comparison writeup/nips2010_submission.tex @ 533:22d5cd82d5f0

resolved conflit
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 01 Jun 2010 21:24:39 -0400
parents 2e33885730cf 85f2337d47d2
children 4d6493d171f6 47894d0ecbde
comparison
equal deleted inserted replaced
532:2e33885730cf 533:22d5cd82d5f0
1 \documentclass{article} % For LaTeX2e 1 \documentclass{article} % For LaTeX2e
2 \usepackage{nips10submit_e,times} 2 \usepackage{nips10submit_e,times}
3 3
4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} 4 \usepackage{amsthm,amsmath,bbm}
5 \usepackage[psamsfonts]{amssymb}
5 \usepackage{algorithm,algorithmic} 6 \usepackage{algorithm,algorithmic}
6 \usepackage[utf8]{inputenc} 7 \usepackage[utf8]{inputenc}
7 \usepackage{graphicx,subfigure} 8 \usepackage{graphicx,subfigure}
8 \usepackage[numbers]{natbib} 9 \usepackage[numbers]{natbib}
9 10
75 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which 76 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
76 performed similarly or better than previously proposed Restricted Boltzmann 77 performed similarly or better than previously proposed Restricted Boltzmann
77 Machines in terms of unsupervised extraction of a hierarchy of features 78 Machines in terms of unsupervised extraction of a hierarchy of features
78 useful for classification. The principle is that each layer starting from 79 useful for classification. The principle is that each layer starting from
79 the bottom is trained to encode its input (the output of the previous 80 the bottom is trained to encode its input (the output of the previous
80 layer) and to reconstruct it from a corrupted version of it. After this 81 layer) and to reconstruct it from a corrupted version. After this
81 unsupervised initialization, the stack of denoising auto-encoders can be 82 unsupervised initialization, the stack of denoising auto-encoders can be
82 converted into a deep supervised feedforward neural network and fine-tuned by 83 converted into a deep supervised feedforward neural network and fine-tuned by
83 stochastic gradient descent. 84 stochastic gradient descent.
84 85
85 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles 86 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
96 The hypothesis explored here is that a deep hierarchy of features 97 The hypothesis explored here is that a deep hierarchy of features
97 may be better able to provide sharing of statistical strength 98 may be better able to provide sharing of statistical strength
98 between different regions in input space or different tasks, 99 between different regions in input space or different tasks,
99 as discussed in the conclusion. 100 as discussed in the conclusion.
100 101
101 % TODO: why we care to evaluate this relative advantage
102
103 In this paper we ask the following questions: 102 In this paper we ask the following questions:
104 103
105 %\begin{enumerate} 104 %\begin{enumerate}
106 $\bullet$ %\item 105 $\bullet$ %\item
107 Do the good results previously obtained with deep architectures on the 106 Do the good results previously obtained with deep architectures on the
126 125
127 Our experimental results provide positive evidence towards all of these questions. 126 Our experimental results provide positive evidence towards all of these questions.
128 127
129 \vspace*{-1mm} 128 \vspace*{-1mm}
130 \section{Perturbation and Transformation of Character Images} 129 \section{Perturbation and Transformation of Character Images}
130 \label{s:perturbations}
131 \vspace*{-1mm} 131 \vspace*{-1mm}
132 132
133 This section describes the different transformations we used to stochastically 133 This section describes the different transformations we used to stochastically
134 transform source images in order to obtain data. More details can 134 transform source images in order to obtain data. More details can
135 be found in this technical report~\citep{ift6266-tr-anonymous}. 135 be found in this technical report~\citep{ift6266-tr-anonymous}.
141 There are two main parts in the pipeline. The first one, 141 There are two main parts in the pipeline. The first one,
142 from slant to pinch below, performs transformations. The second 142 from slant to pinch below, performs transformations. The second
143 part, from blur to contrast, adds different kinds of noise. 143 part, from blur to contrast, adds different kinds of noise.
144 144
145 \begin{figure}[ht] 145 \begin{figure}[ht]
146 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}} 146 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/transfo.png}}}
147 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE 147 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE
148 \caption{Illustration of each transformation applied alone to the same image 148 \caption{Illustration of each transformation applied alone to the same image
149 of an upper-case h (top left). First row (from left to right) : original image, slant, 149 of an upper-case h (top left). First row (from left to right) : original image, slant,
150 thickness, affine transformation (translation, rotation, shear), 150 thickness, affine transformation (translation, rotation, shear),
151 local elastic deformation; second row (from left to right) : 151 local elastic deformation; second row (from left to right) :
353 Whereas much previous work on deep learning algorithms had been performed on 353 Whereas much previous work on deep learning algorithms had been performed on
354 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, 354 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
355 with 60~000 examples, and variants involving 10~000 355 with 60~000 examples, and variants involving 10~000
356 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want 356 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
357 to focus here on the case of much larger training sets, from 10 times to 357 to focus here on the case of much larger training sets, from 10 times to
358 to 1000 times larger. The larger datasets are obtained by first sampling from 358 to 1000 times larger.
359
360 The first step in constructing the larger datasets is to sample from
359 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, 361 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
360 and {\bf OCR data} (scanned machine printed characters). Once a character 362 and {\bf OCR data} (scanned machine printed characters). Once a character
361 is sampled from one of these sources (chosen randomly), a pipeline of 363 is sampled from one of these sources (chosen randomly), the pipeline of
362 the above transformations and/or noise processes is applied to the 364 the transformations and/or noise processes described in section \ref{s:perturbations}
363 image. 365 is applied to the image.
364 366
365 We compare the best MLP (according to validation set error) that we found against 367 We compare the best MLP against
366 the best SDA (again according to validation set error), along with a precise estimate 368 the best SDA (both models' hyper-parameters are selected to minimize the validation set error),
369 along with a comparison against a precise estimate
367 of human performance obtained via Amazon's Mechanical Turk (AMT) 370 of human performance obtained via Amazon's Mechanical Turk (AMT)
368 service\footnote{http://mturk.com}. 371 service (http://mturk.com).
369 AMT users are paid small amounts 372 AMT users are paid small amounts
370 of money to perform tasks for which human intelligence is required. 373 of money to perform tasks for which human intelligence is required.
371 Mechanical Turk has been used extensively in natural language processing and vision. 374 Mechanical Turk has been used extensively in natural language processing and vision.
372 %processing \citep{SnowEtAl2008} and vision 375 %processing \citep{SnowEtAl2008} and vision
373 %\citep{SorokinAndForsyth2008,whitehill09}. 376 %\citep{SorokinAndForsyth2008,whitehill09}.
374 AMT users where presented 377 AMT users were presented
375 with 10 character images and asked to type 10 corresponding ASCII 378 with 10 character images and asked to choose 10 corresponding ASCII
376 characters. They were forced to make a hard choice among the 379 characters. They were forced to make a hard choice among the
377 62 or 10 character classes (all classes or digits only). 380 62 or 10 character classes (all classes or digits only).
378 Three users classified each image, allowing 381 Three users classified each image, allowing
379 to estimate inter-human variability. 382 to estimate inter-human variability. A total 2500 images/dataset were classified.
380 383
381 \vspace*{-1mm} 384 \vspace*{-1mm}
382 \subsection{Data Sources} 385 \subsection{Data Sources}
383 \vspace*{-1mm} 386 \vspace*{-1mm}
384 387
389 widely used for training and testing character 392 widely used for training and testing character
390 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. 393 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
391 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, 394 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
392 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 395 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
393 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. 396 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity.
394 The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended 397 The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended
395 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} 398 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
396 for that purpose. We randomly split the remainder into a training set and a validation set for 399 for that purpose. We randomly split the remainder into a training set and a validation set for
397 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, 400 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation,
398 and 82587 for testing. 401 and 82587 for testing.
399 The performances reported by previous work on that dataset mostly use only the digits. 402 The performances reported by previous work on that dataset mostly use only the digits.
405 more like the natural distribution of letters in text). 408 more like the natural distribution of letters in text).
406 409
407 %\item 410 %\item
408 {\bf Fonts.} 411 {\bf Fonts.}
409 In order to have a good variety of sources we downloaded an important number of free fonts from: 412 In order to have a good variety of sources we downloaded an important number of free fonts from:
410 {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} 413 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
411 % TODO: pointless to anonymize, it's not pointing to our work 414 % TODO: pointless to anonymize, it's not pointing to our work
412 Including operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. 415 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
413 The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 416 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
414 directly as input to our models. 417 directly as input to our models.
415 418
416 %\item 419 %\item
417 {\bf Captchas.} 420 {\bf Captchas.}
418 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 421 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
444 %\item 447 %\item
445 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. 448 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}.
446 449
447 %\item 450 %\item
448 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources 451 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
449 and sending them through the above transformation pipeline. 452 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
450 For each new example to generate, a source is selected with probability $10\%$ from the fonts, 453 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
451 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the 454 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
452 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. 455 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
453 456
454 %\item 457 %\item
455 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) 458 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
456 except that we only apply 459 except that we only apply
457 transformations from slant to pinch. Therefore, the character is 460 transformations from slant to pinch. Therefore, the character is
458 transformed but no additional noise is added to the image, giving images 461 transformed but no additional noise is added to the image, giving images
459 closer to the NIST dataset. 462 closer to the NIST dataset.
460 %\end{itemize} 463 %\end{itemize}
463 \subsection{Models and their Hyperparameters} 466 \subsection{Models and their Hyperparameters}
464 \vspace*{-1mm} 467 \vspace*{-1mm}
465 468
466 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single 469 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
467 hidden layer and with Stacked Denoising Auto-Encoders (SDA). 470 hidden layer and with Stacked Denoising Auto-Encoders (SDA).
468 All hyper-parameters are selected based on performance on the NISTP validation set. 471 \emph{Note that all hyper-parameters are selected based on performance on the {\bf NISTP} validation set.}
469 472
470 {\bf Multi-Layer Perceptrons (MLP).} 473 {\bf Multi-Layer Perceptrons (MLP).}
471 Whereas previous work had compared deep architectures to both shallow MLPs and 474 Whereas previous work had compared deep architectures to both shallow MLPs and
472 SVMs, we only compared to MLPs here because of the very large datasets used 475 SVMs, we only compared to MLPs here because of the very large datasets used
473 (making the use of SVMs computationally inconvenient because of their quadratic 476 (making the use of SVMs computationally challenging because of their quadratic
474 scaling behavior). 477 scaling behavior).
475 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized 478 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
476 exponentials) on the output layer for estimating $P(class | image)$. 479 exponentials) on the output layer for estimating $P(class | image)$.
477 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 480 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
478 The optimization procedure is as follows: training 481 Training examples are presented in minibatches of size 20. A constant learning
479 examples are presented in minibatches of size 20, a constant learning 482 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
480 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
481 through preliminary experiments (measuring performance on a validation set), 483 through preliminary experiments (measuring performance on a validation set),
482 and $0.1$ was then selected. 484 and $0.1$ was then selected for optimizing on the whole training sets.
483 485
484 \begin{figure}[ht] 486 \begin{figure}[ht]
485 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} 487 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
486 \caption{Illustration of the computations and training criterion for the denoising 488 \caption{Illustration of the computations and training criterion for the denoising
487 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of 489 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
506 distribution $P(x)$ and the conditional distribution of interest 508 distribution $P(x)$ and the conditional distribution of interest
507 $P(y|x)$ (like in semi-supervised learning), and on the other hand 509 $P(y|x)$ (like in semi-supervised learning), and on the other hand
508 taking advantage of the expressive power and bias implicit in the 510 taking advantage of the expressive power and bias implicit in the
509 deep architecture (whereby complex concepts are expressed as 511 deep architecture (whereby complex concepts are expressed as
510 compositions of simpler ones through a deep hierarchy). 512 compositions of simpler ones through a deep hierarchy).
513
511 Here we chose to use the Denoising 514 Here we chose to use the Denoising
512 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for 515 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
513 these deep hierarchies of features, as it is very simple to train and 516 these deep hierarchies of features, as it is very simple to train and
514 explain (see Figure~\ref{fig:da}, as well as 517 explain (see Figure~\ref{fig:da}, as well as
515 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 518 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
535 538
536 \vspace*{-1mm} 539 \vspace*{-1mm}
537 540
538 \begin{figure}[ht] 541 \begin{figure}[ht]
539 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} 542 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
540 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training 543 \caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
541 on NIST, 1 on NISTP, and 2 on P07. Left: overall results 544 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
542 of all models, on 3 different test sets corresponding to the three 545 of all models, on 3 different test sets (NIST, NISTP, P07).
543 datasets.
544 Right: error rates on NIST test digits only, along with the previous results from 546 Right: error rates on NIST test digits only, along with the previous results from
545 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} 547 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
546 respectively based on ART, nearest neighbors, MLPs, and SVMs.} 548 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
547 549
548 \label{fig:error-rates-charts} 550 \label{fig:error-rates-charts}