ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 472:2dd6e8962df1

conclusion

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sun, 30 May 2010 10:44:20 -0400
parents	d02d288257bf
children	bcf024e6ab23

comparison

equal deleted inserted replaced

-:d02d288257bf
+:2dd6e8962df1
 \end{figure}
 \section{Experimental Setup}
-\subsection{Training Datasets}
+Whereas much previous work on deep learning algorithms had been performed on
+the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
-\subsubsection{Data Sources}
+with 60~000 examples, and variants involving 10~000
+examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
+to focus here on the case of much larger training sets, from 10 times to
+to 1000 times larger.  The larger datasets are obtained by first sampling from
+a {\em data source} (NIST characters, scanned machine printed characters, characters
+from fonts, or characters from captchas) and then optionally applying some of the
+above transformations and/or noise processes.
+\subsection{Data Sources}
 \begin{itemize}
 \item {\bf NIST}
-The NIST Special Database 19 (NIST19) is a very widely used dataset for training and testing OCR systems.
+Our main source of characters is the NIST Special Database 19~\cite{Grother-1995},
+widely used for training and testing character
+recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}.
 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity.
-The fourth series, $hsf_4$, experimentally recognized to be the most difficult one for classification task is recommended
+The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended
-by NIST as testing set and is used in our work for that purpose. It contains 82600 examples,
+by NIST as testing set and is used in our work and some previous work~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
-while the training and validation sets (which have the same distribution) contain XXXXX and
+for that purpose. We randomly split the remainder into a training set and a validation set for
-XXXXX examples respectively.
+model selection. The sizes of these data sets are: XXX for training, XXX for validation,
+and XXX for testing.
 The performances reported by previous work on that dataset mostly use only the digits.
 Here we use all the classes both in the training and testing phase. This is especially
 useful to estimate the effect of a multi-task setting.
 Note that the distribution of the classes in the NIST training and test sets differs
 substantially, with relatively many more digits in the test set, and uniform distribution
 \item {\bf Fonts} TODO!!!
 \item {\bf Captchas}
 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
-generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character
+generating characters of the same format as the NIST dataset. This software is based on
-generator and various kinds of tranformations similar to those described in the previous sections.
+a random character class generator and various kinds of tranformations similar to those described in the previous sections.
-In order to increase the variability of the data generated, different fonts are used for generating the characters.
+In order to increase the variability of the data generated, many different fonts are used for generating the characters.
 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are
 allowed and can be controlled via an easy to use facade class.
 \item {\bf OCR data}
+A large set (2 million) of scanned, OCRed and manually verified machine-printed
+characters (from various documents and books) where included as an
+additional source. This set is part of a larger corpus being collected by the Image Understanding
+Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern
+({\tt http://www.iupr.com}), and which will be publically released.
 \end{itemize}
-\subsubsection{Data Sets}
+\subsection{Data Sets}
+All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
+from one of the 62 character classes.
 \begin{itemize}
-\item {\bf NIST} This is the raw NIST special database 19.
+\item {\bf NIST}. This is the raw NIST special database 19.
-\item {\bf P07}
+\item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources
-The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$.
+and sending them through the above transformation pipeline.
-For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts,
+For each new exemple to generate, a source is selected with probability $10\%$ from the fonts,
-$0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order
+$25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
-and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
+order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
 except that we only apply
 transformations from slant to pinch. Therefore, the character is
 transformed but no additionnal noise is added to the image, giving images
 closer to the NIST dataset.
 \end{itemize}
 \subsection{Models and their Hyperparameters}
+All hyper-parameters are selected based on performance on the NISTP validation set.
 \subsubsection{Multi-Layer Perceptrons (MLP)}
-An MLP is a family of functions that are described by stacking layers of of a function similar to
+Whereas previous work had compared deep architectures to both shallow MLPs and
-$$g(x) = \tanh(b+Wx)$$
+SVMs, we only compared to MLPs here because of the very large datasets used.
-The input, $x$, is a $d$-dimension vector.
+The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
-The output, $g(x)$, is a $m$-dimension vector.
+exponentials) on the output layer for estimating P(class | image).
-The parameter $W$ is a $m\times d$ matrix and is called the weight matrix.
+The hyper-parameters are the following: number of hidden units, taken in
-The parameter  $b$ is a $m$-vector and is called the bias vector.
+$\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
-The non-linearity (here $\tanh$) is applied element-wise to the output vector.
+examples are presented in minibatches of size 20. A constant learning
-Usually the input is referred to a input layer and similarly for the output.
+rate is chosen in $\{10^{-6},10^{-5},10^{-4},10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
-You can of course chain several such functions to obtain a more complex one.
+through preliminary experiments, and 0.1 was selected.
-Here is a common example
-$$f(x) = c + V\tanh(b+Wx)$$
-In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer.
-Here the output layer does not have the same non-linearity as the hidden layer.
-This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand.
-If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP.
-The parameters to adapt are the weight matrix and the bias vector for each layer.
 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
 \label{SdA}
-Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization.
+Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
-This is essentially unsupervised training where the layer is made to reconstruct its input through and encoding and decoding phase.
+can be used to initialize the weights of each layer of a deep MLP (with many hidden
-Denoising auto-encoders are a variant where the input is corrupted with random noise but the target is the uncorrupted input.
+layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
-The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform.
+enabling better generalization, apparently setting parameters in the
+basin of attraction of supervised gradient descent yielding better
-An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer.
+generalization~\citep{Erhan+al-2010}. It is hypothesized that the
-Usually the top and bottom weight matrices are the transpose of each other and are fixed this way.
+advantage brought by this procedure stems from a better prior,
-The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer.
+on the one hand taking advantage of the link between the input
-The other parameters are discarded.
+distribution $P(x)$ and the conditional distribution of interest
+$P(y|x)$ (like in semi-supervised learning), and on the other hand
-The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder  starting from the bottom.
+taking advantage of the expressive power and bias implicit in the
-During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder.
+deep architecture (whereby complex concepts are expressed as
-The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it.
+compositions of simpler ones through a deep hierarchy).
-For additional details see \citet{vincent:icml08}.
+Here we chose to use the Denoising
+Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
+these deep hierarchies of features, as it is very simple to train and
+teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}),
+provides immediate and efficient inference, and yielded results
+comparable or better than RBMs in series of experiments
+\citep{VincentPLarochelleH2008}. During training of a Denoising
+Auto-Encoder, it is presented with a stochastically corrupted version
+of the input and trained to reconstruct the uncorrupted input,
+forcing the hidden units to represent the leading regularities in
+the data. Once it is trained, its hidden units activations can
+be used as inputs for training a second one, etc.
+After this unsupervised pre-training stage, the parameters
+are used to initialize a deep MLP, which is fine-tuned by
+the same standard procedure used to train them (see previous section).
+The hyper-parameters are the same as for the MLP, with the addition of the
+amount of corruption noise (we used the masking noise process, whereby a
+fixed proportion of the input values, randomly selected, are zeroed), and a
+separate learning rate for the unsupervised pre-training stage (selected
+from the same above set). The fraction of inputs corrupted was selected
+among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
+of hidden layers but it was fixed to 3 based on previous work with
+stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
 \section{Experimental Results}
 \subsection{SDA vs MLP vs Humans}
 on NIST digits classification using the same test set are included.}
 \label{tab:sda-vs-mlp-vs-humans}
 \begin{center}
 \begin{tabular}{|l|r|r|r|r|} \hline
 & NIST test          & NISTP test       & P07 test       & NIST test digits   \\ \hline
-Humans&   18.2\% $\pm$.1\%   &  39.4\%$\pm$.1\%   &  46.9\%$\pm$.1\%  &  $>1.1\%$ \\ \hline
+Humans&   18.2\% $\pm$.1\%   &  39.4\%$\pm$.1\%   &  46.9\%$\pm$.1\%  &  $1.4\%$ \\ \hline
 SDA0   &  23.7\% $\pm$.14\%  &  65.2\%$\pm$.34\%  & 97.45\%$\pm$.06\%  & 2.7\% $\pm$.14\%\\ \hline
 SDA1   &  17.1\% $\pm$.13\%  &  29.7\%$\pm$.3\%  & 29.7\%$\pm$.3\%  & 1.4\% $\pm$.1\%\\ \hline
 SDA2   &  18.7\% $\pm$.13\%  &  33.6\%$\pm$.3\%  & 39.9\%$\pm$.17\%  & 1.7\% $\pm$.1\%\\ \hline
 MLP0   &  24.2\% $\pm$.15\%  & 68.8\%$\pm$.33\%  & 78.70\%$\pm$.14\%  & 3.45\% $\pm$.15\% \\ \hline
 MLP1   &  23.0\% $\pm$.15\%  &  41.8\%$\pm$.35\%  & 90.4\%$\pm$.1\%  & 3.85\% $\pm$.16\% \\ \hline
 MLP2   &  24.3\% $\pm$.15\%  &  46.0\%$\pm$.35\%  & 54.7\%$\pm$.17\%  & 4.85\% $\pm$.18\% \\ \hline
 \citep{Granger+al-2007} &     &                    &                   & 4.95\% $\pm$.18\% \\ \hline
 \citep{Cortes+al-2000} &      &                    &                   & 3.71\% $\pm$.16\% \\ \hline
 \citep{Oliveira+al-2002} &    &                    &                   & 2.4\% $\pm$.13\% \\ \hline
-\citep{Migram+al-2005} &      &                    &                   & 2.1\% $\pm$.12\% \\ \hline
+\citep{Milgram+al-2005} &      &                    &                   & 2.1\% $\pm$.12\% \\ \hline
 \end{tabular}
 \end{center}
 \end{table}
 \subsection{Perturbed Training Data More Helpful for SDAE}
 \end{center}
 \end{table}
 \section{Conclusions}
+The conclusions are positive for all the questions asked in the introduction.
+\begin{itemize}
+\item Do the good results previously obtained with deep architectures on the
+MNIST digits generalize to the setting of a much larger and richer (but similar)
+dataset, the NIST special database 19, with 62 classes and around 800k examples?
+Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
+performance.
+\item To what extent does the perturbation of input images (e.g. adding
+noise, affine transformations, background images) make the resulting
+classifier better not only on similarly perturbed images but also on
+the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
+examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
+MLPs were helped by perturbed training examples when tested on perturbed input images,
+but only marginally helped wrt clean examples. On the other hand, the deep SDAs
+were very significantly boosted by these out-of-distribution examples.
+\item Similarly, does the feature learning step in deep learning algorithms benefit more
+training with similar but different classes (i.e. a multi-task learning scenario) than
+a corresponding shallow and purely supervised architecture?
+Whereas the improvement due to the multi-task setting was marginal or
+negative for the MLP, it was very significant for the SDA.
+\end{itemize}
 \bibliography{strings,ml,aigaion,specials}
 %\bibliographystyle{plainnat}
 \bibliographystyle{unsrtnat}
 %\bibliographystyle{apalike}

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 472:2dd6e8962df1