changeset 473:92d6df91939f

merge
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 30 May 2010 10:44:40 -0400
parents 2dd6e8962df1 (diff) 0a9600709a9f (current diff)
children bcf024e6ab23
files
diffstat 2 files changed, 107 insertions(+), 48 deletions(-) [+]
line wrap: on
line diff
--- a/writeup/ml.bib	Sat May 29 15:38:13 2010 -0700
+++ b/writeup/ml.bib	Sun May 30 10:44:40 2010 -0400
@@ -25739,7 +25739,7 @@
   bibsource = {DBLP, http://dblp.uni-trier.de}
 }
 
-@inproceedings{Migram+al-2005,
+@inproceedings{Milgram+al-2005,
   author = {Milgram, J. and Cheriet, M. and Sabourin, R.},
   title = {Estimating accurate multi-class probabilities with support vector machines},
   booktitle = {Int. Joint Conf. on Neural Networks},
--- a/writeup/nips2010_submission.tex	Sat May 29 15:38:13 2010 -0700
+++ b/writeup/nips2010_submission.tex	Sun May 30 10:44:40 2010 -0400
@@ -279,20 +279,31 @@
 
 \section{Experimental Setup}
 
-\subsection{Training Datasets}
+Whereas much previous work on deep learning algorithms had been performed on
+the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
+with 60~000 examples, and variants involving 10~000
+examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
+to focus here on the case of much larger training sets, from 10 times to 
+to 1000 times larger.  The larger datasets are obtained by first sampling from
+a {\em data source} (NIST characters, scanned machine printed characters, characters
+from fonts, or characters from captchas) and then optionally applying some of the
+above transformations and/or noise processes.
 
-\subsubsection{Data Sources}
+\subsection{Data Sources}
 
 \begin{itemize}
 \item {\bf NIST}
-The NIST Special Database 19 (NIST19) is a very widely used dataset for training and testing OCR systems. 
+Our main source of characters is the NIST Special Database 19~\cite{Grother-1995}, 
+widely used for training and testing character
+recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. 
 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 
 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. 
-The fourth series, $hsf_4$, experimentally recognized to be the most difficult one for classification task is recommended 
-by NIST as testing set and is used in our work for that purpose. It contains 82600 examples,
-while the training and validation sets (which have the same distribution) contain XXXXX and
-XXXXX examples respectively.
+The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended 
+by NIST as testing set and is used in our work and some previous work~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
+for that purpose. We randomly split the remainder into a training set and a validation set for
+model selection. The sizes of these data sets are: XXX for training, XXX for validation, 
+and XXX for testing.
 The performances reported by previous work on that dataset mostly use only the digits.
 Here we use all the classes both in the training and testing phase. This is especially
 useful to estimate the effect of a multi-task setting.
@@ -305,23 +316,30 @@
 
 \item {\bf Captchas}
 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 
-generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character
-generator and various kinds of tranformations similar to those described in the previous sections. 
-In order to increase the variability of the data generated, different fonts are used for generating the characters. 
+generating characters of the same format as the NIST dataset. This software is based on
+a random character class generator and various kinds of tranformations similar to those described in the previous sections. 
+In order to increase the variability of the data generated, many different fonts are used for generating the characters. 
 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are 
 allowed and can be controlled via an easy to use facade class.    
 \item {\bf OCR data}
+A large set (2 million) of scanned, OCRed and manually verified machine-printed 
+characters (from various documents and books) where included as an
+additional source. This set is part of a larger corpus being collected by the Image Understanding
+Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern 
+({\tt http://www.iupr.com}), and which will be publically released.
 \end{itemize}
 
-\subsubsection{Data Sets}
+\subsection{Data Sets}
+All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
+from one of the 62 character classes.
 \begin{itemize}
-\item {\bf NIST} This is the raw NIST special database 19.
-\item {\bf P07}
-The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$. 
-For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts,
-$0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order
-and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
+\item {\bf NIST}. This is the raw NIST special database 19.
+\item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources
+and sending them through the above transformation pipeline.
+For each new exemple to generate, a source is selected with probability $10\%$ from the fonts,
+$25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
+order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
   except that we only apply
   transformations from slant to pinch. Therefore, the character is
@@ -331,43 +349,62 @@
 
 \subsection{Models and their Hyperparameters}
 
+All hyper-parameters are selected based on performance on the NISTP validation set.
+
 \subsubsection{Multi-Layer Perceptrons (MLP)}
 
-An MLP is a family of functions that are described by stacking layers of of a function similar to
-$$g(x) = \tanh(b+Wx)$$
-The input, $x$, is a $d$-dimension vector.  
-The output, $g(x)$, is a $m$-dimension vector.
-The parameter $W$ is a $m\times d$ matrix and is called the weight matrix.
-The parameter  $b$ is a $m$-vector and is called the bias vector.
-The non-linearity (here $\tanh$) is applied element-wise to the output vector.
-Usually the input is referred to a input layer and similarly for the output.
-You can of course chain several such functions to obtain a more complex one.
-Here is a common example
-$$f(x) = c + V\tanh(b+Wx)$$
-In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer.
-Here the output layer does not have the same non-linearity as the hidden layer.
-This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand.
+Whereas previous work had compared deep architectures to both shallow MLPs and
+SVMs, we only compared to MLPs here because of the very large datasets used.
+The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
+exponentials) on the output layer for estimating P(class | image).
+The hyper-parameters are the following: number of hidden units, taken in 
+$\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
+examples are presented in minibatches of size 20. A constant learning
+rate is chosen in $\{10^{-6},10^{-5},10^{-4},10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
+through preliminary experiments, and 0.1 was selected. 
 
-If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP.
-The parameters to adapt are the weight matrix and the bias vector for each layer.
 
 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
 \label{SdA}
 
-Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization.
-This is essentially unsupervised training where the layer is made to reconstruct its input through and encoding and decoding phase.
-Denoising auto-encoders are a variant where the input is corrupted with random noise but the target is the uncorrupted input.
-The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform.
+Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
+can be used to initialize the weights of each layer of a deep MLP (with many hidden 
+layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
+enabling better generalization, apparently setting parameters in the
+basin of attraction of supervised gradient descent yielding better 
+generalization~\citep{Erhan+al-2010}. It is hypothesized that the
+advantage brought by this procedure stems from a better prior,
+on the one hand taking advantage of the link between the input
+distribution $P(x)$ and the conditional distribution of interest
+$P(y|x)$ (like in semi-supervised learning), and on the other hand
+taking advantage of the expressive power and bias implicit in the
+deep architecture (whereby complex concepts are expressed as
+compositions of simpler ones through a deep hierarchy).
 
-An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer.
-Usually the top and bottom weight matrices are the transpose of each other and are fixed this way.
-The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer.
-The other parameters are discarded.
+Here we chose to use the Denoising
+Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
+these deep hierarchies of features, as it is very simple to train and
+teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), 
+provides immediate and efficient inference, and yielded results
+comparable or better than RBMs in series of experiments
+\citep{VincentPLarochelleH2008}. During training of a Denoising
+Auto-Encoder, it is presented with a stochastically corrupted version
+of the input and trained to reconstruct the uncorrupted input,
+forcing the hidden units to represent the leading regularities in
+the data. Once it is trained, its hidden units activations can
+be used as inputs for training a second one, etc.
+After this unsupervised pre-training stage, the parameters
+are used to initialize a deep MLP, which is fine-tuned by
+the same standard procedure used to train them (see previous section).
 
-The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder  starting from the bottom.
-During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder.
-The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it.
-For additional details see \citet{vincent:icml08}.
+The hyper-parameters are the same as for the MLP, with the addition of the
+amount of corruption noise (we used the masking noise process, whereby a
+fixed proportion of the input values, randomly selected, are zeroed), and a
+separate learning rate for the unsupervised pre-training stage (selected
+from the same above set). The fraction of inputs corrupted was selected
+among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
+of hidden layers but it was fixed to 3 based on previous work with
+stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
 
 \section{Experimental Results}
 
@@ -401,7 +438,7 @@
 \begin{center}
 \begin{tabular}{|l|r|r|r|r|} \hline
       & NIST test          & NISTP test       & P07 test       & NIST test digits   \\ \hline
-Humans&   18.2\% $\pm$.1\%   &  39.4\%$\pm$.1\%   &  46.9\%$\pm$.1\%  &  $>1.1\%$ \\ \hline 
+Humans&   18.2\% $\pm$.1\%   &  39.4\%$\pm$.1\%   &  46.9\%$\pm$.1\%  &  $1.4\%$ \\ \hline 
 SDA0   &  23.7\% $\pm$.14\%  &  65.2\%$\pm$.34\%  & 97.45\%$\pm$.06\%  & 2.7\% $\pm$.14\%\\ \hline 
 SDA1   &  17.1\% $\pm$.13\%  &  29.7\%$\pm$.3\%  & 29.7\%$\pm$.3\%  & 1.4\% $\pm$.1\%\\ \hline 
 SDA2   &  18.7\% $\pm$.13\%  &  33.6\%$\pm$.3\%  & 39.9\%$\pm$.17\%  & 1.7\% $\pm$.1\%\\ \hline 
@@ -411,7 +448,7 @@
 \citep{Granger+al-2007} &     &                    &                   & 4.95\% $\pm$.18\% \\ \hline
 \citep{Cortes+al-2000} &      &                    &                   & 3.71\% $\pm$.16\% \\ \hline
 \citep{Oliveira+al-2002} &    &                    &                   & 2.4\% $\pm$.13\% \\ \hline
-\citep{Migram+al-2005} &      &                    &                   & 2.1\% $\pm$.12\% \\ \hline
+\citep{Milgram+al-2005} &      &                    &                   & 2.1\% $\pm$.12\% \\ \hline
 \end{tabular}
 \end{center}
 \end{table}
@@ -489,6 +526,28 @@
 
 \section{Conclusions}
 
+The conclusions are positive for all the questions asked in the introduction.
+\begin{itemize}
+\item Do the good results previously obtained with deep architectures on the
+MNIST digits generalize to the setting of a much larger and richer (but similar)
+dataset, the NIST special database 19, with 62 classes and around 800k examples?
+Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
+performance.
+\item To what extent does the perturbation of input images (e.g. adding
+noise, affine transformations, background images) make the resulting
+classifier better not only on similarly perturbed images but also on
+the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
+examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
+MLPs were helped by perturbed training examples when tested on perturbed input images,
+but only marginally helped wrt clean examples. On the other hand, the deep SDAs
+were very significantly boosted by these out-of-distribution examples.
+\item Similarly, does the feature learning step in deep learning algorithms benefit more 
+training with similar but different classes (i.e. a multi-task learning scenario) than
+a corresponding shallow and purely supervised architecture?
+Whereas the improvement due to the multi-task setting was marginal or
+negative for the MLP, it was very significant for the SDA.
+\end{itemize}
+
 \bibliography{strings,ml,aigaion,specials}
 %\bibliographystyle{plainnat}
 \bibliographystyle{unsrtnat}