# HG changeset patch # User Yoshua Bengio # Date 1275352942 14400 # Node ID 9a757d565e468e2d52b6535e6ce84f08f51155c7 # Parent b9cdb464de5fa6b5abe7aae4977ea80a85e39c8c reduction de taille diff -r b9cdb464de5f -r 9a757d565e46 writeup/nips2010_submission.tex --- a/writeup/nips2010_submission.tex Mon May 31 17:57:45 2010 -0400 +++ b/writeup/nips2010_submission.tex Mon May 31 20:42:22 2010 -0400 @@ -15,6 +15,7 @@ %\makeanontitle \maketitle +\vspace*{-2mm} \begin{abstract} Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep @@ -36,8 +37,10 @@ obtained by training with these highly distorted images or by including object classes different from those in the target test set. \end{abstract} +\vspace*{-2mm} \section{Introduction} +\vspace*{-1mm} Deep Learning has emerged as a promising new area of research in statistical machine learning (see~\citet{Bengio-2009} for a review). @@ -45,20 +48,13 @@ of useful representations of data, which are better suited to the task at hand. This is in great part inspired by observations of the mammalian visual cortex, which consists of a chain of processing elements, each of which is associated with a -different representation. In fact, +different representation of the raw visual input. In fact, it was found recently that the features learnt in deep architectures resemble those observed in the first two of these stages (in areas V1 and V2 -of visual cortex)~\citep{HonglakL2008}. -Processing images typically involves transforming the raw pixel data into -new {\bf representations} that can be used for analysis or classification. -For example, a principal component analysis representation linearly projects -the input image into a lower-dimensional feature space. -Why learn a representation? Current practice in the computer vision -literature converts the raw pixels into a hand-crafted representation -e.g.\ SIFT features~\citep{Lowe04}, but deep learning algorithms -tend to discover similar features in their first few -levels~\citep{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}. -Learning increases the +of visual cortex)~\citep{HonglakL2008}, and that they become more and +more invariant to factors of variation (such as camera movement) in +higher layers~\cite{Goodfellow2009}. +Learning a hierarchy of features increases the ease and practicality of developing representations that are at once tailored to specific tasks, yet are able to borrow statistical strength from other related tasks (e.g., modeling different kinds of objects). Finally, learning the @@ -81,27 +77,49 @@ the bottom is trained to encode their input (the output of the previous layer) and try to reconstruct it from a corrupted version of it. After this unsupervised initialization, the stack of denoising auto-encoders can be -converted into a deep supervised feedforward neural network and trained by +converted into a deep supervised feedforward neural network and fine-tuned by stochastic gradient descent. +Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles +of semi-supervised and multi-task learning: the learner can exploit examples +that are unlabeled and/or come from a distribution different from the target +distribution, e.g., from other classes that those of interest. Whereas +it has already been shown that deep learners can clearly take advantage of +unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008} +and multi-task learning, not much has been done yet to explore the impact +of {\em out-of-distribution} examples and of the multi-task setting +(but see~\citep{CollobertR2008-short}). In particular the {\em relative +advantage} of deep learning for this settings has not been evaluated. + In this paper we ask the following questions: -\begin{enumerate} -\item Do the good results previously obtained with deep architectures on the -MNIST digits generalize to the setting of a much larger and richer (but similar) + +%\begin{enumerate} +$\bullet$ %\item +Do the good results previously obtained with deep architectures on the +MNIST digit images generalize to the setting of a much larger and richer (but similar) dataset, the NIST special database 19, with 62 classes and around 800k examples? -\item To what extent does the perturbation of input images (e.g. adding + +$\bullet$ %\item +To what extent does the perturbation of input images (e.g. adding noise, affine transformations, background images) make the resulting -classifier better not only on similarly perturbed images but also on +classifiers better not only on similarly perturbed images but also on the {\em original clean examples}? -\item Do deep architectures benefit more from such {\em out-of-distribution} + +$\bullet$ %\item +Do deep architectures {\em benefit more from such out-of-distribution} examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? -\item Similarly, does the feature learning step in deep learning algorithms benefit more + +$\bullet$ %\item +Similarly, does the feature learning step in deep learning algorithms benefit more training with similar but different classes (i.e. a multi-task learning scenario) than a corresponding shallow and purely supervised architecture? -\end{enumerate} +%\end{enumerate} + The experimental results presented here provide positive evidence towards all of these questions. +\vspace*{-1mm} \section{Perturbation and Transformation of Character Images} +\vspace*{-1mm} This section describes the different transformations we used to stochastically transform source images in order to obtain data. More details can @@ -115,7 +133,9 @@ from slant to pinch below, performs transformations. The second part, from blur to contrast, adds different kinds of noise. -{\large\bf Transformations}\\ +{\large\bf Transformations} + +\vspace*{2mm} {\bf Slant.} We mimic slant by shifting each row of the image @@ -178,9 +198,13 @@ d_1$, where $pinch$ is a parameter to the filter. The actual value is given by bilinear interpolation considering the pixels around the (non-integer) source position thus found. -Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\ +Here $pinch \sim U[-complexity, 0.7 \times complexity]$. + +\vspace*{1mm} -{\large\bf Injecting Noise}\\ +{\large\bf Injecting Noise} + +\vspace*{1mm} {\bf Motion Blur.} This GIMP filter is a ``linear motion blur'' in GIMP @@ -286,8 +310,9 @@ \end{figure} - +\vspace*{-1mm} \section{Experimental Setup} +\vspace*{-1mm} Whereas much previous work on deep learning algorithms had been performed on the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, @@ -299,10 +324,13 @@ from fonts, or characters from captchas) and then optionally applying some of the above transformations and/or noise processes. +\vspace*{-1mm} \subsection{Data Sources} +\vspace*{-1mm} -\begin{itemize} -\item {\bf NIST} +%\begin{itemize} +%\item +{\bf NIST.} Our main source of characters is the NIST Special Database 19~\cite{Grother-1995}, widely used for training and testing character recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. @@ -322,16 +350,16 @@ of letters in the test set, not in the training set (more like the natural distribution of letters in text). -\item {\bf Fonts} +%\item +{\bf Fonts.} In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, directly as input to our models. - - -\item {\bf Captchas} +%\item +{\bf Captchas.} The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for generating characters of the same format as the NIST dataset. This software is based on a random character class generator and various kinds of tranformations similar to those described in the previous sections. @@ -339,37 +367,49 @@ Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are allowed and can be controlled via an easy to use facade class. -\item {\bf OCR data} + +%\item +{\bf OCR data.} A large set (2 million) of scanned, OCRed and manually verified machine-printed characters (from various documents and books) where included as an additional source. This set is part of a larger corpus being collected by the Image Understanding Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern ({\tt http://www.iupr.com}), and which will be publically released. -\end{itemize} +%\end{itemize} +\vspace*{-1mm} \subsection{Data Sets} +\vspace*{-1mm} + All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label from one of the 62 character classes. -\begin{itemize} -\item {\bf NIST}. This is the raw NIST special database 19. -\item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources +%\begin{itemize} + +%\item +{\bf NIST.} This is the raw NIST special database 19. + +%\item +{\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources and sending them through the above transformation pipeline. For each new exemple to generate, a source is selected with probability $10\%$ from the fonts, $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. -\item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) + +%\item +{\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) except that we only apply transformations from slant to pinch. Therefore, the character is transformed but no additionnal noise is added to the image, giving images closer to the NIST dataset. -\end{itemize} +%\end{itemize} +\vspace*{-1mm} \subsection{Models and their Hyperparameters} +\vspace*{-1mm} All hyper-parameters are selected based on performance on the NISTP validation set. -\subsubsection{Multi-Layer Perceptrons (MLP)} - +{\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared deep architectures to both shallow MLPs and SVMs, we only compared to MLPs here because of the very large datasets used. The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized @@ -380,10 +420,7 @@ rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ through preliminary experiments, and 0.1 was selected. - -\subsubsection{Stacked Denoising Auto-Encoders (SDAE)} -\label{SdA} - +{\bf Stacked Denoising Auto-Encoders (SDAE).} Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) can be used to initialize the weights of each layer of a deep MLP (with many hidden layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} @@ -397,7 +434,6 @@ taking advantage of the expressive power and bias implicit in the deep architecture (whereby complex concepts are expressed as compositions of simpler ones through a deep hierarchy). - Here we chose to use the Denoising Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for these deep hierarchies of features, as it is very simple to train and @@ -413,8 +449,7 @@ After this unsupervised pre-training stage, the parameters are used to initialize a deep MLP, which is fine-tuned by the same standard procedure used to train them (see previous section). - -The hyper-parameters are the same as for the MLP, with the addition of the +The SDA hyper-parameters are the same as for the MLP, with the addition of the amount of corruption noise (we used the masking noise process, whereby a fixed proportion of the input values, randomly selected, are zeroed), and a separate learning rate for the unsupervised pre-training stage (selected @@ -423,9 +458,12 @@ of hidden layers but it was fixed to 3 based on previous work with stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. +\vspace*{-1mm} \section{Experimental Results} +\vspace*{-1mm} \subsection{SDA vs MLP vs Humans} +\vspace*{-1mm} We compare here the best MLP (according to validation set error) that we found against the best SDA (again according to validation set error), along with a precise estimate @@ -436,10 +474,14 @@ processing \citep{SnowEtAl2008} and vision \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented with 10 character images and asked to type 10 corresponding ascii -characters. Hence they were forced to make a hard choice among the -62 character classes. Three users classified each image, allowing +characters. They were forced to make a hard choice among the +62 or 10 character classes (all classes or digits only). +Three users classified each image, allowing to estimate inter-human variability (shown as +/- in parenthesis below). +Figure~\ref{fig:error-rates-charts} summarizes the results obtained. +More detailed results and tables can be found in the appendix. + \begin{table} \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training @@ -476,7 +518,9 @@ \label{fig:error-rates-charts} \end{figure} +\vspace*{-1mm} \subsection{Perturbed Training Data More Helpful for SDAE} +\vspace*{-1mm} \begin{table} \caption{Relative change in error rates due to the use of perturbed training data, @@ -499,8 +543,9 @@ \end{center} \end{table} - +\vspace*{-1mm} \subsection{Multi-Task Learning Effects} +\vspace*{-1mm} As previously seen, the SDA is better able to benefit from the transformations applied to the data than the MLP. In this experiment we @@ -554,19 +599,21 @@ \label{fig:improvements-charts} \end{figure} -A Flash demo of the recognizer (where both the MLP and the SDA can be compared) -can be executed on-line at {\tt http://deep.host22.com}. - +\vspace*{-1mm} \section{Conclusions} +\vspace*{-1mm} The conclusions are positive for all the questions asked in the introduction. -\begin{itemize} -\item Do the good results previously obtained with deep architectures on the +%\begin{itemize} +$\bullet$ %\item +Do the good results previously obtained with deep architectures on the MNIST digits generalize to the setting of a much larger and richer (but similar) dataset, the NIST special database 19, with 62 classes and around 800k examples? Yes, the SDA systematically outperformed the MLP, in fact reaching human-level performance. -\item To what extent does the perturbation of input images (e.g. adding + +$\bullet$ %\item +To what extent does the perturbation of input images (e.g. adding noise, affine transformations, background images) make the resulting classifier better not only on similarly perturbed images but also on the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} @@ -574,16 +621,24 @@ MLPs were helped by perturbed training examples when tested on perturbed input images, but only marginally helped wrt clean examples. On the other hand, the deep SDAs were very significantly boosted by these out-of-distribution examples. -\item Similarly, does the feature learning step in deep learning algorithms benefit more + +$\bullet$ %\item +Similarly, does the feature learning step in deep learning algorithms benefit more training with similar but different classes (i.e. a multi-task learning scenario) than a corresponding shallow and purely supervised architecture? Whereas the improvement due to the multi-task setting was marginal or negative for the MLP, it was very significant for the SDA. -\end{itemize} +%\end{itemize} +A Flash demo of the recognizer (where both the MLP and the SDA can be compared) +can be executed on-line at {\tt http://deep.host22.com}. + + +{\small \bibliography{strings,ml,aigaion,specials} %\bibliographystyle{plainnat} \bibliographystyle{unsrtnat} %\bibliographystyle{apalike} +} \end{document}