# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1275352942 14400
# Node ID 9a757d565e468e2d52b6535e6ce84f08f51155c7
# Parent  b9cdb464de5fa6b5abe7aae4977ea80a85e39c8c
reduction de taille

diff -r b9cdb464de5f -r 9a757d565e46 writeup/nips2010_submission.tex
--- a/writeup/nips2010_submission.tex	Mon May 31 17:57:45 2010 -0400
+++ b/writeup/nips2010_submission.tex	Mon May 31 20:42:22 2010 -0400
@@ -15,6 +15,7 @@
 %\makeanontitle
 \maketitle
 
+\vspace*{-2mm}
 \begin{abstract}
   Recent theoretical and empirical work in statistical machine learning has
   demonstrated the importance of learning algorithms for deep
@@ -36,8 +37,10 @@
   obtained by training with these highly distorted images or
   by including object classes different from those in the target test set.
 \end{abstract}
+\vspace*{-2mm}
 
 \section{Introduction}
+\vspace*{-1mm}
 
 Deep Learning has emerged as a promising new area of research in
 statistical machine learning (see~\citet{Bengio-2009} for a review).
@@ -45,20 +48,13 @@
 of useful representations of data, which are better suited to the task at hand.
 This is in great part inspired by observations of the mammalian visual cortex, 
 which consists of a chain of processing elements, each of which is associated with a
-different representation. In fact,
+different representation of the raw visual input. In fact,
 it was found recently that the features learnt in deep architectures resemble
 those observed in the first two of these stages (in areas V1 and V2
-of visual cortex)~\citep{HonglakL2008}.
-Processing images typically involves transforming the raw pixel data into
-new {\bf representations} that can be used for analysis or classification.
-For example, a principal component analysis representation linearly projects 
-the input image into a lower-dimensional feature space.
-Why learn a representation?  Current practice in the computer vision
-literature converts the raw pixels into a hand-crafted representation
-e.g.\ SIFT features~\citep{Lowe04}, but deep learning algorithms
-tend to discover similar features in their first few 
-levels~\citep{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}.
-Learning increases the
+of visual cortex)~\citep{HonglakL2008}, and that they become more and
+more invariant to factors of variation (such as camera movement) in
+higher layers~\cite{Goodfellow2009}.
+Learning a hierarchy of features increases the
 ease and practicality of developing representations that are at once
 tailored to specific tasks, yet are able to borrow statistical strength
 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
@@ -81,27 +77,49 @@
 the bottom is trained to encode their input (the output of the previous
 layer) and try to reconstruct it from a corrupted version of it. After this
 unsupervised initialization, the stack of denoising auto-encoders can be
-converted into a deep supervised feedforward neural network and trained by
+converted into a deep supervised feedforward neural network and fine-tuned by
 stochastic gradient descent.
 
+Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
+of semi-supervised and multi-task learning: the learner can exploit examples
+that are unlabeled and/or come from a distribution different from the target
+distribution, e.g., from other classes that those of interest. Whereas
+it has already been shown that deep learners can clearly take advantage of
+unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008}
+and multi-task learning, not much has been done yet to explore the impact
+of {\em out-of-distribution} examples and of the multi-task setting
+(but see~\citep{CollobertR2008-short}). In particular the {\em relative
+advantage} of deep learning for this settings has not been evaluated.
+
 In this paper we ask the following questions:
-\begin{enumerate}
-\item Do the good results previously obtained with deep architectures on the
-MNIST digits generalize to the setting of a much larger and richer (but similar)
+
+%\begin{enumerate}
+$\bullet$ %\item 
+Do the good results previously obtained with deep architectures on the
+MNIST digit images generalize to the setting of a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
-\item To what extent does the perturbation of input images (e.g. adding
+
+$\bullet$ %\item 
+To what extent does the perturbation of input images (e.g. adding
 noise, affine transformations, background images) make the resulting
-classifier better not only on similarly perturbed images but also on
+classifiers better not only on similarly perturbed images but also on
 the {\em original clean examples}?
-\item Do deep architectures benefit more from such {\em out-of-distribution}
+
+$\bullet$ %\item 
+Do deep architectures {\em benefit more from such out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
-\item Similarly, does the feature learning step in deep learning algorithms benefit more 
+
+$\bullet$ %\item 
+Similarly, does the feature learning step in deep learning algorithms benefit more 
 training with similar but different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
-\end{enumerate}
+%\end{enumerate}
+
 The experimental results presented here provide positive evidence towards all of these questions.
 
+\vspace*{-1mm}
 \section{Perturbation and Transformation of Character Images}
+\vspace*{-1mm}
 
 This section describes the different transformations we used to stochastically
 transform source images in order to obtain data. More details can
@@ -115,7 +133,9 @@
 from slant to pinch below, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
 
-{\large\bf Transformations}\\
+{\large\bf Transformations}
+
+\vspace*{2mm}
 
 {\bf Slant.} 
 We mimic slant by shifting each row of the image
@@ -178,9 +198,13 @@
 d_1$, where $pinch$ is a parameter to the filter.
 The actual value is given by bilinear interpolation considering the pixels
 around the (non-integer) source position thus found.
-Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\
+Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
+
+\vspace*{1mm}
 
-{\large\bf Injecting Noise}\\
+{\large\bf Injecting Noise}
+
+\vspace*{1mm}
 
 {\bf Motion Blur.}
 This GIMP filter is a ``linear motion blur'' in GIMP
@@ -286,8 +310,9 @@
 \end{figure}
 
 
-
+\vspace*{-1mm}
 \section{Experimental Setup}
+\vspace*{-1mm}
 
 Whereas much previous work on deep learning algorithms had been performed on
 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
@@ -299,10 +324,13 @@
 from fonts, or characters from captchas) and then optionally applying some of the
 above transformations and/or noise processes.
 
+\vspace*{-1mm}
 \subsection{Data Sources}
+\vspace*{-1mm}
 
-\begin{itemize}
-\item {\bf NIST}
+%\begin{itemize}
+%\item 
+{\bf NIST.}
 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995}, 
 widely used for training and testing character
 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. 
@@ -322,16 +350,16 @@
 of letters in the test set, not in the training set (more like the natural distribution
 of letters in text).
 
-\item {\bf Fonts} 
+%\item 
+{\bf Fonts.} 
 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net}
 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html}
 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly.
 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 
 directly as input to our models.
 
-
-
-\item {\bf Captchas}
+%\item 
+{\bf Captchas.}
 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 
 generating characters of the same format as the NIST dataset. This software is based on
 a random character class generator and various kinds of tranformations similar to those described in the previous sections. 
@@ -339,37 +367,49 @@
 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are 
 allowed and can be controlled via an easy to use facade class.    
-\item {\bf OCR data}
+
+%\item 
+{\bf OCR data.}
 A large set (2 million) of scanned, OCRed and manually verified machine-printed 
 characters (from various documents and books) where included as an
 additional source. This set is part of a larger corpus being collected by the Image Understanding
 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern 
 ({\tt http://www.iupr.com}), and which will be publically released.
-\end{itemize}
+%\end{itemize}
 
+\vspace*{-1mm}
 \subsection{Data Sets}
+\vspace*{-1mm}
+
 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
 from one of the 62 character classes.
-\begin{itemize}
-\item {\bf NIST}. This is the raw NIST special database 19.
-\item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources
+%\begin{itemize}
+
+%\item 
+{\bf NIST.} This is the raw NIST special database 19.
+
+%\item 
+{\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
 and sending them through the above transformation pipeline.
 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts,
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
-\item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
+
+%\item 
+{\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
   except that we only apply
   transformations from slant to pinch. Therefore, the character is
   transformed but no additionnal noise is added to the image, giving images
   closer to the NIST dataset.
-\end{itemize}
+%\end{itemize}
 
+\vspace*{-1mm}
 \subsection{Models and their Hyperparameters}
+\vspace*{-1mm}
 
 All hyper-parameters are selected based on performance on the NISTP validation set.
 
-\subsubsection{Multi-Layer Perceptrons (MLP)}
-
+{\bf Multi-Layer Perceptrons (MLP).}
 Whereas previous work had compared deep architectures to both shallow MLPs and
 SVMs, we only compared to MLPs here because of the very large datasets used.
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
@@ -380,10 +420,7 @@
 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments, and 0.1 was selected. 
 
-
-\subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
-\label{SdA}
-
+{\bf Stacked Denoising Auto-Encoders (SDAE).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden 
 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
@@ -397,7 +434,6 @@
 taking advantage of the expressive power and bias implicit in the
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
-
 Here we chose to use the Denoising
 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
 these deep hierarchies of features, as it is very simple to train and
@@ -413,8 +449,7 @@
 After this unsupervised pre-training stage, the parameters
 are used to initialize a deep MLP, which is fine-tuned by
 the same standard procedure used to train them (see previous section).
-
-The hyper-parameters are the same as for the MLP, with the addition of the
+The SDA hyper-parameters are the same as for the MLP, with the addition of the
 amount of corruption noise (we used the masking noise process, whereby a
 fixed proportion of the input values, randomly selected, are zeroed), and a
 separate learning rate for the unsupervised pre-training stage (selected
@@ -423,9 +458,12 @@
 of hidden layers but it was fixed to 3 based on previous work with
 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
 
+\vspace*{-1mm}
 \section{Experimental Results}
 
+\vspace*{-1mm}
 \subsection{SDA vs MLP vs Humans}
+\vspace*{-1mm}
 
 We compare here the best MLP (according to validation set error) that we found against
 the best SDA (again according to validation set error), along with a precise estimate
@@ -436,10 +474,14 @@
 processing \citep{SnowEtAl2008} and vision
 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented
 with 10 character images and asked to type 10 corresponding ascii
-characters. Hence they were forced to make a hard choice among the
-62 character classes. Three users classified each image, allowing
+characters. They were forced to make a hard choice among the
+62 or 10 character classes (all classes or digits only). 
+Three users classified each image, allowing
 to estimate inter-human variability (shown as +/- in parenthesis below).
 
+Figure~\ref{fig:error-rates-charts} summarizes the results obtained.
+More detailed results and tables can be found in the appendix.
+
 \begin{table}
 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
@@ -476,7 +518,9 @@
 \label{fig:error-rates-charts}
 \end{figure}
 
+\vspace*{-1mm}
 \subsection{Perturbed Training Data More Helpful for SDAE}
+\vspace*{-1mm}
 
 \begin{table}
 \caption{Relative change in error rates due to the use of perturbed training data,
@@ -499,8 +543,9 @@
 \end{center}
 \end{table}
 
-
+\vspace*{-1mm}
 \subsection{Multi-Task Learning Effects}
+\vspace*{-1mm}
 
 As previously seen, the SDA is better able to benefit from the
 transformations applied to the data than the MLP. In this experiment we
@@ -554,19 +599,21 @@
 \label{fig:improvements-charts}
 \end{figure}
 
-A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 
-can be executed on-line at {\tt http://deep.host22.com}.
-
+\vspace*{-1mm}
 \section{Conclusions}
+\vspace*{-1mm}
 
 The conclusions are positive for all the questions asked in the introduction.
-\begin{itemize}
-\item Do the good results previously obtained with deep architectures on the
+%\begin{itemize}
+$\bullet$ %\item 
+Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to the setting of a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
 performance.
-\item To what extent does the perturbation of input images (e.g. adding
+
+$\bullet$ %\item 
+To what extent does the perturbation of input images (e.g. adding
 noise, affine transformations, background images) make the resulting
 classifier better not only on similarly perturbed images but also on
 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
@@ -574,16 +621,24 @@
 MLPs were helped by perturbed training examples when tested on perturbed input images,
 but only marginally helped wrt clean examples. On the other hand, the deep SDAs
 were very significantly boosted by these out-of-distribution examples.
-\item Similarly, does the feature learning step in deep learning algorithms benefit more 
+
+$\bullet$ %\item 
+Similarly, does the feature learning step in deep learning algorithms benefit more 
 training with similar but different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 Whereas the improvement due to the multi-task setting was marginal or
 negative for the MLP, it was very significant for the SDA.
-\end{itemize}
+%\end{itemize}
 
+A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 
+can be executed on-line at {\tt http://deep.host22.com}.
+
+
+{\small
 \bibliography{strings,ml,aigaion,specials}
 %\bibliographystyle{plainnat}
 \bibliographystyle{unsrtnat}
 %\bibliographystyle{apalike}
+}
 
 \end{document}