diff writeup/nips2010_submission.tex @ 547:316c7bdad5ad

charts
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Wed, 02 Jun 2010 13:09:27 -0400
parents 1cdfc17e890f
children 34cb28249de0
line wrap: on
line diff
--- a/writeup/nips2010_submission.tex	Wed Jun 02 11:45:17 2010 -0400
+++ b/writeup/nips2010_submission.tex	Wed Jun 02 13:09:27 2010 -0400
@@ -73,13 +73,14 @@
 learning, often in an greedy layer-wise ``unsupervised pre-training''
 stage~\citep{Bengio-2009}.  One of these layer initialization techniques,
 applied here, is the Denoising
-Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
+Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}), 
+which
 performed similarly or better than previously proposed Restricted Boltzmann
 Machines in terms of unsupervised extraction of a hierarchy of features
 useful for classification.  The principle is that each layer starting from
 the bottom is trained to encode its input (the output of the previous
 layer) and to reconstruct it from a corrupted version. After this
-unsupervised initialization, the stack of denoising auto-encoders can be
+unsupervised initialization, the stack of DAs can be
 converted into a deep supervised feedforward neural network and fine-tuned by
 stochastic gradient descent.
 
@@ -124,6 +125,10 @@
 %\end{enumerate}
 
 Our experimental results provide positive evidence towards all of these questions.
+To achieve these results, we introduce in the next section a sophisticated system
+for stochastically transforming character images. The conclusion discusses
+the more general question of why deep learners may benefit so much from 
+the self-taught learning framework.
 
 \vspace*{-1mm}
 \section{Perturbation and Transformation of Character Images}
@@ -131,7 +136,13 @@
 \vspace*{-1mm}
 
 This section describes the different transformations we used to stochastically
-transform source images in order to obtain data. More details can
+transform source images in order to obtain data from a larger distribution which
+covers a domain substantially larger than the clean characters distribution from
+which we start. Although character transformations have been used before to
+improve character recognizers, this effort is on a large scale both
+in number of classes and in the complexity of the transformations, hence
+in the complexity of the learning task.
+More details can
 be found in this technical report~\citep{ift6266-tr-anonymous}.
 The code for these transformations (mostly python) is available at 
 {\tt http://anonymous.url.net}. All the modules in the pipeline share
@@ -334,10 +345,11 @@
 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
 and {\bf OCR data} (scanned machine printed characters). Once a character
-is sampled from one of these sources (chosen randomly), the pipeline of
-the transformations and/or noise processes described in section \ref{s:perturbations}
-is applied to the image.
+is sampled from one of these sources (chosen randomly), the second step is to
+apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}.
 
+To provide a baseline of error rate comparison we also estimate human performance
+on both the 62-class task and the 10-class digits task.
 We compare the best MLPs against
 the best SDAs (both models' hyper-parameters are selected to minimize the validation set error), 
 along with a comparison against a precise estimate
@@ -460,20 +472,6 @@
 through preliminary experiments (measuring performance on a validation set),
 and $0.1$ was then selected for optimizing on the whole training sets.
 
-\begin{figure}[ht]
-\vspace*{-2mm}
-\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
-\caption{Illustration of the computations and training criterion for the denoising
-auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
-the layer (i.e. raw input or output of previous layer)
-is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
-The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
-is compared to the uncorrupted input $x$ through the loss function
-$L_H(x,z)$, whose expected value is approximately minimized during training
-by tuning $\theta$ and $\theta'$.}
-\label{fig:da}
-\vspace*{-2mm}
-\end{figure}
 
 {\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
@@ -490,6 +488,21 @@
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
 
+\begin{figure}[ht]
+\vspace*{-2mm}
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
+\caption{Illustration of the computations and training criterion for the denoising
+auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
+the layer (i.e. raw input or output of previous layer)
+is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
+The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
+is compared to the uncorrupted input $x$ through the loss function
+$L_H(x,z)$, whose expected value is approximately minimized during training
+by tuning $\theta$ and $\theta'$.}
+\label{fig:da}
+\vspace*{-2mm}
+\end{figure}
+
 Here we chose to use the Denoising
 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
 these deep hierarchies of features, as it is very simple to train and
@@ -514,14 +527,14 @@
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
 of hidden layers but it was fixed to 3 based on previous work with
-stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
+SDAs on MNIST~\citep{VincentPLarochelleH2008}.
 
 \vspace*{-1mm}
 
 \begin{figure}[ht]
 \vspace*{-2mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
-\caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
+\caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
 of all models, on 3 different test sets (NIST, NISTP, P07).
 Right: error rates on NIST test digits only, along with the previous results from 
@@ -580,8 +593,8 @@
 The left side of the figure shows the improvement to the clean
 NIST test set error brought by the use of out-of-distribution examples
 (i.e. the perturbed examples examples from NISTP or P07). 
-Relative change is measured by taking
-(original model's error / perturbed-data model's error - 1).
+Relative percent change is measured by taking
+100 \% \times (original model's error / perturbed-data model's error - 1).
 The right side of
 Figure~\ref{fig:improvements-charts} shows the relative improvement
 brought by the use of a multi-task setting, in which the same model is
@@ -589,9 +602,10 @@
 with all 62 classes when the target classes are respectively the digits,
 lower-case, or upper-case characters). Again, whereas the gain from the
 multi-task setting is marginal or negative for the MLP, it is substantial
-for the SDA.  Note that for these multi-task experiment, only the original
+for the SDA.  Note that to simplify these multi-task experiments, only the original
 NIST dataset is used. For example, the MLP-digits bar shows the relative
-improvement in MLP error rate on the NIST digits test set (1 - single-task
+percent improvement in MLP error rate on the NIST digits test set 
+is 100\% $\times$ (1 - single-task
 model's error / multi-task model's error).  The single-task model is
 trained with only 10 outputs (one per digit), seeing only digit examples,
 whereas the multi-task model is trained with 62 outputs, with all 62
@@ -647,16 +661,16 @@
 %\begin{itemize}
 
 $\bullet$ %\item 
-Do the good results previously obtained with deep architectures on the
+{\bf Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to the setting of a much larger and richer (but similar)
-dataset, the NIST special database 19, with 62 classes and around 800k examples?
+dataset, the NIST special database 19, with 62 classes and around 800k examples}?
 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
 
 $\bullet$ %\item 
-To what extent do self-taught learning scenarios help deep learners,
-and do they help them more than shallow supervised ones?
+{\bf To what extent do self-taught learning scenarios help deep learners,
+and do they help them more than shallow supervised ones}?
 We found that distorted training examples not only made the resulting
 classifier better on similarly perturbed images but also on
 the {\em original clean examples}, and more importantly and more novel,
@@ -669,7 +683,8 @@
 were very significantly boosted by these out-of-distribution examples.
 Similarly, whereas the improvement due to the multi-task setting was marginal or
 negative for the MLP (from +5.6\% to -3.6\% relative change), 
-it was very significant for the SDA (from +13\% to +27\% relative change).
+it was very significant for the SDA (from +13\% to +27\% relative change),
+which may be explained by the arguments below.
 %\end{itemize}
 
 In the original self-taught learning framework~\citep{RainaR2007}, the
@@ -682,7 +697,7 @@
 architectures, our experiments show that such a positive effect is accomplished
 even in a scenario with a \emph{very large number of labeled examples}.
 
-Why would deep learners benefit more from the self-taught learning framework?
+{\bf Why would deep learners benefit more from the self-taught learning framework}?
 The key idea is that the lower layers of the predictor compute a hierarchy
 of features that can be shared across tasks or across variants of the
 input distribution. Intermediate features that can be used in different