ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 547:316c7bdad5ad

charts

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Wed, 02 Jun 2010 13:09:27 -0400
parents	1cdfc17e890f
children	34cb28249de0

comparison

equal deleted inserted replaced

-:cf68f5685406
+:316c7bdad5ad
 It is also only recently that successful algorithms were proposed to
 overcome some of these difficulties.  All are based on unsupervised
 learning, often in an greedy layer-wise ``unsupervised pre-training''
 stage~\citep{Bengio-2009}.  One of these layer initialization techniques,
 applied here, is the Denoising
-Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
+Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}),
+which
 performed similarly or better than previously proposed Restricted Boltzmann
 Machines in terms of unsupervised extraction of a hierarchy of features
 useful for classification.  The principle is that each layer starting from
 the bottom is trained to encode its input (the output of the previous
 layer) and to reconstruct it from a corrupted version. After this
-unsupervised initialization, the stack of denoising auto-encoders can be
+unsupervised initialization, the stack of DAs can be
 converted into a deep supervised feedforward neural network and fine-tuned by
 stochastic gradient descent.
 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
 of semi-supervised and multi-task learning: the learner can exploit examples
 training with similar but different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 %\end{enumerate}
 Our experimental results provide positive evidence towards all of these questions.
+To achieve these results, we introduce in the next section a sophisticated system
+for stochastically transforming character images. The conclusion discusses
+the more general question of why deep learners may benefit so much from
+the self-taught learning framework.
 \vspace*{-1mm}
 \section{Perturbation and Transformation of Character Images}
 \label{s:perturbations}
 \vspace*{-1mm}
 This section describes the different transformations we used to stochastically
-transform source images in order to obtain data. More details can
+transform source images in order to obtain data from a larger distribution which
+covers a domain substantially larger than the clean characters distribution from
+which we start. Although character transformations have been used before to
+improve character recognizers, this effort is on a large scale both
+in number of classes and in the complexity of the transformations, hence
+in the complexity of the learning task.
+More details can
 be found in this technical report~\citep{ift6266-tr-anonymous}.
 The code for these transformations (mostly python) is available at
 {\tt http://anonymous.url.net}. All the modules in the pipeline share
 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
 amount of deformation or noise introduced.
 to 1000 times larger.
 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
 and {\bf OCR data} (scanned machine printed characters). Once a character
-is sampled from one of these sources (chosen randomly), the pipeline of
+is sampled from one of these sources (chosen randomly), the second step is to
-the transformations and/or noise processes described in section \ref{s:perturbations}
+apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}.
-is applied to the image.
+To provide a baseline of error rate comparison we also estimate human performance
+on both the 62-class task and the 10-class digits task.
 We compare the best MLPs against
 the best SDAs (both models' hyper-parameters are selected to minimize the validation set error),
 along with a comparison against a precise estimate
 of human performance obtained via Amazon's Mechanical Turk (AMT)
 service (http://mturk.com).
 Training examples are presented in minibatches of size 20. A constant learning
 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments (measuring performance on a validation set),
 and $0.1$ was then selected for optimizing on the whole training sets.
-\begin{figure}[ht]
-\vspace*{-2mm}
-\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
-\caption{Illustration of the computations and training criterion for the denoising
-auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
-the layer (i.e. raw input or output of previous layer)
-is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
-The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
-is compared to the uncorrupted input $x$ through the loss function
-$L_H(x,z)$, whose expected value is approximately minimized during training
-by tuning $\theta$ and $\theta'$.}
-\label{fig:da}
-\vspace*{-2mm}
-\end{figure}
 {\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden
 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
 distribution $P(x)$ and the conditional distribution of interest
 $P(y|x)$ (like in semi-supervised learning), and on the other hand
 taking advantage of the expressive power and bias implicit in the
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
+\begin{figure}[ht]
+\vspace*{-2mm}
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
+\caption{Illustration of the computations and training criterion for the denoising
+auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
+the layer (i.e. raw input or output of previous layer)
+is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
+The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
+is compared to the uncorrupted input $x$ through the loss function
+$L_H(x,z)$, whose expected value is approximately minimized during training
+by tuning $\theta$ and $\theta'$.}
+\label{fig:da}
+\vspace*{-2mm}
+\end{figure}
 Here we chose to use the Denoising
 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
 these deep hierarchies of features, as it is very simple to train and
 explain (see Figure~\ref{fig:da}, as well as
 fixed proportion of the input values, randomly selected, are zeroed), and a
 separate learning rate for the unsupervised pre-training stage (selected
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
 of hidden layers but it was fixed to 3 based on previous work with
-stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
+SDAs on MNIST~\citep{VincentPLarochelleH2008}.
 \vspace*{-1mm}
 \begin{figure}[ht]
 \vspace*{-2mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
-\caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
+\caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
 of all models, on 3 different test sets (NIST, NISTP, P07).
 Right: error rates on NIST test digits only, along with the previous results from
 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
 differences with the MLP are statistically and qualitatively
 significant.
 The left side of the figure shows the improvement to the clean
 NIST test set error brought by the use of out-of-distribution examples
 (i.e. the perturbed examples examples from NISTP or P07).
-Relative change is measured by taking
+Relative percent change is measured by taking
-(original model's error / perturbed-data model's error - 1).
+100 \% \times (original model's error / perturbed-data model's error - 1).
 The right side of
 Figure~\ref{fig:improvements-charts} shows the relative improvement
 brought by the use of a multi-task setting, in which the same model is
 trained for more classes than the target classes of interest (i.e. training
 with all 62 classes when the target classes are respectively the digits,
 lower-case, or upper-case characters). Again, whereas the gain from the
 multi-task setting is marginal or negative for the MLP, it is substantial
-for the SDA.  Note that for these multi-task experiment, only the original
+for the SDA.  Note that to simplify these multi-task experiments, only the original
 NIST dataset is used. For example, the MLP-digits bar shows the relative
-improvement in MLP error rate on the NIST digits test set (1 - single-task
+percent improvement in MLP error rate on the NIST digits test set
+is 100\% $\times$ (1 - single-task
 model's error / multi-task model's error).  The single-task model is
 trained with only 10 outputs (one per digit), seeing only digit examples,
 whereas the multi-task model is trained with 62 outputs, with all 62
 character classes as examples.  Hence the hidden units are shared across
 all tasks.  For the multi-task model, the digit error rate is measured by
 supervised learner. More precisely,
 the answers are positive for all the questions asked in the introduction.
 %\begin{itemize}
 $\bullet$ %\item
-Do the good results previously obtained with deep architectures on the
+{\bf Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to the setting of a much larger and richer (but similar)
-dataset, the NIST special database 19, with 62 classes and around 800k examples?
+dataset, the NIST special database 19, with 62 classes and around 800k examples}?
 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
 $\bullet$ %\item
-To what extent do self-taught learning scenarios help deep learners,
+{\bf To what extent do self-taught learning scenarios help deep learners,
-and do they help them more than shallow supervised ones?
+and do they help them more than shallow supervised ones}?
 We found that distorted training examples not only made the resulting
 classifier better on similarly perturbed images but also on
 the {\em original clean examples}, and more importantly and more novel,
 that deep architectures benefit more from such {\em out-of-distribution}
 examples. MLPs were helped by perturbed training examples when tested on perturbed input
 or even hurt (10\% relative loss on digits)
 with respect to clean examples . On the other hand, the deep SDAs
 were very significantly boosted by these out-of-distribution examples.
 Similarly, whereas the improvement due to the multi-task setting was marginal or
 negative for the MLP (from +5.6\% to -3.6\% relative change),
-it was very significant for the SDA (from +13\% to +27\% relative change).
+it was very significant for the SDA (from +13\% to +27\% relative change),
+which may be explained by the arguments below.
 %\end{itemize}
 In the original self-taught learning framework~\citep{RainaR2007}, the
 out-of-sample examples were used as a source of unsupervised data, and
 experiments showed its positive effects in a \emph{limited labeled data}
 learning diminishes as the number of labeled examples increases (essentially,
 a ``diminishing returns'' scenario occurs).  We note instead that, for deep
 architectures, our experiments show that such a positive effect is accomplished
 even in a scenario with a \emph{very large number of labeled examples}.
-Why would deep learners benefit more from the self-taught learning framework?
+{\bf Why would deep learners benefit more from the self-taught learning framework}?
 The key idea is that the lower layers of the predictor compute a hierarchy
 of features that can be shared across tasks or across variants of the
 input distribution. Intermediate features that can be used in different
 contexts can be estimated in a way that allows to share statistical
 strength. Features extracted through many levels are more likely to

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 547:316c7bdad5ad