changeset 569:9d01280ff1c1

commentaires de Joseph Turian
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Thu, 03 Jun 2010 19:05:08 -0400
parents ae6ba0309bf9
children f4b95749ffba
files writeup/nips2010_submission.tex
diffstat 1 files changed, 86 insertions(+), 47 deletions(-) [+]
line wrap: on
line diff
--- a/writeup/nips2010_submission.tex	Thu Jun 03 13:19:16 2010 -0400
+++ b/writeup/nips2010_submission.tex	Thu Jun 03 19:05:08 2010 -0400
@@ -20,18 +20,37 @@
 
 \vspace*{-2mm}
 \begin{abstract}
-Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples} and show that {\em deep learners benefit more from them than a corresponding shallow learner}, in the area of handwritten character recognition. In fact, we show that they reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.  For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set.
+  Recent theoretical and empirical work in statistical machine learning has
+  demonstrated the importance of learning algorithms for deep
+  architectures, i.e., function classes obtained by composing multiple
+  non-linear transformations. Self-taught learning (exploiting unlabeled
+  examples or examples from other distributions) has already been applied
+  to deep learners, but mostly to show the advantage of unlabeled
+  examples. Here we explore the advantage brought by {\em out-of-distribution examples}.
+For this purpose we
+  developed a powerful generator of stochastic variations and noise
+  processes for character images, including not only affine transformations
+  but also slant, local elastic deformations, changes in thickness,
+  background images, grey level changes, contrast, occlusion, and various
+  types of noise. The out-of-distribution examples are obtained from these
+  highly distorted images or by including examples of object classes
+  different from those in the target test set.
+  We show that {\em deep learners benefit
+    more from them than a corresponding shallow learner}, at least in the area of
+  handwritten character recognition. In fact, we show that they reach
+  human-level performance on both handwritten digit classification and
+  62-class handwritten character recognition.  
 \end{abstract}
 \vspace*{-3mm}
 
 \section{Introduction}
 \vspace*{-1mm}
 
-Deep Learning has emerged as a promising new area of research in
+{\bf Deep Learning} has emerged as a promising new area of research in
 statistical machine learning (see~\citet{Bengio-2009} for a review).
 Learning algorithms for deep architectures are centered on the learning
 of useful representations of data, which are better suited to the task at hand.
-This is in great part inspired by observations of the mammalian visual cortex, 
+This is in part inspired by observations of the mammalian visual cortex, 
 which consists of a chain of processing elements, each of which is associated with a
 different representation of the raw visual input. In fact,
 it was found recently that the features learnt in deep architectures resemble
@@ -47,6 +66,22 @@
 general) features that are more robust to unanticipated sources of
 variance extant in real data.
 
+{\bf Self-taught learning}~\citep{RainaR2007} is a paradigm that combines principles
+of semi-supervised and multi-task learning: the learner can exploit examples
+that are unlabeled and possibly come from a distribution different from the target
+distribution, e.g., from other classes than those of interest. 
+It has already been shown that deep learners can clearly take advantage of
+unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
+but more needs to be done to explore the impact
+of {\em out-of-distribution} examples and of the multi-task setting
+(one exception is~\citep{CollobertR2008}, which uses a different kind
+of learning algorithm). In particular the {\em relative
+advantage} of deep learning for these settings has not been evaluated.
+The hypothesis discussed in the conclusion is that a deep hierarchy of features
+may be better able to provide sharing of statistical strength
+between different regions in input space or different tasks.
+
+\iffalse
 Whereas a deep architecture can in principle be more powerful than a
 shallow one in terms of representation, depth appears to render the
 training problem more difficult in terms of optimization and local minima.
@@ -59,27 +94,16 @@
 which
 performed similarly or better than previously proposed Restricted Boltzmann
 Machines in terms of unsupervised extraction of a hierarchy of features
-useful for classification.  The principle is that each layer starting from
-the bottom is trained to encode its input (the output of the previous
-layer) and to reconstruct it from a corrupted version. After this
-unsupervised initialization, the stack of DAs can be
-converted into a deep supervised feedforward neural network and fine-tuned by
-stochastic gradient descent.
+useful for classification. Each layer is trained to denoise its
+input, creating a layer of features that can be used as input for the next layer.  
+\fi
+%The principle is that each layer starting from
+%the bottom is trained to encode its input (the output of the previous
+%layer) and to reconstruct it from a corrupted version. After this
+%unsupervised initialization, the stack of DAs can be
+%converted into a deep supervised feedforward neural network and fine-tuned by
+%stochastic gradient descent.
 
-Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
-of semi-supervised and multi-task learning: the learner can exploit examples
-that are unlabeled and possibly come from a distribution different from the target
-distribution, e.g., from other classes than those of interest. 
-It has already been shown that deep learners can clearly take advantage of
-unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
-but more needs to be done to explore the impact
-of {\em out-of-distribution} examples and of the multi-task setting
-(one exception is~\citep{CollobertR2008}, which uses very different kinds
-of learning algorithms). In particular the {\em relative
-advantage} of deep learning for these settings has not been evaluated.
-The hypothesis discussed in the conclusion is that a deep hierarchy of features
-may be better able to provide sharing of statistical strength
-between different regions in input space or different tasks.
 %
 In this paper we ask the following questions:
 
@@ -93,29 +117,42 @@
 To what extent does the perturbation of input images (e.g. adding
 noise, affine transformations, background images) make the resulting
 classifiers better not only on similarly perturbed images but also on
-the {\em original clean examples}?
+the {\em original clean examples}? We study this question in the
+context of the 62-class and 10-class tasks of the NIST special database 19.
 
 $\bullet$ %\item 
 Do deep architectures {\em benefit more from such out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
+We use highly perturbed examples to generate out-of-distribution examples.
 
 $\bullet$ %\item 
 Similarly, does the feature learning step in deep learning algorithms benefit more 
 from training with moderately different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
+We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
+to answer this question.
 %\end{enumerate}
 
 Our experimental results provide positive evidence towards all of these questions.
 To achieve these results, we introduce in the next section a sophisticated system
-for stochastically transforming character images and then explain the methodology. 
+for stochastically transforming character images and then explain the methodology,
+which is based on training with or without these transformed images and testing on 
+clean ones. We measure the relative advantage of out-of-distribution examples
+for a deep learner vs a supervised shallow one.
+Code for generating these transformations as well as for the deep learning 
+algorithms are made available. 
+We also estimate the relative advantage for deep learners of training with
+other classes than those of interest, by comparing learners trained with
+62 classes with learners trained with only a subset (on which they
+are then tested).
 The conclusion discusses
 the more general question of why deep learners may benefit so much from 
 the self-taught learning framework.
 
-\vspace*{-1mm}
+\vspace*{-3mm}
 \section{Perturbation and Transformation of Character Images}
 \label{s:perturbations}
-\vspace*{-1mm}
+\vspace*{-2mm}
 
 \begin{wrapfigure}[8]{l}{0.15\textwidth}
 %\begin{minipage}[b]{0.14\linewidth}
@@ -193,7 +230,7 @@
 proportionally to its height: $shift = round(slant \times height)$.  
 $slant \sim U[-complexity,complexity]$.
 The shift is randomly chosen to be either to the left or to the right.
-\vspace{1.1cm}
+\vspace{1cm}
 \end{minipage}
 %\vspace*{-4mm}
 
@@ -203,7 +240,7 @@
 \vspace*{-6mm}
 \begin{center}
 \includegraphics[scale=.4]{images/Affine_only.png}\\
-{\bf Affine Transformation}
+{\small {\bf Affine \mbox{Transformation}}}
 \end{center}
 \end{wrapfigure}
 %\end{minipage}%
@@ -230,13 +267,13 @@
 \begin{center}
 \vspace*{-4mm}
 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}\\
-{\bf Local Elastic}
+{\bf Local Elastic Deformation}
 \end{center}
 \end{wrapfigure}
 %\end{minipage}%
 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth}
 %\vspace*{-20mm}
-The {\bf local elastic} deformation 
+The {\bf local elastic deformation}
 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
 which provides more details. 
 The intensity of the displacement fields is given by 
@@ -246,7 +283,7 @@
 %\vspace{.9cm}
 \end{minipage}
 
-\vspace*{5mm}
+\vspace*{7mm}
 
 %\begin{minipage}[b]{0.14\linewidth}
 %\centering
@@ -277,7 +314,7 @@
 %\vspace{1.5cm}
 %\end{minipage}
 
-\vspace{2mm}
+\vspace{1mm}
 
 {\large\bf 2.2 Injecting Noise}
 %\subsection{Injecting Noise}
@@ -523,8 +560,8 @@
 %\citep{SorokinAndForsyth2008,whitehill09}. 
 AMT users were presented
 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII
-characters. They were forced to make a hard choice among the
-62 or 10 character classes (all classes or digits only). 
+characters. They were forced to choose a single character class (either among the
+62 or 10 character classes) for each image.
 80 subjects classified 2500 images per (dataset,task) pair,
 with the guarantee that 3 different subjects classified each image, allowing
 us to estimate inter-human variability (e.g a standard error of 0.1\%
@@ -637,10 +674,10 @@
 exponentials) on the output layer for estimating $P(class | image)$.
 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 
 Training examples are presented in minibatches of size 20. A constant learning
-rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
-through preliminary experiments (measuring performance on a validation set),
-and $0.1$ (which was found to work best) was then selected for optimizing on
-the whole training sets.
+rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
+%through preliminary experiments (measuring performance on a validation set),
+%and $0.1$ (which was found to work best) was then selected for optimizing on
+%the whole training sets.
 \vspace*{-1mm}
 
 
@@ -666,7 +703,7 @@
 \caption{Illustration of the computations and training criterion for the denoising
 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
 the layer (i.e. raw input or output of previous layer)
-is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
+s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
 is compared to the uncorrupted input $x$ through the loss function
 $L_H(x,z)$, whose expected value is approximately minimized during training
@@ -677,7 +714,7 @@
 
 Here we chose to use the Denoising
 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
-these deep hierarchies of features, as it is very simple to train and
+these deep hierarchies of features, as it is simple to train and
 explain (see Figure~\ref{fig:da}, as well as 
 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 
 provides efficient inference, and yielded results
@@ -686,7 +723,9 @@
 Auto-encoder is presented with a stochastically corrupted version
 of the input and trained to reconstruct the uncorrupted input,
 forcing the hidden units to represent the leading regularities in
-the data. Once it is trained, in a purely unsupervised way, 
+the data. Here we use the random binary masking corruption
+(which sets to 0 a random subset of the inputs).
+ Once it is trained, in a purely unsupervised way, 
 its hidden units' activations can
 be used as inputs for training a second one, etc.
 After this unsupervised pre-training stage, the parameters
@@ -842,8 +881,8 @@
 {\bf Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
-Yes, the SDA {\bf systematically outperformed the MLP and all the previously
-published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
+Yes, the SDA {\em systematically outperformed the MLP and all the previously
+published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
 
 $\bullet$ %\item 
@@ -858,10 +897,10 @@
 but only marginally helped (5\% relative improvement on all classes) 
 or even hurt (10\% relative loss on digits)
 with respect to clean examples . On the other hand, the deep SDAs
-were very significantly boosted by these out-of-distribution examples.
+were significantly boosted by these out-of-distribution examples.
 Similarly, whereas the improvement due to the multi-task setting was marginal or
 negative for the MLP (from +5.6\% to -3.6\% relative change), 
-it was very significant for the SDA (from +13\% to +27\% relative change),
+it was quite significant for the SDA (from +13\% to +27\% relative change),
 which may be explained by the arguments below.
 %\end{itemize}
 
@@ -873,7 +912,7 @@
 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases.
 We note instead that, for deep
 architectures, our experiments show that such a positive effect is accomplished
-even in a scenario with a \emph{very large number of labeled examples},
+even in a scenario with a \emph{large number of labeled examples},
 i.e., here, the relative gain of self-taught learning is probably preserved
 in the asymptotic regime.