ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 569:9d01280ff1c1

commentaires de Joseph Turian

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Thu, 03 Jun 2010 19:05:08 -0400
parents	ae6ba0309bf9
children	df749e70f637

comparison

equal deleted inserted replaced

-:ae6ba0309bf9
+:9d01280ff1c1
 %\makeanontitle
 \maketitle
 \vspace*{-2mm}
 \begin{abstract}
-Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples} and show that {\em deep learners benefit more from them than a corresponding shallow learner}, in the area of handwritten character recognition. In fact, we show that they reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.  For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set.
+Recent theoretical and empirical work in statistical machine learning has
+demonstrated the importance of learning algorithms for deep
+architectures, i.e., function classes obtained by composing multiple
+non-linear transformations. Self-taught learning (exploiting unlabeled
+examples or examples from other distributions) has already been applied
+to deep learners, but mostly to show the advantage of unlabeled
+examples. Here we explore the advantage brought by {\em out-of-distribution examples}.
+For this purpose we
+developed a powerful generator of stochastic variations and noise
+processes for character images, including not only affine transformations
+but also slant, local elastic deformations, changes in thickness,
+background images, grey level changes, contrast, occlusion, and various
+types of noise. The out-of-distribution examples are obtained from these
+highly distorted images or by including examples of object classes
+different from those in the target test set.
+We show that {\em deep learners benefit
+more from them than a corresponding shallow learner}, at least in the area of
+handwritten character recognition. In fact, we show that they reach
+human-level performance on both handwritten digit classification and
+62-class handwritten character recognition.
 \end{abstract}
 \vspace*{-3mm}
 \section{Introduction}
 \vspace*{-1mm}
-Deep Learning has emerged as a promising new area of research in
+{\bf Deep Learning} has emerged as a promising new area of research in
 statistical machine learning (see~\citet{Bengio-2009} for a review).
 Learning algorithms for deep architectures are centered on the learning
 of useful representations of data, which are better suited to the task at hand.
-This is in great part inspired by observations of the mammalian visual cortex,
+This is in part inspired by observations of the mammalian visual cortex,
 which consists of a chain of processing elements, each of which is associated with a
 different representation of the raw visual input. In fact,
 it was found recently that the features learnt in deep architectures resemble
 those observed in the first two of these stages (in areas V1 and V2
 of visual cortex)~\citep{HonglakL2008}, and that they become more and
 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
 feature representation can lead to higher-level (more abstract, more
 general) features that are more robust to unanticipated sources of
 variance extant in real data.
+{\bf Self-taught learning}~\citep{RainaR2007} is a paradigm that combines principles
+of semi-supervised and multi-task learning: the learner can exploit examples
+that are unlabeled and possibly come from a distribution different from the target
+distribution, e.g., from other classes than those of interest.
+It has already been shown that deep learners can clearly take advantage of
+unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
+but more needs to be done to explore the impact
+of {\em out-of-distribution} examples and of the multi-task setting
+(one exception is~\citep{CollobertR2008}, which uses a different kind
+of learning algorithm). In particular the {\em relative
+advantage} of deep learning for these settings has not been evaluated.
+The hypothesis discussed in the conclusion is that a deep hierarchy of features
+may be better able to provide sharing of statistical strength
+between different regions in input space or different tasks.
+\iffalse
 Whereas a deep architecture can in principle be more powerful than a
 shallow one in terms of representation, depth appears to render the
 training problem more difficult in terms of optimization and local minima.
 It is also only recently that successful algorithms were proposed to
 overcome some of these difficulties.  All are based on unsupervised
 applied here, is the Denoising
 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}),
 which
 performed similarly or better than previously proposed Restricted Boltzmann
 Machines in terms of unsupervised extraction of a hierarchy of features
-useful for classification.  The principle is that each layer starting from
+useful for classification. Each layer is trained to denoise its
-the bottom is trained to encode its input (the output of the previous
+input, creating a layer of features that can be used as input for the next layer.
-layer) and to reconstruct it from a corrupted version. After this
+\fi
-unsupervised initialization, the stack of DAs can be
+%The principle is that each layer starting from
-converted into a deep supervised feedforward neural network and fine-tuned by
+%the bottom is trained to encode its input (the output of the previous
-stochastic gradient descent.
+%layer) and to reconstruct it from a corrupted version. After this
+%unsupervised initialization, the stack of DAs can be
-Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
+%converted into a deep supervised feedforward neural network and fine-tuned by
-of semi-supervised and multi-task learning: the learner can exploit examples
+%stochastic gradient descent.
-that are unlabeled and possibly come from a distribution different from the target
-distribution, e.g., from other classes than those of interest.
-It has already been shown that deep learners can clearly take advantage of
-unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
-but more needs to be done to explore the impact
-of {\em out-of-distribution} examples and of the multi-task setting
-(one exception is~\citep{CollobertR2008}, which uses very different kinds
-of learning algorithms). In particular the {\em relative
-advantage} of deep learning for these settings has not been evaluated.
-The hypothesis discussed in the conclusion is that a deep hierarchy of features
-may be better able to provide sharing of statistical strength
-between different regions in input space or different tasks.
 %
 In this paper we ask the following questions:
 %\begin{enumerate}
 $\bullet$ %\item
 $\bullet$ %\item
 To what extent does the perturbation of input images (e.g. adding
 noise, affine transformations, background images) make the resulting
 classifiers better not only on similarly perturbed images but also on
-the {\em original clean examples}?
+the {\em original clean examples}? We study this question in the
+context of the 62-class and 10-class tasks of the NIST special database 19.
 $\bullet$ %\item
 Do deep architectures {\em benefit more from such out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
+We use highly perturbed examples to generate out-of-distribution examples.
 $\bullet$ %\item
 Similarly, does the feature learning step in deep learning algorithms benefit more
 from training with moderately different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
+We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
+to answer this question.
 %\end{enumerate}
 Our experimental results provide positive evidence towards all of these questions.
 To achieve these results, we introduce in the next section a sophisticated system
-for stochastically transforming character images and then explain the methodology.
+for stochastically transforming character images and then explain the methodology,
+which is based on training with or without these transformed images and testing on
+clean ones. We measure the relative advantage of out-of-distribution examples
+for a deep learner vs a supervised shallow one.
+Code for generating these transformations as well as for the deep learning
+algorithms are made available.
+We also estimate the relative advantage for deep learners of training with
+other classes than those of interest, by comparing learners trained with
+62 classes with learners trained with only a subset (on which they
+are then tested).
 The conclusion discusses
 the more general question of why deep learners may benefit so much from
 the self-taught learning framework.
-\vspace*{-1mm}
+\vspace*{-3mm}
 \section{Perturbation and Transformation of Character Images}
 \label{s:perturbations}
-\vspace*{-1mm}
+\vspace*{-2mm}
 \begin{wrapfigure}[8]{l}{0.15\textwidth}
 %\begin{minipage}[b]{0.14\linewidth}
 \vspace*{-5mm}
 \begin{center}
 %\centering
 To produce {\bf slant}, each row of the image is shifted
 proportionally to its height: $shift = round(slant \times height)$.
 $slant \sim U[-complexity,complexity]$.
 The shift is randomly chosen to be either to the left or to the right.
-\vspace{1.1cm}
+\vspace{1cm}
 \end{minipage}
 %\vspace*{-4mm}
 %\begin{minipage}[b]{0.14\linewidth}
 %\centering
 \begin{wrapfigure}[8]{l}{0.15\textwidth}
 \vspace*{-6mm}
 \begin{center}
 \includegraphics[scale=.4]{images/Affine_only.png}\\
-{\bf Affine Transformation}
+{\small {\bf Affine \mbox{Transformation}}}
 \end{center}
 \end{wrapfigure}
 %\end{minipage}%
 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
 A $2 \times 3$ {\bf affine transform} matrix (with
 %\hspace*{-8mm}\begin{minipage}[b]{0.25\linewidth}
 %\centering
 \begin{center}
 \vspace*{-4mm}
 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}\\
-{\bf Local Elastic}
+{\bf Local Elastic Deformation}
 \end{center}
 \end{wrapfigure}
 %\end{minipage}%
 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth}
 %\vspace*{-20mm}
-The {\bf local elastic} deformation
+The {\bf local elastic deformation}
 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
 which provides more details.
 The intensity of the displacement fields is given by
 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are
 convolved with a Gaussian 2D kernel (resulting in a blur) of
 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$.
 %\vspace{.9cm}
 \end{minipage}
-\vspace*{5mm}
+\vspace*{7mm}
 %\begin{minipage}[b]{0.14\linewidth}
 %\centering
 \begin{wrapfigure}[7]{l}{0.15\textwidth}
 \vspace*{-5mm}
 around the (non-integer) source position thus found.
 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
 %\vspace{1.5cm}
 %\end{minipage}
-\vspace{2mm}
+\vspace{1mm}
 {\large\bf 2.2 Injecting Noise}
 %\subsection{Injecting Noise}
 \vspace{2mm}
 Mechanical Turk has been used extensively in natural language processing and vision.
 %processing \citep{SnowEtAl2008} and vision
 %\citep{SorokinAndForsyth2008,whitehill09}.
 AMT users were presented
 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII
-characters. They were forced to make a hard choice among the
+characters. They were forced to choose a single character class (either among the
-62 or 10 character classes (all classes or digits only).
+62 or 10 character classes) for each image.
 80 subjects classified 2500 images per (dataset,task) pair,
 with the guarantee that 3 different subjects classified each image, allowing
 us to estimate inter-human variability (e.g a standard error of 0.1\%
 on the average 18.2\% error done by humans on the 62-class task NIST test set).
 scaling behavior).
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
 exponentials) on the output layer for estimating $P(class | image)$.
 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
 Training examples are presented in minibatches of size 20. A constant learning
-rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
+rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
-through preliminary experiments (measuring performance on a validation set),
+%through preliminary experiments (measuring performance on a validation set),
-and $0.1$ (which was found to work best) was then selected for optimizing on
+%and $0.1$ (which was found to work best) was then selected for optimizing on
-the whole training sets.
+%the whole training sets.
 \vspace*{-1mm}
 {\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
 \vspace*{-2mm}
 \caption{Illustration of the computations and training criterion for the denoising
 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
 the layer (i.e. raw input or output of previous layer)
-is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
+s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
 is compared to the uncorrupted input $x$ through the loss function
 $L_H(x,z)$, whose expected value is approximately minimized during training
 by tuning $\theta$ and $\theta'$.}
 \label{fig:da}
 \vspace*{-2mm}
 \end{figure}
 Here we chose to use the Denoising
 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
-these deep hierarchies of features, as it is very simple to train and
+these deep hierarchies of features, as it is simple to train and
 explain (see Figure~\ref{fig:da}, as well as
 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
 provides efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
 \citep{VincentPLarochelleH2008}. During training, a Denoising
 Auto-encoder is presented with a stochastically corrupted version
 of the input and trained to reconstruct the uncorrupted input,
 forcing the hidden units to represent the leading regularities in
-the data. Once it is trained, in a purely unsupervised way,
+the data. Here we use the random binary masking corruption
+(which sets to 0 a random subset of the inputs).
+Once it is trained, in a purely unsupervised way,
 its hidden units' activations can
 be used as inputs for training a second one, etc.
 After this unsupervised pre-training stage, the parameters
 are used to initialize a deep MLP, which is fine-tuned by
 the same standard procedure used to train them (see previous section).
 $\bullet$ %\item
 {\bf Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
-Yes, the SDA {\bf systematically outperformed the MLP and all the previously
+Yes, the SDA {\em systematically outperformed the MLP and all the previously
-published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
+published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
 $\bullet$ %\item
 {\bf To what extent do self-taught learning scenarios help deep learners,
 and do they help them more than shallow supervised ones}?
 examples. MLPs were helped by perturbed training examples when tested on perturbed input
 images (65\% relative improvement on NISTP)
 but only marginally helped (5\% relative improvement on all classes)
 or even hurt (10\% relative loss on digits)
 with respect to clean examples . On the other hand, the deep SDAs
-were very significantly boosted by these out-of-distribution examples.
+were significantly boosted by these out-of-distribution examples.
 Similarly, whereas the improvement due to the multi-task setting was marginal or
 negative for the MLP (from +5.6\% to -3.6\% relative change),
-it was very significant for the SDA (from +13\% to +27\% relative change),
+it was quite significant for the SDA (from +13\% to +27\% relative change),
 which may be explained by the arguments below.
 %\end{itemize}
 In the original self-taught learning framework~\citep{RainaR2007}, the
 out-of-sample examples were used as a source of unsupervised data, and
 scenario. However, many of the results by \citet{RainaR2007} (who used a
 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught
 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases.
 We note instead that, for deep
 architectures, our experiments show that such a positive effect is accomplished
-even in a scenario with a \emph{very large number of labeled examples},
+even in a scenario with a \emph{large number of labeled examples},
 i.e., here, the relative gain of self-taught learning is probably preserved
 in the asymptotic regime.
 {\bf Why would deep learners benefit more from the self-taught learning framework}?
 The key idea is that the lower layers of the predictor compute a hierarchy

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 569:9d01280ff1c1