ift6266: writeup/aistats2011_submission.tex comparison

comparison writeup/aistats2011_submission.tex @ 603:eb6244c6d861

aistats submission

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sun, 31 Oct 2010 22:40:33 -0400
parents	203c6071e104
children	51213beaed8b

comparison

equal deleted inserted replaced

-:203c6071e104
+:eb6244c6d861
 unsupervised initialization, the stack of layers can be
 converted into a deep supervised feedforward neural network and fine-tuned by
 stochastic gradient descent.
 One of these layer initialization techniques,
 applied here, is the Denoising
-Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see
+Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see
 Figure~\ref{fig:da}), which performed similarly or
 better~\citep{VincentPLarochelleH2008-very-small} than previously
 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06}
 in terms of unsupervised extraction
 of a hierarchy of features useful for classification. Each layer is trained
 to denoise its input, creating a layer of features that can be used as
 input for the next layer. Note that training a Denoising Auto-Encoder
 can actually been seen as training a particular RBM by an inductive
-principle different from maximum likelihood~\cite{Vincent-SM-2010}, namely by
+principle different from maximum likelihood~\citep{ift6266-tr-anonymous}, % Vincent-SM-2010},
-Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}.
+namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}.
-Previous comparative experimental results with stacking of RBMs and DAEs
+Previous comparative experimental results with stacking of RBMs and DAs
 to build deep supervised predictors had shown that they could outperform
 shallow architectures in a variety of settings (see~\citet{Bengio-2009}
 for a review), especially
 when the data involves complex interactions between many factors of
 variation~\citep{LarochelleH2007}. Other experiments have suggested
 advantage of deep learning} for these settings has not been evaluated.
 %
 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
-{\bf benefit more from self-taught learning than shallow learners} (with a single
+{\bf benefit more from out-of-distribution examples than shallow learners} (with a single
-level), both in the context of the multi-task setting and from {\em
+level), both in the context of the multi-task setting and from
-out-of-distribution examples} in general. Because we are able to improve on state-of-the-art
+perturbed examples. Because we are able to improve on state-of-the-art
 performance and reach human-level performance
 on a large-scale task, we consider that this paper is also a contribution
 to advance the application of machine learning to handwritten character recognition.
 More precisely, we ask and answer the following questions:
 other classes than those of interest, by comparing learners trained with
 62 classes with learners trained with only a subset (on which they
 are then tested).
 The conclusion discusses
 the more general question of why deep learners may benefit so much from
-the self-taught learning framework. Since out-of-distribution data
+out-of-distribution examples. Since out-of-distribution data
 (perturbed or from other related classes) is very common, this conclusion
 is of practical importance.
 %\vspace*{-3mm}
 %\newpage
 \begin{figure*}[ht]
 \vspace*{-3mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
 \vspace*{-3mm}
-\caption{Relative improvement in error rate due to self-taught learning.
+\caption{Relative improvement in error rate due to out-of-distribution examples.
 Left: Improvement (or loss, when negative)
 induced by out-of-distribution examples (perturbed data).
 Right: Improvement (or loss, when negative) induced by multi-task
 learning (training on all classes and testing only on either digits,
 upper case, or lower-case). The deep learner (SDA) benefits more from
-both self-taught learning scenarios, compared to the shallow MLP.}
+out-of-distribution examples, compared to the shallow MLP.}
 \label{fig:improvements-charts}
 \vspace*{-2mm}
 \end{figure*}
 \vspace*{-2mm}
 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
 ``c'' and a ``C'' are often indistinguishible).
 In addition, as shown in the left of
 Figure~\ref{fig:improvements-charts}, the relative improvement in error
-rate brought by self-taught learning is greater for the SDA, and these
+rate brought by out-of-distribution examples is greater for the deep
-differences with the MLP are statistically and qualitatively
+stacked SDA, and these
+differences with the shallow MLP are statistically and qualitatively
 significant.
 The left side of the figure shows the improvement to the clean
 NIST test set error brought by the use of out-of-distribution examples
 (i.e. the perturbed examples examples from NISTP or P07).
 Relative percent change is measured by taking
 \vspace*{-2mm}
 \section{Conclusions and Discussion}
 \vspace*{-2mm}
-We have found that the self-taught learning framework is more beneficial
+We have found that out-of-distribution examples (multi-task learning
+and perturbed examples) are more beneficial
 to a deep learner than to a traditional shallow and purely
 supervised learner. More precisely,
 the answers are positive for all the questions asked in the introduction.
 %\begin{itemize}
 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
 performance} at around 17\% error on the 62-class task and 1.4\% on the digits,
 and beating previously published results on the same data.
 $\bullet$ %\item
-{\bf To what extent do self-taught learning scenarios help deep learners,
+{\bf To what extent do out-of-distribution examples help deep learners,
 and do they help them more than shallow supervised ones}?
 We found that distorted training examples not only made the resulting
 classifier better on similarly perturbed images but also on
 the {\em original clean examples}, and more importantly and more novel,
 that deep architectures benefit more from such {\em out-of-distribution}
-examples. MLPs were helped by perturbed training examples when tested on perturbed input
+examples. Shallow MLPs were helped by perturbed training examples when tested on perturbed input
 images (65\% relative improvement on NISTP)
 but only marginally helped (5\% relative improvement on all classes)
 or even hurt (10\% relative loss on digits)
 with respect to clean examples . On the other hand, the deep SDAs
 were significantly boosted by these out-of-distribution examples.
 architectures, our experiments show that such a positive effect is accomplished
 even in a scenario with a \emph{large number of labeled examples},
 i.e., here, the relative gain of self-taught learning is probably preserved
 in the asymptotic regime.
-{\bf Why would deep learners benefit more from the self-taught learning framework}?
+{\bf Why would deep learners benefit more from the self-taught learning
+framework and out-of-distribution examples}?
 The key idea is that the lower layers of the predictor compute a hierarchy
 of features that can be shared across tasks or across variants of the
 input distribution. A theoretical analysis of generalization improvements
 due to sharing of intermediate features across tasks already points
 towards that explanation~\cite{baxter95a}.
 We hypothesize that this is related to the hypotheses studied
 in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010}
 it was found that online learning on a huge dataset did not make the
 advantage of the deep learning bias vanish, and a similar phenomenon
 may be happening here. We hypothesize that unsupervised pre-training
-of a deep hierarchy with self-taught learning initializes the
+of a deep hierarchy with out-of-distribution examples initializes the
 model in the basin of attraction of supervised gradient descent
 that corresponds to better generalization. Furthermore, such good
 basins of attraction are not discovered by pure supervised learning
-(with or without self-taught settings) from random initialization, and more labeled examples
+(with or without out-of-distribution examples) from random initialization, and more labeled examples
 does not allow the shallow or purely supervised models to discover
 the kind of better basins associated
-with deep learning and self-taught learning.
+with deep learning and out-of-distribution examples.
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
 can be executed on-line at the anonymous site {\tt http://deep.host22.com}.
 \iffalse

Mercurial > ift6266

comparison writeup/aistats2011_submission.tex @ 603:eb6244c6d861