# HG changeset patch # User Yoshua Bengio # Date 1288579233 14400 # Node ID eb6244c6d8610cc96beb848098b8ebb526352182 # Parent 203c6071e104940a42e0b23eecc5d137a8ef8ce7 aistats submission diff -r 203c6071e104 -r eb6244c6d861 writeup/aistats2011_submission.tex --- a/writeup/aistats2011_submission.tex Sun Oct 31 22:27:30 2010 -0400 +++ b/writeup/aistats2011_submission.tex Sun Oct 31 22:40:33 2010 -0400 @@ -109,7 +109,7 @@ stochastic gradient descent. One of these layer initialization techniques, applied here, is the Denoising -Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see +Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}), which performed similarly or better~\citep{VincentPLarochelleH2008-very-small} than previously proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} @@ -118,10 +118,10 @@ to denoise its input, creating a layer of features that can be used as input for the next layer. Note that training a Denoising Auto-Encoder can actually been seen as training a particular RBM by an inductive -principle different from maximum likelihood~\cite{Vincent-SM-2010}, namely by -Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. +principle different from maximum likelihood~\citep{ift6266-tr-anonymous}, % Vincent-SM-2010}, +namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. -Previous comparative experimental results with stacking of RBMs and DAEs +Previous comparative experimental results with stacking of RBMs and DAs to build deep supervised predictors had shown that they could outperform shallow architectures in a variety of settings (see~\citet{Bengio-2009} for a review), especially @@ -161,9 +161,9 @@ % The {\bf main claim} of this paper is that deep learners (with several levels of representation) can -{\bf benefit more from self-taught learning than shallow learners} (with a single -level), both in the context of the multi-task setting and from {\em - out-of-distribution examples} in general. Because we are able to improve on state-of-the-art +{\bf benefit more from out-of-distribution examples than shallow learners} (with a single +level), both in the context of the multi-task setting and from + perturbed examples. Because we are able to improve on state-of-the-art performance and reach human-level performance on a large-scale task, we consider that this paper is also a contribution to advance the application of machine learning to handwritten character recognition. @@ -212,7 +212,7 @@ are then tested). The conclusion discusses the more general question of why deep learners may benefit so much from -the self-taught learning framework. Since out-of-distribution data +out-of-distribution examples. Since out-of-distribution data (perturbed or from other related classes) is very common, this conclusion is of practical importance. @@ -512,13 +512,13 @@ \vspace*{-3mm} \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} \vspace*{-3mm} -\caption{Relative improvement in error rate due to self-taught learning. +\caption{Relative improvement in error rate due to out-of-distribution examples. Left: Improvement (or loss, when negative) induced by out-of-distribution examples (perturbed data). Right: Improvement (or loss, when negative) induced by multi-task learning (training on all classes and testing only on either digits, upper case, or lower-case). The deep learner (SDA) benefits more from -both self-taught learning scenarios, compared to the shallow MLP.} +out-of-distribution examples, compared to the shallow MLP.} \label{fig:improvements-charts} \vspace*{-2mm} \end{figure*} @@ -557,8 +557,9 @@ In addition, as shown in the left of Figure~\ref{fig:improvements-charts}, the relative improvement in error -rate brought by self-taught learning is greater for the SDA, and these -differences with the MLP are statistically and qualitatively +rate brought by out-of-distribution examples is greater for the deep +stacked SDA, and these +differences with the shallow MLP are statistically and qualitatively significant. The left side of the figure shows the improvement to the clean NIST test set error brought by the use of out-of-distribution examples @@ -623,7 +624,8 @@ \section{Conclusions and Discussion} \vspace*{-2mm} -We have found that the self-taught learning framework is more beneficial +We have found that out-of-distribution examples (multi-task learning +and perturbed examples) are more beneficial to a deep learner than to a traditional shallow and purely supervised learner. More precisely, the answers are positive for all the questions asked in the introduction. @@ -639,13 +641,13 @@ and beating previously published results on the same data. $\bullet$ %\item -{\bf To what extent do self-taught learning scenarios help deep learners, +{\bf To what extent do out-of-distribution examples help deep learners, and do they help them more than shallow supervised ones}? We found that distorted training examples not only made the resulting classifier better on similarly perturbed images but also on the {\em original clean examples}, and more importantly and more novel, that deep architectures benefit more from such {\em out-of-distribution} -examples. MLPs were helped by perturbed training examples when tested on perturbed input +examples. Shallow MLPs were helped by perturbed training examples when tested on perturbed input images (65\% relative improvement on NISTP) but only marginally helped (5\% relative improvement on all classes) or even hurt (10\% relative loss on digits) @@ -669,7 +671,8 @@ i.e., here, the relative gain of self-taught learning is probably preserved in the asymptotic regime. -{\bf Why would deep learners benefit more from the self-taught learning framework}? +{\bf Why would deep learners benefit more from the self-taught learning +framework and out-of-distribution examples}? The key idea is that the lower layers of the predictor compute a hierarchy of features that can be shared across tasks or across variants of the input distribution. A theoretical analysis of generalization improvements @@ -692,14 +695,14 @@ it was found that online learning on a huge dataset did not make the advantage of the deep learning bias vanish, and a similar phenomenon may be happening here. We hypothesize that unsupervised pre-training -of a deep hierarchy with self-taught learning initializes the +of a deep hierarchy with out-of-distribution examples initializes the model in the basin of attraction of supervised gradient descent that corresponds to better generalization. Furthermore, such good basins of attraction are not discovered by pure supervised learning -(with or without self-taught settings) from random initialization, and more labeled examples +(with or without out-of-distribution examples) from random initialization, and more labeled examples does not allow the shallow or purely supervised models to discover the kind of better basins associated -with deep learning and self-taught learning. +with deep learning and out-of-distribution examples. A Flash demo of the recognizer (where both the MLP and the SDA can be compared) can be executed on-line at the anonymous site {\tt http://deep.host22.com}.