# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1288579233 14400
# Node ID eb6244c6d8610cc96beb848098b8ebb526352182
# Parent  203c6071e104940a42e0b23eecc5d137a8ef8ce7
aistats submission

diff -r 203c6071e104 -r eb6244c6d861 writeup/aistats2011_submission.tex
--- a/writeup/aistats2011_submission.tex	Sun Oct 31 22:27:30 2010 -0400
+++ b/writeup/aistats2011_submission.tex	Sun Oct 31 22:40:33 2010 -0400
@@ -109,7 +109,7 @@
 stochastic gradient descent.
 One of these layer initialization techniques,
 applied here, is the Denoising
-Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see
+Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see
 Figure~\ref{fig:da}), which performed similarly or 
 better~\citep{VincentPLarochelleH2008-very-small} than previously
 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} 
@@ -118,10 +118,10 @@
 to denoise its input, creating a layer of features that can be used as
 input for the next layer. Note that training a Denoising Auto-Encoder
 can actually been seen as training a particular RBM by an inductive
-principle different from maximum likelihood~\cite{Vincent-SM-2010}, namely by
-Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. 
+principle different from maximum likelihood~\citep{ift6266-tr-anonymous}, % Vincent-SM-2010}, 
+namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. 
 
-Previous comparative experimental results with stacking of RBMs and DAEs
+Previous comparative experimental results with stacking of RBMs and DAs
 to build deep supervised predictors had shown that they could outperform
 shallow architectures in a variety of settings (see~\citet{Bengio-2009}
 for a review), especially
@@ -161,9 +161,9 @@
 
 %
 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
-{\bf benefit more from self-taught learning than shallow learners} (with a single
-level), both in the context of the multi-task setting and from {\em
-  out-of-distribution examples} in general. Because we are able to improve on state-of-the-art
+{\bf benefit more from out-of-distribution examples than shallow learners} (with a single
+level), both in the context of the multi-task setting and from
+ perturbed examples. Because we are able to improve on state-of-the-art
 performance and reach human-level performance
 on a large-scale task, we consider that this paper is also a contribution
 to advance the application of machine learning to handwritten character recognition.
@@ -212,7 +212,7 @@
 are then tested).
 The conclusion discusses
 the more general question of why deep learners may benefit so much from 
-the self-taught learning framework. Since out-of-distribution data
+out-of-distribution examples. Since out-of-distribution data
 (perturbed or from other related classes) is very common, this conclusion
 is of practical importance.
 
@@ -512,13 +512,13 @@
 \vspace*{-3mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
 \vspace*{-3mm}
-\caption{Relative improvement in error rate due to self-taught learning. 
+\caption{Relative improvement in error rate due to out-of-distribution examples.
 Left: Improvement (or loss, when negative)
 induced by out-of-distribution examples (perturbed data). 
 Right: Improvement (or loss, when negative) induced by multi-task 
 learning (training on all classes and testing only on either digits,
 upper case, or lower-case). The deep learner (SDA) benefits more from
-both self-taught learning scenarios, compared to the shallow MLP.}
+out-of-distribution examples, compared to the shallow MLP.}
 \label{fig:improvements-charts}
 \vspace*{-2mm}
 \end{figure*}
@@ -557,8 +557,9 @@
 
 In addition, as shown in the left of
 Figure~\ref{fig:improvements-charts}, the relative improvement in error
-rate brought by self-taught learning is greater for the SDA, and these
-differences with the MLP are statistically and qualitatively
+rate brought by out-of-distribution examples is greater for the deep
+stacked SDA, and these
+differences with the shallow MLP are statistically and qualitatively
 significant. 
 The left side of the figure shows the improvement to the clean
 NIST test set error brought by the use of out-of-distribution examples
@@ -623,7 +624,8 @@
 \section{Conclusions and Discussion}
 \vspace*{-2mm}
 
-We have found that the self-taught learning framework is more beneficial
+We have found that out-of-distribution examples (multi-task learning
+and perturbed examples) are more beneficial
 to a deep learner than to a traditional shallow and purely
 supervised learner. More precisely, 
 the answers are positive for all the questions asked in the introduction.
@@ -639,13 +641,13 @@
 and beating previously published results on the same data.
 
 $\bullet$ %\item 
-{\bf To what extent do self-taught learning scenarios help deep learners,
+{\bf To what extent do out-of-distribution examples help deep learners,
 and do they help them more than shallow supervised ones}?
 We found that distorted training examples not only made the resulting
 classifier better on similarly perturbed images but also on
 the {\em original clean examples}, and more importantly and more novel,
 that deep architectures benefit more from such {\em out-of-distribution}
-examples. MLPs were helped by perturbed training examples when tested on perturbed input 
+examples. Shallow MLPs were helped by perturbed training examples when tested on perturbed input 
 images (65\% relative improvement on NISTP) 
 but only marginally helped (5\% relative improvement on all classes) 
 or even hurt (10\% relative loss on digits)
@@ -669,7 +671,8 @@
 i.e., here, the relative gain of self-taught learning is probably preserved
 in the asymptotic regime.
 
-{\bf Why would deep learners benefit more from the self-taught learning framework}?
+{\bf Why would deep learners benefit more from the self-taught learning 
+framework and out-of-distribution examples}?
 The key idea is that the lower layers of the predictor compute a hierarchy
 of features that can be shared across tasks or across variants of the
 input distribution. A theoretical analysis of generalization improvements
@@ -692,14 +695,14 @@
 it was found that online learning on a huge dataset did not make the
 advantage of the deep learning bias vanish, and a similar phenomenon
 may be happening here. We hypothesize that unsupervised pre-training
-of a deep hierarchy with self-taught learning initializes the
+of a deep hierarchy with out-of-distribution examples initializes the
 model in the basin of attraction of supervised gradient descent
 that corresponds to better generalization. Furthermore, such good
 basins of attraction are not discovered by pure supervised learning
-(with or without self-taught settings) from random initialization, and more labeled examples
+(with or without out-of-distribution examples) from random initialization, and more labeled examples
 does not allow the shallow or purely supervised models to discover
 the kind of better basins associated
-with deep learning and self-taught learning.
+with deep learning and out-of-distribution examples.
  
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 
 can be executed on-line at the anonymous site {\tt http://deep.host22.com}.