diff writeup/aistats2011_revised.tex @ 624:49933073590c

added jmlr_review1.txt and jmlr_review2.txt
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 13 Mar 2011 18:25:25 -0400
parents d44c78c90669
children
line wrap: on
line diff
--- a/writeup/aistats2011_revised.tex	Sun Jan 09 22:00:39 2011 -0500
+++ b/writeup/aistats2011_revised.tex	Sun Mar 13 18:25:25 2011 -0400
@@ -24,7 +24,8 @@
 \aistatstitle{Deep Learners Benefit More from Out-of-Distribution Examples}
 \runningtitle{Deep Learners for Out-of-Distribution Examples}
 \runningauthor{Bengio et. al.}
-\aistatsauthor{Anonymous Authors}]
+\aistatsauthor{Anonymous Authors\\
+\vspace*{5mm}}]
 \iffalse
 Yoshua  Bengio \and
 Frédéric  Bastien \and
@@ -55,7 +56,7 @@
 
 %{\bf Running title: Deep Self-Taught Learning}
 
-%\vspace*{-2mm}
+\vspace*{5mm}
 \begin{abstract}
   Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did {\em beat previously published results and reached human-level performance}.
 \end{abstract}
@@ -297,7 +298,7 @@
 is 18.2\%, with a standard error of 0.1\%.
 We controlled noise in the labelling process by (1)
 requiring AMT workers with a higher than normal average of accepted
-responses (>95\%) on other tasks (2) discarding responses that were not
+responses ($>$95\%) on other tasks (2) discarding responses that were not
 complete (10 predictions) (3) discarding responses for which for which the
 time to predict was smaller than 3 seconds for NIST (the mean response time
 was 20 seconds) and 6 seconds seconds for NISTP (average response time of
@@ -497,8 +498,13 @@
 separate learning rate for the unsupervised pre-training stage (selected
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
-of hidden layers but it was fixed to 3 based on previous work with
-SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. The size of the hidden
+of hidden layers but it was fixed to 3 for most experiments,
+based on previous work with
+SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. 
+We also compared against 1 and against 2 hidden layers, in order
+to disantangle the effect of depth from the effect of unsupervised
+pre-training.
+The size of the hidden
 layers was kept constant across hidden layers, and the best results
 were obtained with the largest values that we could experiment
 with given our patience, with 1000 hidden units.
@@ -567,6 +573,16 @@
 majority of the errors from humans and from SDA1 are from out-of-context
 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
 ``c'' and a ``C'' are often indistinguishible).
+Regarding shallower networks pre-trained with unsupervised denoising
+auto-encders, we find that the NIST test error is 21\% with one hidden
+layer and 20\% with two hidden layers (vs 17\% in the same conditions
+with 3 hidden layers). Compare this with the 23\% error achieved
+by the MLP, i.e. a single hidden layer and no unsupervised pre-training.
+As found in previous work~\cite{Erhan+al-2010,Larochelle-jmlr-2009}, 
+these results show that both depth and
+unsupervised pre-training need to be combined in order to achieve
+the best results.
+
 
 In addition, as shown in the left of
 Figure~\ref{fig:improvements-charts}, the relative improvement in error