ift6266: writeup/aistats2011_revised.tex comparison

comparison writeup/aistats2011_revised.tex @ 624:49933073590c

added jmlr_review1.txt and jmlr_review2.txt

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sun, 13 Mar 2011 18:25:25 -0400
parents	d44c78c90669
children

comparison

equal deleted inserted replaced

-:d44c78c90669
+:49933073590c
 \twocolumn[
 \aistatstitle{Deep Learners Benefit More from Out-of-Distribution Examples}
 \runningtitle{Deep Learners for Out-of-Distribution Examples}
 \runningauthor{Bengio et. al.}
-\aistatsauthor{Anonymous Authors}]
+\aistatsauthor{Anonymous Authors\\
+\vspace*{5mm}}]
 \iffalse
 Yoshua  Bengio \and
 Frédéric  Bastien \and
 Arnaud  Bergeron \and
 Nicolas  Boulanger-Lewandowski \and
 %\makeanontitle
 %\maketitle
 %{\bf Running title: Deep Self-Taught Learning}
-%\vspace*{-2mm}
+\vspace*{5mm}
 \begin{abstract}
 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did {\em beat previously published results and reached human-level performance}.
 \end{abstract}
 %\vspace*{-3mm}
 because each image was classified by 3 different persons.
 The average error of humans on the 62-class task NIST test set
 is 18.2\%, with a standard error of 0.1\%.
 We controlled noise in the labelling process by (1)
 requiring AMT workers with a higher than normal average of accepted
-responses (>95\%) on other tasks (2) discarding responses that were not
+responses ($>$95\%) on other tasks (2) discarding responses that were not
 complete (10 predictions) (3) discarding responses for which for which the
 time to predict was smaller than 3 seconds for NIST (the mean response time
 was 20 seconds) and 6 seconds seconds for NISTP (average response time of
 45 seconds) (4) discarding responses which were obviously wrong (10
 identical ones, or "12345..."). Overall, after such filtering, we kept
 amount of corruption noise (we used the masking noise process, whereby a
 fixed proportion of the input values, randomly selected, are zeroed), and a
 separate learning rate for the unsupervised pre-training stage (selected
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
-of hidden layers but it was fixed to 3 based on previous work with
+of hidden layers but it was fixed to 3 for most experiments,
-SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. The size of the hidden
+based on previous work with
+SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}.
+We also compared against 1 and against 2 hidden layers, in order
+to disantangle the effect of depth from the effect of unsupervised
+pre-training.
+The size of the hidden
 layers was kept constant across hidden layers, and the best results
 were obtained with the largest values that we could experiment
 with given our patience, with 1000 hidden units.
 %\vspace*{-1mm}
 and the 10-class (digits) task.
 17\% error (SDA1) or 18\% error (humans) may seem large but a large
 majority of the errors from humans and from SDA1 are from out-of-context
 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
 ``c'' and a ``C'' are often indistinguishible).
+Regarding shallower networks pre-trained with unsupervised denoising
+auto-encders, we find that the NIST test error is 21\% with one hidden
+layer and 20\% with two hidden layers (vs 17\% in the same conditions
+with 3 hidden layers). Compare this with the 23\% error achieved
+by the MLP, i.e. a single hidden layer and no unsupervised pre-training.
+As found in previous work~\cite{Erhan+al-2010,Larochelle-jmlr-2009},
+these results show that both depth and
+unsupervised pre-training need to be combined in order to achieve
+the best results.
 In addition, as shown in the left of
 Figure~\ref{fig:improvements-charts}, the relative improvement in error
 rate brought by out-of-distribution examples is greater for the deep
 SDA, and these

Mercurial > ift6266

comparison writeup/aistats2011_revised.tex @ 624:49933073590c