comparison writeup/aistats2011_revised.tex @ 624:49933073590c

added jmlr_review1.txt and jmlr_review2.txt
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 13 Mar 2011 18:25:25 -0400
parents d44c78c90669
children
comparison
equal deleted inserted replaced
623:d44c78c90669 624:49933073590c
22 22
23 \twocolumn[ 23 \twocolumn[
24 \aistatstitle{Deep Learners Benefit More from Out-of-Distribution Examples} 24 \aistatstitle{Deep Learners Benefit More from Out-of-Distribution Examples}
25 \runningtitle{Deep Learners for Out-of-Distribution Examples} 25 \runningtitle{Deep Learners for Out-of-Distribution Examples}
26 \runningauthor{Bengio et. al.} 26 \runningauthor{Bengio et. al.}
27 \aistatsauthor{Anonymous Authors}] 27 \aistatsauthor{Anonymous Authors\\
28 \vspace*{5mm}}]
28 \iffalse 29 \iffalse
29 Yoshua Bengio \and 30 Yoshua Bengio \and
30 Frédéric Bastien \and 31 Frédéric Bastien \and
31 Arnaud Bergeron \and 32 Arnaud Bergeron \and
32 Nicolas Boulanger-Lewandowski \and 33 Nicolas Boulanger-Lewandowski \and
53 %\makeanontitle 54 %\makeanontitle
54 %\maketitle 55 %\maketitle
55 56
56 %{\bf Running title: Deep Self-Taught Learning} 57 %{\bf Running title: Deep Self-Taught Learning}
57 58
58 %\vspace*{-2mm} 59 \vspace*{5mm}
59 \begin{abstract} 60 \begin{abstract}
60 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did {\em beat previously published results and reached human-level performance}. 61 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did {\em beat previously published results and reached human-level performance}.
61 \end{abstract} 62 \end{abstract}
62 %\vspace*{-3mm} 63 %\vspace*{-3mm}
63 64
295 because each image was classified by 3 different persons. 296 because each image was classified by 3 different persons.
296 The average error of humans on the 62-class task NIST test set 297 The average error of humans on the 62-class task NIST test set
297 is 18.2\%, with a standard error of 0.1\%. 298 is 18.2\%, with a standard error of 0.1\%.
298 We controlled noise in the labelling process by (1) 299 We controlled noise in the labelling process by (1)
299 requiring AMT workers with a higher than normal average of accepted 300 requiring AMT workers with a higher than normal average of accepted
300 responses (>95\%) on other tasks (2) discarding responses that were not 301 responses ($>$95\%) on other tasks (2) discarding responses that were not
301 complete (10 predictions) (3) discarding responses for which for which the 302 complete (10 predictions) (3) discarding responses for which for which the
302 time to predict was smaller than 3 seconds for NIST (the mean response time 303 time to predict was smaller than 3 seconds for NIST (the mean response time
303 was 20 seconds) and 6 seconds seconds for NISTP (average response time of 304 was 20 seconds) and 6 seconds seconds for NISTP (average response time of
304 45 seconds) (4) discarding responses which were obviously wrong (10 305 45 seconds) (4) discarding responses which were obviously wrong (10
305 identical ones, or "12345..."). Overall, after such filtering, we kept 306 identical ones, or "12345..."). Overall, after such filtering, we kept
495 amount of corruption noise (we used the masking noise process, whereby a 496 amount of corruption noise (we used the masking noise process, whereby a
496 fixed proportion of the input values, randomly selected, are zeroed), and a 497 fixed proportion of the input values, randomly selected, are zeroed), and a
497 separate learning rate for the unsupervised pre-training stage (selected 498 separate learning rate for the unsupervised pre-training stage (selected
498 from the same above set). The fraction of inputs corrupted was selected 499 from the same above set). The fraction of inputs corrupted was selected
499 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number 500 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
500 of hidden layers but it was fixed to 3 based on previous work with 501 of hidden layers but it was fixed to 3 for most experiments,
501 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. The size of the hidden 502 based on previous work with
503 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}.
504 We also compared against 1 and against 2 hidden layers, in order
505 to disantangle the effect of depth from the effect of unsupervised
506 pre-training.
507 The size of the hidden
502 layers was kept constant across hidden layers, and the best results 508 layers was kept constant across hidden layers, and the best results
503 were obtained with the largest values that we could experiment 509 were obtained with the largest values that we could experiment
504 with given our patience, with 1000 hidden units. 510 with given our patience, with 1000 hidden units.
505 511
506 %\vspace*{-1mm} 512 %\vspace*{-1mm}
565 and the 10-class (digits) task. 571 and the 10-class (digits) task.
566 17\% error (SDA1) or 18\% error (humans) may seem large but a large 572 17\% error (SDA1) or 18\% error (humans) may seem large but a large
567 majority of the errors from humans and from SDA1 are from out-of-context 573 majority of the errors from humans and from SDA1 are from out-of-context
568 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a 574 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
569 ``c'' and a ``C'' are often indistinguishible). 575 ``c'' and a ``C'' are often indistinguishible).
576 Regarding shallower networks pre-trained with unsupervised denoising
577 auto-encders, we find that the NIST test error is 21\% with one hidden
578 layer and 20\% with two hidden layers (vs 17\% in the same conditions
579 with 3 hidden layers). Compare this with the 23\% error achieved
580 by the MLP, i.e. a single hidden layer and no unsupervised pre-training.
581 As found in previous work~\cite{Erhan+al-2010,Larochelle-jmlr-2009},
582 these results show that both depth and
583 unsupervised pre-training need to be combined in order to achieve
584 the best results.
585
570 586
571 In addition, as shown in the left of 587 In addition, as shown in the left of
572 Figure~\ref{fig:improvements-charts}, the relative improvement in error 588 Figure~\ref{fig:improvements-charts}, the relative improvement in error
573 rate brought by out-of-distribution examples is greater for the deep 589 rate brought by out-of-distribution examples is greater for the deep
574 SDA, and these 590 SDA, and these