Mercurial > ift6266
comparison writeup/aistats2011_revised.tex @ 624:49933073590c
added jmlr_review1.txt and jmlr_review2.txt
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sun, 13 Mar 2011 18:25:25 -0400 |
parents | d44c78c90669 |
children |
comparison
equal
deleted
inserted
replaced
623:d44c78c90669 | 624:49933073590c |
---|---|
22 | 22 |
23 \twocolumn[ | 23 \twocolumn[ |
24 \aistatstitle{Deep Learners Benefit More from Out-of-Distribution Examples} | 24 \aistatstitle{Deep Learners Benefit More from Out-of-Distribution Examples} |
25 \runningtitle{Deep Learners for Out-of-Distribution Examples} | 25 \runningtitle{Deep Learners for Out-of-Distribution Examples} |
26 \runningauthor{Bengio et. al.} | 26 \runningauthor{Bengio et. al.} |
27 \aistatsauthor{Anonymous Authors}] | 27 \aistatsauthor{Anonymous Authors\\ |
28 \vspace*{5mm}}] | |
28 \iffalse | 29 \iffalse |
29 Yoshua Bengio \and | 30 Yoshua Bengio \and |
30 Frédéric Bastien \and | 31 Frédéric Bastien \and |
31 Arnaud Bergeron \and | 32 Arnaud Bergeron \and |
32 Nicolas Boulanger-Lewandowski \and | 33 Nicolas Boulanger-Lewandowski \and |
53 %\makeanontitle | 54 %\makeanontitle |
54 %\maketitle | 55 %\maketitle |
55 | 56 |
56 %{\bf Running title: Deep Self-Taught Learning} | 57 %{\bf Running title: Deep Self-Taught Learning} |
57 | 58 |
58 %\vspace*{-2mm} | 59 \vspace*{5mm} |
59 \begin{abstract} | 60 \begin{abstract} |
60 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did {\em beat previously published results and reached human-level performance}. | 61 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did {\em beat previously published results and reached human-level performance}. |
61 \end{abstract} | 62 \end{abstract} |
62 %\vspace*{-3mm} | 63 %\vspace*{-3mm} |
63 | 64 |
295 because each image was classified by 3 different persons. | 296 because each image was classified by 3 different persons. |
296 The average error of humans on the 62-class task NIST test set | 297 The average error of humans on the 62-class task NIST test set |
297 is 18.2\%, with a standard error of 0.1\%. | 298 is 18.2\%, with a standard error of 0.1\%. |
298 We controlled noise in the labelling process by (1) | 299 We controlled noise in the labelling process by (1) |
299 requiring AMT workers with a higher than normal average of accepted | 300 requiring AMT workers with a higher than normal average of accepted |
300 responses (>95\%) on other tasks (2) discarding responses that were not | 301 responses ($>$95\%) on other tasks (2) discarding responses that were not |
301 complete (10 predictions) (3) discarding responses for which for which the | 302 complete (10 predictions) (3) discarding responses for which for which the |
302 time to predict was smaller than 3 seconds for NIST (the mean response time | 303 time to predict was smaller than 3 seconds for NIST (the mean response time |
303 was 20 seconds) and 6 seconds seconds for NISTP (average response time of | 304 was 20 seconds) and 6 seconds seconds for NISTP (average response time of |
304 45 seconds) (4) discarding responses which were obviously wrong (10 | 305 45 seconds) (4) discarding responses which were obviously wrong (10 |
305 identical ones, or "12345..."). Overall, after such filtering, we kept | 306 identical ones, or "12345..."). Overall, after such filtering, we kept |
495 amount of corruption noise (we used the masking noise process, whereby a | 496 amount of corruption noise (we used the masking noise process, whereby a |
496 fixed proportion of the input values, randomly selected, are zeroed), and a | 497 fixed proportion of the input values, randomly selected, are zeroed), and a |
497 separate learning rate for the unsupervised pre-training stage (selected | 498 separate learning rate for the unsupervised pre-training stage (selected |
498 from the same above set). The fraction of inputs corrupted was selected | 499 from the same above set). The fraction of inputs corrupted was selected |
499 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | 500 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number |
500 of hidden layers but it was fixed to 3 based on previous work with | 501 of hidden layers but it was fixed to 3 for most experiments, |
501 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. The size of the hidden | 502 based on previous work with |
503 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. | |
504 We also compared against 1 and against 2 hidden layers, in order | |
505 to disantangle the effect of depth from the effect of unsupervised | |
506 pre-training. | |
507 The size of the hidden | |
502 layers was kept constant across hidden layers, and the best results | 508 layers was kept constant across hidden layers, and the best results |
503 were obtained with the largest values that we could experiment | 509 were obtained with the largest values that we could experiment |
504 with given our patience, with 1000 hidden units. | 510 with given our patience, with 1000 hidden units. |
505 | 511 |
506 %\vspace*{-1mm} | 512 %\vspace*{-1mm} |
565 and the 10-class (digits) task. | 571 and the 10-class (digits) task. |
566 17\% error (SDA1) or 18\% error (humans) may seem large but a large | 572 17\% error (SDA1) or 18\% error (humans) may seem large but a large |
567 majority of the errors from humans and from SDA1 are from out-of-context | 573 majority of the errors from humans and from SDA1 are from out-of-context |
568 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a | 574 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a |
569 ``c'' and a ``C'' are often indistinguishible). | 575 ``c'' and a ``C'' are often indistinguishible). |
576 Regarding shallower networks pre-trained with unsupervised denoising | |
577 auto-encders, we find that the NIST test error is 21\% with one hidden | |
578 layer and 20\% with two hidden layers (vs 17\% in the same conditions | |
579 with 3 hidden layers). Compare this with the 23\% error achieved | |
580 by the MLP, i.e. a single hidden layer and no unsupervised pre-training. | |
581 As found in previous work~\cite{Erhan+al-2010,Larochelle-jmlr-2009}, | |
582 these results show that both depth and | |
583 unsupervised pre-training need to be combined in order to achieve | |
584 the best results. | |
585 | |
570 | 586 |
571 In addition, as shown in the left of | 587 In addition, as shown in the left of |
572 Figure~\ref{fig:improvements-charts}, the relative improvement in error | 588 Figure~\ref{fig:improvements-charts}, the relative improvement in error |
573 rate brought by out-of-distribution examples is greater for the deep | 589 rate brought by out-of-distribution examples is greater for the deep |
574 SDA, and these | 590 SDA, and these |