Mercurial > ift6266
comparison writeup/aistats2011_submission.tex @ 603:eb6244c6d861
aistats submission
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sun, 31 Oct 2010 22:40:33 -0400 |
parents | 203c6071e104 |
children | 51213beaed8b |
comparison
equal
deleted
inserted
replaced
602:203c6071e104 | 603:eb6244c6d861 |
---|---|
107 unsupervised initialization, the stack of layers can be | 107 unsupervised initialization, the stack of layers can be |
108 converted into a deep supervised feedforward neural network and fine-tuned by | 108 converted into a deep supervised feedforward neural network and fine-tuned by |
109 stochastic gradient descent. | 109 stochastic gradient descent. |
110 One of these layer initialization techniques, | 110 One of these layer initialization techniques, |
111 applied here, is the Denoising | 111 applied here, is the Denoising |
112 Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see | 112 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see |
113 Figure~\ref{fig:da}), which performed similarly or | 113 Figure~\ref{fig:da}), which performed similarly or |
114 better~\citep{VincentPLarochelleH2008-very-small} than previously | 114 better~\citep{VincentPLarochelleH2008-very-small} than previously |
115 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} | 115 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} |
116 in terms of unsupervised extraction | 116 in terms of unsupervised extraction |
117 of a hierarchy of features useful for classification. Each layer is trained | 117 of a hierarchy of features useful for classification. Each layer is trained |
118 to denoise its input, creating a layer of features that can be used as | 118 to denoise its input, creating a layer of features that can be used as |
119 input for the next layer. Note that training a Denoising Auto-Encoder | 119 input for the next layer. Note that training a Denoising Auto-Encoder |
120 can actually been seen as training a particular RBM by an inductive | 120 can actually been seen as training a particular RBM by an inductive |
121 principle different from maximum likelihood~\cite{Vincent-SM-2010}, namely by | 121 principle different from maximum likelihood~\citep{ift6266-tr-anonymous}, % Vincent-SM-2010}, |
122 Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. | 122 namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. |
123 | 123 |
124 Previous comparative experimental results with stacking of RBMs and DAEs | 124 Previous comparative experimental results with stacking of RBMs and DAs |
125 to build deep supervised predictors had shown that they could outperform | 125 to build deep supervised predictors had shown that they could outperform |
126 shallow architectures in a variety of settings (see~\citet{Bengio-2009} | 126 shallow architectures in a variety of settings (see~\citet{Bengio-2009} |
127 for a review), especially | 127 for a review), especially |
128 when the data involves complex interactions between many factors of | 128 when the data involves complex interactions between many factors of |
129 variation~\citep{LarochelleH2007}. Other experiments have suggested | 129 variation~\citep{LarochelleH2007}. Other experiments have suggested |
159 advantage of deep learning} for these settings has not been evaluated. | 159 advantage of deep learning} for these settings has not been evaluated. |
160 | 160 |
161 | 161 |
162 % | 162 % |
163 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can | 163 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can |
164 {\bf benefit more from self-taught learning than shallow learners} (with a single | 164 {\bf benefit more from out-of-distribution examples than shallow learners} (with a single |
165 level), both in the context of the multi-task setting and from {\em | 165 level), both in the context of the multi-task setting and from |
166 out-of-distribution examples} in general. Because we are able to improve on state-of-the-art | 166 perturbed examples. Because we are able to improve on state-of-the-art |
167 performance and reach human-level performance | 167 performance and reach human-level performance |
168 on a large-scale task, we consider that this paper is also a contribution | 168 on a large-scale task, we consider that this paper is also a contribution |
169 to advance the application of machine learning to handwritten character recognition. | 169 to advance the application of machine learning to handwritten character recognition. |
170 More precisely, we ask and answer the following questions: | 170 More precisely, we ask and answer the following questions: |
171 | 171 |
210 other classes than those of interest, by comparing learners trained with | 210 other classes than those of interest, by comparing learners trained with |
211 62 classes with learners trained with only a subset (on which they | 211 62 classes with learners trained with only a subset (on which they |
212 are then tested). | 212 are then tested). |
213 The conclusion discusses | 213 The conclusion discusses |
214 the more general question of why deep learners may benefit so much from | 214 the more general question of why deep learners may benefit so much from |
215 the self-taught learning framework. Since out-of-distribution data | 215 out-of-distribution examples. Since out-of-distribution data |
216 (perturbed or from other related classes) is very common, this conclusion | 216 (perturbed or from other related classes) is very common, this conclusion |
217 is of practical importance. | 217 is of practical importance. |
218 | 218 |
219 %\vspace*{-3mm} | 219 %\vspace*{-3mm} |
220 %\newpage | 220 %\newpage |
510 | 510 |
511 \begin{figure*}[ht] | 511 \begin{figure*}[ht] |
512 \vspace*{-3mm} | 512 \vspace*{-3mm} |
513 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} | 513 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} |
514 \vspace*{-3mm} | 514 \vspace*{-3mm} |
515 \caption{Relative improvement in error rate due to self-taught learning. | 515 \caption{Relative improvement in error rate due to out-of-distribution examples. |
516 Left: Improvement (or loss, when negative) | 516 Left: Improvement (or loss, when negative) |
517 induced by out-of-distribution examples (perturbed data). | 517 induced by out-of-distribution examples (perturbed data). |
518 Right: Improvement (or loss, when negative) induced by multi-task | 518 Right: Improvement (or loss, when negative) induced by multi-task |
519 learning (training on all classes and testing only on either digits, | 519 learning (training on all classes and testing only on either digits, |
520 upper case, or lower-case). The deep learner (SDA) benefits more from | 520 upper case, or lower-case). The deep learner (SDA) benefits more from |
521 both self-taught learning scenarios, compared to the shallow MLP.} | 521 out-of-distribution examples, compared to the shallow MLP.} |
522 \label{fig:improvements-charts} | 522 \label{fig:improvements-charts} |
523 \vspace*{-2mm} | 523 \vspace*{-2mm} |
524 \end{figure*} | 524 \end{figure*} |
525 | 525 |
526 \vspace*{-2mm} | 526 \vspace*{-2mm} |
555 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a | 555 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a |
556 ``c'' and a ``C'' are often indistinguishible). | 556 ``c'' and a ``C'' are often indistinguishible). |
557 | 557 |
558 In addition, as shown in the left of | 558 In addition, as shown in the left of |
559 Figure~\ref{fig:improvements-charts}, the relative improvement in error | 559 Figure~\ref{fig:improvements-charts}, the relative improvement in error |
560 rate brought by self-taught learning is greater for the SDA, and these | 560 rate brought by out-of-distribution examples is greater for the deep |
561 differences with the MLP are statistically and qualitatively | 561 stacked SDA, and these |
562 differences with the shallow MLP are statistically and qualitatively | |
562 significant. | 563 significant. |
563 The left side of the figure shows the improvement to the clean | 564 The left side of the figure shows the improvement to the clean |
564 NIST test set error brought by the use of out-of-distribution examples | 565 NIST test set error brought by the use of out-of-distribution examples |
565 (i.e. the perturbed examples examples from NISTP or P07). | 566 (i.e. the perturbed examples examples from NISTP or P07). |
566 Relative percent change is measured by taking | 567 Relative percent change is measured by taking |
621 | 622 |
622 \vspace*{-2mm} | 623 \vspace*{-2mm} |
623 \section{Conclusions and Discussion} | 624 \section{Conclusions and Discussion} |
624 \vspace*{-2mm} | 625 \vspace*{-2mm} |
625 | 626 |
626 We have found that the self-taught learning framework is more beneficial | 627 We have found that out-of-distribution examples (multi-task learning |
628 and perturbed examples) are more beneficial | |
627 to a deep learner than to a traditional shallow and purely | 629 to a deep learner than to a traditional shallow and purely |
628 supervised learner. More precisely, | 630 supervised learner. More precisely, |
629 the answers are positive for all the questions asked in the introduction. | 631 the answers are positive for all the questions asked in the introduction. |
630 %\begin{itemize} | 632 %\begin{itemize} |
631 | 633 |
637 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level | 639 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level |
638 performance} at around 17\% error on the 62-class task and 1.4\% on the digits, | 640 performance} at around 17\% error on the 62-class task and 1.4\% on the digits, |
639 and beating previously published results on the same data. | 641 and beating previously published results on the same data. |
640 | 642 |
641 $\bullet$ %\item | 643 $\bullet$ %\item |
642 {\bf To what extent do self-taught learning scenarios help deep learners, | 644 {\bf To what extent do out-of-distribution examples help deep learners, |
643 and do they help them more than shallow supervised ones}? | 645 and do they help them more than shallow supervised ones}? |
644 We found that distorted training examples not only made the resulting | 646 We found that distorted training examples not only made the resulting |
645 classifier better on similarly perturbed images but also on | 647 classifier better on similarly perturbed images but also on |
646 the {\em original clean examples}, and more importantly and more novel, | 648 the {\em original clean examples}, and more importantly and more novel, |
647 that deep architectures benefit more from such {\em out-of-distribution} | 649 that deep architectures benefit more from such {\em out-of-distribution} |
648 examples. MLPs were helped by perturbed training examples when tested on perturbed input | 650 examples. Shallow MLPs were helped by perturbed training examples when tested on perturbed input |
649 images (65\% relative improvement on NISTP) | 651 images (65\% relative improvement on NISTP) |
650 but only marginally helped (5\% relative improvement on all classes) | 652 but only marginally helped (5\% relative improvement on all classes) |
651 or even hurt (10\% relative loss on digits) | 653 or even hurt (10\% relative loss on digits) |
652 with respect to clean examples . On the other hand, the deep SDAs | 654 with respect to clean examples . On the other hand, the deep SDAs |
653 were significantly boosted by these out-of-distribution examples. | 655 were significantly boosted by these out-of-distribution examples. |
667 architectures, our experiments show that such a positive effect is accomplished | 669 architectures, our experiments show that such a positive effect is accomplished |
668 even in a scenario with a \emph{large number of labeled examples}, | 670 even in a scenario with a \emph{large number of labeled examples}, |
669 i.e., here, the relative gain of self-taught learning is probably preserved | 671 i.e., here, the relative gain of self-taught learning is probably preserved |
670 in the asymptotic regime. | 672 in the asymptotic regime. |
671 | 673 |
672 {\bf Why would deep learners benefit more from the self-taught learning framework}? | 674 {\bf Why would deep learners benefit more from the self-taught learning |
675 framework and out-of-distribution examples}? | |
673 The key idea is that the lower layers of the predictor compute a hierarchy | 676 The key idea is that the lower layers of the predictor compute a hierarchy |
674 of features that can be shared across tasks or across variants of the | 677 of features that can be shared across tasks or across variants of the |
675 input distribution. A theoretical analysis of generalization improvements | 678 input distribution. A theoretical analysis of generalization improvements |
676 due to sharing of intermediate features across tasks already points | 679 due to sharing of intermediate features across tasks already points |
677 towards that explanation~\cite{baxter95a}. | 680 towards that explanation~\cite{baxter95a}. |
690 We hypothesize that this is related to the hypotheses studied | 693 We hypothesize that this is related to the hypotheses studied |
691 in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010} | 694 in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010} |
692 it was found that online learning on a huge dataset did not make the | 695 it was found that online learning on a huge dataset did not make the |
693 advantage of the deep learning bias vanish, and a similar phenomenon | 696 advantage of the deep learning bias vanish, and a similar phenomenon |
694 may be happening here. We hypothesize that unsupervised pre-training | 697 may be happening here. We hypothesize that unsupervised pre-training |
695 of a deep hierarchy with self-taught learning initializes the | 698 of a deep hierarchy with out-of-distribution examples initializes the |
696 model in the basin of attraction of supervised gradient descent | 699 model in the basin of attraction of supervised gradient descent |
697 that corresponds to better generalization. Furthermore, such good | 700 that corresponds to better generalization. Furthermore, such good |
698 basins of attraction are not discovered by pure supervised learning | 701 basins of attraction are not discovered by pure supervised learning |
699 (with or without self-taught settings) from random initialization, and more labeled examples | 702 (with or without out-of-distribution examples) from random initialization, and more labeled examples |
700 does not allow the shallow or purely supervised models to discover | 703 does not allow the shallow or purely supervised models to discover |
701 the kind of better basins associated | 704 the kind of better basins associated |
702 with deep learning and self-taught learning. | 705 with deep learning and out-of-distribution examples. |
703 | 706 |
704 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 707 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) |
705 can be executed on-line at the anonymous site {\tt http://deep.host22.com}. | 708 can be executed on-line at the anonymous site {\tt http://deep.host22.com}. |
706 | 709 |
707 \iffalse | 710 \iffalse |