comparison writeup/aistats2011_submission.tex @ 603:eb6244c6d861

aistats submission
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 31 Oct 2010 22:40:33 -0400
parents 203c6071e104
children 51213beaed8b
comparison
equal deleted inserted replaced
602:203c6071e104 603:eb6244c6d861
107 unsupervised initialization, the stack of layers can be 107 unsupervised initialization, the stack of layers can be
108 converted into a deep supervised feedforward neural network and fine-tuned by 108 converted into a deep supervised feedforward neural network and fine-tuned by
109 stochastic gradient descent. 109 stochastic gradient descent.
110 One of these layer initialization techniques, 110 One of these layer initialization techniques,
111 applied here, is the Denoising 111 applied here, is the Denoising
112 Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see 112 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see
113 Figure~\ref{fig:da}), which performed similarly or 113 Figure~\ref{fig:da}), which performed similarly or
114 better~\citep{VincentPLarochelleH2008-very-small} than previously 114 better~\citep{VincentPLarochelleH2008-very-small} than previously
115 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} 115 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06}
116 in terms of unsupervised extraction 116 in terms of unsupervised extraction
117 of a hierarchy of features useful for classification. Each layer is trained 117 of a hierarchy of features useful for classification. Each layer is trained
118 to denoise its input, creating a layer of features that can be used as 118 to denoise its input, creating a layer of features that can be used as
119 input for the next layer. Note that training a Denoising Auto-Encoder 119 input for the next layer. Note that training a Denoising Auto-Encoder
120 can actually been seen as training a particular RBM by an inductive 120 can actually been seen as training a particular RBM by an inductive
121 principle different from maximum likelihood~\cite{Vincent-SM-2010}, namely by 121 principle different from maximum likelihood~\citep{ift6266-tr-anonymous}, % Vincent-SM-2010},
122 Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. 122 namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}.
123 123
124 Previous comparative experimental results with stacking of RBMs and DAEs 124 Previous comparative experimental results with stacking of RBMs and DAs
125 to build deep supervised predictors had shown that they could outperform 125 to build deep supervised predictors had shown that they could outperform
126 shallow architectures in a variety of settings (see~\citet{Bengio-2009} 126 shallow architectures in a variety of settings (see~\citet{Bengio-2009}
127 for a review), especially 127 for a review), especially
128 when the data involves complex interactions between many factors of 128 when the data involves complex interactions between many factors of
129 variation~\citep{LarochelleH2007}. Other experiments have suggested 129 variation~\citep{LarochelleH2007}. Other experiments have suggested
159 advantage of deep learning} for these settings has not been evaluated. 159 advantage of deep learning} for these settings has not been evaluated.
160 160
161 161
162 % 162 %
163 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can 163 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
164 {\bf benefit more from self-taught learning than shallow learners} (with a single 164 {\bf benefit more from out-of-distribution examples than shallow learners} (with a single
165 level), both in the context of the multi-task setting and from {\em 165 level), both in the context of the multi-task setting and from
166 out-of-distribution examples} in general. Because we are able to improve on state-of-the-art 166 perturbed examples. Because we are able to improve on state-of-the-art
167 performance and reach human-level performance 167 performance and reach human-level performance
168 on a large-scale task, we consider that this paper is also a contribution 168 on a large-scale task, we consider that this paper is also a contribution
169 to advance the application of machine learning to handwritten character recognition. 169 to advance the application of machine learning to handwritten character recognition.
170 More precisely, we ask and answer the following questions: 170 More precisely, we ask and answer the following questions:
171 171
210 other classes than those of interest, by comparing learners trained with 210 other classes than those of interest, by comparing learners trained with
211 62 classes with learners trained with only a subset (on which they 211 62 classes with learners trained with only a subset (on which they
212 are then tested). 212 are then tested).
213 The conclusion discusses 213 The conclusion discusses
214 the more general question of why deep learners may benefit so much from 214 the more general question of why deep learners may benefit so much from
215 the self-taught learning framework. Since out-of-distribution data 215 out-of-distribution examples. Since out-of-distribution data
216 (perturbed or from other related classes) is very common, this conclusion 216 (perturbed or from other related classes) is very common, this conclusion
217 is of practical importance. 217 is of practical importance.
218 218
219 %\vspace*{-3mm} 219 %\vspace*{-3mm}
220 %\newpage 220 %\newpage
510 510
511 \begin{figure*}[ht] 511 \begin{figure*}[ht]
512 \vspace*{-3mm} 512 \vspace*{-3mm}
513 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} 513 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
514 \vspace*{-3mm} 514 \vspace*{-3mm}
515 \caption{Relative improvement in error rate due to self-taught learning. 515 \caption{Relative improvement in error rate due to out-of-distribution examples.
516 Left: Improvement (or loss, when negative) 516 Left: Improvement (or loss, when negative)
517 induced by out-of-distribution examples (perturbed data). 517 induced by out-of-distribution examples (perturbed data).
518 Right: Improvement (or loss, when negative) induced by multi-task 518 Right: Improvement (or loss, when negative) induced by multi-task
519 learning (training on all classes and testing only on either digits, 519 learning (training on all classes and testing only on either digits,
520 upper case, or lower-case). The deep learner (SDA) benefits more from 520 upper case, or lower-case). The deep learner (SDA) benefits more from
521 both self-taught learning scenarios, compared to the shallow MLP.} 521 out-of-distribution examples, compared to the shallow MLP.}
522 \label{fig:improvements-charts} 522 \label{fig:improvements-charts}
523 \vspace*{-2mm} 523 \vspace*{-2mm}
524 \end{figure*} 524 \end{figure*}
525 525
526 \vspace*{-2mm} 526 \vspace*{-2mm}
555 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a 555 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
556 ``c'' and a ``C'' are often indistinguishible). 556 ``c'' and a ``C'' are often indistinguishible).
557 557
558 In addition, as shown in the left of 558 In addition, as shown in the left of
559 Figure~\ref{fig:improvements-charts}, the relative improvement in error 559 Figure~\ref{fig:improvements-charts}, the relative improvement in error
560 rate brought by self-taught learning is greater for the SDA, and these 560 rate brought by out-of-distribution examples is greater for the deep
561 differences with the MLP are statistically and qualitatively 561 stacked SDA, and these
562 differences with the shallow MLP are statistically and qualitatively
562 significant. 563 significant.
563 The left side of the figure shows the improvement to the clean 564 The left side of the figure shows the improvement to the clean
564 NIST test set error brought by the use of out-of-distribution examples 565 NIST test set error brought by the use of out-of-distribution examples
565 (i.e. the perturbed examples examples from NISTP or P07). 566 (i.e. the perturbed examples examples from NISTP or P07).
566 Relative percent change is measured by taking 567 Relative percent change is measured by taking
621 622
622 \vspace*{-2mm} 623 \vspace*{-2mm}
623 \section{Conclusions and Discussion} 624 \section{Conclusions and Discussion}
624 \vspace*{-2mm} 625 \vspace*{-2mm}
625 626
626 We have found that the self-taught learning framework is more beneficial 627 We have found that out-of-distribution examples (multi-task learning
628 and perturbed examples) are more beneficial
627 to a deep learner than to a traditional shallow and purely 629 to a deep learner than to a traditional shallow and purely
628 supervised learner. More precisely, 630 supervised learner. More precisely,
629 the answers are positive for all the questions asked in the introduction. 631 the answers are positive for all the questions asked in the introduction.
630 %\begin{itemize} 632 %\begin{itemize}
631 633
637 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level 639 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
638 performance} at around 17\% error on the 62-class task and 1.4\% on the digits, 640 performance} at around 17\% error on the 62-class task and 1.4\% on the digits,
639 and beating previously published results on the same data. 641 and beating previously published results on the same data.
640 642
641 $\bullet$ %\item 643 $\bullet$ %\item
642 {\bf To what extent do self-taught learning scenarios help deep learners, 644 {\bf To what extent do out-of-distribution examples help deep learners,
643 and do they help them more than shallow supervised ones}? 645 and do they help them more than shallow supervised ones}?
644 We found that distorted training examples not only made the resulting 646 We found that distorted training examples not only made the resulting
645 classifier better on similarly perturbed images but also on 647 classifier better on similarly perturbed images but also on
646 the {\em original clean examples}, and more importantly and more novel, 648 the {\em original clean examples}, and more importantly and more novel,
647 that deep architectures benefit more from such {\em out-of-distribution} 649 that deep architectures benefit more from such {\em out-of-distribution}
648 examples. MLPs were helped by perturbed training examples when tested on perturbed input 650 examples. Shallow MLPs were helped by perturbed training examples when tested on perturbed input
649 images (65\% relative improvement on NISTP) 651 images (65\% relative improvement on NISTP)
650 but only marginally helped (5\% relative improvement on all classes) 652 but only marginally helped (5\% relative improvement on all classes)
651 or even hurt (10\% relative loss on digits) 653 or even hurt (10\% relative loss on digits)
652 with respect to clean examples . On the other hand, the deep SDAs 654 with respect to clean examples . On the other hand, the deep SDAs
653 were significantly boosted by these out-of-distribution examples. 655 were significantly boosted by these out-of-distribution examples.
667 architectures, our experiments show that such a positive effect is accomplished 669 architectures, our experiments show that such a positive effect is accomplished
668 even in a scenario with a \emph{large number of labeled examples}, 670 even in a scenario with a \emph{large number of labeled examples},
669 i.e., here, the relative gain of self-taught learning is probably preserved 671 i.e., here, the relative gain of self-taught learning is probably preserved
670 in the asymptotic regime. 672 in the asymptotic regime.
671 673
672 {\bf Why would deep learners benefit more from the self-taught learning framework}? 674 {\bf Why would deep learners benefit more from the self-taught learning
675 framework and out-of-distribution examples}?
673 The key idea is that the lower layers of the predictor compute a hierarchy 676 The key idea is that the lower layers of the predictor compute a hierarchy
674 of features that can be shared across tasks or across variants of the 677 of features that can be shared across tasks or across variants of the
675 input distribution. A theoretical analysis of generalization improvements 678 input distribution. A theoretical analysis of generalization improvements
676 due to sharing of intermediate features across tasks already points 679 due to sharing of intermediate features across tasks already points
677 towards that explanation~\cite{baxter95a}. 680 towards that explanation~\cite{baxter95a}.
690 We hypothesize that this is related to the hypotheses studied 693 We hypothesize that this is related to the hypotheses studied
691 in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010} 694 in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010}
692 it was found that online learning on a huge dataset did not make the 695 it was found that online learning on a huge dataset did not make the
693 advantage of the deep learning bias vanish, and a similar phenomenon 696 advantage of the deep learning bias vanish, and a similar phenomenon
694 may be happening here. We hypothesize that unsupervised pre-training 697 may be happening here. We hypothesize that unsupervised pre-training
695 of a deep hierarchy with self-taught learning initializes the 698 of a deep hierarchy with out-of-distribution examples initializes the
696 model in the basin of attraction of supervised gradient descent 699 model in the basin of attraction of supervised gradient descent
697 that corresponds to better generalization. Furthermore, such good 700 that corresponds to better generalization. Furthermore, such good
698 basins of attraction are not discovered by pure supervised learning 701 basins of attraction are not discovered by pure supervised learning
699 (with or without self-taught settings) from random initialization, and more labeled examples 702 (with or without out-of-distribution examples) from random initialization, and more labeled examples
700 does not allow the shallow or purely supervised models to discover 703 does not allow the shallow or purely supervised models to discover
701 the kind of better basins associated 704 the kind of better basins associated
702 with deep learning and self-taught learning. 705 with deep learning and out-of-distribution examples.
703 706
704 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 707 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
705 can be executed on-line at the anonymous site {\tt http://deep.host22.com}. 708 can be executed on-line at the anonymous site {\tt http://deep.host22.com}.
706 709
707 \iffalse 710 \iffalse