comparison writeup/nips2010_submission.tex @ 507:b8e33d3d7f65

merge
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 01 Jun 2010 13:57:16 -0400
parents 8bf07979b8ba a41a8925be70
children 8c2ab4f246b1 6f042a71be23
comparison
equal deleted inserted replaced
506:8bf07979b8ba 507:b8e33d3d7f65
18 \vspace*{-2mm} 18 \vspace*{-2mm}
19 \begin{abstract} 19 \begin{abstract}
20 Recent theoretical and empirical work in statistical machine learning has 20 Recent theoretical and empirical work in statistical machine learning has
21 demonstrated the importance of learning algorithms for deep 21 demonstrated the importance of learning algorithms for deep
22 architectures, i.e., function classes obtained by composing multiple 22 architectures, i.e., function classes obtained by composing multiple
23 non-linear transformations. The self-taught learning (exploiting unlabeled 23 non-linear transformations. Self-taught learning (exploiting unlabeled
24 examples or examples from other distributions) has already been applied 24 examples or examples from other distributions) has already been applied
25 to deep learners, but mostly to show the advantage of unlabeled 25 to deep learners, but mostly to show the advantage of unlabeled
26 examples. Here we explore the advantage brought by {\em out-of-distribution 26 examples. Here we explore the advantage brought by {\em out-of-distribution
27 examples} and show that {\em deep learners benefit more from them than a 27 examples} and show that {\em deep learners benefit more from them than a
28 corresponding shallow learner}, in the area 28 corresponding shallow learner}, in the area
72 applied here, is the Denoising 72 applied here, is the Denoising
73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which 73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
74 performed similarly or better than previously proposed Restricted Boltzmann 74 performed similarly or better than previously proposed Restricted Boltzmann
75 Machines in terms of unsupervised extraction of a hierarchy of features 75 Machines in terms of unsupervised extraction of a hierarchy of features
76 useful for classification. The principle is that each layer starting from 76 useful for classification. The principle is that each layer starting from
77 the bottom is trained to encode their input (the output of the previous 77 the bottom is trained to encode its input (the output of the previous
78 layer) and try to reconstruct it from a corrupted version of it. After this 78 layer) and to reconstruct it from a corrupted version of it. After this
79 unsupervised initialization, the stack of denoising auto-encoders can be 79 unsupervised initialization, the stack of denoising auto-encoders can be
80 converted into a deep supervised feedforward neural network and fine-tuned by 80 converted into a deep supervised feedforward neural network and fine-tuned by
81 stochastic gradient descent. 81 stochastic gradient descent.
82 82
83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles 83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
89 and multi-task learning, not much has been done yet to explore the impact 89 and multi-task learning, not much has been done yet to explore the impact
90 of {\em out-of-distribution} examples and of the multi-task setting 90 of {\em out-of-distribution} examples and of the multi-task setting
91 (but see~\citep{CollobertR2008}). In particular the {\em relative 91 (but see~\citep{CollobertR2008}). In particular the {\em relative
92 advantage} of deep learning for this settings has not been evaluated. 92 advantage} of deep learning for this settings has not been evaluated.
93 93
94 % TODO: Explain why we care about this question.
95
94 In this paper we ask the following questions: 96 In this paper we ask the following questions:
95 97
96 %\begin{enumerate} 98 %\begin{enumerate}
97 $\bullet$ %\item 99 $\bullet$ %\item
98 Do the good results previously obtained with deep architectures on the 100 Do the good results previously obtained with deep architectures on the
113 Similarly, does the feature learning step in deep learning algorithms benefit more 115 Similarly, does the feature learning step in deep learning algorithms benefit more
114 training with similar but different classes (i.e. a multi-task learning scenario) than 116 training with similar but different classes (i.e. a multi-task learning scenario) than
115 a corresponding shallow and purely supervised architecture? 117 a corresponding shallow and purely supervised architecture?
116 %\end{enumerate} 118 %\end{enumerate}
117 119
118 The experimental results presented here provide positive evidence towards all of these questions. 120 Our experimental results provide evidence to support positive answers to all of these questions.
119 121
120 \vspace*{-1mm} 122 \vspace*{-1mm}
121 \section{Perturbation and Transformation of Character Images} 123 \section{Perturbation and Transformation of Character Images}
122 \vspace*{-1mm} 124 \vspace*{-1mm}
123 125
581 \fi 583 \fi
582 584
583 585
584 \begin{figure}[h] 586 \begin{figure}[h]
585 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ 587 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
586 \caption{Relative improvement in error rate due to self-taught learning. 588 \caption{Charts corresponding to tables 2 (left) and 3 (right), from Appendix I.}
587 Left: Improvement (or loss, when negative)
588 induced by out-of-distribution examples (perturbed data).
589 Right: Improvement (or loss, when negative) induced by multi-task
590 learning (training on all classes and testing only on either digits,
591 upper case, or lower-case). The deep learner (SDA) benefits more from
592 both self-taught learning scenarios, compared to the shallow MLP.}
593 \label{fig:improvements-charts} 589 \label{fig:improvements-charts}
594 \end{figure} 590 \end{figure}
595 591
596 \vspace*{-1mm} 592 \vspace*{-1mm}
597 \section{Conclusions} 593 \section{Conclusions}