comparison writeup/nips2010_submission.tex @ 569:9d01280ff1c1

commentaires de Joseph Turian
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Thu, 03 Jun 2010 19:05:08 -0400
parents ae6ba0309bf9
children df749e70f637
comparison
equal deleted inserted replaced
568:ae6ba0309bf9 569:9d01280ff1c1
18 %\makeanontitle 18 %\makeanontitle
19 \maketitle 19 \maketitle
20 20
21 \vspace*{-2mm} 21 \vspace*{-2mm}
22 \begin{abstract} 22 \begin{abstract}
23 Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples} and show that {\em deep learners benefit more from them than a corresponding shallow learner}, in the area of handwritten character recognition. In fact, we show that they reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. 23 Recent theoretical and empirical work in statistical machine learning has
24 demonstrated the importance of learning algorithms for deep
25 architectures, i.e., function classes obtained by composing multiple
26 non-linear transformations. Self-taught learning (exploiting unlabeled
27 examples or examples from other distributions) has already been applied
28 to deep learners, but mostly to show the advantage of unlabeled
29 examples. Here we explore the advantage brought by {\em out-of-distribution examples}.
30 For this purpose we
31 developed a powerful generator of stochastic variations and noise
32 processes for character images, including not only affine transformations
33 but also slant, local elastic deformations, changes in thickness,
34 background images, grey level changes, contrast, occlusion, and various
35 types of noise. The out-of-distribution examples are obtained from these
36 highly distorted images or by including examples of object classes
37 different from those in the target test set.
38 We show that {\em deep learners benefit
39 more from them than a corresponding shallow learner}, at least in the area of
40 handwritten character recognition. In fact, we show that they reach
41 human-level performance on both handwritten digit classification and
42 62-class handwritten character recognition.
24 \end{abstract} 43 \end{abstract}
25 \vspace*{-3mm} 44 \vspace*{-3mm}
26 45
27 \section{Introduction} 46 \section{Introduction}
28 \vspace*{-1mm} 47 \vspace*{-1mm}
29 48
30 Deep Learning has emerged as a promising new area of research in 49 {\bf Deep Learning} has emerged as a promising new area of research in
31 statistical machine learning (see~\citet{Bengio-2009} for a review). 50 statistical machine learning (see~\citet{Bengio-2009} for a review).
32 Learning algorithms for deep architectures are centered on the learning 51 Learning algorithms for deep architectures are centered on the learning
33 of useful representations of data, which are better suited to the task at hand. 52 of useful representations of data, which are better suited to the task at hand.
34 This is in great part inspired by observations of the mammalian visual cortex, 53 This is in part inspired by observations of the mammalian visual cortex,
35 which consists of a chain of processing elements, each of which is associated with a 54 which consists of a chain of processing elements, each of which is associated with a
36 different representation of the raw visual input. In fact, 55 different representation of the raw visual input. In fact,
37 it was found recently that the features learnt in deep architectures resemble 56 it was found recently that the features learnt in deep architectures resemble
38 those observed in the first two of these stages (in areas V1 and V2 57 those observed in the first two of these stages (in areas V1 and V2
39 of visual cortex)~\citep{HonglakL2008}, and that they become more and 58 of visual cortex)~\citep{HonglakL2008}, and that they become more and
45 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the 64 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
46 feature representation can lead to higher-level (more abstract, more 65 feature representation can lead to higher-level (more abstract, more
47 general) features that are more robust to unanticipated sources of 66 general) features that are more robust to unanticipated sources of
48 variance extant in real data. 67 variance extant in real data.
49 68
69 {\bf Self-taught learning}~\citep{RainaR2007} is a paradigm that combines principles
70 of semi-supervised and multi-task learning: the learner can exploit examples
71 that are unlabeled and possibly come from a distribution different from the target
72 distribution, e.g., from other classes than those of interest.
73 It has already been shown that deep learners can clearly take advantage of
74 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
75 but more needs to be done to explore the impact
76 of {\em out-of-distribution} examples and of the multi-task setting
77 (one exception is~\citep{CollobertR2008}, which uses a different kind
78 of learning algorithm). In particular the {\em relative
79 advantage} of deep learning for these settings has not been evaluated.
80 The hypothesis discussed in the conclusion is that a deep hierarchy of features
81 may be better able to provide sharing of statistical strength
82 between different regions in input space or different tasks.
83
84 \iffalse
50 Whereas a deep architecture can in principle be more powerful than a 85 Whereas a deep architecture can in principle be more powerful than a
51 shallow one in terms of representation, depth appears to render the 86 shallow one in terms of representation, depth appears to render the
52 training problem more difficult in terms of optimization and local minima. 87 training problem more difficult in terms of optimization and local minima.
53 It is also only recently that successful algorithms were proposed to 88 It is also only recently that successful algorithms were proposed to
54 overcome some of these difficulties. All are based on unsupervised 89 overcome some of these difficulties. All are based on unsupervised
57 applied here, is the Denoising 92 applied here, is the Denoising
58 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}), 93 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}),
59 which 94 which
60 performed similarly or better than previously proposed Restricted Boltzmann 95 performed similarly or better than previously proposed Restricted Boltzmann
61 Machines in terms of unsupervised extraction of a hierarchy of features 96 Machines in terms of unsupervised extraction of a hierarchy of features
62 useful for classification. The principle is that each layer starting from 97 useful for classification. Each layer is trained to denoise its
63 the bottom is trained to encode its input (the output of the previous 98 input, creating a layer of features that can be used as input for the next layer.
64 layer) and to reconstruct it from a corrupted version. After this 99 \fi
65 unsupervised initialization, the stack of DAs can be 100 %The principle is that each layer starting from
66 converted into a deep supervised feedforward neural network and fine-tuned by 101 %the bottom is trained to encode its input (the output of the previous
67 stochastic gradient descent. 102 %layer) and to reconstruct it from a corrupted version. After this
68 103 %unsupervised initialization, the stack of DAs can be
69 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles 104 %converted into a deep supervised feedforward neural network and fine-tuned by
70 of semi-supervised and multi-task learning: the learner can exploit examples 105 %stochastic gradient descent.
71 that are unlabeled and possibly come from a distribution different from the target 106
72 distribution, e.g., from other classes than those of interest.
73 It has already been shown that deep learners can clearly take advantage of
74 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
75 but more needs to be done to explore the impact
76 of {\em out-of-distribution} examples and of the multi-task setting
77 (one exception is~\citep{CollobertR2008}, which uses very different kinds
78 of learning algorithms). In particular the {\em relative
79 advantage} of deep learning for these settings has not been evaluated.
80 The hypothesis discussed in the conclusion is that a deep hierarchy of features
81 may be better able to provide sharing of statistical strength
82 between different regions in input space or different tasks.
83 % 107 %
84 In this paper we ask the following questions: 108 In this paper we ask the following questions:
85 109
86 %\begin{enumerate} 110 %\begin{enumerate}
87 $\bullet$ %\item 111 $\bullet$ %\item
91 115
92 $\bullet$ %\item 116 $\bullet$ %\item
93 To what extent does the perturbation of input images (e.g. adding 117 To what extent does the perturbation of input images (e.g. adding
94 noise, affine transformations, background images) make the resulting 118 noise, affine transformations, background images) make the resulting
95 classifiers better not only on similarly perturbed images but also on 119 classifiers better not only on similarly perturbed images but also on
96 the {\em original clean examples}? 120 the {\em original clean examples}? We study this question in the
121 context of the 62-class and 10-class tasks of the NIST special database 19.
97 122
98 $\bullet$ %\item 123 $\bullet$ %\item
99 Do deep architectures {\em benefit more from such out-of-distribution} 124 Do deep architectures {\em benefit more from such out-of-distribution}
100 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 125 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
126 We use highly perturbed examples to generate out-of-distribution examples.
101 127
102 $\bullet$ %\item 128 $\bullet$ %\item
103 Similarly, does the feature learning step in deep learning algorithms benefit more 129 Similarly, does the feature learning step in deep learning algorithms benefit more
104 from training with moderately different classes (i.e. a multi-task learning scenario) than 130 from training with moderately different classes (i.e. a multi-task learning scenario) than
105 a corresponding shallow and purely supervised architecture? 131 a corresponding shallow and purely supervised architecture?
132 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
133 to answer this question.
106 %\end{enumerate} 134 %\end{enumerate}
107 135
108 Our experimental results provide positive evidence towards all of these questions. 136 Our experimental results provide positive evidence towards all of these questions.
109 To achieve these results, we introduce in the next section a sophisticated system 137 To achieve these results, we introduce in the next section a sophisticated system
110 for stochastically transforming character images and then explain the methodology. 138 for stochastically transforming character images and then explain the methodology,
139 which is based on training with or without these transformed images and testing on
140 clean ones. We measure the relative advantage of out-of-distribution examples
141 for a deep learner vs a supervised shallow one.
142 Code for generating these transformations as well as for the deep learning
143 algorithms are made available.
144 We also estimate the relative advantage for deep learners of training with
145 other classes than those of interest, by comparing learners trained with
146 62 classes with learners trained with only a subset (on which they
147 are then tested).
111 The conclusion discusses 148 The conclusion discusses
112 the more general question of why deep learners may benefit so much from 149 the more general question of why deep learners may benefit so much from
113 the self-taught learning framework. 150 the self-taught learning framework.
114 151
115 \vspace*{-1mm} 152 \vspace*{-3mm}
116 \section{Perturbation and Transformation of Character Images} 153 \section{Perturbation and Transformation of Character Images}
117 \label{s:perturbations} 154 \label{s:perturbations}
118 \vspace*{-1mm} 155 \vspace*{-2mm}
119 156
120 \begin{wrapfigure}[8]{l}{0.15\textwidth} 157 \begin{wrapfigure}[8]{l}{0.15\textwidth}
121 %\begin{minipage}[b]{0.14\linewidth} 158 %\begin{minipage}[b]{0.14\linewidth}
122 \vspace*{-5mm} 159 \vspace*{-5mm}
123 \begin{center} 160 \begin{center}
191 %\centering 228 %\centering
192 To produce {\bf slant}, each row of the image is shifted 229 To produce {\bf slant}, each row of the image is shifted
193 proportionally to its height: $shift = round(slant \times height)$. 230 proportionally to its height: $shift = round(slant \times height)$.
194 $slant \sim U[-complexity,complexity]$. 231 $slant \sim U[-complexity,complexity]$.
195 The shift is randomly chosen to be either to the left or to the right. 232 The shift is randomly chosen to be either to the left or to the right.
196 \vspace{1.1cm} 233 \vspace{1cm}
197 \end{minipage} 234 \end{minipage}
198 %\vspace*{-4mm} 235 %\vspace*{-4mm}
199 236
200 %\begin{minipage}[b]{0.14\linewidth} 237 %\begin{minipage}[b]{0.14\linewidth}
201 %\centering 238 %\centering
202 \begin{wrapfigure}[8]{l}{0.15\textwidth} 239 \begin{wrapfigure}[8]{l}{0.15\textwidth}
203 \vspace*{-6mm} 240 \vspace*{-6mm}
204 \begin{center} 241 \begin{center}
205 \includegraphics[scale=.4]{images/Affine_only.png}\\ 242 \includegraphics[scale=.4]{images/Affine_only.png}\\
206 {\bf Affine Transformation} 243 {\small {\bf Affine \mbox{Transformation}}}
207 \end{center} 244 \end{center}
208 \end{wrapfigure} 245 \end{wrapfigure}
209 %\end{minipage}% 246 %\end{minipage}%
210 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} 247 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
211 A $2 \times 3$ {\bf affine transform} matrix (with 248 A $2 \times 3$ {\bf affine transform} matrix (with
228 %\hspace*{-8mm}\begin{minipage}[b]{0.25\linewidth} 265 %\hspace*{-8mm}\begin{minipage}[b]{0.25\linewidth}
229 %\centering 266 %\centering
230 \begin{center} 267 \begin{center}
231 \vspace*{-4mm} 268 \vspace*{-4mm}
232 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}\\ 269 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}\\
233 {\bf Local Elastic} 270 {\bf Local Elastic Deformation}
234 \end{center} 271 \end{center}
235 \end{wrapfigure} 272 \end{wrapfigure}
236 %\end{minipage}% 273 %\end{minipage}%
237 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth} 274 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth}
238 %\vspace*{-20mm} 275 %\vspace*{-20mm}
239 The {\bf local elastic} deformation 276 The {\bf local elastic deformation}
240 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, 277 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
241 which provides more details. 278 which provides more details.
242 The intensity of the displacement fields is given by 279 The intensity of the displacement fields is given by
243 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are 280 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are
244 convolved with a Gaussian 2D kernel (resulting in a blur) of 281 convolved with a Gaussian 2D kernel (resulting in a blur) of
245 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$. 282 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$.
246 %\vspace{.9cm} 283 %\vspace{.9cm}
247 \end{minipage} 284 \end{minipage}
248 285
249 \vspace*{5mm} 286 \vspace*{7mm}
250 287
251 %\begin{minipage}[b]{0.14\linewidth} 288 %\begin{minipage}[b]{0.14\linewidth}
252 %\centering 289 %\centering
253 \begin{wrapfigure}[7]{l}{0.15\textwidth} 290 \begin{wrapfigure}[7]{l}{0.15\textwidth}
254 \vspace*{-5mm} 291 \vspace*{-5mm}
275 around the (non-integer) source position thus found. 312 around the (non-integer) source position thus found.
276 Here $pinch \sim U[-complexity, 0.7 \times complexity]$. 313 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
277 %\vspace{1.5cm} 314 %\vspace{1.5cm}
278 %\end{minipage} 315 %\end{minipage}
279 316
280 \vspace{2mm} 317 \vspace{1mm}
281 318
282 {\large\bf 2.2 Injecting Noise} 319 {\large\bf 2.2 Injecting Noise}
283 %\subsection{Injecting Noise} 320 %\subsection{Injecting Noise}
284 \vspace{2mm} 321 \vspace{2mm}
285 322
521 Mechanical Turk has been used extensively in natural language processing and vision. 558 Mechanical Turk has been used extensively in natural language processing and vision.
522 %processing \citep{SnowEtAl2008} and vision 559 %processing \citep{SnowEtAl2008} and vision
523 %\citep{SorokinAndForsyth2008,whitehill09}. 560 %\citep{SorokinAndForsyth2008,whitehill09}.
524 AMT users were presented 561 AMT users were presented
525 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII 562 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII
526 characters. They were forced to make a hard choice among the 563 characters. They were forced to choose a single character class (either among the
527 62 or 10 character classes (all classes or digits only). 564 62 or 10 character classes) for each image.
528 80 subjects classified 2500 images per (dataset,task) pair, 565 80 subjects classified 2500 images per (dataset,task) pair,
529 with the guarantee that 3 different subjects classified each image, allowing 566 with the guarantee that 3 different subjects classified each image, allowing
530 us to estimate inter-human variability (e.g a standard error of 0.1\% 567 us to estimate inter-human variability (e.g a standard error of 0.1\%
531 on the average 18.2\% error done by humans on the 62-class task NIST test set). 568 on the average 18.2\% error done by humans on the 62-class task NIST test set).
532 569
635 scaling behavior). 672 scaling behavior).
636 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized 673 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
637 exponentials) on the output layer for estimating $P(class | image)$. 674 exponentials) on the output layer for estimating $P(class | image)$.
638 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 675 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
639 Training examples are presented in minibatches of size 20. A constant learning 676 Training examples are presented in minibatches of size 20. A constant learning
640 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ 677 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
641 through preliminary experiments (measuring performance on a validation set), 678 %through preliminary experiments (measuring performance on a validation set),
642 and $0.1$ (which was found to work best) was then selected for optimizing on 679 %and $0.1$ (which was found to work best) was then selected for optimizing on
643 the whole training sets. 680 %the whole training sets.
644 \vspace*{-1mm} 681 \vspace*{-1mm}
645 682
646 683
647 {\bf Stacked Denoising Auto-Encoders (SDA).} 684 {\bf Stacked Denoising Auto-Encoders (SDA).}
648 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 685 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
664 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} 701 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
665 \vspace*{-2mm} 702 \vspace*{-2mm}
666 \caption{Illustration of the computations and training criterion for the denoising 703 \caption{Illustration of the computations and training criterion for the denoising
667 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of 704 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
668 the layer (i.e. raw input or output of previous layer) 705 the layer (i.e. raw input or output of previous layer)
669 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. 706 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
670 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which 707 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
671 is compared to the uncorrupted input $x$ through the loss function 708 is compared to the uncorrupted input $x$ through the loss function
672 $L_H(x,z)$, whose expected value is approximately minimized during training 709 $L_H(x,z)$, whose expected value is approximately minimized during training
673 by tuning $\theta$ and $\theta'$.} 710 by tuning $\theta$ and $\theta'$.}
674 \label{fig:da} 711 \label{fig:da}
675 \vspace*{-2mm} 712 \vspace*{-2mm}
676 \end{figure} 713 \end{figure}
677 714
678 Here we chose to use the Denoising 715 Here we chose to use the Denoising
679 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for 716 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
680 these deep hierarchies of features, as it is very simple to train and 717 these deep hierarchies of features, as it is simple to train and
681 explain (see Figure~\ref{fig:da}, as well as 718 explain (see Figure~\ref{fig:da}, as well as
682 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 719 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
683 provides efficient inference, and yielded results 720 provides efficient inference, and yielded results
684 comparable or better than RBMs in series of experiments 721 comparable or better than RBMs in series of experiments
685 \citep{VincentPLarochelleH2008}. During training, a Denoising 722 \citep{VincentPLarochelleH2008}. During training, a Denoising
686 Auto-encoder is presented with a stochastically corrupted version 723 Auto-encoder is presented with a stochastically corrupted version
687 of the input and trained to reconstruct the uncorrupted input, 724 of the input and trained to reconstruct the uncorrupted input,
688 forcing the hidden units to represent the leading regularities in 725 forcing the hidden units to represent the leading regularities in
689 the data. Once it is trained, in a purely unsupervised way, 726 the data. Here we use the random binary masking corruption
727 (which sets to 0 a random subset of the inputs).
728 Once it is trained, in a purely unsupervised way,
690 its hidden units' activations can 729 its hidden units' activations can
691 be used as inputs for training a second one, etc. 730 be used as inputs for training a second one, etc.
692 After this unsupervised pre-training stage, the parameters 731 After this unsupervised pre-training stage, the parameters
693 are used to initialize a deep MLP, which is fine-tuned by 732 are used to initialize a deep MLP, which is fine-tuned by
694 the same standard procedure used to train them (see previous section). 733 the same standard procedure used to train them (see previous section).
840 879
841 $\bullet$ %\item 880 $\bullet$ %\item
842 {\bf Do the good results previously obtained with deep architectures on the 881 {\bf Do the good results previously obtained with deep architectures on the
843 MNIST digits generalize to a much larger and richer (but similar) 882 MNIST digits generalize to a much larger and richer (but similar)
844 dataset, the NIST special database 19, with 62 classes and around 800k examples}? 883 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
845 Yes, the SDA {\bf systematically outperformed the MLP and all the previously 884 Yes, the SDA {\em systematically outperformed the MLP and all the previously
846 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level 885 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
847 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. 886 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
848 887
849 $\bullet$ %\item 888 $\bullet$ %\item
850 {\bf To what extent do self-taught learning scenarios help deep learners, 889 {\bf To what extent do self-taught learning scenarios help deep learners,
851 and do they help them more than shallow supervised ones}? 890 and do they help them more than shallow supervised ones}?
856 examples. MLPs were helped by perturbed training examples when tested on perturbed input 895 examples. MLPs were helped by perturbed training examples when tested on perturbed input
857 images (65\% relative improvement on NISTP) 896 images (65\% relative improvement on NISTP)
858 but only marginally helped (5\% relative improvement on all classes) 897 but only marginally helped (5\% relative improvement on all classes)
859 or even hurt (10\% relative loss on digits) 898 or even hurt (10\% relative loss on digits)
860 with respect to clean examples . On the other hand, the deep SDAs 899 with respect to clean examples . On the other hand, the deep SDAs
861 were very significantly boosted by these out-of-distribution examples. 900 were significantly boosted by these out-of-distribution examples.
862 Similarly, whereas the improvement due to the multi-task setting was marginal or 901 Similarly, whereas the improvement due to the multi-task setting was marginal or
863 negative for the MLP (from +5.6\% to -3.6\% relative change), 902 negative for the MLP (from +5.6\% to -3.6\% relative change),
864 it was very significant for the SDA (from +13\% to +27\% relative change), 903 it was quite significant for the SDA (from +13\% to +27\% relative change),
865 which may be explained by the arguments below. 904 which may be explained by the arguments below.
866 %\end{itemize} 905 %\end{itemize}
867 906
868 In the original self-taught learning framework~\citep{RainaR2007}, the 907 In the original self-taught learning framework~\citep{RainaR2007}, the
869 out-of-sample examples were used as a source of unsupervised data, and 908 out-of-sample examples were used as a source of unsupervised data, and
871 scenario. However, many of the results by \citet{RainaR2007} (who used a 910 scenario. However, many of the results by \citet{RainaR2007} (who used a
872 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught 911 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught
873 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases. 912 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases.
874 We note instead that, for deep 913 We note instead that, for deep
875 architectures, our experiments show that such a positive effect is accomplished 914 architectures, our experiments show that such a positive effect is accomplished
876 even in a scenario with a \emph{very large number of labeled examples}, 915 even in a scenario with a \emph{large number of labeled examples},
877 i.e., here, the relative gain of self-taught learning is probably preserved 916 i.e., here, the relative gain of self-taught learning is probably preserved
878 in the asymptotic regime. 917 in the asymptotic regime.
879 918
880 {\bf Why would deep learners benefit more from the self-taught learning framework}? 919 {\bf Why would deep learners benefit more from the self-taught learning framework}?
881 The key idea is that the lower layers of the predictor compute a hierarchy 920 The key idea is that the lower layers of the predictor compute a hierarchy