Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 569:9d01280ff1c1
commentaires de Joseph Turian
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Thu, 03 Jun 2010 19:05:08 -0400 |
parents | ae6ba0309bf9 |
children | df749e70f637 |
comparison
equal
deleted
inserted
replaced
568:ae6ba0309bf9 | 569:9d01280ff1c1 |
---|---|
18 %\makeanontitle | 18 %\makeanontitle |
19 \maketitle | 19 \maketitle |
20 | 20 |
21 \vspace*{-2mm} | 21 \vspace*{-2mm} |
22 \begin{abstract} | 22 \begin{abstract} |
23 Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples} and show that {\em deep learners benefit more from them than a corresponding shallow learner}, in the area of handwritten character recognition. In fact, we show that they reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. | 23 Recent theoretical and empirical work in statistical machine learning has |
24 demonstrated the importance of learning algorithms for deep | |
25 architectures, i.e., function classes obtained by composing multiple | |
26 non-linear transformations. Self-taught learning (exploiting unlabeled | |
27 examples or examples from other distributions) has already been applied | |
28 to deep learners, but mostly to show the advantage of unlabeled | |
29 examples. Here we explore the advantage brought by {\em out-of-distribution examples}. | |
30 For this purpose we | |
31 developed a powerful generator of stochastic variations and noise | |
32 processes for character images, including not only affine transformations | |
33 but also slant, local elastic deformations, changes in thickness, | |
34 background images, grey level changes, contrast, occlusion, and various | |
35 types of noise. The out-of-distribution examples are obtained from these | |
36 highly distorted images or by including examples of object classes | |
37 different from those in the target test set. | |
38 We show that {\em deep learners benefit | |
39 more from them than a corresponding shallow learner}, at least in the area of | |
40 handwritten character recognition. In fact, we show that they reach | |
41 human-level performance on both handwritten digit classification and | |
42 62-class handwritten character recognition. | |
24 \end{abstract} | 43 \end{abstract} |
25 \vspace*{-3mm} | 44 \vspace*{-3mm} |
26 | 45 |
27 \section{Introduction} | 46 \section{Introduction} |
28 \vspace*{-1mm} | 47 \vspace*{-1mm} |
29 | 48 |
30 Deep Learning has emerged as a promising new area of research in | 49 {\bf Deep Learning} has emerged as a promising new area of research in |
31 statistical machine learning (see~\citet{Bengio-2009} for a review). | 50 statistical machine learning (see~\citet{Bengio-2009} for a review). |
32 Learning algorithms for deep architectures are centered on the learning | 51 Learning algorithms for deep architectures are centered on the learning |
33 of useful representations of data, which are better suited to the task at hand. | 52 of useful representations of data, which are better suited to the task at hand. |
34 This is in great part inspired by observations of the mammalian visual cortex, | 53 This is in part inspired by observations of the mammalian visual cortex, |
35 which consists of a chain of processing elements, each of which is associated with a | 54 which consists of a chain of processing elements, each of which is associated with a |
36 different representation of the raw visual input. In fact, | 55 different representation of the raw visual input. In fact, |
37 it was found recently that the features learnt in deep architectures resemble | 56 it was found recently that the features learnt in deep architectures resemble |
38 those observed in the first two of these stages (in areas V1 and V2 | 57 those observed in the first two of these stages (in areas V1 and V2 |
39 of visual cortex)~\citep{HonglakL2008}, and that they become more and | 58 of visual cortex)~\citep{HonglakL2008}, and that they become more and |
45 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the | 64 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the |
46 feature representation can lead to higher-level (more abstract, more | 65 feature representation can lead to higher-level (more abstract, more |
47 general) features that are more robust to unanticipated sources of | 66 general) features that are more robust to unanticipated sources of |
48 variance extant in real data. | 67 variance extant in real data. |
49 | 68 |
69 {\bf Self-taught learning}~\citep{RainaR2007} is a paradigm that combines principles | |
70 of semi-supervised and multi-task learning: the learner can exploit examples | |
71 that are unlabeled and possibly come from a distribution different from the target | |
72 distribution, e.g., from other classes than those of interest. | |
73 It has already been shown that deep learners can clearly take advantage of | |
74 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, | |
75 but more needs to be done to explore the impact | |
76 of {\em out-of-distribution} examples and of the multi-task setting | |
77 (one exception is~\citep{CollobertR2008}, which uses a different kind | |
78 of learning algorithm). In particular the {\em relative | |
79 advantage} of deep learning for these settings has not been evaluated. | |
80 The hypothesis discussed in the conclusion is that a deep hierarchy of features | |
81 may be better able to provide sharing of statistical strength | |
82 between different regions in input space or different tasks. | |
83 | |
84 \iffalse | |
50 Whereas a deep architecture can in principle be more powerful than a | 85 Whereas a deep architecture can in principle be more powerful than a |
51 shallow one in terms of representation, depth appears to render the | 86 shallow one in terms of representation, depth appears to render the |
52 training problem more difficult in terms of optimization and local minima. | 87 training problem more difficult in terms of optimization and local minima. |
53 It is also only recently that successful algorithms were proposed to | 88 It is also only recently that successful algorithms were proposed to |
54 overcome some of these difficulties. All are based on unsupervised | 89 overcome some of these difficulties. All are based on unsupervised |
57 applied here, is the Denoising | 92 applied here, is the Denoising |
58 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}), | 93 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}), |
59 which | 94 which |
60 performed similarly or better than previously proposed Restricted Boltzmann | 95 performed similarly or better than previously proposed Restricted Boltzmann |
61 Machines in terms of unsupervised extraction of a hierarchy of features | 96 Machines in terms of unsupervised extraction of a hierarchy of features |
62 useful for classification. The principle is that each layer starting from | 97 useful for classification. Each layer is trained to denoise its |
63 the bottom is trained to encode its input (the output of the previous | 98 input, creating a layer of features that can be used as input for the next layer. |
64 layer) and to reconstruct it from a corrupted version. After this | 99 \fi |
65 unsupervised initialization, the stack of DAs can be | 100 %The principle is that each layer starting from |
66 converted into a deep supervised feedforward neural network and fine-tuned by | 101 %the bottom is trained to encode its input (the output of the previous |
67 stochastic gradient descent. | 102 %layer) and to reconstruct it from a corrupted version. After this |
68 | 103 %unsupervised initialization, the stack of DAs can be |
69 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles | 104 %converted into a deep supervised feedforward neural network and fine-tuned by |
70 of semi-supervised and multi-task learning: the learner can exploit examples | 105 %stochastic gradient descent. |
71 that are unlabeled and possibly come from a distribution different from the target | 106 |
72 distribution, e.g., from other classes than those of interest. | |
73 It has already been shown that deep learners can clearly take advantage of | |
74 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, | |
75 but more needs to be done to explore the impact | |
76 of {\em out-of-distribution} examples and of the multi-task setting | |
77 (one exception is~\citep{CollobertR2008}, which uses very different kinds | |
78 of learning algorithms). In particular the {\em relative | |
79 advantage} of deep learning for these settings has not been evaluated. | |
80 The hypothesis discussed in the conclusion is that a deep hierarchy of features | |
81 may be better able to provide sharing of statistical strength | |
82 between different regions in input space or different tasks. | |
83 % | 107 % |
84 In this paper we ask the following questions: | 108 In this paper we ask the following questions: |
85 | 109 |
86 %\begin{enumerate} | 110 %\begin{enumerate} |
87 $\bullet$ %\item | 111 $\bullet$ %\item |
91 | 115 |
92 $\bullet$ %\item | 116 $\bullet$ %\item |
93 To what extent does the perturbation of input images (e.g. adding | 117 To what extent does the perturbation of input images (e.g. adding |
94 noise, affine transformations, background images) make the resulting | 118 noise, affine transformations, background images) make the resulting |
95 classifiers better not only on similarly perturbed images but also on | 119 classifiers better not only on similarly perturbed images but also on |
96 the {\em original clean examples}? | 120 the {\em original clean examples}? We study this question in the |
121 context of the 62-class and 10-class tasks of the NIST special database 19. | |
97 | 122 |
98 $\bullet$ %\item | 123 $\bullet$ %\item |
99 Do deep architectures {\em benefit more from such out-of-distribution} | 124 Do deep architectures {\em benefit more from such out-of-distribution} |
100 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 125 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? |
126 We use highly perturbed examples to generate out-of-distribution examples. | |
101 | 127 |
102 $\bullet$ %\item | 128 $\bullet$ %\item |
103 Similarly, does the feature learning step in deep learning algorithms benefit more | 129 Similarly, does the feature learning step in deep learning algorithms benefit more |
104 from training with moderately different classes (i.e. a multi-task learning scenario) than | 130 from training with moderately different classes (i.e. a multi-task learning scenario) than |
105 a corresponding shallow and purely supervised architecture? | 131 a corresponding shallow and purely supervised architecture? |
132 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case) | |
133 to answer this question. | |
106 %\end{enumerate} | 134 %\end{enumerate} |
107 | 135 |
108 Our experimental results provide positive evidence towards all of these questions. | 136 Our experimental results provide positive evidence towards all of these questions. |
109 To achieve these results, we introduce in the next section a sophisticated system | 137 To achieve these results, we introduce in the next section a sophisticated system |
110 for stochastically transforming character images and then explain the methodology. | 138 for stochastically transforming character images and then explain the methodology, |
139 which is based on training with or without these transformed images and testing on | |
140 clean ones. We measure the relative advantage of out-of-distribution examples | |
141 for a deep learner vs a supervised shallow one. | |
142 Code for generating these transformations as well as for the deep learning | |
143 algorithms are made available. | |
144 We also estimate the relative advantage for deep learners of training with | |
145 other classes than those of interest, by comparing learners trained with | |
146 62 classes with learners trained with only a subset (on which they | |
147 are then tested). | |
111 The conclusion discusses | 148 The conclusion discusses |
112 the more general question of why deep learners may benefit so much from | 149 the more general question of why deep learners may benefit so much from |
113 the self-taught learning framework. | 150 the self-taught learning framework. |
114 | 151 |
115 \vspace*{-1mm} | 152 \vspace*{-3mm} |
116 \section{Perturbation and Transformation of Character Images} | 153 \section{Perturbation and Transformation of Character Images} |
117 \label{s:perturbations} | 154 \label{s:perturbations} |
118 \vspace*{-1mm} | 155 \vspace*{-2mm} |
119 | 156 |
120 \begin{wrapfigure}[8]{l}{0.15\textwidth} | 157 \begin{wrapfigure}[8]{l}{0.15\textwidth} |
121 %\begin{minipage}[b]{0.14\linewidth} | 158 %\begin{minipage}[b]{0.14\linewidth} |
122 \vspace*{-5mm} | 159 \vspace*{-5mm} |
123 \begin{center} | 160 \begin{center} |
191 %\centering | 228 %\centering |
192 To produce {\bf slant}, each row of the image is shifted | 229 To produce {\bf slant}, each row of the image is shifted |
193 proportionally to its height: $shift = round(slant \times height)$. | 230 proportionally to its height: $shift = round(slant \times height)$. |
194 $slant \sim U[-complexity,complexity]$. | 231 $slant \sim U[-complexity,complexity]$. |
195 The shift is randomly chosen to be either to the left or to the right. | 232 The shift is randomly chosen to be either to the left or to the right. |
196 \vspace{1.1cm} | 233 \vspace{1cm} |
197 \end{minipage} | 234 \end{minipage} |
198 %\vspace*{-4mm} | 235 %\vspace*{-4mm} |
199 | 236 |
200 %\begin{minipage}[b]{0.14\linewidth} | 237 %\begin{minipage}[b]{0.14\linewidth} |
201 %\centering | 238 %\centering |
202 \begin{wrapfigure}[8]{l}{0.15\textwidth} | 239 \begin{wrapfigure}[8]{l}{0.15\textwidth} |
203 \vspace*{-6mm} | 240 \vspace*{-6mm} |
204 \begin{center} | 241 \begin{center} |
205 \includegraphics[scale=.4]{images/Affine_only.png}\\ | 242 \includegraphics[scale=.4]{images/Affine_only.png}\\ |
206 {\bf Affine Transformation} | 243 {\small {\bf Affine \mbox{Transformation}}} |
207 \end{center} | 244 \end{center} |
208 \end{wrapfigure} | 245 \end{wrapfigure} |
209 %\end{minipage}% | 246 %\end{minipage}% |
210 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} | 247 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} |
211 A $2 \times 3$ {\bf affine transform} matrix (with | 248 A $2 \times 3$ {\bf affine transform} matrix (with |
228 %\hspace*{-8mm}\begin{minipage}[b]{0.25\linewidth} | 265 %\hspace*{-8mm}\begin{minipage}[b]{0.25\linewidth} |
229 %\centering | 266 %\centering |
230 \begin{center} | 267 \begin{center} |
231 \vspace*{-4mm} | 268 \vspace*{-4mm} |
232 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}\\ | 269 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}\\ |
233 {\bf Local Elastic} | 270 {\bf Local Elastic Deformation} |
234 \end{center} | 271 \end{center} |
235 \end{wrapfigure} | 272 \end{wrapfigure} |
236 %\end{minipage}% | 273 %\end{minipage}% |
237 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth} | 274 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth} |
238 %\vspace*{-20mm} | 275 %\vspace*{-20mm} |
239 The {\bf local elastic} deformation | 276 The {\bf local elastic deformation} |
240 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, | 277 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, |
241 which provides more details. | 278 which provides more details. |
242 The intensity of the displacement fields is given by | 279 The intensity of the displacement fields is given by |
243 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are | 280 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are |
244 convolved with a Gaussian 2D kernel (resulting in a blur) of | 281 convolved with a Gaussian 2D kernel (resulting in a blur) of |
245 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$. | 282 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$. |
246 %\vspace{.9cm} | 283 %\vspace{.9cm} |
247 \end{minipage} | 284 \end{minipage} |
248 | 285 |
249 \vspace*{5mm} | 286 \vspace*{7mm} |
250 | 287 |
251 %\begin{minipage}[b]{0.14\linewidth} | 288 %\begin{minipage}[b]{0.14\linewidth} |
252 %\centering | 289 %\centering |
253 \begin{wrapfigure}[7]{l}{0.15\textwidth} | 290 \begin{wrapfigure}[7]{l}{0.15\textwidth} |
254 \vspace*{-5mm} | 291 \vspace*{-5mm} |
275 around the (non-integer) source position thus found. | 312 around the (non-integer) source position thus found. |
276 Here $pinch \sim U[-complexity, 0.7 \times complexity]$. | 313 Here $pinch \sim U[-complexity, 0.7 \times complexity]$. |
277 %\vspace{1.5cm} | 314 %\vspace{1.5cm} |
278 %\end{minipage} | 315 %\end{minipage} |
279 | 316 |
280 \vspace{2mm} | 317 \vspace{1mm} |
281 | 318 |
282 {\large\bf 2.2 Injecting Noise} | 319 {\large\bf 2.2 Injecting Noise} |
283 %\subsection{Injecting Noise} | 320 %\subsection{Injecting Noise} |
284 \vspace{2mm} | 321 \vspace{2mm} |
285 | 322 |
521 Mechanical Turk has been used extensively in natural language processing and vision. | 558 Mechanical Turk has been used extensively in natural language processing and vision. |
522 %processing \citep{SnowEtAl2008} and vision | 559 %processing \citep{SnowEtAl2008} and vision |
523 %\citep{SorokinAndForsyth2008,whitehill09}. | 560 %\citep{SorokinAndForsyth2008,whitehill09}. |
524 AMT users were presented | 561 AMT users were presented |
525 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII | 562 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII |
526 characters. They were forced to make a hard choice among the | 563 characters. They were forced to choose a single character class (either among the |
527 62 or 10 character classes (all classes or digits only). | 564 62 or 10 character classes) for each image. |
528 80 subjects classified 2500 images per (dataset,task) pair, | 565 80 subjects classified 2500 images per (dataset,task) pair, |
529 with the guarantee that 3 different subjects classified each image, allowing | 566 with the guarantee that 3 different subjects classified each image, allowing |
530 us to estimate inter-human variability (e.g a standard error of 0.1\% | 567 us to estimate inter-human variability (e.g a standard error of 0.1\% |
531 on the average 18.2\% error done by humans on the 62-class task NIST test set). | 568 on the average 18.2\% error done by humans on the 62-class task NIST test set). |
532 | 569 |
635 scaling behavior). | 672 scaling behavior). |
636 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | 673 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
637 exponentials) on the output layer for estimating $P(class | image)$. | 674 exponentials) on the output layer for estimating $P(class | image)$. |
638 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. | 675 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. |
639 Training examples are presented in minibatches of size 20. A constant learning | 676 Training examples are presented in minibatches of size 20. A constant learning |
640 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ | 677 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. |
641 through preliminary experiments (measuring performance on a validation set), | 678 %through preliminary experiments (measuring performance on a validation set), |
642 and $0.1$ (which was found to work best) was then selected for optimizing on | 679 %and $0.1$ (which was found to work best) was then selected for optimizing on |
643 the whole training sets. | 680 %the whole training sets. |
644 \vspace*{-1mm} | 681 \vspace*{-1mm} |
645 | 682 |
646 | 683 |
647 {\bf Stacked Denoising Auto-Encoders (SDA).} | 684 {\bf Stacked Denoising Auto-Encoders (SDA).} |
648 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 685 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
664 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | 701 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} |
665 \vspace*{-2mm} | 702 \vspace*{-2mm} |
666 \caption{Illustration of the computations and training criterion for the denoising | 703 \caption{Illustration of the computations and training criterion for the denoising |
667 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | 704 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of |
668 the layer (i.e. raw input or output of previous layer) | 705 the layer (i.e. raw input or output of previous layer) |
669 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | 706 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. |
670 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | 707 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which |
671 is compared to the uncorrupted input $x$ through the loss function | 708 is compared to the uncorrupted input $x$ through the loss function |
672 $L_H(x,z)$, whose expected value is approximately minimized during training | 709 $L_H(x,z)$, whose expected value is approximately minimized during training |
673 by tuning $\theta$ and $\theta'$.} | 710 by tuning $\theta$ and $\theta'$.} |
674 \label{fig:da} | 711 \label{fig:da} |
675 \vspace*{-2mm} | 712 \vspace*{-2mm} |
676 \end{figure} | 713 \end{figure} |
677 | 714 |
678 Here we chose to use the Denoising | 715 Here we chose to use the Denoising |
679 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for | 716 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for |
680 these deep hierarchies of features, as it is very simple to train and | 717 these deep hierarchies of features, as it is simple to train and |
681 explain (see Figure~\ref{fig:da}, as well as | 718 explain (see Figure~\ref{fig:da}, as well as |
682 tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 719 tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
683 provides efficient inference, and yielded results | 720 provides efficient inference, and yielded results |
684 comparable or better than RBMs in series of experiments | 721 comparable or better than RBMs in series of experiments |
685 \citep{VincentPLarochelleH2008}. During training, a Denoising | 722 \citep{VincentPLarochelleH2008}. During training, a Denoising |
686 Auto-encoder is presented with a stochastically corrupted version | 723 Auto-encoder is presented with a stochastically corrupted version |
687 of the input and trained to reconstruct the uncorrupted input, | 724 of the input and trained to reconstruct the uncorrupted input, |
688 forcing the hidden units to represent the leading regularities in | 725 forcing the hidden units to represent the leading regularities in |
689 the data. Once it is trained, in a purely unsupervised way, | 726 the data. Here we use the random binary masking corruption |
727 (which sets to 0 a random subset of the inputs). | |
728 Once it is trained, in a purely unsupervised way, | |
690 its hidden units' activations can | 729 its hidden units' activations can |
691 be used as inputs for training a second one, etc. | 730 be used as inputs for training a second one, etc. |
692 After this unsupervised pre-training stage, the parameters | 731 After this unsupervised pre-training stage, the parameters |
693 are used to initialize a deep MLP, which is fine-tuned by | 732 are used to initialize a deep MLP, which is fine-tuned by |
694 the same standard procedure used to train them (see previous section). | 733 the same standard procedure used to train them (see previous section). |
840 | 879 |
841 $\bullet$ %\item | 880 $\bullet$ %\item |
842 {\bf Do the good results previously obtained with deep architectures on the | 881 {\bf Do the good results previously obtained with deep architectures on the |
843 MNIST digits generalize to a much larger and richer (but similar) | 882 MNIST digits generalize to a much larger and richer (but similar) |
844 dataset, the NIST special database 19, with 62 classes and around 800k examples}? | 883 dataset, the NIST special database 19, with 62 classes and around 800k examples}? |
845 Yes, the SDA {\bf systematically outperformed the MLP and all the previously | 884 Yes, the SDA {\em systematically outperformed the MLP and all the previously |
846 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level | 885 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level |
847 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. | 886 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. |
848 | 887 |
849 $\bullet$ %\item | 888 $\bullet$ %\item |
850 {\bf To what extent do self-taught learning scenarios help deep learners, | 889 {\bf To what extent do self-taught learning scenarios help deep learners, |
851 and do they help them more than shallow supervised ones}? | 890 and do they help them more than shallow supervised ones}? |
856 examples. MLPs were helped by perturbed training examples when tested on perturbed input | 895 examples. MLPs were helped by perturbed training examples when tested on perturbed input |
857 images (65\% relative improvement on NISTP) | 896 images (65\% relative improvement on NISTP) |
858 but only marginally helped (5\% relative improvement on all classes) | 897 but only marginally helped (5\% relative improvement on all classes) |
859 or even hurt (10\% relative loss on digits) | 898 or even hurt (10\% relative loss on digits) |
860 with respect to clean examples . On the other hand, the deep SDAs | 899 with respect to clean examples . On the other hand, the deep SDAs |
861 were very significantly boosted by these out-of-distribution examples. | 900 were significantly boosted by these out-of-distribution examples. |
862 Similarly, whereas the improvement due to the multi-task setting was marginal or | 901 Similarly, whereas the improvement due to the multi-task setting was marginal or |
863 negative for the MLP (from +5.6\% to -3.6\% relative change), | 902 negative for the MLP (from +5.6\% to -3.6\% relative change), |
864 it was very significant for the SDA (from +13\% to +27\% relative change), | 903 it was quite significant for the SDA (from +13\% to +27\% relative change), |
865 which may be explained by the arguments below. | 904 which may be explained by the arguments below. |
866 %\end{itemize} | 905 %\end{itemize} |
867 | 906 |
868 In the original self-taught learning framework~\citep{RainaR2007}, the | 907 In the original self-taught learning framework~\citep{RainaR2007}, the |
869 out-of-sample examples were used as a source of unsupervised data, and | 908 out-of-sample examples were used as a source of unsupervised data, and |
871 scenario. However, many of the results by \citet{RainaR2007} (who used a | 910 scenario. However, many of the results by \citet{RainaR2007} (who used a |
872 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught | 911 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught |
873 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases. | 912 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases. |
874 We note instead that, for deep | 913 We note instead that, for deep |
875 architectures, our experiments show that such a positive effect is accomplished | 914 architectures, our experiments show that such a positive effect is accomplished |
876 even in a scenario with a \emph{very large number of labeled examples}, | 915 even in a scenario with a \emph{large number of labeled examples}, |
877 i.e., here, the relative gain of self-taught learning is probably preserved | 916 i.e., here, the relative gain of self-taught learning is probably preserved |
878 in the asymptotic regime. | 917 in the asymptotic regime. |
879 | 918 |
880 {\bf Why would deep learners benefit more from the self-taught learning framework}? | 919 {\bf Why would deep learners benefit more from the self-taught learning framework}? |
881 The key idea is that the lower layers of the predictor compute a hierarchy | 920 The key idea is that the lower layers of the predictor compute a hierarchy |