comparison writeup/techreport.tex @ 584:81c6fde68a8a

corrections to techreport.tex
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 18 Sep 2010 18:25:11 -0400
parents ae77edb9df67
children
comparison
equal deleted inserted replaced
583:ae77edb9df67 584:81c6fde68a8a
32 Razvan Pascanu \and 32 Razvan Pascanu \and
33 Salah Rifai \and 33 Salah Rifai \and
34 Francois Savard \and 34 Francois Savard \and
35 Guillaume Sicard 35 Guillaume Sicard
36 } 36 }
37 \date{June 8th, 2010, Technical Report 1353, Dept. IRO, U. Montreal} 37 \date{June 3, 2010, Technical Report 1353, Dept. IRO, U. Montreal}
38 38
39 \begin{document} 39 \begin{document}
40 40
41 %\makeanontitle 41 %\makeanontitle
42 \maketitle 42 \maketitle
43 43
44 %\vspace*{-2mm} 44 %\vspace*{-2mm}
45 \begin{abstract} 45 \begin{abstract}
46 Recent theoretical and empirical work in statistical machine learning has 46 Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.
47 demonstrated the importance of learning algorithms for deep
48 architectures, i.e., function classes obtained by composing multiple
49 non-linear transformations. Self-taught learning (exploiting unlabeled
50 examples or examples from other distributions) has already been applied
51 to deep learners, but mostly to show the advantage of unlabeled
52 examples. Here we explore the advantage brought by {\em out-of-distribution examples}.
53 For this purpose we
54 developed a powerful generator of stochastic variations and noise
55 processes for character images, including not only affine transformations
56 but also slant, local elastic deformations, changes in thickness,
57 background images, grey level changes, contrast, occlusion, and various
58 types of noise. The out-of-distribution examples are obtained from these
59 highly distorted images or by including examples of object classes
60 different from those in the target test set.
61 We show that {\em deep learners benefit
62 more from them than a corresponding shallow learner}, at least in the area of
63 handwritten character recognition. In fact, we show that they reach
64 human-level performance on both handwritten digit classification and
65 62-class handwritten character recognition.
66 \end{abstract} 47 \end{abstract}
67 %\vspace*{-3mm} 48 %\vspace*{-3mm}
68 49
69 \section{Introduction} 50 \section{Introduction}
70 %\vspace*{-1mm} 51 %\vspace*{-1mm}
71 52
72 {\bf Deep Learning} has emerged as a promising new area of research in 53 {\bf Deep Learning} has emerged as a promising new area of research in
73 statistical machine learning (see~\citet{Bengio-2009} for a review). 54 statistical machine learning (see~\citet{Bengio-2009} for a review).
74 Learning algorithms for deep architectures are centered on the learning 55 Learning algorithms for deep architectures are centered on the learning
75 of useful representations of data, which are better suited to the task at hand. 56 of useful representations of data, which are better suited to the task at hand,
57 and are organized in a hierarchy with multiple levels.
76 This is in part inspired by observations of the mammalian visual cortex, 58 This is in part inspired by observations of the mammalian visual cortex,
77 which consists of a chain of processing elements, each of which is associated with a 59 which consists of a chain of processing elements, each of which is associated with a
78 different representation of the raw visual input. In fact, 60 different representation of the raw visual input. In fact,
79 it was found recently that the features learnt in deep architectures resemble 61 it was found recently that the features learnt in deep architectures resemble
80 those observed in the first two of these stages (in areas V1 and V2 62 those observed in the first two of these stages (in areas V1 and V2
102 advantage} of deep learning for these settings has not been evaluated. 84 advantage} of deep learning for these settings has not been evaluated.
103 The hypothesis discussed in the conclusion is that a deep hierarchy of features 85 The hypothesis discussed in the conclusion is that a deep hierarchy of features
104 may be better able to provide sharing of statistical strength 86 may be better able to provide sharing of statistical strength
105 between different regions in input space or different tasks. 87 between different regions in input space or different tasks.
106 88
107 \iffalse
108 Whereas a deep architecture can in principle be more powerful than a 89 Whereas a deep architecture can in principle be more powerful than a
109 shallow one in terms of representation, depth appears to render the 90 shallow one in terms of representation, depth appears to render the
110 training problem more difficult in terms of optimization and local minima. 91 training problem more difficult in terms of optimization and local minima.
111 It is also only recently that successful algorithms were proposed to 92 It is also only recently that successful algorithms were proposed to
112 overcome some of these difficulties. All are based on unsupervised 93 overcome some of these difficulties. All are based on unsupervised
117 which 98 which
118 performed similarly or better than previously proposed Restricted Boltzmann 99 performed similarly or better than previously proposed Restricted Boltzmann
119 Machines in terms of unsupervised extraction of a hierarchy of features 100 Machines in terms of unsupervised extraction of a hierarchy of features
120 useful for classification. Each layer is trained to denoise its 101 useful for classification. Each layer is trained to denoise its
121 input, creating a layer of features that can be used as input for the next layer. 102 input, creating a layer of features that can be used as input for the next layer.
122 \fi 103
123 %The principle is that each layer starting from 104 %The principle is that each layer starting from
124 %the bottom is trained to encode its input (the output of the previous 105 %the bottom is trained to encode its input (the output of the previous
125 %layer) and to reconstruct it from a corrupted version. After this 106 %layer) and to reconstruct it from a corrupted version. After this
126 %unsupervised initialization, the stack of DAs can be 107 %unsupervised initialization, the stack of DAs can be
127 %converted into a deep supervised feedforward neural network and fine-tuned by 108 %converted into a deep supervised feedforward neural network and fine-tuned by
142 classifiers better not only on similarly perturbed images but also on 123 classifiers better not only on similarly perturbed images but also on
143 the {\em original clean examples}? We study this question in the 124 the {\em original clean examples}? We study this question in the
144 context of the 62-class and 10-class tasks of the NIST special database 19. 125 context of the 62-class and 10-class tasks of the NIST special database 19.
145 126
146 $\bullet$ %\item 127 $\bullet$ %\item
147 Do deep architectures {\em benefit more from such out-of-distribution} 128 Do deep architectures {\em benefit {\bf more} from such out-of-distribution}
148 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 129 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
149 We use highly perturbed examples to generate out-of-distribution examples. 130 We use highly perturbed examples to generate out-of-distribution examples.
150 131
151 $\bullet$ %\item 132 $\bullet$ %\item
152 Similarly, does the feature learning step in deep learning algorithms benefit more 133 Similarly, does the feature learning step in deep learning algorithms benefit {\bf more}
153 from training with moderately different classes (i.e. a multi-task learning scenario) than 134 from training with moderately {\em different classes} (i.e. a multi-task learning scenario) than
154 a corresponding shallow and purely supervised architecture? 135 a corresponding shallow and purely supervised architecture?
155 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case) 136 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
156 to answer this question. 137 to answer this question.
157 %\end{enumerate} 138 %\end{enumerate}
158 139
159 Our experimental results provide positive evidence towards all of these questions. 140 Our experimental results provide positive evidence towards all of these questions,
141 as well as classifiers that reach human-level performance on 62-class isolated character
142 recognition and beat previously published results on the NIST dataset (special database 19).
160 To achieve these results, we introduce in the next section a sophisticated system 143 To achieve these results, we introduce in the next section a sophisticated system
161 for stochastically transforming character images and then explain the methodology, 144 for stochastically transforming character images and then explain the methodology,
162 which is based on training with or without these transformed images and testing on 145 which is based on training with or without these transformed images and testing on
163 clean ones. We measure the relative advantage of out-of-distribution examples 146 clean ones. We measure the relative advantage of out-of-distribution examples
147 (perturbed or out-of-class)
164 for a deep learner vs a supervised shallow one. 148 for a deep learner vs a supervised shallow one.
165 Code for generating these transformations as well as for the deep learning 149 Code for generating these transformations as well as for the deep learning
166 algorithms are made available. 150 algorithms are made available at {\tt http://hg.assembla.com/ift6266}.
167 We also estimate the relative advantage for deep learners of training with 151 We estimate the relative advantage for deep learners of training with
168 other classes than those of interest, by comparing learners trained with 152 other classes than those of interest, by comparing learners trained with
169 62 classes with learners trained with only a subset (on which they 153 62 classes with learners trained with only a subset (on which they
170 are then tested). 154 are then tested).
171 The conclusion discusses 155 The conclusion discusses
172 the more general question of why deep learners may benefit so much from 156 the more general question of why deep learners may benefit so much from
173 the self-taught learning framework. 157 the self-taught learning framework. Since out-of-distribution data
158 (perturbed or from other related classes) is very common, this conclusion
159 is of practical importance.
174 160
175 %\vspace*{-3mm} 161 %\vspace*{-3mm}
176 \newpage 162 %\newpage
177 \section{Perturbation and Transformation of Character Images} 163 \section{Perturbed and Transformed Character Images}
178 \label{s:perturbations} 164 \label{s:perturbations}
179 %\vspace*{-2mm} 165 %\vspace*{-2mm}
180 166
181 \begin{wrapfigure}[8]{l}{0.15\textwidth} 167 \begin{wrapfigure}[8]{l}{0.15\textwidth}
182 %\begin{minipage}[b]{0.14\linewidth} 168 %\begin{minipage}[b]{0.14\linewidth}
183 %\vspace*{-5mm} 169 %\vspace*{-5mm}
184 \begin{center} 170 \begin{center}
185 \includegraphics[scale=.4]{images/Original.png}\\ 171 \includegraphics[scale=.4]{Original.png}\\
186 {\bf Original} 172 {\bf Original}
187 \end{center} 173 \end{center}
188 \end{wrapfigure} 174 \end{wrapfigure}
189 %%\vspace{0.7cm} 175 %%\vspace{0.7cm}
190 %\end{minipage}% 176 %\end{minipage}%
196 which we start. 182 which we start.
197 Although character transformations have been used before to 183 Although character transformations have been used before to
198 improve character recognizers, this effort is on a large scale both 184 improve character recognizers, this effort is on a large scale both
199 in number of classes and in the complexity of the transformations, hence 185 in number of classes and in the complexity of the transformations, hence
200 in the complexity of the learning task. 186 in the complexity of the learning task.
201 More details can
202 be found in this technical report~\citep{ift6266-tr-anonymous}.
203 The code for these transformations (mostly python) is available at 187 The code for these transformations (mostly python) is available at
204 {\tt http://anonymous.url.net}. All the modules in the pipeline share 188 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share
205 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the 189 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
206 amount of deformation or noise introduced. 190 amount of deformation or noise introduced.
207 There are two main parts in the pipeline. The first one, 191 There are two main parts in the pipeline. The first one,
208 from slant to pinch below, performs transformations. The second 192 from slant to pinch below, performs transformations. The second
209 part, from blur to contrast, adds different kinds of noise. 193 part, from blur to contrast, adds different kinds of noise.
219 %\begin{wrapfigure}[7]{l}{0.15\textwidth} 203 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
220 \begin{minipage}[b]{0.14\linewidth} 204 \begin{minipage}[b]{0.14\linewidth}
221 %\centering 205 %\centering
222 \begin{center} 206 \begin{center}
223 \vspace*{-5mm} 207 \vspace*{-5mm}
224 \includegraphics[scale=.4]{images/Thick_only.png}\\ 208 \includegraphics[scale=.4]{Thick_only.png}\\
225 %{\bf Thickness} 209 %{\bf Thickness}
226 \end{center} 210 \end{center}
227 \vspace{.6cm} 211 \vspace{.6cm}
228 \end{minipage}% 212 \end{minipage}%
229 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} 213 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
247 \subsubsection*{Slant} 231 \subsubsection*{Slant}
248 \vspace*{2mm} 232 \vspace*{2mm}
249 233
250 \begin{minipage}[b]{0.14\linewidth} 234 \begin{minipage}[b]{0.14\linewidth}
251 \centering 235 \centering
252 \includegraphics[scale=.4]{images/Slant_only.png}\\ 236 \includegraphics[scale=.4]{Slant_only.png}\\
253 %{\bf Slant} 237 %{\bf Slant}
254 \end{minipage}% 238 \end{minipage}%
255 \hspace{0.3cm} 239 \hspace{0.3cm}
256 \begin{minipage}[b]{0.83\linewidth} 240 \begin{minipage}[b]{0.83\linewidth}
257 %\centering 241 %\centering
269 253
270 \begin{minipage}[b]{0.14\linewidth} 254 \begin{minipage}[b]{0.14\linewidth}
271 %\centering 255 %\centering
272 %\begin{wrapfigure}[8]{l}{0.15\textwidth} 256 %\begin{wrapfigure}[8]{l}{0.15\textwidth}
273 \begin{center} 257 \begin{center}
274 \includegraphics[scale=.4]{images/Affine_only.png} 258 \includegraphics[scale=.4]{Affine_only.png}
275 \vspace*{6mm} 259 \vspace*{6mm}
276 %{\small {\bf Affine \mbox{Transformation}}} 260 %{\small {\bf Affine \mbox{Transformation}}}
277 \end{center} 261 \end{center}
278 %\end{wrapfigure} 262 %\end{wrapfigure}
279 \end{minipage}% 263 \end{minipage}%
299 %\hspace*{-8mm} 283 %\hspace*{-8mm}
300 \begin{minipage}[b]{0.14\linewidth} 284 \begin{minipage}[b]{0.14\linewidth}
301 %\centering 285 %\centering
302 \begin{center} 286 \begin{center}
303 \vspace*{5mm} 287 \vspace*{5mm}
304 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png} 288 \includegraphics[scale=.4]{Localelasticdistorsions_only.png}
305 %{\bf Local Elastic Deformation} 289 %{\bf Local Elastic Deformation}
306 \end{center} 290 \end{center}
307 %\end{wrapfigure} 291 %\end{wrapfigure}
308 \end{minipage}% 292 \end{minipage}%
309 \hspace{3mm} 293 \hspace{3mm}
326 \begin{minipage}[b]{0.14\linewidth} 310 \begin{minipage}[b]{0.14\linewidth}
327 %\centering 311 %\centering
328 %\begin{wrapfigure}[7]{l}{0.15\textwidth} 312 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
329 %\vspace*{-5mm} 313 %\vspace*{-5mm}
330 \begin{center} 314 \begin{center}
331 \includegraphics[scale=.4]{images/Pinch_only.png}\\ 315 \includegraphics[scale=.4]{Pinch_only.png}\\
332 \vspace*{15mm} 316 \vspace*{15mm}
333 %{\bf Pinch} 317 %{\bf Pinch}
334 \end{center} 318 \end{center}
335 %\end{wrapfigure} 319 %\end{wrapfigure}
336 %%\vspace{.6cm} 320 %%\vspace{.6cm}
363 347
364 %%\vspace*{-.2cm} 348 %%\vspace*{-.2cm}
365 \begin{minipage}[t]{0.14\linewidth} 349 \begin{minipage}[t]{0.14\linewidth}
366 \centering 350 \centering
367 \vspace*{0mm} 351 \vspace*{0mm}
368 \includegraphics[scale=.4]{images/Motionblur_only.png} 352 \includegraphics[scale=.4]{Motionblur_only.png}
369 %{\bf Motion Blur} 353 %{\bf Motion Blur}
370 \end{minipage}% 354 \end{minipage}%
371 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} 355 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
372 %%\vspace*{.5mm} 356 %%\vspace*{.5mm}
373 \vspace*{2mm} 357 \vspace*{2mm}
384 \subsubsection*{Occlusion} 368 \subsubsection*{Occlusion}
385 369
386 \begin{minipage}[t]{0.14\linewidth} 370 \begin{minipage}[t]{0.14\linewidth}
387 \centering 371 \centering
388 \vspace*{3mm} 372 \vspace*{3mm}
389 \includegraphics[scale=.4]{images/occlusion_only.png}\\ 373 \includegraphics[scale=.4]{occlusion_only.png}\\
390 %{\bf Occlusion} 374 %{\bf Occlusion}
391 %%\vspace{.5cm} 375 %%\vspace{.5cm}
392 \end{minipage}% 376 \end{minipage}%
393 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} 377 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
394 %\vspace*{-18mm} 378 %\vspace*{-18mm}
397 image. Pixels are combined by taking the max(occluder, occluded), 381 image. Pixels are combined by taking the max(occluder, occluded),
398 i.e. keeping the lighter ones. 382 i.e. keeping the lighter ones.
399 The rectangle corners 383 The rectangle corners
400 are sampled so that larger complexity gives larger rectangles. 384 are sampled so that larger complexity gives larger rectangles.
401 The destination position in the occluded image are also sampled 385 The destination position in the occluded image are also sampled
402 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). 386 according to a normal distribution.
403 This module is skipped with probability 60\%. 387 This module is skipped with probability 60\%.
404 %%\vspace{7mm} 388 %%\vspace{7mm}
405 \end{minipage} 389 \end{minipage}
406 390
407 %\vspace*{1mm} 391 %\vspace*{1mm}
411 %\vspace*{-6mm} 395 %\vspace*{-6mm}
412 \begin{minipage}[t]{0.14\linewidth} 396 \begin{minipage}[t]{0.14\linewidth}
413 \begin{center} 397 \begin{center}
414 %\centering 398 %\centering
415 \vspace*{6mm} 399 \vspace*{6mm}
416 \includegraphics[scale=.4]{images/Bruitgauss_only.png} 400 \includegraphics[scale=.4]{Bruitgauss_only.png}
417 %{\bf Gaussian Smoothing} 401 %{\bf Gaussian Smoothing}
418 \end{center} 402 \end{center}
419 %\end{wrapfigure} 403 %\end{wrapfigure}
420 %%\vspace{.5cm} 404 %%\vspace{.5cm}
421 \end{minipage}% 405 \end{minipage}%
447 \begin{minipage}[t]{0.14\textwidth} 431 \begin{minipage}[t]{0.14\textwidth}
448 %\begin{wrapfigure}[7]{l}{ 432 %\begin{wrapfigure}[7]{l}{
449 %\vspace*{-5mm} 433 %\vspace*{-5mm}
450 \begin{center} 434 \begin{center}
451 \vspace*{1mm} 435 \vspace*{1mm}
452 \includegraphics[scale=.4]{images/Permutpixel_only.png} 436 \includegraphics[scale=.4]{Permutpixel_only.png}
453 %{\small\bf Permute Pixels} 437 %{\small\bf Permute Pixels}
454 \end{center} 438 \end{center}
455 %\end{wrapfigure} 439 %\end{wrapfigure}
456 \end{minipage}% 440 \end{minipage}%
457 \hspace{3mm}\begin{minipage}[t]{0.86\linewidth} 441 \hspace{3mm}\begin{minipage}[t]{0.86\linewidth}
474 %%\vspace*{-3mm} 458 %%\vspace*{-3mm}
475 \begin{center} 459 \begin{center}
476 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth} 460 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
477 %\centering 461 %\centering
478 \vspace*{0mm} 462 \vspace*{0mm}
479 \includegraphics[scale=.4]{images/Distorsiongauss_only.png} 463 \includegraphics[scale=.4]{Distorsiongauss_only.png}
480 %{\small \bf Gauss. Noise} 464 %{\small \bf Gauss. Noise}
481 \end{center} 465 \end{center}
482 %\end{wrapfigure} 466 %\end{wrapfigure}
483 \end{minipage}% 467 \end{minipage}%
484 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} 468 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
496 480
497 \begin{minipage}[t]{\linewidth} 481 \begin{minipage}[t]{\linewidth}
498 \begin{minipage}[t]{0.14\linewidth} 482 \begin{minipage}[t]{0.14\linewidth}
499 \centering 483 \centering
500 \vspace*{0mm} 484 \vspace*{0mm}
501 \includegraphics[scale=.4]{images/background_other_only.png} 485 \includegraphics[scale=.4]{background_other_only.png}
502 %{\small \bf Bg Image} 486 %{\small \bf Bg Image}
503 \end{minipage}% 487 \end{minipage}%
504 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} 488 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
505 \vspace*{1mm} 489 \vspace*{1mm}
506 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random 490 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random
515 \subsubsection*{Salt and Pepper Noise} 499 \subsubsection*{Salt and Pepper Noise}
516 500
517 \begin{minipage}[t]{0.14\linewidth} 501 \begin{minipage}[t]{0.14\linewidth}
518 \centering 502 \centering
519 \vspace*{0mm} 503 \vspace*{0mm}
520 \includegraphics[scale=.4]{images/Poivresel_only.png} 504 \includegraphics[scale=.4]{Poivresel_only.png}
521 %{\small \bf Salt \& Pepper} 505 %{\small \bf Salt \& Pepper}
522 \end{minipage}% 506 \end{minipage}%
523 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} 507 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
524 \vspace*{1mm} 508 \vspace*{1mm}
525 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels. 509 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels.
537 %\begin{minipage}[t]{0.14\linewidth} 521 %\begin{minipage}[t]{0.14\linewidth}
538 %\centering 522 %\centering
539 \begin{center} 523 \begin{center}
540 \vspace*{4mm} 524 \vspace*{4mm}
541 %\hspace*{-1mm} 525 %\hspace*{-1mm}
542 \includegraphics[scale=.4]{images/Rature_only.png}\\ 526 \includegraphics[scale=.4]{Rature_only.png}\\
543 %{\bf Scratches} 527 %{\bf Scratches}
544 \end{center} 528 \end{center}
545 \end{minipage}% 529 \end{minipage}%
546 %\end{wrapfigure} 530 %\end{wrapfigure}
547 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} 531 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
563 \subsubsection*{Grey Level and Contrast Changes} 547 \subsubsection*{Grey Level and Contrast Changes}
564 548
565 \begin{minipage}[t]{0.15\linewidth} 549 \begin{minipage}[t]{0.15\linewidth}
566 \centering 550 \centering
567 \vspace*{0mm} 551 \vspace*{0mm}
568 \includegraphics[scale=.4]{images/Contrast_only.png} 552 \includegraphics[scale=.4]{Contrast_only.png}
569 %{\bf Grey Level \& Contrast} 553 %{\bf Grey Level \& Contrast}
570 \end{minipage}% 554 \end{minipage}%
571 \hspace{3mm}\begin{minipage}[t]{0.85\linewidth} 555 \hspace{3mm}\begin{minipage}[t]{0.85\linewidth}
572 \vspace*{1mm} 556 \vspace*{1mm}
573 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white 557 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white
579 %\vspace{2mm} 563 %\vspace{2mm}
580 564
581 565
582 \iffalse 566 \iffalse
583 \begin{figure}[ht] 567 \begin{figure}[ht]
584 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/example_t.png}}}\\ 568 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{example_t.png}}}\\
585 \caption{Illustration of the pipeline of stochastic 569 \caption{Illustration of the pipeline of stochastic
586 transformations applied to the image of a lower-case \emph{t} 570 transformations applied to the image of a lower-case \emph{t}
587 (the upper left image). Each image in the pipeline (going from 571 (the upper left image). Each image in the pipeline (going from
588 left to right, first top line, then bottom line) shows the result 572 left to right, first top line, then bottom line) shows the result
589 of applying one of the modules in the pipeline. The last image 573 of applying one of the modules in the pipeline. The last image
624 %\citep{SorokinAndForsyth2008,whitehill09}. 608 %\citep{SorokinAndForsyth2008,whitehill09}.
625 AMT users were presented 609 AMT users were presented
626 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII 610 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII
627 characters. They were forced to choose a single character class (either among the 611 characters. They were forced to choose a single character class (either among the
628 62 or 10 character classes) for each image. 612 62 or 10 character classes) for each image.
629 80 subjects classified 2500 images per (dataset,task) pair, 613 80 subjects classified 2500 images per (dataset,task) pair.
630 with the guarantee that 3 different subjects classified each image, allowing 614 Different humans labelers sometimes provided a different label for the same
631 us to estimate inter-human variability (e.g a standard error of 0.1\% 615 example, and we were able to estimate the error variance due to this effect
632 on the average 18.2\% error done by humans on the 62-class task NIST test set). 616 because each image was classified by 3 different persons.
617 The average error of humans on the 62-class task NIST test set
618 is 18.2\%, with a standard error of 0.1\%.
633 619
634 %\vspace*{-3mm} 620 %\vspace*{-3mm}
635 \subsection{Data Sources} 621 \subsection{Data Sources}
636 %\vspace*{-2mm} 622 %\vspace*{-2mm}
637 623
731 717
732 {\bf Multi-Layer Perceptrons (MLP).} 718 {\bf Multi-Layer Perceptrons (MLP).}
733 Whereas previous work had compared deep architectures to both shallow MLPs and 719 Whereas previous work had compared deep architectures to both shallow MLPs and
734 SVMs, we only compared to MLPs here because of the very large datasets used 720 SVMs, we only compared to MLPs here because of the very large datasets used
735 (making the use of SVMs computationally challenging because of their quadratic 721 (making the use of SVMs computationally challenging because of their quadratic
736 scaling behavior). 722 scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training
723 set allowing the program to fit in memory yielded substantially worse results
724 than those obtained with MLPs. For training on nearly a billion examples
725 (with the perturbed data), the MLPs and SDA are much more convenient than
726 classifiers based on kernel methods.
737 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized 727 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
738 exponentials) on the output layer for estimating $P(class | image)$. 728 exponentials) on the output layer for estimating $P(class | image)$.
739 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 729 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
740 Training examples are presented in minibatches of size 20. A constant learning 730 Training examples are presented in minibatches of size 20. A constant learning
741 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. 731 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
749 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 739 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
750 can be used to initialize the weights of each layer of a deep MLP (with many hidden 740 can be used to initialize the weights of each layer of a deep MLP (with many hidden
751 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, 741 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
752 apparently setting parameters in the 742 apparently setting parameters in the
753 basin of attraction of supervised gradient descent yielding better 743 basin of attraction of supervised gradient descent yielding better
754 generalization~\citep{Erhan+al-2010}. It is hypothesized that the 744 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised
745 pre-training phase} uses all of the training images but not the training labels.
746 Each layer is trained in turn to produce a new representation of its input
747 (starting from the raw pixels).
748 It is hypothesized that the
755 advantage brought by this procedure stems from a better prior, 749 advantage brought by this procedure stems from a better prior,
756 on the one hand taking advantage of the link between the input 750 on the one hand taking advantage of the link between the input
757 distribution $P(x)$ and the conditional distribution of interest 751 distribution $P(x)$ and the conditional distribution of interest
758 $P(y|x)$ (like in semi-supervised learning), and on the other hand 752 $P(y|x)$ (like in semi-supervised learning), and on the other hand
759 taking advantage of the expressive power and bias implicit in the 753 taking advantage of the expressive power and bias implicit in the
760 deep architecture (whereby complex concepts are expressed as 754 deep architecture (whereby complex concepts are expressed as
761 compositions of simpler ones through a deep hierarchy). 755 compositions of simpler ones through a deep hierarchy).
762 756
763 \begin{figure}[ht] 757 \begin{figure}[ht]
764 %\vspace*{-2mm} 758 %\vspace*{-2mm}
765 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} 759 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{denoising_autoencoder_small.pdf}}}
766 %\vspace*{-2mm} 760 %\vspace*{-2mm}
767 \caption{Illustration of the computations and training criterion for the denoising 761 \caption{Illustration of the computations and training criterion for the denoising
768 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of 762 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
769 the layer (i.e. raw input or output of previous layer) 763 the layer (i.e. raw input or output of previous layer)
770 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. 764 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
800 fixed proportion of the input values, randomly selected, are zeroed), and a 794 fixed proportion of the input values, randomly selected, are zeroed), and a
801 separate learning rate for the unsupervised pre-training stage (selected 795 separate learning rate for the unsupervised pre-training stage (selected
802 from the same above set). The fraction of inputs corrupted was selected 796 from the same above set). The fraction of inputs corrupted was selected
803 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number 797 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
804 of hidden layers but it was fixed to 3 based on previous work with 798 of hidden layers but it was fixed to 3 based on previous work with
805 SDAs on MNIST~\citep{VincentPLarochelleH2008}. 799 SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden
800 layers was kept constant across hidden layers, and the best results
801 were obtained with the largest values that we could experiment
802 with given our patience, with 1000 hidden units.
806 803
807 %\vspace*{-1mm} 804 %\vspace*{-1mm}
808 805
809 \begin{figure}[ht] 806 \begin{figure}[ht]
810 %\vspace*{-2mm} 807 %\vspace*{-2mm}
811 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} 808 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{error_rates_charts.pdf}}}
812 %\vspace*{-3mm} 809 %\vspace*{-3mm}
813 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained 810 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
814 on NIST, 1 on NISTP, and 2 on P07. Left: overall results 811 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
815 of all models, on NIST and NISTP test sets. 812 of all models, on NIST and NISTP test sets.
816 Right: error rates on NIST test digits only, along with the previous results from 813 Right: error rates on NIST test digits only, along with the previous results from
821 \end{figure} 818 \end{figure}
822 819
823 820
824 \begin{figure}[ht] 821 \begin{figure}[ht]
825 %\vspace*{-3mm} 822 %\vspace*{-3mm}
826 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} 823 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{improvements_charts.pdf}}}
827 %\vspace*{-3mm} 824 %\vspace*{-3mm}
828 \caption{Relative improvement in error rate due to self-taught learning. 825 \caption{Relative improvement in error rate due to self-taught learning.
829 Left: Improvement (or loss, when negative) 826 Left: Improvement (or loss, when negative)
830 induced by out-of-distribution examples (perturbed data). 827 induced by out-of-distribution examples (perturbed data).
831 Right: Improvement (or loss, when negative) induced by multi-task 828 Right: Improvement (or loss, when negative) induced by multi-task
854 19 test set from the literature, respectively based on ARTMAP neural 851 19 test set from the literature, respectively based on ARTMAP neural
855 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search 852 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
856 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs 853 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
857 ~\citep{Milgram+al-2005}. More detailed and complete numerical results 854 ~\citep{Milgram+al-2005}. More detailed and complete numerical results
858 (figures and tables, including standard errors on the error rates) can be 855 (figures and tables, including standard errors on the error rates) can be
859 found in Appendix I of the supplementary material. 856 found in Appendix.
860 The deep learner not only outperformed the shallow ones and 857 The deep learner not only outperformed the shallow ones and
861 previously published performance (in a statistically and qualitatively 858 previously published performance (in a statistically and qualitatively
862 significant way) but when trained with perturbed data 859 significant way) but when trained with perturbed data
863 reaches human performance on both the 62-class task 860 reaches human performance on both the 62-class task
864 and the 10-class (digits) task. 861 and the 10-class (digits) task.
945 {\bf Do the good results previously obtained with deep architectures on the 942 {\bf Do the good results previously obtained with deep architectures on the
946 MNIST digits generalize to a much larger and richer (but similar) 943 MNIST digits generalize to a much larger and richer (but similar)
947 dataset, the NIST special database 19, with 62 classes and around 800k examples}? 944 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
948 Yes, the SDA {\em systematically outperformed the MLP and all the previously 945 Yes, the SDA {\em systematically outperformed the MLP and all the previously
949 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level 946 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
950 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. 947 performance} at around 17\% error on the 62-class task and 1.4\% on the digits,
948 and beating previously published results on the same data.
951 949
952 $\bullet$ %\item 950 $\bullet$ %\item
953 {\bf To what extent do self-taught learning scenarios help deep learners, 951 {\bf To what extent do self-taught learning scenarios help deep learners,
954 and do they help them more than shallow supervised ones}? 952 and do they help them more than shallow supervised ones}?
955 We found that distorted training examples not only made the resulting 953 We found that distorted training examples not only made the resulting
981 in the asymptotic regime. 979 in the asymptotic regime.
982 980
983 {\bf Why would deep learners benefit more from the self-taught learning framework}? 981 {\bf Why would deep learners benefit more from the self-taught learning framework}?
984 The key idea is that the lower layers of the predictor compute a hierarchy 982 The key idea is that the lower layers of the predictor compute a hierarchy
985 of features that can be shared across tasks or across variants of the 983 of features that can be shared across tasks or across variants of the
986 input distribution. Intermediate features that can be used in different 984 input distribution. A theoretical analysis of generalization improvements
985 due to sharing of intermediate features across tasks already points
986 towards that explanation~\cite{baxter95a}.
987 Intermediate features that can be used in different
987 contexts can be estimated in a way that allows to share statistical 988 contexts can be estimated in a way that allows to share statistical
988 strength. Features extracted through many levels are more likely to 989 strength. Features extracted through many levels are more likely to
989 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), 990 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
990 increasing the likelihood that they would be useful for a larger array 991 increasing the likelihood that they would be useful for a larger array
991 of tasks and input conditions. 992 of tasks and input conditions.
1009 with deep learning and self-taught learning. 1010 with deep learning and self-taught learning.
1010 1011
1011 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 1012 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
1012 can be executed on-line at {\tt http://deep.host22.com}. 1013 can be executed on-line at {\tt http://deep.host22.com}.
1013 1014
1014 %\newpage 1015
1016 \section*{Appendix I: Detailed Numerical Results}
1017
1018 These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered.
1019 They also contain additional data such as test errors on P07 and standard errors.
1020
1021 \begin{table}[ht]
1022 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
1023 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
1024 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture
1025 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07)
1026 and using a validation set to select hyper-parameters and other training choices.
1027 \{SDA,MLP\}0 are trained on NIST,
1028 \{SDA,MLP\}1 are trained on NISTP, and \{SDA,MLP\}2 are trained on P07.
1029 The human error rate on digits is a lower bound because it does not count digits that were
1030 recognized as letters. For comparison, the results found in the literature
1031 on NIST digits classification using the same test set are included.}
1032 \label{tab:sda-vs-mlp-vs-humans}
1033 \begin{center}
1034 \begin{tabular}{|l|r|r|r|r|} \hline
1035 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline
1036 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $1.4\%$ \\ \hline
1037 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline
1038 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline
1039 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline
1040 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline
1041 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline
1042 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline
1043 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline
1044 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline
1045 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline
1046 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline
1047 \end{tabular}
1048 \end{center}
1049 \end{table}
1050
1051 \begin{table}[ht]
1052 \caption{Relative change in error rates due to the use of perturbed training data,
1053 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models.
1054 A positive value indicates that training on the perturbed data helped for the
1055 given test set (the first 3 columns on the 62-class tasks and the last one is
1056 on the clean 10-class digits). Clearly, the deep learning models did benefit more
1057 from perturbed training data, even when testing on clean data, whereas the MLP
1058 trained on perturbed data performed worse on the clean digits and about the same
1059 on the clean characters. }
1060 \label{tab:perturbation-effect}
1061 \begin{center}
1062 \begin{tabular}{|l|r|r|r|r|} \hline
1063 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline
1064 SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline
1065 SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline
1066 MLP0/MLP1-1 & 5.2\% & 65\% & -13\% & -10\% \\ \hline
1067 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline
1068 \end{tabular}
1069 \end{center}
1070 \end{table}
1071
1072 \begin{table}[ht]
1073 \caption{Test error rates and relative change in error rates due to the use of
1074 a multi-task setting, i.e., training on each task in isolation vs training
1075 for all three tasks together, for MLPs vs SDAs. The SDA benefits much
1076 more from the multi-task setting. All experiments on only on the
1077 unperturbed NIST data, using validation error for model selection.
1078 Relative improvement is 1 - single-task error / multi-task error.}
1079 \label{tab:multi-task}
1080 \begin{center}
1081 \begin{tabular}{|l|r|r|r|} \hline
1082 & single-task & multi-task & relative \\
1083 & setting & setting & improvement \\ \hline
1084 MLP-digits & 3.77\% & 3.99\% & 5.6\% \\ \hline
1085 MLP-lower & 17.4\% & 16.8\% & -4.1\% \\ \hline
1086 MLP-upper & 7.84\% & 7.54\% & -3.6\% \\ \hline
1087 SDA-digits & 2.6\% & 3.56\% & 27\% \\ \hline
1088 SDA-lower & 12.3\% & 14.4\% & 15\% \\ \hline
1089 SDA-upper & 5.93\% & 6.78\% & 13\% \\ \hline
1090 \end{tabular}
1091 \end{center}
1092 \end{table}
1093
1094 %\afterpage{\clearpage}
1095 \clearpage
1015 { 1096 {
1016 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,specials,aigaion-shorter} 1097 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,specials,aigaion-shorter}
1017 %\bibliographystyle{plainnat} 1098 %\bibliographystyle{plainnat}
1018 \bibliographystyle{unsrtnat} 1099 \bibliographystyle{unsrtnat}
1019 %\bibliographystyle{apalike} 1100 %\bibliographystyle{apalike}