Mercurial > ift6266
comparison writeup/techreport.tex @ 584:81c6fde68a8a
corrections to techreport.tex
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sat, 18 Sep 2010 18:25:11 -0400 |
parents | ae77edb9df67 |
children |
comparison
equal
deleted
inserted
replaced
583:ae77edb9df67 | 584:81c6fde68a8a |
---|---|
32 Razvan Pascanu \and | 32 Razvan Pascanu \and |
33 Salah Rifai \and | 33 Salah Rifai \and |
34 Francois Savard \and | 34 Francois Savard \and |
35 Guillaume Sicard | 35 Guillaume Sicard |
36 } | 36 } |
37 \date{June 8th, 2010, Technical Report 1353, Dept. IRO, U. Montreal} | 37 \date{June 3, 2010, Technical Report 1353, Dept. IRO, U. Montreal} |
38 | 38 |
39 \begin{document} | 39 \begin{document} |
40 | 40 |
41 %\makeanontitle | 41 %\makeanontitle |
42 \maketitle | 42 \maketitle |
43 | 43 |
44 %\vspace*{-2mm} | 44 %\vspace*{-2mm} |
45 \begin{abstract} | 45 \begin{abstract} |
46 Recent theoretical and empirical work in statistical machine learning has | 46 Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition. |
47 demonstrated the importance of learning algorithms for deep | |
48 architectures, i.e., function classes obtained by composing multiple | |
49 non-linear transformations. Self-taught learning (exploiting unlabeled | |
50 examples or examples from other distributions) has already been applied | |
51 to deep learners, but mostly to show the advantage of unlabeled | |
52 examples. Here we explore the advantage brought by {\em out-of-distribution examples}. | |
53 For this purpose we | |
54 developed a powerful generator of stochastic variations and noise | |
55 processes for character images, including not only affine transformations | |
56 but also slant, local elastic deformations, changes in thickness, | |
57 background images, grey level changes, contrast, occlusion, and various | |
58 types of noise. The out-of-distribution examples are obtained from these | |
59 highly distorted images or by including examples of object classes | |
60 different from those in the target test set. | |
61 We show that {\em deep learners benefit | |
62 more from them than a corresponding shallow learner}, at least in the area of | |
63 handwritten character recognition. In fact, we show that they reach | |
64 human-level performance on both handwritten digit classification and | |
65 62-class handwritten character recognition. | |
66 \end{abstract} | 47 \end{abstract} |
67 %\vspace*{-3mm} | 48 %\vspace*{-3mm} |
68 | 49 |
69 \section{Introduction} | 50 \section{Introduction} |
70 %\vspace*{-1mm} | 51 %\vspace*{-1mm} |
71 | 52 |
72 {\bf Deep Learning} has emerged as a promising new area of research in | 53 {\bf Deep Learning} has emerged as a promising new area of research in |
73 statistical machine learning (see~\citet{Bengio-2009} for a review). | 54 statistical machine learning (see~\citet{Bengio-2009} for a review). |
74 Learning algorithms for deep architectures are centered on the learning | 55 Learning algorithms for deep architectures are centered on the learning |
75 of useful representations of data, which are better suited to the task at hand. | 56 of useful representations of data, which are better suited to the task at hand, |
57 and are organized in a hierarchy with multiple levels. | |
76 This is in part inspired by observations of the mammalian visual cortex, | 58 This is in part inspired by observations of the mammalian visual cortex, |
77 which consists of a chain of processing elements, each of which is associated with a | 59 which consists of a chain of processing elements, each of which is associated with a |
78 different representation of the raw visual input. In fact, | 60 different representation of the raw visual input. In fact, |
79 it was found recently that the features learnt in deep architectures resemble | 61 it was found recently that the features learnt in deep architectures resemble |
80 those observed in the first two of these stages (in areas V1 and V2 | 62 those observed in the first two of these stages (in areas V1 and V2 |
102 advantage} of deep learning for these settings has not been evaluated. | 84 advantage} of deep learning for these settings has not been evaluated. |
103 The hypothesis discussed in the conclusion is that a deep hierarchy of features | 85 The hypothesis discussed in the conclusion is that a deep hierarchy of features |
104 may be better able to provide sharing of statistical strength | 86 may be better able to provide sharing of statistical strength |
105 between different regions in input space or different tasks. | 87 between different regions in input space or different tasks. |
106 | 88 |
107 \iffalse | |
108 Whereas a deep architecture can in principle be more powerful than a | 89 Whereas a deep architecture can in principle be more powerful than a |
109 shallow one in terms of representation, depth appears to render the | 90 shallow one in terms of representation, depth appears to render the |
110 training problem more difficult in terms of optimization and local minima. | 91 training problem more difficult in terms of optimization and local minima. |
111 It is also only recently that successful algorithms were proposed to | 92 It is also only recently that successful algorithms were proposed to |
112 overcome some of these difficulties. All are based on unsupervised | 93 overcome some of these difficulties. All are based on unsupervised |
117 which | 98 which |
118 performed similarly or better than previously proposed Restricted Boltzmann | 99 performed similarly or better than previously proposed Restricted Boltzmann |
119 Machines in terms of unsupervised extraction of a hierarchy of features | 100 Machines in terms of unsupervised extraction of a hierarchy of features |
120 useful for classification. Each layer is trained to denoise its | 101 useful for classification. Each layer is trained to denoise its |
121 input, creating a layer of features that can be used as input for the next layer. | 102 input, creating a layer of features that can be used as input for the next layer. |
122 \fi | 103 |
123 %The principle is that each layer starting from | 104 %The principle is that each layer starting from |
124 %the bottom is trained to encode its input (the output of the previous | 105 %the bottom is trained to encode its input (the output of the previous |
125 %layer) and to reconstruct it from a corrupted version. After this | 106 %layer) and to reconstruct it from a corrupted version. After this |
126 %unsupervised initialization, the stack of DAs can be | 107 %unsupervised initialization, the stack of DAs can be |
127 %converted into a deep supervised feedforward neural network and fine-tuned by | 108 %converted into a deep supervised feedforward neural network and fine-tuned by |
142 classifiers better not only on similarly perturbed images but also on | 123 classifiers better not only on similarly perturbed images but also on |
143 the {\em original clean examples}? We study this question in the | 124 the {\em original clean examples}? We study this question in the |
144 context of the 62-class and 10-class tasks of the NIST special database 19. | 125 context of the 62-class and 10-class tasks of the NIST special database 19. |
145 | 126 |
146 $\bullet$ %\item | 127 $\bullet$ %\item |
147 Do deep architectures {\em benefit more from such out-of-distribution} | 128 Do deep architectures {\em benefit {\bf more} from such out-of-distribution} |
148 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 129 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? |
149 We use highly perturbed examples to generate out-of-distribution examples. | 130 We use highly perturbed examples to generate out-of-distribution examples. |
150 | 131 |
151 $\bullet$ %\item | 132 $\bullet$ %\item |
152 Similarly, does the feature learning step in deep learning algorithms benefit more | 133 Similarly, does the feature learning step in deep learning algorithms benefit {\bf more} |
153 from training with moderately different classes (i.e. a multi-task learning scenario) than | 134 from training with moderately {\em different classes} (i.e. a multi-task learning scenario) than |
154 a corresponding shallow and purely supervised architecture? | 135 a corresponding shallow and purely supervised architecture? |
155 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case) | 136 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case) |
156 to answer this question. | 137 to answer this question. |
157 %\end{enumerate} | 138 %\end{enumerate} |
158 | 139 |
159 Our experimental results provide positive evidence towards all of these questions. | 140 Our experimental results provide positive evidence towards all of these questions, |
141 as well as classifiers that reach human-level performance on 62-class isolated character | |
142 recognition and beat previously published results on the NIST dataset (special database 19). | |
160 To achieve these results, we introduce in the next section a sophisticated system | 143 To achieve these results, we introduce in the next section a sophisticated system |
161 for stochastically transforming character images and then explain the methodology, | 144 for stochastically transforming character images and then explain the methodology, |
162 which is based on training with or without these transformed images and testing on | 145 which is based on training with or without these transformed images and testing on |
163 clean ones. We measure the relative advantage of out-of-distribution examples | 146 clean ones. We measure the relative advantage of out-of-distribution examples |
147 (perturbed or out-of-class) | |
164 for a deep learner vs a supervised shallow one. | 148 for a deep learner vs a supervised shallow one. |
165 Code for generating these transformations as well as for the deep learning | 149 Code for generating these transformations as well as for the deep learning |
166 algorithms are made available. | 150 algorithms are made available at {\tt http://hg.assembla.com/ift6266}. |
167 We also estimate the relative advantage for deep learners of training with | 151 We estimate the relative advantage for deep learners of training with |
168 other classes than those of interest, by comparing learners trained with | 152 other classes than those of interest, by comparing learners trained with |
169 62 classes with learners trained with only a subset (on which they | 153 62 classes with learners trained with only a subset (on which they |
170 are then tested). | 154 are then tested). |
171 The conclusion discusses | 155 The conclusion discusses |
172 the more general question of why deep learners may benefit so much from | 156 the more general question of why deep learners may benefit so much from |
173 the self-taught learning framework. | 157 the self-taught learning framework. Since out-of-distribution data |
158 (perturbed or from other related classes) is very common, this conclusion | |
159 is of practical importance. | |
174 | 160 |
175 %\vspace*{-3mm} | 161 %\vspace*{-3mm} |
176 \newpage | 162 %\newpage |
177 \section{Perturbation and Transformation of Character Images} | 163 \section{Perturbed and Transformed Character Images} |
178 \label{s:perturbations} | 164 \label{s:perturbations} |
179 %\vspace*{-2mm} | 165 %\vspace*{-2mm} |
180 | 166 |
181 \begin{wrapfigure}[8]{l}{0.15\textwidth} | 167 \begin{wrapfigure}[8]{l}{0.15\textwidth} |
182 %\begin{minipage}[b]{0.14\linewidth} | 168 %\begin{minipage}[b]{0.14\linewidth} |
183 %\vspace*{-5mm} | 169 %\vspace*{-5mm} |
184 \begin{center} | 170 \begin{center} |
185 \includegraphics[scale=.4]{images/Original.png}\\ | 171 \includegraphics[scale=.4]{Original.png}\\ |
186 {\bf Original} | 172 {\bf Original} |
187 \end{center} | 173 \end{center} |
188 \end{wrapfigure} | 174 \end{wrapfigure} |
189 %%\vspace{0.7cm} | 175 %%\vspace{0.7cm} |
190 %\end{minipage}% | 176 %\end{minipage}% |
196 which we start. | 182 which we start. |
197 Although character transformations have been used before to | 183 Although character transformations have been used before to |
198 improve character recognizers, this effort is on a large scale both | 184 improve character recognizers, this effort is on a large scale both |
199 in number of classes and in the complexity of the transformations, hence | 185 in number of classes and in the complexity of the transformations, hence |
200 in the complexity of the learning task. | 186 in the complexity of the learning task. |
201 More details can | |
202 be found in this technical report~\citep{ift6266-tr-anonymous}. | |
203 The code for these transformations (mostly python) is available at | 187 The code for these transformations (mostly python) is available at |
204 {\tt http://anonymous.url.net}. All the modules in the pipeline share | 188 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share |
205 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the | 189 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the |
206 amount of deformation or noise introduced. | 190 amount of deformation or noise introduced. |
207 There are two main parts in the pipeline. The first one, | 191 There are two main parts in the pipeline. The first one, |
208 from slant to pinch below, performs transformations. The second | 192 from slant to pinch below, performs transformations. The second |
209 part, from blur to contrast, adds different kinds of noise. | 193 part, from blur to contrast, adds different kinds of noise. |
219 %\begin{wrapfigure}[7]{l}{0.15\textwidth} | 203 %\begin{wrapfigure}[7]{l}{0.15\textwidth} |
220 \begin{minipage}[b]{0.14\linewidth} | 204 \begin{minipage}[b]{0.14\linewidth} |
221 %\centering | 205 %\centering |
222 \begin{center} | 206 \begin{center} |
223 \vspace*{-5mm} | 207 \vspace*{-5mm} |
224 \includegraphics[scale=.4]{images/Thick_only.png}\\ | 208 \includegraphics[scale=.4]{Thick_only.png}\\ |
225 %{\bf Thickness} | 209 %{\bf Thickness} |
226 \end{center} | 210 \end{center} |
227 \vspace{.6cm} | 211 \vspace{.6cm} |
228 \end{minipage}% | 212 \end{minipage}% |
229 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} | 213 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} |
247 \subsubsection*{Slant} | 231 \subsubsection*{Slant} |
248 \vspace*{2mm} | 232 \vspace*{2mm} |
249 | 233 |
250 \begin{minipage}[b]{0.14\linewidth} | 234 \begin{minipage}[b]{0.14\linewidth} |
251 \centering | 235 \centering |
252 \includegraphics[scale=.4]{images/Slant_only.png}\\ | 236 \includegraphics[scale=.4]{Slant_only.png}\\ |
253 %{\bf Slant} | 237 %{\bf Slant} |
254 \end{minipage}% | 238 \end{minipage}% |
255 \hspace{0.3cm} | 239 \hspace{0.3cm} |
256 \begin{minipage}[b]{0.83\linewidth} | 240 \begin{minipage}[b]{0.83\linewidth} |
257 %\centering | 241 %\centering |
269 | 253 |
270 \begin{minipage}[b]{0.14\linewidth} | 254 \begin{minipage}[b]{0.14\linewidth} |
271 %\centering | 255 %\centering |
272 %\begin{wrapfigure}[8]{l}{0.15\textwidth} | 256 %\begin{wrapfigure}[8]{l}{0.15\textwidth} |
273 \begin{center} | 257 \begin{center} |
274 \includegraphics[scale=.4]{images/Affine_only.png} | 258 \includegraphics[scale=.4]{Affine_only.png} |
275 \vspace*{6mm} | 259 \vspace*{6mm} |
276 %{\small {\bf Affine \mbox{Transformation}}} | 260 %{\small {\bf Affine \mbox{Transformation}}} |
277 \end{center} | 261 \end{center} |
278 %\end{wrapfigure} | 262 %\end{wrapfigure} |
279 \end{minipage}% | 263 \end{minipage}% |
299 %\hspace*{-8mm} | 283 %\hspace*{-8mm} |
300 \begin{minipage}[b]{0.14\linewidth} | 284 \begin{minipage}[b]{0.14\linewidth} |
301 %\centering | 285 %\centering |
302 \begin{center} | 286 \begin{center} |
303 \vspace*{5mm} | 287 \vspace*{5mm} |
304 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png} | 288 \includegraphics[scale=.4]{Localelasticdistorsions_only.png} |
305 %{\bf Local Elastic Deformation} | 289 %{\bf Local Elastic Deformation} |
306 \end{center} | 290 \end{center} |
307 %\end{wrapfigure} | 291 %\end{wrapfigure} |
308 \end{minipage}% | 292 \end{minipage}% |
309 \hspace{3mm} | 293 \hspace{3mm} |
326 \begin{minipage}[b]{0.14\linewidth} | 310 \begin{minipage}[b]{0.14\linewidth} |
327 %\centering | 311 %\centering |
328 %\begin{wrapfigure}[7]{l}{0.15\textwidth} | 312 %\begin{wrapfigure}[7]{l}{0.15\textwidth} |
329 %\vspace*{-5mm} | 313 %\vspace*{-5mm} |
330 \begin{center} | 314 \begin{center} |
331 \includegraphics[scale=.4]{images/Pinch_only.png}\\ | 315 \includegraphics[scale=.4]{Pinch_only.png}\\ |
332 \vspace*{15mm} | 316 \vspace*{15mm} |
333 %{\bf Pinch} | 317 %{\bf Pinch} |
334 \end{center} | 318 \end{center} |
335 %\end{wrapfigure} | 319 %\end{wrapfigure} |
336 %%\vspace{.6cm} | 320 %%\vspace{.6cm} |
363 | 347 |
364 %%\vspace*{-.2cm} | 348 %%\vspace*{-.2cm} |
365 \begin{minipage}[t]{0.14\linewidth} | 349 \begin{minipage}[t]{0.14\linewidth} |
366 \centering | 350 \centering |
367 \vspace*{0mm} | 351 \vspace*{0mm} |
368 \includegraphics[scale=.4]{images/Motionblur_only.png} | 352 \includegraphics[scale=.4]{Motionblur_only.png} |
369 %{\bf Motion Blur} | 353 %{\bf Motion Blur} |
370 \end{minipage}% | 354 \end{minipage}% |
371 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} | 355 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} |
372 %%\vspace*{.5mm} | 356 %%\vspace*{.5mm} |
373 \vspace*{2mm} | 357 \vspace*{2mm} |
384 \subsubsection*{Occlusion} | 368 \subsubsection*{Occlusion} |
385 | 369 |
386 \begin{minipage}[t]{0.14\linewidth} | 370 \begin{minipage}[t]{0.14\linewidth} |
387 \centering | 371 \centering |
388 \vspace*{3mm} | 372 \vspace*{3mm} |
389 \includegraphics[scale=.4]{images/occlusion_only.png}\\ | 373 \includegraphics[scale=.4]{occlusion_only.png}\\ |
390 %{\bf Occlusion} | 374 %{\bf Occlusion} |
391 %%\vspace{.5cm} | 375 %%\vspace{.5cm} |
392 \end{minipage}% | 376 \end{minipage}% |
393 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} | 377 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} |
394 %\vspace*{-18mm} | 378 %\vspace*{-18mm} |
397 image. Pixels are combined by taking the max(occluder, occluded), | 381 image. Pixels are combined by taking the max(occluder, occluded), |
398 i.e. keeping the lighter ones. | 382 i.e. keeping the lighter ones. |
399 The rectangle corners | 383 The rectangle corners |
400 are sampled so that larger complexity gives larger rectangles. | 384 are sampled so that larger complexity gives larger rectangles. |
401 The destination position in the occluded image are also sampled | 385 The destination position in the occluded image are also sampled |
402 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). | 386 according to a normal distribution. |
403 This module is skipped with probability 60\%. | 387 This module is skipped with probability 60\%. |
404 %%\vspace{7mm} | 388 %%\vspace{7mm} |
405 \end{minipage} | 389 \end{minipage} |
406 | 390 |
407 %\vspace*{1mm} | 391 %\vspace*{1mm} |
411 %\vspace*{-6mm} | 395 %\vspace*{-6mm} |
412 \begin{minipage}[t]{0.14\linewidth} | 396 \begin{minipage}[t]{0.14\linewidth} |
413 \begin{center} | 397 \begin{center} |
414 %\centering | 398 %\centering |
415 \vspace*{6mm} | 399 \vspace*{6mm} |
416 \includegraphics[scale=.4]{images/Bruitgauss_only.png} | 400 \includegraphics[scale=.4]{Bruitgauss_only.png} |
417 %{\bf Gaussian Smoothing} | 401 %{\bf Gaussian Smoothing} |
418 \end{center} | 402 \end{center} |
419 %\end{wrapfigure} | 403 %\end{wrapfigure} |
420 %%\vspace{.5cm} | 404 %%\vspace{.5cm} |
421 \end{minipage}% | 405 \end{minipage}% |
447 \begin{minipage}[t]{0.14\textwidth} | 431 \begin{minipage}[t]{0.14\textwidth} |
448 %\begin{wrapfigure}[7]{l}{ | 432 %\begin{wrapfigure}[7]{l}{ |
449 %\vspace*{-5mm} | 433 %\vspace*{-5mm} |
450 \begin{center} | 434 \begin{center} |
451 \vspace*{1mm} | 435 \vspace*{1mm} |
452 \includegraphics[scale=.4]{images/Permutpixel_only.png} | 436 \includegraphics[scale=.4]{Permutpixel_only.png} |
453 %{\small\bf Permute Pixels} | 437 %{\small\bf Permute Pixels} |
454 \end{center} | 438 \end{center} |
455 %\end{wrapfigure} | 439 %\end{wrapfigure} |
456 \end{minipage}% | 440 \end{minipage}% |
457 \hspace{3mm}\begin{minipage}[t]{0.86\linewidth} | 441 \hspace{3mm}\begin{minipage}[t]{0.86\linewidth} |
474 %%\vspace*{-3mm} | 458 %%\vspace*{-3mm} |
475 \begin{center} | 459 \begin{center} |
476 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth} | 460 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth} |
477 %\centering | 461 %\centering |
478 \vspace*{0mm} | 462 \vspace*{0mm} |
479 \includegraphics[scale=.4]{images/Distorsiongauss_only.png} | 463 \includegraphics[scale=.4]{Distorsiongauss_only.png} |
480 %{\small \bf Gauss. Noise} | 464 %{\small \bf Gauss. Noise} |
481 \end{center} | 465 \end{center} |
482 %\end{wrapfigure} | 466 %\end{wrapfigure} |
483 \end{minipage}% | 467 \end{minipage}% |
484 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} | 468 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} |
496 | 480 |
497 \begin{minipage}[t]{\linewidth} | 481 \begin{minipage}[t]{\linewidth} |
498 \begin{minipage}[t]{0.14\linewidth} | 482 \begin{minipage}[t]{0.14\linewidth} |
499 \centering | 483 \centering |
500 \vspace*{0mm} | 484 \vspace*{0mm} |
501 \includegraphics[scale=.4]{images/background_other_only.png} | 485 \includegraphics[scale=.4]{background_other_only.png} |
502 %{\small \bf Bg Image} | 486 %{\small \bf Bg Image} |
503 \end{minipage}% | 487 \end{minipage}% |
504 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} | 488 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} |
505 \vspace*{1mm} | 489 \vspace*{1mm} |
506 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random | 490 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random |
515 \subsubsection*{Salt and Pepper Noise} | 499 \subsubsection*{Salt and Pepper Noise} |
516 | 500 |
517 \begin{minipage}[t]{0.14\linewidth} | 501 \begin{minipage}[t]{0.14\linewidth} |
518 \centering | 502 \centering |
519 \vspace*{0mm} | 503 \vspace*{0mm} |
520 \includegraphics[scale=.4]{images/Poivresel_only.png} | 504 \includegraphics[scale=.4]{Poivresel_only.png} |
521 %{\small \bf Salt \& Pepper} | 505 %{\small \bf Salt \& Pepper} |
522 \end{minipage}% | 506 \end{minipage}% |
523 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} | 507 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} |
524 \vspace*{1mm} | 508 \vspace*{1mm} |
525 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels. | 509 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels. |
537 %\begin{minipage}[t]{0.14\linewidth} | 521 %\begin{minipage}[t]{0.14\linewidth} |
538 %\centering | 522 %\centering |
539 \begin{center} | 523 \begin{center} |
540 \vspace*{4mm} | 524 \vspace*{4mm} |
541 %\hspace*{-1mm} | 525 %\hspace*{-1mm} |
542 \includegraphics[scale=.4]{images/Rature_only.png}\\ | 526 \includegraphics[scale=.4]{Rature_only.png}\\ |
543 %{\bf Scratches} | 527 %{\bf Scratches} |
544 \end{center} | 528 \end{center} |
545 \end{minipage}% | 529 \end{minipage}% |
546 %\end{wrapfigure} | 530 %\end{wrapfigure} |
547 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} | 531 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} |
563 \subsubsection*{Grey Level and Contrast Changes} | 547 \subsubsection*{Grey Level and Contrast Changes} |
564 | 548 |
565 \begin{minipage}[t]{0.15\linewidth} | 549 \begin{minipage}[t]{0.15\linewidth} |
566 \centering | 550 \centering |
567 \vspace*{0mm} | 551 \vspace*{0mm} |
568 \includegraphics[scale=.4]{images/Contrast_only.png} | 552 \includegraphics[scale=.4]{Contrast_only.png} |
569 %{\bf Grey Level \& Contrast} | 553 %{\bf Grey Level \& Contrast} |
570 \end{minipage}% | 554 \end{minipage}% |
571 \hspace{3mm}\begin{minipage}[t]{0.85\linewidth} | 555 \hspace{3mm}\begin{minipage}[t]{0.85\linewidth} |
572 \vspace*{1mm} | 556 \vspace*{1mm} |
573 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white | 557 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white |
579 %\vspace{2mm} | 563 %\vspace{2mm} |
580 | 564 |
581 | 565 |
582 \iffalse | 566 \iffalse |
583 \begin{figure}[ht] | 567 \begin{figure}[ht] |
584 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/example_t.png}}}\\ | 568 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{example_t.png}}}\\ |
585 \caption{Illustration of the pipeline of stochastic | 569 \caption{Illustration of the pipeline of stochastic |
586 transformations applied to the image of a lower-case \emph{t} | 570 transformations applied to the image of a lower-case \emph{t} |
587 (the upper left image). Each image in the pipeline (going from | 571 (the upper left image). Each image in the pipeline (going from |
588 left to right, first top line, then bottom line) shows the result | 572 left to right, first top line, then bottom line) shows the result |
589 of applying one of the modules in the pipeline. The last image | 573 of applying one of the modules in the pipeline. The last image |
624 %\citep{SorokinAndForsyth2008,whitehill09}. | 608 %\citep{SorokinAndForsyth2008,whitehill09}. |
625 AMT users were presented | 609 AMT users were presented |
626 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII | 610 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII |
627 characters. They were forced to choose a single character class (either among the | 611 characters. They were forced to choose a single character class (either among the |
628 62 or 10 character classes) for each image. | 612 62 or 10 character classes) for each image. |
629 80 subjects classified 2500 images per (dataset,task) pair, | 613 80 subjects classified 2500 images per (dataset,task) pair. |
630 with the guarantee that 3 different subjects classified each image, allowing | 614 Different humans labelers sometimes provided a different label for the same |
631 us to estimate inter-human variability (e.g a standard error of 0.1\% | 615 example, and we were able to estimate the error variance due to this effect |
632 on the average 18.2\% error done by humans on the 62-class task NIST test set). | 616 because each image was classified by 3 different persons. |
617 The average error of humans on the 62-class task NIST test set | |
618 is 18.2\%, with a standard error of 0.1\%. | |
633 | 619 |
634 %\vspace*{-3mm} | 620 %\vspace*{-3mm} |
635 \subsection{Data Sources} | 621 \subsection{Data Sources} |
636 %\vspace*{-2mm} | 622 %\vspace*{-2mm} |
637 | 623 |
731 | 717 |
732 {\bf Multi-Layer Perceptrons (MLP).} | 718 {\bf Multi-Layer Perceptrons (MLP).} |
733 Whereas previous work had compared deep architectures to both shallow MLPs and | 719 Whereas previous work had compared deep architectures to both shallow MLPs and |
734 SVMs, we only compared to MLPs here because of the very large datasets used | 720 SVMs, we only compared to MLPs here because of the very large datasets used |
735 (making the use of SVMs computationally challenging because of their quadratic | 721 (making the use of SVMs computationally challenging because of their quadratic |
736 scaling behavior). | 722 scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training |
723 set allowing the program to fit in memory yielded substantially worse results | |
724 than those obtained with MLPs. For training on nearly a billion examples | |
725 (with the perturbed data), the MLPs and SDA are much more convenient than | |
726 classifiers based on kernel methods. | |
737 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | 727 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
738 exponentials) on the output layer for estimating $P(class | image)$. | 728 exponentials) on the output layer for estimating $P(class | image)$. |
739 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. | 729 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. |
740 Training examples are presented in minibatches of size 20. A constant learning | 730 Training examples are presented in minibatches of size 20. A constant learning |
741 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. | 731 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. |
749 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 739 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
750 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 740 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
751 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, | 741 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, |
752 apparently setting parameters in the | 742 apparently setting parameters in the |
753 basin of attraction of supervised gradient descent yielding better | 743 basin of attraction of supervised gradient descent yielding better |
754 generalization~\citep{Erhan+al-2010}. It is hypothesized that the | 744 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised |
745 pre-training phase} uses all of the training images but not the training labels. | |
746 Each layer is trained in turn to produce a new representation of its input | |
747 (starting from the raw pixels). | |
748 It is hypothesized that the | |
755 advantage brought by this procedure stems from a better prior, | 749 advantage brought by this procedure stems from a better prior, |
756 on the one hand taking advantage of the link between the input | 750 on the one hand taking advantage of the link between the input |
757 distribution $P(x)$ and the conditional distribution of interest | 751 distribution $P(x)$ and the conditional distribution of interest |
758 $P(y|x)$ (like in semi-supervised learning), and on the other hand | 752 $P(y|x)$ (like in semi-supervised learning), and on the other hand |
759 taking advantage of the expressive power and bias implicit in the | 753 taking advantage of the expressive power and bias implicit in the |
760 deep architecture (whereby complex concepts are expressed as | 754 deep architecture (whereby complex concepts are expressed as |
761 compositions of simpler ones through a deep hierarchy). | 755 compositions of simpler ones through a deep hierarchy). |
762 | 756 |
763 \begin{figure}[ht] | 757 \begin{figure}[ht] |
764 %\vspace*{-2mm} | 758 %\vspace*{-2mm} |
765 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | 759 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{denoising_autoencoder_small.pdf}}} |
766 %\vspace*{-2mm} | 760 %\vspace*{-2mm} |
767 \caption{Illustration of the computations and training criterion for the denoising | 761 \caption{Illustration of the computations and training criterion for the denoising |
768 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | 762 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of |
769 the layer (i.e. raw input or output of previous layer) | 763 the layer (i.e. raw input or output of previous layer) |
770 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | 764 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. |
800 fixed proportion of the input values, randomly selected, are zeroed), and a | 794 fixed proportion of the input values, randomly selected, are zeroed), and a |
801 separate learning rate for the unsupervised pre-training stage (selected | 795 separate learning rate for the unsupervised pre-training stage (selected |
802 from the same above set). The fraction of inputs corrupted was selected | 796 from the same above set). The fraction of inputs corrupted was selected |
803 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | 797 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number |
804 of hidden layers but it was fixed to 3 based on previous work with | 798 of hidden layers but it was fixed to 3 based on previous work with |
805 SDAs on MNIST~\citep{VincentPLarochelleH2008}. | 799 SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden |
800 layers was kept constant across hidden layers, and the best results | |
801 were obtained with the largest values that we could experiment | |
802 with given our patience, with 1000 hidden units. | |
806 | 803 |
807 %\vspace*{-1mm} | 804 %\vspace*{-1mm} |
808 | 805 |
809 \begin{figure}[ht] | 806 \begin{figure}[ht] |
810 %\vspace*{-2mm} | 807 %\vspace*{-2mm} |
811 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} | 808 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{error_rates_charts.pdf}}} |
812 %\vspace*{-3mm} | 809 %\vspace*{-3mm} |
813 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained | 810 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained |
814 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | 811 on NIST, 1 on NISTP, and 2 on P07. Left: overall results |
815 of all models, on NIST and NISTP test sets. | 812 of all models, on NIST and NISTP test sets. |
816 Right: error rates on NIST test digits only, along with the previous results from | 813 Right: error rates on NIST test digits only, along with the previous results from |
821 \end{figure} | 818 \end{figure} |
822 | 819 |
823 | 820 |
824 \begin{figure}[ht] | 821 \begin{figure}[ht] |
825 %\vspace*{-3mm} | 822 %\vspace*{-3mm} |
826 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} | 823 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{improvements_charts.pdf}}} |
827 %\vspace*{-3mm} | 824 %\vspace*{-3mm} |
828 \caption{Relative improvement in error rate due to self-taught learning. | 825 \caption{Relative improvement in error rate due to self-taught learning. |
829 Left: Improvement (or loss, when negative) | 826 Left: Improvement (or loss, when negative) |
830 induced by out-of-distribution examples (perturbed data). | 827 induced by out-of-distribution examples (perturbed data). |
831 Right: Improvement (or loss, when negative) induced by multi-task | 828 Right: Improvement (or loss, when negative) induced by multi-task |
854 19 test set from the literature, respectively based on ARTMAP neural | 851 19 test set from the literature, respectively based on ARTMAP neural |
855 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search | 852 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search |
856 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs | 853 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs |
857 ~\citep{Milgram+al-2005}. More detailed and complete numerical results | 854 ~\citep{Milgram+al-2005}. More detailed and complete numerical results |
858 (figures and tables, including standard errors on the error rates) can be | 855 (figures and tables, including standard errors on the error rates) can be |
859 found in Appendix I of the supplementary material. | 856 found in Appendix. |
860 The deep learner not only outperformed the shallow ones and | 857 The deep learner not only outperformed the shallow ones and |
861 previously published performance (in a statistically and qualitatively | 858 previously published performance (in a statistically and qualitatively |
862 significant way) but when trained with perturbed data | 859 significant way) but when trained with perturbed data |
863 reaches human performance on both the 62-class task | 860 reaches human performance on both the 62-class task |
864 and the 10-class (digits) task. | 861 and the 10-class (digits) task. |
945 {\bf Do the good results previously obtained with deep architectures on the | 942 {\bf Do the good results previously obtained with deep architectures on the |
946 MNIST digits generalize to a much larger and richer (but similar) | 943 MNIST digits generalize to a much larger and richer (but similar) |
947 dataset, the NIST special database 19, with 62 classes and around 800k examples}? | 944 dataset, the NIST special database 19, with 62 classes and around 800k examples}? |
948 Yes, the SDA {\em systematically outperformed the MLP and all the previously | 945 Yes, the SDA {\em systematically outperformed the MLP and all the previously |
949 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level | 946 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level |
950 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. | 947 performance} at around 17\% error on the 62-class task and 1.4\% on the digits, |
948 and beating previously published results on the same data. | |
951 | 949 |
952 $\bullet$ %\item | 950 $\bullet$ %\item |
953 {\bf To what extent do self-taught learning scenarios help deep learners, | 951 {\bf To what extent do self-taught learning scenarios help deep learners, |
954 and do they help them more than shallow supervised ones}? | 952 and do they help them more than shallow supervised ones}? |
955 We found that distorted training examples not only made the resulting | 953 We found that distorted training examples not only made the resulting |
981 in the asymptotic regime. | 979 in the asymptotic regime. |
982 | 980 |
983 {\bf Why would deep learners benefit more from the self-taught learning framework}? | 981 {\bf Why would deep learners benefit more from the self-taught learning framework}? |
984 The key idea is that the lower layers of the predictor compute a hierarchy | 982 The key idea is that the lower layers of the predictor compute a hierarchy |
985 of features that can be shared across tasks or across variants of the | 983 of features that can be shared across tasks or across variants of the |
986 input distribution. Intermediate features that can be used in different | 984 input distribution. A theoretical analysis of generalization improvements |
985 due to sharing of intermediate features across tasks already points | |
986 towards that explanation~\cite{baxter95a}. | |
987 Intermediate features that can be used in different | |
987 contexts can be estimated in a way that allows to share statistical | 988 contexts can be estimated in a way that allows to share statistical |
988 strength. Features extracted through many levels are more likely to | 989 strength. Features extracted through many levels are more likely to |
989 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), | 990 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), |
990 increasing the likelihood that they would be useful for a larger array | 991 increasing the likelihood that they would be useful for a larger array |
991 of tasks and input conditions. | 992 of tasks and input conditions. |
1009 with deep learning and self-taught learning. | 1010 with deep learning and self-taught learning. |
1010 | 1011 |
1011 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 1012 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) |
1012 can be executed on-line at {\tt http://deep.host22.com}. | 1013 can be executed on-line at {\tt http://deep.host22.com}. |
1013 | 1014 |
1014 %\newpage | 1015 |
1016 \section*{Appendix I: Detailed Numerical Results} | |
1017 | |
1018 These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered. | |
1019 They also contain additional data such as test errors on P07 and standard errors. | |
1020 | |
1021 \begin{table}[ht] | |
1022 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + | |
1023 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training | |
1024 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture | |
1025 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07) | |
1026 and using a validation set to select hyper-parameters and other training choices. | |
1027 \{SDA,MLP\}0 are trained on NIST, | |
1028 \{SDA,MLP\}1 are trained on NISTP, and \{SDA,MLP\}2 are trained on P07. | |
1029 The human error rate on digits is a lower bound because it does not count digits that were | |
1030 recognized as letters. For comparison, the results found in the literature | |
1031 on NIST digits classification using the same test set are included.} | |
1032 \label{tab:sda-vs-mlp-vs-humans} | |
1033 \begin{center} | |
1034 \begin{tabular}{|l|r|r|r|r|} \hline | |
1035 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline | |
1036 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $1.4\%$ \\ \hline | |
1037 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline | |
1038 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline | |
1039 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline | |
1040 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline | |
1041 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline | |
1042 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline | |
1043 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline | |
1044 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline | |
1045 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline | |
1046 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline | |
1047 \end{tabular} | |
1048 \end{center} | |
1049 \end{table} | |
1050 | |
1051 \begin{table}[ht] | |
1052 \caption{Relative change in error rates due to the use of perturbed training data, | |
1053 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. | |
1054 A positive value indicates that training on the perturbed data helped for the | |
1055 given test set (the first 3 columns on the 62-class tasks and the last one is | |
1056 on the clean 10-class digits). Clearly, the deep learning models did benefit more | |
1057 from perturbed training data, even when testing on clean data, whereas the MLP | |
1058 trained on perturbed data performed worse on the clean digits and about the same | |
1059 on the clean characters. } | |
1060 \label{tab:perturbation-effect} | |
1061 \begin{center} | |
1062 \begin{tabular}{|l|r|r|r|r|} \hline | |
1063 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline | |
1064 SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline | |
1065 SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline | |
1066 MLP0/MLP1-1 & 5.2\% & 65\% & -13\% & -10\% \\ \hline | |
1067 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline | |
1068 \end{tabular} | |
1069 \end{center} | |
1070 \end{table} | |
1071 | |
1072 \begin{table}[ht] | |
1073 \caption{Test error rates and relative change in error rates due to the use of | |
1074 a multi-task setting, i.e., training on each task in isolation vs training | |
1075 for all three tasks together, for MLPs vs SDAs. The SDA benefits much | |
1076 more from the multi-task setting. All experiments on only on the | |
1077 unperturbed NIST data, using validation error for model selection. | |
1078 Relative improvement is 1 - single-task error / multi-task error.} | |
1079 \label{tab:multi-task} | |
1080 \begin{center} | |
1081 \begin{tabular}{|l|r|r|r|} \hline | |
1082 & single-task & multi-task & relative \\ | |
1083 & setting & setting & improvement \\ \hline | |
1084 MLP-digits & 3.77\% & 3.99\% & 5.6\% \\ \hline | |
1085 MLP-lower & 17.4\% & 16.8\% & -4.1\% \\ \hline | |
1086 MLP-upper & 7.84\% & 7.54\% & -3.6\% \\ \hline | |
1087 SDA-digits & 2.6\% & 3.56\% & 27\% \\ \hline | |
1088 SDA-lower & 12.3\% & 14.4\% & 15\% \\ \hline | |
1089 SDA-upper & 5.93\% & 6.78\% & 13\% \\ \hline | |
1090 \end{tabular} | |
1091 \end{center} | |
1092 \end{table} | |
1093 | |
1094 %\afterpage{\clearpage} | |
1095 \clearpage | |
1015 { | 1096 { |
1016 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,specials,aigaion-shorter} | 1097 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,specials,aigaion-shorter} |
1017 %\bibliographystyle{plainnat} | 1098 %\bibliographystyle{plainnat} |
1018 \bibliographystyle{unsrtnat} | 1099 \bibliographystyle{unsrtnat} |
1019 %\bibliographystyle{apalike} | 1100 %\bibliographystyle{apalike} |