comparison writeup/nips2010_cameraready.tex @ 633:13baba8a4522

merge
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 19 Mar 2011 22:51:40 -0400
parents d840139444fe
children
comparison
equal deleted inserted replaced
632:5541056d3fb0 633:13baba8a4522
1 \documentclass{article} % For LaTeX2e
2 \usepackage{nips10submit_e,times}
3 \usepackage{wrapfig}
4 \usepackage{amsthm,amsmath,bbm}
5 \usepackage[psamsfonts]{amssymb}
6 \usepackage{algorithm,algorithmic}
7 \usepackage[utf8]{inputenc}
8 \usepackage{graphicx,subfigure}
9 \usepackage[numbers]{natbib}
10
11 \addtolength{\textwidth}{10mm}
12 \addtolength{\textheight}{10mm}
13 \addtolength{\topmargin}{-5mm}
14 \addtolength{\evensidemargin}{-5mm}
15 \addtolength{\oddsidemargin}{-5mm}
16
17 %\setlength\parindent{0mm}
18
19 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
20 \author{
21 Frédéric Bastien,
22 Yoshua Bengio,
23 Arnaud Bergeron,
24 Nicolas Boulanger-Lewandowski,
25 Thomas Breuel,\\
26 {\bf Youssouf Chherawala,
27 Moustapha Cisse,
28 Myriam Côté,
29 Dumitru Erhan,
30 Jeremy Eustache,}\\
31 {\bf Xavier Glorot,
32 Xavier Muller,
33 Sylvain Pannetier Lebeuf,
34 Razvan Pascanu,} \\
35 {\bf Salah Rifai,
36 Francois Savard,
37 Guillaume Sicard}\\
38 Dept. IRO, U. Montreal
39 }
40
41 \begin{document}
42
43 %\makeanontitle
44 \maketitle
45
46 \vspace*{-2mm}
47 \begin{abstract}
48 Recent theoretical and empirical work in statistical machine learning has
49 demonstrated the importance of learning algorithms for deep
50 architectures, i.e., function classes obtained by composing multiple
51 non-linear transformations. Self-taught learning (exploiting unlabeled
52 examples or examples from other distributions) has already been applied
53 to deep learners, but mostly to show the advantage of unlabeled
54 examples. Here we explore the advantage brought by {\em out-of-distribution examples}.
55 For this purpose we
56 developed a powerful generator of stochastic variations and noise
57 processes for character images, including not only affine transformations
58 but also slant, local elastic deformations, changes in thickness,
59 background images, grey level changes, contrast, occlusion, and various
60 types of noise. The out-of-distribution examples are obtained from these
61 highly distorted images or by including examples of object classes
62 different from those in the target test set.
63 We show that {\em deep learners benefit
64 more from them than a corresponding shallow learner}, at least in the area of
65 handwritten character recognition. In fact, we show that they reach
66 human-level performance on both handwritten digit classification and
67 62-class handwritten character recognition.
68 \end{abstract}
69 \vspace*{-3mm}
70
71 \section{Introduction}
72 \vspace*{-1mm}
73
74 {\bf Deep Learning} has emerged as a promising new area of research in
75 statistical machine learning~\citep{Hinton06}
76 (see \citet{Bengio-2009} for a review).
77 Learning algorithms for deep architectures are centered on the learning
78 of useful representations of data, which are better suited to the task at hand,
79 and are organized in a hierarchy with multiple levels.
80 This is in part inspired by observations of the mammalian visual cortex,
81 which consists of a chain of processing elements, each of which is associated with a
82 different representation of the raw visual input. In fact,
83 it was found recently that the features learnt in deep architectures resemble
84 those observed in the first two of these stages (in areas V1 and V2
85 of visual cortex)~\citep{HonglakL2008}, and that they become more and
86 more invariant to factors of variation (such as camera movement) in
87 higher layers~\citep{Goodfellow2009}.
88 It has been hypothesized that learning a hierarchy of features increases the
89 ease and practicality of developing representations that are at once
90 tailored to specific tasks, yet are able to borrow statistical strength
91 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
92 feature representation can lead to higher-level (more abstract, more
93 general) features that are more robust to unanticipated sources of
94 variance extant in real data.
95
96 {\bf Self-taught learning}~\citep{RainaR2007} is a paradigm that combines principles
97 of semi-supervised and multi-task learning: the learner can exploit examples
98 that are unlabeled and possibly come from a distribution different from the target
99 distribution, e.g., from other classes than those of interest.
100 It has already been shown that deep learners can clearly take advantage of
101 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
102 but more needs to be done to explore the impact
103 of {\em out-of-distribution} examples and of the multi-task setting
104 (one exception is~\citep{CollobertR2008}, which uses a different kind
105 of learning algorithm). In particular the {\em relative
106 advantage} of deep learning for these settings has not been evaluated.
107 The hypothesis discussed in the conclusion is that a deep hierarchy of features
108 may be better able to provide sharing of statistical strength
109 between different regions in input space or different tasks.
110
111 Previous comparative experimental results with stacking of RBMs and DAs
112 to build deep supervised predictors had shown that they could outperform
113 shallow architectures in a variety of settings, especially
114 when the data involves complex interactions between many factors of
115 variation~\citep{LarochelleH2007,Bengio-2009}. Other experiments have suggested
116 that the unsupervised layer-wise pre-training acted as a useful
117 prior~\citep{Erhan+al-2010} that allows one to initialize a deep
118 neural network in a relatively much smaller region of parameter space,
119 corresponding to better generalization.
120
121 To further the understanding of the reasons for the good performance
122 observed with deep learners, we focus here on the following {\em hypothesis}:
123 intermediate levels of representation, especially when there are
124 more such levels, can be exploited to {\bf share
125 statistical strength across different but related types of examples},
126 such as examples coming from other tasks than the task of interest
127 (the multi-task setting), or examples coming from an overlapping
128 but different distribution (images with different kinds of perturbations
129 and noises, here). This is consistent with the hypotheses discussed
130 in~\citet{Bengio-2009} regarding the potential advantage
131 of deep learning and the idea that more levels of representation can
132 give rise to more abstract, more general features of the raw input.
133
134 This hypothesis is related to the
135 {\bf self-taught learning} setting~\citep{RainaR2007}, which combines principles
136 of semi-supervised and multi-task learning: the learner can exploit examples
137 that are unlabeled and possibly come from a distribution different from the target
138 distribution, e.g., from classes other than those of interest.
139 It has already been shown that deep learners can take advantage of
140 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
141 but more needed to be done to explore the impact
142 of {\em out-of-distribution} examples and of the {\em multi-task} setting
143 (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised
144 pre-training only with the first layer). In particular the {\em relative
145 advantage of deep learning} for these settings had not been evaluated.
146
147
148 %
149 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
150 {\bf benefit more from out-of-distribution examples than shallow learners} (with a single
151 level), both in the context of the multi-task setting and from
152 perturbed examples. Because we are able to improve on state-of-the-art
153 performance and reach human-level performance
154 on a large-scale task, we consider that this paper is also a contribution
155 to advance the application of machine learning to handwritten character recognition.
156 More precisely, we ask and answer the following questions:
157
158 %\begin{enumerate}
159 $\bullet$ %\item
160 Do the good results previously obtained with deep architectures on the
161 MNIST digit images generalize to the setting of a similar but much larger and richer
162 dataset, the NIST special database 19, with 62 classes and around 800k examples?
163
164 $\bullet$ %\item
165 To what extent does the perturbation of input images (e.g. adding
166 noise, affine transformations, background images) make the resulting
167 classifiers better not only on similarly perturbed images but also on
168 the {\em original clean examples}? We study this question in the
169 context of the 62-class and 10-class tasks of the NIST special database 19.
170
171 $\bullet$ %\item
172 Do deep architectures {\em benefit {\bf more} from such out-of-distribution}
173 examples, in particular do they benefit more from
174 examples that are perturbed versions of the examples from the task of interest?
175
176 $\bullet$ %\item
177 Similarly, does the feature learning step in deep learning algorithms benefit {\bf more}
178 from training with moderately {\em different classes} (i.e. a multi-task learning scenario) than
179 a corresponding shallow and purely supervised architecture?
180 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
181 to answer this question.
182 %\end{enumerate}
183
184 Our experimental results provide positive evidence towards all of these questions,
185 as well as {\bf classifiers that reach human-level performance on 62-class isolated character
186 recognition and beat previously published results on the NIST dataset (special database 19)}.
187 To achieve these results, we introduce in the next section a sophisticated system
188 for stochastically transforming character images and then explain the methodology,
189 which is based on training with or without these transformed images and testing on
190 clean ones.
191 Code for generating these transformations as well as for the deep learning
192 algorithms are made available at {\tt http://hg.assembla.com/ift6266}.
193
194 \vspace*{-3mm}
195 %%\newpage
196 \section{Perturbed and Transformed Character Images}
197 \label{s:perturbations}
198 \vspace*{-2mm}
199
200 %\begin{minipage}[h]{\linewidth}
201 \begin{wrapfigure}[8]{l}{0.15\textwidth}
202 %\begin{minipage}[b]{0.14\linewidth}
203 \vspace*{-5mm}
204 \begin{center}
205 \includegraphics[scale=.4]{images/Original.png}\\
206 {\bf Original}
207 \end{center}
208 \end{wrapfigure}
209 %\vspace{0.7cm}
210 %\end{minipage}%
211 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
212 This section describes the different transformations we used to stochastically
213 transform $32 \times 32$ source images (such as the one on the left)
214 in order to obtain data from a larger distribution which
215 covers a domain substantially larger than the clean characters distribution from
216 which we start.
217 Although character transformations have been used before to
218 improve character recognizers, this effort is on a large scale both
219 in number of classes and in the complexity of the transformations, hence
220 in the complexity of the learning task.
221 More details can
222 be found in this technical report~\citep{ARXIV-2010}.
223 The code for these transformations (mostly python) is available at
224 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share
225 a global control parameter ($0 \le complexity \le 1$) modulating the
226 amount of deformation or noise.
227 There are two main parts in the pipeline. The first one,
228 from thickness to pinch, performs transformations. The second
229 part, from blur to contrast, adds different kinds of noise.
230 %\end{minipage}
231
232 %\newpage
233 \vspace*{1mm}
234 %\subsection{Transformations}
235 {\large\bf 2.1 Transformations}
236 \vspace*{1mm}
237
238
239 \begin{minipage}[h]{\linewidth}
240 \begin{wrapfigure}[7]{l}{0.15\textwidth}
241 %\begin{minipage}[b]{0.14\linewidth}
242 %\centering
243 \begin{center}
244 \vspace*{-5mm}
245 \includegraphics[scale=.4]{images/Thick_only.png}\\
246 {\bf Thickness}
247 \end{center}
248 %\vspace{.6cm}
249 %\end{minipage}%
250 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
251 \end{wrapfigure}
252 To change character {\bf thickness}, morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
253 are applied. The neighborhood of each pixel is multiplied
254 element-wise with a {\em structuring element} matrix.
255 The pixel value is replaced by the maximum or the minimum of the resulting
256 matrix, respectively for dilation or erosion. Ten different structural elements with
257 increasing dimensions (largest is $5\times5$) were used. For each image,
258 randomly sample the operator type (dilation or erosion) with equal probability and one structural
259 element from a subset of the $n=round(m \times complexity)$ smallest structuring elements
260 where $m=10$ for dilation and $m=6$ for erosion (to avoid completely erasing thin characters).
261 A neutral element (no transformation)
262 is always present in the set.
263 %\vspace{.4cm}
264 \end{minipage}
265 \vspace*{3mm}
266
267 \begin{minipage}[h]{\linewidth}
268 \begin{wrapfigure}[7]{l}{0.15\textwidth}
269 %\begin{minipage}[b]{0.14\linewidth}
270 %\centering
271 \begin{center}
272 \vspace*{-5mm}
273 \includegraphics[scale=.4]{images/Slant_only.png}\\
274 {\bf Slant}
275 \end{center}
276 \end{wrapfigure}
277
278 %\end{minipage}%
279 %\hspace{0.3cm}
280 %\begin{minipage}[b]{0.83\linewidth}
281 %\centering
282 To produce {\bf slant}, each row of the image is shifted
283 proportionally to its height: $shift = round(slant \times height)$.
284 $slant \sim U[-complexity,complexity]$.
285 The shift is randomly chosen to be either to the left or to the right.
286 %\vspace{8mm}
287 \end{minipage}
288 \vspace*{10mm}
289
290 \begin{minipage}[h]{\linewidth}
291 %\begin{minipage}[b]{0.14\linewidth}
292 %\centering
293 \begin{wrapfigure}[7]{l}{0.15\textwidth}
294 \begin{center}
295 \vspace*{-5mm}
296 \includegraphics[scale=.4]{images/Affine_only.png}\\
297 {\small {\bf Affine \mbox{Transformation}}}
298 \end{center}
299 \end{wrapfigure}
300 %\end{minipage}%
301 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
302 A $2 \times 3$ {\bf affine transform} matrix (with
303 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$.
304 Output pixel $(x,y)$ takes the value of input pixel
305 nearest to $(ax+by+c,dx+ey+f)$,
306 producing scaling, translation, rotation and shearing.
307 Marginal distributions of $(a,b,c,d,e,f)$ have been tuned to
308 forbid large rotations (to avoid confusing classes) but to give good
309 variability of the transformation: $a$ and $d$ $\sim U[1-3
310 complexity,1+3\,complexity]$, $b$ and $e$ $\sim U[-3 \,complexity,3\,
311 complexity]$, and $c$ and $f \sim U[-4 \,complexity, 4 \,
312 complexity]$.\\
313 %\end{minipage}
314 \end{minipage}
315 \vspace*{3mm}
316
317 \vspace*{-4.5mm}
318
319 \begin{minipage}[h]{\linewidth}
320 \begin{wrapfigure}[7]{l}{0.15\textwidth}
321 %\hspace*{-8mm}\begin{minipage}[b]{0.25\linewidth}
322 %\centering
323 \begin{center}
324 \vspace*{-4mm}
325 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}\\
326 {\bf Local Elastic Deformation}
327 \end{center}
328 \end{wrapfigure}
329 %\end{minipage}%
330 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth}
331 %\vspace*{-20mm}
332 The {\bf local elastic deformation}
333 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
334 which provides more details.
335 The intensity of the displacement fields is given by
336 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are
337 convolved with a Gaussian 2D kernel (resulting in a blur) of
338 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$.
339 %\vspace{.9cm}
340 \end{minipage}
341
342 \vspace*{7mm}
343
344 %\begin{minipage}[b]{0.14\linewidth}
345 %\centering
346 \begin{minipage}[h]{\linewidth}
347 \begin{wrapfigure}[7]{l}{0.15\textwidth}
348 \vspace*{-5mm}
349 \begin{center}
350 \includegraphics[scale=.4]{images/Pinch_only.png}\\
351 {\bf Pinch}
352 \end{center}
353 \end{wrapfigure}
354 %\vspace{.6cm}
355 %\end{minipage}%
356 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
357 The {\bf pinch} module applies the ``Whirl and pinch'' GIMP filter with whirl set to 0.
358 A pinch is ``similar to projecting the image onto an elastic
359 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
360 For a square input image, draw a radius-$r$ disk
361 around its center $C$. Any pixel $P$ belonging to
362 that disk has its value replaced by
363 the value of a ``source'' pixel in the original image,
364 on the line that goes through $C$ and $P$, but
365 at some other distance $d_2$. Define $d_1=distance(P,C)$
366 and $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
367 d_1$, where $pinch$ is a parameter of the filter.
368 The actual value is given by bilinear interpolation considering the pixels
369 around the (non-integer) source position thus found.
370 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
371 %\vspace{1.5cm}
372 \end{minipage}
373
374 \vspace{1mm}
375
376 {\large\bf 2.2 Injecting Noise}
377 %\subsection{Injecting Noise}
378 %\vspace{2mm}
379
380 \begin{minipage}[h]{\linewidth}
381 %\vspace*{-.2cm}
382 %\begin{minipage}[t]{0.14\linewidth}
383 \begin{wrapfigure}[8]{l}{0.15\textwidth}
384 \begin{center}
385 \vspace*{-5mm}
386 %\vspace*{-2mm}
387 \includegraphics[scale=.4]{images/Motionblur_only.png}\\
388 {\bf Motion Blur}
389 %\end{minipage}%
390 \end{center}
391 \end{wrapfigure}
392 %\hspace{0.3cm}
393 %\begin{minipage}[t]{0.83\linewidth}
394 %\vspace*{.5mm}
395 The {\bf motion blur} module is GIMP's ``linear motion blur'', which
396 has parameters $length$ and $angle$. The value of
397 a pixel in the final image is approximately the mean of the first $length$ pixels
398 found by moving in the $angle$ direction,
399 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
400 \vspace{5mm}
401 \end{minipage}
402 %\end{minipage}
403
404 \vspace*{1mm}
405
406 \begin{minipage}[h]{\linewidth}
407 \begin{minipage}[t]{0.14\linewidth}
408 \centering
409 \includegraphics[scale=.4]{images/occlusion_only.png}\\
410 {\bf Occlusion}
411 %\vspace{.5cm}
412 \end{minipage}%
413 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
414 \vspace*{-18mm}
415 The {\bf occlusion} module selects a random rectangle from an {\em occluder} character
416 image and places it over the original {\em occluded}
417 image. Pixels are combined by taking the max(occluder, occluded),
418 i.e. keeping the lighter ones.
419 The rectangle corners
420 are sampled so that larger complexity gives larger rectangles.
421 The destination position in the occluded image are also sampled
422 according to a normal distribution (more details in~\citet{ARXIV-2010}).
423 This module is skipped with probability 60\%.
424 %\vspace{7mm}
425 \end{minipage}
426 \end{minipage}
427
428 \vspace*{1mm}
429
430 \begin{wrapfigure}[8]{l}{0.15\textwidth}
431 \vspace*{-3mm}
432 \begin{center}
433 %\begin{minipage}[t]{0.14\linewidth}
434 %\centering
435 \includegraphics[scale=.4]{images/Bruitgauss_only.png}\\
436 {\bf Gaussian Smoothing}
437 \end{center}
438 \end{wrapfigure}
439 %\vspace{.5cm}
440 %\end{minipage}%
441 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
442 With the {\bf Gaussian smoothing} module,
443 different regions of the image are spatially smoothed.
444 This is achieved by first convolving
445 the image with an isotropic Gaussian kernel of
446 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
447 complexity]$ and $[2,2 + 6 \times complexity]$. This filtered image is normalized
448 between $0$ and $1$. We also create an isotropic weighted averaging window, of the
449 kernel size, with maximum value at the center. For each image we sample
450 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be
451 averaging centers between the original image and the filtered one. We
452 initialize to zero a mask matrix of the image size. For each selected pixel
453 we add to the mask the averaging window centered on it. The final image is
454 computed from the following element-wise operation: $\frac{image + filtered\_image
455 \times mask}{mask+1}$.
456 This module is skipped with probability 75\%.
457 %\end{minipage}
458
459 %\newpage
460
461 \vspace*{1mm}
462
463 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
464 %\centering
465 \begin{minipage}[t]{\linewidth}
466 \begin{wrapfigure}[7]{l}{0.15\textwidth}
467 \vspace*{-5mm}
468 \begin{center}
469 \includegraphics[scale=.4]{images/Permutpixel_only.png}\\
470 {\small\bf Permute Pixels}
471 \end{center}
472 \end{wrapfigure}
473 %\end{minipage}%
474 %\hspace{-0cm}\begin{minipage}[t]{0.86\linewidth}
475 %\vspace*{-20mm}
476 This module {\bf permutes neighbouring pixels}. It first selects a
477 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each
478 of these pixels is then sequentially exchanged with a random pixel
479 among its four nearest neighbors (on its left, right, top or bottom).
480 This module is skipped with probability 80\%.\\
481 \vspace*{1mm}
482 \end{minipage}
483
484 \vspace{-3mm}
485
486 \begin{minipage}[t]{\linewidth}
487 \begin{wrapfigure}[7]{l}{0.15\textwidth}
488 %\vspace*{-3mm}
489 \begin{center}
490 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
491 %\centering
492 \vspace*{-5mm}
493 \includegraphics[scale=.4]{images/Distorsiongauss_only.png}\\
494 {\small \bf Gauss. Noise}
495 \end{center}
496 \end{wrapfigure}
497 %\end{minipage}%
498 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
499 \vspace*{12mm}
500 The {\bf Gaussian noise} module simply adds, to each pixel of the image independently, a
501 noise $\sim Normal(0,(\frac{complexity}{10})^2)$.
502 This module is skipped with probability 70\%.
503 %\vspace{1.1cm}
504 \end{minipage}
505
506 \vspace*{1.2cm}
507
508 \begin{minipage}[t]{\linewidth}
509 \begin{minipage}[t]{0.14\linewidth}
510 \centering
511 \includegraphics[scale=.4]{images/background_other_only.png}\\
512 {\small \bf Bg Image}
513 \end{minipage}%
514 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
515 \vspace*{-18mm}
516 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random
517 background image behind the letter, from a randomly chosen natural image,
518 with contrast adjustments depending on $complexity$, to preserve
519 more or less of the original character image.
520 %\vspace{.8cm}
521 \end{minipage}
522 \end{minipage}
523 %\vspace{-.7cm}
524
525 \begin{minipage}[t]{0.14\linewidth}
526 \centering
527 \includegraphics[scale=.4]{images/Poivresel_only.png}\\
528 {\small \bf Salt \& Pepper}
529 \end{minipage}%
530 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
531 \vspace*{-18mm}
532 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels.
533 The number of selected pixels is $0.2 \times complexity$.
534 This module is skipped with probability 75\%.
535 %\vspace{.9cm}
536 \end{minipage}
537 %\vspace{-.7cm}
538
539 \vspace{1mm}
540
541 \begin{minipage}[t]{\linewidth}
542 \begin{wrapfigure}[7]{l}{0.14\textwidth}
543 %\begin{minipage}[t]{0.14\linewidth}
544 %\centering
545 \begin{center}
546 \vspace*{-4mm}
547 \hspace*{-1mm}\includegraphics[scale=.4]{images/Rature_only.png}\\
548 {\bf Scratches}
549 %\end{minipage}%
550 \end{center}
551 \end{wrapfigure}
552 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
553 %\vspace{.4cm}
554 The {\bf scratches} module places line-like white patches on the image. The
555 lines are heavily transformed images of the digit ``1'' (one), chosen
556 at random among 500 such 1 images,
557 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
558 complexity)^2$ (in degrees), using bi-cubic interpolation.
559 Two passes of a grey-scale morphological erosion filter
560 are applied, reducing the width of the line
561 by an amount controlled by $complexity$.
562 This module is skipped with probability 85\%. The probabilities
563 of applying 1, 2, or 3 patches are (50\%,30\%,20\%).
564 \end{minipage}
565
566 \vspace*{1mm}
567
568 \begin{minipage}[t]{0.25\linewidth}
569 \centering
570 \hspace*{-16mm}\includegraphics[scale=.4]{images/Contrast_only.png}\\
571 {\bf Grey Level \& Contrast}
572 \end{minipage}%
573 \hspace{-12mm}\begin{minipage}[t]{0.82\linewidth}
574 \vspace*{-18mm}
575 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white
576 to black and black to white). The contrast is $C \sim U[1-0.85 \times complexity,1]$
577 so the image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
578 polarity is inverted with probability 50\%.
579 %\vspace{.7cm}
580 \end{minipage}
581 \vspace{2mm}
582
583 \iffalse
584 \begin{figure}[ht]
585 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/example_t.png}}}\\
586 \caption{Illustration of the pipeline of stochastic
587 transformations applied to the image of a lower-case \emph{t}
588 (the upper left image). Each image in the pipeline (going from
589 left to right, first top line, then bottom line) shows the result
590 of applying one of the modules in the pipeline. The last image
591 (bottom right) is used as training example.}
592 \label{fig:pipeline}
593 \end{figure}
594 \fi
595
596 \vspace*{-3mm}
597 \section{Experimental Setup}
598 \vspace*{-1mm}
599
600 Much previous work on deep learning had been performed on
601 the MNIST digits task
602 with 60~000 examples, and variants involving 10~000
603 examples~\citep{VincentPLarochelleH2008-very-small}.
604 The focus here is on much larger training sets, from 10 times to
605 to 1000 times larger, and 62 classes.
606
607 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
608 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
609 and {\bf OCR data} (scanned machine printed characters). Once a character
610 is sampled from one of these {\em data sources} (chosen randomly), the second step is to
611 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}.
612
613 To provide a baseline of error rate comparison we also estimate human performance
614 on both the 62-class task and the 10-class digits task.
615 We compare the best Multi-Layer Perceptrons (MLP) against
616 the best Stacked Denoising Auto-encoders (SDA), when
617 both models' hyper-parameters are selected to minimize the validation set error.
618 We also provide a comparison against a precise estimate
619 of human performance obtained via Amazon's Mechanical Turk (AMT)
620 service ({\tt http://mturk.com}).
621 AMT users are paid small amounts
622 of money to perform tasks for which human intelligence is required.
623 An incentive for them to do the job right is that payment can be denied
624 if the job is not properly done.
625 Mechanical Turk has been used extensively in natural language processing and vision.
626 %processing \citep{SnowEtAl2008} and vision
627 %\citep{SorokinAndForsyth2008,whitehill09}.
628 AMT users were presented
629 with 10 character images at a time (from a test set) and asked to choose 10 corresponding ASCII
630 characters. They were forced to choose a single character class (either among the
631 62 or 10 character classes) for each image.
632 80 subjects classified 2500 images per (dataset,task) pair.
633 Different humans labelers sometimes provided a different label for the same
634 example, and we were able to estimate the error variance due to this effect
635 because each image was classified by 3 different persons.
636 The average error of humans on the 62-class task NIST test set
637 is 18.2\%, with a standard error of 0.1\%.
638
639 \vspace*{-3mm}
640 \subsection{Data Sources}
641 \vspace*{-2mm}
642
643 %\begin{itemize}
644 %\item
645 {\bf NIST.}
646 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995},
647 widely used for training and testing character
648 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
649 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
650 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
651 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity.
652 The fourth partition (called $hsf_4$, 82587 examples),
653 experimentally recognized to be the most difficult one, is the one recommended
654 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
655 for that purpose. We randomly split the remainder (731,668 examples) into a training set and a validation set for
656 model selection.
657 The performances reported by previous work on that dataset mostly use only the digits.
658 Here we use all the classes both in the training and testing phase. This is especially
659 useful to estimate the effect of a multi-task setting.
660 The distribution of the classes in the NIST training and test sets differs
661 substantially, with relatively many more digits in the test set, and a more uniform distribution
662 of letters in the test set (whereas in the training set they are distributed
663 more like in natural text).
664 \vspace*{-1mm}
665
666 %\item
667 {\bf Fonts.}
668 In order to have a good variety of sources we downloaded an important number of free fonts from:
669 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
670 % TODO: pointless to anonymize, it's not pointing to our work
671 Including an operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
672 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
673 directly as input to our models.
674 \vspace*{-1mm}
675
676 %\item
677 {\bf Captchas.}
678 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a Python-based captcha generator library) for
679 generating characters of the same format as the NIST dataset. This software is based on
680 a random character class generator and various kinds of transformations similar to those described in the previous sections.
681 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
682 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
683 depending on the value of the complexity parameter provided by the user of the data source.
684 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class?
685 \vspace*{-1mm}
686
687 %\item
688 {\bf OCR data.}
689 A large set (2 million) of scanned, OCRed and manually verified machine-printed
690 characters where included as an
691 additional source. This set is part of a larger corpus being collected by the Image Understanding
692 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern
693 ({\tt http://www.iupr.com}), and which will be publicly released.
694 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
695 %\end{itemize}
696
697 \vspace*{-3mm}
698 \subsection{Data Sets}
699 \vspace*{-2mm}
700
701 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
702 from one of the 62 character classes. They are obtained from the optional application of the
703 perturbation pipeline to iid samples from the datasources, and they are randomly split into
704 training set, validation set, and test set.
705 %\begin{itemize}
706 \vspace*{-1mm}
707
708 %\item
709 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
710 \{651668 / 80000 / 82587\} \{training / validation / test\} examples, containing
711 upper case, lower case, and digits.
712 \vspace*{-1mm}
713
714 %\item
715 {\bf P07.} This dataset of upper case, lower case and digit images
716 is obtained by taking raw characters from all four of the above sources
717 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
718 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
719 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
720 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
721 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
722 \vspace*{-1mm}
723
724 %\item
725 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
726 except that we only apply
727 transformations from slant to pinch. Therefore, the character is
728 transformed but no additional noise is added to the image, giving images
729 closer to the NIST dataset.
730 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples
731 obtained from the corresponding NIST sets plus other sources.
732 %\end{itemize}
733
734 \vspace*{-3mm}
735 \subsection{Models and their Hyperparameters}
736 \vspace*{-2mm}
737
738 The experiments are performed using MLPs (with a single
739 hidden layer) and deep SDAs.
740 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
741
742 {\bf Multi-Layer Perceptrons (MLP).}
743 Whereas previous work had compared deep architectures to both shallow MLPs and
744 SVMs, we only compared to MLPs here because of the very large datasets used
745 (making the use of SVMs computationally challenging because of their quadratic
746 scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training
747 set allowing the program to fit in memory yielded substantially worse results
748 than those obtained with MLPs. For training on nearly a hundred million examples
749 (with the perturbed data), the MLPs and SDA are much more convenient than
750 classifiers based on kernel methods.
751 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
752 exponentials) on the output layer for estimating $P(class | image)$.
753 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
754 Training examples are presented in minibatches of size 20. A constant learning
755 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
756 %through preliminary experiments (measuring performance on a validation set),
757 %and $0.1$ (which was found to work best) was then selected for optimizing on
758 %the whole training sets.
759 \vspace*{-1mm}
760
761
762 {\bf Stacked Denoising Auto-encoders (SDA).}
763 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
764 can be used to initialize the weights of each layer of a deep MLP (with many hidden
765 layers)
766 apparently setting parameters in the
767 basin of attraction of supervised gradient descent yielding better
768 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised
769 pre-training phase} uses all of the training images but not the training labels.
770 Each layer is trained in turn to produce a new representation of its input
771 (starting from the raw pixels).
772 It is hypothesized that the
773 advantage brought by this procedure stems from a better prior,
774 on the one hand taking advantage of the link between the input
775 distribution $P(x)$ and the conditional distribution of interest
776 $P(y|x)$ (like in semi-supervised learning), and on the other hand
777 taking advantage of the expressive power and bias implicit in the
778 deep architecture (whereby complex concepts are expressed as
779 compositions of simpler ones through a deep hierarchy).
780
781 \iffalse
782 \begin{figure}[ht]
783 \vspace*{-2mm}
784 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
785 \vspace*{-2mm}
786 \caption{Illustration of the computations and training criterion for the denoising
787 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
788 the layer (i.e. raw input or output of previous layer)
789 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
790 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
791 is compared to the uncorrupted input $x$ through the loss function
792 $L_H(x,z)$, whose expected value is approximately minimized during training
793 by tuning $\theta$ and $\theta'$.}
794 \label{fig:da}
795 \vspace*{-2mm}
796 \end{figure}
797 \fi
798
799 Here we chose to use the Denoising
800 Auto-encoder~\citep{VincentPLarochelleH2008-very-small} as the building block for
801 these deep hierarchies of features, as it is simple to train and
802 explain (see % Figure~\ref{fig:da}, as well as
803 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
804 provides efficient inference, and yielded results
805 comparable or better than RBMs in series of experiments
806 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian
807 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}.
808 During training, a Denoising
809 Auto-encoder is presented with a stochastically corrupted version
810 of the input and trained to reconstruct the uncorrupted input,
811 forcing the hidden units to represent the leading regularities in
812 the data. Here we use the random binary masking corruption
813 (which sets to 0 a random subset of the inputs).
814 Once it is trained, in a purely unsupervised way,
815 its hidden units' activations can
816 be used as inputs for training a second one, etc.
817 After this unsupervised pre-training stage, the parameters
818 are used to initialize a deep MLP, which is fine-tuned by
819 the same standard procedure used to train them (see previous section).
820 The SDA hyper-parameters are the same as for the MLP, with the addition of the
821 amount of corruption noise (we used the masking noise process, whereby a
822 fixed proportion of the input values, randomly selected, are zeroed), and a
823 separate learning rate for the unsupervised pre-training stage (selected
824 from the same above set). The fraction of inputs corrupted was selected
825 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
826 of hidden layers but it was fixed to 3 based on previous work with
827 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. The size of the hidden
828 layers was kept constant across hidden layers, and the best results
829 were obtained with the largest values that we could experiment
830 with given our patience, with 1000 hidden units.
831
832 \vspace*{-1mm}
833
834 \begin{figure}[ht]
835 \vspace*{-2mm}
836 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
837 \vspace*{-3mm}
838 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
839 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
840 of all models, on NIST and NISTP test sets.
841 Right: error rates on NIST test digits only, along with the previous results from
842 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
843 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
844 \label{fig:error-rates-charts}
845 \vspace*{-2mm}
846 \end{figure}
847
848
849 \begin{figure}[ht]
850 \vspace*{-3mm}
851 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
852 \vspace*{-3mm}
853 \caption{Relative improvement in error rate due to self-taught learning.
854 Left: Improvement (or loss, when negative)
855 induced by out-of-distribution examples (perturbed data).
856 Right: Improvement (or loss, when negative) induced by multi-task
857 learning (training on all classes and testing only on either digits,
858 upper case, or lower-case). The deep learner (SDA) benefits more from
859 both self-taught learning scenarios, compared to the shallow MLP.}
860 \label{fig:improvements-charts}
861 \vspace*{-2mm}
862 \end{figure}
863
864 \section{Experimental Results}
865 \vspace*{-2mm}
866
867 %\vspace*{-1mm}
868 %\subsection{SDA vs MLP vs Humans}
869 %\vspace*{-1mm}
870 The models are either trained on NIST (MLP0 and SDA0),
871 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested
872 on either NIST, NISTP or P07 (regardless of the data set used for training),
873 either on the 62-class task
874 or on the 10-digits task. Training time (including about half
875 for unsupervised pre-training, for DAs) on the larger
876 datasets is around one day on a GPU (GTX 285).
877 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
878 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
879 SDA2), along with the previous results on the digits NIST special database
880 19 test set from the literature, respectively based on ARTMAP neural
881 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
882 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
883 ~\citep{Milgram+al-2005}.% More detailed and complete numerical results
884 %(figures and tables, including standard errors on the error rates) can be
885 %found in Appendix.
886 The deep learner not only outperformed the shallow ones and
887 previously published performance (in a statistically and qualitatively
888 significant way) but when trained with perturbed data
889 reaches human performance on both the 62-class task
890 and the 10-class (digits) task.
891 17\% error (SDA1) or 18\% error (humans) may seem large but a large
892 majority of the errors from humans and from SDA1 are from out-of-context
893 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
894 ``c'' and a ``C'' are often indistinguishible).
895
896 In addition, as shown in the left of
897 Figure~\ref{fig:improvements-charts}, the relative improvement in error
898 rate brought by self-taught learning is greater for the SDA, and these
899 differences with the MLP are statistically and qualitatively
900 significant.
901 The left side of the figure shows the improvement to the clean
902 NIST test set error brought by the use of out-of-distribution examples
903 (i.e. the perturbed examples examples from NISTP or P07),
904 over the models trained exclusively on NIST (respectively SDA0 and MLP0).
905 Relative percent change is measured by taking
906 $100 \% \times$ (original model's error / perturbed-data model's error - 1).
907 The right side of
908 Figure~\ref{fig:improvements-charts} shows the relative improvement
909 brought by the use of a multi-task setting, in which the same model is
910 trained for more classes than the target classes of interest (i.e. training
911 with all 62 classes when the target classes are respectively the digits,
912 lower-case, or upper-case characters). Again, whereas the gain from the
913 multi-task setting is marginal or negative for the MLP, it is substantial
914 for the SDA. Note that to simplify these multi-task experiments, only the original
915 NIST dataset is used. For example, the MLP-digits bar shows the relative
916 percent improvement in MLP error rate on the NIST digits test set
917 is $100\% \times$ (single-task
918 model's error / multi-task model's error - 1). The single-task model is
919 trained with only 10 outputs (one per digit), seeing only digit examples,
920 whereas the multi-task model is trained with 62 outputs, with all 62
921 character classes as examples. Hence the hidden units are shared across
922 all tasks. For the multi-task model, the digit error rate is measured by
923 comparing the correct digit class with the output class associated with the
924 maximum conditional probability among only the digit classes outputs. The
925 setting is similar for the other two target classes (lower case characters
926 and upper case characters). Note however that some types of perturbations
927 (NISTP) help more than others (P07) when testing on the clean images.
928 %%\vspace*{-1mm}
929 %\subsection{Perturbed Training Data More Helpful for SDA}
930 %\vspace*{-1mm}
931
932 %\vspace*{-1mm}
933 %\subsection{Multi-Task Learning Effects}
934 %\vspace*{-1mm}
935
936 \iffalse
937 As previously seen, the SDA is better able to benefit from the
938 transformations applied to the data than the MLP. In this experiment we
939 define three tasks: recognizing digits (knowing that the input is a digit),
940 recognizing upper case characters (knowing that the input is one), and
941 recognizing lower case characters (knowing that the input is one). We
942 consider the digit classification task as the target task and we want to
943 evaluate whether training with the other tasks can help or hurt, and
944 whether the effect is different for MLPs versus SDAs. The goal is to find
945 out if deep learning can benefit more (or less) from multiple related tasks
946 (i.e. the multi-task setting) compared to a corresponding purely supervised
947 shallow learner.
948
949 We use a single hidden layer MLP with 1000 hidden units, and a SDA
950 with 3 hidden layers (1000 hidden units per layer), pre-trained and
951 fine-tuned on NIST.
952
953 Our results show that the MLP benefits marginally from the multi-task setting
954 in the case of digits (5\% relative improvement) but is actually hurt in the case
955 of characters (respectively 3\% and 4\% worse for lower and upper class characters).
956 On the other hand the SDA benefited from the multi-task setting, with relative
957 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
958 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
959 \fi
960
961
962 \vspace*{-2mm}
963 \section{Conclusions and Discussion}
964 \vspace*{-2mm}
965
966 We have found that the self-taught learning framework is more beneficial
967 to a deep learner than to a traditional shallow and purely
968 supervised learner. More precisely,
969 the answers are positive for all the questions asked in the introduction.
970 %\begin{itemize}
971
972 $\bullet$ %\item
973 {\bf Do the good results previously obtained with deep architectures on the
974 MNIST digits generalize to a much larger and richer (but similar)
975 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
976 Yes, the SDA {\em systematically outperformed the MLP and all the previously
977 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
978 performance} at around 17\% error on the 62-class task and 1.4\% on the digits,
979 and beating previously published results on the same data.
980
981 $\bullet$ %\item
982 {\bf To what extent do self-taught learning scenarios help deep learners,
983 and do they help them more than shallow supervised ones}?
984 We found that distorted training examples not only made the resulting
985 classifier better on similarly perturbed images but also on
986 the {\em original clean examples}, and more importantly and more novel,
987 that deep architectures benefit more from such {\em out-of-distribution}
988 examples. MLPs were helped by perturbed training examples when tested on perturbed input
989 images (65\% relative improvement on NISTP)
990 but only marginally helped (5\% relative improvement on all classes)
991 or even hurt (10\% relative loss on digits)
992 with respect to clean examples. On the other hand, the deep SDAs
993 were significantly boosted by these out-of-distribution examples.
994 Similarly, whereas the improvement due to the multi-task setting was marginal or
995 negative for the MLP (from +5.6\% to -3.6\% relative change),
996 it was quite significant for the SDA (from +13\% to +27\% relative change),
997 which may be explained by the arguments below.
998 %\end{itemize}
999
1000 In the original self-taught learning framework~\citep{RainaR2007}, the
1001 out-of-sample examples were used as a source of unsupervised data, and
1002 experiments showed its positive effects in a \emph{limited labeled data}
1003 scenario. However, many of the results by \citet{RainaR2007} (who used a
1004 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught
1005 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases.
1006 We note instead that, for deep
1007 architectures, our experiments show that such a positive effect is accomplished
1008 even in a scenario with a \emph{large number of labeled examples},
1009 i.e., here, the relative gain of self-taught learning and
1010 out-of-distribution examples is probably preserved
1011 in the asymptotic regime. However, note that in our perturbation experiments
1012 (but not in our multi-task experiments),
1013 even the out-of-distribution examples are labeled, unlike in the
1014 earlier self-taught learning experiments~\citep{RainaR2007}.
1015
1016 {\bf Why would deep learners benefit more from the self-taught learning framework}?
1017 The key idea is that the lower layers of the predictor compute a hierarchy
1018 of features that can be shared across tasks or across variants of the
1019 input distribution. A theoretical analysis of generalization improvements
1020 due to sharing of intermediate features across tasks already points
1021 towards that explanation~\cite{baxter95a}.
1022 Intermediate features that can be used in different
1023 contexts can be estimated in a way that allows to share statistical
1024 strength. Features extracted through many levels are more likely to
1025 be more abstract and more invariant to some of the factors of variation
1026 in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest),
1027 increasing the likelihood that they would be useful for a larger array
1028 of tasks and input conditions.
1029 Therefore, we hypothesize that both depth and unsupervised
1030 pre-training play a part in explaining the advantages observed here, and future
1031 experiments could attempt at teasing apart these factors.
1032 And why would deep learners benefit from the self-taught learning
1033 scenarios even when the number of labeled examples is very large?
1034 We hypothesize that this is related to the hypotheses studied
1035 in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010}
1036 it was found that online learning on a huge dataset did not make the
1037 advantage of the deep learning bias vanish, and a similar phenomenon
1038 may be happening here. We hypothesize that unsupervised pre-training
1039 of a deep hierarchy with self-taught learning initializes the
1040 model in the basin of attraction of supervised gradient descent
1041 that corresponds to better generalization. Furthermore, such good
1042 basins of attraction are not discovered by pure supervised learning
1043 (with or without self-taught settings), and more labeled examples
1044 does not allow the model to go from the poorer basins of attraction discovered
1045 by the purely supervised shallow models to the kind of better basins associated
1046 with deep learning and self-taught learning.
1047
1048 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
1049 can be executed on-line at {\tt http://deep.host22.com}.
1050
1051 %\newpage
1052 {
1053 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,aigaion-shorter,specials}
1054 %\bibliographystyle{plainnat}
1055 \bibliographystyle{unsrtnat}
1056 %\bibliographystyle{apalike}
1057 }
1058
1059
1060 \end{document}