comparison writeup/mlj_submission.tex @ 585:4933077b8676

MLJ submission
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Wed, 29 Sep 2010 21:06:47 -0400
parents
children f5a198b2854a b1be957dd1be
comparison
equal deleted inserted replaced
584:81c6fde68a8a 585:4933077b8676
1 \documentclass{article} % For LaTeX2e
2 \usepackage{times}
3 \usepackage{wrapfig}
4 \usepackage{amsthm,amsmath,bbm}
5 \usepackage[psamsfonts]{amssymb}
6 \usepackage{algorithm,algorithmic}
7 \usepackage[utf8]{inputenc}
8 \usepackage{graphicx,subfigure}
9 \usepackage[numbers]{natbib}
10
11 \addtolength{\textwidth}{10mm}
12 \addtolength{\evensidemargin}{-5mm}
13 \addtolength{\oddsidemargin}{-5mm}
14
15 %\setlength\parindent{0mm}
16
17 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
18 \author{
19 Frédéric Bastien \and
20 Yoshua Bengio \and
21 Arnaud Bergeron \and
22 Nicolas Boulanger-Lewandowski \and
23 Thomas Breuel \and
24 Youssouf Chherawala \and
25 Moustapha Cisse \and
26 Myriam Côté \and
27 Dumitru Erhan \and
28 Jeremy Eustache \and
29 Xavier Glorot \and
30 Xavier Muller \and
31 Sylvain Pannetier Lebeuf \and
32 Razvan Pascanu \and
33 Salah Rifai \and
34 Francois Savard \and
35 Guillaume Sicard
36 }
37 \date{September 30th, submission to MLJ special issue on learning from multi-label data}
38
39 \begin{document}
40
41 %\makeanontitle
42 \maketitle
43
44 %\vspace*{-2mm}
45 \begin{abstract}
46 Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.
47 \end{abstract}
48 %\vspace*{-3mm}
49
50 Keywords: self-taught learning, multi-task learning, out-of-distribution examples, deep learning, handwriting recognition.
51
52 \section{Introduction}
53 %\vspace*{-1mm}
54
55 {\bf Deep Learning} has emerged as a promising new area of research in
56 statistical machine learning (see~\citet{Bengio-2009} for a review).
57 Learning algorithms for deep architectures are centered on the learning
58 of useful representations of data, which are better suited to the task at hand,
59 and are organized in a hierarchy with multiple levels.
60 This is in part inspired by observations of the mammalian visual cortex,
61 which consists of a chain of processing elements, each of which is associated with a
62 different representation of the raw visual input. In fact,
63 it was found recently that the features learnt in deep architectures resemble
64 those observed in the first two of these stages (in areas V1 and V2
65 of visual cortex)~\citep{HonglakL2008}, and that they become more and
66 more invariant to factors of variation (such as camera movement) in
67 higher layers~\citep{Goodfellow2009}.
68 Learning a hierarchy of features increases the
69 ease and practicality of developing representations that are at once
70 tailored to specific tasks, yet are able to borrow statistical strength
71 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
72 feature representation can lead to higher-level (more abstract, more
73 general) features that are more robust to unanticipated sources of
74 variance extant in real data.
75
76 {\bf Self-taught learning}~\citep{RainaR2007} is a paradigm that combines principles
77 of semi-supervised and multi-task learning: the learner can exploit examples
78 that are unlabeled and possibly come from a distribution different from the target
79 distribution, e.g., from other classes than those of interest.
80 It has already been shown that deep learners can clearly take advantage of
81 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
82 but more needs to be done to explore the impact
83 of {\em out-of-distribution} examples and of the {\em multi-task} setting
84 (one exception is~\citep{CollobertR2008}, which uses a different kind
85 of learning algorithm). In particular the {\em relative
86 advantage of deep learning} for these settings has not been evaluated.
87 The hypothesis discussed in the conclusion is that in the context of
88 multi-task learning and the availability of out-of-distribution training examples,
89 a deep hierarchy of features
90 may be better able to provide sharing of statistical strength
91 between different regions in input space or different tasks, compared to
92 a shallow learner.
93
94 Whereas a deep architecture can in principle be more powerful than a
95 shallow one in terms of representation, depth appears to render the
96 training problem more difficult in terms of optimization and local minima.
97 It is also only recently that successful algorithms were proposed to
98 overcome some of these difficulties. All are based on unsupervised
99 learning, often in an greedy layer-wise ``unsupervised pre-training''
100 stage~\citep{Bengio-2009}. One of these layer initialization techniques,
101 applied here, is the Denoising
102 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}),
103 which
104 performed similarly or better than previously proposed Restricted Boltzmann
105 Machines in terms of unsupervised extraction of a hierarchy of features
106 useful for classification. Each layer is trained to denoise its
107 input, creating a layer of features that can be used as input for the next layer.
108
109 %The principle is that each layer starting from
110 %the bottom is trained to encode its input (the output of the previous
111 %layer) and to reconstruct it from a corrupted version. After this
112 %unsupervised initialization, the stack of DAs can be
113 %converted into a deep supervised feedforward neural network and fine-tuned by
114 %stochastic gradient descent.
115
116 %
117 In this paper we ask the following questions:
118
119 %\begin{enumerate}
120 $\bullet$ %\item
121 Do the good results previously obtained with deep architectures on the
122 MNIST digit images generalize to the setting of a much larger and richer (but similar)
123 dataset, the NIST special database 19, with 62 classes and around 800k examples?
124
125 $\bullet$ %\item
126 To what extent does the perturbation of input images (e.g. adding
127 noise, affine transformations, background images) make the resulting
128 classifiers better not only on similarly perturbed images but also on
129 the {\em original clean examples}? We study this question in the
130 context of the 62-class and 10-class tasks of the NIST special database 19.
131
132 $\bullet$ %\item
133 Do deep architectures {\em benefit {\bf more} from such out-of-distribution}
134 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
135 We use highly perturbed examples to generate out-of-distribution examples.
136
137 $\bullet$ %\item
138 Similarly, does the feature learning step in deep learning algorithms benefit {\bf more}
139 from training with moderately {\em different classes} (i.e. a multi-task learning scenario) than
140 a corresponding shallow and purely supervised architecture?
141 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
142 to answer this question.
143 %\end{enumerate}
144
145 Our experimental results provide positive evidence towards all of these questions,
146 as well as classifiers that reach human-level performance on 62-class isolated character
147 recognition and beat previously published results on the NIST dataset (special database 19).
148 To achieve these results, we introduce in the next section a sophisticated system
149 for stochastically transforming character images and then explain the methodology,
150 which is based on training with or without these transformed images and testing on
151 clean ones. We measure the relative advantage of out-of-distribution examples
152 (perturbed or out-of-class)
153 for a deep learner vs a supervised shallow one.
154 Code for generating these transformations as well as for the deep learning
155 algorithms are made available at {\tt http://hg.assembla.com/ift6266}.
156 We estimate the relative advantage for deep learners of training with
157 other classes than those of interest, by comparing learners trained with
158 62 classes with learners trained with only a subset (on which they
159 are then tested).
160 The conclusion discusses
161 the more general question of why deep learners may benefit so much from
162 the self-taught learning framework. Since out-of-distribution data
163 (perturbed or from other related classes) is very common, this conclusion
164 is of practical importance.
165
166 %\vspace*{-3mm}
167 %\newpage
168 \section{Perturbed and Transformed Character Images}
169 \label{s:perturbations}
170 %\vspace*{-2mm}
171
172 \begin{wrapfigure}[8]{l}{0.15\textwidth}
173 %\begin{minipage}[b]{0.14\linewidth}
174 %\vspace*{-5mm}
175 \begin{center}
176 \includegraphics[scale=.4]{images/Original.png}\\
177 {\bf Original}
178 \end{center}
179 \end{wrapfigure}
180 %%\vspace{0.7cm}
181 %\end{minipage}%
182 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
183 This section describes the different transformations we used to stochastically
184 transform $32 \times 32$ source images (such as the one on the left)
185 in order to obtain data from a larger distribution which
186 covers a domain substantially larger than the clean characters distribution from
187 which we start.
188 Although character transformations have been used before to
189 improve character recognizers, this effort is on a large scale both
190 in number of classes and in the complexity of the transformations, hence
191 in the complexity of the learning task.
192 The code for these transformations (mostly python) is available at
193 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share
194 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
195 amount of deformation or noise introduced.
196 There are two main parts in the pipeline. The first one,
197 from slant to pinch below, performs transformations. The second
198 part, from blur to contrast, adds different kinds of noise.
199 %\end{minipage}
200
201 %\vspace*{1mm}
202 \subsection{Transformations}
203 %{\large\bf 2.1 Transformations}
204 %\vspace*{1mm}
205
206 \subsubsection*{Thickness}
207
208 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
209 \begin{minipage}[b]{0.14\linewidth}
210 %\centering
211 \begin{center}
212 \vspace*{-5mm}
213 \includegraphics[scale=.4]{images/Thick_only.png}\\
214 %{\bf Thickness}
215 \end{center}
216 \vspace{.6cm}
217 \end{minipage}%
218 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
219 %\end{wrapfigure}
220 To change character {\bf thickness}, morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
221 are applied. The neighborhood of each pixel is multiplied
222 element-wise with a {\em structuring element} matrix.
223 The pixel value is replaced by the maximum or the minimum of the resulting
224 matrix, respectively for dilation or erosion. Ten different structural elements with
225 increasing dimensions (largest is $5\times5$) were used. For each image,
226 randomly sample the operator type (dilation or erosion) with equal probability and one structural
227 element from a subset of the $n=round(m \times complexity)$ smallest structuring elements
228 where $m=10$ for dilation and $m=6$ for erosion (to avoid completely erasing thin characters).
229 A neutral element (no transformation)
230 is always present in the set.
231 %%\vspace{.4cm}
232 \end{minipage}
233
234 \vspace{2mm}
235
236 \subsubsection*{Slant}
237 \vspace*{2mm}
238
239 \begin{minipage}[b]{0.14\linewidth}
240 \centering
241 \includegraphics[scale=.4]{images/Slant_only.png}\\
242 %{\bf Slant}
243 \end{minipage}%
244 \hspace{0.3cm}
245 \begin{minipage}[b]{0.83\linewidth}
246 %\centering
247 To produce {\bf slant}, each row of the image is shifted
248 proportionally to its height: $shift = round(slant \times height)$.
249 $slant \sim U[-complexity,complexity]$.
250 The shift is randomly chosen to be either to the left or to the right.
251 \vspace{5mm}
252 \end{minipage}
253 %\vspace*{-4mm}
254
255 %\newpage
256
257 \subsubsection*{Affine Transformations}
258
259 \begin{minipage}[b]{0.14\linewidth}
260 %\centering
261 %\begin{wrapfigure}[8]{l}{0.15\textwidth}
262 \begin{center}
263 \includegraphics[scale=.4]{images/Affine_only.png}
264 \vspace*{6mm}
265 %{\small {\bf Affine \mbox{Transformation}}}
266 \end{center}
267 %\end{wrapfigure}
268 \end{minipage}%
269 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
270 \noindent A $2 \times 3$ {\bf affine transform} matrix (with
271 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$.
272 Output pixel $(x,y)$ takes the value of input pixel
273 nearest to $(ax+by+c,dx+ey+f)$,
274 producing scaling, translation, rotation and shearing.
275 Marginal distributions of $(a,b,c,d,e,f)$ have been tuned to
276 forbid large rotations (to avoid confusing classes) but to give good
277 variability of the transformation: $a$ and $d$ $\sim U[1-3
278 complexity,1+3\,complexity]$, $b$ and $e$ $\sim U[-3 \,complexity,3\,
279 complexity]$, and $c$ and $f \sim U[-4 \,complexity, 4 \,
280 complexity]$.\\
281 \end{minipage}
282
283 %\vspace*{-4.5mm}
284 \subsubsection*{Local Elastic Deformations}
285
286 %\begin{minipage}[t]{\linewidth}
287 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
288 %\hspace*{-8mm}
289 \begin{minipage}[b]{0.14\linewidth}
290 %\centering
291 \begin{center}
292 \vspace*{5mm}
293 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}
294 %{\bf Local Elastic Deformation}
295 \end{center}
296 %\end{wrapfigure}
297 \end{minipage}%
298 \hspace{3mm}
299 \begin{minipage}[b]{0.85\linewidth}
300 %%\vspace*{-20mm}
301 The {\bf local elastic deformation}
302 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
303 which provides more details.
304 The intensity of the displacement fields is given by
305 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are
306 convolved with a Gaussian 2D kernel (resulting in a blur) of
307 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$.
308 \vspace{2mm}
309 \end{minipage}
310
311 \vspace*{4mm}
312
313 \subsubsection*{Pinch}
314
315 \begin{minipage}[b]{0.14\linewidth}
316 %\centering
317 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
318 %\vspace*{-5mm}
319 \begin{center}
320 \includegraphics[scale=.4]{images/Pinch_only.png}\\
321 \vspace*{15mm}
322 %{\bf Pinch}
323 \end{center}
324 %\end{wrapfigure}
325 %%\vspace{.6cm}
326 \end{minipage}%
327 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
328 The {\bf pinch} module applies the ``Whirl and pinch'' GIMP filter with whirl set to 0.
329 A pinch is ``similar to projecting the image onto an elastic
330 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
331 For a square input image, draw a radius-$r$ disk
332 around its center $C$. Any pixel $P$ belonging to
333 that disk has its value replaced by
334 the value of a ``source'' pixel in the original image,
335 on the line that goes through $C$ and $P$, but
336 at some other distance $d_2$. Define $d_1=distance(P,C)$
337 and $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
338 d_1$, where $pinch$ is a parameter of the filter.
339 The actual value is given by bilinear interpolation considering the pixels
340 around the (non-integer) source position thus found.
341 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
342 %%\vspace{1.5cm}
343 \end{minipage}
344
345 %\vspace{1mm}
346
347 %{\large\bf 2.2 Injecting Noise}
348 \subsection{Injecting Noise}
349 %\vspace{2mm}
350
351 \subsubsection*{Motion Blur}
352
353 %%\vspace*{-.2cm}
354 \begin{minipage}[t]{0.14\linewidth}
355 \centering
356 \vspace*{0mm}
357 \includegraphics[scale=.4]{images/Motionblur_only.png}
358 %{\bf Motion Blur}
359 \end{minipage}%
360 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
361 %%\vspace*{.5mm}
362 \vspace*{2mm}
363 The {\bf motion blur} module is GIMP's ``linear motion blur'', which
364 has parameters $length$ and $angle$. The value of
365 a pixel in the final image is approximately the mean of the first $length$ pixels
366 found by moving in the $angle$ direction,
367 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
368 %\vspace{5mm}
369 \end{minipage}
370
371 %\vspace*{1mm}
372
373 \subsubsection*{Occlusion}
374
375 \begin{minipage}[t]{0.14\linewidth}
376 \centering
377 \vspace*{3mm}
378 \includegraphics[scale=.4]{images/occlusion_only.png}\\
379 %{\bf Occlusion}
380 %%\vspace{.5cm}
381 \end{minipage}%
382 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
383 %\vspace*{-18mm}
384 The {\bf occlusion} module selects a random rectangle from an {\em occluder} character
385 image and places it over the original {\em occluded}
386 image. Pixels are combined by taking the max(occluder, occluded),
387 i.e. keeping the lighter ones.
388 The rectangle corners
389 are sampled so that larger complexity gives larger rectangles.
390 The destination position in the occluded image are also sampled
391 according to a normal distribution.
392 This module is skipped with probability 60\%.
393 %%\vspace{7mm}
394 \end{minipage}
395
396 %\vspace*{1mm}
397 \subsubsection*{Gaussian Smoothing}
398
399 %\begin{wrapfigure}[8]{l}{0.15\textwidth}
400 %\vspace*{-6mm}
401 \begin{minipage}[t]{0.14\linewidth}
402 \begin{center}
403 %\centering
404 \vspace*{6mm}
405 \includegraphics[scale=.4]{images/Bruitgauss_only.png}
406 %{\bf Gaussian Smoothing}
407 \end{center}
408 %\end{wrapfigure}
409 %%\vspace{.5cm}
410 \end{minipage}%
411 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
412 With the {\bf Gaussian smoothing} module,
413 different regions of the image are spatially smoothed.
414 This is achieved by first convolving
415 the image with an isotropic Gaussian kernel of
416 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
417 complexity]$ and $[2,2 + 6 \times complexity]$. This filtered image is normalized
418 between $0$ and $1$. We also create an isotropic weighted averaging window, of the
419 kernel size, with maximum value at the center. For each image we sample
420 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be
421 averaging centers between the original image and the filtered one. We
422 initialize to zero a mask matrix of the image size. For each selected pixel
423 we add to the mask the averaging window centered on it. The final image is
424 computed from the following element-wise operation: $\frac{image + filtered\_image
425 \times mask}{mask+1}$.
426 This module is skipped with probability 75\%.
427 \end{minipage}
428
429 %\newpage
430
431 %\vspace*{-9mm}
432 \subsubsection*{Permute Pixels}
433
434 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
435 %\centering
436 \begin{minipage}[t]{0.14\textwidth}
437 %\begin{wrapfigure}[7]{l}{
438 %\vspace*{-5mm}
439 \begin{center}
440 \vspace*{1mm}
441 \includegraphics[scale=.4]{images/Permutpixel_only.png}
442 %{\small\bf Permute Pixels}
443 \end{center}
444 %\end{wrapfigure}
445 \end{minipage}%
446 \hspace{3mm}\begin{minipage}[t]{0.86\linewidth}
447 \vspace*{1mm}
448 %%\vspace*{-20mm}
449 This module {\bf permutes neighbouring pixels}. It first selects a
450 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each
451 of these pixels is then sequentially exchanged with a random pixel
452 among its four nearest neighbors (on its left, right, top or bottom).
453 This module is skipped with probability 80\%.\\
454 %\vspace*{1mm}
455 \end{minipage}
456
457 %\vspace{-3mm}
458
459 \subsubsection*{Gaussian Noise}
460
461 \begin{minipage}[t]{0.14\textwidth}
462 %\begin{wrapfigure}[7]{l}{
463 %%\vspace*{-3mm}
464 \begin{center}
465 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
466 %\centering
467 \vspace*{0mm}
468 \includegraphics[scale=.4]{images/Distorsiongauss_only.png}
469 %{\small \bf Gauss. Noise}
470 \end{center}
471 %\end{wrapfigure}
472 \end{minipage}%
473 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
474 \vspace*{1mm}
475 %\vspace*{12mm}
476 The {\bf Gaussian noise} module simply adds, to each pixel of the image independently, a
477 noise $\sim Normal(0,(\frac{complexity}{10})^2)$.
478 This module is skipped with probability 70\%.
479 %%\vspace{1.1cm}
480 \end{minipage}
481
482 %\vspace*{1.2cm}
483
484 \subsubsection*{Background Image Addition}
485
486 \begin{minipage}[t]{\linewidth}
487 \begin{minipage}[t]{0.14\linewidth}
488 \centering
489 \vspace*{0mm}
490 \includegraphics[scale=.4]{images/background_other_only.png}
491 %{\small \bf Bg Image}
492 \end{minipage}%
493 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
494 \vspace*{1mm}
495 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random
496 background image behind the letter, from a randomly chosen natural image,
497 with contrast adjustments depending on $complexity$, to preserve
498 more or less of the original character image.
499 %%\vspace{.8cm}
500 \end{minipage}
501 \end{minipage}
502 %%\vspace{-.7cm}
503
504 \subsubsection*{Salt and Pepper Noise}
505
506 \begin{minipage}[t]{0.14\linewidth}
507 \centering
508 \vspace*{0mm}
509 \includegraphics[scale=.4]{images/Poivresel_only.png}
510 %{\small \bf Salt \& Pepper}
511 \end{minipage}%
512 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
513 \vspace*{1mm}
514 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels.
515 The number of selected pixels is $0.2 \times complexity$.
516 This module is skipped with probability 75\%.
517 %%\vspace{.9cm}
518 \end{minipage}
519 %%\vspace{-.7cm}
520
521 %\vspace{1mm}
522 \subsubsection*{Scratches}
523
524 \begin{minipage}[t]{0.14\textwidth}
525 %\begin{wrapfigure}[7]{l}{
526 %\begin{minipage}[t]{0.14\linewidth}
527 %\centering
528 \begin{center}
529 \vspace*{4mm}
530 %\hspace*{-1mm}
531 \includegraphics[scale=.4]{images/Rature_only.png}\\
532 %{\bf Scratches}
533 \end{center}
534 \end{minipage}%
535 %\end{wrapfigure}
536 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
537 %%\vspace{.4cm}
538 The {\bf scratches} module places line-like white patches on the image. The
539 lines are heavily transformed images of the digit ``1'' (one), chosen
540 at random among 500 such 1 images,
541 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
542 complexity)^2$ (in degrees), using bi-cubic interpolation.
543 Two passes of a grey-scale morphological erosion filter
544 are applied, reducing the width of the line
545 by an amount controlled by $complexity$.
546 This module is skipped with probability 85\%. The probabilities
547 of applying 1, 2, or 3 patches are (50\%,30\%,20\%).
548 \end{minipage}
549
550 %\vspace*{1mm}
551
552 \subsubsection*{Grey Level and Contrast Changes}
553
554 \begin{minipage}[t]{0.15\linewidth}
555 \centering
556 \vspace*{0mm}
557 \includegraphics[scale=.4]{images/Contrast_only.png}
558 %{\bf Grey Level \& Contrast}
559 \end{minipage}%
560 \hspace{3mm}\begin{minipage}[t]{0.85\linewidth}
561 \vspace*{1mm}
562 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white
563 to black and black to white). The contrast is $C \sim U[1-0.85 \times complexity,1]$
564 so the image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
565 polarity is inverted with probability 50\%.
566 %%\vspace{.7cm}
567 \end{minipage}
568 %\vspace{2mm}
569
570
571 \iffalse
572 \begin{figure}[ht]
573 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{example_t.png}}}\\
574 \caption{Illustration of the pipeline of stochastic
575 transformations applied to the image of a lower-case \emph{t}
576 (the upper left image). Each image in the pipeline (going from
577 left to right, first top line, then bottom line) shows the result
578 of applying one of the modules in the pipeline. The last image
579 (bottom right) is used as training example.}
580 \label{fig:pipeline}
581 \end{figure}
582 \fi
583
584 %\vspace*{-3mm}
585 \section{Experimental Setup}
586 %\vspace*{-1mm}
587
588 Much previous work on deep learning had been performed on
589 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
590 with 60~000 examples, and variants involving 10~000
591 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}.
592 The focus here is on much larger training sets, from 10 times to
593 to 1000 times larger, and 62 classes.
594
595 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
596 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
597 and {\bf OCR data} (scanned machine printed characters). Once a character
598 is sampled from one of these sources (chosen randomly), the second step is to
599 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}.
600
601 To provide a baseline of error rate comparison we also estimate human performance
602 on both the 62-class task and the 10-class digits task.
603 We compare the best Multi-Layer Perceptrons (MLP) against
604 the best Stacked Denoising Auto-encoders (SDA), when
605 both models' hyper-parameters are selected to minimize the validation set error.
606 We also provide a comparison against a precise estimate
607 of human performance obtained via Amazon's Mechanical Turk (AMT)
608 service (http://mturk.com).
609 AMT users are paid small amounts
610 of money to perform tasks for which human intelligence is required.
611 Mechanical Turk has been used extensively in natural language processing and vision.
612 %processing \citep{SnowEtAl2008} and vision
613 %\citep{SorokinAndForsyth2008,whitehill09}.
614 AMT users were presented
615 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII
616 characters. They were forced to choose a single character class (either among the
617 62 or 10 character classes) for each image.
618 80 subjects classified 2500 images per (dataset,task) pair.
619 Different humans labelers sometimes provided a different label for the same
620 example, and we were able to estimate the error variance due to this effect
621 because each image was classified by 3 different persons.
622 The average error of humans on the 62-class task NIST test set
623 is 18.2\%, with a standard error of 0.1\%.
624
625 %\vspace*{-3mm}
626 \subsection{Data Sources}
627 %\vspace*{-2mm}
628
629 %\begin{itemize}
630 %\item
631 {\bf NIST.}
632 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995},
633 widely used for training and testing character
634 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
635 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
636 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
637 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity.
638 The fourth partition (called $hsf_4$, 82587 examples),
639 experimentally recognized to be the most difficult one, is the one recommended
640 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
641 for that purpose. We randomly split the remainder (731668 examples) into a training set and a validation set for
642 model selection.
643 The performances reported by previous work on that dataset mostly use only the digits.
644 Here we use all the classes both in the training and testing phase. This is especially
645 useful to estimate the effect of a multi-task setting.
646 The distribution of the classes in the NIST training and test sets differs
647 substantially, with relatively many more digits in the test set, and a more uniform distribution
648 of letters in the test set (whereas in the training set they are distributed
649 more like in natural text).
650 %\vspace*{-1mm}
651
652 %\item
653 {\bf Fonts.}
654 In order to have a good variety of sources we downloaded an important number of free fonts from:
655 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
656 % TODO: pointless to anonymize, it's not pointing to our work
657 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
658 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
659 directly as input to our models.
660 %\vspace*{-1mm}
661
662 %\item
663 {\bf Captchas.}
664 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
665 generating characters of the same format as the NIST dataset. This software is based on
666 a random character class generator and various kinds of transformations similar to those described in the previous sections.
667 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
668 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
669 depending on the value of the complexity parameter provided by the user of the data source.
670 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class?
671 %\vspace*{-1mm}
672
673 %\item
674 {\bf OCR data.}
675 A large set (2 million) of scanned, OCRed and manually verified machine-printed
676 characters where included as an
677 additional source. This set is part of a larger corpus being collected by the Image Understanding
678 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern
679 ({\tt http://www.iupr.com}), and which will be publicly released.
680 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
681 %\end{itemize}
682
683 %\vspace*{-3mm}
684 \subsection{Data Sets}
685 %\vspace*{-2mm}
686
687 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
688 from one of the 62 character classes.
689 %\begin{itemize}
690 %\vspace*{-1mm}
691
692 %\item
693 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
694 \{651668 / 80000 / 82587\} \{training / validation / test\} examples.
695 %\vspace*{-1mm}
696
697 %\item
698 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
699 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
700 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
701 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
702 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
703 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
704 %\vspace*{-1mm}
705
706 %\item
707 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
708 except that we only apply
709 transformations from slant to pinch. Therefore, the character is
710 transformed but no additional noise is added to the image, giving images
711 closer to the NIST dataset.
712 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
713 %\end{itemize}
714
715 %\vspace*{-3mm}
716 \subsection{Models and their Hyperparameters}
717 %\vspace*{-2mm}
718
719 The experiments are performed using MLPs (with a single
720 hidden layer) and SDAs.
721 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
722
723 {\bf Multi-Layer Perceptrons (MLP).}
724 Whereas previous work had compared deep architectures to both shallow MLPs and
725 SVMs, we only compared to MLPs here because of the very large datasets used
726 (making the use of SVMs computationally challenging because of their quadratic
727 scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training
728 set allowing the program to fit in memory yielded substantially worse results
729 than those obtained with MLPs. For training on nearly a billion examples
730 (with the perturbed data), the MLPs and SDA are much more convenient than
731 classifiers based on kernel methods.
732 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
733 exponentials) on the output layer for estimating $P(class | image)$.
734 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
735 Training examples are presented in minibatches of size 20. A constant learning
736 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
737 %through preliminary experiments (measuring performance on a validation set),
738 %and $0.1$ (which was found to work best) was then selected for optimizing on
739 %the whole training sets.
740 %\vspace*{-1mm}
741
742
743 {\bf Stacked Denoising Auto-Encoders (SDA).}
744 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
745 can be used to initialize the weights of each layer of a deep MLP (with many hidden
746 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
747 apparently setting parameters in the
748 basin of attraction of supervised gradient descent yielding better
749 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised
750 pre-training phase} uses all of the training images but not the training labels.
751 Each layer is trained in turn to produce a new representation of its input
752 (starting from the raw pixels).
753 It is hypothesized that the
754 advantage brought by this procedure stems from a better prior,
755 on the one hand taking advantage of the link between the input
756 distribution $P(x)$ and the conditional distribution of interest
757 $P(y|x)$ (like in semi-supervised learning), and on the other hand
758 taking advantage of the expressive power and bias implicit in the
759 deep architecture (whereby complex concepts are expressed as
760 compositions of simpler ones through a deep hierarchy).
761
762 \begin{figure}[ht]
763 %\vspace*{-2mm}
764 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
765 %\vspace*{-2mm}
766 \caption{Illustration of the computations and training criterion for the denoising
767 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
768 the layer (i.e. raw input or output of previous layer)
769 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
770 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
771 is compared to the uncorrupted input $x$ through the loss function
772 $L_H(x,z)$, whose expected value is approximately minimized during training
773 by tuning $\theta$ and $\theta'$.}
774 \label{fig:da}
775 %\vspace*{-2mm}
776 \end{figure}
777
778 Here we chose to use the Denoising
779 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
780 these deep hierarchies of features, as it is simple to train and
781 explain (see Figure~\ref{fig:da}, as well as
782 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
783 provides efficient inference, and yielded results
784 comparable or better than RBMs in series of experiments
785 \citep{VincentPLarochelleH2008}. During training, a Denoising
786 Auto-encoder is presented with a stochastically corrupted version
787 of the input and trained to reconstruct the uncorrupted input,
788 forcing the hidden units to represent the leading regularities in
789 the data. Here we use the random binary masking corruption
790 (which sets to 0 a random subset of the inputs).
791 Once it is trained, in a purely unsupervised way,
792 its hidden units' activations can
793 be used as inputs for training a second one, etc.
794 After this unsupervised pre-training stage, the parameters
795 are used to initialize a deep MLP, which is fine-tuned by
796 the same standard procedure used to train them (see previous section).
797 The SDA hyper-parameters are the same as for the MLP, with the addition of the
798 amount of corruption noise (we used the masking noise process, whereby a
799 fixed proportion of the input values, randomly selected, are zeroed), and a
800 separate learning rate for the unsupervised pre-training stage (selected
801 from the same above set). The fraction of inputs corrupted was selected
802 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
803 of hidden layers but it was fixed to 3 based on previous work with
804 SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden
805 layers was kept constant across hidden layers, and the best results
806 were obtained with the largest values that we could experiment
807 with given our patience, with 1000 hidden units.
808
809 %\vspace*{-1mm}
810
811 \begin{figure}[ht]
812 %\vspace*{-2mm}
813 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
814 %\vspace*{-3mm}
815 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
816 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
817 of all models, on NIST and NISTP test sets.
818 Right: error rates on NIST test digits only, along with the previous results from
819 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
820 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
821 \label{fig:error-rates-charts}
822 %\vspace*{-2mm}
823 \end{figure}
824
825
826 \begin{figure}[ht]
827 %\vspace*{-3mm}
828 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
829 %\vspace*{-3mm}
830 \caption{Relative improvement in error rate due to self-taught learning.
831 Left: Improvement (or loss, when negative)
832 induced by out-of-distribution examples (perturbed data).
833 Right: Improvement (or loss, when negative) induced by multi-task
834 learning (training on all classes and testing only on either digits,
835 upper case, or lower-case). The deep learner (SDA) benefits more from
836 both self-taught learning scenarios, compared to the shallow MLP.}
837 \label{fig:improvements-charts}
838 %\vspace*{-2mm}
839 \end{figure}
840
841 \section{Experimental Results}
842 %\vspace*{-2mm}
843
844 %%\vspace*{-1mm}
845 %\subsection{SDA vs MLP vs Humans}
846 %%\vspace*{-1mm}
847 The models are either trained on NIST (MLP0 and SDA0),
848 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested
849 on either NIST, NISTP or P07, either on the 62-class task
850 or on the 10-digits task. Training (including about half
851 for unsupervised pre-training, for DAs) on the larger
852 datasets takes around one day on a GPU-285.
853 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
854 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
855 SDA2), along with the previous results on the digits NIST special database
856 19 test set from the literature, respectively based on ARTMAP neural
857 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
858 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
859 ~\citep{Milgram+al-2005}. More detailed and complete numerical results
860 (figures and tables, including standard errors on the error rates) can be
861 found in Appendix.
862 The deep learner not only outperformed the shallow ones and
863 previously published performance (in a statistically and qualitatively
864 significant way) but when trained with perturbed data
865 reaches human performance on both the 62-class task
866 and the 10-class (digits) task.
867 17\% error (SDA1) or 18\% error (humans) may seem large but a large
868 majority of the errors from humans and from SDA1 are from out-of-context
869 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
870 ``c'' and a ``C'' are often indistinguishible).
871
872 In addition, as shown in the left of
873 Figure~\ref{fig:improvements-charts}, the relative improvement in error
874 rate brought by self-taught learning is greater for the SDA, and these
875 differences with the MLP are statistically and qualitatively
876 significant.
877 The left side of the figure shows the improvement to the clean
878 NIST test set error brought by the use of out-of-distribution examples
879 (i.e. the perturbed examples examples from NISTP or P07).
880 Relative percent change is measured by taking
881 $100 \% \times$ (original model's error / perturbed-data model's error - 1).
882 The right side of
883 Figure~\ref{fig:improvements-charts} shows the relative improvement
884 brought by the use of a multi-task setting, in which the same model is
885 trained for more classes than the target classes of interest (i.e. training
886 with all 62 classes when the target classes are respectively the digits,
887 lower-case, or upper-case characters). Again, whereas the gain from the
888 multi-task setting is marginal or negative for the MLP, it is substantial
889 for the SDA. Note that to simplify these multi-task experiments, only the original
890 NIST dataset is used. For example, the MLP-digits bar shows the relative
891 percent improvement in MLP error rate on the NIST digits test set
892 is $100\% \times$ (single-task
893 model's error / multi-task model's error - 1). The single-task model is
894 trained with only 10 outputs (one per digit), seeing only digit examples,
895 whereas the multi-task model is trained with 62 outputs, with all 62
896 character classes as examples. Hence the hidden units are shared across
897 all tasks. For the multi-task model, the digit error rate is measured by
898 comparing the correct digit class with the output class associated with the
899 maximum conditional probability among only the digit classes outputs. The
900 setting is similar for the other two target classes (lower case characters
901 and upper case characters).
902 %%\vspace*{-1mm}
903 %\subsection{Perturbed Training Data More Helpful for SDA}
904 %%\vspace*{-1mm}
905
906 %%\vspace*{-1mm}
907 %\subsection{Multi-Task Learning Effects}
908 %%\vspace*{-1mm}
909
910 \iffalse
911 As previously seen, the SDA is better able to benefit from the
912 transformations applied to the data than the MLP. In this experiment we
913 define three tasks: recognizing digits (knowing that the input is a digit),
914 recognizing upper case characters (knowing that the input is one), and
915 recognizing lower case characters (knowing that the input is one). We
916 consider the digit classification task as the target task and we want to
917 evaluate whether training with the other tasks can help or hurt, and
918 whether the effect is different for MLPs versus SDAs. The goal is to find
919 out if deep learning can benefit more (or less) from multiple related tasks
920 (i.e. the multi-task setting) compared to a corresponding purely supervised
921 shallow learner.
922
923 We use a single hidden layer MLP with 1000 hidden units, and a SDA
924 with 3 hidden layers (1000 hidden units per layer), pre-trained and
925 fine-tuned on NIST.
926
927 Our results show that the MLP benefits marginally from the multi-task setting
928 in the case of digits (5\% relative improvement) but is actually hurt in the case
929 of characters (respectively 3\% and 4\% worse for lower and upper class characters).
930 On the other hand the SDA benefited from the multi-task setting, with relative
931 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
932 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
933 \fi
934
935
936 %\vspace*{-2mm}
937 \section{Conclusions and Discussion}
938 %\vspace*{-2mm}
939
940 We have found that the self-taught learning framework is more beneficial
941 to a deep learner than to a traditional shallow and purely
942 supervised learner. More precisely,
943 the answers are positive for all the questions asked in the introduction.
944 %\begin{itemize}
945
946 $\bullet$ %\item
947 {\bf Do the good results previously obtained with deep architectures on the
948 MNIST digits generalize to a much larger and richer (but similar)
949 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
950 Yes, the SDA {\em systematically outperformed the MLP and all the previously
951 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
952 performance} at around 17\% error on the 62-class task and 1.4\% on the digits,
953 and beating previously published results on the same data.
954
955 $\bullet$ %\item
956 {\bf To what extent do self-taught learning scenarios help deep learners,
957 and do they help them more than shallow supervised ones}?
958 We found that distorted training examples not only made the resulting
959 classifier better on similarly perturbed images but also on
960 the {\em original clean examples}, and more importantly and more novel,
961 that deep architectures benefit more from such {\em out-of-distribution}
962 examples. MLPs were helped by perturbed training examples when tested on perturbed input
963 images (65\% relative improvement on NISTP)
964 but only marginally helped (5\% relative improvement on all classes)
965 or even hurt (10\% relative loss on digits)
966 with respect to clean examples . On the other hand, the deep SDAs
967 were significantly boosted by these out-of-distribution examples.
968 Similarly, whereas the improvement due to the multi-task setting was marginal or
969 negative for the MLP (from +5.6\% to -3.6\% relative change),
970 it was quite significant for the SDA (from +13\% to +27\% relative change),
971 which may be explained by the arguments below.
972 %\end{itemize}
973
974 In the original self-taught learning framework~\citep{RainaR2007}, the
975 out-of-sample examples were used as a source of unsupervised data, and
976 experiments showed its positive effects in a \emph{limited labeled data}
977 scenario. However, many of the results by \citet{RainaR2007} (who used a
978 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught
979 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases.
980 We note instead that, for deep
981 architectures, our experiments show that such a positive effect is accomplished
982 even in a scenario with a \emph{large number of labeled examples},
983 i.e., here, the relative gain of self-taught learning is probably preserved
984 in the asymptotic regime.
985
986 {\bf Why would deep learners benefit more from the self-taught learning framework}?
987 The key idea is that the lower layers of the predictor compute a hierarchy
988 of features that can be shared across tasks or across variants of the
989 input distribution. A theoretical analysis of generalization improvements
990 due to sharing of intermediate features across tasks already points
991 towards that explanation~\cite{baxter95a}.
992 Intermediate features that can be used in different
993 contexts can be estimated in a way that allows to share statistical
994 strength. Features extracted through many levels are more likely to
995 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
996 increasing the likelihood that they would be useful for a larger array
997 of tasks and input conditions.
998 Therefore, we hypothesize that both depth and unsupervised
999 pre-training play a part in explaining the advantages observed here, and future
1000 experiments could attempt at teasing apart these factors.
1001 And why would deep learners benefit from the self-taught learning
1002 scenarios even when the number of labeled examples is very large?
1003 We hypothesize that this is related to the hypotheses studied
1004 in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010}
1005 it was found that online learning on a huge dataset did not make the
1006 advantage of the deep learning bias vanish, a similar phenomenon
1007 may be happening here. We hypothesize that unsupervised pre-training
1008 of a deep hierarchy with self-taught learning initializes the
1009 model in the basin of attraction of supervised gradient descent
1010 that corresponds to better generalization. Furthermore, such good
1011 basins of attraction are not discovered by pure supervised learning
1012 (with or without self-taught settings), and more labeled examples
1013 does not allow the model to go from the poorer basins of attraction discovered
1014 by the purely supervised shallow models to the kind of better basins associated
1015 with deep learning and self-taught learning.
1016
1017 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
1018 can be executed on-line at {\tt http://deep.host22.com}.
1019
1020
1021 \section*{Appendix I: Detailed Numerical Results}
1022
1023 These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered.
1024 They also contain additional data such as test errors on P07 and standard errors.
1025
1026 \begin{table}[ht]
1027 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
1028 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
1029 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture
1030 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07)
1031 and using a validation set to select hyper-parameters and other training choices.
1032 \{SDA,MLP\}0 are trained on NIST,
1033 \{SDA,MLP\}1 are trained on NISTP, and \{SDA,MLP\}2 are trained on P07.
1034 The human error rate on digits is a lower bound because it does not count digits that were
1035 recognized as letters. For comparison, the results found in the literature
1036 on NIST digits classification using the same test set are included.}
1037 \label{tab:sda-vs-mlp-vs-humans}
1038 \begin{center}
1039 \begin{tabular}{|l|r|r|r|r|} \hline
1040 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline
1041 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $1.4\%$ \\ \hline
1042 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline
1043 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline
1044 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline
1045 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline
1046 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline
1047 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline
1048 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline
1049 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline
1050 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline
1051 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline
1052 \end{tabular}
1053 \end{center}
1054 \end{table}
1055
1056 \begin{table}[ht]
1057 \caption{Relative change in error rates due to the use of perturbed training data,
1058 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models.
1059 A positive value indicates that training on the perturbed data helped for the
1060 given test set (the first 3 columns on the 62-class tasks and the last one is
1061 on the clean 10-class digits). Clearly, the deep learning models did benefit more
1062 from perturbed training data, even when testing on clean data, whereas the MLP
1063 trained on perturbed data performed worse on the clean digits and about the same
1064 on the clean characters. }
1065 \label{tab:perturbation-effect}
1066 \begin{center}
1067 \begin{tabular}{|l|r|r|r|r|} \hline
1068 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline
1069 SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline
1070 SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline
1071 MLP0/MLP1-1 & 5.2\% & 65\% & -13\% & -10\% \\ \hline
1072 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline
1073 \end{tabular}
1074 \end{center}
1075 \end{table}
1076
1077 \begin{table}[ht]
1078 \caption{Test error rates and relative change in error rates due to the use of
1079 a multi-task setting, i.e., training on each task in isolation vs training
1080 for all three tasks together, for MLPs vs SDAs. The SDA benefits much
1081 more from the multi-task setting. All experiments on only on the
1082 unperturbed NIST data, using validation error for model selection.
1083 Relative improvement is 1 - single-task error / multi-task error.}
1084 \label{tab:multi-task}
1085 \begin{center}
1086 \begin{tabular}{|l|r|r|r|} \hline
1087 & single-task & multi-task & relative \\
1088 & setting & setting & improvement \\ \hline
1089 MLP-digits & 3.77\% & 3.99\% & 5.6\% \\ \hline
1090 MLP-lower & 17.4\% & 16.8\% & -4.1\% \\ \hline
1091 MLP-upper & 7.84\% & 7.54\% & -3.6\% \\ \hline
1092 SDA-digits & 2.6\% & 3.56\% & 27\% \\ \hline
1093 SDA-lower & 12.3\% & 14.4\% & 15\% \\ \hline
1094 SDA-upper & 5.93\% & 6.78\% & 13\% \\ \hline
1095 \end{tabular}
1096 \end{center}
1097 \end{table}
1098
1099 %\afterpage{\clearpage}
1100 \clearpage
1101 {
1102 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,specials,aigaion-shorter}
1103 %\bibliographystyle{plainnat}
1104 \bibliographystyle{unsrtnat}
1105 %\bibliographystyle{apalike}
1106 }
1107
1108
1109 \end{document}