comparison writeup/nips2010_submission.tex @ 484:9a757d565e46

reduction de taille
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Mon, 31 May 2010 20:42:22 -0400
parents b9cdb464de5f
children 6beaf3328521
comparison
equal deleted inserted replaced
483:b9cdb464de5f 484:9a757d565e46
13 \begin{document} 13 \begin{document}
14 14
15 %\makeanontitle 15 %\makeanontitle
16 \maketitle 16 \maketitle
17 17
18 \vspace*{-2mm}
18 \begin{abstract} 19 \begin{abstract}
19 Recent theoretical and empirical work in statistical machine learning has 20 Recent theoretical and empirical work in statistical machine learning has
20 demonstrated the importance of learning algorithms for deep 21 demonstrated the importance of learning algorithms for deep
21 architectures, i.e., function classes obtained by composing multiple 22 architectures, i.e., function classes obtained by composing multiple
22 non-linear transformations. The self-taught learning (exploitng unlabeled 23 non-linear transformations. The self-taught learning (exploitng unlabeled
34 images, color, contrast, occlusion, and various types of pixel and 35 images, color, contrast, occlusion, and various types of pixel and
35 spatially correlated noise. The out-of-distribution examples are 36 spatially correlated noise. The out-of-distribution examples are
36 obtained by training with these highly distorted images or 37 obtained by training with these highly distorted images or
37 by including object classes different from those in the target test set. 38 by including object classes different from those in the target test set.
38 \end{abstract} 39 \end{abstract}
40 \vspace*{-2mm}
39 41
40 \section{Introduction} 42 \section{Introduction}
43 \vspace*{-1mm}
41 44
42 Deep Learning has emerged as a promising new area of research in 45 Deep Learning has emerged as a promising new area of research in
43 statistical machine learning (see~\citet{Bengio-2009} for a review). 46 statistical machine learning (see~\citet{Bengio-2009} for a review).
44 Learning algorithms for deep architectures are centered on the learning 47 Learning algorithms for deep architectures are centered on the learning
45 of useful representations of data, which are better suited to the task at hand. 48 of useful representations of data, which are better suited to the task at hand.
46 This is in great part inspired by observations of the mammalian visual cortex, 49 This is in great part inspired by observations of the mammalian visual cortex,
47 which consists of a chain of processing elements, each of which is associated with a 50 which consists of a chain of processing elements, each of which is associated with a
48 different representation. In fact, 51 different representation of the raw visual input. In fact,
49 it was found recently that the features learnt in deep architectures resemble 52 it was found recently that the features learnt in deep architectures resemble
50 those observed in the first two of these stages (in areas V1 and V2 53 those observed in the first two of these stages (in areas V1 and V2
51 of visual cortex)~\citep{HonglakL2008}. 54 of visual cortex)~\citep{HonglakL2008}, and that they become more and
52 Processing images typically involves transforming the raw pixel data into 55 more invariant to factors of variation (such as camera movement) in
53 new {\bf representations} that can be used for analysis or classification. 56 higher layers~\cite{Goodfellow2009}.
54 For example, a principal component analysis representation linearly projects 57 Learning a hierarchy of features increases the
55 the input image into a lower-dimensional feature space.
56 Why learn a representation? Current practice in the computer vision
57 literature converts the raw pixels into a hand-crafted representation
58 e.g.\ SIFT features~\citep{Lowe04}, but deep learning algorithms
59 tend to discover similar features in their first few
60 levels~\citep{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}.
61 Learning increases the
62 ease and practicality of developing representations that are at once 58 ease and practicality of developing representations that are at once
63 tailored to specific tasks, yet are able to borrow statistical strength 59 tailored to specific tasks, yet are able to borrow statistical strength
64 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the 60 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
65 feature representation can lead to higher-level (more abstract, more 61 feature representation can lead to higher-level (more abstract, more
66 general) features that are more robust to unanticipated sources of 62 general) features that are more robust to unanticipated sources of
79 Machines in terms of unsupervised extraction of a hierarchy of features 75 Machines in terms of unsupervised extraction of a hierarchy of features
80 useful for classification. The principle is that each layer starting from 76 useful for classification. The principle is that each layer starting from
81 the bottom is trained to encode their input (the output of the previous 77 the bottom is trained to encode their input (the output of the previous
82 layer) and try to reconstruct it from a corrupted version of it. After this 78 layer) and try to reconstruct it from a corrupted version of it. After this
83 unsupervised initialization, the stack of denoising auto-encoders can be 79 unsupervised initialization, the stack of denoising auto-encoders can be
84 converted into a deep supervised feedforward neural network and trained by 80 converted into a deep supervised feedforward neural network and fine-tuned by
85 stochastic gradient descent. 81 stochastic gradient descent.
86 82
83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
84 of semi-supervised and multi-task learning: the learner can exploit examples
85 that are unlabeled and/or come from a distribution different from the target
86 distribution, e.g., from other classes that those of interest. Whereas
87 it has already been shown that deep learners can clearly take advantage of
88 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008}
89 and multi-task learning, not much has been done yet to explore the impact
90 of {\em out-of-distribution} examples and of the multi-task setting
91 (but see~\citep{CollobertR2008-short}). In particular the {\em relative
92 advantage} of deep learning for this settings has not been evaluated.
93
87 In this paper we ask the following questions: 94 In this paper we ask the following questions:
88 \begin{enumerate} 95
89 \item Do the good results previously obtained with deep architectures on the 96 %\begin{enumerate}
90 MNIST digits generalize to the setting of a much larger and richer (but similar) 97 $\bullet$ %\item
98 Do the good results previously obtained with deep architectures on the
99 MNIST digit images generalize to the setting of a much larger and richer (but similar)
91 dataset, the NIST special database 19, with 62 classes and around 800k examples? 100 dataset, the NIST special database 19, with 62 classes and around 800k examples?
92 \item To what extent does the perturbation of input images (e.g. adding 101
102 $\bullet$ %\item
103 To what extent does the perturbation of input images (e.g. adding
93 noise, affine transformations, background images) make the resulting 104 noise, affine transformations, background images) make the resulting
94 classifier better not only on similarly perturbed images but also on 105 classifiers better not only on similarly perturbed images but also on
95 the {\em original clean examples}? 106 the {\em original clean examples}?
96 \item Do deep architectures benefit more from such {\em out-of-distribution} 107
108 $\bullet$ %\item
109 Do deep architectures {\em benefit more from such out-of-distribution}
97 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 110 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
98 \item Similarly, does the feature learning step in deep learning algorithms benefit more 111
112 $\bullet$ %\item
113 Similarly, does the feature learning step in deep learning algorithms benefit more
99 training with similar but different classes (i.e. a multi-task learning scenario) than 114 training with similar but different classes (i.e. a multi-task learning scenario) than
100 a corresponding shallow and purely supervised architecture? 115 a corresponding shallow and purely supervised architecture?
101 \end{enumerate} 116 %\end{enumerate}
117
102 The experimental results presented here provide positive evidence towards all of these questions. 118 The experimental results presented here provide positive evidence towards all of these questions.
103 119
120 \vspace*{-1mm}
104 \section{Perturbation and Transformation of Character Images} 121 \section{Perturbation and Transformation of Character Images}
122 \vspace*{-1mm}
105 123
106 This section describes the different transformations we used to stochastically 124 This section describes the different transformations we used to stochastically
107 transform source images in order to obtain data. More details can 125 transform source images in order to obtain data. More details can
108 be found in this technical report~\citep{ift6266-tr-anonymous}. 126 be found in this technical report~\citep{ift6266-tr-anonymous}.
109 The code for these transformations (mostly python) is available at 127 The code for these transformations (mostly python) is available at
113 131
114 There are two main parts in the pipeline. The first one, 132 There are two main parts in the pipeline. The first one,
115 from slant to pinch below, performs transformations. The second 133 from slant to pinch below, performs transformations. The second
116 part, from blur to contrast, adds different kinds of noise. 134 part, from blur to contrast, adds different kinds of noise.
117 135
118 {\large\bf Transformations}\\ 136 {\large\bf Transformations}
137
138 \vspace*{2mm}
119 139
120 {\bf Slant.} 140 {\bf Slant.}
121 We mimic slant by shifting each row of the image 141 We mimic slant by shifting each row of the image
122 proportionnaly to its height: $shift = round(slant \times height)$. 142 proportionnaly to its height: $shift = round(slant \times height)$.
123 The $slant$ coefficient can be negative or positive with equal probability 143 The $slant$ coefficient can be negative or positive with equal probability
176 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ 196 at some other distance $d_2$. Define $d_1$ to be the distance between $P$
177 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times 197 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
178 d_1$, where $pinch$ is a parameter to the filter. 198 d_1$, where $pinch$ is a parameter to the filter.
179 The actual value is given by bilinear interpolation considering the pixels 199 The actual value is given by bilinear interpolation considering the pixels
180 around the (non-integer) source position thus found. 200 around the (non-integer) source position thus found.
181 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\ 201 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
182 202
183 {\large\bf Injecting Noise}\\ 203 \vspace*{1mm}
204
205 {\large\bf Injecting Noise}
206
207 \vspace*{1mm}
184 208
185 {\bf Motion Blur.} 209 {\bf Motion Blur.}
186 This GIMP filter is a ``linear motion blur'' in GIMP 210 This GIMP filter is a ``linear motion blur'' in GIMP
187 terminology, with two parameters, $length$ and $angle$. The value of 211 terminology, with two parameters, $length$ and $angle$. The value of
188 a pixel in the final image is the approximately mean value of the $length$ first pixels 212 a pixel in the final image is the approximately mean value of the $length$ first pixels
284 color and contrast changes.} 308 color and contrast changes.}
285 \label{fig:transfo} 309 \label{fig:transfo}
286 \end{figure} 310 \end{figure}
287 311
288 312
289 313 \vspace*{-1mm}
290 \section{Experimental Setup} 314 \section{Experimental Setup}
315 \vspace*{-1mm}
291 316
292 Whereas much previous work on deep learning algorithms had been performed on 317 Whereas much previous work on deep learning algorithms had been performed on
293 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, 318 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
294 with 60~000 examples, and variants involving 10~000 319 with 60~000 examples, and variants involving 10~000
295 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want 320 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
297 to 1000 times larger. The larger datasets are obtained by first sampling from 322 to 1000 times larger. The larger datasets are obtained by first sampling from
298 a {\em data source} (NIST characters, scanned machine printed characters, characters 323 a {\em data source} (NIST characters, scanned machine printed characters, characters
299 from fonts, or characters from captchas) and then optionally applying some of the 324 from fonts, or characters from captchas) and then optionally applying some of the
300 above transformations and/or noise processes. 325 above transformations and/or noise processes.
301 326
327 \vspace*{-1mm}
302 \subsection{Data Sources} 328 \subsection{Data Sources}
303 329 \vspace*{-1mm}
304 \begin{itemize} 330
305 \item {\bf NIST} 331 %\begin{itemize}
332 %\item
333 {\bf NIST.}
306 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995}, 334 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995},
307 widely used for training and testing character 335 widely used for training and testing character
308 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. 336 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}.
309 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, 337 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications,
310 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 338 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
320 Note that the distribution of the classes in the NIST training and test sets differs 348 Note that the distribution of the classes in the NIST training and test sets differs
321 substantially, with relatively many more digits in the test set, and uniform distribution 349 substantially, with relatively many more digits in the test set, and uniform distribution
322 of letters in the test set, not in the training set (more like the natural distribution 350 of letters in the test set, not in the training set (more like the natural distribution
323 of letters in text). 351 of letters in text).
324 352
325 \item {\bf Fonts} 353 %\item
354 {\bf Fonts.}
326 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} 355 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net}
327 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} 356 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html}
328 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. 357 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly.
329 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 358 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
330 directly as input to our models. 359 directly as input to our models.
331 360
332 361 %\item
333 362 {\bf Captchas.}
334 \item {\bf Captchas}
335 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 363 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
336 generating characters of the same format as the NIST dataset. This software is based on 364 generating characters of the same format as the NIST dataset. This software is based on
337 a random character class generator and various kinds of tranformations similar to those described in the previous sections. 365 a random character class generator and various kinds of tranformations similar to those described in the previous sections.
338 In order to increase the variability of the data generated, many different fonts are used for generating the characters. 366 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
339 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity 367 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
340 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are 368 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are
341 allowed and can be controlled via an easy to use facade class. 369 allowed and can be controlled via an easy to use facade class.
342 \item {\bf OCR data} 370
371 %\item
372 {\bf OCR data.}
343 A large set (2 million) of scanned, OCRed and manually verified machine-printed 373 A large set (2 million) of scanned, OCRed and manually verified machine-printed
344 characters (from various documents and books) where included as an 374 characters (from various documents and books) where included as an
345 additional source. This set is part of a larger corpus being collected by the Image Understanding 375 additional source. This set is part of a larger corpus being collected by the Image Understanding
346 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern 376 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern
347 ({\tt http://www.iupr.com}), and which will be publically released. 377 ({\tt http://www.iupr.com}), and which will be publically released.
348 \end{itemize} 378 %\end{itemize}
349 379
380 \vspace*{-1mm}
350 \subsection{Data Sets} 381 \subsection{Data Sets}
382 \vspace*{-1mm}
383
351 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label 384 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
352 from one of the 62 character classes. 385 from one of the 62 character classes.
353 \begin{itemize} 386 %\begin{itemize}
354 \item {\bf NIST}. This is the raw NIST special database 19. 387
355 \item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources 388 %\item
389 {\bf NIST.} This is the raw NIST special database 19.
390
391 %\item
392 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
356 and sending them through the above transformation pipeline. 393 and sending them through the above transformation pipeline.
357 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts, 394 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts,
358 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the 395 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
359 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. 396 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
360 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) 397
398 %\item
399 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
361 except that we only apply 400 except that we only apply
362 transformations from slant to pinch. Therefore, the character is 401 transformations from slant to pinch. Therefore, the character is
363 transformed but no additionnal noise is added to the image, giving images 402 transformed but no additionnal noise is added to the image, giving images
364 closer to the NIST dataset. 403 closer to the NIST dataset.
365 \end{itemize} 404 %\end{itemize}
366 405
406 \vspace*{-1mm}
367 \subsection{Models and their Hyperparameters} 407 \subsection{Models and their Hyperparameters}
408 \vspace*{-1mm}
368 409
369 All hyper-parameters are selected based on performance on the NISTP validation set. 410 All hyper-parameters are selected based on performance on the NISTP validation set.
370 411
371 \subsubsection{Multi-Layer Perceptrons (MLP)} 412 {\bf Multi-Layer Perceptrons (MLP).}
372
373 Whereas previous work had compared deep architectures to both shallow MLPs and 413 Whereas previous work had compared deep architectures to both shallow MLPs and
374 SVMs, we only compared to MLPs here because of the very large datasets used. 414 SVMs, we only compared to MLPs here because of the very large datasets used.
375 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized 415 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
376 exponentials) on the output layer for estimating P(class | image). 416 exponentials) on the output layer for estimating P(class | image).
377 The hyper-parameters are the following: number of hidden units, taken in 417 The hyper-parameters are the following: number of hidden units, taken in
378 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training 418 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
379 examples are presented in minibatches of size 20. A constant learning 419 examples are presented in minibatches of size 20. A constant learning
380 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ 420 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
381 through preliminary experiments, and 0.1 was selected. 421 through preliminary experiments, and 0.1 was selected.
382 422
383 423 {\bf Stacked Denoising Auto-Encoders (SDAE).}
384 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
385 \label{SdA}
386
387 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 424 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
388 can be used to initialize the weights of each layer of a deep MLP (with many hidden 425 can be used to initialize the weights of each layer of a deep MLP (with many hidden
389 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} 426 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
390 enabling better generalization, apparently setting parameters in the 427 enabling better generalization, apparently setting parameters in the
391 basin of attraction of supervised gradient descent yielding better 428 basin of attraction of supervised gradient descent yielding better
395 distribution $P(x)$ and the conditional distribution of interest 432 distribution $P(x)$ and the conditional distribution of interest
396 $P(y|x)$ (like in semi-supervised learning), and on the other hand 433 $P(y|x)$ (like in semi-supervised learning), and on the other hand
397 taking advantage of the expressive power and bias implicit in the 434 taking advantage of the expressive power and bias implicit in the
398 deep architecture (whereby complex concepts are expressed as 435 deep architecture (whereby complex concepts are expressed as
399 compositions of simpler ones through a deep hierarchy). 436 compositions of simpler ones through a deep hierarchy).
400
401 Here we chose to use the Denoising 437 Here we chose to use the Denoising
402 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for 438 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
403 these deep hierarchies of features, as it is very simple to train and 439 these deep hierarchies of features, as it is very simple to train and
404 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), 440 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}),
405 provides immediate and efficient inference, and yielded results 441 provides immediate and efficient inference, and yielded results
411 the data. Once it is trained, its hidden units activations can 447 the data. Once it is trained, its hidden units activations can
412 be used as inputs for training a second one, etc. 448 be used as inputs for training a second one, etc.
413 After this unsupervised pre-training stage, the parameters 449 After this unsupervised pre-training stage, the parameters
414 are used to initialize a deep MLP, which is fine-tuned by 450 are used to initialize a deep MLP, which is fine-tuned by
415 the same standard procedure used to train them (see previous section). 451 the same standard procedure used to train them (see previous section).
416 452 The SDA hyper-parameters are the same as for the MLP, with the addition of the
417 The hyper-parameters are the same as for the MLP, with the addition of the
418 amount of corruption noise (we used the masking noise process, whereby a 453 amount of corruption noise (we used the masking noise process, whereby a
419 fixed proportion of the input values, randomly selected, are zeroed), and a 454 fixed proportion of the input values, randomly selected, are zeroed), and a
420 separate learning rate for the unsupervised pre-training stage (selected 455 separate learning rate for the unsupervised pre-training stage (selected
421 from the same above set). The fraction of inputs corrupted was selected 456 from the same above set). The fraction of inputs corrupted was selected
422 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number 457 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
423 of hidden layers but it was fixed to 3 based on previous work with 458 of hidden layers but it was fixed to 3 based on previous work with
424 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. 459 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
425 460
461 \vspace*{-1mm}
426 \section{Experimental Results} 462 \section{Experimental Results}
427 463
464 \vspace*{-1mm}
428 \subsection{SDA vs MLP vs Humans} 465 \subsection{SDA vs MLP vs Humans}
466 \vspace*{-1mm}
429 467
430 We compare here the best MLP (according to validation set error) that we found against 468 We compare here the best MLP (according to validation set error) that we found against
431 the best SDA (again according to validation set error), along with a precise estimate 469 the best SDA (again according to validation set error), along with a precise estimate
432 of human performance obtained via Amazon's Mechanical Turk (AMT) 470 of human performance obtained via Amazon's Mechanical Turk (AMT)
433 service\footnote{http://mturk.com}. AMT users are paid small amounts 471 service\footnote{http://mturk.com}. AMT users are paid small amounts
434 of money to perform tasks for which human intelligence is required. 472 of money to perform tasks for which human intelligence is required.
435 Mechanical Turk has been used extensively in natural language 473 Mechanical Turk has been used extensively in natural language
436 processing \citep{SnowEtAl2008} and vision 474 processing \citep{SnowEtAl2008} and vision
437 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented 475 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented
438 with 10 character images and asked to type 10 corresponding ascii 476 with 10 character images and asked to type 10 corresponding ascii
439 characters. Hence they were forced to make a hard choice among the 477 characters. They were forced to make a hard choice among the
440 62 character classes. Three users classified each image, allowing 478 62 or 10 character classes (all classes or digits only).
479 Three users classified each image, allowing
441 to estimate inter-human variability (shown as +/- in parenthesis below). 480 to estimate inter-human variability (shown as +/- in parenthesis below).
481
482 Figure~\ref{fig:error-rates-charts} summarizes the results obtained.
483 More detailed results and tables can be found in the appendix.
442 484
443 \begin{table} 485 \begin{table}
444 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + 486 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
445 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training 487 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
446 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture 488 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture
474 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ 516 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
475 \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. } 517 \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. }
476 \label{fig:error-rates-charts} 518 \label{fig:error-rates-charts}
477 \end{figure} 519 \end{figure}
478 520
521 \vspace*{-1mm}
479 \subsection{Perturbed Training Data More Helpful for SDAE} 522 \subsection{Perturbed Training Data More Helpful for SDAE}
523 \vspace*{-1mm}
480 524
481 \begin{table} 525 \begin{table}
482 \caption{Relative change in error rates due to the use of perturbed training data, 526 \caption{Relative change in error rates due to the use of perturbed training data,
483 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. 527 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models.
484 A positive value indicates that training on the perturbed data helped for the 528 A positive value indicates that training on the perturbed data helped for the
497 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline 541 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline
498 \end{tabular} 542 \end{tabular}
499 \end{center} 543 \end{center}
500 \end{table} 544 \end{table}
501 545
502 546 \vspace*{-1mm}
503 \subsection{Multi-Task Learning Effects} 547 \subsection{Multi-Task Learning Effects}
548 \vspace*{-1mm}
504 549
505 As previously seen, the SDA is better able to benefit from the 550 As previously seen, the SDA is better able to benefit from the
506 transformations applied to the data than the MLP. In this experiment we 551 transformations applied to the data than the MLP. In this experiment we
507 define three tasks: recognizing digits (knowing that the input is a digit), 552 define three tasks: recognizing digits (knowing that the input is a digit),
508 recognizing upper case characters (knowing that the input is one), and 553 recognizing upper case characters (knowing that the input is one), and
552 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ 597 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
553 \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).} 598 \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).}
554 \label{fig:improvements-charts} 599 \label{fig:improvements-charts}
555 \end{figure} 600 \end{figure}
556 601
557 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 602 \vspace*{-1mm}
558 can be executed on-line at {\tt http://deep.host22.com}.
559
560 \section{Conclusions} 603 \section{Conclusions}
604 \vspace*{-1mm}
561 605
562 The conclusions are positive for all the questions asked in the introduction. 606 The conclusions are positive for all the questions asked in the introduction.
563 \begin{itemize} 607 %\begin{itemize}
564 \item Do the good results previously obtained with deep architectures on the 608 $\bullet$ %\item
609 Do the good results previously obtained with deep architectures on the
565 MNIST digits generalize to the setting of a much larger and richer (but similar) 610 MNIST digits generalize to the setting of a much larger and richer (but similar)
566 dataset, the NIST special database 19, with 62 classes and around 800k examples? 611 dataset, the NIST special database 19, with 62 classes and around 800k examples?
567 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level 612 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
568 performance. 613 performance.
569 \item To what extent does the perturbation of input images (e.g. adding 614
615 $\bullet$ %\item
616 To what extent does the perturbation of input images (e.g. adding
570 noise, affine transformations, background images) make the resulting 617 noise, affine transformations, background images) make the resulting
571 classifier better not only on similarly perturbed images but also on 618 classifier better not only on similarly perturbed images but also on
572 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} 619 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
573 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 620 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
574 MLPs were helped by perturbed training examples when tested on perturbed input images, 621 MLPs were helped by perturbed training examples when tested on perturbed input images,
575 but only marginally helped wrt clean examples. On the other hand, the deep SDAs 622 but only marginally helped wrt clean examples. On the other hand, the deep SDAs
576 were very significantly boosted by these out-of-distribution examples. 623 were very significantly boosted by these out-of-distribution examples.
577 \item Similarly, does the feature learning step in deep learning algorithms benefit more 624
625 $\bullet$ %\item
626 Similarly, does the feature learning step in deep learning algorithms benefit more
578 training with similar but different classes (i.e. a multi-task learning scenario) than 627 training with similar but different classes (i.e. a multi-task learning scenario) than
579 a corresponding shallow and purely supervised architecture? 628 a corresponding shallow and purely supervised architecture?
580 Whereas the improvement due to the multi-task setting was marginal or 629 Whereas the improvement due to the multi-task setting was marginal or
581 negative for the MLP, it was very significant for the SDA. 630 negative for the MLP, it was very significant for the SDA.
582 \end{itemize} 631 %\end{itemize}
583 632
633 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
634 can be executed on-line at {\tt http://deep.host22.com}.
635
636
637 {\small
584 \bibliography{strings,ml,aigaion,specials} 638 \bibliography{strings,ml,aigaion,specials}
585 %\bibliographystyle{plainnat} 639 %\bibliographystyle{plainnat}
586 \bibliographystyle{unsrtnat} 640 \bibliographystyle{unsrtnat}
587 %\bibliographystyle{apalike} 641 %\bibliographystyle{apalike}
642 }
588 643
589 \end{document} 644 \end{document}