Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 484:9a757d565e46
reduction de taille
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Mon, 31 May 2010 20:42:22 -0400 |
parents | b9cdb464de5f |
children | 6beaf3328521 |
comparison
equal
deleted
inserted
replaced
483:b9cdb464de5f | 484:9a757d565e46 |
---|---|
13 \begin{document} | 13 \begin{document} |
14 | 14 |
15 %\makeanontitle | 15 %\makeanontitle |
16 \maketitle | 16 \maketitle |
17 | 17 |
18 \vspace*{-2mm} | |
18 \begin{abstract} | 19 \begin{abstract} |
19 Recent theoretical and empirical work in statistical machine learning has | 20 Recent theoretical and empirical work in statistical machine learning has |
20 demonstrated the importance of learning algorithms for deep | 21 demonstrated the importance of learning algorithms for deep |
21 architectures, i.e., function classes obtained by composing multiple | 22 architectures, i.e., function classes obtained by composing multiple |
22 non-linear transformations. The self-taught learning (exploitng unlabeled | 23 non-linear transformations. The self-taught learning (exploitng unlabeled |
34 images, color, contrast, occlusion, and various types of pixel and | 35 images, color, contrast, occlusion, and various types of pixel and |
35 spatially correlated noise. The out-of-distribution examples are | 36 spatially correlated noise. The out-of-distribution examples are |
36 obtained by training with these highly distorted images or | 37 obtained by training with these highly distorted images or |
37 by including object classes different from those in the target test set. | 38 by including object classes different from those in the target test set. |
38 \end{abstract} | 39 \end{abstract} |
40 \vspace*{-2mm} | |
39 | 41 |
40 \section{Introduction} | 42 \section{Introduction} |
43 \vspace*{-1mm} | |
41 | 44 |
42 Deep Learning has emerged as a promising new area of research in | 45 Deep Learning has emerged as a promising new area of research in |
43 statistical machine learning (see~\citet{Bengio-2009} for a review). | 46 statistical machine learning (see~\citet{Bengio-2009} for a review). |
44 Learning algorithms for deep architectures are centered on the learning | 47 Learning algorithms for deep architectures are centered on the learning |
45 of useful representations of data, which are better suited to the task at hand. | 48 of useful representations of data, which are better suited to the task at hand. |
46 This is in great part inspired by observations of the mammalian visual cortex, | 49 This is in great part inspired by observations of the mammalian visual cortex, |
47 which consists of a chain of processing elements, each of which is associated with a | 50 which consists of a chain of processing elements, each of which is associated with a |
48 different representation. In fact, | 51 different representation of the raw visual input. In fact, |
49 it was found recently that the features learnt in deep architectures resemble | 52 it was found recently that the features learnt in deep architectures resemble |
50 those observed in the first two of these stages (in areas V1 and V2 | 53 those observed in the first two of these stages (in areas V1 and V2 |
51 of visual cortex)~\citep{HonglakL2008}. | 54 of visual cortex)~\citep{HonglakL2008}, and that they become more and |
52 Processing images typically involves transforming the raw pixel data into | 55 more invariant to factors of variation (such as camera movement) in |
53 new {\bf representations} that can be used for analysis or classification. | 56 higher layers~\cite{Goodfellow2009}. |
54 For example, a principal component analysis representation linearly projects | 57 Learning a hierarchy of features increases the |
55 the input image into a lower-dimensional feature space. | |
56 Why learn a representation? Current practice in the computer vision | |
57 literature converts the raw pixels into a hand-crafted representation | |
58 e.g.\ SIFT features~\citep{Lowe04}, but deep learning algorithms | |
59 tend to discover similar features in their first few | |
60 levels~\citep{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}. | |
61 Learning increases the | |
62 ease and practicality of developing representations that are at once | 58 ease and practicality of developing representations that are at once |
63 tailored to specific tasks, yet are able to borrow statistical strength | 59 tailored to specific tasks, yet are able to borrow statistical strength |
64 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the | 60 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the |
65 feature representation can lead to higher-level (more abstract, more | 61 feature representation can lead to higher-level (more abstract, more |
66 general) features that are more robust to unanticipated sources of | 62 general) features that are more robust to unanticipated sources of |
79 Machines in terms of unsupervised extraction of a hierarchy of features | 75 Machines in terms of unsupervised extraction of a hierarchy of features |
80 useful for classification. The principle is that each layer starting from | 76 useful for classification. The principle is that each layer starting from |
81 the bottom is trained to encode their input (the output of the previous | 77 the bottom is trained to encode their input (the output of the previous |
82 layer) and try to reconstruct it from a corrupted version of it. After this | 78 layer) and try to reconstruct it from a corrupted version of it. After this |
83 unsupervised initialization, the stack of denoising auto-encoders can be | 79 unsupervised initialization, the stack of denoising auto-encoders can be |
84 converted into a deep supervised feedforward neural network and trained by | 80 converted into a deep supervised feedforward neural network and fine-tuned by |
85 stochastic gradient descent. | 81 stochastic gradient descent. |
86 | 82 |
83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles | |
84 of semi-supervised and multi-task learning: the learner can exploit examples | |
85 that are unlabeled and/or come from a distribution different from the target | |
86 distribution, e.g., from other classes that those of interest. Whereas | |
87 it has already been shown that deep learners can clearly take advantage of | |
88 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008} | |
89 and multi-task learning, not much has been done yet to explore the impact | |
90 of {\em out-of-distribution} examples and of the multi-task setting | |
91 (but see~\citep{CollobertR2008-short}). In particular the {\em relative | |
92 advantage} of deep learning for this settings has not been evaluated. | |
93 | |
87 In this paper we ask the following questions: | 94 In this paper we ask the following questions: |
88 \begin{enumerate} | 95 |
89 \item Do the good results previously obtained with deep architectures on the | 96 %\begin{enumerate} |
90 MNIST digits generalize to the setting of a much larger and richer (but similar) | 97 $\bullet$ %\item |
98 Do the good results previously obtained with deep architectures on the | |
99 MNIST digit images generalize to the setting of a much larger and richer (but similar) | |
91 dataset, the NIST special database 19, with 62 classes and around 800k examples? | 100 dataset, the NIST special database 19, with 62 classes and around 800k examples? |
92 \item To what extent does the perturbation of input images (e.g. adding | 101 |
102 $\bullet$ %\item | |
103 To what extent does the perturbation of input images (e.g. adding | |
93 noise, affine transformations, background images) make the resulting | 104 noise, affine transformations, background images) make the resulting |
94 classifier better not only on similarly perturbed images but also on | 105 classifiers better not only on similarly perturbed images but also on |
95 the {\em original clean examples}? | 106 the {\em original clean examples}? |
96 \item Do deep architectures benefit more from such {\em out-of-distribution} | 107 |
108 $\bullet$ %\item | |
109 Do deep architectures {\em benefit more from such out-of-distribution} | |
97 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 110 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? |
98 \item Similarly, does the feature learning step in deep learning algorithms benefit more | 111 |
112 $\bullet$ %\item | |
113 Similarly, does the feature learning step in deep learning algorithms benefit more | |
99 training with similar but different classes (i.e. a multi-task learning scenario) than | 114 training with similar but different classes (i.e. a multi-task learning scenario) than |
100 a corresponding shallow and purely supervised architecture? | 115 a corresponding shallow and purely supervised architecture? |
101 \end{enumerate} | 116 %\end{enumerate} |
117 | |
102 The experimental results presented here provide positive evidence towards all of these questions. | 118 The experimental results presented here provide positive evidence towards all of these questions. |
103 | 119 |
120 \vspace*{-1mm} | |
104 \section{Perturbation and Transformation of Character Images} | 121 \section{Perturbation and Transformation of Character Images} |
122 \vspace*{-1mm} | |
105 | 123 |
106 This section describes the different transformations we used to stochastically | 124 This section describes the different transformations we used to stochastically |
107 transform source images in order to obtain data. More details can | 125 transform source images in order to obtain data. More details can |
108 be found in this technical report~\citep{ift6266-tr-anonymous}. | 126 be found in this technical report~\citep{ift6266-tr-anonymous}. |
109 The code for these transformations (mostly python) is available at | 127 The code for these transformations (mostly python) is available at |
113 | 131 |
114 There are two main parts in the pipeline. The first one, | 132 There are two main parts in the pipeline. The first one, |
115 from slant to pinch below, performs transformations. The second | 133 from slant to pinch below, performs transformations. The second |
116 part, from blur to contrast, adds different kinds of noise. | 134 part, from blur to contrast, adds different kinds of noise. |
117 | 135 |
118 {\large\bf Transformations}\\ | 136 {\large\bf Transformations} |
137 | |
138 \vspace*{2mm} | |
119 | 139 |
120 {\bf Slant.} | 140 {\bf Slant.} |
121 We mimic slant by shifting each row of the image | 141 We mimic slant by shifting each row of the image |
122 proportionnaly to its height: $shift = round(slant \times height)$. | 142 proportionnaly to its height: $shift = round(slant \times height)$. |
123 The $slant$ coefficient can be negative or positive with equal probability | 143 The $slant$ coefficient can be negative or positive with equal probability |
176 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ | 196 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ |
177 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times | 197 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times |
178 d_1$, where $pinch$ is a parameter to the filter. | 198 d_1$, where $pinch$ is a parameter to the filter. |
179 The actual value is given by bilinear interpolation considering the pixels | 199 The actual value is given by bilinear interpolation considering the pixels |
180 around the (non-integer) source position thus found. | 200 around the (non-integer) source position thus found. |
181 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\ | 201 Here $pinch \sim U[-complexity, 0.7 \times complexity]$. |
182 | 202 |
183 {\large\bf Injecting Noise}\\ | 203 \vspace*{1mm} |
204 | |
205 {\large\bf Injecting Noise} | |
206 | |
207 \vspace*{1mm} | |
184 | 208 |
185 {\bf Motion Blur.} | 209 {\bf Motion Blur.} |
186 This GIMP filter is a ``linear motion blur'' in GIMP | 210 This GIMP filter is a ``linear motion blur'' in GIMP |
187 terminology, with two parameters, $length$ and $angle$. The value of | 211 terminology, with two parameters, $length$ and $angle$. The value of |
188 a pixel in the final image is the approximately mean value of the $length$ first pixels | 212 a pixel in the final image is the approximately mean value of the $length$ first pixels |
284 color and contrast changes.} | 308 color and contrast changes.} |
285 \label{fig:transfo} | 309 \label{fig:transfo} |
286 \end{figure} | 310 \end{figure} |
287 | 311 |
288 | 312 |
289 | 313 \vspace*{-1mm} |
290 \section{Experimental Setup} | 314 \section{Experimental Setup} |
315 \vspace*{-1mm} | |
291 | 316 |
292 Whereas much previous work on deep learning algorithms had been performed on | 317 Whereas much previous work on deep learning algorithms had been performed on |
293 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, | 318 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, |
294 with 60~000 examples, and variants involving 10~000 | 319 with 60~000 examples, and variants involving 10~000 |
295 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want | 320 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want |
297 to 1000 times larger. The larger datasets are obtained by first sampling from | 322 to 1000 times larger. The larger datasets are obtained by first sampling from |
298 a {\em data source} (NIST characters, scanned machine printed characters, characters | 323 a {\em data source} (NIST characters, scanned machine printed characters, characters |
299 from fonts, or characters from captchas) and then optionally applying some of the | 324 from fonts, or characters from captchas) and then optionally applying some of the |
300 above transformations and/or noise processes. | 325 above transformations and/or noise processes. |
301 | 326 |
327 \vspace*{-1mm} | |
302 \subsection{Data Sources} | 328 \subsection{Data Sources} |
303 | 329 \vspace*{-1mm} |
304 \begin{itemize} | 330 |
305 \item {\bf NIST} | 331 %\begin{itemize} |
332 %\item | |
333 {\bf NIST.} | |
306 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995}, | 334 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995}, |
307 widely used for training and testing character | 335 widely used for training and testing character |
308 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. | 336 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. |
309 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, | 337 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, |
310 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | 338 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes |
320 Note that the distribution of the classes in the NIST training and test sets differs | 348 Note that the distribution of the classes in the NIST training and test sets differs |
321 substantially, with relatively many more digits in the test set, and uniform distribution | 349 substantially, with relatively many more digits in the test set, and uniform distribution |
322 of letters in the test set, not in the training set (more like the natural distribution | 350 of letters in the test set, not in the training set (more like the natural distribution |
323 of letters in text). | 351 of letters in text). |
324 | 352 |
325 \item {\bf Fonts} | 353 %\item |
354 {\bf Fonts.} | |
326 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} | 355 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} |
327 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} | 356 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} |
328 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. | 357 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. |
329 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | 358 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, |
330 directly as input to our models. | 359 directly as input to our models. |
331 | 360 |
332 | 361 %\item |
333 | 362 {\bf Captchas.} |
334 \item {\bf Captchas} | |
335 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 363 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for |
336 generating characters of the same format as the NIST dataset. This software is based on | 364 generating characters of the same format as the NIST dataset. This software is based on |
337 a random character class generator and various kinds of tranformations similar to those described in the previous sections. | 365 a random character class generator and various kinds of tranformations similar to those described in the previous sections. |
338 In order to increase the variability of the data generated, many different fonts are used for generating the characters. | 366 In order to increase the variability of the data generated, many different fonts are used for generating the characters. |
339 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity | 367 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity |
340 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are | 368 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are |
341 allowed and can be controlled via an easy to use facade class. | 369 allowed and can be controlled via an easy to use facade class. |
342 \item {\bf OCR data} | 370 |
371 %\item | |
372 {\bf OCR data.} | |
343 A large set (2 million) of scanned, OCRed and manually verified machine-printed | 373 A large set (2 million) of scanned, OCRed and manually verified machine-printed |
344 characters (from various documents and books) where included as an | 374 characters (from various documents and books) where included as an |
345 additional source. This set is part of a larger corpus being collected by the Image Understanding | 375 additional source. This set is part of a larger corpus being collected by the Image Understanding |
346 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern | 376 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern |
347 ({\tt http://www.iupr.com}), and which will be publically released. | 377 ({\tt http://www.iupr.com}), and which will be publically released. |
348 \end{itemize} | 378 %\end{itemize} |
349 | 379 |
380 \vspace*{-1mm} | |
350 \subsection{Data Sets} | 381 \subsection{Data Sets} |
382 \vspace*{-1mm} | |
383 | |
351 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label | 384 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label |
352 from one of the 62 character classes. | 385 from one of the 62 character classes. |
353 \begin{itemize} | 386 %\begin{itemize} |
354 \item {\bf NIST}. This is the raw NIST special database 19. | 387 |
355 \item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources | 388 %\item |
389 {\bf NIST.} This is the raw NIST special database 19. | |
390 | |
391 %\item | |
392 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources | |
356 and sending them through the above transformation pipeline. | 393 and sending them through the above transformation pipeline. |
357 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts, | 394 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts, |
358 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the | 395 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the |
359 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. | 396 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. |
360 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) | 397 |
398 %\item | |
399 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) | |
361 except that we only apply | 400 except that we only apply |
362 transformations from slant to pinch. Therefore, the character is | 401 transformations from slant to pinch. Therefore, the character is |
363 transformed but no additionnal noise is added to the image, giving images | 402 transformed but no additionnal noise is added to the image, giving images |
364 closer to the NIST dataset. | 403 closer to the NIST dataset. |
365 \end{itemize} | 404 %\end{itemize} |
366 | 405 |
406 \vspace*{-1mm} | |
367 \subsection{Models and their Hyperparameters} | 407 \subsection{Models and their Hyperparameters} |
408 \vspace*{-1mm} | |
368 | 409 |
369 All hyper-parameters are selected based on performance on the NISTP validation set. | 410 All hyper-parameters are selected based on performance on the NISTP validation set. |
370 | 411 |
371 \subsubsection{Multi-Layer Perceptrons (MLP)} | 412 {\bf Multi-Layer Perceptrons (MLP).} |
372 | |
373 Whereas previous work had compared deep architectures to both shallow MLPs and | 413 Whereas previous work had compared deep architectures to both shallow MLPs and |
374 SVMs, we only compared to MLPs here because of the very large datasets used. | 414 SVMs, we only compared to MLPs here because of the very large datasets used. |
375 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | 415 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
376 exponentials) on the output layer for estimating P(class | image). | 416 exponentials) on the output layer for estimating P(class | image). |
377 The hyper-parameters are the following: number of hidden units, taken in | 417 The hyper-parameters are the following: number of hidden units, taken in |
378 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training | 418 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training |
379 examples are presented in minibatches of size 20. A constant learning | 419 examples are presented in minibatches of size 20. A constant learning |
380 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ | 420 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ |
381 through preliminary experiments, and 0.1 was selected. | 421 through preliminary experiments, and 0.1 was selected. |
382 | 422 |
383 | 423 {\bf Stacked Denoising Auto-Encoders (SDAE).} |
384 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} | |
385 \label{SdA} | |
386 | |
387 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 424 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
388 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 425 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
389 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} | 426 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} |
390 enabling better generalization, apparently setting parameters in the | 427 enabling better generalization, apparently setting parameters in the |
391 basin of attraction of supervised gradient descent yielding better | 428 basin of attraction of supervised gradient descent yielding better |
395 distribution $P(x)$ and the conditional distribution of interest | 432 distribution $P(x)$ and the conditional distribution of interest |
396 $P(y|x)$ (like in semi-supervised learning), and on the other hand | 433 $P(y|x)$ (like in semi-supervised learning), and on the other hand |
397 taking advantage of the expressive power and bias implicit in the | 434 taking advantage of the expressive power and bias implicit in the |
398 deep architecture (whereby complex concepts are expressed as | 435 deep architecture (whereby complex concepts are expressed as |
399 compositions of simpler ones through a deep hierarchy). | 436 compositions of simpler ones through a deep hierarchy). |
400 | |
401 Here we chose to use the Denoising | 437 Here we chose to use the Denoising |
402 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for | 438 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for |
403 these deep hierarchies of features, as it is very simple to train and | 439 these deep hierarchies of features, as it is very simple to train and |
404 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 440 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
405 provides immediate and efficient inference, and yielded results | 441 provides immediate and efficient inference, and yielded results |
411 the data. Once it is trained, its hidden units activations can | 447 the data. Once it is trained, its hidden units activations can |
412 be used as inputs for training a second one, etc. | 448 be used as inputs for training a second one, etc. |
413 After this unsupervised pre-training stage, the parameters | 449 After this unsupervised pre-training stage, the parameters |
414 are used to initialize a deep MLP, which is fine-tuned by | 450 are used to initialize a deep MLP, which is fine-tuned by |
415 the same standard procedure used to train them (see previous section). | 451 the same standard procedure used to train them (see previous section). |
416 | 452 The SDA hyper-parameters are the same as for the MLP, with the addition of the |
417 The hyper-parameters are the same as for the MLP, with the addition of the | |
418 amount of corruption noise (we used the masking noise process, whereby a | 453 amount of corruption noise (we used the masking noise process, whereby a |
419 fixed proportion of the input values, randomly selected, are zeroed), and a | 454 fixed proportion of the input values, randomly selected, are zeroed), and a |
420 separate learning rate for the unsupervised pre-training stage (selected | 455 separate learning rate for the unsupervised pre-training stage (selected |
421 from the same above set). The fraction of inputs corrupted was selected | 456 from the same above set). The fraction of inputs corrupted was selected |
422 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | 457 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number |
423 of hidden layers but it was fixed to 3 based on previous work with | 458 of hidden layers but it was fixed to 3 based on previous work with |
424 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. | 459 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. |
425 | 460 |
461 \vspace*{-1mm} | |
426 \section{Experimental Results} | 462 \section{Experimental Results} |
427 | 463 |
464 \vspace*{-1mm} | |
428 \subsection{SDA vs MLP vs Humans} | 465 \subsection{SDA vs MLP vs Humans} |
466 \vspace*{-1mm} | |
429 | 467 |
430 We compare here the best MLP (according to validation set error) that we found against | 468 We compare here the best MLP (according to validation set error) that we found against |
431 the best SDA (again according to validation set error), along with a precise estimate | 469 the best SDA (again according to validation set error), along with a precise estimate |
432 of human performance obtained via Amazon's Mechanical Turk (AMT) | 470 of human performance obtained via Amazon's Mechanical Turk (AMT) |
433 service\footnote{http://mturk.com}. AMT users are paid small amounts | 471 service\footnote{http://mturk.com}. AMT users are paid small amounts |
434 of money to perform tasks for which human intelligence is required. | 472 of money to perform tasks for which human intelligence is required. |
435 Mechanical Turk has been used extensively in natural language | 473 Mechanical Turk has been used extensively in natural language |
436 processing \citep{SnowEtAl2008} and vision | 474 processing \citep{SnowEtAl2008} and vision |
437 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented | 475 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented |
438 with 10 character images and asked to type 10 corresponding ascii | 476 with 10 character images and asked to type 10 corresponding ascii |
439 characters. Hence they were forced to make a hard choice among the | 477 characters. They were forced to make a hard choice among the |
440 62 character classes. Three users classified each image, allowing | 478 62 or 10 character classes (all classes or digits only). |
479 Three users classified each image, allowing | |
441 to estimate inter-human variability (shown as +/- in parenthesis below). | 480 to estimate inter-human variability (shown as +/- in parenthesis below). |
481 | |
482 Figure~\ref{fig:error-rates-charts} summarizes the results obtained. | |
483 More detailed results and tables can be found in the appendix. | |
442 | 484 |
443 \begin{table} | 485 \begin{table} |
444 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + | 486 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + |
445 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training | 487 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training |
446 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture | 488 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture |
474 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ | 516 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ |
475 \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. } | 517 \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. } |
476 \label{fig:error-rates-charts} | 518 \label{fig:error-rates-charts} |
477 \end{figure} | 519 \end{figure} |
478 | 520 |
521 \vspace*{-1mm} | |
479 \subsection{Perturbed Training Data More Helpful for SDAE} | 522 \subsection{Perturbed Training Data More Helpful for SDAE} |
523 \vspace*{-1mm} | |
480 | 524 |
481 \begin{table} | 525 \begin{table} |
482 \caption{Relative change in error rates due to the use of perturbed training data, | 526 \caption{Relative change in error rates due to the use of perturbed training data, |
483 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. | 527 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. |
484 A positive value indicates that training on the perturbed data helped for the | 528 A positive value indicates that training on the perturbed data helped for the |
497 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline | 541 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline |
498 \end{tabular} | 542 \end{tabular} |
499 \end{center} | 543 \end{center} |
500 \end{table} | 544 \end{table} |
501 | 545 |
502 | 546 \vspace*{-1mm} |
503 \subsection{Multi-Task Learning Effects} | 547 \subsection{Multi-Task Learning Effects} |
548 \vspace*{-1mm} | |
504 | 549 |
505 As previously seen, the SDA is better able to benefit from the | 550 As previously seen, the SDA is better able to benefit from the |
506 transformations applied to the data than the MLP. In this experiment we | 551 transformations applied to the data than the MLP. In this experiment we |
507 define three tasks: recognizing digits (knowing that the input is a digit), | 552 define three tasks: recognizing digits (knowing that the input is a digit), |
508 recognizing upper case characters (knowing that the input is one), and | 553 recognizing upper case characters (knowing that the input is one), and |
552 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ | 597 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ |
553 \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).} | 598 \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).} |
554 \label{fig:improvements-charts} | 599 \label{fig:improvements-charts} |
555 \end{figure} | 600 \end{figure} |
556 | 601 |
557 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 602 \vspace*{-1mm} |
558 can be executed on-line at {\tt http://deep.host22.com}. | |
559 | |
560 \section{Conclusions} | 603 \section{Conclusions} |
604 \vspace*{-1mm} | |
561 | 605 |
562 The conclusions are positive for all the questions asked in the introduction. | 606 The conclusions are positive for all the questions asked in the introduction. |
563 \begin{itemize} | 607 %\begin{itemize} |
564 \item Do the good results previously obtained with deep architectures on the | 608 $\bullet$ %\item |
609 Do the good results previously obtained with deep architectures on the | |
565 MNIST digits generalize to the setting of a much larger and richer (but similar) | 610 MNIST digits generalize to the setting of a much larger and richer (but similar) |
566 dataset, the NIST special database 19, with 62 classes and around 800k examples? | 611 dataset, the NIST special database 19, with 62 classes and around 800k examples? |
567 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level | 612 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level |
568 performance. | 613 performance. |
569 \item To what extent does the perturbation of input images (e.g. adding | 614 |
615 $\bullet$ %\item | |
616 To what extent does the perturbation of input images (e.g. adding | |
570 noise, affine transformations, background images) make the resulting | 617 noise, affine transformations, background images) make the resulting |
571 classifier better not only on similarly perturbed images but also on | 618 classifier better not only on similarly perturbed images but also on |
572 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | 619 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} |
573 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 620 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? |
574 MLPs were helped by perturbed training examples when tested on perturbed input images, | 621 MLPs were helped by perturbed training examples when tested on perturbed input images, |
575 but only marginally helped wrt clean examples. On the other hand, the deep SDAs | 622 but only marginally helped wrt clean examples. On the other hand, the deep SDAs |
576 were very significantly boosted by these out-of-distribution examples. | 623 were very significantly boosted by these out-of-distribution examples. |
577 \item Similarly, does the feature learning step in deep learning algorithms benefit more | 624 |
625 $\bullet$ %\item | |
626 Similarly, does the feature learning step in deep learning algorithms benefit more | |
578 training with similar but different classes (i.e. a multi-task learning scenario) than | 627 training with similar but different classes (i.e. a multi-task learning scenario) than |
579 a corresponding shallow and purely supervised architecture? | 628 a corresponding shallow and purely supervised architecture? |
580 Whereas the improvement due to the multi-task setting was marginal or | 629 Whereas the improvement due to the multi-task setting was marginal or |
581 negative for the MLP, it was very significant for the SDA. | 630 negative for the MLP, it was very significant for the SDA. |
582 \end{itemize} | 631 %\end{itemize} |
583 | 632 |
633 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | |
634 can be executed on-line at {\tt http://deep.host22.com}. | |
635 | |
636 | |
637 {\small | |
584 \bibliography{strings,ml,aigaion,specials} | 638 \bibliography{strings,ml,aigaion,specials} |
585 %\bibliographystyle{plainnat} | 639 %\bibliographystyle{plainnat} |
586 \bibliographystyle{unsrtnat} | 640 \bibliographystyle{unsrtnat} |
587 %\bibliographystyle{apalike} | 641 %\bibliographystyle{apalike} |
642 } | |
588 | 643 |
589 \end{document} | 644 \end{document} |