comparison writeup/nips2010_submission.tex @ 472:2dd6e8962df1

conclusion
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 30 May 2010 10:44:20 -0400
parents d02d288257bf
children bcf024e6ab23
comparison
equal deleted inserted replaced
469:d02d288257bf 472:2dd6e8962df1
277 \end{figure} 277 \end{figure}
278 278
279 279
280 \section{Experimental Setup} 280 \section{Experimental Setup}
281 281
282 \subsection{Training Datasets} 282 Whereas much previous work on deep learning algorithms had been performed on
283 283 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
284 \subsubsection{Data Sources} 284 with 60~000 examples, and variants involving 10~000
285 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
286 to focus here on the case of much larger training sets, from 10 times to
287 to 1000 times larger. The larger datasets are obtained by first sampling from
288 a {\em data source} (NIST characters, scanned machine printed characters, characters
289 from fonts, or characters from captchas) and then optionally applying some of the
290 above transformations and/or noise processes.
291
292 \subsection{Data Sources}
285 293
286 \begin{itemize} 294 \begin{itemize}
287 \item {\bf NIST} 295 \item {\bf NIST}
288 The NIST Special Database 19 (NIST19) is a very widely used dataset for training and testing OCR systems. 296 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995},
297 widely used for training and testing character
298 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}.
289 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, 299 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications,
290 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 300 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
291 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. 301 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity.
292 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one for classification task is recommended 302 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended
293 by NIST as testing set and is used in our work for that purpose. It contains 82600 examples, 303 by NIST as testing set and is used in our work and some previous work~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
294 while the training and validation sets (which have the same distribution) contain XXXXX and 304 for that purpose. We randomly split the remainder into a training set and a validation set for
295 XXXXX examples respectively. 305 model selection. The sizes of these data sets are: XXX for training, XXX for validation,
306 and XXX for testing.
296 The performances reported by previous work on that dataset mostly use only the digits. 307 The performances reported by previous work on that dataset mostly use only the digits.
297 Here we use all the classes both in the training and testing phase. This is especially 308 Here we use all the classes both in the training and testing phase. This is especially
298 useful to estimate the effect of a multi-task setting. 309 useful to estimate the effect of a multi-task setting.
299 Note that the distribution of the classes in the NIST training and test sets differs 310 Note that the distribution of the classes in the NIST training and test sets differs
300 substantially, with relatively many more digits in the test set, and uniform distribution 311 substantially, with relatively many more digits in the test set, and uniform distribution
303 314
304 \item {\bf Fonts} TODO!!! 315 \item {\bf Fonts} TODO!!!
305 316
306 \item {\bf Captchas} 317 \item {\bf Captchas}
307 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 318 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
308 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character 319 generating characters of the same format as the NIST dataset. This software is based on
309 generator and various kinds of tranformations similar to those described in the previous sections. 320 a random character class generator and various kinds of tranformations similar to those described in the previous sections.
310 In order to increase the variability of the data generated, different fonts are used for generating the characters. 321 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
311 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity 322 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
312 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are 323 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are
313 allowed and can be controlled via an easy to use facade class. 324 allowed and can be controlled via an easy to use facade class.
314 \item {\bf OCR data} 325 \item {\bf OCR data}
326 A large set (2 million) of scanned, OCRed and manually verified machine-printed
327 characters (from various documents and books) where included as an
328 additional source. This set is part of a larger corpus being collected by the Image Understanding
329 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern
330 ({\tt http://www.iupr.com}), and which will be publically released.
315 \end{itemize} 331 \end{itemize}
316 332
317 \subsubsection{Data Sets} 333 \subsection{Data Sets}
334 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
335 from one of the 62 character classes.
318 \begin{itemize} 336 \begin{itemize}
319 \item {\bf NIST} This is the raw NIST special database 19. 337 \item {\bf NIST}. This is the raw NIST special database 19.
320 \item {\bf P07} 338 \item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources
321 The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$. 339 and sending them through the above transformation pipeline.
322 For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts, 340 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts,
323 $0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order 341 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
324 and for each of them we sample uniformly a complexity in the range $[0,0.7]$. 342 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
325 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) 343 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
326 except that we only apply 344 except that we only apply
327 transformations from slant to pinch. Therefore, the character is 345 transformations from slant to pinch. Therefore, the character is
328 transformed but no additionnal noise is added to the image, giving images 346 transformed but no additionnal noise is added to the image, giving images
329 closer to the NIST dataset. 347 closer to the NIST dataset.
330 \end{itemize} 348 \end{itemize}
331 349
332 \subsection{Models and their Hyperparameters} 350 \subsection{Models and their Hyperparameters}
333 351
352 All hyper-parameters are selected based on performance on the NISTP validation set.
353
334 \subsubsection{Multi-Layer Perceptrons (MLP)} 354 \subsubsection{Multi-Layer Perceptrons (MLP)}
335 355
336 An MLP is a family of functions that are described by stacking layers of of a function similar to 356 Whereas previous work had compared deep architectures to both shallow MLPs and
337 $$g(x) = \tanh(b+Wx)$$ 357 SVMs, we only compared to MLPs here because of the very large datasets used.
338 The input, $x$, is a $d$-dimension vector. 358 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
339 The output, $g(x)$, is a $m$-dimension vector. 359 exponentials) on the output layer for estimating P(class | image).
340 The parameter $W$ is a $m\times d$ matrix and is called the weight matrix. 360 The hyper-parameters are the following: number of hidden units, taken in
341 The parameter $b$ is a $m$-vector and is called the bias vector. 361 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
342 The non-linearity (here $\tanh$) is applied element-wise to the output vector. 362 examples are presented in minibatches of size 20. A constant learning
343 Usually the input is referred to a input layer and similarly for the output. 363 rate is chosen in $\{10^{-6},10^{-5},10^{-4},10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
344 You can of course chain several such functions to obtain a more complex one. 364 through preliminary experiments, and 0.1 was selected.
345 Here is a common example 365
346 $$f(x) = c + V\tanh(b+Wx)$$
347 In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer.
348 Here the output layer does not have the same non-linearity as the hidden layer.
349 This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand.
350
351 If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP.
352 The parameters to adapt are the weight matrix and the bias vector for each layer.
353 366
354 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} 367 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
355 \label{SdA} 368 \label{SdA}
356 369
357 Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization. 370 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
358 This is essentially unsupervised training where the layer is made to reconstruct its input through and encoding and decoding phase. 371 can be used to initialize the weights of each layer of a deep MLP (with many hidden
359 Denoising auto-encoders are a variant where the input is corrupted with random noise but the target is the uncorrupted input. 372 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
360 The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform. 373 enabling better generalization, apparently setting parameters in the
361 374 basin of attraction of supervised gradient descent yielding better
362 An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer. 375 generalization~\citep{Erhan+al-2010}. It is hypothesized that the
363 Usually the top and bottom weight matrices are the transpose of each other and are fixed this way. 376 advantage brought by this procedure stems from a better prior,
364 The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer. 377 on the one hand taking advantage of the link between the input
365 The other parameters are discarded. 378 distribution $P(x)$ and the conditional distribution of interest
366 379 $P(y|x)$ (like in semi-supervised learning), and on the other hand
367 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. 380 taking advantage of the expressive power and bias implicit in the
368 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder. 381 deep architecture (whereby complex concepts are expressed as
369 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it. 382 compositions of simpler ones through a deep hierarchy).
370 For additional details see \citet{vincent:icml08}. 383
384 Here we chose to use the Denoising
385 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
386 these deep hierarchies of features, as it is very simple to train and
387 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}),
388 provides immediate and efficient inference, and yielded results
389 comparable or better than RBMs in series of experiments
390 \citep{VincentPLarochelleH2008}. During training of a Denoising
391 Auto-Encoder, it is presented with a stochastically corrupted version
392 of the input and trained to reconstruct the uncorrupted input,
393 forcing the hidden units to represent the leading regularities in
394 the data. Once it is trained, its hidden units activations can
395 be used as inputs for training a second one, etc.
396 After this unsupervised pre-training stage, the parameters
397 are used to initialize a deep MLP, which is fine-tuned by
398 the same standard procedure used to train them (see previous section).
399
400 The hyper-parameters are the same as for the MLP, with the addition of the
401 amount of corruption noise (we used the masking noise process, whereby a
402 fixed proportion of the input values, randomly selected, are zeroed), and a
403 separate learning rate for the unsupervised pre-training stage (selected
404 from the same above set). The fraction of inputs corrupted was selected
405 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
406 of hidden layers but it was fixed to 3 based on previous work with
407 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
371 408
372 \section{Experimental Results} 409 \section{Experimental Results}
373 410
374 \subsection{SDA vs MLP vs Humans} 411 \subsection{SDA vs MLP vs Humans}
375 412
399 on NIST digits classification using the same test set are included.} 436 on NIST digits classification using the same test set are included.}
400 \label{tab:sda-vs-mlp-vs-humans} 437 \label{tab:sda-vs-mlp-vs-humans}
401 \begin{center} 438 \begin{center}
402 \begin{tabular}{|l|r|r|r|r|} \hline 439 \begin{tabular}{|l|r|r|r|r|} \hline
403 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline 440 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline
404 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $>1.1\%$ \\ \hline 441 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $1.4\%$ \\ \hline
405 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline 442 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline
406 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline 443 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline
407 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline 444 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline
408 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline 445 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline
409 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline 446 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline
410 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline 447 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline
411 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline 448 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline
412 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline 449 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline
413 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline 450 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline
414 \citep{Migram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline 451 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline
415 \end{tabular} 452 \end{tabular}
416 \end{center} 453 \end{center}
417 \end{table} 454 \end{table}
418 455
419 \subsection{Perturbed Training Data More Helpful for SDAE} 456 \subsection{Perturbed Training Data More Helpful for SDAE}
487 \end{center} 524 \end{center}
488 \end{table} 525 \end{table}
489 526
490 \section{Conclusions} 527 \section{Conclusions}
491 528
529 The conclusions are positive for all the questions asked in the introduction.
530 \begin{itemize}
531 \item Do the good results previously obtained with deep architectures on the
532 MNIST digits generalize to the setting of a much larger and richer (but similar)
533 dataset, the NIST special database 19, with 62 classes and around 800k examples?
534 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
535 performance.
536 \item To what extent does the perturbation of input images (e.g. adding
537 noise, affine transformations, background images) make the resulting
538 classifier better not only on similarly perturbed images but also on
539 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
540 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
541 MLPs were helped by perturbed training examples when tested on perturbed input images,
542 but only marginally helped wrt clean examples. On the other hand, the deep SDAs
543 were very significantly boosted by these out-of-distribution examples.
544 \item Similarly, does the feature learning step in deep learning algorithms benefit more
545 training with similar but different classes (i.e. a multi-task learning scenario) than
546 a corresponding shallow and purely supervised architecture?
547 Whereas the improvement due to the multi-task setting was marginal or
548 negative for the MLP, it was very significant for the SDA.
549 \end{itemize}
550
492 \bibliography{strings,ml,aigaion,specials} 551 \bibliography{strings,ml,aigaion,specials}
493 %\bibliographystyle{plainnat} 552 %\bibliographystyle{plainnat}
494 \bibliographystyle{unsrtnat} 553 \bibliographystyle{unsrtnat}
495 %\bibliographystyle{apalike} 554 %\bibliographystyle{apalike}
496 555