Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 472:2dd6e8962df1
conclusion
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sun, 30 May 2010 10:44:20 -0400 |
parents | d02d288257bf |
children | bcf024e6ab23 |
comparison
equal
deleted
inserted
replaced
469:d02d288257bf | 472:2dd6e8962df1 |
---|---|
277 \end{figure} | 277 \end{figure} |
278 | 278 |
279 | 279 |
280 \section{Experimental Setup} | 280 \section{Experimental Setup} |
281 | 281 |
282 \subsection{Training Datasets} | 282 Whereas much previous work on deep learning algorithms had been performed on |
283 | 283 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, |
284 \subsubsection{Data Sources} | 284 with 60~000 examples, and variants involving 10~000 |
285 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want | |
286 to focus here on the case of much larger training sets, from 10 times to | |
287 to 1000 times larger. The larger datasets are obtained by first sampling from | |
288 a {\em data source} (NIST characters, scanned machine printed characters, characters | |
289 from fonts, or characters from captchas) and then optionally applying some of the | |
290 above transformations and/or noise processes. | |
291 | |
292 \subsection{Data Sources} | |
285 | 293 |
286 \begin{itemize} | 294 \begin{itemize} |
287 \item {\bf NIST} | 295 \item {\bf NIST} |
288 The NIST Special Database 19 (NIST19) is a very widely used dataset for training and testing OCR systems. | 296 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995}, |
297 widely used for training and testing character | |
298 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. | |
289 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, | 299 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, |
290 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | 300 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes |
291 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. | 301 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. |
292 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one for classification task is recommended | 302 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended |
293 by NIST as testing set and is used in our work for that purpose. It contains 82600 examples, | 303 by NIST as testing set and is used in our work and some previous work~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} |
294 while the training and validation sets (which have the same distribution) contain XXXXX and | 304 for that purpose. We randomly split the remainder into a training set and a validation set for |
295 XXXXX examples respectively. | 305 model selection. The sizes of these data sets are: XXX for training, XXX for validation, |
306 and XXX for testing. | |
296 The performances reported by previous work on that dataset mostly use only the digits. | 307 The performances reported by previous work on that dataset mostly use only the digits. |
297 Here we use all the classes both in the training and testing phase. This is especially | 308 Here we use all the classes both in the training and testing phase. This is especially |
298 useful to estimate the effect of a multi-task setting. | 309 useful to estimate the effect of a multi-task setting. |
299 Note that the distribution of the classes in the NIST training and test sets differs | 310 Note that the distribution of the classes in the NIST training and test sets differs |
300 substantially, with relatively many more digits in the test set, and uniform distribution | 311 substantially, with relatively many more digits in the test set, and uniform distribution |
303 | 314 |
304 \item {\bf Fonts} TODO!!! | 315 \item {\bf Fonts} TODO!!! |
305 | 316 |
306 \item {\bf Captchas} | 317 \item {\bf Captchas} |
307 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 318 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for |
308 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character | 319 generating characters of the same format as the NIST dataset. This software is based on |
309 generator and various kinds of tranformations similar to those described in the previous sections. | 320 a random character class generator and various kinds of tranformations similar to those described in the previous sections. |
310 In order to increase the variability of the data generated, different fonts are used for generating the characters. | 321 In order to increase the variability of the data generated, many different fonts are used for generating the characters. |
311 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity | 322 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity |
312 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are | 323 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are |
313 allowed and can be controlled via an easy to use facade class. | 324 allowed and can be controlled via an easy to use facade class. |
314 \item {\bf OCR data} | 325 \item {\bf OCR data} |
326 A large set (2 million) of scanned, OCRed and manually verified machine-printed | |
327 characters (from various documents and books) where included as an | |
328 additional source. This set is part of a larger corpus being collected by the Image Understanding | |
329 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern | |
330 ({\tt http://www.iupr.com}), and which will be publically released. | |
315 \end{itemize} | 331 \end{itemize} |
316 | 332 |
317 \subsubsection{Data Sets} | 333 \subsection{Data Sets} |
334 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label | |
335 from one of the 62 character classes. | |
318 \begin{itemize} | 336 \begin{itemize} |
319 \item {\bf NIST} This is the raw NIST special database 19. | 337 \item {\bf NIST}. This is the raw NIST special database 19. |
320 \item {\bf P07} | 338 \item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources |
321 The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$. | 339 and sending them through the above transformation pipeline. |
322 For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts, | 340 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts, |
323 $0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order | 341 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the |
324 and for each of them we sample uniformly a complexity in the range $[0,0.7]$. | 342 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. |
325 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) | 343 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) |
326 except that we only apply | 344 except that we only apply |
327 transformations from slant to pinch. Therefore, the character is | 345 transformations from slant to pinch. Therefore, the character is |
328 transformed but no additionnal noise is added to the image, giving images | 346 transformed but no additionnal noise is added to the image, giving images |
329 closer to the NIST dataset. | 347 closer to the NIST dataset. |
330 \end{itemize} | 348 \end{itemize} |
331 | 349 |
332 \subsection{Models and their Hyperparameters} | 350 \subsection{Models and their Hyperparameters} |
333 | 351 |
352 All hyper-parameters are selected based on performance on the NISTP validation set. | |
353 | |
334 \subsubsection{Multi-Layer Perceptrons (MLP)} | 354 \subsubsection{Multi-Layer Perceptrons (MLP)} |
335 | 355 |
336 An MLP is a family of functions that are described by stacking layers of of a function similar to | 356 Whereas previous work had compared deep architectures to both shallow MLPs and |
337 $$g(x) = \tanh(b+Wx)$$ | 357 SVMs, we only compared to MLPs here because of the very large datasets used. |
338 The input, $x$, is a $d$-dimension vector. | 358 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
339 The output, $g(x)$, is a $m$-dimension vector. | 359 exponentials) on the output layer for estimating P(class | image). |
340 The parameter $W$ is a $m\times d$ matrix and is called the weight matrix. | 360 The hyper-parameters are the following: number of hidden units, taken in |
341 The parameter $b$ is a $m$-vector and is called the bias vector. | 361 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training |
342 The non-linearity (here $\tanh$) is applied element-wise to the output vector. | 362 examples are presented in minibatches of size 20. A constant learning |
343 Usually the input is referred to a input layer and similarly for the output. | 363 rate is chosen in $\{10^{-6},10^{-5},10^{-4},10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ |
344 You can of course chain several such functions to obtain a more complex one. | 364 through preliminary experiments, and 0.1 was selected. |
345 Here is a common example | 365 |
346 $$f(x) = c + V\tanh(b+Wx)$$ | |
347 In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer. | |
348 Here the output layer does not have the same non-linearity as the hidden layer. | |
349 This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand. | |
350 | |
351 If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP. | |
352 The parameters to adapt are the weight matrix and the bias vector for each layer. | |
353 | 366 |
354 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} | 367 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} |
355 \label{SdA} | 368 \label{SdA} |
356 | 369 |
357 Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization. | 370 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
358 This is essentially unsupervised training where the layer is made to reconstruct its input through and encoding and decoding phase. | 371 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
359 Denoising auto-encoders are a variant where the input is corrupted with random noise but the target is the uncorrupted input. | 372 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} |
360 The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform. | 373 enabling better generalization, apparently setting parameters in the |
361 | 374 basin of attraction of supervised gradient descent yielding better |
362 An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer. | 375 generalization~\citep{Erhan+al-2010}. It is hypothesized that the |
363 Usually the top and bottom weight matrices are the transpose of each other and are fixed this way. | 376 advantage brought by this procedure stems from a better prior, |
364 The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer. | 377 on the one hand taking advantage of the link between the input |
365 The other parameters are discarded. | 378 distribution $P(x)$ and the conditional distribution of interest |
366 | 379 $P(y|x)$ (like in semi-supervised learning), and on the other hand |
367 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. | 380 taking advantage of the expressive power and bias implicit in the |
368 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder. | 381 deep architecture (whereby complex concepts are expressed as |
369 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it. | 382 compositions of simpler ones through a deep hierarchy). |
370 For additional details see \citet{vincent:icml08}. | 383 |
384 Here we chose to use the Denoising | |
385 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for | |
386 these deep hierarchies of features, as it is very simple to train and | |
387 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), | |
388 provides immediate and efficient inference, and yielded results | |
389 comparable or better than RBMs in series of experiments | |
390 \citep{VincentPLarochelleH2008}. During training of a Denoising | |
391 Auto-Encoder, it is presented with a stochastically corrupted version | |
392 of the input and trained to reconstruct the uncorrupted input, | |
393 forcing the hidden units to represent the leading regularities in | |
394 the data. Once it is trained, its hidden units activations can | |
395 be used as inputs for training a second one, etc. | |
396 After this unsupervised pre-training stage, the parameters | |
397 are used to initialize a deep MLP, which is fine-tuned by | |
398 the same standard procedure used to train them (see previous section). | |
399 | |
400 The hyper-parameters are the same as for the MLP, with the addition of the | |
401 amount of corruption noise (we used the masking noise process, whereby a | |
402 fixed proportion of the input values, randomly selected, are zeroed), and a | |
403 separate learning rate for the unsupervised pre-training stage (selected | |
404 from the same above set). The fraction of inputs corrupted was selected | |
405 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | |
406 of hidden layers but it was fixed to 3 based on previous work with | |
407 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. | |
371 | 408 |
372 \section{Experimental Results} | 409 \section{Experimental Results} |
373 | 410 |
374 \subsection{SDA vs MLP vs Humans} | 411 \subsection{SDA vs MLP vs Humans} |
375 | 412 |
399 on NIST digits classification using the same test set are included.} | 436 on NIST digits classification using the same test set are included.} |
400 \label{tab:sda-vs-mlp-vs-humans} | 437 \label{tab:sda-vs-mlp-vs-humans} |
401 \begin{center} | 438 \begin{center} |
402 \begin{tabular}{|l|r|r|r|r|} \hline | 439 \begin{tabular}{|l|r|r|r|r|} \hline |
403 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline | 440 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline |
404 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $>1.1\%$ \\ \hline | 441 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $1.4\%$ \\ \hline |
405 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline | 442 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline |
406 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline | 443 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline |
407 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline | 444 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline |
408 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline | 445 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline |
409 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline | 446 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline |
410 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline | 447 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline |
411 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline | 448 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline |
412 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline | 449 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline |
413 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline | 450 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline |
414 \citep{Migram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline | 451 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline |
415 \end{tabular} | 452 \end{tabular} |
416 \end{center} | 453 \end{center} |
417 \end{table} | 454 \end{table} |
418 | 455 |
419 \subsection{Perturbed Training Data More Helpful for SDAE} | 456 \subsection{Perturbed Training Data More Helpful for SDAE} |
487 \end{center} | 524 \end{center} |
488 \end{table} | 525 \end{table} |
489 | 526 |
490 \section{Conclusions} | 527 \section{Conclusions} |
491 | 528 |
529 The conclusions are positive for all the questions asked in the introduction. | |
530 \begin{itemize} | |
531 \item Do the good results previously obtained with deep architectures on the | |
532 MNIST digits generalize to the setting of a much larger and richer (but similar) | |
533 dataset, the NIST special database 19, with 62 classes and around 800k examples? | |
534 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level | |
535 performance. | |
536 \item To what extent does the perturbation of input images (e.g. adding | |
537 noise, affine transformations, background images) make the resulting | |
538 classifier better not only on similarly perturbed images but also on | |
539 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | |
540 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | |
541 MLPs were helped by perturbed training examples when tested on perturbed input images, | |
542 but only marginally helped wrt clean examples. On the other hand, the deep SDAs | |
543 were very significantly boosted by these out-of-distribution examples. | |
544 \item Similarly, does the feature learning step in deep learning algorithms benefit more | |
545 training with similar but different classes (i.e. a multi-task learning scenario) than | |
546 a corresponding shallow and purely supervised architecture? | |
547 Whereas the improvement due to the multi-task setting was marginal or | |
548 negative for the MLP, it was very significant for the SDA. | |
549 \end{itemize} | |
550 | |
492 \bibliography{strings,ml,aigaion,specials} | 551 \bibliography{strings,ml,aigaion,specials} |
493 %\bibliographystyle{plainnat} | 552 %\bibliographystyle{plainnat} |
494 \bibliographystyle{unsrtnat} | 553 \bibliographystyle{unsrtnat} |
495 %\bibliographystyle{apalike} | 554 %\bibliographystyle{apalike} |
496 | 555 |