comparison writeup/nips2010_submission.tex @ 549:ef172f4a322a

ca fitte
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Wed, 02 Jun 2010 13:56:01 -0400
parents 34cb28249de0
children 662299f265ab
comparison
equal deleted inserted replaced
548:34cb28249de0 549:ef172f4a322a
390 useful to estimate the effect of a multi-task setting. 390 useful to estimate the effect of a multi-task setting.
391 Note that the distribution of the classes in the NIST training and test sets differs 391 Note that the distribution of the classes in the NIST training and test sets differs
392 substantially, with relatively many more digits in the test set, and more uniform distribution 392 substantially, with relatively many more digits in the test set, and more uniform distribution
393 of letters in the test set, compared to the training set (in the latter, the letters are distributed 393 of letters in the test set, compared to the training set (in the latter, the letters are distributed
394 more like the natural distribution of letters in text). 394 more like the natural distribution of letters in text).
395 \vspace*{-1mm}
395 396
396 %\item 397 %\item
397 {\bf Fonts.} 398 {\bf Fonts.}
398 In order to have a good variety of sources we downloaded an important number of free fonts from: 399 In order to have a good variety of sources we downloaded an important number of free fonts from:
399 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. 400 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
400 % TODO: pointless to anonymize, it's not pointing to our work 401 % TODO: pointless to anonymize, it's not pointing to our work
401 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. 402 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
402 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 403 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
403 directly as input to our models. 404 directly as input to our models.
405 \vspace*{-1mm}
404 406
405 %\item 407 %\item
406 {\bf Captchas.} 408 {\bf Captchas.}
407 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 409 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
408 generating characters of the same format as the NIST dataset. This software is based on 410 generating characters of the same format as the NIST dataset. This software is based on
409 a random character class generator and various kinds of transformations similar to those described in the previous sections. 411 a random character class generator and various kinds of transformations similar to those described in the previous sections.
410 In order to increase the variability of the data generated, many different fonts are used for generating the characters. 412 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
411 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity 413 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
412 depending on the value of the complexity parameter provided by the user of the data source. 414 depending on the value of the complexity parameter provided by the user of the data source.
413 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class? 415 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class?
416 \vspace*{-1mm}
414 417
415 %\item 418 %\item
416 {\bf OCR data.} 419 {\bf OCR data.}
417 A large set (2 million) of scanned, OCRed and manually verified machine-printed 420 A large set (2 million) of scanned, OCRed and manually verified machine-printed
418 characters (from various documents and books) where included as an 421 characters (from various documents and books) where included as an
427 \vspace*{-1mm} 430 \vspace*{-1mm}
428 431
429 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label 432 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
430 from one of the 62 character classes. 433 from one of the 62 character classes.
431 %\begin{itemize} 434 %\begin{itemize}
435 \vspace*{-1mm}
432 436
433 %\item 437 %\item
434 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has 438 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
435 \{651668 / 80000 / 82587\} \{training / validation / test\} examples. 439 \{651668 / 80000 / 82587\} \{training / validation / test\} examples.
440 \vspace*{-1mm}
436 441
437 %\item 442 %\item
438 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources 443 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
439 and sending them through the transformation pipeline described in section \ref{s:perturbations}. 444 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
440 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, 445 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
441 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the 446 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
442 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. 447 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
443 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. 448 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
449 \vspace*{-1mm}
444 450
445 %\item 451 %\item
446 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) 452 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
447 except that we only apply 453 except that we only apply
448 transformations from slant to pinch. Therefore, the character is 454 transformations from slant to pinch. Therefore, the character is
469 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 475 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
470 Training examples are presented in minibatches of size 20. A constant learning 476 Training examples are presented in minibatches of size 20. A constant learning
471 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ 477 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
472 through preliminary experiments (measuring performance on a validation set), 478 through preliminary experiments (measuring performance on a validation set),
473 and $0.1$ was then selected for optimizing on the whole training sets. 479 and $0.1$ was then selected for optimizing on the whole training sets.
480 \vspace*{-1mm}
474 481
475 482
476 {\bf Stacked Denoising Auto-Encoders (SDA).} 483 {\bf Stacked Denoising Auto-Encoders (SDA).}
477 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 484 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
478 can be used to initialize the weights of each layer of a deep MLP (with many hidden 485 can be used to initialize the weights of each layer of a deep MLP (with many hidden
489 compositions of simpler ones through a deep hierarchy). 496 compositions of simpler ones through a deep hierarchy).
490 497
491 \begin{figure}[ht] 498 \begin{figure}[ht]
492 \vspace*{-2mm} 499 \vspace*{-2mm}
493 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} 500 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
501 \vspace*{-2mm}
494 \caption{Illustration of the computations and training criterion for the denoising 502 \caption{Illustration of the computations and training criterion for the denoising
495 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of 503 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
496 the layer (i.e. raw input or output of previous layer) 504 the layer (i.e. raw input or output of previous layer)
497 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. 505 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
498 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which 506 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
532 \vspace*{-1mm} 540 \vspace*{-1mm}
533 541
534 \begin{figure}[ht] 542 \begin{figure}[ht]
535 \vspace*{-2mm} 543 \vspace*{-2mm}
536 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} 544 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
545 \vspace*{-3mm}
537 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained 546 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
538 on NIST, 1 on NISTP, and 2 on P07. Left: overall results 547 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
539 of all models, on NIST and NISTP test sets. 548 of all models, on NIST and NISTP test sets.
540 Right: error rates on NIST test digits only, along with the previous results from 549 Right: error rates on NIST test digits only, along with the previous results from
541 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} 550 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
542 respectively based on ART, nearest neighbors, MLPs, and SVMs.} 551 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
543
544 \label{fig:error-rates-charts} 552 \label{fig:error-rates-charts}
545 \vspace*{-2mm} 553 \vspace*{-2mm}
546 \end{figure} 554 \end{figure}
547 555
548 556
549 \section{Experimental Results} 557 \section{Experimental Results}
558 \vspace*{-2mm}
550 559
551 %\vspace*{-1mm} 560 %\vspace*{-1mm}
552 %\subsection{SDA vs MLP vs Humans} 561 %\subsection{SDA vs MLP vs Humans}
553 %\vspace*{-1mm} 562 %\vspace*{-1mm}
554 The models are either trained on NIST (MLP0 and SDA0), 563 The models are either trained on NIST (MLP0 and SDA0),
570 significant way) but when trained with perturbed data 579 significant way) but when trained with perturbed data
571 reaches human performance on both the 62-class task 580 reaches human performance on both the 62-class task
572 and the 10-class (digits) task. 581 and the 10-class (digits) task.
573 582
574 \begin{figure}[ht] 583 \begin{figure}[ht]
575 \vspace*{-2mm} 584 \vspace*{-3mm}
576 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} 585 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
586 \vspace*{-3mm}
577 \caption{Relative improvement in error rate due to self-taught learning. 587 \caption{Relative improvement in error rate due to self-taught learning.
578 Left: Improvement (or loss, when negative) 588 Left: Improvement (or loss, when negative)
579 induced by out-of-distribution examples (perturbed data). 589 induced by out-of-distribution examples (perturbed data).
580 Right: Improvement (or loss, when negative) induced by multi-task 590 Right: Improvement (or loss, when negative) induced by multi-task
581 learning (training on all classes and testing only on either digits, 591 learning (training on all classes and testing only on either digits,
648 error rate improvements of 27\%, 15\% and 13\% respectively for digits, 658 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
649 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. 659 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
650 \fi 660 \fi
651 661
652 662
653 \vspace*{-1mm} 663 \vspace*{-2mm}
654 \section{Conclusions and Discussion} 664 \section{Conclusions and Discussion}
655 \vspace*{-1mm} 665 \vspace*{-2mm}
656 666
657 We have found that the self-taught learning framework is more beneficial 667 We have found that the self-taught learning framework is more beneficial
658 to a deep learner than to a traditional shallow and purely 668 to a deep learner than to a traditional shallow and purely
659 supervised learner. More precisely, 669 supervised learner. More precisely,
660 the answers are positive for all the questions asked in the introduction. 670 the answers are positive for all the questions asked in the introduction.
661 %\begin{itemize} 671 %\begin{itemize}
662 672
663 $\bullet$ %\item 673 $\bullet$ %\item
664 {\bf Do the good results previously obtained with deep architectures on the 674 {\bf Do the good results previously obtained with deep architectures on the
665 MNIST digits generalize to the setting of a much larger and richer (but similar) 675 MNIST digits generalize to a much larger and richer (but similar)
666 dataset, the NIST special database 19, with 62 classes and around 800k examples}? 676 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
667 Yes, the SDA {\bf systematically outperformed the MLP and all the previously 677 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
668 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level 678 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
669 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. 679 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
670 680