comparison writeup/nips2010_submission.tex @ 547:316c7bdad5ad

charts
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Wed, 02 Jun 2010 13:09:27 -0400
parents 1cdfc17e890f
children 34cb28249de0
comparison
equal deleted inserted replaced
546:cf68f5685406 547:316c7bdad5ad
71 It is also only recently that successful algorithms were proposed to 71 It is also only recently that successful algorithms were proposed to
72 overcome some of these difficulties. All are based on unsupervised 72 overcome some of these difficulties. All are based on unsupervised
73 learning, often in an greedy layer-wise ``unsupervised pre-training'' 73 learning, often in an greedy layer-wise ``unsupervised pre-training''
74 stage~\citep{Bengio-2009}. One of these layer initialization techniques, 74 stage~\citep{Bengio-2009}. One of these layer initialization techniques,
75 applied here, is the Denoising 75 applied here, is the Denoising
76 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which 76 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}),
77 which
77 performed similarly or better than previously proposed Restricted Boltzmann 78 performed similarly or better than previously proposed Restricted Boltzmann
78 Machines in terms of unsupervised extraction of a hierarchy of features 79 Machines in terms of unsupervised extraction of a hierarchy of features
79 useful for classification. The principle is that each layer starting from 80 useful for classification. The principle is that each layer starting from
80 the bottom is trained to encode its input (the output of the previous 81 the bottom is trained to encode its input (the output of the previous
81 layer) and to reconstruct it from a corrupted version. After this 82 layer) and to reconstruct it from a corrupted version. After this
82 unsupervised initialization, the stack of denoising auto-encoders can be 83 unsupervised initialization, the stack of DAs can be
83 converted into a deep supervised feedforward neural network and fine-tuned by 84 converted into a deep supervised feedforward neural network and fine-tuned by
84 stochastic gradient descent. 85 stochastic gradient descent.
85 86
86 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles 87 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
87 of semi-supervised and multi-task learning: the learner can exploit examples 88 of semi-supervised and multi-task learning: the learner can exploit examples
122 training with similar but different classes (i.e. a multi-task learning scenario) than 123 training with similar but different classes (i.e. a multi-task learning scenario) than
123 a corresponding shallow and purely supervised architecture? 124 a corresponding shallow and purely supervised architecture?
124 %\end{enumerate} 125 %\end{enumerate}
125 126
126 Our experimental results provide positive evidence towards all of these questions. 127 Our experimental results provide positive evidence towards all of these questions.
128 To achieve these results, we introduce in the next section a sophisticated system
129 for stochastically transforming character images. The conclusion discusses
130 the more general question of why deep learners may benefit so much from
131 the self-taught learning framework.
127 132
128 \vspace*{-1mm} 133 \vspace*{-1mm}
129 \section{Perturbation and Transformation of Character Images} 134 \section{Perturbation and Transformation of Character Images}
130 \label{s:perturbations} 135 \label{s:perturbations}
131 \vspace*{-1mm} 136 \vspace*{-1mm}
132 137
133 This section describes the different transformations we used to stochastically 138 This section describes the different transformations we used to stochastically
134 transform source images in order to obtain data. More details can 139 transform source images in order to obtain data from a larger distribution which
140 covers a domain substantially larger than the clean characters distribution from
141 which we start. Although character transformations have been used before to
142 improve character recognizers, this effort is on a large scale both
143 in number of classes and in the complexity of the transformations, hence
144 in the complexity of the learning task.
145 More details can
135 be found in this technical report~\citep{ift6266-tr-anonymous}. 146 be found in this technical report~\citep{ift6266-tr-anonymous}.
136 The code for these transformations (mostly python) is available at 147 The code for these transformations (mostly python) is available at
137 {\tt http://anonymous.url.net}. All the modules in the pipeline share 148 {\tt http://anonymous.url.net}. All the modules in the pipeline share
138 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the 149 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
139 amount of deformation or noise introduced. 150 amount of deformation or noise introduced.
332 to 1000 times larger. 343 to 1000 times larger.
333 344
334 The first step in constructing the larger datasets (called NISTP and P07) is to sample from 345 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
335 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, 346 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
336 and {\bf OCR data} (scanned machine printed characters). Once a character 347 and {\bf OCR data} (scanned machine printed characters). Once a character
337 is sampled from one of these sources (chosen randomly), the pipeline of 348 is sampled from one of these sources (chosen randomly), the second step is to
338 the transformations and/or noise processes described in section \ref{s:perturbations} 349 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}.
339 is applied to the image. 350
340 351 To provide a baseline of error rate comparison we also estimate human performance
352 on both the 62-class task and the 10-class digits task.
341 We compare the best MLPs against 353 We compare the best MLPs against
342 the best SDAs (both models' hyper-parameters are selected to minimize the validation set error), 354 the best SDAs (both models' hyper-parameters are selected to minimize the validation set error),
343 along with a comparison against a precise estimate 355 along with a comparison against a precise estimate
344 of human performance obtained via Amazon's Mechanical Turk (AMT) 356 of human performance obtained via Amazon's Mechanical Turk (AMT)
345 service (http://mturk.com). 357 service (http://mturk.com).
458 Training examples are presented in minibatches of size 20. A constant learning 470 Training examples are presented in minibatches of size 20. A constant learning
459 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ 471 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
460 through preliminary experiments (measuring performance on a validation set), 472 through preliminary experiments (measuring performance on a validation set),
461 and $0.1$ was then selected for optimizing on the whole training sets. 473 and $0.1$ was then selected for optimizing on the whole training sets.
462 474
463 \begin{figure}[ht]
464 \vspace*{-2mm}
465 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
466 \caption{Illustration of the computations and training criterion for the denoising
467 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
468 the layer (i.e. raw input or output of previous layer)
469 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
470 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
471 is compared to the uncorrupted input $x$ through the loss function
472 $L_H(x,z)$, whose expected value is approximately minimized during training
473 by tuning $\theta$ and $\theta'$.}
474 \label{fig:da}
475 \vspace*{-2mm}
476 \end{figure}
477 475
478 {\bf Stacked Denoising Auto-Encoders (SDA).} 476 {\bf Stacked Denoising Auto-Encoders (SDA).}
479 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 477 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
480 can be used to initialize the weights of each layer of a deep MLP (with many hidden 478 can be used to initialize the weights of each layer of a deep MLP (with many hidden
481 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, 479 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
487 distribution $P(x)$ and the conditional distribution of interest 485 distribution $P(x)$ and the conditional distribution of interest
488 $P(y|x)$ (like in semi-supervised learning), and on the other hand 486 $P(y|x)$ (like in semi-supervised learning), and on the other hand
489 taking advantage of the expressive power and bias implicit in the 487 taking advantage of the expressive power and bias implicit in the
490 deep architecture (whereby complex concepts are expressed as 488 deep architecture (whereby complex concepts are expressed as
491 compositions of simpler ones through a deep hierarchy). 489 compositions of simpler ones through a deep hierarchy).
490
491 \begin{figure}[ht]
492 \vspace*{-2mm}
493 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
494 \caption{Illustration of the computations and training criterion for the denoising
495 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
496 the layer (i.e. raw input or output of previous layer)
497 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
498 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
499 is compared to the uncorrupted input $x$ through the loss function
500 $L_H(x,z)$, whose expected value is approximately minimized during training
501 by tuning $\theta$ and $\theta'$.}
502 \label{fig:da}
503 \vspace*{-2mm}
504 \end{figure}
492 505
493 Here we chose to use the Denoising 506 Here we chose to use the Denoising
494 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for 507 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
495 these deep hierarchies of features, as it is very simple to train and 508 these deep hierarchies of features, as it is very simple to train and
496 explain (see Figure~\ref{fig:da}, as well as 509 explain (see Figure~\ref{fig:da}, as well as
512 fixed proportion of the input values, randomly selected, are zeroed), and a 525 fixed proportion of the input values, randomly selected, are zeroed), and a
513 separate learning rate for the unsupervised pre-training stage (selected 526 separate learning rate for the unsupervised pre-training stage (selected
514 from the same above set). The fraction of inputs corrupted was selected 527 from the same above set). The fraction of inputs corrupted was selected
515 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number 528 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
516 of hidden layers but it was fixed to 3 based on previous work with 529 of hidden layers but it was fixed to 3 based on previous work with
517 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. 530 SDAs on MNIST~\citep{VincentPLarochelleH2008}.
518 531
519 \vspace*{-1mm} 532 \vspace*{-1mm}
520 533
521 \begin{figure}[ht] 534 \begin{figure}[ht]
522 \vspace*{-2mm} 535 \vspace*{-2mm}
523 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} 536 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
524 \caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained 537 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
525 on NIST, 1 on NISTP, and 2 on P07. Left: overall results 538 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
526 of all models, on 3 different test sets (NIST, NISTP, P07). 539 of all models, on 3 different test sets (NIST, NISTP, P07).
527 Right: error rates on NIST test digits only, along with the previous results from 540 Right: error rates on NIST test digits only, along with the previous results from
528 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} 541 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
529 respectively based on ART, nearest neighbors, MLPs, and SVMs.} 542 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
578 differences with the MLP are statistically and qualitatively 591 differences with the MLP are statistically and qualitatively
579 significant. 592 significant.
580 The left side of the figure shows the improvement to the clean 593 The left side of the figure shows the improvement to the clean
581 NIST test set error brought by the use of out-of-distribution examples 594 NIST test set error brought by the use of out-of-distribution examples
582 (i.e. the perturbed examples examples from NISTP or P07). 595 (i.e. the perturbed examples examples from NISTP or P07).
583 Relative change is measured by taking 596 Relative percent change is measured by taking
584 (original model's error / perturbed-data model's error - 1). 597 100 \% \times (original model's error / perturbed-data model's error - 1).
585 The right side of 598 The right side of
586 Figure~\ref{fig:improvements-charts} shows the relative improvement 599 Figure~\ref{fig:improvements-charts} shows the relative improvement
587 brought by the use of a multi-task setting, in which the same model is 600 brought by the use of a multi-task setting, in which the same model is
588 trained for more classes than the target classes of interest (i.e. training 601 trained for more classes than the target classes of interest (i.e. training
589 with all 62 classes when the target classes are respectively the digits, 602 with all 62 classes when the target classes are respectively the digits,
590 lower-case, or upper-case characters). Again, whereas the gain from the 603 lower-case, or upper-case characters). Again, whereas the gain from the
591 multi-task setting is marginal or negative for the MLP, it is substantial 604 multi-task setting is marginal or negative for the MLP, it is substantial
592 for the SDA. Note that for these multi-task experiment, only the original 605 for the SDA. Note that to simplify these multi-task experiments, only the original
593 NIST dataset is used. For example, the MLP-digits bar shows the relative 606 NIST dataset is used. For example, the MLP-digits bar shows the relative
594 improvement in MLP error rate on the NIST digits test set (1 - single-task 607 percent improvement in MLP error rate on the NIST digits test set
608 is 100\% $\times$ (1 - single-task
595 model's error / multi-task model's error). The single-task model is 609 model's error / multi-task model's error). The single-task model is
596 trained with only 10 outputs (one per digit), seeing only digit examples, 610 trained with only 10 outputs (one per digit), seeing only digit examples,
597 whereas the multi-task model is trained with 62 outputs, with all 62 611 whereas the multi-task model is trained with 62 outputs, with all 62
598 character classes as examples. Hence the hidden units are shared across 612 character classes as examples. Hence the hidden units are shared across
599 all tasks. For the multi-task model, the digit error rate is measured by 613 all tasks. For the multi-task model, the digit error rate is measured by
645 supervised learner. More precisely, 659 supervised learner. More precisely,
646 the answers are positive for all the questions asked in the introduction. 660 the answers are positive for all the questions asked in the introduction.
647 %\begin{itemize} 661 %\begin{itemize}
648 662
649 $\bullet$ %\item 663 $\bullet$ %\item
650 Do the good results previously obtained with deep architectures on the 664 {\bf Do the good results previously obtained with deep architectures on the
651 MNIST digits generalize to the setting of a much larger and richer (but similar) 665 MNIST digits generalize to the setting of a much larger and richer (but similar)
652 dataset, the NIST special database 19, with 62 classes and around 800k examples? 666 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
653 Yes, the SDA {\bf systematically outperformed the MLP and all the previously 667 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
654 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level 668 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
655 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. 669 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
656 670
657 $\bullet$ %\item 671 $\bullet$ %\item
658 To what extent do self-taught learning scenarios help deep learners, 672 {\bf To what extent do self-taught learning scenarios help deep learners,
659 and do they help them more than shallow supervised ones? 673 and do they help them more than shallow supervised ones}?
660 We found that distorted training examples not only made the resulting 674 We found that distorted training examples not only made the resulting
661 classifier better on similarly perturbed images but also on 675 classifier better on similarly perturbed images but also on
662 the {\em original clean examples}, and more importantly and more novel, 676 the {\em original clean examples}, and more importantly and more novel,
663 that deep architectures benefit more from such {\em out-of-distribution} 677 that deep architectures benefit more from such {\em out-of-distribution}
664 examples. MLPs were helped by perturbed training examples when tested on perturbed input 678 examples. MLPs were helped by perturbed training examples when tested on perturbed input
667 or even hurt (10\% relative loss on digits) 681 or even hurt (10\% relative loss on digits)
668 with respect to clean examples . On the other hand, the deep SDAs 682 with respect to clean examples . On the other hand, the deep SDAs
669 were very significantly boosted by these out-of-distribution examples. 683 were very significantly boosted by these out-of-distribution examples.
670 Similarly, whereas the improvement due to the multi-task setting was marginal or 684 Similarly, whereas the improvement due to the multi-task setting was marginal or
671 negative for the MLP (from +5.6\% to -3.6\% relative change), 685 negative for the MLP (from +5.6\% to -3.6\% relative change),
672 it was very significant for the SDA (from +13\% to +27\% relative change). 686 it was very significant for the SDA (from +13\% to +27\% relative change),
687 which may be explained by the arguments below.
673 %\end{itemize} 688 %\end{itemize}
674 689
675 In the original self-taught learning framework~\citep{RainaR2007}, the 690 In the original self-taught learning framework~\citep{RainaR2007}, the
676 out-of-sample examples were used as a source of unsupervised data, and 691 out-of-sample examples were used as a source of unsupervised data, and
677 experiments showed its positive effects in a \emph{limited labeled data} 692 experiments showed its positive effects in a \emph{limited labeled data}
680 learning diminishes as the number of labeled examples increases (essentially, 695 learning diminishes as the number of labeled examples increases (essentially,
681 a ``diminishing returns'' scenario occurs). We note instead that, for deep 696 a ``diminishing returns'' scenario occurs). We note instead that, for deep
682 architectures, our experiments show that such a positive effect is accomplished 697 architectures, our experiments show that such a positive effect is accomplished
683 even in a scenario with a \emph{very large number of labeled examples}. 698 even in a scenario with a \emph{very large number of labeled examples}.
684 699
685 Why would deep learners benefit more from the self-taught learning framework? 700 {\bf Why would deep learners benefit more from the self-taught learning framework}?
686 The key idea is that the lower layers of the predictor compute a hierarchy 701 The key idea is that the lower layers of the predictor compute a hierarchy
687 of features that can be shared across tasks or across variants of the 702 of features that can be shared across tasks or across variants of the
688 input distribution. Intermediate features that can be used in different 703 input distribution. Intermediate features that can be used in different
689 contexts can be estimated in a way that allows to share statistical 704 contexts can be estimated in a way that allows to share statistical
690 strength. Features extracted through many levels are more likely to 705 strength. Features extracted through many levels are more likely to