Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 547:316c7bdad5ad
charts
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Wed, 02 Jun 2010 13:09:27 -0400 |
parents | 1cdfc17e890f |
children | 34cb28249de0 |
comparison
equal
deleted
inserted
replaced
546:cf68f5685406 | 547:316c7bdad5ad |
---|---|
71 It is also only recently that successful algorithms were proposed to | 71 It is also only recently that successful algorithms were proposed to |
72 overcome some of these difficulties. All are based on unsupervised | 72 overcome some of these difficulties. All are based on unsupervised |
73 learning, often in an greedy layer-wise ``unsupervised pre-training'' | 73 learning, often in an greedy layer-wise ``unsupervised pre-training'' |
74 stage~\citep{Bengio-2009}. One of these layer initialization techniques, | 74 stage~\citep{Bengio-2009}. One of these layer initialization techniques, |
75 applied here, is the Denoising | 75 applied here, is the Denoising |
76 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which | 76 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}), |
77 which | |
77 performed similarly or better than previously proposed Restricted Boltzmann | 78 performed similarly or better than previously proposed Restricted Boltzmann |
78 Machines in terms of unsupervised extraction of a hierarchy of features | 79 Machines in terms of unsupervised extraction of a hierarchy of features |
79 useful for classification. The principle is that each layer starting from | 80 useful for classification. The principle is that each layer starting from |
80 the bottom is trained to encode its input (the output of the previous | 81 the bottom is trained to encode its input (the output of the previous |
81 layer) and to reconstruct it from a corrupted version. After this | 82 layer) and to reconstruct it from a corrupted version. After this |
82 unsupervised initialization, the stack of denoising auto-encoders can be | 83 unsupervised initialization, the stack of DAs can be |
83 converted into a deep supervised feedforward neural network and fine-tuned by | 84 converted into a deep supervised feedforward neural network and fine-tuned by |
84 stochastic gradient descent. | 85 stochastic gradient descent. |
85 | 86 |
86 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles | 87 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles |
87 of semi-supervised and multi-task learning: the learner can exploit examples | 88 of semi-supervised and multi-task learning: the learner can exploit examples |
122 training with similar but different classes (i.e. a multi-task learning scenario) than | 123 training with similar but different classes (i.e. a multi-task learning scenario) than |
123 a corresponding shallow and purely supervised architecture? | 124 a corresponding shallow and purely supervised architecture? |
124 %\end{enumerate} | 125 %\end{enumerate} |
125 | 126 |
126 Our experimental results provide positive evidence towards all of these questions. | 127 Our experimental results provide positive evidence towards all of these questions. |
128 To achieve these results, we introduce in the next section a sophisticated system | |
129 for stochastically transforming character images. The conclusion discusses | |
130 the more general question of why deep learners may benefit so much from | |
131 the self-taught learning framework. | |
127 | 132 |
128 \vspace*{-1mm} | 133 \vspace*{-1mm} |
129 \section{Perturbation and Transformation of Character Images} | 134 \section{Perturbation and Transformation of Character Images} |
130 \label{s:perturbations} | 135 \label{s:perturbations} |
131 \vspace*{-1mm} | 136 \vspace*{-1mm} |
132 | 137 |
133 This section describes the different transformations we used to stochastically | 138 This section describes the different transformations we used to stochastically |
134 transform source images in order to obtain data. More details can | 139 transform source images in order to obtain data from a larger distribution which |
140 covers a domain substantially larger than the clean characters distribution from | |
141 which we start. Although character transformations have been used before to | |
142 improve character recognizers, this effort is on a large scale both | |
143 in number of classes and in the complexity of the transformations, hence | |
144 in the complexity of the learning task. | |
145 More details can | |
135 be found in this technical report~\citep{ift6266-tr-anonymous}. | 146 be found in this technical report~\citep{ift6266-tr-anonymous}. |
136 The code for these transformations (mostly python) is available at | 147 The code for these transformations (mostly python) is available at |
137 {\tt http://anonymous.url.net}. All the modules in the pipeline share | 148 {\tt http://anonymous.url.net}. All the modules in the pipeline share |
138 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the | 149 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the |
139 amount of deformation or noise introduced. | 150 amount of deformation or noise introduced. |
332 to 1000 times larger. | 343 to 1000 times larger. |
333 | 344 |
334 The first step in constructing the larger datasets (called NISTP and P07) is to sample from | 345 The first step in constructing the larger datasets (called NISTP and P07) is to sample from |
335 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, | 346 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, |
336 and {\bf OCR data} (scanned machine printed characters). Once a character | 347 and {\bf OCR data} (scanned machine printed characters). Once a character |
337 is sampled from one of these sources (chosen randomly), the pipeline of | 348 is sampled from one of these sources (chosen randomly), the second step is to |
338 the transformations and/or noise processes described in section \ref{s:perturbations} | 349 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}. |
339 is applied to the image. | 350 |
340 | 351 To provide a baseline of error rate comparison we also estimate human performance |
352 on both the 62-class task and the 10-class digits task. | |
341 We compare the best MLPs against | 353 We compare the best MLPs against |
342 the best SDAs (both models' hyper-parameters are selected to minimize the validation set error), | 354 the best SDAs (both models' hyper-parameters are selected to minimize the validation set error), |
343 along with a comparison against a precise estimate | 355 along with a comparison against a precise estimate |
344 of human performance obtained via Amazon's Mechanical Turk (AMT) | 356 of human performance obtained via Amazon's Mechanical Turk (AMT) |
345 service (http://mturk.com). | 357 service (http://mturk.com). |
458 Training examples are presented in minibatches of size 20. A constant learning | 470 Training examples are presented in minibatches of size 20. A constant learning |
459 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ | 471 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ |
460 through preliminary experiments (measuring performance on a validation set), | 472 through preliminary experiments (measuring performance on a validation set), |
461 and $0.1$ was then selected for optimizing on the whole training sets. | 473 and $0.1$ was then selected for optimizing on the whole training sets. |
462 | 474 |
463 \begin{figure}[ht] | |
464 \vspace*{-2mm} | |
465 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | |
466 \caption{Illustration of the computations and training criterion for the denoising | |
467 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | |
468 the layer (i.e. raw input or output of previous layer) | |
469 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | |
470 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | |
471 is compared to the uncorrupted input $x$ through the loss function | |
472 $L_H(x,z)$, whose expected value is approximately minimized during training | |
473 by tuning $\theta$ and $\theta'$.} | |
474 \label{fig:da} | |
475 \vspace*{-2mm} | |
476 \end{figure} | |
477 | 475 |
478 {\bf Stacked Denoising Auto-Encoders (SDA).} | 476 {\bf Stacked Denoising Auto-Encoders (SDA).} |
479 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 477 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
480 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 478 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
481 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, | 479 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, |
487 distribution $P(x)$ and the conditional distribution of interest | 485 distribution $P(x)$ and the conditional distribution of interest |
488 $P(y|x)$ (like in semi-supervised learning), and on the other hand | 486 $P(y|x)$ (like in semi-supervised learning), and on the other hand |
489 taking advantage of the expressive power and bias implicit in the | 487 taking advantage of the expressive power and bias implicit in the |
490 deep architecture (whereby complex concepts are expressed as | 488 deep architecture (whereby complex concepts are expressed as |
491 compositions of simpler ones through a deep hierarchy). | 489 compositions of simpler ones through a deep hierarchy). |
490 | |
491 \begin{figure}[ht] | |
492 \vspace*{-2mm} | |
493 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | |
494 \caption{Illustration of the computations and training criterion for the denoising | |
495 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | |
496 the layer (i.e. raw input or output of previous layer) | |
497 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | |
498 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | |
499 is compared to the uncorrupted input $x$ through the loss function | |
500 $L_H(x,z)$, whose expected value is approximately minimized during training | |
501 by tuning $\theta$ and $\theta'$.} | |
502 \label{fig:da} | |
503 \vspace*{-2mm} | |
504 \end{figure} | |
492 | 505 |
493 Here we chose to use the Denoising | 506 Here we chose to use the Denoising |
494 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for | 507 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for |
495 these deep hierarchies of features, as it is very simple to train and | 508 these deep hierarchies of features, as it is very simple to train and |
496 explain (see Figure~\ref{fig:da}, as well as | 509 explain (see Figure~\ref{fig:da}, as well as |
512 fixed proportion of the input values, randomly selected, are zeroed), and a | 525 fixed proportion of the input values, randomly selected, are zeroed), and a |
513 separate learning rate for the unsupervised pre-training stage (selected | 526 separate learning rate for the unsupervised pre-training stage (selected |
514 from the same above set). The fraction of inputs corrupted was selected | 527 from the same above set). The fraction of inputs corrupted was selected |
515 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | 528 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number |
516 of hidden layers but it was fixed to 3 based on previous work with | 529 of hidden layers but it was fixed to 3 based on previous work with |
517 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. | 530 SDAs on MNIST~\citep{VincentPLarochelleH2008}. |
518 | 531 |
519 \vspace*{-1mm} | 532 \vspace*{-1mm} |
520 | 533 |
521 \begin{figure}[ht] | 534 \begin{figure}[ht] |
522 \vspace*{-2mm} | 535 \vspace*{-2mm} |
523 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} | 536 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} |
524 \caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained | 537 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained |
525 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | 538 on NIST, 1 on NISTP, and 2 on P07. Left: overall results |
526 of all models, on 3 different test sets (NIST, NISTP, P07). | 539 of all models, on 3 different test sets (NIST, NISTP, P07). |
527 Right: error rates on NIST test digits only, along with the previous results from | 540 Right: error rates on NIST test digits only, along with the previous results from |
528 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | 541 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
529 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | 542 respectively based on ART, nearest neighbors, MLPs, and SVMs.} |
578 differences with the MLP are statistically and qualitatively | 591 differences with the MLP are statistically and qualitatively |
579 significant. | 592 significant. |
580 The left side of the figure shows the improvement to the clean | 593 The left side of the figure shows the improvement to the clean |
581 NIST test set error brought by the use of out-of-distribution examples | 594 NIST test set error brought by the use of out-of-distribution examples |
582 (i.e. the perturbed examples examples from NISTP or P07). | 595 (i.e. the perturbed examples examples from NISTP or P07). |
583 Relative change is measured by taking | 596 Relative percent change is measured by taking |
584 (original model's error / perturbed-data model's error - 1). | 597 100 \% \times (original model's error / perturbed-data model's error - 1). |
585 The right side of | 598 The right side of |
586 Figure~\ref{fig:improvements-charts} shows the relative improvement | 599 Figure~\ref{fig:improvements-charts} shows the relative improvement |
587 brought by the use of a multi-task setting, in which the same model is | 600 brought by the use of a multi-task setting, in which the same model is |
588 trained for more classes than the target classes of interest (i.e. training | 601 trained for more classes than the target classes of interest (i.e. training |
589 with all 62 classes when the target classes are respectively the digits, | 602 with all 62 classes when the target classes are respectively the digits, |
590 lower-case, or upper-case characters). Again, whereas the gain from the | 603 lower-case, or upper-case characters). Again, whereas the gain from the |
591 multi-task setting is marginal or negative for the MLP, it is substantial | 604 multi-task setting is marginal or negative for the MLP, it is substantial |
592 for the SDA. Note that for these multi-task experiment, only the original | 605 for the SDA. Note that to simplify these multi-task experiments, only the original |
593 NIST dataset is used. For example, the MLP-digits bar shows the relative | 606 NIST dataset is used. For example, the MLP-digits bar shows the relative |
594 improvement in MLP error rate on the NIST digits test set (1 - single-task | 607 percent improvement in MLP error rate on the NIST digits test set |
608 is 100\% $\times$ (1 - single-task | |
595 model's error / multi-task model's error). The single-task model is | 609 model's error / multi-task model's error). The single-task model is |
596 trained with only 10 outputs (one per digit), seeing only digit examples, | 610 trained with only 10 outputs (one per digit), seeing only digit examples, |
597 whereas the multi-task model is trained with 62 outputs, with all 62 | 611 whereas the multi-task model is trained with 62 outputs, with all 62 |
598 character classes as examples. Hence the hidden units are shared across | 612 character classes as examples. Hence the hidden units are shared across |
599 all tasks. For the multi-task model, the digit error rate is measured by | 613 all tasks. For the multi-task model, the digit error rate is measured by |
645 supervised learner. More precisely, | 659 supervised learner. More precisely, |
646 the answers are positive for all the questions asked in the introduction. | 660 the answers are positive for all the questions asked in the introduction. |
647 %\begin{itemize} | 661 %\begin{itemize} |
648 | 662 |
649 $\bullet$ %\item | 663 $\bullet$ %\item |
650 Do the good results previously obtained with deep architectures on the | 664 {\bf Do the good results previously obtained with deep architectures on the |
651 MNIST digits generalize to the setting of a much larger and richer (but similar) | 665 MNIST digits generalize to the setting of a much larger and richer (but similar) |
652 dataset, the NIST special database 19, with 62 classes and around 800k examples? | 666 dataset, the NIST special database 19, with 62 classes and around 800k examples}? |
653 Yes, the SDA {\bf systematically outperformed the MLP and all the previously | 667 Yes, the SDA {\bf systematically outperformed the MLP and all the previously |
654 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level | 668 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level |
655 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. | 669 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. |
656 | 670 |
657 $\bullet$ %\item | 671 $\bullet$ %\item |
658 To what extent do self-taught learning scenarios help deep learners, | 672 {\bf To what extent do self-taught learning scenarios help deep learners, |
659 and do they help them more than shallow supervised ones? | 673 and do they help them more than shallow supervised ones}? |
660 We found that distorted training examples not only made the resulting | 674 We found that distorted training examples not only made the resulting |
661 classifier better on similarly perturbed images but also on | 675 classifier better on similarly perturbed images but also on |
662 the {\em original clean examples}, and more importantly and more novel, | 676 the {\em original clean examples}, and more importantly and more novel, |
663 that deep architectures benefit more from such {\em out-of-distribution} | 677 that deep architectures benefit more from such {\em out-of-distribution} |
664 examples. MLPs were helped by perturbed training examples when tested on perturbed input | 678 examples. MLPs were helped by perturbed training examples when tested on perturbed input |
667 or even hurt (10\% relative loss on digits) | 681 or even hurt (10\% relative loss on digits) |
668 with respect to clean examples . On the other hand, the deep SDAs | 682 with respect to clean examples . On the other hand, the deep SDAs |
669 were very significantly boosted by these out-of-distribution examples. | 683 were very significantly boosted by these out-of-distribution examples. |
670 Similarly, whereas the improvement due to the multi-task setting was marginal or | 684 Similarly, whereas the improvement due to the multi-task setting was marginal or |
671 negative for the MLP (from +5.6\% to -3.6\% relative change), | 685 negative for the MLP (from +5.6\% to -3.6\% relative change), |
672 it was very significant for the SDA (from +13\% to +27\% relative change). | 686 it was very significant for the SDA (from +13\% to +27\% relative change), |
687 which may be explained by the arguments below. | |
673 %\end{itemize} | 688 %\end{itemize} |
674 | 689 |
675 In the original self-taught learning framework~\citep{RainaR2007}, the | 690 In the original self-taught learning framework~\citep{RainaR2007}, the |
676 out-of-sample examples were used as a source of unsupervised data, and | 691 out-of-sample examples were used as a source of unsupervised data, and |
677 experiments showed its positive effects in a \emph{limited labeled data} | 692 experiments showed its positive effects in a \emph{limited labeled data} |
680 learning diminishes as the number of labeled examples increases (essentially, | 695 learning diminishes as the number of labeled examples increases (essentially, |
681 a ``diminishing returns'' scenario occurs). We note instead that, for deep | 696 a ``diminishing returns'' scenario occurs). We note instead that, for deep |
682 architectures, our experiments show that such a positive effect is accomplished | 697 architectures, our experiments show that such a positive effect is accomplished |
683 even in a scenario with a \emph{very large number of labeled examples}. | 698 even in a scenario with a \emph{very large number of labeled examples}. |
684 | 699 |
685 Why would deep learners benefit more from the self-taught learning framework? | 700 {\bf Why would deep learners benefit more from the self-taught learning framework}? |
686 The key idea is that the lower layers of the predictor compute a hierarchy | 701 The key idea is that the lower layers of the predictor compute a hierarchy |
687 of features that can be shared across tasks or across variants of the | 702 of features that can be shared across tasks or across variants of the |
688 input distribution. Intermediate features that can be used in different | 703 input distribution. Intermediate features that can be used in different |
689 contexts can be estimated in a way that allows to share statistical | 704 contexts can be estimated in a way that allows to share statistical |
690 strength. Features extracted through many levels are more likely to | 705 strength. Features extracted through many levels are more likely to |