Mercurial > ift6266
diff writeup/nips2010_submission.tex @ 502:2b35a6e5ece4
changements de Myriam
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 01 Jun 2010 13:37:40 -0400 |
parents | 5927432d8b8d |
children | a0e820f04f8e |
line wrap: on
line diff
--- a/writeup/nips2010_submission.tex Tue Jun 01 12:28:05 2010 -0400 +++ b/writeup/nips2010_submission.tex Tue Jun 01 13:37:40 2010 -0400 @@ -327,6 +327,22 @@ the above transformations and/or noise processes is applied to the image. +We compare the best MLP (according to validation set error) that we found against +the best SDA (again according to validation set error), along with a precise estimate +of human performance obtained via Amazon's Mechanical Turk (AMT) +service\footnote{http://mturk.com}. +AMT users are paid small amounts +of money to perform tasks for which human intelligence is required. +Mechanical Turk has been used extensively in natural language +processing \citep{SnowEtAl2008} and vision +\citep{SorokinAndForsyth2008,whitehill09}. +AMT users where presented +with 10 character images and asked to type 10 corresponding ASCII +characters. They were forced to make a hard choice among the +62 or 10 character classes (all classes or digits only). +Three users classified each image, allowing +to estimate inter-human variability. + \vspace*{-1mm} \subsection{Data Sources} \vspace*{-1mm} @@ -410,11 +426,15 @@ \subsection{Models and their Hyperparameters} \vspace*{-1mm} +The experiments are performed with Multi-Layer Perceptrons (MLP) with a single +hidden layer and with Stacked Denoising Auto-Encoders (SDA). All hyper-parameters are selected based on performance on the NISTP validation set. {\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared deep architectures to both shallow MLPs and -SVMs, we only compared to MLPs here because of the very large datasets used. +SVMs, we only compared to MLPs here because of the very large datasets used +(making the use of SVMs computationally inconvenient because of their quadratic +scaling behavior). The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized exponentials) on the output layer for estimating P(class | image). The hyper-parameters are the following: number of hidden units, taken in @@ -423,7 +443,7 @@ rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ through preliminary experiments, and 0.1 was selected. -{\bf Stacked Denoising Auto-Encoders (SDAE).} +{\bf Stacked Denoising Auto-Encoders (SDA).} Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) can be used to initialize the weights of each layer of a deep MLP (with many hidden layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} @@ -439,6 +459,7 @@ compositions of simpler ones through a deep hierarchy). Here we chose to use the Denoising Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for +% AJOUTER UNE IMAGE? these deep hierarchies of features, as it is very simple to train and teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), provides immediate and efficient inference, and yielded results @@ -468,22 +489,6 @@ %\subsection{SDA vs MLP vs Humans} %\vspace*{-1mm} -We compare the best MLP (according to validation set error) that we found against -the best SDA (again according to validation set error), along with a precise estimate -of human performance obtained via Amazon's Mechanical Turk (AMT) -service\footnote{http://mturk.com}. -%AMT users are paid small amounts -%of money to perform tasks for which human intelligence is required. -%Mechanical Turk has been used extensively in natural language -%processing \citep{SnowEtAl2008} and vision -%\citep{SorokinAndForsyth2008,whitehill09}. -AMT users where presented -with 10 character images and asked to type 10 corresponding ASCII -characters. They were forced to make a hard choice among the -62 or 10 character classes (all classes or digits only). -Three users classified each image, allowing -to estimate inter-human variability (shown as +/- in parenthesis below). - Figure~\ref{fig:error-rates-charts} summarizes the results obtained, comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, SDA2), along with the previous results on the digits NIST special database @@ -501,9 +506,13 @@ Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error rate brought by self-taught learning is greater for the SDA, and these differences with the MLP are statistically and qualitatively -significant. The left side of the figure shows the improvement to the clean +significant. +The left side of the figure shows the improvement to the clean NIST test set error brought by the use of out-of-distribution examples -(i.e. the perturbed examples examples from NISTP or P07). The right side of +(i.e. the perturbed examples examples from NISTP or P07). +Relative change is measured by taking +(original model's error / perturbed-data model's error - 1). +The right side of Figure~\ref{fig:fig:improvements-charts} shows the relative improvement brought by the use of a multi-task setting, in which the same model is trained for more classes than the target classes of interest (i.e. training @@ -525,13 +534,19 @@ \begin{figure}[h] \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ -\caption{Left: overall results; error bars indicate a 95\% confidence interval. -Right: error rates on NIST test digits only, with results from literature. } +\caption{Error bars indicate a 95\% confidence interval. 0 indicates training +on NIST, 1 on NISTP, and 2 on P07. Left: overall results +of all models, on 3 different test sets corresponding to the three +datasets. +Right: error rates on NIST test digits only, along with the previous results from +literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} +respectively based on ART, nearest neighbors, MLPs, and SVMs.} + \label{fig:error-rates-charts} \end{figure} %\vspace*{-1mm} -%\subsection{Perturbed Training Data More Helpful for SDAE} +%\subsection{Perturbed Training Data More Helpful for SDA} %\vspace*{-1mm} %\vspace*{-1mm} @@ -580,16 +595,19 @@ \section{Conclusions} \vspace*{-1mm} -The conclusions are positive for all the questions asked in the introduction. +We have found that the self-taught learning framework is more beneficial +to a deep learner than to a traditional shallow and purely +supervised learner. More precisely, +the conclusions are positive for all the questions asked in the introduction. %\begin{itemize} $\bullet$ %\item Do the good results previously obtained with deep architectures on the MNIST digits generalize to the setting of a much larger and richer (but similar) dataset, the NIST special database 19, with 62 classes and around 800k examples? -Yes, the SDA systematically outperformed the MLP and all the previously +Yes, the SDA {\bf systematically outperformed the MLP and all the previously published results on this dataset (as far as we know), in fact reaching human-level -performance. +performance} at round 17\% error on the 62-class task and 1.4\% on the digits. $\bullet$ %\item To what extent does the perturbation of input images (e.g. adding @@ -597,8 +615,11 @@ classifier better not only on similarly perturbed images but also on the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? -MLPs were helped by perturbed training examples when tested on perturbed input images, -but only marginally helped with respect to clean examples. On the other hand, the deep SDAs +MLPs were helped by perturbed training examples when tested on perturbed input +images (65\% relative improvement on NISTP) +but only marginally helped (5\% relative improvement on all classes) +or even hurt (10\% relative loss on digits) +with respect to clean examples . On the other hand, the deep SDAs were very significantly boosted by these out-of-distribution examples. $\bullet$ %\item @@ -606,9 +627,23 @@ training with similar but different classes (i.e. a multi-task learning scenario) than a corresponding shallow and purely supervised architecture? Whereas the improvement due to the multi-task setting was marginal or -negative for the MLP, it was very significant for the SDA. +negative for the MLP (from +5.6\% to -3.6\% relative change), +it was very significant for the SDA (from +13\% to +27\% relative change). %\end{itemize} +Why would deep learners benefit more from the self-taught learning framework? +The key idea is that the lower layers of the predictor compute a hierarchy +of features that can be shared across tasks or across variants of the +input distribution. Intermediate features that can be used in different +contexts can be estimated in a way that allows to share statistical +strength. Features extracted through many levels are more likely to +be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), +increasing the likelihood that they would be useful for a larger array +of tasks and input conditions. +Therefore, we hypothesize that both depth and unsupervised +pre-training play a part in explaining the advantages observed here, and future +experiments could attempt at teasing apart these factors. + A Flash demo of the recognizer (where both the MLP and the SDA can be compared) can be executed on-line at {\tt http://deep.host22.com}.