ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 502:2b35a6e5ece4

changements de Myriam

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Tue, 01 Jun 2010 13:37:40 -0400
parents	5927432d8b8d
children	a0e820f04f8e

comparison

equal deleted inserted replaced

-:5927432d8b8d
+:2b35a6e5ece4
 and {\bf OCR data} (scanned machine printed characters). Once a character
 is sampled from one of these sources (chosen randomly), a pipeline of
 the above transformations and/or noise processes is applied to the
 image.
+We compare the best MLP (according to validation set error) that we found against
+the best SDA (again according to validation set error), along with a precise estimate
+of human performance obtained via Amazon's Mechanical Turk (AMT)
+service\footnote{http://mturk.com}.
+AMT users are paid small amounts
+of money to perform tasks for which human intelligence is required.
+Mechanical Turk has been used extensively in natural language
+processing \citep{SnowEtAl2008} and vision
+\citep{SorokinAndForsyth2008,whitehill09}.
+AMT users where presented
+with 10 character images and asked to type 10 corresponding ASCII
+characters. They were forced to make a hard choice among the
+62 or 10 character classes (all classes or digits only).
+Three users classified each image, allowing
+to estimate inter-human variability.
 \vspace*{-1mm}
 \subsection{Data Sources}
 \vspace*{-1mm}
 %\begin{itemize}
 \vspace*{-1mm}
 \subsection{Models and their Hyperparameters}
 \vspace*{-1mm}
+The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
+hidden layer and with Stacked Denoising Auto-Encoders (SDA).
 All hyper-parameters are selected based on performance on the NISTP validation set.
 {\bf Multi-Layer Perceptrons (MLP).}
 Whereas previous work had compared deep architectures to both shallow MLPs and
-SVMs, we only compared to MLPs here because of the very large datasets used.
+SVMs, we only compared to MLPs here because of the very large datasets used
+(making the use of SVMs computationally inconvenient because of their quadratic
+scaling behavior).
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
 exponentials) on the output layer for estimating P(class | image).
 The hyper-parameters are the following: number of hidden units, taken in
 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
 examples are presented in minibatches of size 20. A constant learning
 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments, and 0.1 was selected.
-{\bf Stacked Denoising Auto-Encoders (SDAE).}
+{\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden
 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
 enabling better generalization, apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better
 taking advantage of the expressive power and bias implicit in the
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
 Here we chose to use the Denoising
 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
+% AJOUTER UNE IMAGE?
 these deep hierarchies of features, as it is very simple to train and
 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}),
 provides immediate and efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
 \citep{VincentPLarochelleH2008}. During training of a Denoising
 %\vspace*{-1mm}
 %\subsection{SDA vs MLP vs Humans}
 %\vspace*{-1mm}
-We compare the best MLP (according to validation set error) that we found against
-the best SDA (again according to validation set error), along with a precise estimate
-of human performance obtained via Amazon's Mechanical Turk (AMT)
-service\footnote{http://mturk.com}.
-%AMT users are paid small amounts
-%of money to perform tasks for which human intelligence is required.
-%Mechanical Turk has been used extensively in natural language
-%processing \citep{SnowEtAl2008} and vision
-%\citep{SorokinAndForsyth2008,whitehill09}.
-AMT users where presented
-with 10 character images and asked to type 10 corresponding ASCII
-characters. They were forced to make a hard choice among the
-62 or 10 character classes (all classes or digits only).
-Three users classified each image, allowing
-to estimate inter-human variability (shown as +/- in parenthesis below).
 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1,
 SDA2), along with the previous results on the digits NIST special database
 19 test set from the literature respectively based on ARTMAP neural
 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
 significant way) but reaches human performance on both the 62-class task
 and the 10-class (digits) task. In addition, as shown in the left of
 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error
 rate brought by self-taught learning is greater for the SDA, and these
 differences with the MLP are statistically and qualitatively
-significant. The left side of the figure shows the improvement to the clean
+significant.
+The left side of the figure shows the improvement to the clean
 NIST test set error brought by the use of out-of-distribution examples
-(i.e. the perturbed examples examples from NISTP or P07). The right side of
+(i.e. the perturbed examples examples from NISTP or P07).
+Relative change is measured by taking
+(original model's error / perturbed-data model's error - 1).
+The right side of
 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement
 brought by the use of a multi-task setting, in which the same model is
 trained for more classes than the target classes of interest (i.e. training
 with all 62 classes when the target classes are respectively the digits,
 lower-case, or upper-case characters). Again, whereas the gain from the
 setting is similar for the other two target classes (lower case characters
 and upper case characters).
 \begin{figure}[h]
 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
-\caption{Left: overall results; error bars indicate a 95\% confidence interval.
+\caption{Error bars indicate a 95\% confidence interval. 0 indicates training
-Right: error rates on NIST test digits only, with results from literature. }
+on NIST, 1 on NISTP, and 2 on P07. Left: overall results
+of all models, on 3 different test sets corresponding to the three
+datasets.
+Right: error rates on NIST test digits only, along with the previous results from
+literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
+respectively based on ART, nearest neighbors, MLPs, and SVMs.}
 \label{fig:error-rates-charts}
 \end{figure}
 %\vspace*{-1mm}
-%\subsection{Perturbed Training Data More Helpful for SDAE}
+%\subsection{Perturbed Training Data More Helpful for SDA}
 %\vspace*{-1mm}
 %\vspace*{-1mm}
 %\subsection{Multi-Task Learning Effects}
 %\vspace*{-1mm}
 \vspace*{-1mm}
 \section{Conclusions}
 \vspace*{-1mm}
-The conclusions are positive for all the questions asked in the introduction.
+We have found that the self-taught learning framework is more beneficial
+to a deep learner than to a traditional shallow and purely
+supervised learner. More precisely,
+the conclusions are positive for all the questions asked in the introduction.
 %\begin{itemize}
 $\bullet$ %\item
 Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to the setting of a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
-Yes, the SDA systematically outperformed the MLP and all the previously
+Yes, the SDA {\bf systematically outperformed the MLP and all the previously
 published results on this dataset (as far as we know), in fact reaching human-level
-performance.
+performance} at round 17\% error on the 62-class task and 1.4\% on the digits.
 $\bullet$ %\item
 To what extent does the perturbation of input images (e.g. adding
 noise, affine transformations, background images) make the resulting
 classifier better not only on similarly perturbed images but also on
 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
-MLPs were helped by perturbed training examples when tested on perturbed input images,
+MLPs were helped by perturbed training examples when tested on perturbed input
-but only marginally helped with respect to clean examples. On the other hand, the deep SDAs
+images (65\% relative improvement on NISTP)
+but only marginally helped (5\% relative improvement on all classes)
+or even hurt (10\% relative loss on digits)
+with respect to clean examples . On the other hand, the deep SDAs
 were very significantly boosted by these out-of-distribution examples.
 $\bullet$ %\item
 Similarly, does the feature learning step in deep learning algorithms benefit more
 training with similar but different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 Whereas the improvement due to the multi-task setting was marginal or
-negative for the MLP, it was very significant for the SDA.
+negative for the MLP (from +5.6\% to -3.6\% relative change),
+it was very significant for the SDA (from +13\% to +27\% relative change).
 %\end{itemize}
+Why would deep learners benefit more from the self-taught learning framework?
+The key idea is that the lower layers of the predictor compute a hierarchy
+of features that can be shared across tasks or across variants of the
+input distribution. Intermediate features that can be used in different
+contexts can be estimated in a way that allows to share statistical
+strength. Features extracted through many levels are more likely to
+be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
+increasing the likelihood that they would be useful for a larger array
+of tasks and input conditions.
+Therefore, we hypothesize that both depth and unsupervised
+pre-training play a part in explaining the advantages observed here, and future
+experiments could attempt at teasing apart these factors.
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
 can be executed on-line at {\tt http://deep.host22.com}.
 \newpage

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 502:2b35a6e5ece4