# HG changeset patch # User Dumitru Erhan # Date 1275414908 25200 # Node ID a41a8925be70e626eba5c5edee982baf6e54ba7a # Parent e837ef6eef8c48f05a0235ef3987d315d7add6bd# Parent a0e820f04f8e7736d4d7240540700cfe5391f1e2 merge diff -r e837ef6eef8c -r a41a8925be70 writeup/ift6266_ml.bib --- a/writeup/ift6266_ml.bib Tue Jun 01 10:53:07 2010 -0700 +++ b/writeup/ift6266_ml.bib Tue Jun 01 10:55:08 2010 -0700 @@ -267,6 +267,14 @@ mixture that has a dominant tail", } +@techreport{ift6266-tr-anonymous, + author = "Anonymous authors", + title = "Generating and Exploiting Perturbed and Multi-Task Handwritten +Training Data for Deep Architectures", + institution = "University X.", + year = 2010, +} + @TechReport{Abdallah+Plumbley-06, author = "Samer Abdallah and Mark Plumbley", title = "Geometry Dependency Analysis", diff -r e837ef6eef8c -r a41a8925be70 writeup/nips2010_submission.tex --- a/writeup/nips2010_submission.tex Tue Jun 01 10:53:07 2010 -0700 +++ b/writeup/nips2010_submission.tex Tue Jun 01 10:55:08 2010 -0700 @@ -201,7 +201,7 @@ {\bf Pinch.} This GIMP filter is named "Whirl and pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic -surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. +surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). For a square input image, think of drawing a circle of radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to that disk (region inside circle) will have its value recalculated by taking @@ -329,6 +329,23 @@ the above transformations and/or noise processes is applied to the image. +We compare the best MLP (according to validation set error) that we found against +the best SDA (again according to validation set error), along with a precise estimate +of human performance obtained via Amazon's Mechanical Turk (AMT) +service\footnote{http://mturk.com}. +AMT users are paid small amounts +of money to perform tasks for which human intelligence is required. +Mechanical Turk has been used extensively in natural language processing and vision. +%processing \citep{SnowEtAl2008} and vision +%\citep{SorokinAndForsyth2008,whitehill09}. +%\citep{SorokinAndForsyth2008,whitehill09}. +AMT users where presented +with 10 character images and asked to type 10 corresponding ASCII +characters. They were forced to make a hard choice among the +62 or 10 character classes (all classes or digits only). +Three users classified each image, allowing +to estimate inter-human variability. + \vspace*{-1mm} \subsection{Data Sources} \vspace*{-1mm} @@ -412,11 +429,15 @@ \subsection{Models and their Hyperparameters} \vspace*{-1mm} +The experiments are performed with Multi-Layer Perceptrons (MLP) with a single +hidden layer and with Stacked Denoising Auto-Encoders (SDA). All hyper-parameters are selected based on performance on the NISTP validation set. {\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared deep architectures to both shallow MLPs and -SVMs, we only compared to MLPs here because of the very large datasets used. +SVMs, we only compared to MLPs here because of the very large datasets used +(making the use of SVMs computationally inconvenient because of their quadratic +scaling behavior). The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized exponentials) on the output layer for estimating P(class | image). The hyper-parameters are the following: number of hidden units, taken in @@ -425,7 +446,7 @@ rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ through preliminary experiments, and 0.1 was selected. -{\bf Stacked Denoising Auto-Encoders (SDAE).} +{\bf Stacked Denoising Auto-Encoders (SDA).} Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) can be used to initialize the weights of each layer of a deep MLP (with many hidden layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} @@ -441,6 +462,7 @@ compositions of simpler ones through a deep hierarchy). Here we chose to use the Denoising Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for +% AJOUTER UNE IMAGE? these deep hierarchies of features, as it is very simple to train and teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), provides immediate and efficient inference, and yielded results @@ -470,22 +492,6 @@ %\subsection{SDA vs MLP vs Humans} %\vspace*{-1mm} -We compare the best MLP (according to validation set error) that we found against -the best SDA (again according to validation set error), along with a precise estimate -of human performance obtained via Amazon's Mechanical Turk (AMT) -service\footnote{http://mturk.com}. -%AMT users are paid small amounts -%of money to perform tasks for which human intelligence is required. -%Mechanical Turk has been used extensively in natural language -%processing \citep{SnowEtAl2008} and vision -%\citep{SorokinAndForsyth2008,whitehill09}. -AMT users where presented -with 10 character images and asked to type 10 corresponding ASCII -characters. They were forced to make a hard choice among the -62 or 10 character classes (all classes or digits only). -Three users classified each image, allowing -to estimate inter-human variability (shown as +/- in parenthesis below). - Figure~\ref{fig:error-rates-charts} summarizes the results obtained, comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, SDA2), along with the previous results on the digits NIST special database @@ -503,9 +509,13 @@ Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error rate brought by self-taught learning is greater for the SDA, and these differences with the MLP are statistically and qualitatively -significant. The left side of the figure shows the improvement to the clean +significant. +The left side of the figure shows the improvement to the clean NIST test set error brought by the use of out-of-distribution examples -(i.e. the perturbed examples examples from NISTP or P07). The right side of +(i.e. the perturbed examples examples from NISTP or P07). +Relative change is measured by taking +(original model's error / perturbed-data model's error - 1). +The right side of Figure~\ref{fig:fig:improvements-charts} shows the relative improvement brought by the use of a multi-task setting, in which the same model is trained for more classes than the target classes of interest (i.e. training @@ -527,12 +537,19 @@ \begin{figure}[h] \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ -\caption{Charts corresponding to table 1 of Appendix I. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from literature. } +\caption{Error bars indicate a 95\% confidence interval. 0 indicates training +on NIST, 1 on NISTP, and 2 on P07. Left: overall results +of all models, on 3 different test sets corresponding to the three +datasets. +Right: error rates on NIST test digits only, along with the previous results from +literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} +respectively based on ART, nearest neighbors, MLPs, and SVMs.} + \label{fig:error-rates-charts} \end{figure} %\vspace*{-1mm} -%\subsection{Perturbed Training Data More Helpful for SDAE} +%\subsection{Perturbed Training Data More Helpful for SDA} %\vspace*{-1mm} %\vspace*{-1mm} @@ -575,16 +592,19 @@ \section{Conclusions} \vspace*{-1mm} -The conclusions are positive for all the questions asked in the introduction. +We have found that the self-taught learning framework is more beneficial +to a deep learner than to a traditional shallow and purely +supervised learner. More precisely, +the conclusions are positive for all the questions asked in the introduction. %\begin{itemize} $\bullet$ %\item Do the good results previously obtained with deep architectures on the MNIST digits generalize to the setting of a much larger and richer (but similar) dataset, the NIST special database 19, with 62 classes and around 800k examples? -Yes, the SDA systematically outperformed the MLP and all the previously +Yes, the SDA {\bf systematically outperformed the MLP and all the previously published results on this dataset (as far as we know), in fact reaching human-level -performance. +performance} at round 17\% error on the 62-class task and 1.4\% on the digits. $\bullet$ %\item To what extent does the perturbation of input images (e.g. adding @@ -592,8 +612,11 @@ classifier better not only on similarly perturbed images but also on the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? -MLPs were helped by perturbed training examples when tested on perturbed input images, -but only marginally helped with respect to clean examples. On the other hand, the deep SDAs +MLPs were helped by perturbed training examples when tested on perturbed input +images (65\% relative improvement on NISTP) +but only marginally helped (5\% relative improvement on all classes) +or even hurt (10\% relative loss on digits) +with respect to clean examples . On the other hand, the deep SDAs were very significantly boosted by these out-of-distribution examples. $\bullet$ %\item @@ -601,9 +624,23 @@ training with similar but different classes (i.e. a multi-task learning scenario) than a corresponding shallow and purely supervised architecture? Whereas the improvement due to the multi-task setting was marginal or -negative for the MLP, it was very significant for the SDA. +negative for the MLP (from +5.6\% to -3.6\% relative change), +it was very significant for the SDA (from +13\% to +27\% relative change). %\end{itemize} +Why would deep learners benefit more from the self-taught learning framework? +The key idea is that the lower layers of the predictor compute a hierarchy +of features that can be shared across tasks or across variants of the +input distribution. Intermediate features that can be used in different +contexts can be estimated in a way that allows to share statistical +strength. Features extracted through many levels are more likely to +be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), +increasing the likelihood that they would be useful for a larger array +of tasks and input conditions. +Therefore, we hypothesize that both depth and unsupervised +pre-training play a part in explaining the advantages observed here, and future +experiments could attempt at teasing apart these factors. + A Flash demo of the recognizer (where both the MLP and the SDA can be compared) can be executed on-line at {\tt http://deep.host22.com}.