# HG changeset patch
# User Dumitru Erhan <dumitru.erhan@gmail.com>
# Date 1275414908 25200
# Node ID a41a8925be70e626eba5c5edee982baf6e54ba7a
# Parent  e837ef6eef8c48f05a0235ef3987d315d7add6bd# Parent  a0e820f04f8e7736d4d7240540700cfe5391f1e2
merge

diff -r e837ef6eef8c -r a41a8925be70 writeup/ift6266_ml.bib
--- a/writeup/ift6266_ml.bib	Tue Jun 01 10:53:07 2010 -0700
+++ b/writeup/ift6266_ml.bib	Tue Jun 01 10:55:08 2010 -0700
@@ -267,6 +267,14 @@
                  mixture that has a dominant tail",
 }
 
+@techreport{ift6266-tr-anonymous,
+ author = "Anonymous authors",
+ title = "Generating and Exploiting Perturbed and Multi-Task Handwritten 
+Training Data for Deep Architectures",
+ institution = "University X.",
+ year = 2010,
+}
+
 @TechReport{Abdallah+Plumbley-06,
   author =       "Samer Abdallah and Mark Plumbley",
   title =        "Geometry Dependency Analysis",
diff -r e837ef6eef8c -r a41a8925be70 writeup/nips2010_submission.tex
--- a/writeup/nips2010_submission.tex	Tue Jun 01 10:53:07 2010 -0700
+++ b/writeup/nips2010_submission.tex	Tue Jun 01 10:55:08 2010 -0700
@@ -201,7 +201,7 @@
 {\bf Pinch.}
 This GIMP filter is named "Whirl and
 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
-surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}.
+surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
 For a square input image, think of drawing a circle of
 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
 that disk (region inside circle) will have its value recalculated by taking
@@ -329,6 +329,23 @@
 the above transformations and/or noise processes is applied to the
 image.
 
+We compare the best MLP (according to validation set error) that we found against
+the best SDA (again according to validation set error), along with a precise estimate
+of human performance obtained via Amazon's Mechanical Turk (AMT)
+service\footnote{http://mturk.com}. 
+AMT users are paid small amounts
+of money to perform tasks for which human intelligence is required.
+Mechanical Turk has been used extensively in natural language processing and vision.
+%processing \citep{SnowEtAl2008} and vision
+%\citep{SorokinAndForsyth2008,whitehill09}. 
+%\citep{SorokinAndForsyth2008,whitehill09}. 
+AMT users where presented
+with 10 character images and asked to type 10 corresponding ASCII
+characters. They were forced to make a hard choice among the
+62 or 10 character classes (all classes or digits only). 
+Three users classified each image, allowing
+to estimate inter-human variability.
+
 \vspace*{-1mm}
 \subsection{Data Sources}
 \vspace*{-1mm}
@@ -412,11 +429,15 @@
 \subsection{Models and their Hyperparameters}
 \vspace*{-1mm}
 
+The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
+hidden layer and with Stacked Denoising Auto-Encoders (SDA).
 All hyper-parameters are selected based on performance on the NISTP validation set.
 
 {\bf Multi-Layer Perceptrons (MLP).}
 Whereas previous work had compared deep architectures to both shallow MLPs and
-SVMs, we only compared to MLPs here because of the very large datasets used.
+SVMs, we only compared to MLPs here because of the very large datasets used
+(making the use of SVMs computationally inconvenient because of their quadratic
+scaling behavior).
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
 exponentials) on the output layer for estimating P(class | image).
 The hyper-parameters are the following: number of hidden units, taken in 
@@ -425,7 +446,7 @@
 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments, and 0.1 was selected. 
 
-{\bf Stacked Denoising Auto-Encoders (SDAE).}
+{\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden 
 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
@@ -441,6 +462,7 @@
 compositions of simpler ones through a deep hierarchy).
 Here we chose to use the Denoising
 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
+% AJOUTER UNE IMAGE?
 these deep hierarchies of features, as it is very simple to train and
 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), 
 provides immediate and efficient inference, and yielded results
@@ -470,22 +492,6 @@
 %\subsection{SDA vs MLP vs Humans}
 %\vspace*{-1mm}
 
-We compare the best MLP (according to validation set error) that we found against
-the best SDA (again according to validation set error), along with a precise estimate
-of human performance obtained via Amazon's Mechanical Turk (AMT)
-service\footnote{http://mturk.com}. 
-%AMT users are paid small amounts
-%of money to perform tasks for which human intelligence is required.
-%Mechanical Turk has been used extensively in natural language
-%processing \citep{SnowEtAl2008} and vision
-%\citep{SorokinAndForsyth2008,whitehill09}. 
-AMT users where presented
-with 10 character images and asked to type 10 corresponding ASCII
-characters. They were forced to make a hard choice among the
-62 or 10 character classes (all classes or digits only). 
-Three users classified each image, allowing
-to estimate inter-human variability (shown as +/- in parenthesis below).
-
 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1,
 SDA2), along with the previous results on the digits NIST special database
@@ -503,9 +509,13 @@
 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error
 rate brought by self-taught learning is greater for the SDA, and these
 differences with the MLP are statistically and qualitatively
-significant. The left side of the figure shows the improvement to the clean
+significant. 
+The left side of the figure shows the improvement to the clean
 NIST test set error brought by the use of out-of-distribution examples
-(i.e. the perturbed examples examples from NISTP or P07). The right side of
+(i.e. the perturbed examples examples from NISTP or P07). 
+Relative change is measured by taking
+(original model's error / perturbed-data model's error - 1).
+The right side of
 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement
 brought by the use of a multi-task setting, in which the same model is
 trained for more classes than the target classes of interest (i.e. training
@@ -527,12 +537,19 @@
 
 \begin{figure}[h]
 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
-\caption{Charts corresponding to table 1 of Appendix I. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from literature. }
+\caption{Error bars indicate a 95\% confidence interval. 0 indicates training
+on NIST, 1 on NISTP, and 2 on P07. Left: overall results
+of all models, on 3 different test sets corresponding to the three
+datasets.
+Right: error rates on NIST test digits only, along with the previous results from 
+literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
+respectively based on ART, nearest neighbors, MLPs, and SVMs.}
+
 \label{fig:error-rates-charts}
 \end{figure}
 
 %\vspace*{-1mm}
-%\subsection{Perturbed Training Data More Helpful for SDAE}
+%\subsection{Perturbed Training Data More Helpful for SDA}
 %\vspace*{-1mm}
 
 %\vspace*{-1mm}
@@ -575,16 +592,19 @@
 \section{Conclusions}
 \vspace*{-1mm}
 
-The conclusions are positive for all the questions asked in the introduction.
+We have found that the self-taught learning framework is more beneficial
+to a deep learner than to a traditional shallow and purely
+supervised learner. More precisely, 
+the conclusions are positive for all the questions asked in the introduction.
 %\begin{itemize}
 
 $\bullet$ %\item 
 Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to the setting of a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
-Yes, the SDA systematically outperformed the MLP and all the previously
+Yes, the SDA {\bf systematically outperformed the MLP and all the previously
 published results on this dataset (as far as we know), in fact reaching human-level
-performance.
+performance} at round 17\% error on the 62-class task and 1.4\% on the digits.
 
 $\bullet$ %\item 
 To what extent does the perturbation of input images (e.g. adding
@@ -592,8 +612,11 @@
 classifier better not only on similarly perturbed images but also on
 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
-MLPs were helped by perturbed training examples when tested on perturbed input images,
-but only marginally helped with respect to clean examples. On the other hand, the deep SDAs
+MLPs were helped by perturbed training examples when tested on perturbed input 
+images (65\% relative improvement on NISTP) 
+but only marginally helped (5\% relative improvement on all classes) 
+or even hurt (10\% relative loss on digits)
+with respect to clean examples . On the other hand, the deep SDAs
 were very significantly boosted by these out-of-distribution examples.
 
 $\bullet$ %\item 
@@ -601,9 +624,23 @@
 training with similar but different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 Whereas the improvement due to the multi-task setting was marginal or
-negative for the MLP, it was very significant for the SDA.
+negative for the MLP (from +5.6\% to -3.6\% relative change), 
+it was very significant for the SDA (from +13\% to +27\% relative change).
 %\end{itemize}
 
+Why would deep learners benefit more from the self-taught learning framework?
+The key idea is that the lower layers of the predictor compute a hierarchy
+of features that can be shared across tasks or across variants of the
+input distribution. Intermediate features that can be used in different
+contexts can be estimated in a way that allows to share statistical 
+strength. Features extracted through many levels are more likely to
+be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
+increasing the likelihood that they would be useful for a larger array
+of tasks and input conditions.
+Therefore, we hypothesize that both depth and unsupervised
+pre-training play a part in explaining the advantages observed here, and future
+experiments could attempt at teasing apart these factors.
+
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 
 can be executed on-line at {\tt http://deep.host22.com}.