Mercurial > ift6266

--- a/writeup/techreport.tex	Sat Sep 18 16:44:46 2010 -0400
+++ b/writeup/techreport.tex	Sat Sep 18 18:25:11 2010 -0400
@@ -34,7 +34,7 @@
 Francois  Savard \and
 Guillaume  Sicard
 }
-\date{June 8th, 2010, Technical Report 1353, Dept. IRO, U. Montreal}
+\date{June 3, 2010, Technical Report 1353, Dept. IRO, U. Montreal}

 \begin{document}

@@ -43,26 +43,7 @@

 %\vspace*{-2mm}
 \begin{abstract}
-  Recent theoretical and empirical work in statistical machine learning has
-  demonstrated the importance of learning algorithms for deep
-  architectures, i.e., function classes obtained by composing multiple
-  non-linear transformations. Self-taught learning (exploiting unlabeled
-  examples or examples from other distributions) has already been applied
-  to deep learners, but mostly to show the advantage of unlabeled
-  examples. Here we explore the advantage brought by {\em out-of-distribution examples}.
-For this purpose we
-  developed a powerful generator of stochastic variations and noise
-  processes for character images, including not only affine transformations
-  but also slant, local elastic deformations, changes in thickness,
-  background images, grey level changes, contrast, occlusion, and various
-  types of noise. The out-of-distribution examples are obtained from these
-  highly distorted images or by including examples of object classes
-  different from those in the target test set.
-  We show that {\em deep learners benefit
-    more from them than a corresponding shallow learner}, at least in the area of
-  handwritten character recognition. In fact, we show that they reach
-  human-level performance on both handwritten digit classification and
-  62-class handwritten character recognition.
+  Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}.  For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set.  We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.
 \end{abstract}
 %\vspace*{-3mm}

@@ -72,7 +53,8 @@
 {\bf Deep Learning} has emerged as a promising new area of research in
 statistical machine learning (see~\citet{Bengio-2009} for a review).
 Learning algorithms for deep architectures are centered on the learning
-of useful representations of data, which are better suited to the task at hand.
+of useful representations of data, which are better suited to the task at hand,
+and are organized in a hierarchy with multiple levels.
 This is in part inspired by observations of the mammalian visual cortex,
 which consists of a chain of processing elements, each of which is associated with a
 different representation of the raw visual input. In fact,
@@ -104,7 +86,6 @@
 may be better able to provide sharing of statistical strength
 between different regions in input space or different tasks.

-\iffalse
 Whereas a deep architecture can in principle be more powerful than a
 shallow one in terms of representation, depth appears to render the
 training problem more difficult in terms of optimization and local minima.
@@ -119,7 +100,7 @@
 Machines in terms of unsupervised extraction of a hierarchy of features
 useful for classification. Each layer is trained to denoise its
 input, creating a layer of features that can be used as input for the next layer.
-\fi
+
 %The principle is that each layer starting from
 %the bottom is trained to encode its input (the output of the previous
 %layer) and to reconstruct it from a corrupted version. After this
@@ -144,37 +125,42 @@
 context of the 62-class and 10-class tasks of the NIST special database 19.

 $\bullet$ %\item
-Do deep architectures {\em benefit more from such out-of-distribution}
+Do deep architectures {\em benefit {\bf more} from such out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
 We use highly perturbed examples to generate out-of-distribution examples.

 $\bullet$ %\item
-Similarly, does the feature learning step in deep learning algorithms benefit more
-from training with moderately different classes (i.e. a multi-task learning scenario) than
+Similarly, does the feature learning step in deep learning algorithms benefit {\bf more}
+from training with moderately {\em different classes} (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
 to answer this question.
 %\end{enumerate}

-Our experimental results provide positive evidence towards all of these questions.
+Our experimental results provide positive evidence towards all of these questions,
+as well as classifiers that reach human-level performance on 62-class isolated character
+recognition and beat previously published results on the NIST dataset (special database 19).
 To achieve these results, we introduce in the next section a sophisticated system
 for stochastically transforming character images and then explain the methodology,
 which is based on training with or without these transformed images and testing on
 clean ones. We measure the relative advantage of out-of-distribution examples
+(perturbed or out-of-class)
 for a deep learner vs a supervised shallow one.
 Code for generating these transformations as well as for the deep learning
-algorithms are made available.
-We also estimate the relative advantage for deep learners of training with
+algorithms are made available at {\tt http://hg.assembla.com/ift6266}.
+We estimate the relative advantage for deep learners of training with
 other classes than those of interest, by comparing learners trained with
 62 classes with learners trained with only a subset (on which they
 are then tested).
 The conclusion discusses
 the more general question of why deep learners may benefit so much from
-the self-taught learning framework.
+the self-taught learning framework. Since out-of-distribution data
+(perturbed or from other related classes) is very common, this conclusion
+is of practical importance.

 %\vspace*{-3mm}
-\newpage
-\section{Perturbation and Transformation of Character Images}
+%\newpage
+\section{Perturbed and Transformed Character Images}
 \label{s:perturbations}
 %\vspace*{-2mm}

@@ -182,7 +168,7 @@
 %\begin{minipage}[b]{0.14\linewidth}
 %\vspace*{-5mm}
 \begin{center}
-\includegraphics[scale=.4]{images/Original.png}\\
+\includegraphics[scale=.4]{Original.png}\\
 {\bf Original}
 \end{center}
 \end{wrapfigure}
@@ -198,10 +184,8 @@
 improve character recognizers, this effort is on a large scale both
 in number of classes and in the complexity of the transformations, hence
 in the complexity of the learning task.
-More details can
-be found in this technical report~\citep{ift6266-tr-anonymous}.
 The code for these transformations (mostly python) is available at
-{\tt http://anonymous.url.net}. All the modules in the pipeline share
+{\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share
 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
 amount of deformation or noise introduced.
 There are two main parts in the pipeline. The first one,
@@ -221,7 +205,7 @@
 %\centering
 \begin{center}
 \vspace*{-5mm}
-\includegraphics[scale=.4]{images/Thick_only.png}\\
+\includegraphics[scale=.4]{Thick_only.png}\\
 %{\bf Thickness}
 \end{center}
 \vspace{.6cm}
@@ -249,7 +233,7 @@

 \begin{minipage}[b]{0.14\linewidth}
 \centering
-\includegraphics[scale=.4]{images/Slant_only.png}\\
+\includegraphics[scale=.4]{Slant_only.png}\\
 %{\bf Slant}
 \end{minipage}%
 \hspace{0.3cm}
@@ -271,7 +255,7 @@
 %\centering
 %\begin{wrapfigure}[8]{l}{0.15\textwidth}
 \begin{center}
-\includegraphics[scale=.4]{images/Affine_only.png}
+\includegraphics[scale=.4]{Affine_only.png}
 \vspace*{6mm}
 %{\small {\bf Affine \mbox{Transformation}}}
 \end{center}
@@ -301,7 +285,7 @@
 %\centering
 \begin{center}
 \vspace*{5mm}
-\includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}
+\includegraphics[scale=.4]{Localelasticdistorsions_only.png}
 %{\bf Local Elastic Deformation}
 \end{center}
 %\end{wrapfigure}
@@ -328,7 +312,7 @@
 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
 %\vspace*{-5mm}
 \begin{center}
-\includegraphics[scale=.4]{images/Pinch_only.png}\\
+\includegraphics[scale=.4]{Pinch_only.png}\\
 \vspace*{15mm}
 %{\bf Pinch}
 \end{center}
@@ -365,7 +349,7 @@
 \begin{minipage}[t]{0.14\linewidth}
 \centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/Motionblur_only.png}
+\includegraphics[scale=.4]{Motionblur_only.png}
 %{\bf Motion Blur}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
@@ -386,7 +370,7 @@
 \begin{minipage}[t]{0.14\linewidth}
 \centering
 \vspace*{3mm}
-\includegraphics[scale=.4]{images/occlusion_only.png}\\
+\includegraphics[scale=.4]{occlusion_only.png}\\
 %{\bf Occlusion}
 %%\vspace{.5cm}
 \end{minipage}%
@@ -399,7 +383,7 @@
 The rectangle corners
 are sampled so that larger complexity gives larger rectangles.
 The destination position in the occluded image are also sampled
-according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}).
+according to a normal distribution.
 This module is skipped with probability 60\%.
 %%\vspace{7mm}
 \end{minipage}
@@ -413,7 +397,7 @@
 \begin{center}
 %\centering
 \vspace*{6mm}
-\includegraphics[scale=.4]{images/Bruitgauss_only.png}
+\includegraphics[scale=.4]{Bruitgauss_only.png}
 %{\bf Gaussian Smoothing}
 \end{center}
 %\end{wrapfigure}
@@ -449,7 +433,7 @@
 %\vspace*{-5mm}
 \begin{center}
 \vspace*{1mm}
-\includegraphics[scale=.4]{images/Permutpixel_only.png}
+\includegraphics[scale=.4]{Permutpixel_only.png}
 %{\small\bf Permute Pixels}
 \end{center}
 %\end{wrapfigure}
@@ -476,7 +460,7 @@
 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
 %\centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/Distorsiongauss_only.png}
+\includegraphics[scale=.4]{Distorsiongauss_only.png}
 %{\small \bf Gauss. Noise}
 \end{center}
 %\end{wrapfigure}
@@ -498,7 +482,7 @@
 \begin{minipage}[t]{0.14\linewidth}
 \centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/background_other_only.png}
+\includegraphics[scale=.4]{background_other_only.png}
 %{\small \bf Bg Image}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
@@ -517,7 +501,7 @@
 \begin{minipage}[t]{0.14\linewidth}
 \centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/Poivresel_only.png}
+\includegraphics[scale=.4]{Poivresel_only.png}
 %{\small \bf Salt \& Pepper}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
@@ -539,7 +523,7 @@
 \begin{center}
 \vspace*{4mm}
 %\hspace*{-1mm}
-\includegraphics[scale=.4]{images/Rature_only.png}\\
+\includegraphics[scale=.4]{Rature_only.png}\\
 %{\bf Scratches}
 \end{center}
 \end{minipage}%
@@ -565,7 +549,7 @@
 \begin{minipage}[t]{0.15\linewidth}
 \centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/Contrast_only.png}
+\includegraphics[scale=.4]{Contrast_only.png}
 %{\bf Grey Level \& Contrast}
 \end{minipage}%
 \hspace{3mm}\begin{minipage}[t]{0.85\linewidth}
@@ -581,7 +565,7 @@

 \iffalse
 \begin{figure}[ht]
-\centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/example_t.png}}}\\
+\centerline{\resizebox{.9\textwidth}{!}{\includegraphics{example_t.png}}}\\
 \caption{Illustration of the pipeline of stochastic
 transformations applied to the image of a lower-case \emph{t}
 (the upper left image). Each image in the pipeline (going from
@@ -626,10 +610,12 @@
 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII
 characters. They were forced to choose a single character class (either among the
 62 or 10 character classes) for each image.
-80 subjects classified 2500 images per (dataset,task) pair,
-with the guarantee that 3 different subjects classified each image, allowing
-us to estimate inter-human variability (e.g a standard error of 0.1\%
-on the average 18.2\% error done by humans on the 62-class task NIST test set).
+80 subjects classified 2500 images per (dataset,task) pair.
+Different humans labelers sometimes provided a different label for the same
+example, and we were able to estimate the error variance due to this effect
+because each image was classified by 3 different persons.
+The average error of humans on the 62-class task NIST test set
+is 18.2\%, with a standard error of 0.1\%.

 %\vspace*{-3mm}
 \subsection{Data Sources}
@@ -733,7 +719,11 @@
 Whereas previous work had compared deep architectures to both shallow MLPs and
 SVMs, we only compared to MLPs here because of the very large datasets used
 (making the use of SVMs computationally challenging because of their quadratic
-scaling behavior).
+scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training
+set allowing the program to fit in memory yielded substantially worse results
+than those obtained with MLPs. For training on nearly a billion examples
+(with the perturbed data), the MLPs and SDA are much more convenient than
+classifiers based on kernel methods.
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
 exponentials) on the output layer for estimating $P(class | image)$.
 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
@@ -751,7 +741,11 @@
 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
 apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better
-generalization~\citep{Erhan+al-2010}. It is hypothesized that the
+generalization~\citep{Erhan+al-2010}.  This initial {\em unsupervised
+pre-training phase} uses all of the training images but not the training labels.
+Each layer is trained in turn to produce a new representation of its input
+(starting from the raw pixels).
+It is hypothesized that the
 advantage brought by this procedure stems from a better prior,
 on the one hand taking advantage of the link between the input
 distribution $P(x)$ and the conditional distribution of interest
@@ -762,7 +756,7 @@

 \begin{figure}[ht]
 %\vspace*{-2mm}
-\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{denoising_autoencoder_small.pdf}}}
 %\vspace*{-2mm}
 \caption{Illustration of the computations and training criterion for the denoising
 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
@@ -802,13 +796,16 @@
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
 of hidden layers but it was fixed to 3 based on previous work with
-SDAs on MNIST~\citep{VincentPLarochelleH2008}.
+SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden
+layers was kept constant across hidden layers, and the best results
+were obtained with the largest values that we could experiment
+with given our patience, with 1000 hidden units.

 %\vspace*{-1mm}

 \begin{figure}[ht]
 %\vspace*{-2mm}
-\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{error_rates_charts.pdf}}}
 %\vspace*{-3mm}
 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
@@ -823,7 +820,7 @@

 \begin{figure}[ht]
 %\vspace*{-3mm}
-\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{improvements_charts.pdf}}}
 %\vspace*{-3mm}
 \caption{Relative improvement in error rate due to self-taught learning.
 Left: Improvement (or loss, when negative)
@@ -856,7 +853,7 @@
 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
 ~\citep{Milgram+al-2005}.  More detailed and complete numerical results
 (figures and tables, including standard errors on the error rates) can be
-found in Appendix I of the supplementary material.
+found in Appendix.
 The deep learner not only outperformed the shallow ones and
 previously published performance (in a statistically and qualitatively
 significant way) but when trained with perturbed data
@@ -947,7 +944,8 @@
 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
 Yes, the SDA {\em systematically outperformed the MLP and all the previously
 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
-performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
+performance} at around 17\% error on the 62-class task and 1.4\% on the digits,
+and beating previously published results on the same data.

 $\bullet$ %\item
 {\bf To what extent do self-taught learning scenarios help deep learners,
@@ -983,7 +981,10 @@
 {\bf Why would deep learners benefit more from the self-taught learning framework}?
 The key idea is that the lower layers of the predictor compute a hierarchy
 of features that can be shared across tasks or across variants of the
-input distribution. Intermediate features that can be used in different
+input distribution. A theoretical analysis of generalization improvements
+due to sharing of intermediate features across tasks already points
+towards that explanation~\cite{baxter95a}.
+Intermediate features that can be used in different
 contexts can be estimated in a way that allows to share statistical
 strength. Features extracted through many levels are more likely to
 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
@@ -1011,7 +1012,87 @@
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
 can be executed on-line at {\tt http://deep.host22.com}.

-%\newpage
+
+\section*{Appendix I: Detailed Numerical Results}
+
+These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered.
+They also contain additional data such as test errors on P07 and standard errors.
+
+\begin{table}[ht]
+\caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
+26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
+(SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture
+(MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07)
+and using a validation set to select hyper-parameters and other training choices.
+\{SDA,MLP\}0 are trained on NIST,
+\{SDA,MLP\}1 are trained on NISTP, and \{SDA,MLP\}2 are trained on P07.
+The human error rate on digits is a lower bound because it does not count digits that were
+recognized as letters. For comparison, the results found in the literature
+on NIST digits classification using the same test set are included.}
+\label{tab:sda-vs-mlp-vs-humans}
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|} \hline
+      & NIST test          & NISTP test       & P07 test       & NIST test digits   \\ \hline
+Humans&   18.2\% $\pm$.1\%   &  39.4\%$\pm$.1\%   &  46.9\%$\pm$.1\%  &  $1.4\%$ \\ \hline
+SDA0   &  23.7\% $\pm$.14\%  &  65.2\%$\pm$.34\%  & 97.45\%$\pm$.06\%  & 2.7\% $\pm$.14\%\\ \hline
+SDA1   &  17.1\% $\pm$.13\%  &  29.7\%$\pm$.3\%  & 29.7\%$\pm$.3\%  & 1.4\% $\pm$.1\%\\ \hline
+SDA2   &  18.7\% $\pm$.13\%  &  33.6\%$\pm$.3\%  & 39.9\%$\pm$.17\%  & 1.7\% $\pm$.1\%\\ \hline
+MLP0   &  24.2\% $\pm$.15\%  & 68.8\%$\pm$.33\%  & 78.70\%$\pm$.14\%  & 3.45\% $\pm$.15\% \\ \hline
+MLP1   &  23.0\% $\pm$.15\%  &  41.8\%$\pm$.35\%  & 90.4\%$\pm$.1\%  & 3.85\% $\pm$.16\% \\ \hline
+MLP2   &  24.3\% $\pm$.15\%  &  46.0\%$\pm$.35\%  & 54.7\%$\pm$.17\%  & 4.85\% $\pm$.18\% \\ \hline
+\citep{Granger+al-2007} &     &                    &                   & 4.95\% $\pm$.18\% \\ \hline
+\citep{Cortes+al-2000} &      &                    &                   & 3.71\% $\pm$.16\% \\ \hline
+\citep{Oliveira+al-2002} &    &                    &                   & 2.4\% $\pm$.13\% \\ \hline
+\citep{Milgram+al-2005} &      &                    &                   & 2.1\% $\pm$.12\% \\ \hline
+\end{tabular}
+\end{center}
+\end{table}
+
+\begin{table}[ht]
+\caption{Relative change in error rates due to the use of perturbed training data,
+either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models.
+A positive value indicates that training on the perturbed data helped for the
+given test set (the first 3 columns on the 62-class tasks and the last one is
+on the clean 10-class digits). Clearly, the deep learning models did benefit more
+from perturbed training data, even when testing on clean data, whereas the MLP
+trained on perturbed data performed worse on the clean digits and about the same
+on the clean characters. }
+\label{tab:perturbation-effect}
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|} \hline
+      & NIST test          & NISTP test      & P07 test       & NIST test digits   \\ \hline
+SDA0/SDA1-1   &  38\%      &  84\%           & 228\%          &  93\% \\ \hline
+SDA0/SDA2-1   &  27\%      &  94\%           & 144\%          &  59\% \\ \hline
+MLP0/MLP1-1   &  5.2\%     &  65\%           & -13\%          & -10\%  \\ \hline
+MLP0/MLP2-1   &  -0.4\%    &  49\%           & 44\%           & -29\% \\ \hline
+\end{tabular}
+\end{center}
+\end{table}
+
+\begin{table}[ht]
+\caption{Test error rates and relative change in error rates due to the use of
+a multi-task setting, i.e., training on each task in isolation vs training
+for all three tasks together, for MLPs vs SDAs. The SDA benefits much
+more from the multi-task setting. All experiments on only on the
+unperturbed NIST data, using validation error for model selection.
+Relative improvement is 1 - single-task error / multi-task error.}
+\label{tab:multi-task}
+\begin{center}
+\begin{tabular}{|l|r|r|r|} \hline
+             & single-task  & multi-task  & relative \\
+             & setting      & setting     & improvement \\ \hline
+MLP-digits   &  3.77\%      &  3.99\%     & 5.6\%   \\ \hline
+MLP-lower   &  17.4\%      &  16.8\%     &  -4.1\%    \\ \hline
+MLP-upper   &  7.84\%     &  7.54\%      & -3.6\%    \\ \hline
+SDA-digits   &  2.6\%      &  3.56\%     & 27\%    \\ \hline
+SDA-lower   &  12.3\%      &  14.4\%    & 15\%    \\ \hline
+SDA-upper   &  5.93\%     &  6.78\%      & 13\%    \\ \hline
+\end{tabular}
+\end{center}
+\end{table}
+
+%\afterpage{\clearpage}
+\clearpage
 {
 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,specials,aigaion-shorter}
 %\bibliographystyle{plainnat}