ift6266: writeup/techreport.tex comparison

comparison writeup/techreport.tex @ 584:81c6fde68a8a

corrections to techreport.tex

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sat, 18 Sep 2010 18:25:11 -0400
parents	ae77edb9df67
children

comparison

equal deleted inserted replaced

-:ae77edb9df67
+:81c6fde68a8a
 Razvan  Pascanu \and
 Salah  Rifai \and
 Francois  Savard \and
 Guillaume  Sicard
 }
-\date{June 8th, 2010, Technical Report 1353, Dept. IRO, U. Montreal}
+\date{June 3, 2010, Technical Report 1353, Dept. IRO, U. Montreal}
 \begin{document}
 %\makeanontitle
 \maketitle
 %\vspace*{-2mm}
 \begin{abstract}
-Recent theoretical and empirical work in statistical machine learning has
+Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}.  For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set.  We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.
-demonstrated the importance of learning algorithms for deep
-architectures, i.e., function classes obtained by composing multiple
-non-linear transformations. Self-taught learning (exploiting unlabeled
-examples or examples from other distributions) has already been applied
-to deep learners, but mostly to show the advantage of unlabeled
-examples. Here we explore the advantage brought by {\em out-of-distribution examples}.
-For this purpose we
-developed a powerful generator of stochastic variations and noise
-processes for character images, including not only affine transformations
-but also slant, local elastic deformations, changes in thickness,
-background images, grey level changes, contrast, occlusion, and various
-types of noise. The out-of-distribution examples are obtained from these
-highly distorted images or by including examples of object classes
-different from those in the target test set.
-We show that {\em deep learners benefit
-more from them than a corresponding shallow learner}, at least in the area of
-handwritten character recognition. In fact, we show that they reach
-human-level performance on both handwritten digit classification and
-62-class handwritten character recognition.
 \end{abstract}
 %\vspace*{-3mm}
 \section{Introduction}
 %\vspace*{-1mm}
 {\bf Deep Learning} has emerged as a promising new area of research in
 statistical machine learning (see~\citet{Bengio-2009} for a review).
 Learning algorithms for deep architectures are centered on the learning
-of useful representations of data, which are better suited to the task at hand.
+of useful representations of data, which are better suited to the task at hand,
+and are organized in a hierarchy with multiple levels.
 This is in part inspired by observations of the mammalian visual cortex,
 which consists of a chain of processing elements, each of which is associated with a
 different representation of the raw visual input. In fact,
 it was found recently that the features learnt in deep architectures resemble
 those observed in the first two of these stages (in areas V1 and V2
 advantage} of deep learning for these settings has not been evaluated.
 The hypothesis discussed in the conclusion is that a deep hierarchy of features
 may be better able to provide sharing of statistical strength
 between different regions in input space or different tasks.
-\iffalse
 Whereas a deep architecture can in principle be more powerful than a
 shallow one in terms of representation, depth appears to render the
 training problem more difficult in terms of optimization and local minima.
 It is also only recently that successful algorithms were proposed to
 overcome some of these difficulties.  All are based on unsupervised
 which
 performed similarly or better than previously proposed Restricted Boltzmann
 Machines in terms of unsupervised extraction of a hierarchy of features
 useful for classification. Each layer is trained to denoise its
 input, creating a layer of features that can be used as input for the next layer.
-\fi
 %The principle is that each layer starting from
 %the bottom is trained to encode its input (the output of the previous
 %layer) and to reconstruct it from a corrupted version. After this
 %unsupervised initialization, the stack of DAs can be
 %converted into a deep supervised feedforward neural network and fine-tuned by
 classifiers better not only on similarly perturbed images but also on
 the {\em original clean examples}? We study this question in the
 context of the 62-class and 10-class tasks of the NIST special database 19.
 $\bullet$ %\item
-Do deep architectures {\em benefit more from such out-of-distribution}
+Do deep architectures {\em benefit {\bf more} from such out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
 We use highly perturbed examples to generate out-of-distribution examples.
 $\bullet$ %\item
-Similarly, does the feature learning step in deep learning algorithms benefit more
+Similarly, does the feature learning step in deep learning algorithms benefit {\bf more}
-from training with moderately different classes (i.e. a multi-task learning scenario) than
+from training with moderately {\em different classes} (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
 to answer this question.
 %\end{enumerate}
-Our experimental results provide positive evidence towards all of these questions.
+Our experimental results provide positive evidence towards all of these questions,
+as well as classifiers that reach human-level performance on 62-class isolated character
+recognition and beat previously published results on the NIST dataset (special database 19).
 To achieve these results, we introduce in the next section a sophisticated system
 for stochastically transforming character images and then explain the methodology,
 which is based on training with or without these transformed images and testing on
 clean ones. We measure the relative advantage of out-of-distribution examples
+(perturbed or out-of-class)
 for a deep learner vs a supervised shallow one.
 Code for generating these transformations as well as for the deep learning
-algorithms are made available.
+algorithms are made available at {\tt http://hg.assembla.com/ift6266}.
-We also estimate the relative advantage for deep learners of training with
+We estimate the relative advantage for deep learners of training with
 other classes than those of interest, by comparing learners trained with
 62 classes with learners trained with only a subset (on which they
 are then tested).
 The conclusion discusses
 the more general question of why deep learners may benefit so much from
-the self-taught learning framework.
+the self-taught learning framework. Since out-of-distribution data
+(perturbed or from other related classes) is very common, this conclusion
+is of practical importance.
 %\vspace*{-3mm}
-\newpage
+%\newpage
-\section{Perturbation and Transformation of Character Images}
+\section{Perturbed and Transformed Character Images}
 \label{s:perturbations}
 %\vspace*{-2mm}
 \begin{wrapfigure}[8]{l}{0.15\textwidth}
 %\begin{minipage}[b]{0.14\linewidth}
 %\vspace*{-5mm}
 \begin{center}
-\includegraphics[scale=.4]{images/Original.png}\\
+\includegraphics[scale=.4]{Original.png}\\
 {\bf Original}
 \end{center}
 \end{wrapfigure}
 %%\vspace{0.7cm}
 %\end{minipage}%
 which we start.
 Although character transformations have been used before to
 improve character recognizers, this effort is on a large scale both
 in number of classes and in the complexity of the transformations, hence
 in the complexity of the learning task.
-More details can
-be found in this technical report~\citep{ift6266-tr-anonymous}.
 The code for these transformations (mostly python) is available at
-{\tt http://anonymous.url.net}. All the modules in the pipeline share
+{\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share
 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
 amount of deformation or noise introduced.
 There are two main parts in the pipeline. The first one,
 from slant to pinch below, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
 \begin{minipage}[b]{0.14\linewidth}
 %\centering
 \begin{center}
 \vspace*{-5mm}
-\includegraphics[scale=.4]{images/Thick_only.png}\\
+\includegraphics[scale=.4]{Thick_only.png}\\
 %{\bf Thickness}
 \end{center}
 \vspace{.6cm}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
 \subsubsection*{Slant}
 \vspace*{2mm}
 \begin{minipage}[b]{0.14\linewidth}
 \centering
-\includegraphics[scale=.4]{images/Slant_only.png}\\
+\includegraphics[scale=.4]{Slant_only.png}\\
 %{\bf Slant}
 \end{minipage}%
 \hspace{0.3cm}
 \begin{minipage}[b]{0.83\linewidth}
 %\centering
 \begin{minipage}[b]{0.14\linewidth}
 %\centering
 %\begin{wrapfigure}[8]{l}{0.15\textwidth}
 \begin{center}
-\includegraphics[scale=.4]{images/Affine_only.png}
+\includegraphics[scale=.4]{Affine_only.png}
 \vspace*{6mm}
 %{\small {\bf Affine \mbox{Transformation}}}
 \end{center}
 %\end{wrapfigure}
 \end{minipage}%
 %\hspace*{-8mm}
 \begin{minipage}[b]{0.14\linewidth}
 %\centering
 \begin{center}
 \vspace*{5mm}
-\includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}
+\includegraphics[scale=.4]{Localelasticdistorsions_only.png}
 %{\bf Local Elastic Deformation}
 \end{center}
 %\end{wrapfigure}
 \end{minipage}%
 \hspace{3mm}
 \begin{minipage}[b]{0.14\linewidth}
 %\centering
 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
 %\vspace*{-5mm}
 \begin{center}
-\includegraphics[scale=.4]{images/Pinch_only.png}\\
+\includegraphics[scale=.4]{Pinch_only.png}\\
 \vspace*{15mm}
 %{\bf Pinch}
 \end{center}
 %\end{wrapfigure}
 %%\vspace{.6cm}
 %%\vspace*{-.2cm}
 \begin{minipage}[t]{0.14\linewidth}
 \centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/Motionblur_only.png}
+\includegraphics[scale=.4]{Motionblur_only.png}
 %{\bf Motion Blur}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
 %%\vspace*{.5mm}
 \vspace*{2mm}
 \subsubsection*{Occlusion}
 \begin{minipage}[t]{0.14\linewidth}
 \centering
 \vspace*{3mm}
-\includegraphics[scale=.4]{images/occlusion_only.png}\\
+\includegraphics[scale=.4]{occlusion_only.png}\\
 %{\bf Occlusion}
 %%\vspace{.5cm}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
 %\vspace*{-18mm}
 image. Pixels are combined by taking the max(occluder, occluded),
 i.e. keeping the lighter ones.
 The rectangle corners
 are sampled so that larger complexity gives larger rectangles.
 The destination position in the occluded image are also sampled
-according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}).
+according to a normal distribution.
 This module is skipped with probability 60\%.
 %%\vspace{7mm}
 \end{minipage}
 %\vspace*{1mm}
 %\vspace*{-6mm}
 \begin{minipage}[t]{0.14\linewidth}
 \begin{center}
 %\centering
 \vspace*{6mm}
-\includegraphics[scale=.4]{images/Bruitgauss_only.png}
+\includegraphics[scale=.4]{Bruitgauss_only.png}
 %{\bf Gaussian Smoothing}
 \end{center}
 %\end{wrapfigure}
 %%\vspace{.5cm}
 \end{minipage}%
 \begin{minipage}[t]{0.14\textwidth}
 %\begin{wrapfigure}[7]{l}{
 %\vspace*{-5mm}
 \begin{center}
 \vspace*{1mm}
-\includegraphics[scale=.4]{images/Permutpixel_only.png}
+\includegraphics[scale=.4]{Permutpixel_only.png}
 %{\small\bf Permute Pixels}
 \end{center}
 %\end{wrapfigure}
 \end{minipage}%
 \hspace{3mm}\begin{minipage}[t]{0.86\linewidth}
 %%\vspace*{-3mm}
 \begin{center}
 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
 %\centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/Distorsiongauss_only.png}
+\includegraphics[scale=.4]{Distorsiongauss_only.png}
 %{\small \bf Gauss. Noise}
 \end{center}
 %\end{wrapfigure}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
 \begin{minipage}[t]{\linewidth}
 \begin{minipage}[t]{0.14\linewidth}
 \centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/background_other_only.png}
+\includegraphics[scale=.4]{background_other_only.png}
 %{\small \bf Bg Image}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
 \vspace*{1mm}
 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random
 \subsubsection*{Salt and Pepper Noise}
 \begin{minipage}[t]{0.14\linewidth}
 \centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/Poivresel_only.png}
+\includegraphics[scale=.4]{Poivresel_only.png}
 %{\small \bf Salt \& Pepper}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
 \vspace*{1mm}
 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels.
 %\begin{minipage}[t]{0.14\linewidth}
 %\centering
 \begin{center}
 \vspace*{4mm}
 %\hspace*{-1mm}
-\includegraphics[scale=.4]{images/Rature_only.png}\\
+\includegraphics[scale=.4]{Rature_only.png}\\
 %{\bf Scratches}
 \end{center}
 \end{minipage}%
 %\end{wrapfigure}
 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
 \subsubsection*{Grey Level and Contrast Changes}
 \begin{minipage}[t]{0.15\linewidth}
 \centering
 \vspace*{0mm}
-\includegraphics[scale=.4]{images/Contrast_only.png}
+\includegraphics[scale=.4]{Contrast_only.png}
 %{\bf Grey Level \& Contrast}
 \end{minipage}%
 \hspace{3mm}\begin{minipage}[t]{0.85\linewidth}
 \vspace*{1mm}
 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white
 %\vspace{2mm}
 \iffalse
 \begin{figure}[ht]
-\centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/example_t.png}}}\\
+\centerline{\resizebox{.9\textwidth}{!}{\includegraphics{example_t.png}}}\\
 \caption{Illustration of the pipeline of stochastic
 transformations applied to the image of a lower-case \emph{t}
 (the upper left image). Each image in the pipeline (going from
 left to right, first top line, then bottom line) shows the result
 of applying one of the modules in the pipeline. The last image
 %\citep{SorokinAndForsyth2008,whitehill09}.
 AMT users were presented
 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII
 characters. They were forced to choose a single character class (either among the
 62 or 10 character classes) for each image.
-80 subjects classified 2500 images per (dataset,task) pair,
+80 subjects classified 2500 images per (dataset,task) pair.
-with the guarantee that 3 different subjects classified each image, allowing
+Different humans labelers sometimes provided a different label for the same
-us to estimate inter-human variability (e.g a standard error of 0.1\%
+example, and we were able to estimate the error variance due to this effect
-on the average 18.2\% error done by humans on the 62-class task NIST test set).
+because each image was classified by 3 different persons.
+The average error of humans on the 62-class task NIST test set
+is 18.2\%, with a standard error of 0.1\%.
 %\vspace*{-3mm}
 \subsection{Data Sources}
 %\vspace*{-2mm}
 {\bf Multi-Layer Perceptrons (MLP).}
 Whereas previous work had compared deep architectures to both shallow MLPs and
 SVMs, we only compared to MLPs here because of the very large datasets used
 (making the use of SVMs computationally challenging because of their quadratic
-scaling behavior).
+scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training
+set allowing the program to fit in memory yielded substantially worse results
+than those obtained with MLPs. For training on nearly a billion examples
+(with the perturbed data), the MLPs and SDA are much more convenient than
+classifiers based on kernel methods.
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
 exponentials) on the output layer for estimating $P(class | image)$.
 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
 Training examples are presented in minibatches of size 20. A constant learning
 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden
 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
 apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better
-generalization~\citep{Erhan+al-2010}. It is hypothesized that the
+generalization~\citep{Erhan+al-2010}.  This initial {\em unsupervised
+pre-training phase} uses all of the training images but not the training labels.
+Each layer is trained in turn to produce a new representation of its input
+(starting from the raw pixels).
+It is hypothesized that the
 advantage brought by this procedure stems from a better prior,
 on the one hand taking advantage of the link between the input
 distribution $P(x)$ and the conditional distribution of interest
 $P(y|x)$ (like in semi-supervised learning), and on the other hand
 taking advantage of the expressive power and bias implicit in the
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
 \begin{figure}[ht]
 %\vspace*{-2mm}
-\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{denoising_autoencoder_small.pdf}}}
 %\vspace*{-2mm}
 \caption{Illustration of the computations and training criterion for the denoising
 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
 the layer (i.e. raw input or output of previous layer)
 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
 fixed proportion of the input values, randomly selected, are zeroed), and a
 separate learning rate for the unsupervised pre-training stage (selected
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
 of hidden layers but it was fixed to 3 based on previous work with
-SDAs on MNIST~\citep{VincentPLarochelleH2008}.
+SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden
+layers was kept constant across hidden layers, and the best results
+were obtained with the largest values that we could experiment
+with given our patience, with 1000 hidden units.
 %\vspace*{-1mm}
 \begin{figure}[ht]
 %\vspace*{-2mm}
-\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{error_rates_charts.pdf}}}
 %\vspace*{-3mm}
 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
 of all models, on NIST and NISTP test sets.
 Right: error rates on NIST test digits only, along with the previous results from
 \end{figure}
 \begin{figure}[ht]
 %\vspace*{-3mm}
-\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{improvements_charts.pdf}}}
 %\vspace*{-3mm}
 \caption{Relative improvement in error rate due to self-taught learning.
 Left: Improvement (or loss, when negative)
 induced by out-of-distribution examples (perturbed data).
 Right: Improvement (or loss, when negative) induced by multi-task
 19 test set from the literature, respectively based on ARTMAP neural
 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
 ~\citep{Milgram+al-2005}.  More detailed and complete numerical results
 (figures and tables, including standard errors on the error rates) can be
-found in Appendix I of the supplementary material.
+found in Appendix.
 The deep learner not only outperformed the shallow ones and
 previously published performance (in a statistically and qualitatively
 significant way) but when trained with perturbed data
 reaches human performance on both the 62-class task
 and the 10-class (digits) task.
 {\bf Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
 Yes, the SDA {\em systematically outperformed the MLP and all the previously
 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
-performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
+performance} at around 17\% error on the 62-class task and 1.4\% on the digits,
+and beating previously published results on the same data.
 $\bullet$ %\item
 {\bf To what extent do self-taught learning scenarios help deep learners,
 and do they help them more than shallow supervised ones}?
 We found that distorted training examples not only made the resulting
 in the asymptotic regime.
 {\bf Why would deep learners benefit more from the self-taught learning framework}?
 The key idea is that the lower layers of the predictor compute a hierarchy
 of features that can be shared across tasks or across variants of the
-input distribution. Intermediate features that can be used in different
+input distribution. A theoretical analysis of generalization improvements
+due to sharing of intermediate features across tasks already points
+towards that explanation~\cite{baxter95a}.
+Intermediate features that can be used in different
 contexts can be estimated in a way that allows to share statistical
 strength. Features extracted through many levels are more likely to
 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
 increasing the likelihood that they would be useful for a larger array
 of tasks and input conditions.
 with deep learning and self-taught learning.
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
 can be executed on-line at {\tt http://deep.host22.com}.
-%\newpage
+\section*{Appendix I: Detailed Numerical Results}
+These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered.
+They also contain additional data such as test errors on P07 and standard errors.
+\begin{table}[ht]
+\caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
+26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
+(SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture
+(MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07)
+and using a validation set to select hyper-parameters and other training choices.
+\{SDA,MLP\}0 are trained on NIST,
+\{SDA,MLP\}1 are trained on NISTP, and \{SDA,MLP\}2 are trained on P07.
+The human error rate on digits is a lower bound because it does not count digits that were
+recognized as letters. For comparison, the results found in the literature
+on NIST digits classification using the same test set are included.}
+\label{tab:sda-vs-mlp-vs-humans}
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|} \hline
+& NIST test          & NISTP test       & P07 test       & NIST test digits   \\ \hline
+Humans&   18.2\% $\pm$.1\%   &  39.4\%$\pm$.1\%   &  46.9\%$\pm$.1\%  &  $1.4\%$ \\ \hline
+SDA0   &  23.7\% $\pm$.14\%  &  65.2\%$\pm$.34\%  & 97.45\%$\pm$.06\%  & 2.7\% $\pm$.14\%\\ \hline
+SDA1   &  17.1\% $\pm$.13\%  &  29.7\%$\pm$.3\%  & 29.7\%$\pm$.3\%  & 1.4\% $\pm$.1\%\\ \hline
+SDA2   &  18.7\% $\pm$.13\%  &  33.6\%$\pm$.3\%  & 39.9\%$\pm$.17\%  & 1.7\% $\pm$.1\%\\ \hline
+MLP0   &  24.2\% $\pm$.15\%  & 68.8\%$\pm$.33\%  & 78.70\%$\pm$.14\%  & 3.45\% $\pm$.15\% \\ \hline
+MLP1   &  23.0\% $\pm$.15\%  &  41.8\%$\pm$.35\%  & 90.4\%$\pm$.1\%  & 3.85\% $\pm$.16\% \\ \hline
+MLP2   &  24.3\% $\pm$.15\%  &  46.0\%$\pm$.35\%  & 54.7\%$\pm$.17\%  & 4.85\% $\pm$.18\% \\ \hline
+\citep{Granger+al-2007} &     &                    &                   & 4.95\% $\pm$.18\% \\ \hline
+\citep{Cortes+al-2000} &      &                    &                   & 3.71\% $\pm$.16\% \\ \hline
+\citep{Oliveira+al-2002} &    &                    &                   & 2.4\% $\pm$.13\% \\ \hline
+\citep{Milgram+al-2005} &      &                    &                   & 2.1\% $\pm$.12\% \\ \hline
+\end{tabular}
+\end{center}
+\end{table}
+\begin{table}[ht]
+\caption{Relative change in error rates due to the use of perturbed training data,
+either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models.
+A positive value indicates that training on the perturbed data helped for the
+given test set (the first 3 columns on the 62-class tasks and the last one is
+on the clean 10-class digits). Clearly, the deep learning models did benefit more
+from perturbed training data, even when testing on clean data, whereas the MLP
+trained on perturbed data performed worse on the clean digits and about the same
+on the clean characters. }
+\label{tab:perturbation-effect}
+\begin{center}
+\begin{tabular}{|l|r|r|r|r|} \hline
+& NIST test          & NISTP test      & P07 test       & NIST test digits   \\ \hline
+SDA0/SDA1-1   &  38\%      &  84\%           & 228\%          &  93\% \\ \hline
+SDA0/SDA2-1   &  27\%      &  94\%           & 144\%          &  59\% \\ \hline
+MLP0/MLP1-1   &  5.2\%     &  65\%           & -13\%          & -10\%  \\ \hline
+MLP0/MLP2-1   &  -0.4\%    &  49\%           & 44\%           & -29\% \\ \hline
+\end{tabular}
+\end{center}
+\end{table}
+\begin{table}[ht]
+\caption{Test error rates and relative change in error rates due to the use of
+a multi-task setting, i.e., training on each task in isolation vs training
+for all three tasks together, for MLPs vs SDAs. The SDA benefits much
+more from the multi-task setting. All experiments on only on the
+unperturbed NIST data, using validation error for model selection.
+Relative improvement is 1 - single-task error / multi-task error.}
+\label{tab:multi-task}
+\begin{center}
+\begin{tabular}{|l|r|r|r|} \hline
+& single-task  & multi-task  & relative \\
+& setting      & setting     & improvement \\ \hline
+MLP-digits   &  3.77\%      &  3.99\%     & 5.6\%   \\ \hline
+MLP-lower   &  17.4\%      &  16.8\%     &  -4.1\%    \\ \hline
+MLP-upper   &  7.84\%     &  7.54\%      & -3.6\%    \\ \hline
+SDA-digits   &  2.6\%      &  3.56\%     & 27\%    \\ \hline
+SDA-lower   &  12.3\%      &  14.4\%    & 15\%    \\ \hline
+SDA-upper   &  5.93\%     &  6.78\%      & 13\%    \\ \hline
+\end{tabular}
+\end{center}
+\end{table}
+%\afterpage{\clearpage}
+\clearpage
 {
 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,specials,aigaion-shorter}
 %\bibliographystyle{plainnat}
 \bibliographystyle{unsrtnat}
 %\bibliographystyle{apalike}

Mercurial > ift6266

comparison writeup/techreport.tex @ 584:81c6fde68a8a