Mercurial > ift6266
view writeup/nips2010_submission.tex @ 484:9a757d565e46
reduction de taille
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Mon, 31 May 2010 20:42:22 -0400 |
parents | b9cdb464de5f |
children | 6beaf3328521 |
line wrap: on
line source
\documentclass{article} % For LaTeX2e \usepackage{nips10submit_e,times} \usepackage{amsthm,amsmath,amssymb,bbold,bbm} \usepackage{algorithm,algorithmic} \usepackage[utf8]{inputenc} \usepackage{graphicx,subfigure} \usepackage[numbers]{natbib} \title{Deep Self-Taught Learning for Handwritten Character Recognition} \author{The IFT6266 Gang} \begin{document} %\makeanontitle \maketitle \vspace*{-2mm} \begin{abstract} Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. The self-taught learning (exploitng unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples} and show that {\em deep learners benefit more from them than a corresponding shallow learner}, in the area of handwritten character recognition. In fact, we show that they reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition. For this purpose we developed a powerful generator of stochastic variations and noise processes character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, color, contrast, occlusion, and various types of pixel and spatially correlated noise. The out-of-distribution examples are obtained by training with these highly distorted images or by including object classes different from those in the target test set. \end{abstract} \vspace*{-2mm} \section{Introduction} \vspace*{-1mm} Deep Learning has emerged as a promising new area of research in statistical machine learning (see~\citet{Bengio-2009} for a review). Learning algorithms for deep architectures are centered on the learning of useful representations of data, which are better suited to the task at hand. This is in great part inspired by observations of the mammalian visual cortex, which consists of a chain of processing elements, each of which is associated with a different representation of the raw visual input. In fact, it was found recently that the features learnt in deep architectures resemble those observed in the first two of these stages (in areas V1 and V2 of visual cortex)~\citep{HonglakL2008}, and that they become more and more invariant to factors of variation (such as camera movement) in higher layers~\cite{Goodfellow2009}. Learning a hierarchy of features increases the ease and practicality of developing representations that are at once tailored to specific tasks, yet are able to borrow statistical strength from other related tasks (e.g., modeling different kinds of objects). Finally, learning the feature representation can lead to higher-level (more abstract, more general) features that are more robust to unanticipated sources of variance extant in real data. Whereas a deep architecture can in principle be more powerful than a shallow one in terms of representation, depth appears to render the training problem more difficult in terms of optimization and local minima. It is also only recently that successful algorithms were proposed to overcome some of these difficulties. All are based on unsupervised learning, often in an greedy layer-wise ``unsupervised pre-training'' stage~\citep{Bengio-2009}. One of these layer initialization techniques, applied here, is the Denoising Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which performed similarly or better than previously proposed Restricted Boltzmann Machines in terms of unsupervised extraction of a hierarchy of features useful for classification. The principle is that each layer starting from the bottom is trained to encode their input (the output of the previous layer) and try to reconstruct it from a corrupted version of it. After this unsupervised initialization, the stack of denoising auto-encoders can be converted into a deep supervised feedforward neural network and fine-tuned by stochastic gradient descent. Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles of semi-supervised and multi-task learning: the learner can exploit examples that are unlabeled and/or come from a distribution different from the target distribution, e.g., from other classes that those of interest. Whereas it has already been shown that deep learners can clearly take advantage of unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008} and multi-task learning, not much has been done yet to explore the impact of {\em out-of-distribution} examples and of the multi-task setting (but see~\citep{CollobertR2008-short}). In particular the {\em relative advantage} of deep learning for this settings has not been evaluated. In this paper we ask the following questions: %\begin{enumerate} $\bullet$ %\item Do the good results previously obtained with deep architectures on the MNIST digit images generalize to the setting of a much larger and richer (but similar) dataset, the NIST special database 19, with 62 classes and around 800k examples? $\bullet$ %\item To what extent does the perturbation of input images (e.g. adding noise, affine transformations, background images) make the resulting classifiers better not only on similarly perturbed images but also on the {\em original clean examples}? $\bullet$ %\item Do deep architectures {\em benefit more from such out-of-distribution} examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? $\bullet$ %\item Similarly, does the feature learning step in deep learning algorithms benefit more training with similar but different classes (i.e. a multi-task learning scenario) than a corresponding shallow and purely supervised architecture? %\end{enumerate} The experimental results presented here provide positive evidence towards all of these questions. \vspace*{-1mm} \section{Perturbation and Transformation of Character Images} \vspace*{-1mm} This section describes the different transformations we used to stochastically transform source images in order to obtain data. More details can be found in this technical report~\citep{ift6266-tr-anonymous}. The code for these transformations (mostly python) is available at {\tt http://anonymous.url.net}. All the modules in the pipeline share a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the amount of deformation or noise introduced. There are two main parts in the pipeline. The first one, from slant to pinch below, performs transformations. The second part, from blur to contrast, adds different kinds of noise. {\large\bf Transformations} \vspace*{2mm} {\bf Slant.} We mimic slant by shifting each row of the image proportionnaly to its height: $shift = round(slant \times height)$. The $slant$ coefficient can be negative or positive with equal probability and its value is randomly sampled according to the complexity level: e $slant \sim U[0,complexity]$, so the maximum displacement for the lowest or highest pixel line is of $round(complexity \times 32)$.\\ {\bf Thickness.} Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82} are applied. The neighborhood of each pixel is multiplied element-wise with a {\em structuring element} matrix. The pixel value is replaced by the maximum or the minimum of the resulting matrix, respectively for dilation or erosion. Ten different structural elements with increasing dimensions (largest is $5\times5$) were used. For each image, randomly sample the operator type (dilation or erosion) with equal probability and one structural element from a subset of the $n$ smallest structuring elements where $n$ is $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ for erosion. A neutral element is always present in the set, and if it is chosen no transformation is applied. Erosion allows only the six smallest structural elements because when the character is too thin it may be completely erased.\\ {\bf Affine Transformations.} A $2 \times 3$ affine transform matrix (with 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. Each pixel $(x,y)$ of the output image takes the value of the pixel nearest to $(ax+by+c,dx+ey+f)$ in the input image. This produces scaling, translation, rotation and shearing. The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to forbid important rotations (not to confuse classes) but to give good variability of the transformation: $a$ and $d$ $\sim U[1-3 \times complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times complexity]$.\\ {\bf Local Elastic Deformations.} This filter induces a "wiggly" effect in the image, following~\citet{SimardSP03}, which provides more details. Two "displacements" fields are generated and applied, for horizontal and vertical displacements of pixels. To generate a pixel in either field, first a value between -1 and 1 is chosen from a uniform distribution. Then all the pixels, in both fields, are multiplied by a constant $\alpha$ which controls the intensity of the displacements (larger $\alpha$ translates into larger wiggles). Each field is convoluted with a Gaussian 2D kernel of standard deviation $\sigma$. Visually, this results in a blur. $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times \sqrt[3]{complexity}$.\\ {\bf Pinch.} This GIMP filter is named "Whirl and pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. For a square input image, think of drawing a circle of radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to that disk (region inside circle) will have its value recalculated by taking the value of another "source" pixel in the original image. The position of that source pixel is found on the line thats goes through $C$ and $P$, but at some other distance $d_2$. Define $d_1$ to be the distance between $P$ and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times d_1$, where $pinch$ is a parameter to the filter. The actual value is given by bilinear interpolation considering the pixels around the (non-integer) source position thus found. Here $pinch \sim U[-complexity, 0.7 \times complexity]$. \vspace*{1mm} {\large\bf Injecting Noise} \vspace*{1mm} {\bf Motion Blur.} This GIMP filter is a ``linear motion blur'' in GIMP terminology, with two parameters, $length$ and $angle$. The value of a pixel in the final image is the approximately mean value of the $length$ first pixels found by moving in the $angle$ direction. Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\ {\bf Occlusion.} This filter selects a random rectangle from an {\em occluder} character images and places it over the original {\em occluded} character image. Pixels are combined by taking the max(occluder,occluded), closer to black. The corners of the occluder The rectangle corners are sampled so that larger complexity gives larger rectangles. The destination position in the occluded image are also sampled according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). It has has a probability of not being applied at all of 60\%.\\ {\bf Pixel Permutation.} This filter permutes neighbouring pixels. It selects first $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number of exchanges to the left, right, top, bottom are equal or does not differ from more than 1 if the number of selected pixels is not a multiple of 4. It has has a probability of not being applied at all of 80\%.\\ {\bf Gaussian Noise.} This filter simply adds, to each pixel of the image independently, a noise $\sim Normal(0(\frac{complexity}{10})^2)$. It has has a probability of not being applied at all of 70\%.\\ {\bf Background Images.} Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random background behind the letter. The background is chosen by first selecting, at random, an image from a set of images. Then a 32$\times$32 subregion of that image is chosen as the background image (by sampling position uniformly while making sure not to cross image borders). To combine the original letter image and the background image, contrast adjustments are made. We first get the maximal values (i.e. maximal intensity) for both the original image and the background image, $maximage$ and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$. Each background pixel value is multiplied by $\frac{max(maximage - contrast, 0)}{maxbg}$ (higher contrast yield darker background). The output image pixels are max(background,original).\\ {\bf Salt and Pepper Noise.} This filter adds noise $\sim U[0,1]$ to random subsets of pixels. The number of selected pixels is $0.2 \times complexity$. This filter has a probability of not being applied at all of 75\%.\\ {\bf Spatially Gaussian Noise.} Different regions of the image are spatially smoothed. The image is convolved with a symmetric Gaussian kernel of size and variance choosen uniformly in the ranges $[12,12 + 20 \times complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized between $0$ and $1$. We also create a symmetric averaging window, of the kernel size, with maximum value at the center. For each image we sample uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be averaging centers between the original image and the filtered one. We initialize to zero a mask matrix of the image size. For each selected pixel we add to the mask the averaging window centered to it. The final image is computed from the following element-wise operation: $\frac{image + filtered image \times mask}{mask+1}$. This filter has a probability of not being applied at all of 75\%.\\ {\bf Scratches.} The scratches module places line-like white patches on the image. The lines are heavily transformed images of the digit "1" (one), chosen at random among five thousands such 1 images. The 1 image is randomly cropped and rotated by an angle $\sim Normal(0,(100 \times complexity)^2$, using bicubic interpolation, Two passes of a greyscale morphological erosion filter are applied, reducing the width of the line by an amount controlled by $complexity$. This filter is only applied only 15\% of the time. When it is applied, 50\% of the time, only one patch image is generated and applied. In 30\% of cases, two patches are generated, and otherwise three patches are generated. The patch is applied by taking the maximal value on any given patch or the original image, for each of the 32x32 pixel locations.\\ {\bf Color and Contrast Changes.} This filter changes the constrast and may invert the image polarity (white on black to black on white). The contrast $C$ is defined here as the difference between the maximum and the minimum pixel value of the image. Contrast $\sim U[1-0.85 \times complexity,1]$ (so constrast $\geq 0.15$). The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The polarity is inverted with $0.5$ probability. \begin{figure}[h] \resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\ \caption{Illustration of the pipeline of stochastic transformations applied to the image of a lower-case t (the upper left image). Each image in the pipeline (going from left to right, first top line, then bottom line) shows the result of applying one of the modules in the pipeline. The last image (bottom right) is used as training example.} \label{fig:pipeline} \end{figure} \begin{figure}[h] \resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\ \caption{Illustration of each transformation applied alone to the same image of an upper-case h (top left). First row (from left to rigth) : original image, slant, thickness, affine transformation, local elastic deformation; second row (from left to rigth) : pinch, motion blur, occlusion, pixel permutation, Gaussian noise; third row (from left to rigth) : background image, salt and pepper noise, spatially Gaussian noise, scratches, color and contrast changes.} \label{fig:transfo} \end{figure} \vspace*{-1mm} \section{Experimental Setup} \vspace*{-1mm} Whereas much previous work on deep learning algorithms had been performed on the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, with 60~000 examples, and variants involving 10~000 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want to focus here on the case of much larger training sets, from 10 times to to 1000 times larger. The larger datasets are obtained by first sampling from a {\em data source} (NIST characters, scanned machine printed characters, characters from fonts, or characters from captchas) and then optionally applying some of the above transformations and/or noise processes. \vspace*{-1mm} \subsection{Data Sources} \vspace*{-1mm} %\begin{itemize} %\item {\bf NIST.} Our main source of characters is the NIST Special Database 19~\cite{Grother-1995}, widely used for training and testing character recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended by NIST as testing set and is used in our work and some previous work~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} for that purpose. We randomly split the remainder into a training set and a validation set for model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, and 82587 for testing. The performances reported by previous work on that dataset mostly use only the digits. Here we use all the classes both in the training and testing phase. This is especially useful to estimate the effect of a multi-task setting. Note that the distribution of the classes in the NIST training and test sets differs substantially, with relatively many more digits in the test set, and uniform distribution of letters in the test set, not in the training set (more like the natural distribution of letters in text). %\item {\bf Fonts.} In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, directly as input to our models. %\item {\bf Captchas.} The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for generating characters of the same format as the NIST dataset. This software is based on a random character class generator and various kinds of tranformations similar to those described in the previous sections. In order to increase the variability of the data generated, many different fonts are used for generating the characters. Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are allowed and can be controlled via an easy to use facade class. %\item {\bf OCR data.} A large set (2 million) of scanned, OCRed and manually verified machine-printed characters (from various documents and books) where included as an additional source. This set is part of a larger corpus being collected by the Image Understanding Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern ({\tt http://www.iupr.com}), and which will be publically released. %\end{itemize} \vspace*{-1mm} \subsection{Data Sets} \vspace*{-1mm} All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label from one of the 62 character classes. %\begin{itemize} %\item {\bf NIST.} This is the raw NIST special database 19. %\item {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources and sending them through the above transformation pipeline. For each new exemple to generate, a source is selected with probability $10\%$ from the fonts, $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. %\item {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) except that we only apply transformations from slant to pinch. Therefore, the character is transformed but no additionnal noise is added to the image, giving images closer to the NIST dataset. %\end{itemize} \vspace*{-1mm} \subsection{Models and their Hyperparameters} \vspace*{-1mm} All hyper-parameters are selected based on performance on the NISTP validation set. {\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared deep architectures to both shallow MLPs and SVMs, we only compared to MLPs here because of the very large datasets used. The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized exponentials) on the output layer for estimating P(class | image). The hyper-parameters are the following: number of hidden units, taken in $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training examples are presented in minibatches of size 20. A constant learning rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ through preliminary experiments, and 0.1 was selected. {\bf Stacked Denoising Auto-Encoders (SDAE).} Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) can be used to initialize the weights of each layer of a deep MLP (with many hidden layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} enabling better generalization, apparently setting parameters in the basin of attraction of supervised gradient descent yielding better generalization~\citep{Erhan+al-2010}. It is hypothesized that the advantage brought by this procedure stems from a better prior, on the one hand taking advantage of the link between the input distribution $P(x)$ and the conditional distribution of interest $P(y|x)$ (like in semi-supervised learning), and on the other hand taking advantage of the expressive power and bias implicit in the deep architecture (whereby complex concepts are expressed as compositions of simpler ones through a deep hierarchy). Here we chose to use the Denoising Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for these deep hierarchies of features, as it is very simple to train and teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), provides immediate and efficient inference, and yielded results comparable or better than RBMs in series of experiments \citep{VincentPLarochelleH2008}. During training of a Denoising Auto-Encoder, it is presented with a stochastically corrupted version of the input and trained to reconstruct the uncorrupted input, forcing the hidden units to represent the leading regularities in the data. Once it is trained, its hidden units activations can be used as inputs for training a second one, etc. After this unsupervised pre-training stage, the parameters are used to initialize a deep MLP, which is fine-tuned by the same standard procedure used to train them (see previous section). The SDA hyper-parameters are the same as for the MLP, with the addition of the amount of corruption noise (we used the masking noise process, whereby a fixed proportion of the input values, randomly selected, are zeroed), and a separate learning rate for the unsupervised pre-training stage (selected from the same above set). The fraction of inputs corrupted was selected among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number of hidden layers but it was fixed to 3 based on previous work with stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. \vspace*{-1mm} \section{Experimental Results} \vspace*{-1mm} \subsection{SDA vs MLP vs Humans} \vspace*{-1mm} We compare here the best MLP (according to validation set error) that we found against the best SDA (again according to validation set error), along with a precise estimate of human performance obtained via Amazon's Mechanical Turk (AMT) service\footnote{http://mturk.com}. AMT users are paid small amounts of money to perform tasks for which human intelligence is required. Mechanical Turk has been used extensively in natural language processing \citep{SnowEtAl2008} and vision \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented with 10 character images and asked to type 10 corresponding ascii characters. They were forced to make a hard choice among the 62 or 10 character classes (all classes or digits only). Three users classified each image, allowing to estimate inter-human variability (shown as +/- in parenthesis below). Figure~\ref{fig:error-rates-charts} summarizes the results obtained. More detailed results and tables can be found in the appendix. \begin{table} \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07) and using a validation set to select hyper-parameters and other training choices. \{SDA,MLP\}0 are trained on NIST, \{SDA,MLP\}1 are trained on NISTP, and \{SDA,MLP\}2 are trained on P07. The human error rate on digits is a lower bound because it does not count digits that were recognized as letters. For comparison, the results found in the literature on NIST digits classification using the same test set are included.} \label{tab:sda-vs-mlp-vs-humans} \begin{center} \begin{tabular}{|l|r|r|r|r|} \hline & NIST test & NISTP test & P07 test & NIST test digits \\ \hline Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $1.4\%$ \\ \hline SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline \end{tabular} \end{center} \end{table} \begin{figure}[h] \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. } \label{fig:error-rates-charts} \end{figure} \vspace*{-1mm} \subsection{Perturbed Training Data More Helpful for SDAE} \vspace*{-1mm} \begin{table} \caption{Relative change in error rates due to the use of perturbed training data, either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. A positive value indicates that training on the perturbed data helped for the given test set (the first 3 columns on the 62-class tasks and the last one is on the clean 10-class digits). Clearly, the deep learning models did benefit more from perturbed training data, even when testing on clean data, whereas the MLP trained on perturbed data performed worse on the clean digits and about the same on the clean characters. } \label{tab:perturbation-effect} \begin{center} \begin{tabular}{|l|r|r|r|r|} \hline & NIST test & NISTP test & P07 test & NIST test digits \\ \hline SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline MLP0/MLP1-1 & 5.2\% & 65\% & -13\% & -10\% \\ \hline MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline \end{tabular} \end{center} \end{table} \vspace*{-1mm} \subsection{Multi-Task Learning Effects} \vspace*{-1mm} As previously seen, the SDA is better able to benefit from the transformations applied to the data than the MLP. In this experiment we define three tasks: recognizing digits (knowing that the input is a digit), recognizing upper case characters (knowing that the input is one), and recognizing lower case characters (knowing that the input is one). We consider the digit classification task as the target task and we want to evaluate whether training with the other tasks can help or hurt, and whether the effect is different for MLPs versus SDAs. The goal is to find out if deep learning can benefit more (or less) from multiple related tasks (i.e. the multi-task setting) compared to a corresponding purely supervised shallow learner. We use a single hidden layer MLP with 1000 hidden units, and a SDA with 3 hidden layers (1000 hidden units per layer), pre-trained and fine-tuned on NIST. Our results show that the MLP benefits marginally from the multi-task setting in the case of digits (5\% relative improvement) but is actually hurt in the case of characters (respectively 3\% and 4\% worse for lower and upper class characters). On the other hand the SDA benefitted from the multi-task setting, with relative error rate improvements of 27\%, 15\% and 13\% respectively for digits, lower and upper case characters, as shown in Table~\ref{tab:multi-task}. \begin{table} \caption{Test error rates and relative change in error rates due to the use of a multi-task setting, i.e., training on each task in isolation vs training for all three tasks together, for MLPs vs SDAs. The SDA benefits much more from the multi-task setting. All experiments on only on the unperturbed NIST data, using validation error for model selection. Relative improvement is 1 - single-task error / multi-task error.} \label{tab:multi-task} \begin{center} \begin{tabular}{|l|r|r|r|} \hline & single-task & multi-task & relative \\ & setting & setting & improvement \\ \hline MLP-digits & 3.77\% & 3.99\% & 5.6\% \\ \hline MLP-lower & 17.4\% & 16.8\% & -4.1\% \\ \hline MLP-upper & 7.84\% & 7.54\% & -3.6\% \\ \hline SDA-digits & 2.6\% & 3.56\% & 27\% \\ \hline SDA-lower & 12.3\% & 14.4\% & 15\% \\ \hline SDA-upper & 5.93\% & 6.78\% & 13\% \\ \hline \end{tabular} \end{center} \end{table} \begin{figure}[h] \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).} \label{fig:improvements-charts} \end{figure} \vspace*{-1mm} \section{Conclusions} \vspace*{-1mm} The conclusions are positive for all the questions asked in the introduction. %\begin{itemize} $\bullet$ %\item Do the good results previously obtained with deep architectures on the MNIST digits generalize to the setting of a much larger and richer (but similar) dataset, the NIST special database 19, with 62 classes and around 800k examples? Yes, the SDA systematically outperformed the MLP, in fact reaching human-level performance. $\bullet$ %\item To what extent does the perturbation of input images (e.g. adding noise, affine transformations, background images) make the resulting classifier better not only on similarly perturbed images but also on the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? MLPs were helped by perturbed training examples when tested on perturbed input images, but only marginally helped wrt clean examples. On the other hand, the deep SDAs were very significantly boosted by these out-of-distribution examples. $\bullet$ %\item Similarly, does the feature learning step in deep learning algorithms benefit more training with similar but different classes (i.e. a multi-task learning scenario) than a corresponding shallow and purely supervised architecture? Whereas the improvement due to the multi-task setting was marginal or negative for the MLP, it was very significant for the SDA. %\end{itemize} A Flash demo of the recognizer (where both the MLP and the SDA can be compared) can be executed on-line at {\tt http://deep.host22.com}. {\small \bibliography{strings,ml,aigaion,specials} %\bibliographystyle{plainnat} \bibliographystyle{unsrtnat} %\bibliographystyle{apalike} } \end{document}