ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 484:9a757d565e46

reduction de taille

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Mon, 31 May 2010 20:42:22 -0400
parents	b9cdb464de5f
children	6beaf3328521

comparison

equal deleted inserted replaced

-:b9cdb464de5f
+:9a757d565e46
 \begin{document}
 %\makeanontitle
 \maketitle
+\vspace*{-2mm}
 \begin{abstract}
 Recent theoretical and empirical work in statistical machine learning has
 demonstrated the importance of learning algorithms for deep
 architectures, i.e., function classes obtained by composing multiple
 non-linear transformations. The self-taught learning (exploitng unlabeled
 images, color, contrast, occlusion, and various types of pixel and
 spatially correlated noise. The out-of-distribution examples are
 obtained by training with these highly distorted images or
 by including object classes different from those in the target test set.
 \end{abstract}
+\vspace*{-2mm}
 \section{Introduction}
+\vspace*{-1mm}
 Deep Learning has emerged as a promising new area of research in
 statistical machine learning (see~\citet{Bengio-2009} for a review).
 Learning algorithms for deep architectures are centered on the learning
 of useful representations of data, which are better suited to the task at hand.
 This is in great part inspired by observations of the mammalian visual cortex,
 which consists of a chain of processing elements, each of which is associated with a
-different representation. In fact,
+different representation of the raw visual input. In fact,
 it was found recently that the features learnt in deep architectures resemble
 those observed in the first two of these stages (in areas V1 and V2
-of visual cortex)~\citep{HonglakL2008}.
+of visual cortex)~\citep{HonglakL2008}, and that they become more and
-Processing images typically involves transforming the raw pixel data into
+more invariant to factors of variation (such as camera movement) in
-new {\bf representations} that can be used for analysis or classification.
+higher layers~\cite{Goodfellow2009}.
-For example, a principal component analysis representation linearly projects
+Learning a hierarchy of features increases the
-the input image into a lower-dimensional feature space.
-Why learn a representation?  Current practice in the computer vision
-literature converts the raw pixels into a hand-crafted representation
-e.g.\ SIFT features~\citep{Lowe04}, but deep learning algorithms
-tend to discover similar features in their first few
-levels~\citep{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}.
-Learning increases the
 ease and practicality of developing representations that are at once
 tailored to specific tasks, yet are able to borrow statistical strength
 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
 feature representation can lead to higher-level (more abstract, more
 general) features that are more robust to unanticipated sources of
 Machines in terms of unsupervised extraction of a hierarchy of features
 useful for classification.  The principle is that each layer starting from
 the bottom is trained to encode their input (the output of the previous
 layer) and try to reconstruct it from a corrupted version of it. After this
 unsupervised initialization, the stack of denoising auto-encoders can be
-converted into a deep supervised feedforward neural network and trained by
+converted into a deep supervised feedforward neural network and fine-tuned by
 stochastic gradient descent.
+Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
+of semi-supervised and multi-task learning: the learner can exploit examples
+that are unlabeled and/or come from a distribution different from the target
+distribution, e.g., from other classes that those of interest. Whereas
+it has already been shown that deep learners can clearly take advantage of
+unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008}
+and multi-task learning, not much has been done yet to explore the impact
+of {\em out-of-distribution} examples and of the multi-task setting
+(but see~\citep{CollobertR2008-short}). In particular the {\em relative
+advantage} of deep learning for this settings has not been evaluated.
 In this paper we ask the following questions:
-\begin{enumerate}
-\item Do the good results previously obtained with deep architectures on the
+%\begin{enumerate}
-MNIST digits generalize to the setting of a much larger and richer (but similar)
+$\bullet$ %\item
+Do the good results previously obtained with deep architectures on the
+MNIST digit images generalize to the setting of a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
-\item To what extent does the perturbation of input images (e.g. adding
+$\bullet$ %\item
+To what extent does the perturbation of input images (e.g. adding
 noise, affine transformations, background images) make the resulting
-classifier better not only on similarly perturbed images but also on
+classifiers better not only on similarly perturbed images but also on
 the {\em original clean examples}?
-\item Do deep architectures benefit more from such {\em out-of-distribution}
+$\bullet$ %\item
+Do deep architectures {\em benefit more from such out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
-\item Similarly, does the feature learning step in deep learning algorithms benefit more
+$\bullet$ %\item
+Similarly, does the feature learning step in deep learning algorithms benefit more
 training with similar but different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
-\end{enumerate}
+%\end{enumerate}
 The experimental results presented here provide positive evidence towards all of these questions.
+\vspace*{-1mm}
 \section{Perturbation and Transformation of Character Images}
+\vspace*{-1mm}
 This section describes the different transformations we used to stochastically
 transform source images in order to obtain data. More details can
 be found in this technical report~\citep{ift6266-tr-anonymous}.
 The code for these transformations (mostly python) is available at
 There are two main parts in the pipeline. The first one,
 from slant to pinch below, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
-{\large\bf Transformations}\\
+{\large\bf Transformations}
+\vspace*{2mm}
 {\bf Slant.}
 We mimic slant by shifting each row of the image
 proportionnaly to its height: $shift = round(slant \times height)$.
 The $slant$ coefficient can be negative or positive with equal probability
 at some other distance $d_2$. Define $d_1$ to be the distance between $P$
 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
 d_1$, where $pinch$ is a parameter to the filter.
 The actual value is given by bilinear interpolation considering the pixels
 around the (non-integer) source position thus found.
-Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\
+Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
-{\large\bf Injecting Noise}\\
+\vspace*{1mm}
+{\large\bf Injecting Noise}
+\vspace*{1mm}
 {\bf Motion Blur.}
 This GIMP filter is a ``linear motion blur'' in GIMP
 terminology, with two parameters, $length$ and $angle$. The value of
 a pixel in the final image is the approximately mean value of the $length$ first pixels
 color and contrast changes.}
 \label{fig:transfo}
 \end{figure}
+\vspace*{-1mm}
 \section{Experimental Setup}
+\vspace*{-1mm}
 Whereas much previous work on deep learning algorithms had been performed on
 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
 with 60~000 examples, and variants involving 10~000
 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
 to 1000 times larger.  The larger datasets are obtained by first sampling from
 a {\em data source} (NIST characters, scanned machine printed characters, characters
 from fonts, or characters from captchas) and then optionally applying some of the
 above transformations and/or noise processes.
+\vspace*{-1mm}
 \subsection{Data Sources}
+\vspace*{-1mm}
-\begin{itemize}
-\item {\bf NIST}
+%\begin{itemize}
+%\item
+{\bf NIST.}
 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995},
 widely used for training and testing character
 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}.
 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
 Note that the distribution of the classes in the NIST training and test sets differs
 substantially, with relatively many more digits in the test set, and uniform distribution
 of letters in the test set, not in the training set (more like the natural distribution
 of letters in text).
-\item {\bf Fonts}
+%\item
+{\bf Fonts.}
 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net}
 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html}
 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly.
 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
 directly as input to our models.
+%\item
+{\bf Captchas.}
-\item {\bf Captchas}
 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
 generating characters of the same format as the NIST dataset. This software is based on
 a random character class generator and various kinds of tranformations similar to those described in the previous sections.
 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are
 allowed and can be controlled via an easy to use facade class.
-\item {\bf OCR data}
+%\item
+{\bf OCR data.}
 A large set (2 million) of scanned, OCRed and manually verified machine-printed
 characters (from various documents and books) where included as an
 additional source. This set is part of a larger corpus being collected by the Image Understanding
 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern
 ({\tt http://www.iupr.com}), and which will be publically released.
-\end{itemize}
+%\end{itemize}
+\vspace*{-1mm}
 \subsection{Data Sets}
+\vspace*{-1mm}
 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
 from one of the 62 character classes.
-\begin{itemize}
+%\begin{itemize}
-\item {\bf NIST}. This is the raw NIST special database 19.
-\item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources
+%\item
+{\bf NIST.} This is the raw NIST special database 19.
+%\item
+{\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
 and sending them through the above transformation pipeline.
 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts,
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
-\item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
+%\item
+{\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
 except that we only apply
 transformations from slant to pinch. Therefore, the character is
 transformed but no additionnal noise is added to the image, giving images
 closer to the NIST dataset.
-\end{itemize}
+%\end{itemize}
+\vspace*{-1mm}
 \subsection{Models and their Hyperparameters}
+\vspace*{-1mm}
 All hyper-parameters are selected based on performance on the NISTP validation set.
-\subsubsection{Multi-Layer Perceptrons (MLP)}
+{\bf Multi-Layer Perceptrons (MLP).}
 Whereas previous work had compared deep architectures to both shallow MLPs and
 SVMs, we only compared to MLPs here because of the very large datasets used.
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
 exponentials) on the output layer for estimating P(class | image).
 The hyper-parameters are the following: number of hidden units, taken in
 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
 examples are presented in minibatches of size 20. A constant learning
 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments, and 0.1 was selected.
+{\bf Stacked Denoising Auto-Encoders (SDAE).}
-\subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
-\label{SdA}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden
 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
 enabling better generalization, apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better
 distribution $P(x)$ and the conditional distribution of interest
 $P(y|x)$ (like in semi-supervised learning), and on the other hand
 taking advantage of the expressive power and bias implicit in the
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
 Here we chose to use the Denoising
 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
 these deep hierarchies of features, as it is very simple to train and
 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}),
 provides immediate and efficient inference, and yielded results
 the data. Once it is trained, its hidden units activations can
 be used as inputs for training a second one, etc.
 After this unsupervised pre-training stage, the parameters
 are used to initialize a deep MLP, which is fine-tuned by
 the same standard procedure used to train them (see previous section).
+The SDA hyper-parameters are the same as for the MLP, with the addition of the
-The hyper-parameters are the same as for the MLP, with the addition of the
 amount of corruption noise (we used the masking noise process, whereby a
 fixed proportion of the input values, randomly selected, are zeroed), and a
 separate learning rate for the unsupervised pre-training stage (selected
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
 of hidden layers but it was fixed to 3 based on previous work with
 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
+\vspace*{-1mm}
 \section{Experimental Results}
+\vspace*{-1mm}
 \subsection{SDA vs MLP vs Humans}
+\vspace*{-1mm}
 We compare here the best MLP (according to validation set error) that we found against
 the best SDA (again according to validation set error), along with a precise estimate
 of human performance obtained via Amazon's Mechanical Turk (AMT)
 service\footnote{http://mturk.com}. AMT users are paid small amounts
 of money to perform tasks for which human intelligence is required.
 Mechanical Turk has been used extensively in natural language
 processing \citep{SnowEtAl2008} and vision
 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented
 with 10 character images and asked to type 10 corresponding ascii
-characters. Hence they were forced to make a hard choice among the
+characters. They were forced to make a hard choice among the
-62 character classes. Three users classified each image, allowing
+62 or 10 character classes (all classes or digits only).
+Three users classified each image, allowing
 to estimate inter-human variability (shown as +/- in parenthesis below).
+Figure~\ref{fig:error-rates-charts} summarizes the results obtained.
+More detailed results and tables can be found in the appendix.
 \begin{table}
 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture
 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
 \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. }
 \label{fig:error-rates-charts}
 \end{figure}
+\vspace*{-1mm}
 \subsection{Perturbed Training Data More Helpful for SDAE}
+\vspace*{-1mm}
 \begin{table}
 \caption{Relative change in error rates due to the use of perturbed training data,
 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models.
 A positive value indicates that training on the perturbed data helped for the
 MLP0/MLP2-1   &  -0.4\%    &  49\%           & 44\%           & -29\% \\ \hline
 \end{tabular}
 \end{center}
 \end{table}
+\vspace*{-1mm}
 \subsection{Multi-Task Learning Effects}
+\vspace*{-1mm}
 As previously seen, the SDA is better able to benefit from the
 transformations applied to the data than the MLP. In this experiment we
 define three tasks: recognizing digits (knowing that the input is a digit),
 recognizing upper case characters (knowing that the input is one), and
 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
 \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).}
 \label{fig:improvements-charts}
 \end{figure}
-A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
+\vspace*{-1mm}
-can be executed on-line at {\tt http://deep.host22.com}.
 \section{Conclusions}
+\vspace*{-1mm}
 The conclusions are positive for all the questions asked in the introduction.
-\begin{itemize}
+%\begin{itemize}
-\item Do the good results previously obtained with deep architectures on the
+$\bullet$ %\item
+Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to the setting of a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
 performance.
-\item To what extent does the perturbation of input images (e.g. adding
+$\bullet$ %\item
+To what extent does the perturbation of input images (e.g. adding
 noise, affine transformations, background images) make the resulting
 classifier better not only on similarly perturbed images but also on
 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
 MLPs were helped by perturbed training examples when tested on perturbed input images,
 but only marginally helped wrt clean examples. On the other hand, the deep SDAs
 were very significantly boosted by these out-of-distribution examples.
-\item Similarly, does the feature learning step in deep learning algorithms benefit more
+$\bullet$ %\item
+Similarly, does the feature learning step in deep learning algorithms benefit more
 training with similar but different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 Whereas the improvement due to the multi-task setting was marginal or
 negative for the MLP, it was very significant for the SDA.
-\end{itemize}
+%\end{itemize}
+A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
+can be executed on-line at {\tt http://deep.host22.com}.
+{\small
 \bibliography{strings,ml,aigaion,specials}
 %\bibliographystyle{plainnat}
 \bibliographystyle{unsrtnat}
 %\bibliographystyle{apalike}
+}
 \end{document}

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 484:9a757d565e46