ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 533:22d5cd82d5f0

resolved conflit

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Tue, 01 Jun 2010 21:24:39 -0400
parents	2e33885730cf 85f2337d47d2
children	4d6493d171f6 47894d0ecbde

comparison

equal deleted inserted replaced

-:2e33885730cf
+:22d5cd82d5f0
 \documentclass{article} % For LaTeX2e
 \usepackage{nips10submit_e,times}
-\usepackage{amsthm,amsmath,amssymb,bbold,bbm}
+\usepackage{amsthm,amsmath,bbm}
+\usepackage[psamsfonts]{amssymb}
 \usepackage{algorithm,algorithmic}
 \usepackage[utf8]{inputenc}
 \usepackage{graphicx,subfigure}
 \usepackage[numbers]{natbib}
 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
 performed similarly or better than previously proposed Restricted Boltzmann
 Machines in terms of unsupervised extraction of a hierarchy of features
 useful for classification.  The principle is that each layer starting from
 the bottom is trained to encode its input (the output of the previous
-layer) and to reconstruct it from a corrupted version of it. After this
+layer) and to reconstruct it from a corrupted version. After this
 unsupervised initialization, the stack of denoising auto-encoders can be
 converted into a deep supervised feedforward neural network and fine-tuned by
 stochastic gradient descent.
 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
 The hypothesis explored here is that a deep hierarchy of features
 may be better able to provide sharing of statistical strength
 between different regions in input space or different tasks,
 as discussed in the conclusion.
-% TODO: why we care to evaluate this relative advantage
 In this paper we ask the following questions:
 %\begin{enumerate}
 $\bullet$ %\item
 Do the good results previously obtained with deep architectures on the
 Our experimental results provide positive evidence towards all of these questions.
 \vspace*{-1mm}
 \section{Perturbation and Transformation of Character Images}
+\label{s:perturbations}
 \vspace*{-1mm}
 This section describes the different transformations we used to stochastically
 transform source images in order to obtain data. More details can
 be found in this technical report~\citep{ift6266-tr-anonymous}.
 There are two main parts in the pipeline. The first one,
 from slant to pinch below, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
 \begin{figure}[ht]
-\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}}
+\centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/transfo.png}}}
 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE
 \caption{Illustration of each transformation applied alone to the same image
 of an upper-case h (top left). First row (from left to right) : original image, slant,
 thickness, affine transformation (translation, rotation, shear),
 local elastic deformation; second row (from left to right) :
 Whereas much previous work on deep learning algorithms had been performed on
 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
 with 60~000 examples, and variants involving 10~000
 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
 to focus here on the case of much larger training sets, from 10 times to
-to 1000 times larger.  The larger datasets are obtained by first sampling from
+to 1000 times larger.
+The first step in constructing the larger datasets is to sample from
 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
 and {\bf OCR data} (scanned machine printed characters). Once a character
-is sampled from one of these sources (chosen randomly), a pipeline of
+is sampled from one of these sources (chosen randomly), the pipeline of
-the above transformations and/or noise processes is applied to the
+the transformations and/or noise processes described in section \ref{s:perturbations}
-image.
+is applied to the image.
-We compare the best MLP (according to validation set error) that we found against
+We compare the best MLP against
-the best SDA (again according to validation set error), along with a precise estimate
+the best SDA (both models' hyper-parameters are selected to minimize the validation set error),
+along with a comparison against a precise estimate
 of human performance obtained via Amazon's Mechanical Turk (AMT)
-service\footnote{http://mturk.com}.
+service (http://mturk.com).
 AMT users are paid small amounts
 of money to perform tasks for which human intelligence is required.
 Mechanical Turk has been used extensively in natural language processing and vision.
 %processing \citep{SnowEtAl2008} and vision
 %\citep{SorokinAndForsyth2008,whitehill09}.
-AMT users where presented
+AMT users were presented
-with 10 character images and asked to type 10 corresponding ASCII
+with 10 character images and asked to choose 10 corresponding ASCII
 characters. They were forced to make a hard choice among the
 62 or 10 character classes (all classes or digits only).
 Three users classified each image, allowing
-to estimate inter-human variability.
+to estimate inter-human variability. A total 2500 images/dataset were classified.
 \vspace*{-1mm}
 \subsection{Data Sources}
 \vspace*{-1mm}
 widely used for training and testing character
 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity.
-The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended
+The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended
 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 for that purpose. We randomly split the remainder into a training set and a validation set for
 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation,
 and 82587 for testing.
 The performances reported by previous work on that dataset mostly use only the digits.
 more like the natural distribution of letters in text).
 %\item
 {\bf Fonts.}
 In order to have a good variety of sources we downloaded an important number of free fonts from:
-{\tt http://cg.scs.carleton.ca/~luc/freefonts.html}
+{\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
 % TODO: pointless to anonymize, it's not pointing to our work
-Including operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
+Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
-The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
+The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
 directly as input to our models.
 %\item
 {\bf Captchas.}
 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
 %\item
 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}.
 %\item
 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
-and sending them through the above transformation pipeline.
+and sending them through the transformation pipeline described in section \ref{s:perturbations}.
-For each new example to generate, a source is selected with probability $10\%$ from the fonts,
+For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
-order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
+order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
 %\item
-{\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
+{\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
 except that we only apply
 transformations from slant to pinch. Therefore, the character is
 transformed but no additional noise is added to the image, giving images
 closer to the NIST dataset.
 %\end{itemize}
 \subsection{Models and their Hyperparameters}
 \vspace*{-1mm}
 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
 hidden layer and with Stacked Denoising Auto-Encoders (SDA).
-All hyper-parameters are selected based on performance on the NISTP validation set.
+\emph{Note that all hyper-parameters are selected based on performance on the {\bf NISTP} validation set.}
 {\bf Multi-Layer Perceptrons (MLP).}
 Whereas previous work had compared deep architectures to both shallow MLPs and
 SVMs, we only compared to MLPs here because of the very large datasets used
-(making the use of SVMs computationally inconvenient because of their quadratic
+(making the use of SVMs computationally challenging because of their quadratic
 scaling behavior).
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
 exponentials) on the output layer for estimating $P(class | image)$.
 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
-The optimization procedure is as follows: training
+Training examples are presented in minibatches of size 20. A constant learning
-examples are presented in minibatches of size 20, a constant learning
+rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
-rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments (measuring performance on a validation set),
-and $0.1$ was then selected.
+and $0.1$ was then selected for optimizing on the whole training sets.
 \begin{figure}[ht]
 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
 \caption{Illustration of the computations and training criterion for the denoising
 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
 distribution $P(x)$ and the conditional distribution of interest
 $P(y|x)$ (like in semi-supervised learning), and on the other hand
 taking advantage of the expressive power and bias implicit in the
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
 Here we chose to use the Denoising
 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
 these deep hierarchies of features, as it is very simple to train and
 explain (see Figure~\ref{fig:da}, as well as
 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
 \vspace*{-1mm}
 \begin{figure}[ht]
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
-\caption{Error bars indicate a 95\% confidence interval. 0 indicates training
+\caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
-of all models, on 3 different test sets corresponding to the three
+of all models, on 3 different test sets (NIST, NISTP, P07).
-datasets.
 Right: error rates on NIST test digits only, along with the previous results from
 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
 \label{fig:error-rates-charts}

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 533:22d5cd82d5f0