# HG changeset patch
# User Dumitru Erhan <dumitru.erhan@gmail.com>
# Date 1275441481 25200
# Node ID 8fe77eac344f2a9340b1f7d483a20205e1ce7565
# Parent  8b7e054d22bd4ef5fbf60c9557ede44eac18f38d
Clarifying the experimental setup, typos here and there

diff -r 8b7e054d22bd -r 8fe77eac344f writeup/nips2010_submission.tex
--- a/writeup/nips2010_submission.tex	Tue Jun 01 14:23:34 2010 -0700
+++ b/writeup/nips2010_submission.tex	Tue Jun 01 18:18:01 2010 -0700
@@ -1,7 +1,8 @@
 \documentclass{article} % For LaTeX2e
 \usepackage{nips10submit_e,times}
 
-\usepackage{amsthm,amsmath,amssymb,bbold,bbm} 
+\usepackage{amsthm,amsmath,bbm} 
+\usepackage[psamsfonts]{amssymb}
 \usepackage{algorithm,algorithmic}
 \usepackage[utf8]{inputenc}
 \usepackage{graphicx,subfigure}
@@ -77,7 +78,7 @@
 Machines in terms of unsupervised extraction of a hierarchy of features
 useful for classification.  The principle is that each layer starting from
 the bottom is trained to encode its input (the output of the previous
-layer) and to reconstruct it from a corrupted version of it. After this
+layer) and to reconstruct it from a corrupted version. After this
 unsupervised initialization, the stack of denoising auto-encoders can be
 converted into a deep supervised feedforward neural network and fine-tuned by
 stochastic gradient descent.
@@ -97,8 +98,6 @@
 between different regions in input space or different tasks,
 as discussed in the conclusion.
 
-% TODO: why we care to evaluate this relative advantage
-
 In this paper we ask the following questions:
 
 %\begin{enumerate}
@@ -127,6 +126,7 @@
 
 \vspace*{-1mm}
 \section{Perturbation and Transformation of Character Images}
+\label{s:perturbations}
 \vspace*{-1mm}
 
 This section describes the different transformations we used to stochastically
@@ -142,7 +142,7 @@
 part, from blur to contrast, adds different kinds of noise.
 
 \begin{figure}[ht]
-\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}}
+\centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/transfo.png}}}
 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE
 \caption{Illustration of each transformation applied alone to the same image
 of an upper-case h (top left). First row (from left to right) : original image, slant,
@@ -354,28 +354,31 @@
 with 60~000 examples, and variants involving 10~000
 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
 to focus here on the case of much larger training sets, from 10 times to 
-to 1000 times larger.  The larger datasets are obtained by first sampling from
+to 1000 times larger.  
+
+The first step in constructing the larger datasets is to sample from
 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
 and {\bf OCR data} (scanned machine printed characters). Once a character
-is sampled from one of these sources (chosen randomly), a pipeline of
-the above transformations and/or noise processes is applied to the
-image.
+is sampled from one of these sources (chosen randomly), the pipeline of
+the transformations and/or noise processes described in section \ref{s:perturbations}
+is applied to the image.
 
-We compare the best MLP (according to validation set error) that we found against
-the best SDA (again according to validation set error), along with a precise estimate
+We compare the best MLP against
+the best SDA (both models' hyper-parameters are selected to minimize the validation set error), 
+along with a comparison against a precise estimate
 of human performance obtained via Amazon's Mechanical Turk (AMT)
-service\footnote{http://mturk.com}. 
+service (http://mturk.com). 
 AMT users are paid small amounts
 of money to perform tasks for which human intelligence is required.
 Mechanical Turk has been used extensively in natural language processing and vision.
 %processing \citep{SnowEtAl2008} and vision
 %\citep{SorokinAndForsyth2008,whitehill09}. 
-AMT users where presented
-with 10 character images and asked to type 10 corresponding ASCII
+AMT users were presented
+with 10 character images and asked to choose 10 corresponding ASCII
 characters. They were forced to make a hard choice among the
 62 or 10 character classes (all classes or digits only). 
 Three users classified each image, allowing
-to estimate inter-human variability.
+to estimate inter-human variability. A total 2500 images/dataset were classified.
 
 \vspace*{-1mm}
 \subsection{Data Sources}
@@ -390,7 +393,7 @@
 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 
 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. 
-The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended 
+The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended 
 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 for that purpose. We randomly split the remainder into a training set and a validation set for
 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, 
@@ -406,10 +409,10 @@
 %\item 
 {\bf Fonts.} 
 In order to have a good variety of sources we downloaded an important number of free fonts from:
-{\tt http://cg.scs.carleton.ca/~luc/freefonts.html}
+{\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
 % TODO: pointless to anonymize, it's not pointing to our work
-Including operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
-The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 
+Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
+The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 
 directly as input to our models.
 
 %\item 
@@ -445,13 +448,13 @@
 
 %\item 
 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
-and sending them through the above transformation pipeline.
-For each new example to generate, a source is selected with probability $10\%$ from the fonts,
+and sending them through the transformation pipeline described in section \ref{s:perturbations}.
+For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
-order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
+order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
 
 %\item 
-{\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
+{\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
   except that we only apply
   transformations from slant to pinch. Therefore, the character is
   transformed but no additional noise is added to the image, giving images
@@ -464,21 +467,20 @@
 
 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
 hidden layer and with Stacked Denoising Auto-Encoders (SDA).
-All hyper-parameters are selected based on performance on the NISTP validation set.
+\emph{Note that all hyper-parameters are selected based on performance on the {\bf NISTP} validation set.}
 
 {\bf Multi-Layer Perceptrons (MLP).}
 Whereas previous work had compared deep architectures to both shallow MLPs and
 SVMs, we only compared to MLPs here because of the very large datasets used
-(making the use of SVMs computationally inconvenient because of their quadratic
+(making the use of SVMs computationally challenging because of their quadratic
 scaling behavior).
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
 exponentials) on the output layer for estimating $P(class | image)$.
 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 
-The optimization procedure is as follows: training
-examples are presented in minibatches of size 20, a constant learning
-rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
+Training examples are presented in minibatches of size 20. A constant learning
+rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments (measuring performance on a validation set),
-and $0.1$ was then selected.
+and $0.1$ was then selected for optimizing on the whole training sets.
 
 \begin{figure}[ht]
 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
@@ -507,11 +509,12 @@
 taking advantage of the expressive power and bias implicit in the
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
+
 Here we chose to use the Denoising
 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
 these deep hierarchies of features, as it is very simple to train and
 teach (see Figure~\ref{fig:da}, as well as 
-tutorial and code there: {\tt http://deeplearning.net/tutorial}), 
+tutorial and code at {\tt http://deeplearning.net/tutorial}), 
 provides immediate and efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
 \citep{VincentPLarochelleH2008}. During training, a Denoising
@@ -536,10 +539,9 @@
 
 \begin{figure}[ht]
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
-\caption{Error bars indicate a 95\% confidence interval. 0 indicates training
+\caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
-of all models, on 3 different test sets corresponding to the three
-datasets.
+of all models, on 3 different test sets (NIST, NISTP, P07).
 Right: error rates on NIST test digits only, along with the previous results from 
 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
@@ -693,7 +695,7 @@
 experiments showed its positive effects in a \emph{limited labeled data}
 scenario. However, many of the results by \citet{RainaR2007} (who used a
 shallow, sparse coding approach) suggest that the relative gain of self-taught
-learning diminishes as the number of labeled examples increases, (essentially,
+learning diminishes as the number of labeled examples increases (essentially,
 a ``diminishing returns'' scenario occurs).  We note that, for deep
 architectures, our experiments show that such a positive effect is accomplished
 even in a scenario with a \emph{very large number of labeled examples}.