# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1275514843 14400
# Node ID 8f6c09d1140f842b1e78a80bf44292a886ba2fb8
# Parent  35c611363291ea4356acf893e8bc356a2209e9db
ca fitte de nouveau

diff -r 35c611363291 -r 8f6c09d1140f writeup/nips2010_submission.tex
--- a/writeup/nips2010_submission.tex	Wed Jun 02 17:28:43 2010 -0400
+++ b/writeup/nips2010_submission.tex	Wed Jun 02 17:40:43 2010 -0400
@@ -133,8 +133,6 @@
 \vspace*{-1mm}
 \section{Perturbation and Transformation of Character Images}
 \label{s:perturbations}
-{\large\bf Transformations}
-
 \vspace*{-1mm}
 
 \begin{minipage}[b]{0.14\linewidth}
@@ -144,9 +142,10 @@
 \vspace{1.2cm}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Original:}
+{\bf Original.}
 This section describes the different transformations we used to stochastically
-transform source images in order to obtain data from a larger distribution which
+transform source images such as the one on the left
+in order to obtain data from a larger distribution which
 covers a domain substantially larger than the clean characters distribution from
 which we start. Although character transformations have been used before to
 improve character recognizers, this effort is on a large scale both
@@ -158,12 +157,13 @@
 {\tt http://anonymous.url.net}. All the modules in the pipeline share
 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
 amount of deformation or noise introduced. 
-
 There are two main parts in the pipeline. The first one,
 from slant to pinch below, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
 \end{minipage}
 
+{\large\bf Transformations}
+
 
 \begin{minipage}[b]{0.14\linewidth}
 \centering
@@ -172,7 +172,7 @@
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
 %\centering
-{\bf Slant:}
+{\bf Slant.}
 Each row of the image is shifted
 proportionally to its height: $shift = round(slant \times height)$.  
 $slant \sim U[-complexity,complexity]$.
@@ -187,7 +187,7 @@
 \vspace{.9cm}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Thinkness:}
+{\bf Thinkness.}
 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
 are applied. The neighborhood of each pixel is multiplied
 element-wise with a {\em structuring element} matrix.
@@ -210,7 +210,7 @@
 \label{fig:Affine}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Affine Transformations:}
+{\bf Affine Transformations.}
 A $2 \times 3$ affine transform matrix (with
 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
 Output pixel $(x,y)$ takes the value of input pixel
@@ -230,7 +230,7 @@
 \label{fig:Elastic}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Local Elastic Deformations:}
+{\bf Local Elastic Deformations.}
 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
 which provides more details. 
 The intensity of the displacement fields is given by 
@@ -248,7 +248,7 @@
 \vspace{.6cm}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Pinch:}
+{\bf Pinch.}
 This is the ``Whirl and pinch'' GIMP filter but with whirl was set to 0. 
 A pinch is ``similar to projecting the image onto an elastic
 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
@@ -277,7 +277,7 @@
 \label{fig:Original}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Motion Blur:}
+{\bf Motion Blur.}
 This is GIMP's ``linear motion blur'' 
 with parameters $length$ and $angle$. The value of
 a pixel in the final image is approximately the  mean value of the first $length$ pixels
@@ -294,7 +294,7 @@
 \label{fig:Original}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Occlusion:}
+{\bf Occlusion.}
 Selects a random rectangle from an {\em occluder} character
 image and places it over the original {\em occluded}
 image. Pixels are combined by taking the max(occluder,occluded),
@@ -313,7 +313,7 @@
 \label{fig:Original}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Pixel Permutation:}
+{\bf Pixel Permutation.}
 This filter permutes neighbouring pixels. It first selects
 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then
 sequentially exchanged with one other in as $V4$ neighbourhood. 
@@ -328,7 +328,7 @@
 \label{fig:Original}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Gaussian Noise:}
+{\bf Gaussian Noise.}
 This filter simply adds, to each pixel of the image independently, a
 noise $\sim Normal(0,(\frac{complexity}{10})^2)$.
 This filter is skipped with probability 70\%.
@@ -342,7 +342,7 @@
 \label{fig:Original}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Background Images:}
+{\bf Background Images.}
 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
 background behind the letter, from a randomly chosen natural image,
 with contrast adjustments depending on $complexity$, to preserve
@@ -357,7 +357,7 @@
 \label{fig:Original}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Salt and Pepper Noise:}
+{\bf Salt and Pepper Noise.}
 This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
 The number of selected pixels is $0.2 \times complexity$.
 This filter is skipped with probability 75\%.
@@ -372,7 +372,7 @@
 \vspace{.5cm}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Spatially Gaussian Noise:}
+{\bf Spatially Gaussian Noise.}
 Different regions of the image are spatially smoothed by convolving
 the image is convolved with a symmetric Gaussian kernel of
 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
@@ -396,7 +396,7 @@
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
 \vspace{.4cm}
-{\bf Scratches:}
+{\bf Scratches.}
 The scratches module places line-like white patches on the image.  The
 lines are heavily transformed images of the digit ``1'' (one), chosen
 at random among 500 such 1 images,
@@ -416,7 +416,7 @@
 \label{fig:Original}
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-{\bf Grey Level and Contrast Changes:}
+{\bf Grey Level and Contrast Changes.}
 This filter changes the contrast and may invert the image polarity (white
 to black and black to white). The contrast is $C \sim U[1-0.85 \times complexity,1]$ 
 so the image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
@@ -440,16 +440,16 @@
 \fi
 
 
-\vspace*{-1mm}
+\vspace*{-2mm}
 \section{Experimental Setup}
 \vspace*{-1mm}
 
-Whereas much previous work on deep learning algorithms had been performed on
-the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
+Much previous work on deep learning had been performed on
+the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
 with 60~000 examples, and variants involving 10~000
-examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
-to focus here on the case of much larger training sets, from 10 times to 
-to 1000 times larger.  
+examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}.
+The focus here is on much larger training sets, from 10 times to 
+to 1000 times larger, and 62 classes.
 
 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
@@ -470,17 +470,17 @@
 %processing \citep{SnowEtAl2008} and vision
 %\citep{SorokinAndForsyth2008,whitehill09}. 
 AMT users were presented
-with 10 character images and asked to choose 10 corresponding ASCII
+with 10 character images (from a test set) and asked to choose 10 corresponding ASCII
 characters. They were forced to make a hard choice among the
 62 or 10 character classes (all classes or digits only). 
-A total 2500 images/dataset were classified by XXX subjects,
-with 3 subjects classifying each image, allowing
+80 subjects classified 2500 images per (dataset,task) pair,
+with the guarantee that 3 different subjects classified each image, allowing
 us to estimate inter-human variability (e.g a standard error of 0.1\%
-on the average 18\% error done by humans on the 62-class task). 
+on the average 18.2\% error done by humans on the 62-class task NIST test set). 
 
-\vspace*{-1mm}
+\vspace*{-3mm}
 \subsection{Data Sources}
-\vspace*{-1mm}
+\vspace*{-2mm}
 
 %\begin{itemize}
 %\item 
@@ -499,10 +499,10 @@
 The performances reported by previous work on that dataset mostly use only the digits.
 Here we use all the classes both in the training and testing phase. This is especially
 useful to estimate the effect of a multi-task setting.
-Note that the distribution of the classes in the NIST training and test sets differs
-substantially, with relatively many more digits in the test set, and more uniform distribution
-of letters in the test set, compared to the training set (in the latter, the letters are distributed
-more like the natural distribution of letters in text).
+The distribution of the classes in the NIST training and test sets differs
+substantially, with relatively many more digits in the test set, and a more uniform distribution
+of letters in the test set (where the letters are distributed
+more like in natural text).
 \vspace*{-1mm}
 
 %\item 
@@ -529,16 +529,16 @@
 %\item 
 {\bf OCR data.}
 A large set (2 million) of scanned, OCRed and manually verified machine-printed 
-characters (from various documents and books) where included as an
+characters where included as an
 additional source. This set is part of a larger corpus being collected by the Image Understanding
 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern 
 ({\tt http://www.iupr.com}), and which will be publicly released.
 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
 %\end{itemize}
 
-\vspace*{-1mm}
+\vspace*{-3mm}
 \subsection{Data Sets}
-\vspace*{-1mm}
+\vspace*{-2mm}
 
 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
 from one of the 62 character classes.
@@ -568,13 +568,13 @@
 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
 %\end{itemize}
 
-\vspace*{-1mm}
+\vspace*{-3mm}
 \subsection{Models and their Hyperparameters}
-\vspace*{-1mm}
+\vspace*{-2mm}
 
 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
 hidden layer and with Stacked Denoising Auto-Encoders (SDA).
-\emph{Note that all hyper-parameters are selected based on performance on the {\bf NISTP} validation set.}
+\emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
 
 {\bf Multi-Layer Perceptrons (MLP).}
 Whereas previous work had compared deep architectures to both shallow MLPs and