# HG changeset patch
# User Dumitru Erhan <dumitru.erhan@gmail.com>
# Date 1275417124 25200
# Node ID eaa595ea2402ccb2244749f3fe942d5a173d9957
# Parent  460a4e78c9a411e09c5784d387364f258ab3cd5e
section 3 quickpass

diff -r 460a4e78c9a4 -r eaa595ea2402 writeup/nips2010_submission.tex
--- a/writeup/nips2010_submission.tex	Tue Jun 01 11:15:37 2010 -0700
+++ b/writeup/nips2010_submission.tex	Tue Jun 01 11:32:04 2010 -0700
@@ -309,7 +309,7 @@
 \begin{figure}[h]
 \resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\
 \caption{Illustration of the pipeline of stochastic 
-transformations applied to the image of a lower-case t
+transformations applied to the image of a lower-case \emph{t}
 (the upper left image). Each image in the pipeline (going from
 left to right, first top line, then bottom line) shows the result
 of applying one of the modules in the pipeline. The last image
@@ -361,11 +361,11 @@
 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995}, 
 widely used for training and testing character
 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. 
-The dataset is composed with 814255 digits and characters (upper and lower cases), with hand checked classifications,
+The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 
-corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. 
-The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended 
-by NIST as testing set and is used in our work and some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
+corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. 
+The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one is the one recommended 
+by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 for that purpose. We randomly split the remainder into a training set and a validation set for
 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, 
 and 82587 for testing.
@@ -373,15 +373,16 @@
 Here we use all the classes both in the training and testing phase. This is especially
 useful to estimate the effect of a multi-task setting.
 Note that the distribution of the classes in the NIST training and test sets differs
-substantially, with relatively many more digits in the test set, and uniform distribution
-of letters in the test set, not in the training set (more like the natural distribution
-of letters in text).
+substantially, with relatively many more digits in the test set, and more uniform distribution
+of letters in the test set, compared to the training set (in the latter, the letters are distributed
+more like the natural distribution of letters in text).
 
 %\item 
 {\bf Fonts.} 
-In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net}
-%real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html}
-in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly.
+In order to have a good variety of sources we downloaded an important number of free fonts from:
+{\tt http://cg.scs.carleton.ca/~luc/freefonts.html}
+% TODO: pointless to anonymize, it's not pointing to our work
+Including operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
 The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 
 directly as input to our models.
 
@@ -392,8 +393,8 @@
 a random character class generator and various kinds of transformations similar to those described in the previous sections. 
 In order to increase the variability of the data generated, many different fonts are used for generating the characters. 
 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
-depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are 
-allowed and can be controlled via an easy to use facade class.    
+depending on the value of the complexity parameter provided by the user of the data source. 
+%Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class?
 
 %\item 
 {\bf OCR data.}
@@ -402,6 +403,7 @@
 additional source. This set is part of a larger corpus being collected by the Image Understanding
 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern 
 ({\tt http://www.iupr.com}), and which will be publicly released.
+%TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
 %\end{itemize}
 
 \vspace*{-1mm}
@@ -444,12 +446,13 @@
 (making the use of SVMs computationally inconvenient because of their quadratic
 scaling behavior).
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
-exponentials) on the output layer for estimating P(class | image).
-The hyper-parameters are the following: number of hidden units, taken in 
-$\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
-examples are presented in minibatches of size 20. A constant learning
+exponentials) on the output layer for estimating$ P(class | image)$.
+The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 
+The optimization procedure is as follows: training
+examples are presented in minibatches of size 20, a constant learning
 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
-through preliminary experiments, and 0.1 was selected. 
+through preliminary experiments (measuring performance on a validation set),
+and $0.1$ was then selected.
 
 {\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
@@ -472,8 +475,8 @@
 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), 
 provides immediate and efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
-\citep{VincentPLarochelleH2008}. During training of a Denoising
-Auto-Encoder, it is presented with a stochastically corrupted version
+\citep{VincentPLarochelleH2008}. During training, a Denoising
+Auto-Encoder is presented with a stochastically corrupted version
 of the input and trained to reconstruct the uncorrupted input,
 forcing the hidden units to represent the leading regularities in
 the data. Once it is trained, its hidden units activations can