# HG changeset patch
# User Dumitru Erhan <dumitru.erhan@gmail.com>
# Date 1275418694 25200
# Node ID 18a6379999fdbbd375508ce5b0b077f811b921de
# Parent  eaa595ea2402ccb2244749f3fe942d5a173d9957
more after lunch :)

diff -r eaa595ea2402 -r 18a6379999fd writeup/nips2010_submission.tex
--- a/writeup/nips2010_submission.tex	Tue Jun 01 11:32:04 2010 -0700
+++ b/writeup/nips2010_submission.tex	Tue Jun 01 11:58:14 2010 -0700
@@ -364,7 +364,7 @@
 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 
 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. 
-The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one is the one recommended 
+The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended 
 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 for that purpose. We randomly split the remainder into a training set and a validation set for
 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, 
@@ -446,19 +446,19 @@
 (making the use of SVMs computationally inconvenient because of their quadratic
 scaling behavior).
 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
-exponentials) on the output layer for estimating$ P(class | image)$.
+exponentials) on the output layer for estimating $P(class | image)$.
 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 
 The optimization procedure is as follows: training
 examples are presented in minibatches of size 20, a constant learning
-rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
+rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments (measuring performance on a validation set),
 and $0.1$ was then selected.
 
 {\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden 
-layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}
-enabling better generalization, apparently setting parameters in the
+layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, 
+apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better 
 generalization~\citep{Erhan+al-2010}. It is hypothesized that the
 advantage brought by this procedure stems from a better prior,
@@ -508,7 +508,7 @@
 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
 ~\citep{Milgram+al-2005}.  More detailed and complete numerical results
 (figures and tables, including standard errors on the error rates) can be
-found in the supplementary material.  The 3 kinds of model differ in the
+found in Appendix I of the supplementary material.  The 3 kinds of model differ in the
 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07
 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and
 previously published performance (in a statistically and qualitatively
@@ -609,7 +609,7 @@
 We have found that the self-taught learning framework is more beneficial
 to a deep learner than to a traditional shallow and purely
 supervised learner. More precisely, 
-the conclusions are positive for all the questions asked in the introduction.
+the answers are positive for all the questions asked in the introduction.
 %\begin{itemize}
 
 $\bullet$ %\item 
@@ -617,7 +617,7 @@
 MNIST digits generalize to the setting of a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
-published results on this dataset (as far as we know), in fact reaching human-level
+published results on this dataset (the one that we are aware of), in fact reaching human-level
 performance} at round 17\% error on the 62-class task and 1.4\% on the digits.
 
 $\bullet$ %\item