ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 566:b9b811e886ae

Small fixes

author	Olivier Delalleau <delallea@iro>
date	Thu, 03 Jun 2010 13:16:00 -0400
parents	dc5c3f538a05
children	ae6ba0309bf9

comparison

equal deleted inserted replaced

-:34278b732d2c
+:b9b811e886ae
 is sampled from one of these sources (chosen randomly), the second step is to
 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}.
 To provide a baseline of error rate comparison we also estimate human performance
 on both the 62-class task and the 10-class digits task.
-We compare the best MLPs against
+We compare the best Multi-Layer Perceptrons (MLP) against
-the best SDAs (both models' hyper-parameters are selected to minimize the validation set error),
+the best Stacked Denoising Auto-encoders (SDA), when
-along with a comparison against a precise estimate
+both models' hyper-parameters are selected to minimize the validation set error.
+We also provide a comparison against a precise estimate
 of human performance obtained via Amazon's Mechanical Turk (AMT)
 service (http://mturk.com).
 AMT users are paid small amounts
 of money to perform tasks for which human intelligence is required.
 Mechanical Turk has been used extensively in natural language processing and vision.
 The performances reported by previous work on that dataset mostly use only the digits.
 Here we use all the classes both in the training and testing phase. This is especially
 useful to estimate the effect of a multi-task setting.
 The distribution of the classes in the NIST training and test sets differs
 substantially, with relatively many more digits in the test set, and a more uniform distribution
-of letters in the test set (where the letters are distributed
+of letters in the test set (whereas in the training set they are distributed
 more like in natural text).
 \vspace*{-1mm}
 %\item
 {\bf Fonts.}
 \vspace*{-3mm}
 \subsection{Models and their Hyperparameters}
 \vspace*{-2mm}
-The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
+The experiments are performed using MLPs (with a single
-hidden layer and with Stacked Denoising Auto-Encoders (SDA).
+hidden layer) and SDAs.
 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
 {\bf Multi-Layer Perceptrons (MLP).}
 Whereas previous work had compared deep architectures to both shallow MLPs and
 SVMs, we only compared to MLPs here because of the very large datasets used
 exponentials) on the output layer for estimating $P(class | image)$.
 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
 Training examples are presented in minibatches of size 20. A constant learning
 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments (measuring performance on a validation set),
-and $0.1$ was then selected for optimizing on the whole training sets.
+and $0.1$ (which was found to work best) was then selected for optimizing on
+the whole training sets.
 \vspace*{-1mm}
 {\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 \label{fig:da}
 \vspace*{-2mm}
 \end{figure}
 Here we chose to use the Denoising
-Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
+Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
 these deep hierarchies of features, as it is very simple to train and
 explain (see Figure~\ref{fig:da}, as well as
 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
-provides immediate and efficient inference, and yielded results
+provides efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
 \citep{VincentPLarochelleH2008}. During training, a Denoising
-Auto-Encoder is presented with a stochastically corrupted version
+Auto-encoder is presented with a stochastically corrupted version
 of the input and trained to reconstruct the uncorrupted input,
 forcing the hidden units to represent the leading regularities in
 the data. Once it is trained, in a purely unsupervised way,
 its hidden units' activations can
 be used as inputs for training a second one, etc.
 on either NIST, NISTP or P07, either on the 62-class task
 or on the 10-digits task.
 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
 SDA2), along with the previous results on the digits NIST special database
-19 test set from the literature respectively based on ARTMAP neural
+19 test set from the literature, respectively based on ARTMAP neural
 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
 ~\citep{Milgram+al-2005}.  More detailed and complete numerical results
 (figures and tables, including standard errors on the error rates) can be
 found in Appendix I of the supplementary material.
 lower-case, or upper-case characters). Again, whereas the gain from the
 multi-task setting is marginal or negative for the MLP, it is substantial
 for the SDA.  Note that to simplify these multi-task experiments, only the original
 NIST dataset is used. For example, the MLP-digits bar shows the relative
 percent improvement in MLP error rate on the NIST digits test set
-is $100\% \times$ (1 - single-task
+is $100\% \times$ (single-task
-model's error / multi-task model's error).  The single-task model is
+model's error / multi-task model's error - 1).  The single-task model is
 trained with only 10 outputs (one per digit), seeing only digit examples,
 whereas the multi-task model is trained with 62 outputs, with all 62
 character classes as examples.  Hence the hidden units are shared across
 all tasks.  For the multi-task model, the digit error rate is measured by
 comparing the correct digit class with the output class associated with the

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 566:b9b811e886ae