ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 518:460a4e78c9a4

merging is fun, merging is fun, merging is fun

author	Dumitru Erhan <dumitru.erhan@gmail.com>
date	Tue, 01 Jun 2010 11:15:37 -0700
parents	0a5945249f2b 092dae9a5040
children	eaa595ea2402

comparison

equal deleted inserted replaced

-:0a5945249f2b
+:460a4e78c9a4
 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}
 and multi-task learning, not much has been done yet to explore the impact
 of {\em out-of-distribution} examples and of the multi-task setting
 (but see~\citep{CollobertR2008}). In particular the {\em relative
 advantage} of deep learning for this settings has not been evaluated.
+The hypothesis explored here is that a deep hierarchy of features
+may be better able to provide sharing of statistical strength
+between different regions in input space or different tasks,
+as discussed in the conclusion.
 % TODO: why we care to evaluate this relative advantage
 In this paper we ask the following questions:
 \vspace*{-1mm}
 \section{Experimental Setup}
 \vspace*{-1mm}
 Whereas much previous work on deep learning algorithms had been performed on
-the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
+the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
 with 60~000 examples, and variants involving 10~000
 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
 to focus here on the case of much larger training sets, from 10 times to
 to 1000 times larger.  The larger datasets are obtained by first sampling from
 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
 %\begin{itemize}
 %\item
 {\bf NIST.}
 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995},
 widely used for training and testing character
-recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}.
+recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
 The dataset is composed with 814255 digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity.
 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended
-by NIST as testing set and is used in our work and some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
+by NIST as testing set and is used in our work and some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 for that purpose. We randomly split the remainder into a training set and a validation set for
 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation,
 and 82587 for testing.
 The performances reported by previous work on that dataset mostly use only the digits.
 Here we use all the classes both in the training and testing phase. This is especially
 through preliminary experiments, and 0.1 was selected.
 {\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden
-layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
+layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}
 enabling better generalization, apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better
 generalization~\citep{Erhan+al-2010}. It is hypothesized that the
 advantage brought by this procedure stems from a better prior,
 on the one hand taking advantage of the link between the input
 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1,
 SDA2), along with the previous results on the digits NIST special database
 19 test set from the literature respectively based on ARTMAP neural
 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
-~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002}, and SVMs
+~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
 ~\citep{Milgram+al-2005}.  More detailed and complete numerical results
 (figures and tables, including standard errors on the error rates) can be
 found in the supplementary material.  The 3 kinds of model differ in the
 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07
 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and
 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
 of all models, on 3 different test sets corresponding to the three
 datasets.
 Right: error rates on NIST test digits only, along with the previous results from
-literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
+literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
 \label{fig:error-rates-charts}
 \end{figure}

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 518:460a4e78c9a4