comparison writeup/nips2010_submission.tex @ 518:460a4e78c9a4

merging is fun, merging is fun, merging is fun
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Tue, 01 Jun 2010 11:15:37 -0700
parents 0a5945249f2b 092dae9a5040
children eaa595ea2402
comparison
equal deleted inserted replaced
517:0a5945249f2b 518:460a4e78c9a4
88 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small} 88 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}
89 and multi-task learning, not much has been done yet to explore the impact 89 and multi-task learning, not much has been done yet to explore the impact
90 of {\em out-of-distribution} examples and of the multi-task setting 90 of {\em out-of-distribution} examples and of the multi-task setting
91 (but see~\citep{CollobertR2008}). In particular the {\em relative 91 (but see~\citep{CollobertR2008}). In particular the {\em relative
92 advantage} of deep learning for this settings has not been evaluated. 92 advantage} of deep learning for this settings has not been evaluated.
93 The hypothesis explored here is that a deep hierarchy of features
94 may be better able to provide sharing of statistical strength
95 between different regions in input space or different tasks,
96 as discussed in the conclusion.
93 97
94 % TODO: why we care to evaluate this relative advantage 98 % TODO: why we care to evaluate this relative advantage
95 99
96 In this paper we ask the following questions: 100 In this paper we ask the following questions:
97 101
318 \vspace*{-1mm} 322 \vspace*{-1mm}
319 \section{Experimental Setup} 323 \section{Experimental Setup}
320 \vspace*{-1mm} 324 \vspace*{-1mm}
321 325
322 Whereas much previous work on deep learning algorithms had been performed on 326 Whereas much previous work on deep learning algorithms had been performed on
323 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, 327 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
324 with 60~000 examples, and variants involving 10~000 328 with 60~000 examples, and variants involving 10~000
325 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want 329 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
326 to focus here on the case of much larger training sets, from 10 times to 330 to focus here on the case of much larger training sets, from 10 times to
327 to 1000 times larger. The larger datasets are obtained by first sampling from 331 to 1000 times larger. The larger datasets are obtained by first sampling from
328 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, 332 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
354 %\begin{itemize} 358 %\begin{itemize}
355 %\item 359 %\item
356 {\bf NIST.} 360 {\bf NIST.}
357 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995}, 361 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995},
358 widely used for training and testing character 362 widely used for training and testing character
359 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. 363 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
360 The dataset is composed with 814255 digits and characters (upper and lower cases), with hand checked classifications, 364 The dataset is composed with 814255 digits and characters (upper and lower cases), with hand checked classifications,
361 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 365 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
362 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. 366 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity.
363 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended 367 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended
364 by NIST as testing set and is used in our work and some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} 368 by NIST as testing set and is used in our work and some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
365 for that purpose. We randomly split the remainder into a training set and a validation set for 369 for that purpose. We randomly split the remainder into a training set and a validation set for
366 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, 370 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation,
367 and 82587 for testing. 371 and 82587 for testing.
368 The performances reported by previous work on that dataset mostly use only the digits. 372 The performances reported by previous work on that dataset mostly use only the digits.
369 Here we use all the classes both in the training and testing phase. This is especially 373 Here we use all the classes both in the training and testing phase. This is especially
448 through preliminary experiments, and 0.1 was selected. 452 through preliminary experiments, and 0.1 was selected.
449 453
450 {\bf Stacked Denoising Auto-Encoders (SDA).} 454 {\bf Stacked Denoising Auto-Encoders (SDA).}
451 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 455 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
452 can be used to initialize the weights of each layer of a deep MLP (with many hidden 456 can be used to initialize the weights of each layer of a deep MLP (with many hidden
453 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} 457 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}
454 enabling better generalization, apparently setting parameters in the 458 enabling better generalization, apparently setting parameters in the
455 basin of attraction of supervised gradient descent yielding better 459 basin of attraction of supervised gradient descent yielding better
456 generalization~\citep{Erhan+al-2010}. It is hypothesized that the 460 generalization~\citep{Erhan+al-2010}. It is hypothesized that the
457 advantage brought by this procedure stems from a better prior, 461 advantage brought by this procedure stems from a better prior,
458 on the one hand taking advantage of the link between the input 462 on the one hand taking advantage of the link between the input
496 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, 500 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
497 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, 501 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1,
498 SDA2), along with the previous results on the digits NIST special database 502 SDA2), along with the previous results on the digits NIST special database
499 19 test set from the literature respectively based on ARTMAP neural 503 19 test set from the literature respectively based on ARTMAP neural
500 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search 504 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
501 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002}, and SVMs 505 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
502 ~\citep{Milgram+al-2005}. More detailed and complete numerical results 506 ~\citep{Milgram+al-2005}. More detailed and complete numerical results
503 (figures and tables, including standard errors on the error rates) can be 507 (figures and tables, including standard errors on the error rates) can be
504 found in the supplementary material. The 3 kinds of model differ in the 508 found in the supplementary material. The 3 kinds of model differ in the
505 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 509 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07
506 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and 510 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and
541 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training 545 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training
542 on NIST, 1 on NISTP, and 2 on P07. Left: overall results 546 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
543 of all models, on 3 different test sets corresponding to the three 547 of all models, on 3 different test sets corresponding to the three
544 datasets. 548 datasets.
545 Right: error rates on NIST test digits only, along with the previous results from 549 Right: error rates on NIST test digits only, along with the previous results from
546 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} 550 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
547 respectively based on ART, nearest neighbors, MLPs, and SVMs.} 551 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
548 552
549 \label{fig:error-rates-charts} 553 \label{fig:error-rates-charts}
550 \end{figure} 554 \end{figure}
551 555