Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 518:460a4e78c9a4
merging is fun, merging is fun, merging is fun
author | Dumitru Erhan <dumitru.erhan@gmail.com> |
---|---|
date | Tue, 01 Jun 2010 11:15:37 -0700 |
parents | 0a5945249f2b 092dae9a5040 |
children | eaa595ea2402 |
comparison
equal
deleted
inserted
replaced
517:0a5945249f2b | 518:460a4e78c9a4 |
---|---|
88 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small} | 88 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small} |
89 and multi-task learning, not much has been done yet to explore the impact | 89 and multi-task learning, not much has been done yet to explore the impact |
90 of {\em out-of-distribution} examples and of the multi-task setting | 90 of {\em out-of-distribution} examples and of the multi-task setting |
91 (but see~\citep{CollobertR2008}). In particular the {\em relative | 91 (but see~\citep{CollobertR2008}). In particular the {\em relative |
92 advantage} of deep learning for this settings has not been evaluated. | 92 advantage} of deep learning for this settings has not been evaluated. |
93 The hypothesis explored here is that a deep hierarchy of features | |
94 may be better able to provide sharing of statistical strength | |
95 between different regions in input space or different tasks, | |
96 as discussed in the conclusion. | |
93 | 97 |
94 % TODO: why we care to evaluate this relative advantage | 98 % TODO: why we care to evaluate this relative advantage |
95 | 99 |
96 In this paper we ask the following questions: | 100 In this paper we ask the following questions: |
97 | 101 |
318 \vspace*{-1mm} | 322 \vspace*{-1mm} |
319 \section{Experimental Setup} | 323 \section{Experimental Setup} |
320 \vspace*{-1mm} | 324 \vspace*{-1mm} |
321 | 325 |
322 Whereas much previous work on deep learning algorithms had been performed on | 326 Whereas much previous work on deep learning algorithms had been performed on |
323 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, | 327 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, |
324 with 60~000 examples, and variants involving 10~000 | 328 with 60~000 examples, and variants involving 10~000 |
325 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want | 329 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want |
326 to focus here on the case of much larger training sets, from 10 times to | 330 to focus here on the case of much larger training sets, from 10 times to |
327 to 1000 times larger. The larger datasets are obtained by first sampling from | 331 to 1000 times larger. The larger datasets are obtained by first sampling from |
328 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, | 332 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, |
354 %\begin{itemize} | 358 %\begin{itemize} |
355 %\item | 359 %\item |
356 {\bf NIST.} | 360 {\bf NIST.} |
357 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995}, | 361 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995}, |
358 widely used for training and testing character | 362 widely used for training and testing character |
359 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. | 363 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. |
360 The dataset is composed with 814255 digits and characters (upper and lower cases), with hand checked classifications, | 364 The dataset is composed with 814255 digits and characters (upper and lower cases), with hand checked classifications, |
361 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | 365 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes |
362 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. | 366 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. |
363 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended | 367 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended |
364 by NIST as testing set and is used in our work and some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} | 368 by NIST as testing set and is used in our work and some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
365 for that purpose. We randomly split the remainder into a training set and a validation set for | 369 for that purpose. We randomly split the remainder into a training set and a validation set for |
366 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, | 370 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, |
367 and 82587 for testing. | 371 and 82587 for testing. |
368 The performances reported by previous work on that dataset mostly use only the digits. | 372 The performances reported by previous work on that dataset mostly use only the digits. |
369 Here we use all the classes both in the training and testing phase. This is especially | 373 Here we use all the classes both in the training and testing phase. This is especially |
448 through preliminary experiments, and 0.1 was selected. | 452 through preliminary experiments, and 0.1 was selected. |
449 | 453 |
450 {\bf Stacked Denoising Auto-Encoders (SDA).} | 454 {\bf Stacked Denoising Auto-Encoders (SDA).} |
451 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 455 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
452 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 456 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
453 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} | 457 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006} |
454 enabling better generalization, apparently setting parameters in the | 458 enabling better generalization, apparently setting parameters in the |
455 basin of attraction of supervised gradient descent yielding better | 459 basin of attraction of supervised gradient descent yielding better |
456 generalization~\citep{Erhan+al-2010}. It is hypothesized that the | 460 generalization~\citep{Erhan+al-2010}. It is hypothesized that the |
457 advantage brought by this procedure stems from a better prior, | 461 advantage brought by this procedure stems from a better prior, |
458 on the one hand taking advantage of the link between the input | 462 on the one hand taking advantage of the link between the input |
496 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, | 500 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, |
497 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, | 501 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, |
498 SDA2), along with the previous results on the digits NIST special database | 502 SDA2), along with the previous results on the digits NIST special database |
499 19 test set from the literature respectively based on ARTMAP neural | 503 19 test set from the literature respectively based on ARTMAP neural |
500 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search | 504 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search |
501 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002}, and SVMs | 505 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs |
502 ~\citep{Milgram+al-2005}. More detailed and complete numerical results | 506 ~\citep{Milgram+al-2005}. More detailed and complete numerical results |
503 (figures and tables, including standard errors on the error rates) can be | 507 (figures and tables, including standard errors on the error rates) can be |
504 found in the supplementary material. The 3 kinds of model differ in the | 508 found in the supplementary material. The 3 kinds of model differ in the |
505 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 | 509 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 |
506 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and | 510 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and |
541 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training | 545 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training |
542 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | 546 on NIST, 1 on NISTP, and 2 on P07. Left: overall results |
543 of all models, on 3 different test sets corresponding to the three | 547 of all models, on 3 different test sets corresponding to the three |
544 datasets. | 548 datasets. |
545 Right: error rates on NIST test digits only, along with the previous results from | 549 Right: error rates on NIST test digits only, along with the previous results from |
546 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} | 550 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
547 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | 551 respectively based on ART, nearest neighbors, MLPs, and SVMs.} |
548 | 552 |
549 \label{fig:error-rates-charts} | 553 \label{fig:error-rates-charts} |
550 \end{figure} | 554 \end{figure} |
551 | 555 |