comparison writeup/nips2010_submission.tex @ 520:18a6379999fd

more after lunch :)
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Tue, 01 Jun 2010 11:58:14 -0700
parents eaa595ea2402
children 13816dbef6ed
comparison
equal deleted inserted replaced
519:eaa595ea2402 520:18a6379999fd
362 widely used for training and testing character 362 widely used for training and testing character
363 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. 363 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
364 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, 364 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
365 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 365 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
366 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. 366 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity.
367 The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one is the one recommended 367 The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended
368 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} 368 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
369 for that purpose. We randomly split the remainder into a training set and a validation set for 369 for that purpose. We randomly split the remainder into a training set and a validation set for
370 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, 370 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation,
371 and 82587 for testing. 371 and 82587 for testing.
372 The performances reported by previous work on that dataset mostly use only the digits. 372 The performances reported by previous work on that dataset mostly use only the digits.
444 Whereas previous work had compared deep architectures to both shallow MLPs and 444 Whereas previous work had compared deep architectures to both shallow MLPs and
445 SVMs, we only compared to MLPs here because of the very large datasets used 445 SVMs, we only compared to MLPs here because of the very large datasets used
446 (making the use of SVMs computationally inconvenient because of their quadratic 446 (making the use of SVMs computationally inconvenient because of their quadratic
447 scaling behavior). 447 scaling behavior).
448 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized 448 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
449 exponentials) on the output layer for estimating$ P(class | image)$. 449 exponentials) on the output layer for estimating $P(class | image)$.
450 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 450 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
451 The optimization procedure is as follows: training 451 The optimization procedure is as follows: training
452 examples are presented in minibatches of size 20, a constant learning 452 examples are presented in minibatches of size 20, a constant learning
453 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ 453 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
454 through preliminary experiments (measuring performance on a validation set), 454 through preliminary experiments (measuring performance on a validation set),
455 and $0.1$ was then selected. 455 and $0.1$ was then selected.
456 456
457 {\bf Stacked Denoising Auto-Encoders (SDA).} 457 {\bf Stacked Denoising Auto-Encoders (SDA).}
458 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 458 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
459 can be used to initialize the weights of each layer of a deep MLP (with many hidden 459 can be used to initialize the weights of each layer of a deep MLP (with many hidden
460 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006} 460 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
461 enabling better generalization, apparently setting parameters in the 461 apparently setting parameters in the
462 basin of attraction of supervised gradient descent yielding better 462 basin of attraction of supervised gradient descent yielding better
463 generalization~\citep{Erhan+al-2010}. It is hypothesized that the 463 generalization~\citep{Erhan+al-2010}. It is hypothesized that the
464 advantage brought by this procedure stems from a better prior, 464 advantage brought by this procedure stems from a better prior,
465 on the one hand taking advantage of the link between the input 465 on the one hand taking advantage of the link between the input
466 distribution $P(x)$ and the conditional distribution of interest 466 distribution $P(x)$ and the conditional distribution of interest
506 19 test set from the literature respectively based on ARTMAP neural 506 19 test set from the literature respectively based on ARTMAP neural
507 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search 507 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
508 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs 508 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
509 ~\citep{Milgram+al-2005}. More detailed and complete numerical results 509 ~\citep{Milgram+al-2005}. More detailed and complete numerical results
510 (figures and tables, including standard errors on the error rates) can be 510 (figures and tables, including standard errors on the error rates) can be
511 found in the supplementary material. The 3 kinds of model differ in the 511 found in Appendix I of the supplementary material. The 3 kinds of model differ in the
512 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 512 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07
513 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and 513 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and
514 previously published performance (in a statistically and qualitatively 514 previously published performance (in a statistically and qualitatively
515 significant way) but reaches human performance on both the 62-class task 515 significant way) but reaches human performance on both the 62-class task
516 and the 10-class (digits) task. In addition, as shown in the left of 516 and the 10-class (digits) task. In addition, as shown in the left of
607 \vspace*{-1mm} 607 \vspace*{-1mm}
608 608
609 We have found that the self-taught learning framework is more beneficial 609 We have found that the self-taught learning framework is more beneficial
610 to a deep learner than to a traditional shallow and purely 610 to a deep learner than to a traditional shallow and purely
611 supervised learner. More precisely, 611 supervised learner. More precisely,
612 the conclusions are positive for all the questions asked in the introduction. 612 the answers are positive for all the questions asked in the introduction.
613 %\begin{itemize} 613 %\begin{itemize}
614 614
615 $\bullet$ %\item 615 $\bullet$ %\item
616 Do the good results previously obtained with deep architectures on the 616 Do the good results previously obtained with deep architectures on the
617 MNIST digits generalize to the setting of a much larger and richer (but similar) 617 MNIST digits generalize to the setting of a much larger and richer (but similar)
618 dataset, the NIST special database 19, with 62 classes and around 800k examples? 618 dataset, the NIST special database 19, with 62 classes and around 800k examples?
619 Yes, the SDA {\bf systematically outperformed the MLP and all the previously 619 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
620 published results on this dataset (as far as we know), in fact reaching human-level 620 published results on this dataset (the one that we are aware of), in fact reaching human-level
621 performance} at round 17\% error on the 62-class task and 1.4\% on the digits. 621 performance} at round 17\% error on the 62-class task and 1.4\% on the digits.
622 622
623 $\bullet$ %\item 623 $\bullet$ %\item
624 To what extent does the perturbation of input images (e.g. adding 624 To what extent does the perturbation of input images (e.g. adding
625 noise, affine transformations, background images) make the resulting 625 noise, affine transformations, background images) make the resulting