Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 520:18a6379999fd
more after lunch :)
author | Dumitru Erhan <dumitru.erhan@gmail.com> |
---|---|
date | Tue, 01 Jun 2010 11:58:14 -0700 |
parents | eaa595ea2402 |
children | 13816dbef6ed |
comparison
equal
deleted
inserted
replaced
519:eaa595ea2402 | 520:18a6379999fd |
---|---|
362 widely used for training and testing character | 362 widely used for training and testing character |
363 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. | 363 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. |
364 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, | 364 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, |
365 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | 365 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes |
366 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. | 366 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. |
367 The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one is the one recommended | 367 The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended |
368 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | 368 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
369 for that purpose. We randomly split the remainder into a training set and a validation set for | 369 for that purpose. We randomly split the remainder into a training set and a validation set for |
370 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, | 370 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, |
371 and 82587 for testing. | 371 and 82587 for testing. |
372 The performances reported by previous work on that dataset mostly use only the digits. | 372 The performances reported by previous work on that dataset mostly use only the digits. |
444 Whereas previous work had compared deep architectures to both shallow MLPs and | 444 Whereas previous work had compared deep architectures to both shallow MLPs and |
445 SVMs, we only compared to MLPs here because of the very large datasets used | 445 SVMs, we only compared to MLPs here because of the very large datasets used |
446 (making the use of SVMs computationally inconvenient because of their quadratic | 446 (making the use of SVMs computationally inconvenient because of their quadratic |
447 scaling behavior). | 447 scaling behavior). |
448 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | 448 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
449 exponentials) on the output layer for estimating$ P(class | image)$. | 449 exponentials) on the output layer for estimating $P(class | image)$. |
450 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. | 450 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. |
451 The optimization procedure is as follows: training | 451 The optimization procedure is as follows: training |
452 examples are presented in minibatches of size 20, a constant learning | 452 examples are presented in minibatches of size 20, a constant learning |
453 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ | 453 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ |
454 through preliminary experiments (measuring performance on a validation set), | 454 through preliminary experiments (measuring performance on a validation set), |
455 and $0.1$ was then selected. | 455 and $0.1$ was then selected. |
456 | 456 |
457 {\bf Stacked Denoising Auto-Encoders (SDA).} | 457 {\bf Stacked Denoising Auto-Encoders (SDA).} |
458 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 458 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
459 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 459 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
460 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006} | 460 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, |
461 enabling better generalization, apparently setting parameters in the | 461 apparently setting parameters in the |
462 basin of attraction of supervised gradient descent yielding better | 462 basin of attraction of supervised gradient descent yielding better |
463 generalization~\citep{Erhan+al-2010}. It is hypothesized that the | 463 generalization~\citep{Erhan+al-2010}. It is hypothesized that the |
464 advantage brought by this procedure stems from a better prior, | 464 advantage brought by this procedure stems from a better prior, |
465 on the one hand taking advantage of the link between the input | 465 on the one hand taking advantage of the link between the input |
466 distribution $P(x)$ and the conditional distribution of interest | 466 distribution $P(x)$ and the conditional distribution of interest |
506 19 test set from the literature respectively based on ARTMAP neural | 506 19 test set from the literature respectively based on ARTMAP neural |
507 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search | 507 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search |
508 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs | 508 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs |
509 ~\citep{Milgram+al-2005}. More detailed and complete numerical results | 509 ~\citep{Milgram+al-2005}. More detailed and complete numerical results |
510 (figures and tables, including standard errors on the error rates) can be | 510 (figures and tables, including standard errors on the error rates) can be |
511 found in the supplementary material. The 3 kinds of model differ in the | 511 found in Appendix I of the supplementary material. The 3 kinds of model differ in the |
512 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 | 512 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 |
513 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and | 513 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and |
514 previously published performance (in a statistically and qualitatively | 514 previously published performance (in a statistically and qualitatively |
515 significant way) but reaches human performance on both the 62-class task | 515 significant way) but reaches human performance on both the 62-class task |
516 and the 10-class (digits) task. In addition, as shown in the left of | 516 and the 10-class (digits) task. In addition, as shown in the left of |
607 \vspace*{-1mm} | 607 \vspace*{-1mm} |
608 | 608 |
609 We have found that the self-taught learning framework is more beneficial | 609 We have found that the self-taught learning framework is more beneficial |
610 to a deep learner than to a traditional shallow and purely | 610 to a deep learner than to a traditional shallow and purely |
611 supervised learner. More precisely, | 611 supervised learner. More precisely, |
612 the conclusions are positive for all the questions asked in the introduction. | 612 the answers are positive for all the questions asked in the introduction. |
613 %\begin{itemize} | 613 %\begin{itemize} |
614 | 614 |
615 $\bullet$ %\item | 615 $\bullet$ %\item |
616 Do the good results previously obtained with deep architectures on the | 616 Do the good results previously obtained with deep architectures on the |
617 MNIST digits generalize to the setting of a much larger and richer (but similar) | 617 MNIST digits generalize to the setting of a much larger and richer (but similar) |
618 dataset, the NIST special database 19, with 62 classes and around 800k examples? | 618 dataset, the NIST special database 19, with 62 classes and around 800k examples? |
619 Yes, the SDA {\bf systematically outperformed the MLP and all the previously | 619 Yes, the SDA {\bf systematically outperformed the MLP and all the previously |
620 published results on this dataset (as far as we know), in fact reaching human-level | 620 published results on this dataset (the one that we are aware of), in fact reaching human-level |
621 performance} at round 17\% error on the 62-class task and 1.4\% on the digits. | 621 performance} at round 17\% error on the 62-class task and 1.4\% on the digits. |
622 | 622 |
623 $\bullet$ %\item | 623 $\bullet$ %\item |
624 To what extent does the perturbation of input images (e.g. adding | 624 To what extent does the perturbation of input images (e.g. adding |
625 noise, affine transformations, background images) make the resulting | 625 noise, affine transformations, background images) make the resulting |