Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 566:b9b811e886ae
Small fixes
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Thu, 03 Jun 2010 13:16:00 -0400 |
parents | dc5c3f538a05 |
children | ae6ba0309bf9 |
comparison
equal
deleted
inserted
replaced
565:34278b732d2c | 566:b9b811e886ae |
---|---|
508 is sampled from one of these sources (chosen randomly), the second step is to | 508 is sampled from one of these sources (chosen randomly), the second step is to |
509 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}. | 509 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}. |
510 | 510 |
511 To provide a baseline of error rate comparison we also estimate human performance | 511 To provide a baseline of error rate comparison we also estimate human performance |
512 on both the 62-class task and the 10-class digits task. | 512 on both the 62-class task and the 10-class digits task. |
513 We compare the best MLPs against | 513 We compare the best Multi-Layer Perceptrons (MLP) against |
514 the best SDAs (both models' hyper-parameters are selected to minimize the validation set error), | 514 the best Stacked Denoising Auto-encoders (SDA), when |
515 along with a comparison against a precise estimate | 515 both models' hyper-parameters are selected to minimize the validation set error. |
516 We also provide a comparison against a precise estimate | |
516 of human performance obtained via Amazon's Mechanical Turk (AMT) | 517 of human performance obtained via Amazon's Mechanical Turk (AMT) |
517 service (http://mturk.com). | 518 service (http://mturk.com). |
518 AMT users are paid small amounts | 519 AMT users are paid small amounts |
519 of money to perform tasks for which human intelligence is required. | 520 of money to perform tasks for which human intelligence is required. |
520 Mechanical Turk has been used extensively in natural language processing and vision. | 521 Mechanical Turk has been used extensively in natural language processing and vision. |
550 The performances reported by previous work on that dataset mostly use only the digits. | 551 The performances reported by previous work on that dataset mostly use only the digits. |
551 Here we use all the classes both in the training and testing phase. This is especially | 552 Here we use all the classes both in the training and testing phase. This is especially |
552 useful to estimate the effect of a multi-task setting. | 553 useful to estimate the effect of a multi-task setting. |
553 The distribution of the classes in the NIST training and test sets differs | 554 The distribution of the classes in the NIST training and test sets differs |
554 substantially, with relatively many more digits in the test set, and a more uniform distribution | 555 substantially, with relatively many more digits in the test set, and a more uniform distribution |
555 of letters in the test set (where the letters are distributed | 556 of letters in the test set (whereas in the training set they are distributed |
556 more like in natural text). | 557 more like in natural text). |
557 \vspace*{-1mm} | 558 \vspace*{-1mm} |
558 | 559 |
559 %\item | 560 %\item |
560 {\bf Fonts.} | 561 {\bf Fonts.} |
621 | 622 |
622 \vspace*{-3mm} | 623 \vspace*{-3mm} |
623 \subsection{Models and their Hyperparameters} | 624 \subsection{Models and their Hyperparameters} |
624 \vspace*{-2mm} | 625 \vspace*{-2mm} |
625 | 626 |
626 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single | 627 The experiments are performed using MLPs (with a single |
627 hidden layer and with Stacked Denoising Auto-Encoders (SDA). | 628 hidden layer) and SDAs. |
628 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} | 629 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} |
629 | 630 |
630 {\bf Multi-Layer Perceptrons (MLP).} | 631 {\bf Multi-Layer Perceptrons (MLP).} |
631 Whereas previous work had compared deep architectures to both shallow MLPs and | 632 Whereas previous work had compared deep architectures to both shallow MLPs and |
632 SVMs, we only compared to MLPs here because of the very large datasets used | 633 SVMs, we only compared to MLPs here because of the very large datasets used |
636 exponentials) on the output layer for estimating $P(class | image)$. | 637 exponentials) on the output layer for estimating $P(class | image)$. |
637 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. | 638 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. |
638 Training examples are presented in minibatches of size 20. A constant learning | 639 Training examples are presented in minibatches of size 20. A constant learning |
639 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ | 640 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ |
640 through preliminary experiments (measuring performance on a validation set), | 641 through preliminary experiments (measuring performance on a validation set), |
641 and $0.1$ was then selected for optimizing on the whole training sets. | 642 and $0.1$ (which was found to work best) was then selected for optimizing on |
643 the whole training sets. | |
642 \vspace*{-1mm} | 644 \vspace*{-1mm} |
643 | 645 |
644 | 646 |
645 {\bf Stacked Denoising Auto-Encoders (SDA).} | 647 {\bf Stacked Denoising Auto-Encoders (SDA).} |
646 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 648 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
672 \label{fig:da} | 674 \label{fig:da} |
673 \vspace*{-2mm} | 675 \vspace*{-2mm} |
674 \end{figure} | 676 \end{figure} |
675 | 677 |
676 Here we chose to use the Denoising | 678 Here we chose to use the Denoising |
677 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for | 679 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for |
678 these deep hierarchies of features, as it is very simple to train and | 680 these deep hierarchies of features, as it is very simple to train and |
679 explain (see Figure~\ref{fig:da}, as well as | 681 explain (see Figure~\ref{fig:da}, as well as |
680 tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 682 tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
681 provides immediate and efficient inference, and yielded results | 683 provides efficient inference, and yielded results |
682 comparable or better than RBMs in series of experiments | 684 comparable or better than RBMs in series of experiments |
683 \citep{VincentPLarochelleH2008}. During training, a Denoising | 685 \citep{VincentPLarochelleH2008}. During training, a Denoising |
684 Auto-Encoder is presented with a stochastically corrupted version | 686 Auto-encoder is presented with a stochastically corrupted version |
685 of the input and trained to reconstruct the uncorrupted input, | 687 of the input and trained to reconstruct the uncorrupted input, |
686 forcing the hidden units to represent the leading regularities in | 688 forcing the hidden units to represent the leading regularities in |
687 the data. Once it is trained, in a purely unsupervised way, | 689 the data. Once it is trained, in a purely unsupervised way, |
688 its hidden units' activations can | 690 its hidden units' activations can |
689 be used as inputs for training a second one, etc. | 691 be used as inputs for training a second one, etc. |
742 on either NIST, NISTP or P07, either on the 62-class task | 744 on either NIST, NISTP or P07, either on the 62-class task |
743 or on the 10-digits task. | 745 or on the 10-digits task. |
744 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, | 746 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, |
745 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1, | 747 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1, |
746 SDA2), along with the previous results on the digits NIST special database | 748 SDA2), along with the previous results on the digits NIST special database |
747 19 test set from the literature respectively based on ARTMAP neural | 749 19 test set from the literature, respectively based on ARTMAP neural |
748 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search | 750 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search |
749 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs | 751 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs |
750 ~\citep{Milgram+al-2005}. More detailed and complete numerical results | 752 ~\citep{Milgram+al-2005}. More detailed and complete numerical results |
751 (figures and tables, including standard errors on the error rates) can be | 753 (figures and tables, including standard errors on the error rates) can be |
752 found in Appendix I of the supplementary material. | 754 found in Appendix I of the supplementary material. |
778 lower-case, or upper-case characters). Again, whereas the gain from the | 780 lower-case, or upper-case characters). Again, whereas the gain from the |
779 multi-task setting is marginal or negative for the MLP, it is substantial | 781 multi-task setting is marginal or negative for the MLP, it is substantial |
780 for the SDA. Note that to simplify these multi-task experiments, only the original | 782 for the SDA. Note that to simplify these multi-task experiments, only the original |
781 NIST dataset is used. For example, the MLP-digits bar shows the relative | 783 NIST dataset is used. For example, the MLP-digits bar shows the relative |
782 percent improvement in MLP error rate on the NIST digits test set | 784 percent improvement in MLP error rate on the NIST digits test set |
783 is $100\% \times$ (1 - single-task | 785 is $100\% \times$ (single-task |
784 model's error / multi-task model's error). The single-task model is | 786 model's error / multi-task model's error - 1). The single-task model is |
785 trained with only 10 outputs (one per digit), seeing only digit examples, | 787 trained with only 10 outputs (one per digit), seeing only digit examples, |
786 whereas the multi-task model is trained with 62 outputs, with all 62 | 788 whereas the multi-task model is trained with 62 outputs, with all 62 |
787 character classes as examples. Hence the hidden units are shared across | 789 character classes as examples. Hence the hidden units are shared across |
788 all tasks. For the multi-task model, the digit error rate is measured by | 790 all tasks. For the multi-task model, the digit error rate is measured by |
789 comparing the correct digit class with the output class associated with the | 791 comparing the correct digit class with the output class associated with the |