comparison writeup/nips2010_submission.tex @ 566:b9b811e886ae

Small fixes
author Olivier Delalleau <delallea@iro>
date Thu, 03 Jun 2010 13:16:00 -0400
parents dc5c3f538a05
children ae6ba0309bf9
comparison
equal deleted inserted replaced
565:34278b732d2c 566:b9b811e886ae
508 is sampled from one of these sources (chosen randomly), the second step is to 508 is sampled from one of these sources (chosen randomly), the second step is to
509 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}. 509 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}.
510 510
511 To provide a baseline of error rate comparison we also estimate human performance 511 To provide a baseline of error rate comparison we also estimate human performance
512 on both the 62-class task and the 10-class digits task. 512 on both the 62-class task and the 10-class digits task.
513 We compare the best MLPs against 513 We compare the best Multi-Layer Perceptrons (MLP) against
514 the best SDAs (both models' hyper-parameters are selected to minimize the validation set error), 514 the best Stacked Denoising Auto-encoders (SDA), when
515 along with a comparison against a precise estimate 515 both models' hyper-parameters are selected to minimize the validation set error.
516 We also provide a comparison against a precise estimate
516 of human performance obtained via Amazon's Mechanical Turk (AMT) 517 of human performance obtained via Amazon's Mechanical Turk (AMT)
517 service (http://mturk.com). 518 service (http://mturk.com).
518 AMT users are paid small amounts 519 AMT users are paid small amounts
519 of money to perform tasks for which human intelligence is required. 520 of money to perform tasks for which human intelligence is required.
520 Mechanical Turk has been used extensively in natural language processing and vision. 521 Mechanical Turk has been used extensively in natural language processing and vision.
550 The performances reported by previous work on that dataset mostly use only the digits. 551 The performances reported by previous work on that dataset mostly use only the digits.
551 Here we use all the classes both in the training and testing phase. This is especially 552 Here we use all the classes both in the training and testing phase. This is especially
552 useful to estimate the effect of a multi-task setting. 553 useful to estimate the effect of a multi-task setting.
553 The distribution of the classes in the NIST training and test sets differs 554 The distribution of the classes in the NIST training and test sets differs
554 substantially, with relatively many more digits in the test set, and a more uniform distribution 555 substantially, with relatively many more digits in the test set, and a more uniform distribution
555 of letters in the test set (where the letters are distributed 556 of letters in the test set (whereas in the training set they are distributed
556 more like in natural text). 557 more like in natural text).
557 \vspace*{-1mm} 558 \vspace*{-1mm}
558 559
559 %\item 560 %\item
560 {\bf Fonts.} 561 {\bf Fonts.}
621 622
622 \vspace*{-3mm} 623 \vspace*{-3mm}
623 \subsection{Models and their Hyperparameters} 624 \subsection{Models and their Hyperparameters}
624 \vspace*{-2mm} 625 \vspace*{-2mm}
625 626
626 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single 627 The experiments are performed using MLPs (with a single
627 hidden layer and with Stacked Denoising Auto-Encoders (SDA). 628 hidden layer) and SDAs.
628 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} 629 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
629 630
630 {\bf Multi-Layer Perceptrons (MLP).} 631 {\bf Multi-Layer Perceptrons (MLP).}
631 Whereas previous work had compared deep architectures to both shallow MLPs and 632 Whereas previous work had compared deep architectures to both shallow MLPs and
632 SVMs, we only compared to MLPs here because of the very large datasets used 633 SVMs, we only compared to MLPs here because of the very large datasets used
636 exponentials) on the output layer for estimating $P(class | image)$. 637 exponentials) on the output layer for estimating $P(class | image)$.
637 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 638 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
638 Training examples are presented in minibatches of size 20. A constant learning 639 Training examples are presented in minibatches of size 20. A constant learning
639 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ 640 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
640 through preliminary experiments (measuring performance on a validation set), 641 through preliminary experiments (measuring performance on a validation set),
641 and $0.1$ was then selected for optimizing on the whole training sets. 642 and $0.1$ (which was found to work best) was then selected for optimizing on
643 the whole training sets.
642 \vspace*{-1mm} 644 \vspace*{-1mm}
643 645
644 646
645 {\bf Stacked Denoising Auto-Encoders (SDA).} 647 {\bf Stacked Denoising Auto-Encoders (SDA).}
646 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 648 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
672 \label{fig:da} 674 \label{fig:da}
673 \vspace*{-2mm} 675 \vspace*{-2mm}
674 \end{figure} 676 \end{figure}
675 677
676 Here we chose to use the Denoising 678 Here we chose to use the Denoising
677 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for 679 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
678 these deep hierarchies of features, as it is very simple to train and 680 these deep hierarchies of features, as it is very simple to train and
679 explain (see Figure~\ref{fig:da}, as well as 681 explain (see Figure~\ref{fig:da}, as well as
680 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 682 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
681 provides immediate and efficient inference, and yielded results 683 provides efficient inference, and yielded results
682 comparable or better than RBMs in series of experiments 684 comparable or better than RBMs in series of experiments
683 \citep{VincentPLarochelleH2008}. During training, a Denoising 685 \citep{VincentPLarochelleH2008}. During training, a Denoising
684 Auto-Encoder is presented with a stochastically corrupted version 686 Auto-encoder is presented with a stochastically corrupted version
685 of the input and trained to reconstruct the uncorrupted input, 687 of the input and trained to reconstruct the uncorrupted input,
686 forcing the hidden units to represent the leading regularities in 688 forcing the hidden units to represent the leading regularities in
687 the data. Once it is trained, in a purely unsupervised way, 689 the data. Once it is trained, in a purely unsupervised way,
688 its hidden units' activations can 690 its hidden units' activations can
689 be used as inputs for training a second one, etc. 691 be used as inputs for training a second one, etc.
742 on either NIST, NISTP or P07, either on the 62-class task 744 on either NIST, NISTP or P07, either on the 62-class task
743 or on the 10-digits task. 745 or on the 10-digits task.
744 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, 746 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
745 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1, 747 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
746 SDA2), along with the previous results on the digits NIST special database 748 SDA2), along with the previous results on the digits NIST special database
747 19 test set from the literature respectively based on ARTMAP neural 749 19 test set from the literature, respectively based on ARTMAP neural
748 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search 750 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
749 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs 751 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
750 ~\citep{Milgram+al-2005}. More detailed and complete numerical results 752 ~\citep{Milgram+al-2005}. More detailed and complete numerical results
751 (figures and tables, including standard errors on the error rates) can be 753 (figures and tables, including standard errors on the error rates) can be
752 found in Appendix I of the supplementary material. 754 found in Appendix I of the supplementary material.
778 lower-case, or upper-case characters). Again, whereas the gain from the 780 lower-case, or upper-case characters). Again, whereas the gain from the
779 multi-task setting is marginal or negative for the MLP, it is substantial 781 multi-task setting is marginal or negative for the MLP, it is substantial
780 for the SDA. Note that to simplify these multi-task experiments, only the original 782 for the SDA. Note that to simplify these multi-task experiments, only the original
781 NIST dataset is used. For example, the MLP-digits bar shows the relative 783 NIST dataset is used. For example, the MLP-digits bar shows the relative
782 percent improvement in MLP error rate on the NIST digits test set 784 percent improvement in MLP error rate on the NIST digits test set
783 is $100\% \times$ (1 - single-task 785 is $100\% \times$ (single-task
784 model's error / multi-task model's error). The single-task model is 786 model's error / multi-task model's error - 1). The single-task model is
785 trained with only 10 outputs (one per digit), seeing only digit examples, 787 trained with only 10 outputs (one per digit), seeing only digit examples,
786 whereas the multi-task model is trained with 62 outputs, with all 62 788 whereas the multi-task model is trained with 62 outputs, with all 62
787 character classes as examples. Hence the hidden units are shared across 789 character classes as examples. Hence the hidden units are shared across
788 all tasks. For the multi-task model, the digit error rate is measured by 790 all tasks. For the multi-task model, the digit error rate is measured by
789 comparing the correct digit class with the output class associated with the 791 comparing the correct digit class with the output class associated with the