ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 485:6beaf3328521

les tables enlevées

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Mon, 31 May 2010 21:50:00 -0400
parents	9a757d565e46
children	877af97ee193 6c9ff48e15cd

comparison

equal deleted inserted replaced

-:9a757d565e46
+:6beaf3328521
 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
 \vspace*{-1mm}
 \section{Experimental Results}
-\vspace*{-1mm}
+%\vspace*{-1mm}
-\subsection{SDA vs MLP vs Humans}
+%\subsection{SDA vs MLP vs Humans}
-\vspace*{-1mm}
+%\vspace*{-1mm}
-We compare here the best MLP (according to validation set error) that we found against
+We compare the best MLP (according to validation set error) that we found against
 the best SDA (again according to validation set error), along with a precise estimate
 of human performance obtained via Amazon's Mechanical Turk (AMT)
-service\footnote{http://mturk.com}. AMT users are paid small amounts
+service\footnote{http://mturk.com}.
-of money to perform tasks for which human intelligence is required.
+%AMT users are paid small amounts
-Mechanical Turk has been used extensively in natural language
+%of money to perform tasks for which human intelligence is required.
-processing \citep{SnowEtAl2008} and vision
+%Mechanical Turk has been used extensively in natural language
-\citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented
+%processing \citep{SnowEtAl2008} and vision
+%\citep{SorokinAndForsyth2008,whitehill09}.
+AMT users where presented
 with 10 character images and asked to type 10 corresponding ascii
 characters. They were forced to make a hard choice among the
 62 or 10 character classes (all classes or digits only).
 Three users classified each image, allowing
 to estimate inter-human variability (shown as +/- in parenthesis below).
-Figure~\ref{fig:error-rates-charts} summarizes the results obtained.
+Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
-More detailed results and tables can be found in the appendix.
+comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1,
+SDA2), along with the previous results on the digits NIST special database 19
-\begin{table}
+test set from the
+literature
+respectively based on ARTMAP neural networks
+~\citep{Granger+al-2007}, fast nearest-neighbor search
+~\citep{Cortes+al-2000}, MLPs
+~\citep{Oliveira+al-2002}, and SVMs
+~\citep{Milgram+al-2005}.
+More detailed and complete numerical results (figures and tables)
+can be found in the appendix.  The 3 kinds of model differ in the
+training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1),
+or P07 (MLP2, SDA2). The deep learner not only outperformed
+the shallow ones and previously published performance
+but reaches human performance on both the 62-class
+task and the 10-class (digits) task. In addition, as shown
+in the left of Figure~\ref{fig:fig:improvements-charts},
+the relative improvement in error rate brought by
+self-taught learning is greater for the SDA. The left
+side shows the improvement to the clean NIST test set error
+brought by the use of out-of-distribution
+examples (i.e. the perturbed examples examples from NISTP
+or P07). The right side of Figure~\ref{fig:fig:improvements-charts}
+shows the relative improvement brought by the use
+of a multi-task setting, in which the same model is trained
+for more classes than the target classes of interest
+(i.e. training with all 62 classes when the target classes
+are respectively the digits, lower-case, or upper-case
+characters). Again, whereas the gain is marginal
+or negative for the MLP, it is substantial for the SDA.
+\begin{figure}[h]
+\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
+\caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. }
+\label{fig:error-rates-charts}
+\end{figure}
+%\vspace*{-1mm}
+%\subsection{Perturbed Training Data More Helpful for SDAE}
+%\vspace*{-1mm}
+%\vspace*{-1mm}
+%\subsection{Multi-Task Learning Effects}
+%\vspace*{-1mm}
+\iffalse
+As previously seen, the SDA is better able to benefit from the
+transformations applied to the data than the MLP. In this experiment we
+define three tasks: recognizing digits (knowing that the input is a digit),
+recognizing upper case characters (knowing that the input is one), and
+recognizing lower case characters (knowing that the input is one).  We
+consider the digit classification task as the target task and we want to
+evaluate whether training with the other tasks can help or hurt, and
+whether the effect is different for MLPs versus SDAs.  The goal is to find
+out if deep learning can benefit more (or less) from multiple related tasks
+(i.e. the multi-task setting) compared to a corresponding purely supervised
+shallow learner.
+We use a single hidden layer MLP with 1000 hidden units, and a SDA
+with 3 hidden layers (1000 hidden units per layer), pre-trained and
+fine-tuned on NIST.
+Our results show that the MLP benefits marginally from the multi-task setting
+in the case of digits (5\% relative improvement) but is actually hurt in the case
+of characters (respectively 3\% and 4\% worse for lower and upper class characters).
+On the other hand the SDA benefitted from the multi-task setting, with relative
+error rate improvements of 27\%, 15\% and 13\% respectively for digits,
+lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
+\fi
+\begin{figure}[h]
+\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
+\caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).}
+\label{fig:improvements-charts}
+\end{figure}
+\vspace*{-1mm}
+\section{Conclusions}
+\vspace*{-1mm}
+The conclusions are positive for all the questions asked in the introduction.
+%\begin{itemize}
+$\bullet$ %\item
+Do the good results previously obtained with deep architectures on the
+MNIST digits generalize to the setting of a much larger and richer (but similar)
+dataset, the NIST special database 19, with 62 classes and around 800k examples?
+Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
+performance.
+$\bullet$ %\item
+To what extent does the perturbation of input images (e.g. adding
+noise, affine transformations, background images) make the resulting
+classifier better not only on similarly perturbed images but also on
+the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
+examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
+MLPs were helped by perturbed training examples when tested on perturbed input images,
+but only marginally helped wrt clean examples. On the other hand, the deep SDAs
+were very significantly boosted by these out-of-distribution examples.
+$\bullet$ %\item
+Similarly, does the feature learning step in deep learning algorithms benefit more
+training with similar but different classes (i.e. a multi-task learning scenario) than
+a corresponding shallow and purely supervised architecture?
+Whereas the improvement due to the multi-task setting was marginal or
+negative for the MLP, it was very significant for the SDA.
+%\end{itemize}
+A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
+can be executed on-line at {\tt http://deep.host22.com}.
+{\small
+\bibliography{strings,ml,aigaion,specials}
+%\bibliographystyle{plainnat}
+\bibliographystyle{unsrtnat}
+%\bibliographystyle{apalike}
+}
+\newpage
+\centerline{APPENDIX FOR {\bf Deep Self-Taught Learning for Handwritten Character Recognition}}
+\vspace*{1cm}
+\begin{table}[h]
 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture
 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07)
 and using a validation set to select hyper-parameters and other training choices.
 \citep{Milgram+al-2005} &      &                    &                   & 2.1\% $\pm$.12\% \\ \hline
 \end{tabular}
 \end{center}
 \end{table}
-\begin{figure}[h]
+\begin{table}[h]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
-\caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. }
-\label{fig:error-rates-charts}
-\end{figure}
-\vspace*{-1mm}
-\subsection{Perturbed Training Data More Helpful for SDAE}
-\vspace*{-1mm}
-\begin{table}
 \caption{Relative change in error rates due to the use of perturbed training data,
 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models.
 A positive value indicates that training on the perturbed data helped for the
 given test set (the first 3 columns on the 62-class tasks and the last one is
 on the clean 10-class digits). Clearly, the deep learning models did benefit more
 MLP0/MLP2-1   &  -0.4\%    &  49\%           & 44\%           & -29\% \\ \hline
 \end{tabular}
 \end{center}
 \end{table}
-\vspace*{-1mm}
+\begin{table}[h]
-\subsection{Multi-Task Learning Effects}
-\vspace*{-1mm}
-As previously seen, the SDA is better able to benefit from the
-transformations applied to the data than the MLP. In this experiment we
-define three tasks: recognizing digits (knowing that the input is a digit),
-recognizing upper case characters (knowing that the input is one), and
-recognizing lower case characters (knowing that the input is one).  We
-consider the digit classification task as the target task and we want to
-evaluate whether training with the other tasks can help or hurt, and
-whether the effect is different for MLPs versus SDAs.  The goal is to find
-out if deep learning can benefit more (or less) from multiple related tasks
-(i.e. the multi-task setting) compared to a corresponding purely supervised
-shallow learner.
-We use a single hidden layer MLP with 1000 hidden units, and a SDA
-with 3 hidden layers (1000 hidden units per layer), pre-trained and
-fine-tuned on NIST.
-Our results show that the MLP benefits marginally from the multi-task setting
-in the case of digits (5\% relative improvement) but is actually hurt in the case
-of characters (respectively 3\% and 4\% worse for lower and upper class characters).
-On the other hand the SDA benefitted from the multi-task setting, with relative
-error rate improvements of 27\%, 15\% and 13\% respectively for digits,
-lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
-\begin{table}
 \caption{Test error rates and relative change in error rates due to the use of
 a multi-task setting, i.e., training on each task in isolation vs training
 for all three tasks together, for MLPs vs SDAs. The SDA benefits much
 more from the multi-task setting. All experiments on only on the
 unperturbed NIST data, using validation error for model selection.
 \end{tabular}
 \end{center}
 \end{table}
-\begin{figure}[h]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
-\caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).}
-\label{fig:improvements-charts}
-\end{figure}
-\vspace*{-1mm}
-\section{Conclusions}
-\vspace*{-1mm}
-The conclusions are positive for all the questions asked in the introduction.
-%\begin{itemize}
-$\bullet$ %\item
-Do the good results previously obtained with deep architectures on the
-MNIST digits generalize to the setting of a much larger and richer (but similar)
-dataset, the NIST special database 19, with 62 classes and around 800k examples?
-Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
-performance.
-$\bullet$ %\item
-To what extent does the perturbation of input images (e.g. adding
-noise, affine transformations, background images) make the resulting
-classifier better not only on similarly perturbed images but also on
-the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
-examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
-MLPs were helped by perturbed training examples when tested on perturbed input images,
-but only marginally helped wrt clean examples. On the other hand, the deep SDAs
-were very significantly boosted by these out-of-distribution examples.
-$\bullet$ %\item
-Similarly, does the feature learning step in deep learning algorithms benefit more
-training with similar but different classes (i.e. a multi-task learning scenario) than
-a corresponding shallow and purely supervised architecture?
-Whereas the improvement due to the multi-task setting was marginal or
-negative for the MLP, it was very significant for the SDA.
-%\end{itemize}
-A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
-can be executed on-line at {\tt http://deep.host22.com}.
-{\small
-\bibliography{strings,ml,aigaion,specials}
-%\bibliographystyle{plainnat}
-\bibliographystyle{unsrtnat}
-%\bibliographystyle{apalike}
-}
 \end{document}

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 485:6beaf3328521