ift6266: writeup/jmlr_submission.tex comparison

comparison writeup/jmlr_submission.tex @ 594:537f8b786655

submitted JMLR paper

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Tue, 05 Oct 2010 15:07:12 -0400
parents	18a7e7fdea4d
children

comparison

equal deleted inserted replaced

-:18a7e7fdea4d
+:537f8b786655
+%\documentclass[twoside,11pt]{article} % For LaTeX2e
 \documentclass{article} % For LaTeX2e
+\usepackage{jmlr2e}
 \usepackage{times}
 \usepackage{wrapfig}
 %\usepackage{amsthm} % not to be used with springer tools
 \usepackage{amsmath}
 \usepackage{bbm}
-\usepackage[psamsfonts]{amssymb}
+\usepackage[utf8]{inputenc}
+%\usepackage[psamsfonts]{amssymb}
 %\usepackage{algorithm,algorithmic} % not used after all
-\usepackage[utf8]{inputenc}
 \usepackage{graphicx,subfigure}
 \usepackage{natbib} % was [numbers]{natbib}
 \addtolength{\textwidth}{10mm}
 \addtolength{\evensidemargin}{-5mm}
 \addtolength{\oddsidemargin}{-5mm}
 %\setlength\parindent{0mm}
+\begin{document}
 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
 \author{
 Yoshua  Bengio \and
 Frédéric  Bastien \and
 Razvan  Pascanu \and
 Salah  Rifai \and
 Francois  Savard \and
 Guillaume  Sicard
 }
-\date{September 30th}
+\date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
+\jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al}
+\editor{}
-\begin{document}
 %\makeanontitle
 \maketitle
+{\bf Running title: Deep Self-Taught Learning}
 %\vspace*{-2mm}
 \begin{abstract}
-Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}.  For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set.  We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.
+Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}.  For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set.  We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in a large-scale handwritten character recognition setting. In fact, we show that they {\em beat previously published results and reach human-level performance}.
 \end{abstract}
 %\vspace*{-3mm}
+\begin{keywords}
+Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning
+\end{keywords}
 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition}
 \section{Introduction}
 %\vspace*{-1mm}
 {\bf Deep Learning} has emerged as a promising new area of research in
-statistical machine learning (see \citet{Bengio-2009} for a review).
+statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review.
 Learning algorithms for deep architectures are centered on the learning
 of useful representations of data, which are better suited to the task at hand,
 and are organized in a hierarchy with multiple levels.
 This is in part inspired by observations of the mammalian visual cortex,
 which consists of a chain of processing elements, each of which is associated with a
 of learning algorithm). In particular the {\em relative
 advantage of deep learning} for these settings has not been evaluated.
 The hypothesis discussed in the conclusion is that in the context of
 multi-task learning and the availability of out-of-distribution training examples,
 a deep hierarchy of features
-may be better able to provide sharing of statistical strength
+may be better able to provide {\em sharing of statistical strength}
 between different regions in input space or different tasks, compared to
 a shallow learner.
 Whereas a deep architecture can in principle be more powerful than a
 shallow one in terms of representation, depth appears to render the
 %unsupervised initialization, the stack of DAs can be
 %converted into a deep supervised feedforward neural network and fine-tuned by
 %stochastic gradient descent.
 %
-In this paper we ask the following questions:
+The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
+{\bf benefit more from self-taught learning than shallow learners} (with a single
+level), both in the context of the multi-task setting and from {\em
+out-of-distribution examples} in general. Because we are able to improve on state-of-the-art
+performance and reach human-level performance
+on a large-scale task, we consider that this paper is also a contribution
+to advance the application of machine learning to handwritten character recognition.
+More precisely, we ask and answer the following questions:
 %\begin{enumerate}
 $\bullet$ %\item
 Do the good results previously obtained with deep architectures on the
-MNIST digit images generalize to the setting of a much larger and richer (but similar)
+MNIST digit images generalize to the setting of a similar but much larger and richer
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
 $\bullet$ %\item
 To what extent does the perturbation of input images (e.g. adding
 noise, affine transformations, background images) make the resulting
 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
 to answer this question.
 %\end{enumerate}
 Our experimental results provide positive evidence towards all of these questions,
-as well as classifiers that reach human-level performance on 62-class isolated character
+as well as {\em classifiers that reach human-level performance on 62-class isolated character
-recognition and beat previously published results on the NIST dataset (special database 19).
+recognition and beat previously published results on the NIST dataset (special database 19)}.
 To achieve these results, we introduce in the next section a sophisticated system
 for stochastically transforming character images and then explain the methodology,
 which is based on training with or without these transformed images and testing on
 clean ones. We measure the relative advantage of out-of-distribution examples
 (perturbed or out-of-class)
 for a deep learner vs a supervised shallow one.
 Code for generating these transformations as well as for the deep learning
 algorithms are made available at {\tt http://hg.assembla.com/ift6266}.
-We estimate the relative advantage for deep learners of training with
+We also estimate the relative advantage for deep learners of training with
 other classes than those of interest, by comparing learners trained with
 62 classes with learners trained with only a subset (on which they
 are then tested).
 The conclusion discusses
 the more general question of why deep learners may benefit so much from
 due to sharing of intermediate features across tasks already points
 towards that explanation~\cite{baxter95a}.
 Intermediate features that can be used in different
 contexts can be estimated in a way that allows to share statistical
 strength. Features extracted through many levels are more likely to
-be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
+be more abstract and more invariant to some of the factors of variation
+in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest),
 increasing the likelihood that they would be useful for a larger array
 of tasks and input conditions.
 Therefore, we hypothesize that both depth and unsupervised
 pre-training play a part in explaining the advantages observed here, and future
 experiments could attempt at teasing apart these factors.
 And why would deep learners benefit from the self-taught learning
 scenarios even when the number of labeled examples is very large?
 We hypothesize that this is related to the hypotheses studied
-in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010}
+in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010}
 it was found that online learning on a huge dataset did not make the
-advantage of the deep learning bias vanish, a similar phenomenon
+advantage of the deep learning bias vanish, and a similar phenomenon
 may be happening here. We hypothesize that unsupervised pre-training
 of a deep hierarchy with self-taught learning initializes the
 model in the basin of attraction of supervised gradient descent
 that corresponds to better generalization. Furthermore, such good
 basins of attraction are not discovered by pure supervised learning
-(with or without self-taught settings), and more labeled examples
+(with or without self-taught settings) from random initialization, and more labeled examples
-does not allow the model to go from the poorer basins of attraction discovered
+does not allow the shallow or purely supervised models to discover
-by the purely supervised shallow models to the kind of better basins associated
+the kind of better basins associated
 with deep learning and self-taught learning.
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
 can be executed on-line at {\tt http://deep.host22.com}.

Mercurial > ift6266

comparison writeup/jmlr_submission.tex @ 594:537f8b786655