# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1286305632 14400
# Node ID 537f8b786655c0c45968c99a2d907d3f91f8f48a
# Parent  18a7e7fdea4df8a814113ae9512b3f1245ebb04b
submitted JMLR paper

diff -r 18a7e7fdea4d -r 537f8b786655 writeup/jmlr_submission.tex
--- a/writeup/jmlr_submission.tex	Fri Oct 01 15:54:34 2010 -0400
+++ b/writeup/jmlr_submission.tex	Tue Oct 05 15:07:12 2010 -0400
@@ -1,13 +1,14 @@
+%\documentclass[twoside,11pt]{article} % For LaTeX2e
 \documentclass{article} % For LaTeX2e
-
+\usepackage{jmlr2e}
 \usepackage{times}
 \usepackage{wrapfig}
 %\usepackage{amsthm} % not to be used with springer tools
 \usepackage{amsmath}
 \usepackage{bbm}
-\usepackage[psamsfonts]{amssymb}
+\usepackage[utf8]{inputenc}
+%\usepackage[psamsfonts]{amssymb}
 %\usepackage{algorithm,algorithmic} % not used after all
-\usepackage[utf8]{inputenc}
 \usepackage{graphicx,subfigure}
 \usepackage{natbib} % was [numbers]{natbib}
 
@@ -17,6 +18,8 @@
 
 %\setlength\parindent{0mm}
 
+\begin{document}
+
 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
 \author{
 Yoshua  Bengio \and
@@ -37,27 +40,33 @@
 Francois  Savard \and 
 Guillaume  Sicard 
 }
-\date{September 30th}
-
-
-\begin{document}
+\date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
+\jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al}
+\editor{}
 
 %\makeanontitle
 \maketitle
 
+{\bf Running title: Deep Self-Taught Learning}
+
 %\vspace*{-2mm}
 \begin{abstract}
-  Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}.  For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set.  We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition.
+  Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}.  For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set.  We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in a large-scale handwritten character recognition setting. In fact, we show that they {\em beat previously published results and reach human-level performance}.
 \end{abstract}
 %\vspace*{-3mm}
- 
+
+\begin{keywords}  
+Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning
+\end{keywords}
 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition}
 
+
+
 \section{Introduction}
 %\vspace*{-1mm}
 
 {\bf Deep Learning} has emerged as a promising new area of research in
-statistical machine learning (see \citet{Bengio-2009} for a review).
+statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review.
 Learning algorithms for deep architectures are centered on the learning
 of useful representations of data, which are better suited to the task at hand,
 and are organized in a hierarchy with multiple levels.
@@ -91,7 +100,7 @@
 The hypothesis discussed in the conclusion is that in the context of 
 multi-task learning and the availability of out-of-distribution training examples, 
 a deep hierarchy of features
-may be better able to provide sharing of statistical strength
+may be better able to provide {\em sharing of statistical strength}
 between different regions in input space or different tasks, compared to
 a shallow learner.
 
@@ -118,12 +127,19 @@
 %stochastic gradient descent.
 
 %
-In this paper we ask the following questions:
+The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
+{\bf benefit more from self-taught learning than shallow learners} (with a single
+level), both in the context of the multi-task setting and from {\em
+  out-of-distribution examples} in general. Because we are able to improve on state-of-the-art
+performance and reach human-level performance
+on a large-scale task, we consider that this paper is also a contribution
+to advance the application of machine learning to handwritten character recognition.
+More precisely, we ask and answer the following questions:
 
 %\begin{enumerate}
 $\bullet$ %\item 
 Do the good results previously obtained with deep architectures on the
-MNIST digit images generalize to the setting of a much larger and richer (but similar)
+MNIST digit images generalize to the setting of a similar but much larger and richer
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
 
 $\bullet$ %\item 
@@ -147,8 +163,8 @@
 %\end{enumerate}
 
 Our experimental results provide positive evidence towards all of these questions,
-as well as classifiers that reach human-level performance on 62-class isolated character
-recognition and beat previously published results on the NIST dataset (special database 19).
+as well as {\em classifiers that reach human-level performance on 62-class isolated character
+recognition and beat previously published results on the NIST dataset (special database 19)}.
 To achieve these results, we introduce in the next section a sophisticated system
 for stochastically transforming character images and then explain the methodology,
 which is based on training with or without these transformed images and testing on 
@@ -157,7 +173,7 @@
 for a deep learner vs a supervised shallow one.
 Code for generating these transformations as well as for the deep learning 
 algorithms are made available at {\tt http://hg.assembla.com/ift6266}.
-We estimate the relative advantage for deep learners of training with
+We also estimate the relative advantage for deep learners of training with
 other classes than those of interest, by comparing learners trained with
 62 classes with learners trained with only a subset (on which they
 are then tested).
@@ -996,7 +1012,8 @@
 Intermediate features that can be used in different
 contexts can be estimated in a way that allows to share statistical 
 strength. Features extracted through many levels are more likely to
-be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
+be more abstract and more invariant to some of the factors of variation
+in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest),
 increasing the likelihood that they would be useful for a larger array
 of tasks and input conditions.
 Therefore, we hypothesize that both depth and unsupervised
@@ -1005,17 +1022,17 @@
 And why would deep learners benefit from the self-taught learning
 scenarios even when the number of labeled examples is very large?
 We hypothesize that this is related to the hypotheses studied
-in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010}
+in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010}
 it was found that online learning on a huge dataset did not make the
-advantage of the deep learning bias vanish, a similar phenomenon
+advantage of the deep learning bias vanish, and a similar phenomenon
 may be happening here. We hypothesize that unsupervised pre-training
 of a deep hierarchy with self-taught learning initializes the
 model in the basin of attraction of supervised gradient descent
 that corresponds to better generalization. Furthermore, such good
 basins of attraction are not discovered by pure supervised learning
-(with or without self-taught settings), and more labeled examples
-does not allow the model to go from the poorer basins of attraction discovered
-by the purely supervised shallow models to the kind of better basins associated
+(with or without self-taught settings) from random initialization, and more labeled examples
+does not allow the shallow or purely supervised models to discover
+the kind of better basins associated
 with deep learning and self-taught learning.
  
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)