comparison writeup/jmlr_submission.tex @ 594:537f8b786655

submitted JMLR paper
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 05 Oct 2010 15:07:12 -0400
parents 18a7e7fdea4d
children
comparison
equal deleted inserted replaced
593:18a7e7fdea4d 594:537f8b786655
1 %\documentclass[twoside,11pt]{article} % For LaTeX2e
1 \documentclass{article} % For LaTeX2e 2 \documentclass{article} % For LaTeX2e
2 3 \usepackage{jmlr2e}
3 \usepackage{times} 4 \usepackage{times}
4 \usepackage{wrapfig} 5 \usepackage{wrapfig}
5 %\usepackage{amsthm} % not to be used with springer tools 6 %\usepackage{amsthm} % not to be used with springer tools
6 \usepackage{amsmath} 7 \usepackage{amsmath}
7 \usepackage{bbm} 8 \usepackage{bbm}
8 \usepackage[psamsfonts]{amssymb} 9 \usepackage[utf8]{inputenc}
10 %\usepackage[psamsfonts]{amssymb}
9 %\usepackage{algorithm,algorithmic} % not used after all 11 %\usepackage{algorithm,algorithmic} % not used after all
10 \usepackage[utf8]{inputenc}
11 \usepackage{graphicx,subfigure} 12 \usepackage{graphicx,subfigure}
12 \usepackage{natbib} % was [numbers]{natbib} 13 \usepackage{natbib} % was [numbers]{natbib}
13 14
14 \addtolength{\textwidth}{10mm} 15 \addtolength{\textwidth}{10mm}
15 \addtolength{\evensidemargin}{-5mm} 16 \addtolength{\evensidemargin}{-5mm}
16 \addtolength{\oddsidemargin}{-5mm} 17 \addtolength{\oddsidemargin}{-5mm}
17 18
18 %\setlength\parindent{0mm} 19 %\setlength\parindent{0mm}
20
21 \begin{document}
19 22
20 \title{Deep Self-Taught Learning for Handwritten Character Recognition} 23 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
21 \author{ 24 \author{
22 Yoshua Bengio \and 25 Yoshua Bengio \and
23 Frédéric Bastien \and 26 Frédéric Bastien \and
35 Razvan Pascanu \and 38 Razvan Pascanu \and
36 Salah Rifai \and 39 Salah Rifai \and
37 Francois Savard \and 40 Francois Savard \and
38 Guillaume Sicard 41 Guillaume Sicard
39 } 42 }
40 \date{September 30th} 43 \date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
41 44 \jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al}
42 45 \editor{}
43 \begin{document}
44 46
45 %\makeanontitle 47 %\makeanontitle
46 \maketitle 48 \maketitle
47 49
50 {\bf Running title: Deep Self-Taught Learning}
51
48 %\vspace*{-2mm} 52 %\vspace*{-2mm}
49 \begin{abstract} 53 \begin{abstract}
50 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition. 54 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in a large-scale handwritten character recognition setting. In fact, we show that they {\em beat previously published results and reach human-level performance}.
51 \end{abstract} 55 \end{abstract}
52 %\vspace*{-3mm} 56 %\vspace*{-3mm}
53 57
58 \begin{keywords}
59 Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning
60 \end{keywords}
54 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition} 61 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition}
62
63
55 64
56 \section{Introduction} 65 \section{Introduction}
57 %\vspace*{-1mm} 66 %\vspace*{-1mm}
58 67
59 {\bf Deep Learning} has emerged as a promising new area of research in 68 {\bf Deep Learning} has emerged as a promising new area of research in
60 statistical machine learning (see \citet{Bengio-2009} for a review). 69 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review.
61 Learning algorithms for deep architectures are centered on the learning 70 Learning algorithms for deep architectures are centered on the learning
62 of useful representations of data, which are better suited to the task at hand, 71 of useful representations of data, which are better suited to the task at hand,
63 and are organized in a hierarchy with multiple levels. 72 and are organized in a hierarchy with multiple levels.
64 This is in part inspired by observations of the mammalian visual cortex, 73 This is in part inspired by observations of the mammalian visual cortex,
65 which consists of a chain of processing elements, each of which is associated with a 74 which consists of a chain of processing elements, each of which is associated with a
89 of learning algorithm). In particular the {\em relative 98 of learning algorithm). In particular the {\em relative
90 advantage of deep learning} for these settings has not been evaluated. 99 advantage of deep learning} for these settings has not been evaluated.
91 The hypothesis discussed in the conclusion is that in the context of 100 The hypothesis discussed in the conclusion is that in the context of
92 multi-task learning and the availability of out-of-distribution training examples, 101 multi-task learning and the availability of out-of-distribution training examples,
93 a deep hierarchy of features 102 a deep hierarchy of features
94 may be better able to provide sharing of statistical strength 103 may be better able to provide {\em sharing of statistical strength}
95 between different regions in input space or different tasks, compared to 104 between different regions in input space or different tasks, compared to
96 a shallow learner. 105 a shallow learner.
97 106
98 Whereas a deep architecture can in principle be more powerful than a 107 Whereas a deep architecture can in principle be more powerful than a
99 shallow one in terms of representation, depth appears to render the 108 shallow one in terms of representation, depth appears to render the
116 %unsupervised initialization, the stack of DAs can be 125 %unsupervised initialization, the stack of DAs can be
117 %converted into a deep supervised feedforward neural network and fine-tuned by 126 %converted into a deep supervised feedforward neural network and fine-tuned by
118 %stochastic gradient descent. 127 %stochastic gradient descent.
119 128
120 % 129 %
121 In this paper we ask the following questions: 130 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
131 {\bf benefit more from self-taught learning than shallow learners} (with a single
132 level), both in the context of the multi-task setting and from {\em
133 out-of-distribution examples} in general. Because we are able to improve on state-of-the-art
134 performance and reach human-level performance
135 on a large-scale task, we consider that this paper is also a contribution
136 to advance the application of machine learning to handwritten character recognition.
137 More precisely, we ask and answer the following questions:
122 138
123 %\begin{enumerate} 139 %\begin{enumerate}
124 $\bullet$ %\item 140 $\bullet$ %\item
125 Do the good results previously obtained with deep architectures on the 141 Do the good results previously obtained with deep architectures on the
126 MNIST digit images generalize to the setting of a much larger and richer (but similar) 142 MNIST digit images generalize to the setting of a similar but much larger and richer
127 dataset, the NIST special database 19, with 62 classes and around 800k examples? 143 dataset, the NIST special database 19, with 62 classes and around 800k examples?
128 144
129 $\bullet$ %\item 145 $\bullet$ %\item
130 To what extent does the perturbation of input images (e.g. adding 146 To what extent does the perturbation of input images (e.g. adding
131 noise, affine transformations, background images) make the resulting 147 noise, affine transformations, background images) make the resulting
145 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case) 161 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
146 to answer this question. 162 to answer this question.
147 %\end{enumerate} 163 %\end{enumerate}
148 164
149 Our experimental results provide positive evidence towards all of these questions, 165 Our experimental results provide positive evidence towards all of these questions,
150 as well as classifiers that reach human-level performance on 62-class isolated character 166 as well as {\em classifiers that reach human-level performance on 62-class isolated character
151 recognition and beat previously published results on the NIST dataset (special database 19). 167 recognition and beat previously published results on the NIST dataset (special database 19)}.
152 To achieve these results, we introduce in the next section a sophisticated system 168 To achieve these results, we introduce in the next section a sophisticated system
153 for stochastically transforming character images and then explain the methodology, 169 for stochastically transforming character images and then explain the methodology,
154 which is based on training with or without these transformed images and testing on 170 which is based on training with or without these transformed images and testing on
155 clean ones. We measure the relative advantage of out-of-distribution examples 171 clean ones. We measure the relative advantage of out-of-distribution examples
156 (perturbed or out-of-class) 172 (perturbed or out-of-class)
157 for a deep learner vs a supervised shallow one. 173 for a deep learner vs a supervised shallow one.
158 Code for generating these transformations as well as for the deep learning 174 Code for generating these transformations as well as for the deep learning
159 algorithms are made available at {\tt http://hg.assembla.com/ift6266}. 175 algorithms are made available at {\tt http://hg.assembla.com/ift6266}.
160 We estimate the relative advantage for deep learners of training with 176 We also estimate the relative advantage for deep learners of training with
161 other classes than those of interest, by comparing learners trained with 177 other classes than those of interest, by comparing learners trained with
162 62 classes with learners trained with only a subset (on which they 178 62 classes with learners trained with only a subset (on which they
163 are then tested). 179 are then tested).
164 The conclusion discusses 180 The conclusion discusses
165 the more general question of why deep learners may benefit so much from 181 the more general question of why deep learners may benefit so much from
994 due to sharing of intermediate features across tasks already points 1010 due to sharing of intermediate features across tasks already points
995 towards that explanation~\cite{baxter95a}. 1011 towards that explanation~\cite{baxter95a}.
996 Intermediate features that can be used in different 1012 Intermediate features that can be used in different
997 contexts can be estimated in a way that allows to share statistical 1013 contexts can be estimated in a way that allows to share statistical
998 strength. Features extracted through many levels are more likely to 1014 strength. Features extracted through many levels are more likely to
999 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), 1015 be more abstract and more invariant to some of the factors of variation
1016 in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest),
1000 increasing the likelihood that they would be useful for a larger array 1017 increasing the likelihood that they would be useful for a larger array
1001 of tasks and input conditions. 1018 of tasks and input conditions.
1002 Therefore, we hypothesize that both depth and unsupervised 1019 Therefore, we hypothesize that both depth and unsupervised
1003 pre-training play a part in explaining the advantages observed here, and future 1020 pre-training play a part in explaining the advantages observed here, and future
1004 experiments could attempt at teasing apart these factors. 1021 experiments could attempt at teasing apart these factors.
1005 And why would deep learners benefit from the self-taught learning 1022 And why would deep learners benefit from the self-taught learning
1006 scenarios even when the number of labeled examples is very large? 1023 scenarios even when the number of labeled examples is very large?
1007 We hypothesize that this is related to the hypotheses studied 1024 We hypothesize that this is related to the hypotheses studied
1008 in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010} 1025 in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010}
1009 it was found that online learning on a huge dataset did not make the 1026 it was found that online learning on a huge dataset did not make the
1010 advantage of the deep learning bias vanish, a similar phenomenon 1027 advantage of the deep learning bias vanish, and a similar phenomenon
1011 may be happening here. We hypothesize that unsupervised pre-training 1028 may be happening here. We hypothesize that unsupervised pre-training
1012 of a deep hierarchy with self-taught learning initializes the 1029 of a deep hierarchy with self-taught learning initializes the
1013 model in the basin of attraction of supervised gradient descent 1030 model in the basin of attraction of supervised gradient descent
1014 that corresponds to better generalization. Furthermore, such good 1031 that corresponds to better generalization. Furthermore, such good
1015 basins of attraction are not discovered by pure supervised learning 1032 basins of attraction are not discovered by pure supervised learning
1016 (with or without self-taught settings), and more labeled examples 1033 (with or without self-taught settings) from random initialization, and more labeled examples
1017 does not allow the model to go from the poorer basins of attraction discovered 1034 does not allow the shallow or purely supervised models to discover
1018 by the purely supervised shallow models to the kind of better basins associated 1035 the kind of better basins associated
1019 with deep learning and self-taught learning. 1036 with deep learning and self-taught learning.
1020 1037
1021 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 1038 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
1022 can be executed on-line at {\tt http://deep.host22.com}. 1039 can be executed on-line at {\tt http://deep.host22.com}.
1023 1040