Mercurial > ift6266
comparison writeup/jmlr_submission.tex @ 594:537f8b786655
submitted JMLR paper
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 05 Oct 2010 15:07:12 -0400 |
parents | 18a7e7fdea4d |
children |
comparison
equal
deleted
inserted
replaced
593:18a7e7fdea4d | 594:537f8b786655 |
---|---|
1 %\documentclass[twoside,11pt]{article} % For LaTeX2e | |
1 \documentclass{article} % For LaTeX2e | 2 \documentclass{article} % For LaTeX2e |
2 | 3 \usepackage{jmlr2e} |
3 \usepackage{times} | 4 \usepackage{times} |
4 \usepackage{wrapfig} | 5 \usepackage{wrapfig} |
5 %\usepackage{amsthm} % not to be used with springer tools | 6 %\usepackage{amsthm} % not to be used with springer tools |
6 \usepackage{amsmath} | 7 \usepackage{amsmath} |
7 \usepackage{bbm} | 8 \usepackage{bbm} |
8 \usepackage[psamsfonts]{amssymb} | 9 \usepackage[utf8]{inputenc} |
10 %\usepackage[psamsfonts]{amssymb} | |
9 %\usepackage{algorithm,algorithmic} % not used after all | 11 %\usepackage{algorithm,algorithmic} % not used after all |
10 \usepackage[utf8]{inputenc} | |
11 \usepackage{graphicx,subfigure} | 12 \usepackage{graphicx,subfigure} |
12 \usepackage{natbib} % was [numbers]{natbib} | 13 \usepackage{natbib} % was [numbers]{natbib} |
13 | 14 |
14 \addtolength{\textwidth}{10mm} | 15 \addtolength{\textwidth}{10mm} |
15 \addtolength{\evensidemargin}{-5mm} | 16 \addtolength{\evensidemargin}{-5mm} |
16 \addtolength{\oddsidemargin}{-5mm} | 17 \addtolength{\oddsidemargin}{-5mm} |
17 | 18 |
18 %\setlength\parindent{0mm} | 19 %\setlength\parindent{0mm} |
20 | |
21 \begin{document} | |
19 | 22 |
20 \title{Deep Self-Taught Learning for Handwritten Character Recognition} | 23 \title{Deep Self-Taught Learning for Handwritten Character Recognition} |
21 \author{ | 24 \author{ |
22 Yoshua Bengio \and | 25 Yoshua Bengio \and |
23 Frédéric Bastien \and | 26 Frédéric Bastien \and |
35 Razvan Pascanu \and | 38 Razvan Pascanu \and |
36 Salah Rifai \and | 39 Salah Rifai \and |
37 Francois Savard \and | 40 Francois Savard \and |
38 Guillaume Sicard | 41 Guillaume Sicard |
39 } | 42 } |
40 \date{September 30th} | 43 \date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada} |
41 | 44 \jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al} |
42 | 45 \editor{} |
43 \begin{document} | |
44 | 46 |
45 %\makeanontitle | 47 %\makeanontitle |
46 \maketitle | 48 \maketitle |
47 | 49 |
50 {\bf Running title: Deep Self-Taught Learning} | |
51 | |
48 %\vspace*{-2mm} | 52 %\vspace*{-2mm} |
49 \begin{abstract} | 53 \begin{abstract} |
50 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition. | 54 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in a large-scale handwritten character recognition setting. In fact, we show that they {\em beat previously published results and reach human-level performance}. |
51 \end{abstract} | 55 \end{abstract} |
52 %\vspace*{-3mm} | 56 %\vspace*{-3mm} |
53 | 57 |
58 \begin{keywords} | |
59 Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning | |
60 \end{keywords} | |
54 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition} | 61 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition} |
62 | |
63 | |
55 | 64 |
56 \section{Introduction} | 65 \section{Introduction} |
57 %\vspace*{-1mm} | 66 %\vspace*{-1mm} |
58 | 67 |
59 {\bf Deep Learning} has emerged as a promising new area of research in | 68 {\bf Deep Learning} has emerged as a promising new area of research in |
60 statistical machine learning (see \citet{Bengio-2009} for a review). | 69 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review. |
61 Learning algorithms for deep architectures are centered on the learning | 70 Learning algorithms for deep architectures are centered on the learning |
62 of useful representations of data, which are better suited to the task at hand, | 71 of useful representations of data, which are better suited to the task at hand, |
63 and are organized in a hierarchy with multiple levels. | 72 and are organized in a hierarchy with multiple levels. |
64 This is in part inspired by observations of the mammalian visual cortex, | 73 This is in part inspired by observations of the mammalian visual cortex, |
65 which consists of a chain of processing elements, each of which is associated with a | 74 which consists of a chain of processing elements, each of which is associated with a |
89 of learning algorithm). In particular the {\em relative | 98 of learning algorithm). In particular the {\em relative |
90 advantage of deep learning} for these settings has not been evaluated. | 99 advantage of deep learning} for these settings has not been evaluated. |
91 The hypothesis discussed in the conclusion is that in the context of | 100 The hypothesis discussed in the conclusion is that in the context of |
92 multi-task learning and the availability of out-of-distribution training examples, | 101 multi-task learning and the availability of out-of-distribution training examples, |
93 a deep hierarchy of features | 102 a deep hierarchy of features |
94 may be better able to provide sharing of statistical strength | 103 may be better able to provide {\em sharing of statistical strength} |
95 between different regions in input space or different tasks, compared to | 104 between different regions in input space or different tasks, compared to |
96 a shallow learner. | 105 a shallow learner. |
97 | 106 |
98 Whereas a deep architecture can in principle be more powerful than a | 107 Whereas a deep architecture can in principle be more powerful than a |
99 shallow one in terms of representation, depth appears to render the | 108 shallow one in terms of representation, depth appears to render the |
116 %unsupervised initialization, the stack of DAs can be | 125 %unsupervised initialization, the stack of DAs can be |
117 %converted into a deep supervised feedforward neural network and fine-tuned by | 126 %converted into a deep supervised feedforward neural network and fine-tuned by |
118 %stochastic gradient descent. | 127 %stochastic gradient descent. |
119 | 128 |
120 % | 129 % |
121 In this paper we ask the following questions: | 130 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can |
131 {\bf benefit more from self-taught learning than shallow learners} (with a single | |
132 level), both in the context of the multi-task setting and from {\em | |
133 out-of-distribution examples} in general. Because we are able to improve on state-of-the-art | |
134 performance and reach human-level performance | |
135 on a large-scale task, we consider that this paper is also a contribution | |
136 to advance the application of machine learning to handwritten character recognition. | |
137 More precisely, we ask and answer the following questions: | |
122 | 138 |
123 %\begin{enumerate} | 139 %\begin{enumerate} |
124 $\bullet$ %\item | 140 $\bullet$ %\item |
125 Do the good results previously obtained with deep architectures on the | 141 Do the good results previously obtained with deep architectures on the |
126 MNIST digit images generalize to the setting of a much larger and richer (but similar) | 142 MNIST digit images generalize to the setting of a similar but much larger and richer |
127 dataset, the NIST special database 19, with 62 classes and around 800k examples? | 143 dataset, the NIST special database 19, with 62 classes and around 800k examples? |
128 | 144 |
129 $\bullet$ %\item | 145 $\bullet$ %\item |
130 To what extent does the perturbation of input images (e.g. adding | 146 To what extent does the perturbation of input images (e.g. adding |
131 noise, affine transformations, background images) make the resulting | 147 noise, affine transformations, background images) make the resulting |
145 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case) | 161 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case) |
146 to answer this question. | 162 to answer this question. |
147 %\end{enumerate} | 163 %\end{enumerate} |
148 | 164 |
149 Our experimental results provide positive evidence towards all of these questions, | 165 Our experimental results provide positive evidence towards all of these questions, |
150 as well as classifiers that reach human-level performance on 62-class isolated character | 166 as well as {\em classifiers that reach human-level performance on 62-class isolated character |
151 recognition and beat previously published results on the NIST dataset (special database 19). | 167 recognition and beat previously published results on the NIST dataset (special database 19)}. |
152 To achieve these results, we introduce in the next section a sophisticated system | 168 To achieve these results, we introduce in the next section a sophisticated system |
153 for stochastically transforming character images and then explain the methodology, | 169 for stochastically transforming character images and then explain the methodology, |
154 which is based on training with or without these transformed images and testing on | 170 which is based on training with or without these transformed images and testing on |
155 clean ones. We measure the relative advantage of out-of-distribution examples | 171 clean ones. We measure the relative advantage of out-of-distribution examples |
156 (perturbed or out-of-class) | 172 (perturbed or out-of-class) |
157 for a deep learner vs a supervised shallow one. | 173 for a deep learner vs a supervised shallow one. |
158 Code for generating these transformations as well as for the deep learning | 174 Code for generating these transformations as well as for the deep learning |
159 algorithms are made available at {\tt http://hg.assembla.com/ift6266}. | 175 algorithms are made available at {\tt http://hg.assembla.com/ift6266}. |
160 We estimate the relative advantage for deep learners of training with | 176 We also estimate the relative advantage for deep learners of training with |
161 other classes than those of interest, by comparing learners trained with | 177 other classes than those of interest, by comparing learners trained with |
162 62 classes with learners trained with only a subset (on which they | 178 62 classes with learners trained with only a subset (on which they |
163 are then tested). | 179 are then tested). |
164 The conclusion discusses | 180 The conclusion discusses |
165 the more general question of why deep learners may benefit so much from | 181 the more general question of why deep learners may benefit so much from |
994 due to sharing of intermediate features across tasks already points | 1010 due to sharing of intermediate features across tasks already points |
995 towards that explanation~\cite{baxter95a}. | 1011 towards that explanation~\cite{baxter95a}. |
996 Intermediate features that can be used in different | 1012 Intermediate features that can be used in different |
997 contexts can be estimated in a way that allows to share statistical | 1013 contexts can be estimated in a way that allows to share statistical |
998 strength. Features extracted through many levels are more likely to | 1014 strength. Features extracted through many levels are more likely to |
999 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), | 1015 be more abstract and more invariant to some of the factors of variation |
1016 in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest), | |
1000 increasing the likelihood that they would be useful for a larger array | 1017 increasing the likelihood that they would be useful for a larger array |
1001 of tasks and input conditions. | 1018 of tasks and input conditions. |
1002 Therefore, we hypothesize that both depth and unsupervised | 1019 Therefore, we hypothesize that both depth and unsupervised |
1003 pre-training play a part in explaining the advantages observed here, and future | 1020 pre-training play a part in explaining the advantages observed here, and future |
1004 experiments could attempt at teasing apart these factors. | 1021 experiments could attempt at teasing apart these factors. |
1005 And why would deep learners benefit from the self-taught learning | 1022 And why would deep learners benefit from the self-taught learning |
1006 scenarios even when the number of labeled examples is very large? | 1023 scenarios even when the number of labeled examples is very large? |
1007 We hypothesize that this is related to the hypotheses studied | 1024 We hypothesize that this is related to the hypotheses studied |
1008 in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010} | 1025 in~\citet{Erhan+al-2010}. In~\citet{Erhan+al-2010} |
1009 it was found that online learning on a huge dataset did not make the | 1026 it was found that online learning on a huge dataset did not make the |
1010 advantage of the deep learning bias vanish, a similar phenomenon | 1027 advantage of the deep learning bias vanish, and a similar phenomenon |
1011 may be happening here. We hypothesize that unsupervised pre-training | 1028 may be happening here. We hypothesize that unsupervised pre-training |
1012 of a deep hierarchy with self-taught learning initializes the | 1029 of a deep hierarchy with self-taught learning initializes the |
1013 model in the basin of attraction of supervised gradient descent | 1030 model in the basin of attraction of supervised gradient descent |
1014 that corresponds to better generalization. Furthermore, such good | 1031 that corresponds to better generalization. Furthermore, such good |
1015 basins of attraction are not discovered by pure supervised learning | 1032 basins of attraction are not discovered by pure supervised learning |
1016 (with or without self-taught settings), and more labeled examples | 1033 (with or without self-taught settings) from random initialization, and more labeled examples |
1017 does not allow the model to go from the poorer basins of attraction discovered | 1034 does not allow the shallow or purely supervised models to discover |
1018 by the purely supervised shallow models to the kind of better basins associated | 1035 the kind of better basins associated |
1019 with deep learning and self-taught learning. | 1036 with deep learning and self-taught learning. |
1020 | 1037 |
1021 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 1038 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) |
1022 can be executed on-line at {\tt http://deep.host22.com}. | 1039 can be executed on-line at {\tt http://deep.host22.com}. |
1023 | 1040 |