comparison writeup/nips2010_cameraready.tex @ 607:d840139444fe

NIPS workshop spotlight
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Fri, 26 Nov 2010 17:41:43 -0500
parents bd7d11089a47
children
comparison
equal deleted inserted replaced
606:bd7d11089a47 607:d840139444fe
6 \usepackage{algorithm,algorithmic} 6 \usepackage{algorithm,algorithmic}
7 \usepackage[utf8]{inputenc} 7 \usepackage[utf8]{inputenc}
8 \usepackage{graphicx,subfigure} 8 \usepackage{graphicx,subfigure}
9 \usepackage[numbers]{natbib} 9 \usepackage[numbers]{natbib}
10 10
11 \addtolength{\textwidth}{20mm} 11 \addtolength{\textwidth}{10mm}
12 \addtolength{\textheight}{20mm} 12 \addtolength{\textheight}{10mm}
13 \addtolength{\topmargin}{-10mm} 13 \addtolength{\topmargin}{-5mm}
14 \addtolength{\evensidemargin}{-10mm} 14 \addtolength{\evensidemargin}{-5mm}
15 \addtolength{\oddsidemargin}{-10mm} 15 \addtolength{\oddsidemargin}{-5mm}
16 16
17 %\setlength\parindent{0mm} 17 %\setlength\parindent{0mm}
18 18
19 \title{Deep Self-Taught Learning for Handwritten Character Recognition} 19 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
20 \author{ 20 \author{
70 70
71 \section{Introduction} 71 \section{Introduction}
72 \vspace*{-1mm} 72 \vspace*{-1mm}
73 73
74 {\bf Deep Learning} has emerged as a promising new area of research in 74 {\bf Deep Learning} has emerged as a promising new area of research in
75 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008-very-small,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review. 75 statistical machine learning~\citep{Hinton06}
76 (see \citet{Bengio-2009} for a review).
76 Learning algorithms for deep architectures are centered on the learning 77 Learning algorithms for deep architectures are centered on the learning
77 of useful representations of data, which are better suited to the task at hand, 78 of useful representations of data, which are better suited to the task at hand,
78 and are organized in a hierarchy with multiple levels. 79 and are organized in a hierarchy with multiple levels.
79 This is in part inspired by observations of the mammalian visual cortex, 80 This is in part inspired by observations of the mammalian visual cortex,
80 which consists of a chain of processing elements, each of which is associated with a 81 which consists of a chain of processing elements, each of which is associated with a
105 advantage} of deep learning for these settings has not been evaluated. 106 advantage} of deep learning for these settings has not been evaluated.
106 The hypothesis discussed in the conclusion is that a deep hierarchy of features 107 The hypothesis discussed in the conclusion is that a deep hierarchy of features
107 may be better able to provide sharing of statistical strength 108 may be better able to provide sharing of statistical strength
108 between different regions in input space or different tasks. 109 between different regions in input space or different tasks.
109 110
110 \iffalse
111 Whereas a deep architecture can in principle be more powerful than a
112 shallow one in terms of representation, depth appears to render the
113 training problem more difficult in terms of optimization and local minima.
114 It is also only recently that successful algorithms were proposed to
115 overcome some of these difficulties. All are based on unsupervised
116 learning, often in an greedy layer-wise ``unsupervised pre-training''
117 stage~\citep{Bengio-2009}.
118 The principle is that each layer starting from
119 the bottom is trained to represent its input (the output of the previous
120 layer). After this
121 unsupervised initialization, the stack of layers can be
122 converted into a deep supervised feedforward neural network and fine-tuned by
123 stochastic gradient descent.
124 One of these layer initialization techniques,
125 applied here, is the Denoising
126 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see
127 Figure~\ref{fig:da}), which performed similarly or
128 better~\citep{VincentPLarochelleH2008-very-small} than previously
129 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06}
130 in terms of unsupervised extraction
131 of a hierarchy of features useful for classification. Each layer is trained
132 to denoise its input, creating a layer of features that can be used as
133 input for the next layer, forming a Stacked Denoising Auto-encoder (SDA).
134 Note that training a Denoising Auto-encoder
135 can actually been seen as training a particular RBM by an inductive
136 principle different from maximum likelihood~\citep{Vincent-SM-2010},
137 namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}.
138 \fi
139
140 Previous comparative experimental results with stacking of RBMs and DAs 111 Previous comparative experimental results with stacking of RBMs and DAs
141 to build deep supervised predictors had shown that they could outperform 112 to build deep supervised predictors had shown that they could outperform
142 shallow architectures in a variety of settings, especially 113 shallow architectures in a variety of settings, especially
143 when the data involves complex interactions between many factors of 114 when the data involves complex interactions between many factors of
144 variation~\citep{LarochelleH2007,Bengio-2009}. Other experiments have suggested 115 variation~\citep{LarochelleH2007,Bengio-2009}. Other experiments have suggested
158 and noises, here). This is consistent with the hypotheses discussed 129 and noises, here). This is consistent with the hypotheses discussed
159 in~\citet{Bengio-2009} regarding the potential advantage 130 in~\citet{Bengio-2009} regarding the potential advantage
160 of deep learning and the idea that more levels of representation can 131 of deep learning and the idea that more levels of representation can
161 give rise to more abstract, more general features of the raw input. 132 give rise to more abstract, more general features of the raw input.
162 133
163 This hypothesis is related to a learning setting called 134 This hypothesis is related to the
164 {\bf self-taught learning}~\citep{RainaR2007}, which combines principles 135 {\bf self-taught learning} setting~\citep{RainaR2007}, which combines principles
165 of semi-supervised and multi-task learning: the learner can exploit examples 136 of semi-supervised and multi-task learning: the learner can exploit examples
166 that are unlabeled and possibly come from a distribution different from the target 137 that are unlabeled and possibly come from a distribution different from the target
167 distribution, e.g., from other classes than those of interest. 138 distribution, e.g., from classes other than those of interest.
168 It has already been shown that deep learners can clearly take advantage of 139 It has already been shown that deep learners can take advantage of
169 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, 140 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
170 but more needed to be done to explore the impact 141 but more needed to be done to explore the impact
171 of {\em out-of-distribution} examples and of the {\em multi-task} setting 142 of {\em out-of-distribution} examples and of the {\em multi-task} setting
172 (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised 143 (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised
173 pre-training only with the first layer). In particular the {\em relative 144 pre-training only with the first layer). In particular the {\em relative
174 advantage of deep learning} for these settings has not been evaluated. 145 advantage of deep learning} for these settings had not been evaluated.
175 146
176 147
177 % 148 %
178 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can 149 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
179 {\bf benefit more from out-of-distribution examples than shallow learners} (with a single 150 {\bf benefit more from out-of-distribution examples than shallow learners} (with a single
224 %%\newpage 195 %%\newpage
225 \section{Perturbed and Transformed Character Images} 196 \section{Perturbed and Transformed Character Images}
226 \label{s:perturbations} 197 \label{s:perturbations}
227 \vspace*{-2mm} 198 \vspace*{-2mm}
228 199
229 \begin{minipage}[h]{\linewidth} 200 %\begin{minipage}[h]{\linewidth}
230 \begin{wrapfigure}[8]{l}{0.15\textwidth} 201 \begin{wrapfigure}[8]{l}{0.15\textwidth}
231 %\begin{minipage}[b]{0.14\linewidth} 202 %\begin{minipage}[b]{0.14\linewidth}
232 \vspace*{-5mm} 203 \vspace*{-5mm}
233 \begin{center} 204 \begin{center}
234 \includegraphics[scale=.4]{images/Original.png}\\ 205 \includegraphics[scale=.4]{images/Original.png}\\
249 in the complexity of the learning task. 220 in the complexity of the learning task.
250 More details can 221 More details can
251 be found in this technical report~\citep{ARXIV-2010}. 222 be found in this technical report~\citep{ARXIV-2010}.
252 The code for these transformations (mostly python) is available at 223 The code for these transformations (mostly python) is available at
253 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share 224 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share
254 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the 225 a global control parameter ($0 \le complexity \le 1$) modulating the
255 amount of deformation or noise introduced. 226 amount of deformation or noise.
256 There are two main parts in the pipeline. The first one, 227 There are two main parts in the pipeline. The first one,
257 from thickness to pinch, performs transformations. The second 228 from thickness to pinch, performs transformations. The second
258 part, from blur to contrast, adds different kinds of noise. 229 part, from blur to contrast, adds different kinds of noise.
259 \end{minipage} 230 %\end{minipage}
260 231
261 \newpage 232 %\newpage
262 \vspace*{1mm} 233 \vspace*{1mm}
263 %\subsection{Transformations} 234 %\subsection{Transformations}
264 {\large\bf 2.1 Transformations} 235 {\large\bf 2.1 Transformations}
265 \vspace*{1mm} 236 \vspace*{1mm}
266 237
402 373
403 \vspace{1mm} 374 \vspace{1mm}
404 375
405 {\large\bf 2.2 Injecting Noise} 376 {\large\bf 2.2 Injecting Noise}
406 %\subsection{Injecting Noise} 377 %\subsection{Injecting Noise}
407 \vspace{2mm} 378 %\vspace{2mm}
408 379
409 \begin{minipage}[h]{\linewidth} 380 \begin{minipage}[h]{\linewidth}
410 %\vspace*{-.2cm} 381 %\vspace*{-.2cm}
411 \begin{minipage}[t]{0.14\linewidth} 382 %\begin{minipage}[t]{0.14\linewidth}
412 \centering 383 \begin{wrapfigure}[8]{l}{0.15\textwidth}
413 \vspace*{-2mm} 384 \begin{center}
385 \vspace*{-5mm}
386 %\vspace*{-2mm}
414 \includegraphics[scale=.4]{images/Motionblur_only.png}\\ 387 \includegraphics[scale=.4]{images/Motionblur_only.png}\\
415 {\bf Motion Blur} 388 {\bf Motion Blur}
416 \end{minipage}% 389 %\end{minipage}%
417 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} 390 \end{center}
391 \end{wrapfigure}
392 %\hspace{0.3cm}
393 %\begin{minipage}[t]{0.83\linewidth}
418 %\vspace*{.5mm} 394 %\vspace*{.5mm}
419 The {\bf motion blur} module is GIMP's ``linear motion blur'', which 395 The {\bf motion blur} module is GIMP's ``linear motion blur'', which
420 has parameters $length$ and $angle$. The value of 396 has parameters $length$ and $angle$. The value of
421 a pixel in the final image is approximately the mean of the first $length$ pixels 397 a pixel in the final image is approximately the mean of the first $length$ pixels
422 found by moving in the $angle$ direction, 398 found by moving in the $angle$ direction,
423 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. 399 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
424 \vspace{5mm} 400 \vspace{5mm}
425 \end{minipage} 401 \end{minipage}
426 \end{minipage} 402 %\end{minipage}
427 403
428 \vspace*{1mm} 404 \vspace*{1mm}
429 405
430 \begin{minipage}[h]{\linewidth} 406 \begin{minipage}[h]{\linewidth}
431 \begin{minipage}[t]{0.14\linewidth} 407 \begin{minipage}[t]{0.14\linewidth}
441 image. Pixels are combined by taking the max(occluder, occluded), 417 image. Pixels are combined by taking the max(occluder, occluded),
442 i.e. keeping the lighter ones. 418 i.e. keeping the lighter ones.
443 The rectangle corners 419 The rectangle corners
444 are sampled so that larger complexity gives larger rectangles. 420 are sampled so that larger complexity gives larger rectangles.
445 The destination position in the occluded image are also sampled 421 The destination position in the occluded image are also sampled
446 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). 422 according to a normal distribution (more details in~\citet{ARXIV-2010}).
447 This module is skipped with probability 60\%. 423 This module is skipped with probability 60\%.
448 %\vspace{7mm} 424 %\vspace{7mm}
449 \end{minipage} 425 \end{minipage}
450 \end{minipage} 426 \end{minipage}
451 427
452 \vspace*{1mm} 428 \vspace*{1mm}
453 429
454 \begin{wrapfigure}[8]{l}{0.15\textwidth} 430 \begin{wrapfigure}[8]{l}{0.15\textwidth}
455 \vspace*{-6mm} 431 \vspace*{-3mm}
456 \begin{center} 432 \begin{center}
457 %\begin{minipage}[t]{0.14\linewidth} 433 %\begin{minipage}[t]{0.14\linewidth}
458 %\centering 434 %\centering
459 \includegraphics[scale=.4]{images/Bruitgauss_only.png}\\ 435 \includegraphics[scale=.4]{images/Bruitgauss_only.png}\\
460 {\bf Gaussian Smoothing} 436 {\bf Gaussian Smoothing}
480 This module is skipped with probability 75\%. 456 This module is skipped with probability 75\%.
481 %\end{minipage} 457 %\end{minipage}
482 458
483 %\newpage 459 %\newpage
484 460
485 \vspace*{-9mm} 461 \vspace*{1mm}
486 462
487 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth} 463 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
488 %\centering 464 %\centering
489 \begin{minipage}[t]{\linewidth} 465 \begin{minipage}[t]{\linewidth}
490 \begin{wrapfigure}[7]{l}{0.15\textwidth} 466 \begin{wrapfigure}[7]{l}{0.15\textwidth}
620 \vspace*{-3mm} 596 \vspace*{-3mm}
621 \section{Experimental Setup} 597 \section{Experimental Setup}
622 \vspace*{-1mm} 598 \vspace*{-1mm}
623 599
624 Much previous work on deep learning had been performed on 600 Much previous work on deep learning had been performed on
625 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, 601 the MNIST digits task
626 with 60~000 examples, and variants involving 10~000 602 with 60~000 examples, and variants involving 10~000
627 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}. 603 examples~\citep{VincentPLarochelleH2008-very-small}.
628 The focus here is on much larger training sets, from 10 times to 604 The focus here is on much larger training sets, from 10 times to
629 to 1000 times larger, and 62 classes. 605 to 1000 times larger, and 62 classes.
630 606
631 The first step in constructing the larger datasets (called NISTP and P07) is to sample from 607 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
632 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, 608 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
784 760
785 761
786 {\bf Stacked Denoising Auto-encoders (SDA).} 762 {\bf Stacked Denoising Auto-encoders (SDA).}
787 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 763 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
788 can be used to initialize the weights of each layer of a deep MLP (with many hidden 764 can be used to initialize the weights of each layer of a deep MLP (with many hidden
789 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, 765 layers)
790 apparently setting parameters in the 766 apparently setting parameters in the
791 basin of attraction of supervised gradient descent yielding better 767 basin of attraction of supervised gradient descent yielding better
792 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised 768 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised
793 pre-training phase} uses all of the training images but not the training labels. 769 pre-training phase} uses all of the training images but not the training labels.
794 Each layer is trained in turn to produce a new representation of its input 770 Each layer is trained in turn to produce a new representation of its input
800 $P(y|x)$ (like in semi-supervised learning), and on the other hand 776 $P(y|x)$ (like in semi-supervised learning), and on the other hand
801 taking advantage of the expressive power and bias implicit in the 777 taking advantage of the expressive power and bias implicit in the
802 deep architecture (whereby complex concepts are expressed as 778 deep architecture (whereby complex concepts are expressed as
803 compositions of simpler ones through a deep hierarchy). 779 compositions of simpler ones through a deep hierarchy).
804 780
781 \iffalse
805 \begin{figure}[ht] 782 \begin{figure}[ht]
806 \vspace*{-2mm} 783 \vspace*{-2mm}
807 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} 784 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
808 \vspace*{-2mm} 785 \vspace*{-2mm}
809 \caption{Illustration of the computations and training criterion for the denoising 786 \caption{Illustration of the computations and training criterion for the denoising
815 $L_H(x,z)$, whose expected value is approximately minimized during training 792 $L_H(x,z)$, whose expected value is approximately minimized during training
816 by tuning $\theta$ and $\theta'$.} 793 by tuning $\theta$ and $\theta'$.}
817 \label{fig:da} 794 \label{fig:da}
818 \vspace*{-2mm} 795 \vspace*{-2mm}
819 \end{figure} 796 \end{figure}
797 \fi
820 798
821 Here we chose to use the Denoising 799 Here we chose to use the Denoising
822 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for 800 Auto-encoder~\citep{VincentPLarochelleH2008-very-small} as the building block for
823 these deep hierarchies of features, as it is simple to train and 801 these deep hierarchies of features, as it is simple to train and
824 explain (see Figure~\ref{fig:da}, as well as 802 explain (see % Figure~\ref{fig:da}, as well as
825 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 803 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
826 provides efficient inference, and yielded results 804 provides efficient inference, and yielded results
827 comparable or better than RBMs in series of experiments 805 comparable or better than RBMs in series of experiments
828 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian 806 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian
829 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. 807 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}.