Mercurial > ift6266
comparison writeup/nips2010_cameraready.tex @ 607:d840139444fe
NIPS workshop spotlight
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Fri, 26 Nov 2010 17:41:43 -0500 |
parents | bd7d11089a47 |
children |
comparison
equal
deleted
inserted
replaced
606:bd7d11089a47 | 607:d840139444fe |
---|---|
6 \usepackage{algorithm,algorithmic} | 6 \usepackage{algorithm,algorithmic} |
7 \usepackage[utf8]{inputenc} | 7 \usepackage[utf8]{inputenc} |
8 \usepackage{graphicx,subfigure} | 8 \usepackage{graphicx,subfigure} |
9 \usepackage[numbers]{natbib} | 9 \usepackage[numbers]{natbib} |
10 | 10 |
11 \addtolength{\textwidth}{20mm} | 11 \addtolength{\textwidth}{10mm} |
12 \addtolength{\textheight}{20mm} | 12 \addtolength{\textheight}{10mm} |
13 \addtolength{\topmargin}{-10mm} | 13 \addtolength{\topmargin}{-5mm} |
14 \addtolength{\evensidemargin}{-10mm} | 14 \addtolength{\evensidemargin}{-5mm} |
15 \addtolength{\oddsidemargin}{-10mm} | 15 \addtolength{\oddsidemargin}{-5mm} |
16 | 16 |
17 %\setlength\parindent{0mm} | 17 %\setlength\parindent{0mm} |
18 | 18 |
19 \title{Deep Self-Taught Learning for Handwritten Character Recognition} | 19 \title{Deep Self-Taught Learning for Handwritten Character Recognition} |
20 \author{ | 20 \author{ |
70 | 70 |
71 \section{Introduction} | 71 \section{Introduction} |
72 \vspace*{-1mm} | 72 \vspace*{-1mm} |
73 | 73 |
74 {\bf Deep Learning} has emerged as a promising new area of research in | 74 {\bf Deep Learning} has emerged as a promising new area of research in |
75 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008-very-small,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review. | 75 statistical machine learning~\citep{Hinton06} |
76 (see \citet{Bengio-2009} for a review). | |
76 Learning algorithms for deep architectures are centered on the learning | 77 Learning algorithms for deep architectures are centered on the learning |
77 of useful representations of data, which are better suited to the task at hand, | 78 of useful representations of data, which are better suited to the task at hand, |
78 and are organized in a hierarchy with multiple levels. | 79 and are organized in a hierarchy with multiple levels. |
79 This is in part inspired by observations of the mammalian visual cortex, | 80 This is in part inspired by observations of the mammalian visual cortex, |
80 which consists of a chain of processing elements, each of which is associated with a | 81 which consists of a chain of processing elements, each of which is associated with a |
105 advantage} of deep learning for these settings has not been evaluated. | 106 advantage} of deep learning for these settings has not been evaluated. |
106 The hypothesis discussed in the conclusion is that a deep hierarchy of features | 107 The hypothesis discussed in the conclusion is that a deep hierarchy of features |
107 may be better able to provide sharing of statistical strength | 108 may be better able to provide sharing of statistical strength |
108 between different regions in input space or different tasks. | 109 between different regions in input space or different tasks. |
109 | 110 |
110 \iffalse | |
111 Whereas a deep architecture can in principle be more powerful than a | |
112 shallow one in terms of representation, depth appears to render the | |
113 training problem more difficult in terms of optimization and local minima. | |
114 It is also only recently that successful algorithms were proposed to | |
115 overcome some of these difficulties. All are based on unsupervised | |
116 learning, often in an greedy layer-wise ``unsupervised pre-training'' | |
117 stage~\citep{Bengio-2009}. | |
118 The principle is that each layer starting from | |
119 the bottom is trained to represent its input (the output of the previous | |
120 layer). After this | |
121 unsupervised initialization, the stack of layers can be | |
122 converted into a deep supervised feedforward neural network and fine-tuned by | |
123 stochastic gradient descent. | |
124 One of these layer initialization techniques, | |
125 applied here, is the Denoising | |
126 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see | |
127 Figure~\ref{fig:da}), which performed similarly or | |
128 better~\citep{VincentPLarochelleH2008-very-small} than previously | |
129 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} | |
130 in terms of unsupervised extraction | |
131 of a hierarchy of features useful for classification. Each layer is trained | |
132 to denoise its input, creating a layer of features that can be used as | |
133 input for the next layer, forming a Stacked Denoising Auto-encoder (SDA). | |
134 Note that training a Denoising Auto-encoder | |
135 can actually been seen as training a particular RBM by an inductive | |
136 principle different from maximum likelihood~\citep{Vincent-SM-2010}, | |
137 namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. | |
138 \fi | |
139 | |
140 Previous comparative experimental results with stacking of RBMs and DAs | 111 Previous comparative experimental results with stacking of RBMs and DAs |
141 to build deep supervised predictors had shown that they could outperform | 112 to build deep supervised predictors had shown that they could outperform |
142 shallow architectures in a variety of settings, especially | 113 shallow architectures in a variety of settings, especially |
143 when the data involves complex interactions between many factors of | 114 when the data involves complex interactions between many factors of |
144 variation~\citep{LarochelleH2007,Bengio-2009}. Other experiments have suggested | 115 variation~\citep{LarochelleH2007,Bengio-2009}. Other experiments have suggested |
158 and noises, here). This is consistent with the hypotheses discussed | 129 and noises, here). This is consistent with the hypotheses discussed |
159 in~\citet{Bengio-2009} regarding the potential advantage | 130 in~\citet{Bengio-2009} regarding the potential advantage |
160 of deep learning and the idea that more levels of representation can | 131 of deep learning and the idea that more levels of representation can |
161 give rise to more abstract, more general features of the raw input. | 132 give rise to more abstract, more general features of the raw input. |
162 | 133 |
163 This hypothesis is related to a learning setting called | 134 This hypothesis is related to the |
164 {\bf self-taught learning}~\citep{RainaR2007}, which combines principles | 135 {\bf self-taught learning} setting~\citep{RainaR2007}, which combines principles |
165 of semi-supervised and multi-task learning: the learner can exploit examples | 136 of semi-supervised and multi-task learning: the learner can exploit examples |
166 that are unlabeled and possibly come from a distribution different from the target | 137 that are unlabeled and possibly come from a distribution different from the target |
167 distribution, e.g., from other classes than those of interest. | 138 distribution, e.g., from classes other than those of interest. |
168 It has already been shown that deep learners can clearly take advantage of | 139 It has already been shown that deep learners can take advantage of |
169 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, | 140 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, |
170 but more needed to be done to explore the impact | 141 but more needed to be done to explore the impact |
171 of {\em out-of-distribution} examples and of the {\em multi-task} setting | 142 of {\em out-of-distribution} examples and of the {\em multi-task} setting |
172 (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised | 143 (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised |
173 pre-training only with the first layer). In particular the {\em relative | 144 pre-training only with the first layer). In particular the {\em relative |
174 advantage of deep learning} for these settings has not been evaluated. | 145 advantage of deep learning} for these settings had not been evaluated. |
175 | 146 |
176 | 147 |
177 % | 148 % |
178 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can | 149 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can |
179 {\bf benefit more from out-of-distribution examples than shallow learners} (with a single | 150 {\bf benefit more from out-of-distribution examples than shallow learners} (with a single |
224 %%\newpage | 195 %%\newpage |
225 \section{Perturbed and Transformed Character Images} | 196 \section{Perturbed and Transformed Character Images} |
226 \label{s:perturbations} | 197 \label{s:perturbations} |
227 \vspace*{-2mm} | 198 \vspace*{-2mm} |
228 | 199 |
229 \begin{minipage}[h]{\linewidth} | 200 %\begin{minipage}[h]{\linewidth} |
230 \begin{wrapfigure}[8]{l}{0.15\textwidth} | 201 \begin{wrapfigure}[8]{l}{0.15\textwidth} |
231 %\begin{minipage}[b]{0.14\linewidth} | 202 %\begin{minipage}[b]{0.14\linewidth} |
232 \vspace*{-5mm} | 203 \vspace*{-5mm} |
233 \begin{center} | 204 \begin{center} |
234 \includegraphics[scale=.4]{images/Original.png}\\ | 205 \includegraphics[scale=.4]{images/Original.png}\\ |
249 in the complexity of the learning task. | 220 in the complexity of the learning task. |
250 More details can | 221 More details can |
251 be found in this technical report~\citep{ARXIV-2010}. | 222 be found in this technical report~\citep{ARXIV-2010}. |
252 The code for these transformations (mostly python) is available at | 223 The code for these transformations (mostly python) is available at |
253 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share | 224 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share |
254 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the | 225 a global control parameter ($0 \le complexity \le 1$) modulating the |
255 amount of deformation or noise introduced. | 226 amount of deformation or noise. |
256 There are two main parts in the pipeline. The first one, | 227 There are two main parts in the pipeline. The first one, |
257 from thickness to pinch, performs transformations. The second | 228 from thickness to pinch, performs transformations. The second |
258 part, from blur to contrast, adds different kinds of noise. | 229 part, from blur to contrast, adds different kinds of noise. |
259 \end{minipage} | 230 %\end{minipage} |
260 | 231 |
261 \newpage | 232 %\newpage |
262 \vspace*{1mm} | 233 \vspace*{1mm} |
263 %\subsection{Transformations} | 234 %\subsection{Transformations} |
264 {\large\bf 2.1 Transformations} | 235 {\large\bf 2.1 Transformations} |
265 \vspace*{1mm} | 236 \vspace*{1mm} |
266 | 237 |
402 | 373 |
403 \vspace{1mm} | 374 \vspace{1mm} |
404 | 375 |
405 {\large\bf 2.2 Injecting Noise} | 376 {\large\bf 2.2 Injecting Noise} |
406 %\subsection{Injecting Noise} | 377 %\subsection{Injecting Noise} |
407 \vspace{2mm} | 378 %\vspace{2mm} |
408 | 379 |
409 \begin{minipage}[h]{\linewidth} | 380 \begin{minipage}[h]{\linewidth} |
410 %\vspace*{-.2cm} | 381 %\vspace*{-.2cm} |
411 \begin{minipage}[t]{0.14\linewidth} | 382 %\begin{minipage}[t]{0.14\linewidth} |
412 \centering | 383 \begin{wrapfigure}[8]{l}{0.15\textwidth} |
413 \vspace*{-2mm} | 384 \begin{center} |
385 \vspace*{-5mm} | |
386 %\vspace*{-2mm} | |
414 \includegraphics[scale=.4]{images/Motionblur_only.png}\\ | 387 \includegraphics[scale=.4]{images/Motionblur_only.png}\\ |
415 {\bf Motion Blur} | 388 {\bf Motion Blur} |
416 \end{minipage}% | 389 %\end{minipage}% |
417 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} | 390 \end{center} |
391 \end{wrapfigure} | |
392 %\hspace{0.3cm} | |
393 %\begin{minipage}[t]{0.83\linewidth} | |
418 %\vspace*{.5mm} | 394 %\vspace*{.5mm} |
419 The {\bf motion blur} module is GIMP's ``linear motion blur'', which | 395 The {\bf motion blur} module is GIMP's ``linear motion blur'', which |
420 has parameters $length$ and $angle$. The value of | 396 has parameters $length$ and $angle$. The value of |
421 a pixel in the final image is approximately the mean of the first $length$ pixels | 397 a pixel in the final image is approximately the mean of the first $length$ pixels |
422 found by moving in the $angle$ direction, | 398 found by moving in the $angle$ direction, |
423 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. | 399 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. |
424 \vspace{5mm} | 400 \vspace{5mm} |
425 \end{minipage} | 401 \end{minipage} |
426 \end{minipage} | 402 %\end{minipage} |
427 | 403 |
428 \vspace*{1mm} | 404 \vspace*{1mm} |
429 | 405 |
430 \begin{minipage}[h]{\linewidth} | 406 \begin{minipage}[h]{\linewidth} |
431 \begin{minipage}[t]{0.14\linewidth} | 407 \begin{minipage}[t]{0.14\linewidth} |
441 image. Pixels are combined by taking the max(occluder, occluded), | 417 image. Pixels are combined by taking the max(occluder, occluded), |
442 i.e. keeping the lighter ones. | 418 i.e. keeping the lighter ones. |
443 The rectangle corners | 419 The rectangle corners |
444 are sampled so that larger complexity gives larger rectangles. | 420 are sampled so that larger complexity gives larger rectangles. |
445 The destination position in the occluded image are also sampled | 421 The destination position in the occluded image are also sampled |
446 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). | 422 according to a normal distribution (more details in~\citet{ARXIV-2010}). |
447 This module is skipped with probability 60\%. | 423 This module is skipped with probability 60\%. |
448 %\vspace{7mm} | 424 %\vspace{7mm} |
449 \end{minipage} | 425 \end{minipage} |
450 \end{minipage} | 426 \end{minipage} |
451 | 427 |
452 \vspace*{1mm} | 428 \vspace*{1mm} |
453 | 429 |
454 \begin{wrapfigure}[8]{l}{0.15\textwidth} | 430 \begin{wrapfigure}[8]{l}{0.15\textwidth} |
455 \vspace*{-6mm} | 431 \vspace*{-3mm} |
456 \begin{center} | 432 \begin{center} |
457 %\begin{minipage}[t]{0.14\linewidth} | 433 %\begin{minipage}[t]{0.14\linewidth} |
458 %\centering | 434 %\centering |
459 \includegraphics[scale=.4]{images/Bruitgauss_only.png}\\ | 435 \includegraphics[scale=.4]{images/Bruitgauss_only.png}\\ |
460 {\bf Gaussian Smoothing} | 436 {\bf Gaussian Smoothing} |
480 This module is skipped with probability 75\%. | 456 This module is skipped with probability 75\%. |
481 %\end{minipage} | 457 %\end{minipage} |
482 | 458 |
483 %\newpage | 459 %\newpage |
484 | 460 |
485 \vspace*{-9mm} | 461 \vspace*{1mm} |
486 | 462 |
487 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth} | 463 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth} |
488 %\centering | 464 %\centering |
489 \begin{minipage}[t]{\linewidth} | 465 \begin{minipage}[t]{\linewidth} |
490 \begin{wrapfigure}[7]{l}{0.15\textwidth} | 466 \begin{wrapfigure}[7]{l}{0.15\textwidth} |
620 \vspace*{-3mm} | 596 \vspace*{-3mm} |
621 \section{Experimental Setup} | 597 \section{Experimental Setup} |
622 \vspace*{-1mm} | 598 \vspace*{-1mm} |
623 | 599 |
624 Much previous work on deep learning had been performed on | 600 Much previous work on deep learning had been performed on |
625 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, | 601 the MNIST digits task |
626 with 60~000 examples, and variants involving 10~000 | 602 with 60~000 examples, and variants involving 10~000 |
627 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}. | 603 examples~\citep{VincentPLarochelleH2008-very-small}. |
628 The focus here is on much larger training sets, from 10 times to | 604 The focus here is on much larger training sets, from 10 times to |
629 to 1000 times larger, and 62 classes. | 605 to 1000 times larger, and 62 classes. |
630 | 606 |
631 The first step in constructing the larger datasets (called NISTP and P07) is to sample from | 607 The first step in constructing the larger datasets (called NISTP and P07) is to sample from |
632 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, | 608 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, |
784 | 760 |
785 | 761 |
786 {\bf Stacked Denoising Auto-encoders (SDA).} | 762 {\bf Stacked Denoising Auto-encoders (SDA).} |
787 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 763 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
788 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 764 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
789 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, | 765 layers) |
790 apparently setting parameters in the | 766 apparently setting parameters in the |
791 basin of attraction of supervised gradient descent yielding better | 767 basin of attraction of supervised gradient descent yielding better |
792 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised | 768 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised |
793 pre-training phase} uses all of the training images but not the training labels. | 769 pre-training phase} uses all of the training images but not the training labels. |
794 Each layer is trained in turn to produce a new representation of its input | 770 Each layer is trained in turn to produce a new representation of its input |
800 $P(y|x)$ (like in semi-supervised learning), and on the other hand | 776 $P(y|x)$ (like in semi-supervised learning), and on the other hand |
801 taking advantage of the expressive power and bias implicit in the | 777 taking advantage of the expressive power and bias implicit in the |
802 deep architecture (whereby complex concepts are expressed as | 778 deep architecture (whereby complex concepts are expressed as |
803 compositions of simpler ones through a deep hierarchy). | 779 compositions of simpler ones through a deep hierarchy). |
804 | 780 |
781 \iffalse | |
805 \begin{figure}[ht] | 782 \begin{figure}[ht] |
806 \vspace*{-2mm} | 783 \vspace*{-2mm} |
807 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | 784 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} |
808 \vspace*{-2mm} | 785 \vspace*{-2mm} |
809 \caption{Illustration of the computations and training criterion for the denoising | 786 \caption{Illustration of the computations and training criterion for the denoising |
815 $L_H(x,z)$, whose expected value is approximately minimized during training | 792 $L_H(x,z)$, whose expected value is approximately minimized during training |
816 by tuning $\theta$ and $\theta'$.} | 793 by tuning $\theta$ and $\theta'$.} |
817 \label{fig:da} | 794 \label{fig:da} |
818 \vspace*{-2mm} | 795 \vspace*{-2mm} |
819 \end{figure} | 796 \end{figure} |
797 \fi | |
820 | 798 |
821 Here we chose to use the Denoising | 799 Here we chose to use the Denoising |
822 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for | 800 Auto-encoder~\citep{VincentPLarochelleH2008-very-small} as the building block for |
823 these deep hierarchies of features, as it is simple to train and | 801 these deep hierarchies of features, as it is simple to train and |
824 explain (see Figure~\ref{fig:da}, as well as | 802 explain (see % Figure~\ref{fig:da}, as well as |
825 tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 803 tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
826 provides efficient inference, and yielded results | 804 provides efficient inference, and yielded results |
827 comparable or better than RBMs in series of experiments | 805 comparable or better than RBMs in series of experiments |
828 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian | 806 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian |
829 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. | 807 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. |