comparison writeup/nips2010_submission.tex @ 550:662299f265ab

suggestions from Ian
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Wed, 02 Jun 2010 15:44:46 -0400
parents ef172f4a322a
children 8f365abf171d
comparison
equal deleted inserted replaced
549:ef172f4a322a 550:662299f265ab
31 corresponding shallow learner}, in the area 31 corresponding shallow learner}, in the area
32 of handwritten character recognition. In fact, we show that they reach 32 of handwritten character recognition. In fact, we show that they reach
33 human-level performance on both handwritten digit classification and 33 human-level performance on both handwritten digit classification and
34 62-class handwritten character recognition. For this purpose we 34 62-class handwritten character recognition. For this purpose we
35 developed a powerful generator of stochastic variations and noise 35 developed a powerful generator of stochastic variations and noise
36 processes character images, including not only affine transformations but 36 processes for character images, including not only affine transformations but
37 also slant, local elastic deformations, changes in thickness, background 37 also slant, local elastic deformations, changes in thickness, background
38 images, grey level changes, contrast, occlusion, and various types of pixel and 38 images, grey level changes, contrast, occlusion, and various types of
39 spatially correlated noise. The out-of-distribution examples are 39 noise. The out-of-distribution examples are
40 obtained by training with these highly distorted images or 40 obtained from these highly distorted images or
41 by including object classes different from those in the target test set. 41 by including examples of object classes different from those in the target test set.
42 \end{abstract} 42 \end{abstract}
43 \vspace*{-2mm} 43 \vspace*{-2mm}
44 44
45 \section{Introduction} 45 \section{Introduction}
46 \vspace*{-1mm} 46 \vspace*{-1mm}
85 stochastic gradient descent. 85 stochastic gradient descent.
86 86
87 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles 87 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
88 of semi-supervised and multi-task learning: the learner can exploit examples 88 of semi-supervised and multi-task learning: the learner can exploit examples
89 that are unlabeled and/or come from a distribution different from the target 89 that are unlabeled and/or come from a distribution different from the target
90 distribution, e.g., from other classes that those of interest. 90 distribution, e.g., from other classes than those of interest.
91 It has already been shown that deep learners can clearly take advantage of 91 It has already been shown that deep learners can clearly take advantage of
92 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, 92 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
93 but more needs to be done to explore the impact 93 but more needs to be done to explore the impact
94 of {\em out-of-distribution} examples and of the multi-task setting 94 of {\em out-of-distribution} examples and of the multi-task setting
95 (one exception is~\citep{CollobertR2008}, but using very different kinds 95 (one exception is~\citep{CollobertR2008}, which uses very different kinds
96 of learning algorithms). In particular the {\em relative 96 of learning algorithms). In particular the {\em relative
97 advantage} of deep learning for this settings has not been evaluated. 97 advantage} of deep learning for these settings has not been evaluated.
98 The hypothesis explored here is that a deep hierarchy of features 98 The hypothesis explored here is that a deep hierarchy of features
99 may be better able to provide sharing of statistical strength 99 may be better able to provide sharing of statistical strength
100 between different regions in input space or different tasks, 100 between different regions in input space or different tasks,
101 as discussed in the conclusion. 101 as discussed in the conclusion.
102 102
118 Do deep architectures {\em benefit more from such out-of-distribution} 118 Do deep architectures {\em benefit more from such out-of-distribution}
119 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 119 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
120 120
121 $\bullet$ %\item 121 $\bullet$ %\item
122 Similarly, does the feature learning step in deep learning algorithms benefit more 122 Similarly, does the feature learning step in deep learning algorithms benefit more
123 training with similar but different classes (i.e. a multi-task learning scenario) than 123 from training with moderately different classes (i.e. a multi-task learning scenario) than
124 a corresponding shallow and purely supervised architecture? 124 a corresponding shallow and purely supervised architecture?
125 %\end{enumerate} 125 %\end{enumerate}
126 126
127 Our experimental results provide positive evidence towards all of these questions. 127 Our experimental results provide positive evidence towards all of these questions.
128 To achieve these results, we introduce in the next section a sophisticated system 128 To achieve these results, we introduce in the next section a sophisticated system
197 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. 197 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
198 Output pixel $(x,y)$ takes the value of input pixel 198 Output pixel $(x,y)$ takes the value of input pixel
199 nearest to $(ax+by+c,dx+ey+f)$, 199 nearest to $(ax+by+c,dx+ey+f)$,
200 producing scaling, translation, rotation and shearing. 200 producing scaling, translation, rotation and shearing.
201 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to 201 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to
202 forbid important rotations (not to confuse classes) but to give good 202 forbid large rotations (not to confuse classes) but to give good
203 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times 203 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
204 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 204 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
205 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times 205 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
206 complexity]$. 206 complexity]$.
207 \vspace*{-1mm} 207 \vspace*{-1mm}
238 \vspace*{0.5mm} 238 \vspace*{0.5mm}
239 239
240 {\bf Motion Blur.} 240 {\bf Motion Blur.}
241 This is GIMP's ``linear motion blur'' 241 This is GIMP's ``linear motion blur''
242 with parameters $length$ and $angle$. The value of 242 with parameters $length$ and $angle$. The value of
243 a pixel in the final image is approximately the mean value of the $length$ first pixels 243 a pixel in the final image is approximately the mean value of the first $length$ pixels
244 found by moving in the $angle$ direction. 244 found by moving in the $angle$ direction.
245 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. 245 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
246 \vspace*{-1mm} 246 \vspace*{-1mm}
247 247
248 {\bf Occlusion.} 248 {\bf Occlusion.}
255 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). 255 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}).
256 This filter is skipped with probability 60\%. 256 This filter is skipped with probability 60\%.
257 \vspace*{-1mm} 257 \vspace*{-1mm}
258 258
259 {\bf Pixel Permutation.} 259 {\bf Pixel Permutation.}
260 This filter permutes neighbouring pixels. It selects first 260 This filter permutes neighbouring pixels. It first selects
261 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then 261 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then
262 sequentially exchanged with one other in as $V4$ neighbourhood. 262 sequentially exchanged with one other in as $V4$ neighbourhood.
263 This filter is skipped with probability 80\%. 263 This filter is skipped with probability 80\%.
264 \vspace*{-1mm} 264 \vspace*{-1mm}
265 265
266 {\bf Gaussian Noise.} 266 {\bf Gaussian Noise.}
267 This filter simply adds, to each pixel of the image independently, a 267 This filter simply adds, to each pixel of the image independently, a
268 noise $\sim Normal(0(\frac{complexity}{10})^2)$. 268 noise $\sim Normal(0,(\frac{complexity}{10})^2)$.
269 This filter is skipped with probability 70\%. 269 This filter is skipped with probability 70\%.
270 \vspace*{-1mm} 270 \vspace*{-1mm}
271 271
272 {\bf Background Images.} 272 {\bf Background Images.}
273 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random 273 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
362 %\citep{SorokinAndForsyth2008,whitehill09}. 362 %\citep{SorokinAndForsyth2008,whitehill09}.
363 AMT users were presented 363 AMT users were presented
364 with 10 character images and asked to choose 10 corresponding ASCII 364 with 10 character images and asked to choose 10 corresponding ASCII
365 characters. They were forced to make a hard choice among the 365 characters. They were forced to make a hard choice among the
366 62 or 10 character classes (all classes or digits only). 366 62 or 10 character classes (all classes or digits only).
367 Three users classified each image, allowing 367 A total 2500 images/dataset were classified by XXX subjects,
368 to estimate inter-human variability. A total 2500 images/dataset were classified. 368 with 3 subjects classifying each image, allowing
369 us to estimate inter-human variability (e.g a standard error of 0.1\%
370 on the average 18\% error done by humans on the 62-class task).
369 371
370 \vspace*{-1mm} 372 \vspace*{-1mm}
371 \subsection{Data Sources} 373 \subsection{Data Sources}
372 \vspace*{-1mm} 374 \vspace*{-1mm}
373 375
418 %\item 420 %\item
419 {\bf OCR data.} 421 {\bf OCR data.}
420 A large set (2 million) of scanned, OCRed and manually verified machine-printed 422 A large set (2 million) of scanned, OCRed and manually verified machine-printed
421 characters (from various documents and books) where included as an 423 characters (from various documents and books) where included as an
422 additional source. This set is part of a larger corpus being collected by the Image Understanding 424 additional source. This set is part of a larger corpus being collected by the Image Understanding
423 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern 425 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern
424 ({\tt http://www.iupr.com}), and which will be publicly released. 426 ({\tt http://www.iupr.com}), and which will be publicly released.
425 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this 427 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
426 %\end{itemize} 428 %\end{itemize}
427 429
428 \vspace*{-1mm} 430 \vspace*{-1mm}
521 \citep{VincentPLarochelleH2008}. During training, a Denoising 523 \citep{VincentPLarochelleH2008}. During training, a Denoising
522 Auto-Encoder is presented with a stochastically corrupted version 524 Auto-Encoder is presented with a stochastically corrupted version
523 of the input and trained to reconstruct the uncorrupted input, 525 of the input and trained to reconstruct the uncorrupted input,
524 forcing the hidden units to represent the leading regularities in 526 forcing the hidden units to represent the leading regularities in
525 the data. Once it is trained, in a purely unsupervised way, 527 the data. Once it is trained, in a purely unsupervised way,
526 its hidden units activations can 528 its hidden units' activations can
527 be used as inputs for training a second one, etc. 529 be used as inputs for training a second one, etc.
528 After this unsupervised pre-training stage, the parameters 530 After this unsupervised pre-training stage, the parameters
529 are used to initialize a deep MLP, which is fine-tuned by 531 are used to initialize a deep MLP, which is fine-tuned by
530 the same standard procedure used to train them (see previous section). 532 the same standard procedure used to train them (see previous section).
531 The SDA hyper-parameters are the same as for the MLP, with the addition of the 533 The SDA hyper-parameters are the same as for the MLP, with the addition of the
560 %\vspace*{-1mm} 562 %\vspace*{-1mm}
561 %\subsection{SDA vs MLP vs Humans} 563 %\subsection{SDA vs MLP vs Humans}
562 %\vspace*{-1mm} 564 %\vspace*{-1mm}
563 The models are either trained on NIST (MLP0 and SDA0), 565 The models are either trained on NIST (MLP0 and SDA0),
564 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested 566 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested
565 on either NIST, NISTP or P07, either on all 62 classes 567 on either NIST, NISTP or P07, either on the 62-class task
566 or only on the digits (considering only the outputs 568 or on the 10-digits task.
567 associated with digit classes).
568 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, 569 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
569 comparing Humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1, 570 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
570 SDA2), along with the previous results on the digits NIST special database 571 SDA2), along with the previous results on the digits NIST special database
571 19 test set from the literature respectively based on ARTMAP neural 572 19 test set from the literature respectively based on ARTMAP neural
572 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search 573 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
573 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs 574 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
574 ~\citep{Milgram+al-2005}. More detailed and complete numerical results 575 ~\citep{Milgram+al-2005}. More detailed and complete numerical results
577 The deep learner not only outperformed the shallow ones and 578 The deep learner not only outperformed the shallow ones and
578 previously published performance (in a statistically and qualitatively 579 previously published performance (in a statistically and qualitatively
579 significant way) but when trained with perturbed data 580 significant way) but when trained with perturbed data
580 reaches human performance on both the 62-class task 581 reaches human performance on both the 62-class task
581 and the 10-class (digits) task. 582 and the 10-class (digits) task.
583 17\% error (SDA1) or 18\% error (humans) may seem large but a large
584 majority of the errors from humans and from SDA1 are from out-of-context
585 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
586 ``c'' and a ``C'' are often indistinguishible).
582 587
583 \begin{figure}[ht] 588 \begin{figure}[ht]
584 \vspace*{-3mm} 589 \vspace*{-3mm}
585 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} 590 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
586 \vspace*{-3mm} 591 \vspace*{-3mm}
623 all tasks. For the multi-task model, the digit error rate is measured by 628 all tasks. For the multi-task model, the digit error rate is measured by
624 comparing the correct digit class with the output class associated with the 629 comparing the correct digit class with the output class associated with the
625 maximum conditional probability among only the digit classes outputs. The 630 maximum conditional probability among only the digit classes outputs. The
626 setting is similar for the other two target classes (lower case characters 631 setting is similar for the other two target classes (lower case characters
627 and upper case characters). 632 and upper case characters).
628
629 %\vspace*{-1mm} 633 %\vspace*{-1mm}
630 %\subsection{Perturbed Training Data More Helpful for SDA} 634 %\subsection{Perturbed Training Data More Helpful for SDA}
631 %\vspace*{-1mm} 635 %\vspace*{-1mm}
632 636
633 %\vspace*{-1mm} 637 %\vspace*{-1mm}
699 703
700 In the original self-taught learning framework~\citep{RainaR2007}, the 704 In the original self-taught learning framework~\citep{RainaR2007}, the
701 out-of-sample examples were used as a source of unsupervised data, and 705 out-of-sample examples were used as a source of unsupervised data, and
702 experiments showed its positive effects in a \emph{limited labeled data} 706 experiments showed its positive effects in a \emph{limited labeled data}
703 scenario. However, many of the results by \citet{RainaR2007} (who used a 707 scenario. However, many of the results by \citet{RainaR2007} (who used a
704 shallow, sparse coding approach) suggest that the relative gain of self-taught 708 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught
705 learning diminishes as the number of labeled examples increases (essentially, 709 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases.
706 a ``diminishing returns'' scenario occurs). We note instead that, for deep 710 We note instead that, for deep
707 architectures, our experiments show that such a positive effect is accomplished 711 architectures, our experiments show that such a positive effect is accomplished
708 even in a scenario with a \emph{very large number of labeled examples}. 712 even in a scenario with a \emph{very large number of labeled examples},
713 i.e., here, the relative gain of self-taught learning is probably preserved
714 in the asymptotic regime.
709 715
710 {\bf Why would deep learners benefit more from the self-taught learning framework}? 716 {\bf Why would deep learners benefit more from the self-taught learning framework}?
711 The key idea is that the lower layers of the predictor compute a hierarchy 717 The key idea is that the lower layers of the predictor compute a hierarchy
712 of features that can be shared across tasks or across variants of the 718 of features that can be shared across tasks or across variants of the
713 input distribution. Intermediate features that can be used in different 719 input distribution. Intermediate features that can be used in different
729 of a deep hierarchy with self-taught learning initializes the 735 of a deep hierarchy with self-taught learning initializes the
730 model in the basin of attraction of supervised gradient descent 736 model in the basin of attraction of supervised gradient descent
731 that corresponds to better generalization. Furthermore, such good 737 that corresponds to better generalization. Furthermore, such good
732 basins of attraction are not discovered by pure supervised learning 738 basins of attraction are not discovered by pure supervised learning
733 (with or without self-taught settings), and more labeled examples 739 (with or without self-taught settings), and more labeled examples
734 does not allow to go from the poorer basins of attraction discovered 740 does not allow the model to go from the poorer basins of attraction discovered
735 by the purely supervised shallow models to the kind of better basins associated 741 by the purely supervised shallow models to the kind of better basins associated
736 with deep learning and self-taught learning. 742 with deep learning and self-taught learning.
737 743
738 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 744 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
739 can be executed on-line at {\tt http://deep.host22.com}. 745 can be executed on-line at {\tt http://deep.host22.com}.