Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 550:662299f265ab
suggestions from Ian
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Wed, 02 Jun 2010 15:44:46 -0400 |
parents | ef172f4a322a |
children | 8f365abf171d |
comparison
equal
deleted
inserted
replaced
549:ef172f4a322a | 550:662299f265ab |
---|---|
31 corresponding shallow learner}, in the area | 31 corresponding shallow learner}, in the area |
32 of handwritten character recognition. In fact, we show that they reach | 32 of handwritten character recognition. In fact, we show that they reach |
33 human-level performance on both handwritten digit classification and | 33 human-level performance on both handwritten digit classification and |
34 62-class handwritten character recognition. For this purpose we | 34 62-class handwritten character recognition. For this purpose we |
35 developed a powerful generator of stochastic variations and noise | 35 developed a powerful generator of stochastic variations and noise |
36 processes character images, including not only affine transformations but | 36 processes for character images, including not only affine transformations but |
37 also slant, local elastic deformations, changes in thickness, background | 37 also slant, local elastic deformations, changes in thickness, background |
38 images, grey level changes, contrast, occlusion, and various types of pixel and | 38 images, grey level changes, contrast, occlusion, and various types of |
39 spatially correlated noise. The out-of-distribution examples are | 39 noise. The out-of-distribution examples are |
40 obtained by training with these highly distorted images or | 40 obtained from these highly distorted images or |
41 by including object classes different from those in the target test set. | 41 by including examples of object classes different from those in the target test set. |
42 \end{abstract} | 42 \end{abstract} |
43 \vspace*{-2mm} | 43 \vspace*{-2mm} |
44 | 44 |
45 \section{Introduction} | 45 \section{Introduction} |
46 \vspace*{-1mm} | 46 \vspace*{-1mm} |
85 stochastic gradient descent. | 85 stochastic gradient descent. |
86 | 86 |
87 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles | 87 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles |
88 of semi-supervised and multi-task learning: the learner can exploit examples | 88 of semi-supervised and multi-task learning: the learner can exploit examples |
89 that are unlabeled and/or come from a distribution different from the target | 89 that are unlabeled and/or come from a distribution different from the target |
90 distribution, e.g., from other classes that those of interest. | 90 distribution, e.g., from other classes than those of interest. |
91 It has already been shown that deep learners can clearly take advantage of | 91 It has already been shown that deep learners can clearly take advantage of |
92 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, | 92 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, |
93 but more needs to be done to explore the impact | 93 but more needs to be done to explore the impact |
94 of {\em out-of-distribution} examples and of the multi-task setting | 94 of {\em out-of-distribution} examples and of the multi-task setting |
95 (one exception is~\citep{CollobertR2008}, but using very different kinds | 95 (one exception is~\citep{CollobertR2008}, which uses very different kinds |
96 of learning algorithms). In particular the {\em relative | 96 of learning algorithms). In particular the {\em relative |
97 advantage} of deep learning for this settings has not been evaluated. | 97 advantage} of deep learning for these settings has not been evaluated. |
98 The hypothesis explored here is that a deep hierarchy of features | 98 The hypothesis explored here is that a deep hierarchy of features |
99 may be better able to provide sharing of statistical strength | 99 may be better able to provide sharing of statistical strength |
100 between different regions in input space or different tasks, | 100 between different regions in input space or different tasks, |
101 as discussed in the conclusion. | 101 as discussed in the conclusion. |
102 | 102 |
118 Do deep architectures {\em benefit more from such out-of-distribution} | 118 Do deep architectures {\em benefit more from such out-of-distribution} |
119 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 119 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? |
120 | 120 |
121 $\bullet$ %\item | 121 $\bullet$ %\item |
122 Similarly, does the feature learning step in deep learning algorithms benefit more | 122 Similarly, does the feature learning step in deep learning algorithms benefit more |
123 training with similar but different classes (i.e. a multi-task learning scenario) than | 123 from training with moderately different classes (i.e. a multi-task learning scenario) than |
124 a corresponding shallow and purely supervised architecture? | 124 a corresponding shallow and purely supervised architecture? |
125 %\end{enumerate} | 125 %\end{enumerate} |
126 | 126 |
127 Our experimental results provide positive evidence towards all of these questions. | 127 Our experimental results provide positive evidence towards all of these questions. |
128 To achieve these results, we introduce in the next section a sophisticated system | 128 To achieve these results, we introduce in the next section a sophisticated system |
197 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. | 197 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. |
198 Output pixel $(x,y)$ takes the value of input pixel | 198 Output pixel $(x,y)$ takes the value of input pixel |
199 nearest to $(ax+by+c,dx+ey+f)$, | 199 nearest to $(ax+by+c,dx+ey+f)$, |
200 producing scaling, translation, rotation and shearing. | 200 producing scaling, translation, rotation and shearing. |
201 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to | 201 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to |
202 forbid important rotations (not to confuse classes) but to give good | 202 forbid large rotations (not to confuse classes) but to give good |
203 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times | 203 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times |
204 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 | 204 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 |
205 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times | 205 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times |
206 complexity]$. | 206 complexity]$. |
207 \vspace*{-1mm} | 207 \vspace*{-1mm} |
238 \vspace*{0.5mm} | 238 \vspace*{0.5mm} |
239 | 239 |
240 {\bf Motion Blur.} | 240 {\bf Motion Blur.} |
241 This is GIMP's ``linear motion blur'' | 241 This is GIMP's ``linear motion blur'' |
242 with parameters $length$ and $angle$. The value of | 242 with parameters $length$ and $angle$. The value of |
243 a pixel in the final image is approximately the mean value of the $length$ first pixels | 243 a pixel in the final image is approximately the mean value of the first $length$ pixels |
244 found by moving in the $angle$ direction. | 244 found by moving in the $angle$ direction. |
245 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. | 245 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. |
246 \vspace*{-1mm} | 246 \vspace*{-1mm} |
247 | 247 |
248 {\bf Occlusion.} | 248 {\bf Occlusion.} |
255 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). | 255 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). |
256 This filter is skipped with probability 60\%. | 256 This filter is skipped with probability 60\%. |
257 \vspace*{-1mm} | 257 \vspace*{-1mm} |
258 | 258 |
259 {\bf Pixel Permutation.} | 259 {\bf Pixel Permutation.} |
260 This filter permutes neighbouring pixels. It selects first | 260 This filter permutes neighbouring pixels. It first selects |
261 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then | 261 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then |
262 sequentially exchanged with one other in as $V4$ neighbourhood. | 262 sequentially exchanged with one other in as $V4$ neighbourhood. |
263 This filter is skipped with probability 80\%. | 263 This filter is skipped with probability 80\%. |
264 \vspace*{-1mm} | 264 \vspace*{-1mm} |
265 | 265 |
266 {\bf Gaussian Noise.} | 266 {\bf Gaussian Noise.} |
267 This filter simply adds, to each pixel of the image independently, a | 267 This filter simply adds, to each pixel of the image independently, a |
268 noise $\sim Normal(0(\frac{complexity}{10})^2)$. | 268 noise $\sim Normal(0,(\frac{complexity}{10})^2)$. |
269 This filter is skipped with probability 70\%. | 269 This filter is skipped with probability 70\%. |
270 \vspace*{-1mm} | 270 \vspace*{-1mm} |
271 | 271 |
272 {\bf Background Images.} | 272 {\bf Background Images.} |
273 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random | 273 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random |
362 %\citep{SorokinAndForsyth2008,whitehill09}. | 362 %\citep{SorokinAndForsyth2008,whitehill09}. |
363 AMT users were presented | 363 AMT users were presented |
364 with 10 character images and asked to choose 10 corresponding ASCII | 364 with 10 character images and asked to choose 10 corresponding ASCII |
365 characters. They were forced to make a hard choice among the | 365 characters. They were forced to make a hard choice among the |
366 62 or 10 character classes (all classes or digits only). | 366 62 or 10 character classes (all classes or digits only). |
367 Three users classified each image, allowing | 367 A total 2500 images/dataset were classified by XXX subjects, |
368 to estimate inter-human variability. A total 2500 images/dataset were classified. | 368 with 3 subjects classifying each image, allowing |
369 us to estimate inter-human variability (e.g a standard error of 0.1\% | |
370 on the average 18\% error done by humans on the 62-class task). | |
369 | 371 |
370 \vspace*{-1mm} | 372 \vspace*{-1mm} |
371 \subsection{Data Sources} | 373 \subsection{Data Sources} |
372 \vspace*{-1mm} | 374 \vspace*{-1mm} |
373 | 375 |
418 %\item | 420 %\item |
419 {\bf OCR data.} | 421 {\bf OCR data.} |
420 A large set (2 million) of scanned, OCRed and manually verified machine-printed | 422 A large set (2 million) of scanned, OCRed and manually verified machine-printed |
421 characters (from various documents and books) where included as an | 423 characters (from various documents and books) where included as an |
422 additional source. This set is part of a larger corpus being collected by the Image Understanding | 424 additional source. This set is part of a larger corpus being collected by the Image Understanding |
423 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern | 425 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern |
424 ({\tt http://www.iupr.com}), and which will be publicly released. | 426 ({\tt http://www.iupr.com}), and which will be publicly released. |
425 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this | 427 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this |
426 %\end{itemize} | 428 %\end{itemize} |
427 | 429 |
428 \vspace*{-1mm} | 430 \vspace*{-1mm} |
521 \citep{VincentPLarochelleH2008}. During training, a Denoising | 523 \citep{VincentPLarochelleH2008}. During training, a Denoising |
522 Auto-Encoder is presented with a stochastically corrupted version | 524 Auto-Encoder is presented with a stochastically corrupted version |
523 of the input and trained to reconstruct the uncorrupted input, | 525 of the input and trained to reconstruct the uncorrupted input, |
524 forcing the hidden units to represent the leading regularities in | 526 forcing the hidden units to represent the leading regularities in |
525 the data. Once it is trained, in a purely unsupervised way, | 527 the data. Once it is trained, in a purely unsupervised way, |
526 its hidden units activations can | 528 its hidden units' activations can |
527 be used as inputs for training a second one, etc. | 529 be used as inputs for training a second one, etc. |
528 After this unsupervised pre-training stage, the parameters | 530 After this unsupervised pre-training stage, the parameters |
529 are used to initialize a deep MLP, which is fine-tuned by | 531 are used to initialize a deep MLP, which is fine-tuned by |
530 the same standard procedure used to train them (see previous section). | 532 the same standard procedure used to train them (see previous section). |
531 The SDA hyper-parameters are the same as for the MLP, with the addition of the | 533 The SDA hyper-parameters are the same as for the MLP, with the addition of the |
560 %\vspace*{-1mm} | 562 %\vspace*{-1mm} |
561 %\subsection{SDA vs MLP vs Humans} | 563 %\subsection{SDA vs MLP vs Humans} |
562 %\vspace*{-1mm} | 564 %\vspace*{-1mm} |
563 The models are either trained on NIST (MLP0 and SDA0), | 565 The models are either trained on NIST (MLP0 and SDA0), |
564 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested | 566 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested |
565 on either NIST, NISTP or P07, either on all 62 classes | 567 on either NIST, NISTP or P07, either on the 62-class task |
566 or only on the digits (considering only the outputs | 568 or on the 10-digits task. |
567 associated with digit classes). | |
568 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, | 569 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, |
569 comparing Humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1, | 570 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1, |
570 SDA2), along with the previous results on the digits NIST special database | 571 SDA2), along with the previous results on the digits NIST special database |
571 19 test set from the literature respectively based on ARTMAP neural | 572 19 test set from the literature respectively based on ARTMAP neural |
572 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search | 573 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search |
573 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs | 574 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs |
574 ~\citep{Milgram+al-2005}. More detailed and complete numerical results | 575 ~\citep{Milgram+al-2005}. More detailed and complete numerical results |
577 The deep learner not only outperformed the shallow ones and | 578 The deep learner not only outperformed the shallow ones and |
578 previously published performance (in a statistically and qualitatively | 579 previously published performance (in a statistically and qualitatively |
579 significant way) but when trained with perturbed data | 580 significant way) but when trained with perturbed data |
580 reaches human performance on both the 62-class task | 581 reaches human performance on both the 62-class task |
581 and the 10-class (digits) task. | 582 and the 10-class (digits) task. |
583 17\% error (SDA1) or 18\% error (humans) may seem large but a large | |
584 majority of the errors from humans and from SDA1 are from out-of-context | |
585 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a | |
586 ``c'' and a ``C'' are often indistinguishible). | |
582 | 587 |
583 \begin{figure}[ht] | 588 \begin{figure}[ht] |
584 \vspace*{-3mm} | 589 \vspace*{-3mm} |
585 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} | 590 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} |
586 \vspace*{-3mm} | 591 \vspace*{-3mm} |
623 all tasks. For the multi-task model, the digit error rate is measured by | 628 all tasks. For the multi-task model, the digit error rate is measured by |
624 comparing the correct digit class with the output class associated with the | 629 comparing the correct digit class with the output class associated with the |
625 maximum conditional probability among only the digit classes outputs. The | 630 maximum conditional probability among only the digit classes outputs. The |
626 setting is similar for the other two target classes (lower case characters | 631 setting is similar for the other two target classes (lower case characters |
627 and upper case characters). | 632 and upper case characters). |
628 | |
629 %\vspace*{-1mm} | 633 %\vspace*{-1mm} |
630 %\subsection{Perturbed Training Data More Helpful for SDA} | 634 %\subsection{Perturbed Training Data More Helpful for SDA} |
631 %\vspace*{-1mm} | 635 %\vspace*{-1mm} |
632 | 636 |
633 %\vspace*{-1mm} | 637 %\vspace*{-1mm} |
699 | 703 |
700 In the original self-taught learning framework~\citep{RainaR2007}, the | 704 In the original self-taught learning framework~\citep{RainaR2007}, the |
701 out-of-sample examples were used as a source of unsupervised data, and | 705 out-of-sample examples were used as a source of unsupervised data, and |
702 experiments showed its positive effects in a \emph{limited labeled data} | 706 experiments showed its positive effects in a \emph{limited labeled data} |
703 scenario. However, many of the results by \citet{RainaR2007} (who used a | 707 scenario. However, many of the results by \citet{RainaR2007} (who used a |
704 shallow, sparse coding approach) suggest that the relative gain of self-taught | 708 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught |
705 learning diminishes as the number of labeled examples increases (essentially, | 709 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases. |
706 a ``diminishing returns'' scenario occurs). We note instead that, for deep | 710 We note instead that, for deep |
707 architectures, our experiments show that such a positive effect is accomplished | 711 architectures, our experiments show that such a positive effect is accomplished |
708 even in a scenario with a \emph{very large number of labeled examples}. | 712 even in a scenario with a \emph{very large number of labeled examples}, |
713 i.e., here, the relative gain of self-taught learning is probably preserved | |
714 in the asymptotic regime. | |
709 | 715 |
710 {\bf Why would deep learners benefit more from the self-taught learning framework}? | 716 {\bf Why would deep learners benefit more from the self-taught learning framework}? |
711 The key idea is that the lower layers of the predictor compute a hierarchy | 717 The key idea is that the lower layers of the predictor compute a hierarchy |
712 of features that can be shared across tasks or across variants of the | 718 of features that can be shared across tasks or across variants of the |
713 input distribution. Intermediate features that can be used in different | 719 input distribution. Intermediate features that can be used in different |
729 of a deep hierarchy with self-taught learning initializes the | 735 of a deep hierarchy with self-taught learning initializes the |
730 model in the basin of attraction of supervised gradient descent | 736 model in the basin of attraction of supervised gradient descent |
731 that corresponds to better generalization. Furthermore, such good | 737 that corresponds to better generalization. Furthermore, such good |
732 basins of attraction are not discovered by pure supervised learning | 738 basins of attraction are not discovered by pure supervised learning |
733 (with or without self-taught settings), and more labeled examples | 739 (with or without self-taught settings), and more labeled examples |
734 does not allow to go from the poorer basins of attraction discovered | 740 does not allow the model to go from the poorer basins of attraction discovered |
735 by the purely supervised shallow models to the kind of better basins associated | 741 by the purely supervised shallow models to the kind of better basins associated |
736 with deep learning and self-taught learning. | 742 with deep learning and self-taught learning. |
737 | 743 |
738 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 744 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) |
739 can be executed on-line at {\tt http://deep.host22.com}. | 745 can be executed on-line at {\tt http://deep.host22.com}. |