Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 529:4354c3c8f49c
longer conclusion
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 01 Jun 2010 20:48:05 -0400 |
parents | 07bc0ca8d246 |
children | 85f2337d47d2 2e33885730cf |
comparison
equal
deleted
inserted
replaced
528:f79049b0b847 | 529:4354c3c8f49c |
---|---|
647 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. | 647 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. |
648 \fi | 648 \fi |
649 | 649 |
650 | 650 |
651 \vspace*{-1mm} | 651 \vspace*{-1mm} |
652 \section{Conclusions} | 652 \section{Conclusions and Discussion} |
653 \vspace*{-1mm} | 653 \vspace*{-1mm} |
654 | 654 |
655 We have found that the self-taught learning framework is more beneficial | 655 We have found that the self-taught learning framework is more beneficial |
656 to a deep learner than to a traditional shallow and purely | 656 to a deep learner than to a traditional shallow and purely |
657 supervised learner. More precisely, | 657 supervised learner. More precisely, |
661 $\bullet$ %\item | 661 $\bullet$ %\item |
662 Do the good results previously obtained with deep architectures on the | 662 Do the good results previously obtained with deep architectures on the |
663 MNIST digits generalize to the setting of a much larger and richer (but similar) | 663 MNIST digits generalize to the setting of a much larger and richer (but similar) |
664 dataset, the NIST special database 19, with 62 classes and around 800k examples? | 664 dataset, the NIST special database 19, with 62 classes and around 800k examples? |
665 Yes, the SDA {\bf systematically outperformed the MLP and all the previously | 665 Yes, the SDA {\bf systematically outperformed the MLP and all the previously |
666 published results on this dataset (the one that we are aware of), in fact reaching human-level | 666 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level |
667 performance} at round 17\% error on the 62-class task and 1.4\% on the digits. | 667 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. |
668 | 668 |
669 $\bullet$ %\item | 669 $\bullet$ %\item |
670 To what extent does the perturbation of input images (e.g. adding | 670 To what extent do self-taught learning scenarios help deep learners, |
671 noise, affine transformations, background images) make the resulting | 671 and do they help them more than shallow supervised ones? |
672 classifier better not only on similarly perturbed images but also on | 672 We found that distorted training examples not only made the resulting |
673 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | 673 classifier better on similarly perturbed images but also on |
674 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 674 the {\em original clean examples}, and more importantly and more novel, |
675 MLPs were helped by perturbed training examples when tested on perturbed input | 675 that deep architectures benefit more from such {\em out-of-distribution} |
676 examples. MLPs were helped by perturbed training examples when tested on perturbed input | |
676 images (65\% relative improvement on NISTP) | 677 images (65\% relative improvement on NISTP) |
677 but only marginally helped (5\% relative improvement on all classes) | 678 but only marginally helped (5\% relative improvement on all classes) |
678 or even hurt (10\% relative loss on digits) | 679 or even hurt (10\% relative loss on digits) |
679 with respect to clean examples . On the other hand, the deep SDAs | 680 with respect to clean examples . On the other hand, the deep SDAs |
680 were very significantly boosted by these out-of-distribution examples. | 681 were very significantly boosted by these out-of-distribution examples. |
681 | 682 Similarly, whereas the improvement due to the multi-task setting was marginal or |
682 $\bullet$ %\item | |
683 Similarly, does the feature learning step in deep learning algorithms benefit more | |
684 training with similar but different classes (i.e. a multi-task learning scenario) than | |
685 a corresponding shallow and purely supervised architecture? | |
686 Whereas the improvement due to the multi-task setting was marginal or | |
687 negative for the MLP (from +5.6\% to -3.6\% relative change), | 683 negative for the MLP (from +5.6\% to -3.6\% relative change), |
688 it was very significant for the SDA (from +13\% to +27\% relative change). | 684 it was very significant for the SDA (from +13\% to +27\% relative change). |
689 %\end{itemize} | 685 %\end{itemize} |
690 | 686 |
691 In the original self-taught learning framework~\citep{RainaR2007}, the | 687 In the original self-taught learning framework~\citep{RainaR2007}, the |
692 out-of-sample examples were used as a source of unsupervised data, and | 688 out-of-sample examples were used as a source of unsupervised data, and |
693 experiments showed its positive effects in a \emph{limited labeled data} | 689 experiments showed its positive effects in a \emph{limited labeled data} |
694 scenario. However, many of the results by \citet{RainaR2007} (who used a | 690 scenario. However, many of the results by \citet{RainaR2007} (who used a |
695 shallow, sparse coding approach) suggest that the relative gain of self-taught | 691 shallow, sparse coding approach) suggest that the relative gain of self-taught |
696 learning diminishes as the number of labeled examples increases, (essentially, | 692 learning diminishes as the number of labeled examples increases, (essentially, |
697 a ``diminishing returns'' scenario occurs). We note that, for deep | 693 a ``diminishing returns'' scenario occurs). We note instead that, for deep |
698 architectures, our experiments show that such a positive effect is accomplished | 694 architectures, our experiments show that such a positive effect is accomplished |
699 even in a scenario with a \emph{very large number of labeled examples}. | 695 even in a scenario with a \emph{very large number of labeled examples}. |
700 | 696 |
701 Why would deep learners benefit more from the self-taught learning framework? | 697 Why would deep learners benefit more from the self-taught learning framework? |
702 The key idea is that the lower layers of the predictor compute a hierarchy | 698 The key idea is that the lower layers of the predictor compute a hierarchy |
708 increasing the likelihood that they would be useful for a larger array | 704 increasing the likelihood that they would be useful for a larger array |
709 of tasks and input conditions. | 705 of tasks and input conditions. |
710 Therefore, we hypothesize that both depth and unsupervised | 706 Therefore, we hypothesize that both depth and unsupervised |
711 pre-training play a part in explaining the advantages observed here, and future | 707 pre-training play a part in explaining the advantages observed here, and future |
712 experiments could attempt at teasing apart these factors. | 708 experiments could attempt at teasing apart these factors. |
709 And why would deep learners benefit from the self-taught learning | |
710 scenarios even when the number of labeled examples is very large? | |
711 We hypothesize that this is related to the hypotheses studied | |
712 in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010} | |
713 it was found that online learning on a huge dataset did not make the | |
714 advantage of the deep learning bias vanish, a similar phenomenon | |
715 may be happening here. We hypothesize that unsupervised pre-training | |
716 of a deep hierarchy with self-taught learning initializes the | |
717 model in the basin of attraction of supervised gradient descent | |
718 that corresponds to better generalization. Furthermore, such good | |
719 basins of attraction are not discovered by pure supervised learning | |
720 (with or without self-taught settings), and more labeled examples | |
721 does not allow to go from the poorer basins of attraction discovered | |
722 by the purely supervised shallow models to the kind of better basins associated | |
723 with deep learning and self-taught learning. | |
713 | 724 |
714 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 725 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) |
715 can be executed on-line at {\tt http://deep.host22.com}. | 726 can be executed on-line at {\tt http://deep.host22.com}. |
716 | 727 |
717 \newpage | 728 \newpage |