Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 531:85f2337d47d2
merging, praying that I did not delete stuff this time :)
author | Dumitru Erhan <dumitru.erhan@gmail.com> |
---|---|
date | Tue, 01 Jun 2010 18:19:40 -0700 |
parents | 8fe77eac344f 4354c3c8f49c |
children | 22d5cd82d5f0 5157a5830125 |
comparison
equal
deleted
inserted
replaced
530:8fe77eac344f | 531:85f2337d47d2 |
---|---|
649 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. | 649 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. |
650 \fi | 650 \fi |
651 | 651 |
652 | 652 |
653 \vspace*{-1mm} | 653 \vspace*{-1mm} |
654 \section{Conclusions} | 654 \section{Conclusions and Discussion} |
655 \vspace*{-1mm} | 655 \vspace*{-1mm} |
656 | 656 |
657 We have found that the self-taught learning framework is more beneficial | 657 We have found that the self-taught learning framework is more beneficial |
658 to a deep learner than to a traditional shallow and purely | 658 to a deep learner than to a traditional shallow and purely |
659 supervised learner. More precisely, | 659 supervised learner. More precisely, |
663 $\bullet$ %\item | 663 $\bullet$ %\item |
664 Do the good results previously obtained with deep architectures on the | 664 Do the good results previously obtained with deep architectures on the |
665 MNIST digits generalize to the setting of a much larger and richer (but similar) | 665 MNIST digits generalize to the setting of a much larger and richer (but similar) |
666 dataset, the NIST special database 19, with 62 classes and around 800k examples? | 666 dataset, the NIST special database 19, with 62 classes and around 800k examples? |
667 Yes, the SDA {\bf systematically outperformed the MLP and all the previously | 667 Yes, the SDA {\bf systematically outperformed the MLP and all the previously |
668 published results on this dataset (the one that we are aware of), in fact reaching human-level | 668 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level |
669 performance} at round 17\% error on the 62-class task and 1.4\% on the digits. | 669 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. |
670 | 670 |
671 $\bullet$ %\item | 671 $\bullet$ %\item |
672 To what extent does the perturbation of input images (e.g. adding | 672 To what extent do self-taught learning scenarios help deep learners, |
673 noise, affine transformations, background images) make the resulting | 673 and do they help them more than shallow supervised ones? |
674 classifier better not only on similarly perturbed images but also on | 674 We found that distorted training examples not only made the resulting |
675 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | 675 classifier better on similarly perturbed images but also on |
676 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 676 the {\em original clean examples}, and more importantly and more novel, |
677 MLPs were helped by perturbed training examples when tested on perturbed input | 677 that deep architectures benefit more from such {\em out-of-distribution} |
678 examples. MLPs were helped by perturbed training examples when tested on perturbed input | |
678 images (65\% relative improvement on NISTP) | 679 images (65\% relative improvement on NISTP) |
679 but only marginally helped (5\% relative improvement on all classes) | 680 but only marginally helped (5\% relative improvement on all classes) |
680 or even hurt (10\% relative loss on digits) | 681 or even hurt (10\% relative loss on digits) |
681 with respect to clean examples . On the other hand, the deep SDAs | 682 with respect to clean examples . On the other hand, the deep SDAs |
682 were very significantly boosted by these out-of-distribution examples. | 683 were very significantly boosted by these out-of-distribution examples. |
683 | 684 Similarly, whereas the improvement due to the multi-task setting was marginal or |
684 $\bullet$ %\item | |
685 Similarly, does the feature learning step in deep learning algorithms benefit more | |
686 training with similar but different classes (i.e. a multi-task learning scenario) than | |
687 a corresponding shallow and purely supervised architecture? | |
688 Whereas the improvement due to the multi-task setting was marginal or | |
689 negative for the MLP (from +5.6\% to -3.6\% relative change), | 685 negative for the MLP (from +5.6\% to -3.6\% relative change), |
690 it was very significant for the SDA (from +13\% to +27\% relative change). | 686 it was very significant for the SDA (from +13\% to +27\% relative change). |
691 %\end{itemize} | 687 %\end{itemize} |
692 | 688 |
693 In the original self-taught learning framework~\citep{RainaR2007}, the | 689 In the original self-taught learning framework~\citep{RainaR2007}, the |
694 out-of-sample examples were used as a source of unsupervised data, and | 690 out-of-sample examples were used as a source of unsupervised data, and |
695 experiments showed its positive effects in a \emph{limited labeled data} | 691 experiments showed its positive effects in a \emph{limited labeled data} |
696 scenario. However, many of the results by \citet{RainaR2007} (who used a | 692 scenario. However, many of the results by \citet{RainaR2007} (who used a |
697 shallow, sparse coding approach) suggest that the relative gain of self-taught | 693 shallow, sparse coding approach) suggest that the relative gain of self-taught |
698 learning diminishes as the number of labeled examples increases (essentially, | 694 learning diminishes as the number of labeled examples increases, (essentially, |
699 a ``diminishing returns'' scenario occurs). We note that, for deep | 695 a ``diminishing returns'' scenario occurs). We note instead that, for deep |
700 architectures, our experiments show that such a positive effect is accomplished | 696 architectures, our experiments show that such a positive effect is accomplished |
701 even in a scenario with a \emph{very large number of labeled examples}. | 697 even in a scenario with a \emph{very large number of labeled examples}. |
702 | 698 |
703 Why would deep learners benefit more from the self-taught learning framework? | 699 Why would deep learners benefit more from the self-taught learning framework? |
704 The key idea is that the lower layers of the predictor compute a hierarchy | 700 The key idea is that the lower layers of the predictor compute a hierarchy |
710 increasing the likelihood that they would be useful for a larger array | 706 increasing the likelihood that they would be useful for a larger array |
711 of tasks and input conditions. | 707 of tasks and input conditions. |
712 Therefore, we hypothesize that both depth and unsupervised | 708 Therefore, we hypothesize that both depth and unsupervised |
713 pre-training play a part in explaining the advantages observed here, and future | 709 pre-training play a part in explaining the advantages observed here, and future |
714 experiments could attempt at teasing apart these factors. | 710 experiments could attempt at teasing apart these factors. |
711 And why would deep learners benefit from the self-taught learning | |
712 scenarios even when the number of labeled examples is very large? | |
713 We hypothesize that this is related to the hypotheses studied | |
714 in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010} | |
715 it was found that online learning on a huge dataset did not make the | |
716 advantage of the deep learning bias vanish, a similar phenomenon | |
717 may be happening here. We hypothesize that unsupervised pre-training | |
718 of a deep hierarchy with self-taught learning initializes the | |
719 model in the basin of attraction of supervised gradient descent | |
720 that corresponds to better generalization. Furthermore, such good | |
721 basins of attraction are not discovered by pure supervised learning | |
722 (with or without self-taught settings), and more labeled examples | |
723 does not allow to go from the poorer basins of attraction discovered | |
724 by the purely supervised shallow models to the kind of better basins associated | |
725 with deep learning and self-taught learning. | |
715 | 726 |
716 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 727 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) |
717 can be executed on-line at {\tt http://deep.host22.com}. | 728 can be executed on-line at {\tt http://deep.host22.com}. |
718 | 729 |
719 \newpage | 730 \newpage |