comparison writeup/nips2010_submission.tex @ 531:85f2337d47d2

merging, praying that I did not delete stuff this time :)
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Tue, 01 Jun 2010 18:19:40 -0700
parents 8fe77eac344f 4354c3c8f49c
children 22d5cd82d5f0 5157a5830125
comparison
equal deleted inserted replaced
530:8fe77eac344f 531:85f2337d47d2
649 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. 649 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
650 \fi 650 \fi
651 651
652 652
653 \vspace*{-1mm} 653 \vspace*{-1mm}
654 \section{Conclusions} 654 \section{Conclusions and Discussion}
655 \vspace*{-1mm} 655 \vspace*{-1mm}
656 656
657 We have found that the self-taught learning framework is more beneficial 657 We have found that the self-taught learning framework is more beneficial
658 to a deep learner than to a traditional shallow and purely 658 to a deep learner than to a traditional shallow and purely
659 supervised learner. More precisely, 659 supervised learner. More precisely,
663 $\bullet$ %\item 663 $\bullet$ %\item
664 Do the good results previously obtained with deep architectures on the 664 Do the good results previously obtained with deep architectures on the
665 MNIST digits generalize to the setting of a much larger and richer (but similar) 665 MNIST digits generalize to the setting of a much larger and richer (but similar)
666 dataset, the NIST special database 19, with 62 classes and around 800k examples? 666 dataset, the NIST special database 19, with 62 classes and around 800k examples?
667 Yes, the SDA {\bf systematically outperformed the MLP and all the previously 667 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
668 published results on this dataset (the one that we are aware of), in fact reaching human-level 668 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
669 performance} at round 17\% error on the 62-class task and 1.4\% on the digits. 669 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
670 670
671 $\bullet$ %\item 671 $\bullet$ %\item
672 To what extent does the perturbation of input images (e.g. adding 672 To what extent do self-taught learning scenarios help deep learners,
673 noise, affine transformations, background images) make the resulting 673 and do they help them more than shallow supervised ones?
674 classifier better not only on similarly perturbed images but also on 674 We found that distorted training examples not only made the resulting
675 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} 675 classifier better on similarly perturbed images but also on
676 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 676 the {\em original clean examples}, and more importantly and more novel,
677 MLPs were helped by perturbed training examples when tested on perturbed input 677 that deep architectures benefit more from such {\em out-of-distribution}
678 examples. MLPs were helped by perturbed training examples when tested on perturbed input
678 images (65\% relative improvement on NISTP) 679 images (65\% relative improvement on NISTP)
679 but only marginally helped (5\% relative improvement on all classes) 680 but only marginally helped (5\% relative improvement on all classes)
680 or even hurt (10\% relative loss on digits) 681 or even hurt (10\% relative loss on digits)
681 with respect to clean examples . On the other hand, the deep SDAs 682 with respect to clean examples . On the other hand, the deep SDAs
682 were very significantly boosted by these out-of-distribution examples. 683 were very significantly boosted by these out-of-distribution examples.
683 684 Similarly, whereas the improvement due to the multi-task setting was marginal or
684 $\bullet$ %\item
685 Similarly, does the feature learning step in deep learning algorithms benefit more
686 training with similar but different classes (i.e. a multi-task learning scenario) than
687 a corresponding shallow and purely supervised architecture?
688 Whereas the improvement due to the multi-task setting was marginal or
689 negative for the MLP (from +5.6\% to -3.6\% relative change), 685 negative for the MLP (from +5.6\% to -3.6\% relative change),
690 it was very significant for the SDA (from +13\% to +27\% relative change). 686 it was very significant for the SDA (from +13\% to +27\% relative change).
691 %\end{itemize} 687 %\end{itemize}
692 688
693 In the original self-taught learning framework~\citep{RainaR2007}, the 689 In the original self-taught learning framework~\citep{RainaR2007}, the
694 out-of-sample examples were used as a source of unsupervised data, and 690 out-of-sample examples were used as a source of unsupervised data, and
695 experiments showed its positive effects in a \emph{limited labeled data} 691 experiments showed its positive effects in a \emph{limited labeled data}
696 scenario. However, many of the results by \citet{RainaR2007} (who used a 692 scenario. However, many of the results by \citet{RainaR2007} (who used a
697 shallow, sparse coding approach) suggest that the relative gain of self-taught 693 shallow, sparse coding approach) suggest that the relative gain of self-taught
698 learning diminishes as the number of labeled examples increases (essentially, 694 learning diminishes as the number of labeled examples increases, (essentially,
699 a ``diminishing returns'' scenario occurs). We note that, for deep 695 a ``diminishing returns'' scenario occurs). We note instead that, for deep
700 architectures, our experiments show that such a positive effect is accomplished 696 architectures, our experiments show that such a positive effect is accomplished
701 even in a scenario with a \emph{very large number of labeled examples}. 697 even in a scenario with a \emph{very large number of labeled examples}.
702 698
703 Why would deep learners benefit more from the self-taught learning framework? 699 Why would deep learners benefit more from the self-taught learning framework?
704 The key idea is that the lower layers of the predictor compute a hierarchy 700 The key idea is that the lower layers of the predictor compute a hierarchy
710 increasing the likelihood that they would be useful for a larger array 706 increasing the likelihood that they would be useful for a larger array
711 of tasks and input conditions. 707 of tasks and input conditions.
712 Therefore, we hypothesize that both depth and unsupervised 708 Therefore, we hypothesize that both depth and unsupervised
713 pre-training play a part in explaining the advantages observed here, and future 709 pre-training play a part in explaining the advantages observed here, and future
714 experiments could attempt at teasing apart these factors. 710 experiments could attempt at teasing apart these factors.
711 And why would deep learners benefit from the self-taught learning
712 scenarios even when the number of labeled examples is very large?
713 We hypothesize that this is related to the hypotheses studied
714 in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010}
715 it was found that online learning on a huge dataset did not make the
716 advantage of the deep learning bias vanish, a similar phenomenon
717 may be happening here. We hypothesize that unsupervised pre-training
718 of a deep hierarchy with self-taught learning initializes the
719 model in the basin of attraction of supervised gradient descent
720 that corresponds to better generalization. Furthermore, such good
721 basins of attraction are not discovered by pure supervised learning
722 (with or without self-taught settings), and more labeled examples
723 does not allow to go from the poorer basins of attraction discovered
724 by the purely supervised shallow models to the kind of better basins associated
725 with deep learning and self-taught learning.
715 726
716 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 727 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
717 can be executed on-line at {\tt http://deep.host22.com}. 728 can be executed on-line at {\tt http://deep.host22.com}.
718 729
719 \newpage 730 \newpage