comparison writeup/nips2010_submission.tex @ 529:4354c3c8f49c

longer conclusion
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 01 Jun 2010 20:48:05 -0400
parents 07bc0ca8d246
children 85f2337d47d2 2e33885730cf
comparison
equal deleted inserted replaced
528:f79049b0b847 529:4354c3c8f49c
647 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. 647 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
648 \fi 648 \fi
649 649
650 650
651 \vspace*{-1mm} 651 \vspace*{-1mm}
652 \section{Conclusions} 652 \section{Conclusions and Discussion}
653 \vspace*{-1mm} 653 \vspace*{-1mm}
654 654
655 We have found that the self-taught learning framework is more beneficial 655 We have found that the self-taught learning framework is more beneficial
656 to a deep learner than to a traditional shallow and purely 656 to a deep learner than to a traditional shallow and purely
657 supervised learner. More precisely, 657 supervised learner. More precisely,
661 $\bullet$ %\item 661 $\bullet$ %\item
662 Do the good results previously obtained with deep architectures on the 662 Do the good results previously obtained with deep architectures on the
663 MNIST digits generalize to the setting of a much larger and richer (but similar) 663 MNIST digits generalize to the setting of a much larger and richer (but similar)
664 dataset, the NIST special database 19, with 62 classes and around 800k examples? 664 dataset, the NIST special database 19, with 62 classes and around 800k examples?
665 Yes, the SDA {\bf systematically outperformed the MLP and all the previously 665 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
666 published results on this dataset (the one that we are aware of), in fact reaching human-level 666 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
667 performance} at round 17\% error on the 62-class task and 1.4\% on the digits. 667 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
668 668
669 $\bullet$ %\item 669 $\bullet$ %\item
670 To what extent does the perturbation of input images (e.g. adding 670 To what extent do self-taught learning scenarios help deep learners,
671 noise, affine transformations, background images) make the resulting 671 and do they help them more than shallow supervised ones?
672 classifier better not only on similarly perturbed images but also on 672 We found that distorted training examples not only made the resulting
673 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} 673 classifier better on similarly perturbed images but also on
674 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 674 the {\em original clean examples}, and more importantly and more novel,
675 MLPs were helped by perturbed training examples when tested on perturbed input 675 that deep architectures benefit more from such {\em out-of-distribution}
676 examples. MLPs were helped by perturbed training examples when tested on perturbed input
676 images (65\% relative improvement on NISTP) 677 images (65\% relative improvement on NISTP)
677 but only marginally helped (5\% relative improvement on all classes) 678 but only marginally helped (5\% relative improvement on all classes)
678 or even hurt (10\% relative loss on digits) 679 or even hurt (10\% relative loss on digits)
679 with respect to clean examples . On the other hand, the deep SDAs 680 with respect to clean examples . On the other hand, the deep SDAs
680 were very significantly boosted by these out-of-distribution examples. 681 were very significantly boosted by these out-of-distribution examples.
681 682 Similarly, whereas the improvement due to the multi-task setting was marginal or
682 $\bullet$ %\item
683 Similarly, does the feature learning step in deep learning algorithms benefit more
684 training with similar but different classes (i.e. a multi-task learning scenario) than
685 a corresponding shallow and purely supervised architecture?
686 Whereas the improvement due to the multi-task setting was marginal or
687 negative for the MLP (from +5.6\% to -3.6\% relative change), 683 negative for the MLP (from +5.6\% to -3.6\% relative change),
688 it was very significant for the SDA (from +13\% to +27\% relative change). 684 it was very significant for the SDA (from +13\% to +27\% relative change).
689 %\end{itemize} 685 %\end{itemize}
690 686
691 In the original self-taught learning framework~\citep{RainaR2007}, the 687 In the original self-taught learning framework~\citep{RainaR2007}, the
692 out-of-sample examples were used as a source of unsupervised data, and 688 out-of-sample examples were used as a source of unsupervised data, and
693 experiments showed its positive effects in a \emph{limited labeled data} 689 experiments showed its positive effects in a \emph{limited labeled data}
694 scenario. However, many of the results by \citet{RainaR2007} (who used a 690 scenario. However, many of the results by \citet{RainaR2007} (who used a
695 shallow, sparse coding approach) suggest that the relative gain of self-taught 691 shallow, sparse coding approach) suggest that the relative gain of self-taught
696 learning diminishes as the number of labeled examples increases, (essentially, 692 learning diminishes as the number of labeled examples increases, (essentially,
697 a ``diminishing returns'' scenario occurs). We note that, for deep 693 a ``diminishing returns'' scenario occurs). We note instead that, for deep
698 architectures, our experiments show that such a positive effect is accomplished 694 architectures, our experiments show that such a positive effect is accomplished
699 even in a scenario with a \emph{very large number of labeled examples}. 695 even in a scenario with a \emph{very large number of labeled examples}.
700 696
701 Why would deep learners benefit more from the self-taught learning framework? 697 Why would deep learners benefit more from the self-taught learning framework?
702 The key idea is that the lower layers of the predictor compute a hierarchy 698 The key idea is that the lower layers of the predictor compute a hierarchy
708 increasing the likelihood that they would be useful for a larger array 704 increasing the likelihood that they would be useful for a larger array
709 of tasks and input conditions. 705 of tasks and input conditions.
710 Therefore, we hypothesize that both depth and unsupervised 706 Therefore, we hypothesize that both depth and unsupervised
711 pre-training play a part in explaining the advantages observed here, and future 707 pre-training play a part in explaining the advantages observed here, and future
712 experiments could attempt at teasing apart these factors. 708 experiments could attempt at teasing apart these factors.
709 And why would deep learners benefit from the self-taught learning
710 scenarios even when the number of labeled examples is very large?
711 We hypothesize that this is related to the hypotheses studied
712 in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010}
713 it was found that online learning on a huge dataset did not make the
714 advantage of the deep learning bias vanish, a similar phenomenon
715 may be happening here. We hypothesize that unsupervised pre-training
716 of a deep hierarchy with self-taught learning initializes the
717 model in the basin of attraction of supervised gradient descent
718 that corresponds to better generalization. Furthermore, such good
719 basins of attraction are not discovered by pure supervised learning
720 (with or without self-taught settings), and more labeled examples
721 does not allow to go from the poorer basins of attraction discovered
722 by the purely supervised shallow models to the kind of better basins associated
723 with deep learning and self-taught learning.
713 724
714 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 725 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
715 can be executed on-line at {\tt http://deep.host22.com}. 726 can be executed on-line at {\tt http://deep.host22.com}.
716 727
717 \newpage 728 \newpage