Mercurial > ift6266
diff writeup/nips2010_submission.tex @ 531:85f2337d47d2
merging, praying that I did not delete stuff this time :)
author | Dumitru Erhan <dumitru.erhan@gmail.com> |
---|---|
date | Tue, 01 Jun 2010 18:19:40 -0700 |
parents | 8fe77eac344f 4354c3c8f49c |
children | 22d5cd82d5f0 5157a5830125 |
line wrap: on
line diff
--- a/writeup/nips2010_submission.tex Tue Jun 01 18:18:01 2010 -0700 +++ b/writeup/nips2010_submission.tex Tue Jun 01 18:19:40 2010 -0700 @@ -651,7 +651,7 @@ \vspace*{-1mm} -\section{Conclusions} +\section{Conclusions and Discussion} \vspace*{-1mm} We have found that the self-taught learning framework is more beneficial @@ -665,27 +665,23 @@ MNIST digits generalize to the setting of a much larger and richer (but similar) dataset, the NIST special database 19, with 62 classes and around 800k examples? Yes, the SDA {\bf systematically outperformed the MLP and all the previously -published results on this dataset (the one that we are aware of), in fact reaching human-level -performance} at round 17\% error on the 62-class task and 1.4\% on the digits. +published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level +performance} at around 17\% error on the 62-class task and 1.4\% on the digits. $\bullet$ %\item -To what extent does the perturbation of input images (e.g. adding -noise, affine transformations, background images) make the resulting -classifier better not only on similarly perturbed images but also on -the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} -examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? -MLPs were helped by perturbed training examples when tested on perturbed input +To what extent do self-taught learning scenarios help deep learners, +and do they help them more than shallow supervised ones? +We found that distorted training examples not only made the resulting +classifier better on similarly perturbed images but also on +the {\em original clean examples}, and more importantly and more novel, +that deep architectures benefit more from such {\em out-of-distribution} +examples. MLPs were helped by perturbed training examples when tested on perturbed input images (65\% relative improvement on NISTP) but only marginally helped (5\% relative improvement on all classes) or even hurt (10\% relative loss on digits) with respect to clean examples . On the other hand, the deep SDAs were very significantly boosted by these out-of-distribution examples. - -$\bullet$ %\item -Similarly, does the feature learning step in deep learning algorithms benefit more -training with similar but different classes (i.e. a multi-task learning scenario) than -a corresponding shallow and purely supervised architecture? -Whereas the improvement due to the multi-task setting was marginal or +Similarly, whereas the improvement due to the multi-task setting was marginal or negative for the MLP (from +5.6\% to -3.6\% relative change), it was very significant for the SDA (from +13\% to +27\% relative change). %\end{itemize} @@ -695,8 +691,8 @@ experiments showed its positive effects in a \emph{limited labeled data} scenario. However, many of the results by \citet{RainaR2007} (who used a shallow, sparse coding approach) suggest that the relative gain of self-taught -learning diminishes as the number of labeled examples increases (essentially, -a ``diminishing returns'' scenario occurs). We note that, for deep +learning diminishes as the number of labeled examples increases, (essentially, +a ``diminishing returns'' scenario occurs). We note instead that, for deep architectures, our experiments show that such a positive effect is accomplished even in a scenario with a \emph{very large number of labeled examples}. @@ -712,6 +708,21 @@ Therefore, we hypothesize that both depth and unsupervised pre-training play a part in explaining the advantages observed here, and future experiments could attempt at teasing apart these factors. +And why would deep learners benefit from the self-taught learning +scenarios even when the number of labeled examples is very large? +We hypothesize that this is related to the hypotheses studied +in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010} +it was found that online learning on a huge dataset did not make the +advantage of the deep learning bias vanish, a similar phenomenon +may be happening here. We hypothesize that unsupervised pre-training +of a deep hierarchy with self-taught learning initializes the +model in the basin of attraction of supervised gradient descent +that corresponds to better generalization. Furthermore, such good +basins of attraction are not discovered by pure supervised learning +(with or without self-taught settings), and more labeled examples +does not allow to go from the poorer basins of attraction discovered +by the purely supervised shallow models to the kind of better basins associated +with deep learning and self-taught learning. A Flash demo of the recognizer (where both the MLP and the SDA can be compared) can be executed on-line at {\tt http://deep.host22.com}.