# HG changeset patch # User Yoshua Bengio # Date 1275439685 14400 # Node ID 4354c3c8f49c16b1839a366f989d95f4e511c223 # Parent f79049b0b8478f470e1a92f3a604e2a90374bbec longer conclusion diff -r f79049b0b847 -r 4354c3c8f49c writeup/nips2010_submission.tex --- a/writeup/nips2010_submission.tex Tue Jun 01 20:23:59 2010 -0400 +++ b/writeup/nips2010_submission.tex Tue Jun 01 20:48:05 2010 -0400 @@ -649,7 +649,7 @@ \vspace*{-1mm} -\section{Conclusions} +\section{Conclusions and Discussion} \vspace*{-1mm} We have found that the self-taught learning framework is more beneficial @@ -663,27 +663,23 @@ MNIST digits generalize to the setting of a much larger and richer (but similar) dataset, the NIST special database 19, with 62 classes and around 800k examples? Yes, the SDA {\bf systematically outperformed the MLP and all the previously -published results on this dataset (the one that we are aware of), in fact reaching human-level -performance} at round 17\% error on the 62-class task and 1.4\% on the digits. +published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level +performance} at around 17\% error on the 62-class task and 1.4\% on the digits. $\bullet$ %\item -To what extent does the perturbation of input images (e.g. adding -noise, affine transformations, background images) make the resulting -classifier better not only on similarly perturbed images but also on -the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} -examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? -MLPs were helped by perturbed training examples when tested on perturbed input +To what extent do self-taught learning scenarios help deep learners, +and do they help them more than shallow supervised ones? +We found that distorted training examples not only made the resulting +classifier better on similarly perturbed images but also on +the {\em original clean examples}, and more importantly and more novel, +that deep architectures benefit more from such {\em out-of-distribution} +examples. MLPs were helped by perturbed training examples when tested on perturbed input images (65\% relative improvement on NISTP) but only marginally helped (5\% relative improvement on all classes) or even hurt (10\% relative loss on digits) with respect to clean examples . On the other hand, the deep SDAs were very significantly boosted by these out-of-distribution examples. - -$\bullet$ %\item -Similarly, does the feature learning step in deep learning algorithms benefit more -training with similar but different classes (i.e. a multi-task learning scenario) than -a corresponding shallow and purely supervised architecture? -Whereas the improvement due to the multi-task setting was marginal or +Similarly, whereas the improvement due to the multi-task setting was marginal or negative for the MLP (from +5.6\% to -3.6\% relative change), it was very significant for the SDA (from +13\% to +27\% relative change). %\end{itemize} @@ -694,7 +690,7 @@ scenario. However, many of the results by \citet{RainaR2007} (who used a shallow, sparse coding approach) suggest that the relative gain of self-taught learning diminishes as the number of labeled examples increases, (essentially, -a ``diminishing returns'' scenario occurs). We note that, for deep +a ``diminishing returns'' scenario occurs). We note instead that, for deep architectures, our experiments show that such a positive effect is accomplished even in a scenario with a \emph{very large number of labeled examples}. @@ -710,6 +706,21 @@ Therefore, we hypothesize that both depth and unsupervised pre-training play a part in explaining the advantages observed here, and future experiments could attempt at teasing apart these factors. +And why would deep learners benefit from the self-taught learning +scenarios even when the number of labeled examples is very large? +We hypothesize that this is related to the hypotheses studied +in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010} +it was found that online learning on a huge dataset did not make the +advantage of the deep learning bias vanish, a similar phenomenon +may be happening here. We hypothesize that unsupervised pre-training +of a deep hierarchy with self-taught learning initializes the +model in the basin of attraction of supervised gradient descent +that corresponds to better generalization. Furthermore, such good +basins of attraction are not discovered by pure supervised learning +(with or without self-taught settings), and more labeled examples +does not allow to go from the poorer basins of attraction discovered +by the purely supervised shallow models to the kind of better basins associated +with deep learning and self-taught learning. A Flash demo of the recognizer (where both the MLP and the SDA can be compared) can be executed on-line at {\tt http://deep.host22.com}.