# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1275439685 14400
# Node ID 4354c3c8f49c16b1839a366f989d95f4e511c223
# Parent  f79049b0b8478f470e1a92f3a604e2a90374bbec
longer conclusion

diff -r f79049b0b847 -r 4354c3c8f49c writeup/nips2010_submission.tex
--- a/writeup/nips2010_submission.tex	Tue Jun 01 20:23:59 2010 -0400
+++ b/writeup/nips2010_submission.tex	Tue Jun 01 20:48:05 2010 -0400
@@ -649,7 +649,7 @@
 
 
 \vspace*{-1mm}
-\section{Conclusions}
+\section{Conclusions and Discussion}
 \vspace*{-1mm}
 
 We have found that the self-taught learning framework is more beneficial
@@ -663,27 +663,23 @@
 MNIST digits generalize to the setting of a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
-published results on this dataset (the one that we are aware of), in fact reaching human-level
-performance} at round 17\% error on the 62-class task and 1.4\% on the digits.
+published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
+performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
 
 $\bullet$ %\item 
-To what extent does the perturbation of input images (e.g. adding
-noise, affine transformations, background images) make the resulting
-classifier better not only on similarly perturbed images but also on
-the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
-examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
-MLPs were helped by perturbed training examples when tested on perturbed input 
+To what extent do self-taught learning scenarios help deep learners,
+and do they help them more than shallow supervised ones?
+We found that distorted training examples not only made the resulting
+classifier better on similarly perturbed images but also on
+the {\em original clean examples}, and more importantly and more novel,
+that deep architectures benefit more from such {\em out-of-distribution}
+examples. MLPs were helped by perturbed training examples when tested on perturbed input 
 images (65\% relative improvement on NISTP) 
 but only marginally helped (5\% relative improvement on all classes) 
 or even hurt (10\% relative loss on digits)
 with respect to clean examples . On the other hand, the deep SDAs
 were very significantly boosted by these out-of-distribution examples.
-
-$\bullet$ %\item 
-Similarly, does the feature learning step in deep learning algorithms benefit more 
-training with similar but different classes (i.e. a multi-task learning scenario) than
-a corresponding shallow and purely supervised architecture?
-Whereas the improvement due to the multi-task setting was marginal or
+Similarly, whereas the improvement due to the multi-task setting was marginal or
 negative for the MLP (from +5.6\% to -3.6\% relative change), 
 it was very significant for the SDA (from +13\% to +27\% relative change).
 %\end{itemize}
@@ -694,7 +690,7 @@
 scenario. However, many of the results by \citet{RainaR2007} (who used a
 shallow, sparse coding approach) suggest that the relative gain of self-taught
 learning diminishes as the number of labeled examples increases, (essentially,
-a ``diminishing returns'' scenario occurs).  We note that, for deep
+a ``diminishing returns'' scenario occurs).  We note instead that, for deep
 architectures, our experiments show that such a positive effect is accomplished
 even in a scenario with a \emph{very large number of labeled examples}.
 
@@ -710,6 +706,21 @@
 Therefore, we hypothesize that both depth and unsupervised
 pre-training play a part in explaining the advantages observed here, and future
 experiments could attempt at teasing apart these factors.
+And why would deep learners benefit from the self-taught learning
+scenarios even when the number of labeled examples is very large?
+We hypothesize that this is related to the hypotheses studied
+in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010}
+it was found that online learning on a huge dataset did not make the
+advantage of the deep learning bias vanish, a similar phenomenon
+may be happening here. We hypothesize that unsupervised pre-training
+of a deep hierarchy with self-taught learning initializes the
+model in the basin of attraction of supervised gradient descent
+that corresponds to better generalization. Furthermore, such good
+basins of attraction are not discovered by pure supervised learning
+(with or without self-taught settings), and more labeled examples
+does not allow to go from the poorer basins of attraction discovered
+by the purely supervised shallow models to the kind of better basins associated
+with deep learning and self-taught learning.
 
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 
 can be executed on-line at {\tt http://deep.host22.com}.