ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 531:85f2337d47d2

merging, praying that I did not delete stuff this time :)

author	Dumitru Erhan <dumitru.erhan@gmail.com>
date	Tue, 01 Jun 2010 18:19:40 -0700
parents	8fe77eac344f 4354c3c8f49c
children	22d5cd82d5f0 5157a5830125

comparison

equal deleted inserted replaced

-:8fe77eac344f
+:85f2337d47d2
 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
 \fi
 \vspace*{-1mm}
-\section{Conclusions}
+\section{Conclusions and Discussion}
 \vspace*{-1mm}
 We have found that the self-taught learning framework is more beneficial
 to a deep learner than to a traditional shallow and purely
 supervised learner. More precisely,
 $\bullet$ %\item
 Do the good results previously obtained with deep architectures on the
 MNIST digits generalize to the setting of a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples?
 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
-published results on this dataset (the one that we are aware of), in fact reaching human-level
+published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
-performance} at round 17\% error on the 62-class task and 1.4\% on the digits.
+performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
 $\bullet$ %\item
-To what extent does the perturbation of input images (e.g. adding
+To what extent do self-taught learning scenarios help deep learners,
-noise, affine transformations, background images) make the resulting
+and do they help them more than shallow supervised ones?
-classifier better not only on similarly perturbed images but also on
+We found that distorted training examples not only made the resulting
-the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
+classifier better on similarly perturbed images but also on
-examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
+the {\em original clean examples}, and more importantly and more novel,
-MLPs were helped by perturbed training examples when tested on perturbed input
+that deep architectures benefit more from such {\em out-of-distribution}
+examples. MLPs were helped by perturbed training examples when tested on perturbed input
 images (65\% relative improvement on NISTP)
 but only marginally helped (5\% relative improvement on all classes)
 or even hurt (10\% relative loss on digits)
 with respect to clean examples . On the other hand, the deep SDAs
 were very significantly boosted by these out-of-distribution examples.
+Similarly, whereas the improvement due to the multi-task setting was marginal or
-$\bullet$ %\item
-Similarly, does the feature learning step in deep learning algorithms benefit more
-training with similar but different classes (i.e. a multi-task learning scenario) than
-a corresponding shallow and purely supervised architecture?
-Whereas the improvement due to the multi-task setting was marginal or
 negative for the MLP (from +5.6\% to -3.6\% relative change),
 it was very significant for the SDA (from +13\% to +27\% relative change).
 %\end{itemize}
 In the original self-taught learning framework~\citep{RainaR2007}, the
 out-of-sample examples were used as a source of unsupervised data, and
 experiments showed its positive effects in a \emph{limited labeled data}
 scenario. However, many of the results by \citet{RainaR2007} (who used a
 shallow, sparse coding approach) suggest that the relative gain of self-taught
-learning diminishes as the number of labeled examples increases (essentially,
+learning diminishes as the number of labeled examples increases, (essentially,
-a ``diminishing returns'' scenario occurs).  We note that, for deep
+a ``diminishing returns'' scenario occurs).  We note instead that, for deep
 architectures, our experiments show that such a positive effect is accomplished
 even in a scenario with a \emph{very large number of labeled examples}.
 Why would deep learners benefit more from the self-taught learning framework?
 The key idea is that the lower layers of the predictor compute a hierarchy
 increasing the likelihood that they would be useful for a larger array
 of tasks and input conditions.
 Therefore, we hypothesize that both depth and unsupervised
 pre-training play a part in explaining the advantages observed here, and future
 experiments could attempt at teasing apart these factors.
+And why would deep learners benefit from the self-taught learning
+scenarios even when the number of labeled examples is very large?
+We hypothesize that this is related to the hypotheses studied
+in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010}
+it was found that online learning on a huge dataset did not make the
+advantage of the deep learning bias vanish, a similar phenomenon
+may be happening here. We hypothesize that unsupervised pre-training
+of a deep hierarchy with self-taught learning initializes the
+model in the basin of attraction of supervised gradient descent
+that corresponds to better generalization. Furthermore, such good
+basins of attraction are not discovered by pure supervised learning
+(with or without self-taught settings), and more labeled examples
+does not allow to go from the poorer basins of attraction discovered
+by the purely supervised shallow models to the kind of better basins associated
+with deep learning and self-taught learning.
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
 can be executed on-line at {\tt http://deep.host22.com}.
 \newpage

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 531:85f2337d47d2