changeset 624:49933073590c

added jmlr_review1.txt and jmlr_review2.txt
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 13 Mar 2011 18:25:25 -0400
parents d44c78c90669
children 128bc92897f2 249a180795e3
files writeup/aistats2011_revised.tex writeup/aistats_review_response.txt writeup/jmlr_review1.txt writeup/jmlr_review2.txt
diffstat 4 files changed, 121 insertions(+), 73 deletions(-) [+]
line wrap: on
line diff
--- a/writeup/aistats2011_revised.tex	Sun Jan 09 22:00:39 2011 -0500
+++ b/writeup/aistats2011_revised.tex	Sun Mar 13 18:25:25 2011 -0400
@@ -24,7 +24,8 @@
 \aistatstitle{Deep Learners Benefit More from Out-of-Distribution Examples}
 \runningtitle{Deep Learners for Out-of-Distribution Examples}
 \runningauthor{Bengio et. al.}
-\aistatsauthor{Anonymous Authors}]
+\aistatsauthor{Anonymous Authors\\
+\vspace*{5mm}}]
 \iffalse
 Yoshua  Bengio \and
 Frédéric  Bastien \and
@@ -55,7 +56,7 @@
 
 %{\bf Running title: Deep Self-Taught Learning}
 
-%\vspace*{-2mm}
+\vspace*{5mm}
 \begin{abstract}
   Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits), using both a multi-task setting and perturbed examples in order to obtain out-of-distribution examples. The results agree with the hypothesis, and show that a deep learner did {\em beat previously published results and reached human-level performance}.
 \end{abstract}
@@ -297,7 +298,7 @@
 is 18.2\%, with a standard error of 0.1\%.
 We controlled noise in the labelling process by (1)
 requiring AMT workers with a higher than normal average of accepted
-responses (>95\%) on other tasks (2) discarding responses that were not
+responses ($>$95\%) on other tasks (2) discarding responses that were not
 complete (10 predictions) (3) discarding responses for which for which the
 time to predict was smaller than 3 seconds for NIST (the mean response time
 was 20 seconds) and 6 seconds seconds for NISTP (average response time of
@@ -497,8 +498,13 @@
 separate learning rate for the unsupervised pre-training stage (selected
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
-of hidden layers but it was fixed to 3 based on previous work with
-SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. The size of the hidden
+of hidden layers but it was fixed to 3 for most experiments,
+based on previous work with
+SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. 
+We also compared against 1 and against 2 hidden layers, in order
+to disantangle the effect of depth from the effect of unsupervised
+pre-training.
+The size of the hidden
 layers was kept constant across hidden layers, and the best results
 were obtained with the largest values that we could experiment
 with given our patience, with 1000 hidden units.
@@ -567,6 +573,16 @@
 majority of the errors from humans and from SDA1 are from out-of-context
 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
 ``c'' and a ``C'' are often indistinguishible).
+Regarding shallower networks pre-trained with unsupervised denoising
+auto-encders, we find that the NIST test error is 21\% with one hidden
+layer and 20\% with two hidden layers (vs 17\% in the same conditions
+with 3 hidden layers). Compare this with the 23\% error achieved
+by the MLP, i.e. a single hidden layer and no unsupervised pre-training.
+As found in previous work~\cite{Erhan+al-2010,Larochelle-jmlr-2009}, 
+these results show that both depth and
+unsupervised pre-training need to be combined in order to achieve
+the best results.
+
 
 In addition, as shown in the left of
 Figure~\ref{fig:improvements-charts}, the relative improvement in error
--- a/writeup/aistats_review_response.txt	Sun Jan 09 22:00:39 2011 -0500
+++ b/writeup/aistats_review_response.txt	Sun Mar 13 18:25:25 2011 -0400
@@ -1,22 +1,15 @@
 
 We thank the authors for their thoughtful comments. Please find our responses below.
 
-* Comparisons with shallower networks, but using unsupervised pre-training:
-e will add those results to the paper. Previous work in our group with
-very similar data (the InfiniteMNIST dataset were published in JMLR in 2010
-"Why Does Unsupervised Pre-training Help Deep Learning?"). The results indeed
-show improvement when going from 1 to 2 and then 3 layers, even when using
-unsupervised pre-training (RBM or Denoising Auto-Encoder).
+* Comparisons with shallower networks, but using unsupervised pre-training. We have added those results to the paper. On the NIST test set, 62 classes,
+using NISTP to train (which gives the best results on NIST):
+  MLP (1 hidden layer, no unsupervised pre-training): 24% error
+  DA  (1 hidden layer, unsupervised pre-training):    21% error
+  SDA (2 hidden layers, unsupervised pre-training):   20% error
+  SDA (3 hidden layers, unsupervised pre-training):   17% error
+Previous work in our group with very similar data (the InfiniteMNIST dataset were published in JMLR in 2010 "Why Does Unsupervised Pre-training Help Deep Learning?"). The results indeed show improvement when going from 1 to 2 and then 3 layers, even when using unsupervised pre-training (RBM or Denoising Auto-Encoder). The experiment helps to disentangle to some extent the effect of depth with the effect of unsupervised pre-training, and confirms that both are required to achieve the best results.
 
-* Comparisons with SVMs. We have tried several kinds of SVMs. The main limitation
-of course is the size of the training set. One option is to use a non-linear SVM
-with a reduced training set, and the other is to use an online linear SVM.
-Another option we have considered is to project the input non-linearly in a
-high-dimensional but sparse representation and then use an online linear SVM on that space.
-For this experiment we have thresholded input pixel gray levels considered a
-low-order polynomial expansion (e.g. only looking at pairs of non-zero pixels).
-We have obtained the following results until now, all substantially worse than those
-obtained with the MLP and deep nets. 
+* Comparisons with SVMs. The main limitation of course is the size of the training set. One option is to use a non-linear SVM with a reduced training set, and the other is to use an online linear SVM.  Another option is to project the input non-linearly in a high-dimensional but sparse representation and then use an online linear SVM.  For this, we have thresholded input pixel gray levels and projected into the space of order-2 products. Results:
 
 SVM type   training set   input               online    validation test set
             type / size   features            training  set error    error
@@ -27,61 +20,12 @@
 Linear SVM,  NISTP, 800k,  sparse quadratic,   81.76%,  83.69%,     85.56%
 RBF SVM,     NISTP, 100k,  original,           74.73%,  56.57%,     64.22%
 
-The best results were obtained with the sparse quadratic input features, and
-training on the CLEAN data (NIST) rather than the perturbed data (NISTP). 
-A summary of the above results was added to the revised paper.
+The best results were obtained with the sparse quadratic input features, and training on the clean data (NIST) rather than the perturbed data (NISTP).  A summary of the above results was added to the revised paper.
 
 
-* Using distorted characters as the corruption process of the Denoising
-Auto-Encoder (DAE). We had already performed preliminary experiments with this idea
-and it did not work very well (in fact it depends on the kind of distortion
-considered), i.e., it did not improve on the simpler forms of noise we used
-for the AISTATS submission.  We have several interpretations for this, which should
-probably go (along with more extensive simulations) into another paper.
-The main interpretation for those results is that the DAE learns good
-features by being given as target (to reconstruct) a pattern of higher
-density (according to the unknown, underlying generating distribution) than
-the network input. This is how it gets to know where the density should
-concentrate. Hence distortions that are *plausible* in the input distribution
-(such as translation, rotation, scaling, etc.) are not very useful, whereas
-corruption due to a form of noise are useful. In fact, the most useful 
-is a very simple form of noise, that guarantees that the input is much
-less likely than the target, such as Gaussian noise. Another way to think
-about it is to consider the symmetries involved. A corruption process should
-be such that swapping input for target should be very unlikely: this is
-true for many kinds of noises, but not for geometric transformations
-and deformations.
+* Using distorted characters as the corruption process of the Denoising Auto-Encoder (DAE). We had already performed preliminary experiments with this idea and results varied depending on the type of distortion, but did not improve on the original noise process. We believe that the DAE learns good features when the target to reconstruct is more likely than the corrupted input.  concentrate. Hence distortions that are *plausible* in the input distribution (such as translation, rotation, scaling, etc.) are not very useful, whereas corruption due to a form of noise are useful. Consider also the symmetries involved: a translation is as likely to be to the right or to the left, so it is hard to predict.
 
-* Human labeling: We controlled noise in the labelling process by (1)
-requiring AMT workers with a higher than normal average of accepted
-responses (>95%) on other tasks (2) discarding responses that were not
-complete (10 predictions) (3) discarding responses for which for which the
-time to predict was smaller than 3 seconds for NIST (the mean response time
-was 20 seconds) and 6 seconds seconds for NISTP (average response time of
-45 seconds) (4) discarding responses which were obviously wrong (10
-identical ones, or "12345..."). Overall, after such filtering, we kept
-approximately 95% of the AMT workers' responses. The above paragraph
-was added to the revision. We thank the reviewer for
-the suggestion about multi-stage questionnaires, we will definitely
-consider this as an option next time we perform this experiment. However,
-to be fair, if we were to do so, we should also consider the same
-multi-stage decision process for the machine learning algorithms as well.
+* Human labeling: We controlled noise in the labelling process by (1) requiring AMT workers with a higher than normal average of accepted responses (>95%) on other tasks (2) discarding responses that were not complete (10 predictions) (3) discarding responses for which for which the time to predict was smaller than 3 seconds for NIST (the mean response time was 20 seconds) and 6 seconds seconds for NISTP (average response time of 45 seconds) (4) discarding responses which were obviously wrong (10 identical ones, or "12345..."). Overall, after such filtering, we kept approximately 95% of the AMT workers' responses. The above paragraph was added to the revision. We thank the reviewer for the suggestion about multi-stage questionnaires, we will definitely consider this as an option next time we perform this experiment. However, to be fair, if we were to do so, we should also consider the same multi-stage decision process for the machine learning algorithms as well.
 
-* Size of labeled set: in our JMLR 2010 paper on deep learning (cited
-above), we already verified the effect of number of labeled examples on the
-deep learners and shallow learners (with or without unsupervised
-pre-training); see fig. 11 of that paper, which involves data very similar
-to those studied here. Basically (and somewhat surprisingly) the deep
-learners with unsupervised pre-training can take more advantage of a large
-amount of labeled examples, presumably because of the initialization effect
-(that benefits from the prior that representations that are useful for P(X)
-are also useful for P(Y|X)), and the effect does not disappear when the
-number of labeled examples increases. Other work in the semi-supervised
-setting (Lee et al, NIPS2009, "Unsupervised feature learning...") also show
-that the advantage of unsupervised feature learning by a deep architecture
-is most pronounced in the semi-supervised setting with very few labeled
-examples. Adding the training curve in the self-taught settings of this AISTAT
-submission is a good idea, but probably unlikely to provide results
-different from the above already reported in the literature in similar
-settings.
+* Size of labeled set: in our JMLR 2010 paper on deep learning (cited above, see fig. 11), we already verified the effect of number of labeled examples on the deep learners and shallow learners (with or without unsupervised pre-training). Basically (and somewhat surprisingly) the deep learners with unsupervised pre-training can take more advantage of a large amount of labeled examples, presumably because of the initialization effect and the effect does not disappear when the number of labeled examples increases. Similar results were obtained in the semi-supervised setting (Lee et al, NIPS2009).  Adding the training curve in the self-taught settings of this AISTAT submission is a good idea, and we will have it for the final version.
 
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/writeup/jmlr_review1.txt	Sun Mar 13 18:25:25 2011 -0400
@@ -0,0 +1,77 @@
+This paper presents an experimental analysis of the generalization effects of supervised learning leveraging additional out-of-distribution data and certain kinds of perturbations and transformations of examples (handwritten characters). Overall, I feel the paper is interesting, but in its current form the basic content would be more suitable for a conference publication than JMLR.
+There are no new algorithmic advances proposed, as the authors use a number of existing techniques (neural networks, deep learning auto-encoders, multi-task learning, semi-supervised learning and self-taught learning).
+What they do show is that some combinations of these approaches might be quite useful for deep networks. However, I feel there are some missing points both in the text and the experiments themselves, that I detail below.
+
+Comments about the Introduction:
+
+Firstly, a small point: the introduction does not do a good job in explaining the two main topics: "deep learning" and particularly "self-taught learning" (and as there is really no "middle" part of this paper, it just goes straight to experiments after the introduction, there is little elsewhere, either). The deep-learning paragraph explains multi-layer neural nets and why they might be useful, and states "deep learning has emerged as a promising new area of research", but it seems to me the only new area of research is the way they are trained, which should be explained here in the text -- that is not mentioned which is misleading.  (Actually something about deep learning is explained later, but it seems as if it is in the wrong section, it is in the ``self-taught learning'' paragraphs.) More importantly I feel that the self-taught learning section fails to explain adequately what self-taught learning even is. It is written:
+
+``Self-taught learning (Raina et al., 2007) is a paradigm that combines principles of semi-supervised and multi-task learning: the learner can exploit examples that are unlabeled and possibly come from a distribution different from the target distribution, e.g., from other classes than those of interest.''
+
+Firstly, this tries to explain one concept by introducing two others that are not explained (semi-supervised learning and multi-task learning). Secondly, I don't think it's clear from that description that there is also labeled data involved here. I think Raina's website explains it more clearly:
+
+"In self-taught learning, we are given a small amount of labeled data for a supervised learning task, and lots of additional unlabeled data that does not share the labels of the supervised problem and does not arise from the same distribution. This paper introduces an algorithm for self-taught learning based on sparse coding."
+
+Comparing the two descriptions I also find the word ``possibly'' troubling in the paper -- why write ``possibly'' here? If the data is not out-of-distribution, then this is just semi-supervised learning, isn't it? 
+
+I think as this paper hinges on deep learning and self-taught learning more should be done to explain them. In particular, very little of Raina et al.'s approach is explained, e.g. the algorithm they used or the experiments that were conducted. Moreover, other papers have worked on the same setting, and a section discussing prior work should be added. In particular:
+
+	J. Weston, R. Collobert, F. Sinz, L. Bottou and V. Vapnik. "Inference with the Universum", ICML 2006
+
+also studies algorithms for learning with labeled data + out-of-sample unlabeled data, and even have experiments with hand-written character recognition with many classes.
+Also, I guess that several works have looked at learning in the case of a different distribution in training than in test, e.g to name one:
+
+	Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. 
+	Domain adaptation: Learning bounds and algorithms. 
+	In Proceedings of The 22nd Annual Conference on Learning Theory (COLT 2009). Montréal, Canada, June 2009. Omnipress. Longer arxiv version.
+
+Perhaps that area of research is worth mentioning too.
+
+The introduction also states:  ``It has already been shown that deep learners can clearly take advantage of unsupervised learning and unlabeled examples (Bengio, 2009; Weston et al., 2008),  but more needs to be done to explore the impact of out-of-distribution examples and of the multi-task setting (one exception is (Collobert and Weston, 2008), which uses a different kind of learning algorithm). In particular the relative advantage of deep learning for these settings has not been evaluated. "
+
+Several points here:
+- The first sentence should make it clear this is semi-supervised learning that also uses labeled examples (I do not think it is clear).
+- I'm not sure what ``which uses a different kind of learning algorithm'' means -- different to what? To the algorithm in this paper, to Raina et al., or something else.. ? 
+- I believe if one is going to discuss the multi-task setting, then several other works should be cited and explained, in particular:
+	Rich Caruana, "Multitask Learning," Ph.D. Thesis, School of Computer Science, CMU, 1997.
+for multi-tasking in neural networks (although I am sure there are many other works as well), and:
+        A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Rie K. Ando and Tong Zhang. Journal of Machine Learning Research, Vol 6:1817-1853, 2005. 
+which uses multi-tasking in the setting of semi-supervised learning. I'm sure there are other works as well.
+- Finally, I believe there are more ``exceptions'' than Collobert and Weston, 2008. For example: 
+	H. Mobahi, R. Collobert, J. Weston. Deep Learning from Temporal Coherence in Video. ICML 2009. 
+seems to directly compare within distribution and out-of-distribution unlabeled data for convolutional networks. The fact that there are already papers on this topic (and that you do not take the time to explain the differences between these and your own work) lessens the impact. 
+
+I think the phrase ``Whereas a deep architecture can in principle be more powerful than a shallow one in terms of representation'' cannot be written without at least a citation, and I think it depends what you mean by the word ``powerful'' doesn't it? E.g. can't you have infinite VC dimension with a shallow representation?  (Also, I don't think you define what a ``shallow learner'' is anywhere, more explanation always helps.) Also, I feel it would be better if ``sharing of statistical strength'', which is in italics, was explained.
+
+Finally, this is just a style point, but I feel there is too much use of bold and italics at the end of the introduction. You should sell your paper, but sometimes one can go overboard.
+ 
+
+Section 2:
+
+Section 2 is a relatively large chunk of the paper (3 pages) and could basically be put in the appendix, couldn't it? Or else, little is done to justify why it is placed in the paper right after the introduction.  After that section, we are straight to the experiments -- it feels like the paper has some missing sections and was not fully written somehow. It goes straight from the introduction to ``Perturbed and Transformed Character Images'' which is not what I was expecting. For example, I was expected more details of self-taught learning and why it would help.
+I think describing/citing previous work on learning invariances and transforming images would make sense in the context of this section too.
+
+Experiments:
+
+``Much previous work on deep learning had been performed on the MNIST digits task (Hinton et al., 2006; Ranzato et al., 2007; Bengio et al., 2007; Salakhutdinov and Hinton, 2009), with 60 000 examples, and variants involving 10 000 examples (Larochelle et al., 2009b; Vincent et al., 2008b). The focus here is on much larger training sets, from 10 times to to 1000 times larger, and 62 classes.''
+I feel this is unfair. There are many large scale deep learning papers with large datasets. You should make that clear, e.g.:
+
+	Large-scale Deep Unsupervised Learning using Graphics Processors, Rajat Raina, Anand Madhavan, Andrew Y. Ng , ICML 2009
+
+to name one, but there are many others...
+
+Sec. 3: `` The average error of humans on the 62-class task NIST test set is 18.2%, with a standard error of 0.1%.''.  I think at this point you should explain why this is so high.
+
+``Preliminary experiments on training SVMs (libSVM) with subsets of the training set allowing the program to fit in memory yielded substantially worse results than MLPs."
+-- I think you should mention here work done trying to speed up SVMs for exactly this task, e.g.:
+	Gaëlle Loosli, Stéphane Canu and Léon Bottou: Training Invariant Support Vector Machines using Selective Sampling, in Large Scale Kernel Machines,  301–320, MIT Press, Cambridge, MA., 2007.
+where the authors trained an SVM on 8100000 examples generated from MNIST. Also, showing a learning curve might be nice if you can't do the training with the full data. 
+
+The experimental results look on the whole good. However, I still feel the following issues could be resolved:
+
+- The shallow MLP as I understand has a max number of hidden units of 1500, whereas the deep MLP has three layers of 1000 hidden units. Hence, the deep MLPs have a lot more capacity. So shouldn't you try shallow MLPs with more hidden units? It would also be good to show training and test error rates for different number of hidden units.
+- If many shallow and deep MLP methods, and other non-MLP methods, have been compared on MNIST, why not comparing on that as well? You can still do this in a self-taught learning setup, e.g. using other data as unlabeled data, no?
+- The idea of transforming digits seems closer to learning invariances than self-taught learning to me? This should be discussed.
+- There is no comparison to Raina et. al, despite using their idea of ``self-taught learning'' in the title. Indeed, could Raina et al.'s algorithm be compared in both shallow and deep mode? I feel as this is only an experimental paper, more permutations could be done to understand this phenomenon more.
+
+
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/writeup/jmlr_review2.txt	Sun Mar 13 18:25:25 2011 -0400
@@ -0,0 +1,11 @@
+The paper “Deep Self-Taught Learning for Handwritten Character Recognition” by Bengio et al. claims that deep neural networks benefit more from self-taught learning than shallow ones.
+
+The paper presents neural network models applied to handwritten character recognition. Various transformations and noise injection modes to generate additional training data are introduced to get so called “out-of-distribution” examples. MLPs with one hidden layer are then trained on various date sets in a fully supervised way and compared with three-hidden layer MLPs where each layer is initialized in an unsupervised way and then fine tuned using Back-Propagation. It is then concluded that deep learners benefit more from out-of-distribution examples as well as from a multi-task setting.
+
+It is well known that artificially increasing the training data by either adding noise or by incorporation some prior knowledge in the generation of new data points acts as a regularizer and help to improve performance (Simard et al. 2003, ICDAR). It is therefore not very surprising that deep architectures with a higher complexity profit more from this procedure. The paper suggests that MLPs (with one hidden layer) perform worse than deep SDAs (i.e. pretrained MLPs with three hidden layers), especially when the training data is artificially increased.  I would argue that an MLP with three hidden layers trained in a fully supervised way would also perform better with respect to the 1-hidden layer MLP. Therefore it would have been interesting to see results of such an MLP. Only in this way a fair comparison between shallow vs. deep MLPs, as well as supervised vs. unsupervised training, would be possible.
+
+This paper claims that deep architectures with unsupervised pre-training outperform shallow ones and that additional training data is more beneficial for deep architectures. I think the authors should have compared their SDA with a 3-hidden-layer MLPs to support this claim. Furthermore it is claimed that unsupervised pre-training is required to successfully train deep (3-hidden-layer) MLPs. However, there are no experiments in this paper that justify this claim and I also would argue that deep MLPs can be successfully trained with Back-Propagation, especially if enough training data is available (Ciresan et al 2010, Neural Computation). I therefore strongly encourage the authors to either include the result of such an experiment or adjust the conclusion accordingly.
+
+To cut a long story short, this paper wants to establish SDAs as the state of the art for character recognition, without even checking if deep MLPs trained in the usual supervised way are better or not. I run a simple test and trained a three hidden-layer MLP (500-500-500) on deformed NIST and obtained a test error recognition rate of 1.08% on the un-deformed NIST test set compare with 1.4% of the SDA in Table 1. For this particular task a three hidden layer MLP outperforms an even bigger SDA. I am therefore not fully convinced if supervised pretraining is necessary to obtain good performance for the presented task.
+
+The extensive use of font styles made it hard to follow the paper. It is also very difficult to understand with what data which networks were trained. Especially in Table 1, Appendix it is not clear if the nets in the last column tested on digits are trained on 62 characters or only on digits.