diff writeup/nips_rebuttal.txt @ 572:7ee0e41dd3d5

gone through all
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Fri, 06 Aug 2010 15:12:01 -0400
parents 0a8f39ea62b1
children 07b727a12632
line wrap: on
line diff
--- a/writeup/nips_rebuttal.txt	Fri Aug 06 14:50:56 2010 -0400
+++ b/writeup/nips_rebuttal.txt	Fri Aug 06 15:12:01 2010 -0400
@@ -49,15 +49,30 @@
 |||
 |||Quality - The paper is technically sound. The only possible technical shortcomings I see are (1) that the authors seem to equate unsupervised learning with either the addition of noise to training examples, or the use of untested categories (i.e., multi-task learning); it might be useful to also quantify the improvement seen when the SDAs are applied with unlabeled data (without added noise, and without superfluous categories). It is also not completely clear in the setup which (fraction of) data is labeled, which not, and how it is used in training. For instance, NIST comes with annotations, so are all distorted images assumed to belong to the same class, etc. 
 |||
+
+Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels are used. In the supervised fine-tuning phase, all labels are used, so this is not the semi-supervised setting. This paper did not examine the potential advantage of exploiting large quantities of additional unlabeled data, but the availability of the generated dataset and of the learning setup would make it possible to indeed conduct easily an empirical study to answer this interesting question. Note however that previous work [5] already investigated the relative advantage of the semi-supervised setting for deep vs shallow architectures, which is why we did not focus on this here. It might still be worth to do these experiments because the deep learning algorithms were different.
+
 |||And (2) I'm not sure how accurately the scores from Amazon Mechanical Turks (AMT) indicate human-level performance, since human errors may be present either in the AMT predictions or in the original hand-curation of the labeled test data. 
 |||
-|||Clarity - The paper is fairly clearly written, with a few spelling and grammatical errors. Most importantly, the description of the SDA training could be improved and expanded to aid non-specialist readers. (In order to understand the training approach I had to read several of the cited papers). Shortening section 2 (possibly relegating details such as parameter ranges to the supplement) should free up enough space to add a gentle introduction to deep learning with SDAs, which makes it clear that the purpose of deep learning is to induce hierarchical features from raw data via unsupervised methods (it was not made explicit in the manuscript that the input features were (I presume) the raw pixel values of the character images). Note that the authors do cite a supplement, but I did not have access to it. 
+
+"...human errors may be present...": Indeed, there are variations across human labelings, which have have estimated (exploiting the fact that each character was viewed by 3 different humans), and reported in the paper (the standard deviations across humans are large, but the standard error across a large test set is very small, so we believe these numbers to be fairly accurate).
+
+|||Clarity - The paper is fairly clearly written, with a few spelling and grammatical errors. Most importantly, the description of the SDA training could be improved and expanded to aid non-specialist readers. (In order to understand the training approach I had to read several of the cited papers). Shortening section 2 (possibly relegating details such as parameter ranges to the supplement) should free up enough space to add a gentle introduction to deep learning with SDAs, which makes it clear that the purpose of deep learning is to induce hierarchical features from raw data via unsupervised methods (it was not made explicit in the manuscript that the input features were (I presume) the raw pixel values of the character images). Note that the authors do cite a supplement, but I did not have access to it.
+
+"...authors do cite a supplement, but I did not have access to it...": that is strange. We could (and still can) access it from the CMT web site. We will make sure to include a complete pseudo-code of SDAs in it.
+
 |||Finally, the distinction between semi-supervised and self-taught learning should be better explained. 
 |||
 |||Originality - The main contributions of the manuscript is a well-organized evaluation of previously described approaches to assess the benefits of deep learning -- the use of larger data sets (including larger numbers of categories), the framework of image transformations to generate appropriate larger sets for self-taught learning, and the results showing performance comparable to that of humans. The main theoretical result seems to be that adding noise to training examples and/or including categories during training that are not used during testing (i.e., "borrowing strength" via multitask learning) improves classification accuracy even when extremely large numbers of labeled training examples are available. The utility of added noise during training has been well-known for many years, but had previously been thought to result from generalization error induced by bias in the training set (i.e., limited sample sizes), whereas the authors show that the advantage persists even for large sample sizes. 
 |||
+
+"Reviewer_5 on...The main contributions of the manuscript...": the main contribution is actually to show that the self-taught learning setting is more beneficial to deeper architectures.
+
 |||Significance - The results of this paper are very good, and the ideas are of importance not only within the specific application of character recognition. One limit is the restriction to MLPs and not other more recent learning approaches. 
 |||
+
+"...restriction to MLPs...": that restriction was motivated by the computational challenge of training on hundreds of millions of examples. Apart from linear models (which do not fare well on this task and do not take advantage of large training sets), it is not clear to us what could be used, and so MLPs were the obvious candidates to compare with. We will explore the use of SVM approximations, as suggested by Reviewer_1.
+
 |||Please summarize your review in 1-2 sentences	 The manuscript provides results consistent with earlier findings, and introduces a detailed set of noise-adding procedures that work well for the specific task of character recognition. The presentation should be adequately clear to other researchers working on the same task, but could be improved to make the article more accessible to nonspecialists.
 |||Masked Reviewer ID:	Assigned_Reviewer_6
 |||Review:	
@@ -75,6 +90,10 @@
 |||Originality: 
 |||The novelty of the approach is somewhat marginal since the approach is reminiscent of prior work on character recognition using deformations and transformations. However, this paper shows that it can achieve the state-of-the-art performance via this approach. 
 |||
+
+"Reviewer 6:...novelty of the approach is somewhat marginal since the approach is reminiscent of prior work on character recognition using deformations and transformations".  The main originality is not there but in showing that deep learners can take more advantage than shallow learners of such data and of the self-taught learning framework in general.
+
+
 |||Significance: 
 |||The paper tries to address a number of interesting questions related to deep learning and multi-task learning. Furthermore, this work can provide a new large scale data benchmark for deep learning (beyond MNIST).
 |||Please summarize your review in 1-2 sentences	 The paper tries to address a number of interesting questions related to deep learning and multi-task learning on a large scale handwritten character dataset. Furthermore, the presented method seems to achieve the state-of-the-art.