Mercurial > ift6266

--- a/writeup/nips_rebuttal.txt	Fri Aug 06 15:12:01 2010 -0400
+++ b/writeup/nips_rebuttal.txt	Fri Aug 06 15:26:58 2010 -0400
@@ -33,13 +33,13 @@
 |||Another interesting observation is that deep learners benefit more from multi-task learning compared to shallow multi-layer perceptrons. It would also be interesting to compare to SVMs that are built incrementally, i.e. fit SVMs using a subset of data, retain support vectors, add more data, etc. This would better justify empirical findings.
 |||

-"It would also be interesting to compare to SVMs...": ordinary SVMs cannot be used on such large datasets, and indeed it is a good idea to explore variants of SVMs or approximations of SVMs. We will continue exploring this thread (and the particular suggestion made) and hope to include these results in the final paper.
+"It would also be interesting to compare to SVMs...": ordinary SVMs cannot be used on such large datasets, and indeed it is a good idea to explore variants of SVMs or approximations of SVMs. We will continue exploring this thread (and the particular suggestion made) and hope to include these results in the final paper, to add more shallow learners to the comparison.

 |||While the paper is mostly empirical, it would be helpful to provide some theoretical analysis. It would be interesting to work out under what conditions one would expect deep models to benefit from out-of-distribution examples (obviously if the distribution of those examples is very different, it would naturally hurt model performance), or when one would expect deep models to benefit more from multi-task setting compared to shallow learners.
 |||

 "...it would be helpful to provide some theoretical analysis...": indeed, but this is either mathematically
-challenging (to say the least) or would require very strong assumptions on the data distribution. Remember that deep models involve a non-convex optimization. However, there is already a body of theoretical literature which answers some basic questions about this issue, starting with the work of Jonathan Baxter (COLT 1995) "Learning internal representations". We will add that citation. Basically, the argument is about capacity and sharing it across tasks so as to achieve better generalization. The lower layers implement features that can potentially be shared across tasks. As long as some sharing is possible (because the same features can be useful for several tasks), then there is a benefit that can be achieved with shared internal representations. Whereas a one-hidden-layer MLP can only share linear features, a deep architecture can share non-linear ones which have the potential for representing more abstract concepts.
+challenging (to say the least) or would likely require very strong assumptions on the data distribution. Remember also that deep models involve a non-convex optimization. However, there is already a body of theoretical literature which answers some basic questions about this issue, starting with the work of Jonathan Baxter (COLT 1995) "Learning internal representations". We will add that citation. Basically, the argument is about capacity and sharing it across tasks so as to achieve better generalization. The lower layers implement features that can potentially be shared across tasks. As long as some sharing is possible (because the same features can be useful for several tasks), then there is a potential benefit that can be achieved with shared internal representations. Whereas a one-hidden-layer MLP can only share linear features, a deep architecture can share non-linear ones which have the potential for representing more abstract concepts.

 |||Please summarize your review in 1-2 sentences	 The paper is mostly well-written and provides an extensive empirical study showing that model with deep architectures can benefit from self-taught learning setting.
 |||Masked Reviewer ID:	Assigned_Reviewer_5
@@ -55,7 +55,7 @@
 |||And (2) I'm not sure how accurately the scores from Amazon Mechanical Turks (AMT) indicate human-level performance, since human errors may be present either in the AMT predictions or in the original hand-curation of the labeled test data.
 |||

-"...human errors may be present...": Indeed, there are variations across human labelings, which have have estimated (exploiting the fact that each character was viewed by 3 different humans), and reported in the paper (the standard deviations across humans are large, but the standard error across a large test set is very small, so we believe these numbers to be fairly accurate).
+"...human errors may be present...": Indeed, there are variations across human labelings, which have have estimated (exploiting the fact that each character was viewed by 3 different humans), and reported in the paper (the standard deviations across humans are large, but the standard error across a large test set is very small, so we believe the average error numbers to be fairly accurate).

 |||Clarity - The paper is fairly clearly written, with a few spelling and grammatical errors. Most importantly, the description of the SDA training could be improved and expanded to aid non-specialist readers. (In order to understand the training approach I had to read several of the cited papers). Shortening section 2 (possibly relegating details such as parameter ranges to the supplement) should free up enough space to add a gentle introduction to deep learning with SDAs, which makes it clear that the purpose of deep learning is to induce hierarchical features from raw data via unsupervised methods (it was not made explicit in the manuscript that the input features were (I presume) the raw pixel values of the character images). Note that the authors do cite a supplement, but I did not have access to it.

@@ -71,7 +71,7 @@
 |||Significance - The results of this paper are very good, and the ideas are of importance not only within the specific application of character recognition. One limit is the restriction to MLPs and not other more recent learning approaches.
 |||

-"...restriction to MLPs...": that restriction was motivated by the computational challenge of training on hundreds of millions of examples. Apart from linear models (which do not fare well on this task and do not take advantage of large training sets), it is not clear to us what could be used, and so MLPs were the obvious candidates to compare with. We will explore the use of SVM approximations, as suggested by Reviewer_1.
+"...restriction to MLPs...": that restriction was motivated by the computational challenge of training on hundreds of millions of examples. Apart from linear models (which do not fare well on this task and do not take advantage of large training sets), it is not clear to us what could be used, and so MLPs were the obvious candidates to compare with. We will explore the use of SVM approximations, as suggested by Reviewer_1. Other suggestions are welcome.

 |||Please summarize your review in 1-2 sentences	 The manuscript provides results consistent with earlier findings, and introduces a detailed set of noise-adding procedures that work well for the specific task of character recognition. The presentation should be adequately clear to other researchers working on the same task, but could be improved to make the article more accessible to nonspecialists.
 |||Masked Reviewer ID:	Assigned_Reviewer_6