diff writeup/nips_rebuttal_clean.txt @ 576:185d79636a20

now fits
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 07 Aug 2010 22:54:54 -0400
parents bff9ab360ef4
children 685756a11fd2
line wrap: on
line diff
--- a/writeup/nips_rebuttal_clean.txt	Sat Aug 07 22:46:12 2010 -0400
+++ b/writeup/nips_rebuttal_clean.txt	Sat Aug 07 22:54:54 2010 -0400
@@ -24,31 +24,22 @@
 
 
 "...it would be helpful to provide some theoretical analysis...": indeed,
-but this is either mathematically challenging (to say the least, since deep
+but this appears mathematically challenging (to say the least, since deep
 models involve a non-convex optimization) or would likely require very
-strong assumptions on the data distribution. However, there exists
-theoretical literature which answers some basic questions about this issue,
-starting with the work of Jonathan Baxter (COLT 1995) "Learning internal
-representations". The argument is about capacity and sharing it across
-tasks so as to achieve better generalization. The lower layers implement
-features that can potentially be shared across tasks. As long as some
-sharing is possible (because the same features can be useful for several
-tasks), then there is a potential benefit from shared internal
-representations. Whereas a one-hidden-layer MLP can only share linear
+strong distributional assumptions. However, previous
+theoretical literature already provides some answers, e.g.,
+Jonathan Baxter's (COLT 1995) "Learning internal
+representations". The argument is about sharing capacity across
+tasks to improve generalization: lower layers features can potentially 
+be shared across tasks. Whereas a one-hidden-layer MLP can only share linear
 features, a deep architecture can share non-linear ones which have the
 potential for representing more abstract concepts.
 
 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no
-labels are used. In the supervised fine-tuning phase, all labels are used,
-so this is not the semi-supervised setting. This paper did not examine the
-potential advantage of exploiting large quantities of additional unlabeled
-data, but the availability of the generated dataset and of the learning
-setup would make it possible to easily conduct a study to answer this
-interesting question. Note however that previous work [5] already
-investigated the relative advantage of the semi-supervised setting for deep
-vs shallow architectures, which is why we did not focus on this here. It
-might still be worth to do these experiments because the deep learning
-algorithms were different.
+labels are used. In the supervised fine-tuning phase, all labels are used.
+So this is *not* the semi-supervised setting, which was already previously
+studied [5], showing the advantage of depth. Instead, we focus here
+on the out-of-distribution aspect of self-taught learning.
 
 "...human errors may be present...": Indeed, there are variations across
 human labelings, which have have estimated (since each character was viewed
@@ -56,24 +47,24 @@
 across humans are large, but the standard error across a large test set is
 very small, so we believe the average error numbers to be fairly accurate).
 
-"...authors do cite a supplement, but I did not have access to it...": that
-is strange. We could (and still can) access it from the CMT web site. We
-will make sure to include a complete pseudo-code of SDAs in it.
+"...supplement, but I did not have access to it...": strange!  We could
+(and still can) access it. We will include a complete pseudo-code of SDAs
+in it.
 
 "...main contributions of the manuscript...": the main contribution is
 actually to show that the self-taught learning setting is more beneficial
 to deeper architectures.
 
 "...restriction to MLPs...": that restriction was motivated by the
-computational challenge of training on hundreds of millions of
-examples. Apart from linear models (which do not fare well on this task),
-it is not clear to us what could be used, and so MLPs were the obvious
-candidates to compare with. We will explore the use of SVM approximations,
-as suggested by Reviewer_1. Other suggestions are welcome.
+computational challenge of training on nearly a billion examples. Linear
+models do not fare well here, and most non-parametric models do not scale
+well, so MLPs (which have been used before on this task) were natural as
+the baseline. We will explore the use of SVM approximations, as suggested
+by Reviewer_1.
 
 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of
 prior work on character recognition using deformations and
-transformations".  The main originality is in showing that deep learners
+transformations".  Main originality = showing that deep learners
 can take more advantage than shallow learners of such data and of the
 self-taught learning framework in general.