ift6266: writeup/nips_rebuttal_clean.txt comparison

comparison writeup/nips_rebuttal_clean.txt @ 576:185d79636a20

now fits

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sat, 07 Aug 2010 22:54:54 -0400
parents	bff9ab360ef4
children	685756a11fd2

comparison

equal deleted inserted replaced

-:bff9ab360ef4
+:185d79636a20
 SVMs cannot be used on such large datasets. We will explore SVM variants
 such as the suggestion made to add SVM results to the paper.
 "...it would be helpful to provide some theoretical analysis...": indeed,
-but this is either mathematically challenging (to say the least, since deep
+but this appears mathematically challenging (to say the least, since deep
 models involve a non-convex optimization) or would likely require very
-strong assumptions on the data distribution. However, there exists
+strong distributional assumptions. However, previous
-theoretical literature which answers some basic questions about this issue,
+theoretical literature already provides some answers, e.g.,
-starting with the work of Jonathan Baxter (COLT 1995) "Learning internal
+Jonathan Baxter's (COLT 1995) "Learning internal
-representations". The argument is about capacity and sharing it across
+representations". The argument is about sharing capacity across
-tasks so as to achieve better generalization. The lower layers implement
+tasks to improve generalization: lower layers features can potentially
-features that can potentially be shared across tasks. As long as some
+be shared across tasks. Whereas a one-hidden-layer MLP can only share linear
-sharing is possible (because the same features can be useful for several
-tasks), then there is a potential benefit from shared internal
-representations. Whereas a one-hidden-layer MLP can only share linear
 features, a deep architecture can share non-linear ones which have the
 potential for representing more abstract concepts.
 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no
-labels are used. In the supervised fine-tuning phase, all labels are used,
+labels are used. In the supervised fine-tuning phase, all labels are used.
-so this is not the semi-supervised setting. This paper did not examine the
+So this is *not* the semi-supervised setting, which was already previously
-potential advantage of exploiting large quantities of additional unlabeled
+studied [5], showing the advantage of depth. Instead, we focus here
-data, but the availability of the generated dataset and of the learning
+on the out-of-distribution aspect of self-taught learning.
-setup would make it possible to easily conduct a study to answer this
-interesting question. Note however that previous work [5] already
-investigated the relative advantage of the semi-supervised setting for deep
-vs shallow architectures, which is why we did not focus on this here. It
-might still be worth to do these experiments because the deep learning
-algorithms were different.
 "...human errors may be present...": Indeed, there are variations across
 human labelings, which have have estimated (since each character was viewed
 by 3 different humans), and reported in the paper (the standard deviations
 across humans are large, but the standard error across a large test set is
 very small, so we believe the average error numbers to be fairly accurate).
-"...authors do cite a supplement, but I did not have access to it...": that
+"...supplement, but I did not have access to it...": strange!  We could
-is strange. We could (and still can) access it from the CMT web site. We
+(and still can) access it. We will include a complete pseudo-code of SDAs
-will make sure to include a complete pseudo-code of SDAs in it.
+in it.
 "...main contributions of the manuscript...": the main contribution is
 actually to show that the self-taught learning setting is more beneficial
 to deeper architectures.
 "...restriction to MLPs...": that restriction was motivated by the
-computational challenge of training on hundreds of millions of
+computational challenge of training on nearly a billion examples. Linear
-examples. Apart from linear models (which do not fare well on this task),
+models do not fare well here, and most non-parametric models do not scale
-it is not clear to us what could be used, and so MLPs were the obvious
+well, so MLPs (which have been used before on this task) were natural as
-candidates to compare with. We will explore the use of SVM approximations,
+the baseline. We will explore the use of SVM approximations, as suggested
-as suggested by Reviewer_1. Other suggestions are welcome.
+by Reviewer_1.
 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of
 prior work on character recognition using deformations and
-transformations".  The main originality is in showing that deep learners
+transformations".  Main originality = showing that deep learners
 can take more advantage than shallow learners of such data and of the
 self-taught learning framework in general.

Mercurial > ift6266

comparison writeup/nips_rebuttal_clean.txt @ 576:185d79636a20