# HG changeset patch # User Yoshua Bengio # Date 1281236094 14400 # Node ID 185d79636a2045a446e1a4f6e9604b8529f53573 # Parent bff9ab360ef4e6db90f5a422b792a281d690698d now fits diff -r bff9ab360ef4 -r 185d79636a20 writeup/nips_rebuttal_clean.txt --- a/writeup/nips_rebuttal_clean.txt Sat Aug 07 22:46:12 2010 -0400 +++ b/writeup/nips_rebuttal_clean.txt Sat Aug 07 22:54:54 2010 -0400 @@ -24,31 +24,22 @@ "...it would be helpful to provide some theoretical analysis...": indeed, -but this is either mathematically challenging (to say the least, since deep +but this appears mathematically challenging (to say the least, since deep models involve a non-convex optimization) or would likely require very -strong assumptions on the data distribution. However, there exists -theoretical literature which answers some basic questions about this issue, -starting with the work of Jonathan Baxter (COLT 1995) "Learning internal -representations". The argument is about capacity and sharing it across -tasks so as to achieve better generalization. The lower layers implement -features that can potentially be shared across tasks. As long as some -sharing is possible (because the same features can be useful for several -tasks), then there is a potential benefit from shared internal -representations. Whereas a one-hidden-layer MLP can only share linear +strong distributional assumptions. However, previous +theoretical literature already provides some answers, e.g., +Jonathan Baxter's (COLT 1995) "Learning internal +representations". The argument is about sharing capacity across +tasks to improve generalization: lower layers features can potentially +be shared across tasks. Whereas a one-hidden-layer MLP can only share linear features, a deep architecture can share non-linear ones which have the potential for representing more abstract concepts. Reviewer_5 about semi-supervised learning: In the unsupervised phase, no -labels are used. In the supervised fine-tuning phase, all labels are used, -so this is not the semi-supervised setting. This paper did not examine the -potential advantage of exploiting large quantities of additional unlabeled -data, but the availability of the generated dataset and of the learning -setup would make it possible to easily conduct a study to answer this -interesting question. Note however that previous work [5] already -investigated the relative advantage of the semi-supervised setting for deep -vs shallow architectures, which is why we did not focus on this here. It -might still be worth to do these experiments because the deep learning -algorithms were different. +labels are used. In the supervised fine-tuning phase, all labels are used. +So this is *not* the semi-supervised setting, which was already previously +studied [5], showing the advantage of depth. Instead, we focus here +on the out-of-distribution aspect of self-taught learning. "...human errors may be present...": Indeed, there are variations across human labelings, which have have estimated (since each character was viewed @@ -56,24 +47,24 @@ across humans are large, but the standard error across a large test set is very small, so we believe the average error numbers to be fairly accurate). -"...authors do cite a supplement, but I did not have access to it...": that -is strange. We could (and still can) access it from the CMT web site. We -will make sure to include a complete pseudo-code of SDAs in it. +"...supplement, but I did not have access to it...": strange! We could +(and still can) access it. We will include a complete pseudo-code of SDAs +in it. "...main contributions of the manuscript...": the main contribution is actually to show that the self-taught learning setting is more beneficial to deeper architectures. "...restriction to MLPs...": that restriction was motivated by the -computational challenge of training on hundreds of millions of -examples. Apart from linear models (which do not fare well on this task), -it is not clear to us what could be used, and so MLPs were the obvious -candidates to compare with. We will explore the use of SVM approximations, -as suggested by Reviewer_1. Other suggestions are welcome. +computational challenge of training on nearly a billion examples. Linear +models do not fare well here, and most non-parametric models do not scale +well, so MLPs (which have been used before on this task) were natural as +the baseline. We will explore the use of SVM approximations, as suggested +by Reviewer_1. "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of prior work on character recognition using deformations and -transformations". The main originality is in showing that deep learners +transformations". Main originality = showing that deep learners can take more advantage than shallow learners of such data and of the self-taught learning framework in general.