Mercurial > ift6266
diff writeup/nips_rebuttal_clean.txt @ 574:d12b9a1432e8
cleaned-up version, fewer typos, shortened (but need 700 chars less)
author | Dumitru Erhan <dumitru.erhan@gmail.com> |
---|---|
date | Sat, 07 Aug 2010 18:39:36 -0700 |
parents | |
children | bff9ab360ef4 |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/writeup/nips_rebuttal_clean.txt Sat Aug 07 18:39:36 2010 -0700 @@ -0,0 +1,78 @@ +Reviewer_1 claims that handwriting recognition is essentially solved, and we +believe that this is not true. Indeed, the best methods have been +getting essentially human performance in the case of clean digits. We are not +aware of previous papers showing that human performance has been reached on the +full character set. Furthermore, it is clear from our own experimentation that +humans still greatly outperform machines when the characters are heavily +distorted (e.g. the NISTP dataset). Playing with the provided demo will +quickly convince you that this is true. + +"...not intended to compete with the state-of-the-art...": We actually included +comparisons with the state-of-the-art on the NIST dataset (and beat it). + +"the demonstrations that self-taught learning can help deep learners is +helpful": indeed, but it is even more interesting to consider the result that +self-taught learning was found *more helpful for deep learners than for shallow +ones*. Since the availability of out-of-distribution data is common (especially +out-of-class data), this is of practical importance. + +Reviewer_4: "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be +used on such large datasets, and indeed it is a good idea to explore variants of +SVMs or approximations of SVMs. We will continue exploring this thread (and the +particular suggestion made) and hope to include these results in the final +paper, to add more shallow learners to the comparison. + +"...it would be helpful to provide some theoretical analysis...": indeed, but +this is either mathematically challenging (to say the least, since deep models +involve a non-convex optimization) or would likely require very strong +assumptions on the data distribution. However, there exists +theoretical literature which answers some basic questions about this issue, +starting with the work of Jonathan Baxter (COLT 1995) "Learning internal +representations". The argument is about capacity +and sharing it across tasks so as to achieve better generalization. The lower +layers implement features that can potentially be shared across tasks. As long +as some sharing is possible (because the same features can be useful for several +tasks), then there is a potential benefit from shared +internal representations. Whereas a one-hidden-layer MLP can only share linear +features, a deep architecture can share non-linear ones which have the potential +for representing more abstract concepts. + +Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels +are used. In the supervised fine-tuning phase, all labels are used, so this is +not the semi-supervised setting. This paper did not examine the potential +advantage of exploiting large quantities of additional unlabeled data, but the +availability of the generated dataset and of the learning setup would make it +possible to easily conduct a study to answer this interesting +question. Note however that previous work [5] already investigated the relative +advantage of the semi-supervised setting for deep vs shallow architectures, +which is why we did not focus on this here. It might still be worth to do these +experiments because the deep learning algorithms were different. + +"...human errors may be present...": Indeed, there are variations across human +labelings, which have have estimated (since each character +was viewed by 3 different humans), and reported in the paper (the standard +deviations across humans are large, but the standard error across a large test +set is very small, so we believe the average error numbers to be fairly +accurate). + +"...authors do cite a supplement, but I did not have access to it...": that is +strange. We could (and still can) access it from the CMT web site. We will make +sure to include a complete pseudo-code of SDAs in it. + +"...main contributions of the manuscript...": the main +contribution is actually to show that the self-taught learning setting is more +beneficial to deeper architectures. + +"...restriction to MLPs...": that restriction was motivated by the computational +challenge of training on hundreds of millions of examples. Apart from linear +models (which do not fare well on this task), it is not clear to us what +could be used, and so MLPs were the +obvious candidates to compare with. We will explore the use of SVM +approximations, as suggested by Reviewer_1. Other suggestions are welcome. + +"Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of +prior work on character recognition using deformations and transformations". +The main originality is in showing that deep learners can take more advantage +than shallow learners of such data and of the self-taught learning framework in +general. +