Mercurial > ift6266
diff writeup/nips_rebuttal_clean.txt @ 575:bff9ab360ef4
nips_rebuttal_clean
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sat, 07 Aug 2010 22:46:12 -0400 |
parents | d12b9a1432e8 |
children | 185d79636a20 |
line wrap: on
line diff
--- a/writeup/nips_rebuttal_clean.txt Sat Aug 07 18:39:36 2010 -0700 +++ b/writeup/nips_rebuttal_clean.txt Sat Aug 07 22:46:12 2010 -0400 @@ -1,78 +1,79 @@ -Reviewer_1 claims that handwriting recognition is essentially solved, and we -believe that this is not true. Indeed, the best methods have been -getting essentially human performance in the case of clean digits. We are not -aware of previous papers showing that human performance has been reached on the -full character set. Furthermore, it is clear from our own experimentation that -humans still greatly outperform machines when the characters are heavily -distorted (e.g. the NISTP dataset). Playing with the provided demo will -quickly convince you that this is true. -"...not intended to compete with the state-of-the-art...": We actually included +Reviewer_1 claims that handwriting recognition is essentially solved: we +believe this is not true. Yes the best methods have been getting +essentially human performance in the case of clean digits. But we are not +aware of previous papers achieving human performance on the full character +set. It is clear from our own experimentation (play with the demo to +convince yourself) that humans still clearly outperform machines when the +characters are heavily distorted (e.g. as in our NISTP dataset). + + +"...not intended to compete with the state-of-the-art...": We had included comparisons with the state-of-the-art on the NIST dataset (and beat it). + "the demonstrations that self-taught learning can help deep learners is -helpful": indeed, but it is even more interesting to consider the result that -self-taught learning was found *more helpful for deep learners than for shallow -ones*. Since the availability of out-of-distribution data is common (especially +helpful": indeed, but it is even more interesting to consider the result +that self-taught learning was found *more helpful for deep learners than +for shallow ones*. Since out-of-distribution data is common (especially out-of-class data), this is of practical importance. -Reviewer_4: "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be -used on such large datasets, and indeed it is a good idea to explore variants of -SVMs or approximations of SVMs. We will continue exploring this thread (and the -particular suggestion made) and hope to include these results in the final -paper, to add more shallow learners to the comparison. +Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary +SVMs cannot be used on such large datasets. We will explore SVM variants +such as the suggestion made to add SVM results to the paper. + -"...it would be helpful to provide some theoretical analysis...": indeed, but -this is either mathematically challenging (to say the least, since deep models -involve a non-convex optimization) or would likely require very strong -assumptions on the data distribution. However, there exists +"...it would be helpful to provide some theoretical analysis...": indeed, +but this is either mathematically challenging (to say the least, since deep +models involve a non-convex optimization) or would likely require very +strong assumptions on the data distribution. However, there exists theoretical literature which answers some basic questions about this issue, starting with the work of Jonathan Baxter (COLT 1995) "Learning internal -representations". The argument is about capacity -and sharing it across tasks so as to achieve better generalization. The lower -layers implement features that can potentially be shared across tasks. As long -as some sharing is possible (because the same features can be useful for several -tasks), then there is a potential benefit from shared -internal representations. Whereas a one-hidden-layer MLP can only share linear -features, a deep architecture can share non-linear ones which have the potential -for representing more abstract concepts. +representations". The argument is about capacity and sharing it across +tasks so as to achieve better generalization. The lower layers implement +features that can potentially be shared across tasks. As long as some +sharing is possible (because the same features can be useful for several +tasks), then there is a potential benefit from shared internal +representations. Whereas a one-hidden-layer MLP can only share linear +features, a deep architecture can share non-linear ones which have the +potential for representing more abstract concepts. -Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels -are used. In the supervised fine-tuning phase, all labels are used, so this is -not the semi-supervised setting. This paper did not examine the potential -advantage of exploiting large quantities of additional unlabeled data, but the -availability of the generated dataset and of the learning setup would make it -possible to easily conduct a study to answer this interesting -question. Note however that previous work [5] already investigated the relative -advantage of the semi-supervised setting for deep vs shallow architectures, -which is why we did not focus on this here. It might still be worth to do these -experiments because the deep learning algorithms were different. +Reviewer_5 about semi-supervised learning: In the unsupervised phase, no +labels are used. In the supervised fine-tuning phase, all labels are used, +so this is not the semi-supervised setting. This paper did not examine the +potential advantage of exploiting large quantities of additional unlabeled +data, but the availability of the generated dataset and of the learning +setup would make it possible to easily conduct a study to answer this +interesting question. Note however that previous work [5] already +investigated the relative advantage of the semi-supervised setting for deep +vs shallow architectures, which is why we did not focus on this here. It +might still be worth to do these experiments because the deep learning +algorithms were different. -"...human errors may be present...": Indeed, there are variations across human -labelings, which have have estimated (since each character -was viewed by 3 different humans), and reported in the paper (the standard -deviations across humans are large, but the standard error across a large test -set is very small, so we believe the average error numbers to be fairly -accurate). +"...human errors may be present...": Indeed, there are variations across +human labelings, which have have estimated (since each character was viewed +by 3 different humans), and reported in the paper (the standard deviations +across humans are large, but the standard error across a large test set is +very small, so we believe the average error numbers to be fairly accurate). -"...authors do cite a supplement, but I did not have access to it...": that is -strange. We could (and still can) access it from the CMT web site. We will make -sure to include a complete pseudo-code of SDAs in it. +"...authors do cite a supplement, but I did not have access to it...": that +is strange. We could (and still can) access it from the CMT web site. We +will make sure to include a complete pseudo-code of SDAs in it. -"...main contributions of the manuscript...": the main -contribution is actually to show that the self-taught learning setting is more -beneficial to deeper architectures. +"...main contributions of the manuscript...": the main contribution is +actually to show that the self-taught learning setting is more beneficial +to deeper architectures. -"...restriction to MLPs...": that restriction was motivated by the computational -challenge of training on hundreds of millions of examples. Apart from linear -models (which do not fare well on this task), it is not clear to us what -could be used, and so MLPs were the -obvious candidates to compare with. We will explore the use of SVM -approximations, as suggested by Reviewer_1. Other suggestions are welcome. +"...restriction to MLPs...": that restriction was motivated by the +computational challenge of training on hundreds of millions of +examples. Apart from linear models (which do not fare well on this task), +it is not clear to us what could be used, and so MLPs were the obvious +candidates to compare with. We will explore the use of SVM approximations, +as suggested by Reviewer_1. Other suggestions are welcome. "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of -prior work on character recognition using deformations and transformations". -The main originality is in showing that deep learners can take more advantage -than shallow learners of such data and of the self-taught learning framework in -general. +prior work on character recognition using deformations and +transformations". The main originality is in showing that deep learners +can take more advantage than shallow learners of such data and of the +self-taught learning framework in general.