# HG changeset patch # User Yoshua Bengio # Date 1281235572 14400 # Node ID bff9ab360ef4e6db90f5a422b792a281d690698d # Parent d12b9a1432e8bef9e31adcf852c01c23e3667ad8 nips_rebuttal_clean diff -r d12b9a1432e8 -r bff9ab360ef4 writeup/nips_rebuttal.txt --- a/writeup/nips_rebuttal.txt Sat Aug 07 18:39:36 2010 -0700 +++ b/writeup/nips_rebuttal.txt Sat Aug 07 22:46:12 2010 -0400 @@ -9,16 +9,16 @@ ||| |||The paper is well-written and the contributions are presented clearly. However, this paper only presents the results established methods to an application that is already essentially solved. -Reviewer_1 claims that handwriting recognition is essentially solved, and we believe that this is not true. It is true that the best methods have been getting essentially human performance in the case of clean digits. We are not aware of previous papers showing that human performance has been reached on the full character set. Furthermore, it is clear from our own experimentation that humans still greatly outperform machines when the characters are heavily distorted (e.g. as in our NISTP dataset). Playing with the provided demo will quickly convince you that this is true. +Reviewer_1 claims that handwriting recognition is essentially solved: we believe this is not true. Yes the best methods have been getting essentially human performance in the case of clean digits. But we are not aware of previous papers achieving human performance on the full character set. It is clear from our own experimentation (play with the demo to convince yourself) that humans still clearly outperform machines when the characters are heavily distorted (e.g. as in our NISTP dataset). |||While the experiments were run thoroughly and engineered well, the results are not intended to compete with the state-of-the-art, so this is not an application paper. While the main conclusion -- that self-taught learning helps deep learners -- is somewhat interesting, it is not shown to apply generally, and even so is not significant enough to merit acceptance since both the models and self-taught learning methods have been previously shown to be useful (albeit separately). -"...not intended to compete with the state-of-the-art...": We actually included comparisons with the state-of-the-art on the NIST dataset (and beat it). +"...not intended to compete with the state-of-the-art...": We had included comparisons with the state-of-the-art on the NIST dataset (and beat it). ||| |||Because the experiments were run well, the new datasets are useful contributions, and the demonstration that self-taught learning can help deep learners is helpful, it would be good for other researchers to see this work. It would be appropriate for a workshop or technical report, or as part of a review or survey paper. -"the demonstrations that self-taught learning can help deep learners is helpful": indeed, but it is even more interesting to consider the result that self-taught learning was found MORE HELPFUL FOR DEEP LEARNERS THAN SHALLOW ONES. Since the availability of out-of-distribution data is common (especially out-of-class data), this is of practical importance. +"the demonstrations that self-taught learning can help deep learners is helpful": indeed, but it is even more interesting to consider the result that self-taught learning was found MORE HELPFUL FOR DEEP LEARNERS THAN SHALLOW ONES. Since out-of-distribution data is common (especially out-of-class data), this is practically important. |||Please summarize your review in 1-2 sentences Since there is no technical or methodological contribution, this paper should not be accepted to this conference. |||Masked Reviewer ID: Assigned_Reviewer_4 @@ -33,7 +33,7 @@ |||Another interesting observation is that deep learners benefit more from multi-task learning compared to shallow multi-layer perceptrons. It would also be interesting to compare to SVMs that are built incrementally, i.e. fit SVMs using a subset of data, retain support vectors, add more data, etc. This would better justify empirical findings. ||| -"It would also be interesting to compare to SVMs...": ordinary SVMs cannot be used on such large datasets, and indeed it is a good idea to explore variants of SVMs or approximations of SVMs. We will continue exploring this thread (and the particular suggestion made) and hope to include these results in the final paper, to add more shallow learners to the comparison. +Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be used on such large datasets. We will explore SVM variants such as the suggestion made to add SVM results to the paper. |||While the paper is mostly empirical, it would be helpful to provide some theoretical analysis. It would be interesting to work out under what conditions one would expect deep models to benefit from out-of-distribution examples (obviously if the distribution of those examples is very different, it would naturally hurt model performance), or when one would expect deep models to benefit more from multi-task setting compared to shallow learners. ||| diff -r d12b9a1432e8 -r bff9ab360ef4 writeup/nips_rebuttal_clean.txt --- a/writeup/nips_rebuttal_clean.txt Sat Aug 07 18:39:36 2010 -0700 +++ b/writeup/nips_rebuttal_clean.txt Sat Aug 07 22:46:12 2010 -0400 @@ -1,78 +1,79 @@ -Reviewer_1 claims that handwriting recognition is essentially solved, and we -believe that this is not true. Indeed, the best methods have been -getting essentially human performance in the case of clean digits. We are not -aware of previous papers showing that human performance has been reached on the -full character set. Furthermore, it is clear from our own experimentation that -humans still greatly outperform machines when the characters are heavily -distorted (e.g. the NISTP dataset). Playing with the provided demo will -quickly convince you that this is true. -"...not intended to compete with the state-of-the-art...": We actually included +Reviewer_1 claims that handwriting recognition is essentially solved: we +believe this is not true. Yes the best methods have been getting +essentially human performance in the case of clean digits. But we are not +aware of previous papers achieving human performance on the full character +set. It is clear from our own experimentation (play with the demo to +convince yourself) that humans still clearly outperform machines when the +characters are heavily distorted (e.g. as in our NISTP dataset). + + +"...not intended to compete with the state-of-the-art...": We had included comparisons with the state-of-the-art on the NIST dataset (and beat it). + "the demonstrations that self-taught learning can help deep learners is -helpful": indeed, but it is even more interesting to consider the result that -self-taught learning was found *more helpful for deep learners than for shallow -ones*. Since the availability of out-of-distribution data is common (especially +helpful": indeed, but it is even more interesting to consider the result +that self-taught learning was found *more helpful for deep learners than +for shallow ones*. Since out-of-distribution data is common (especially out-of-class data), this is of practical importance. -Reviewer_4: "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be -used on such large datasets, and indeed it is a good idea to explore variants of -SVMs or approximations of SVMs. We will continue exploring this thread (and the -particular suggestion made) and hope to include these results in the final -paper, to add more shallow learners to the comparison. +Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary +SVMs cannot be used on such large datasets. We will explore SVM variants +such as the suggestion made to add SVM results to the paper. + -"...it would be helpful to provide some theoretical analysis...": indeed, but -this is either mathematically challenging (to say the least, since deep models -involve a non-convex optimization) or would likely require very strong -assumptions on the data distribution. However, there exists +"...it would be helpful to provide some theoretical analysis...": indeed, +but this is either mathematically challenging (to say the least, since deep +models involve a non-convex optimization) or would likely require very +strong assumptions on the data distribution. However, there exists theoretical literature which answers some basic questions about this issue, starting with the work of Jonathan Baxter (COLT 1995) "Learning internal -representations". The argument is about capacity -and sharing it across tasks so as to achieve better generalization. The lower -layers implement features that can potentially be shared across tasks. As long -as some sharing is possible (because the same features can be useful for several -tasks), then there is a potential benefit from shared -internal representations. Whereas a one-hidden-layer MLP can only share linear -features, a deep architecture can share non-linear ones which have the potential -for representing more abstract concepts. +representations". The argument is about capacity and sharing it across +tasks so as to achieve better generalization. The lower layers implement +features that can potentially be shared across tasks. As long as some +sharing is possible (because the same features can be useful for several +tasks), then there is a potential benefit from shared internal +representations. Whereas a one-hidden-layer MLP can only share linear +features, a deep architecture can share non-linear ones which have the +potential for representing more abstract concepts. -Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels -are used. In the supervised fine-tuning phase, all labels are used, so this is -not the semi-supervised setting. This paper did not examine the potential -advantage of exploiting large quantities of additional unlabeled data, but the -availability of the generated dataset and of the learning setup would make it -possible to easily conduct a study to answer this interesting -question. Note however that previous work [5] already investigated the relative -advantage of the semi-supervised setting for deep vs shallow architectures, -which is why we did not focus on this here. It might still be worth to do these -experiments because the deep learning algorithms were different. +Reviewer_5 about semi-supervised learning: In the unsupervised phase, no +labels are used. In the supervised fine-tuning phase, all labels are used, +so this is not the semi-supervised setting. This paper did not examine the +potential advantage of exploiting large quantities of additional unlabeled +data, but the availability of the generated dataset and of the learning +setup would make it possible to easily conduct a study to answer this +interesting question. Note however that previous work [5] already +investigated the relative advantage of the semi-supervised setting for deep +vs shallow architectures, which is why we did not focus on this here. It +might still be worth to do these experiments because the deep learning +algorithms were different. -"...human errors may be present...": Indeed, there are variations across human -labelings, which have have estimated (since each character -was viewed by 3 different humans), and reported in the paper (the standard -deviations across humans are large, but the standard error across a large test -set is very small, so we believe the average error numbers to be fairly -accurate). +"...human errors may be present...": Indeed, there are variations across +human labelings, which have have estimated (since each character was viewed +by 3 different humans), and reported in the paper (the standard deviations +across humans are large, but the standard error across a large test set is +very small, so we believe the average error numbers to be fairly accurate). -"...authors do cite a supplement, but I did not have access to it...": that is -strange. We could (and still can) access it from the CMT web site. We will make -sure to include a complete pseudo-code of SDAs in it. +"...authors do cite a supplement, but I did not have access to it...": that +is strange. We could (and still can) access it from the CMT web site. We +will make sure to include a complete pseudo-code of SDAs in it. -"...main contributions of the manuscript...": the main -contribution is actually to show that the self-taught learning setting is more -beneficial to deeper architectures. +"...main contributions of the manuscript...": the main contribution is +actually to show that the self-taught learning setting is more beneficial +to deeper architectures. -"...restriction to MLPs...": that restriction was motivated by the computational -challenge of training on hundreds of millions of examples. Apart from linear -models (which do not fare well on this task), it is not clear to us what -could be used, and so MLPs were the -obvious candidates to compare with. We will explore the use of SVM -approximations, as suggested by Reviewer_1. Other suggestions are welcome. +"...restriction to MLPs...": that restriction was motivated by the +computational challenge of training on hundreds of millions of +examples. Apart from linear models (which do not fare well on this task), +it is not clear to us what could be used, and so MLPs were the obvious +candidates to compare with. We will explore the use of SVM approximations, +as suggested by Reviewer_1. Other suggestions are welcome. "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of -prior work on character recognition using deformations and transformations". -The main originality is in showing that deep learners can take more advantage -than shallow learners of such data and of the self-taught learning framework in -general. +prior work on character recognition using deformations and +transformations". The main originality is in showing that deep learners +can take more advantage than shallow learners of such data and of the +self-taught learning framework in general.