diff writeup/nips_rebuttal_clean.txt @ 575:bff9ab360ef4

nips_rebuttal_clean
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 07 Aug 2010 22:46:12 -0400
parents d12b9a1432e8
children 185d79636a20
line wrap: on
line diff
--- a/writeup/nips_rebuttal_clean.txt	Sat Aug 07 18:39:36 2010 -0700
+++ b/writeup/nips_rebuttal_clean.txt	Sat Aug 07 22:46:12 2010 -0400
@@ -1,78 +1,79 @@
-Reviewer_1 claims that handwriting recognition is essentially solved, and we
-believe that this is not true. Indeed, the best methods have been
-getting essentially human performance in the case of clean digits. We are not
-aware of previous papers showing that human performance has been reached on the
-full character set. Furthermore, it is clear from our own experimentation that
-humans still greatly outperform machines when the characters are heavily
-distorted (e.g. the NISTP dataset). Playing with the provided demo will
-quickly convince you that this is true.
 
-"...not intended to compete with the state-of-the-art...": We actually included
+Reviewer_1 claims that handwriting recognition is essentially solved: we
+believe this is not true. Yes the best methods have been getting
+essentially human performance in the case of clean digits. But we are not
+aware of previous papers achieving human performance on the full character
+set. It is clear from our own experimentation (play with the demo to
+convince yourself) that humans still clearly outperform machines when the
+characters are heavily distorted (e.g. as in our NISTP dataset).
+
+
+"...not intended to compete with the state-of-the-art...": We had included
 comparisons with the state-of-the-art on the NIST dataset (and beat it).
 
+
 "the demonstrations that self-taught learning can help deep learners is
-helpful": indeed, but it is even more interesting to consider the result that
-self-taught learning was found *more helpful for deep learners than for shallow
-ones*. Since the availability of out-of-distribution data is common (especially
+helpful": indeed, but it is even more interesting to consider the result
+that self-taught learning was found *more helpful for deep learners than
+for shallow ones*. Since out-of-distribution data is common (especially
 out-of-class data), this is of practical importance.
 
-Reviewer_4: "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be
-used on such large datasets, and indeed it is a good idea to explore variants of
-SVMs or approximations of SVMs. We will continue exploring this thread (and the
-particular suggestion made) and hope to include these results in the final
-paper, to add more shallow learners to the comparison.
+Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary
+SVMs cannot be used on such large datasets. We will explore SVM variants
+such as the suggestion made to add SVM results to the paper.
+
 
-"...it would be helpful to provide some theoretical analysis...": indeed, but
-this is either mathematically challenging (to say the least, since deep models
-involve a non-convex optimization) or would likely require very strong
-assumptions on the data distribution. However, there exists
+"...it would be helpful to provide some theoretical analysis...": indeed,
+but this is either mathematically challenging (to say the least, since deep
+models involve a non-convex optimization) or would likely require very
+strong assumptions on the data distribution. However, there exists
 theoretical literature which answers some basic questions about this issue,
 starting with the work of Jonathan Baxter (COLT 1995) "Learning internal
-representations". The argument is about capacity
-and sharing it across tasks so as to achieve better generalization. The lower
-layers implement features that can potentially be shared across tasks. As long
-as some sharing is possible (because the same features can be useful for several
-tasks), then there is a potential benefit from shared
-internal representations. Whereas a one-hidden-layer MLP can only share linear
-features, a deep architecture can share non-linear ones which have the potential
-for representing more abstract concepts.
+representations". The argument is about capacity and sharing it across
+tasks so as to achieve better generalization. The lower layers implement
+features that can potentially be shared across tasks. As long as some
+sharing is possible (because the same features can be useful for several
+tasks), then there is a potential benefit from shared internal
+representations. Whereas a one-hidden-layer MLP can only share linear
+features, a deep architecture can share non-linear ones which have the
+potential for representing more abstract concepts.
 
-Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels
-are used. In the supervised fine-tuning phase, all labels are used, so this is
-not the semi-supervised setting. This paper did not examine the potential
-advantage of exploiting large quantities of additional unlabeled data, but the
-availability of the generated dataset and of the learning setup would make it
-possible to easily conduct a study to answer this interesting
-question. Note however that previous work [5] already investigated the relative
-advantage of the semi-supervised setting for deep vs shallow architectures,
-which is why we did not focus on this here. It might still be worth to do these
-experiments because the deep learning algorithms were different.
+Reviewer_5 about semi-supervised learning: In the unsupervised phase, no
+labels are used. In the supervised fine-tuning phase, all labels are used,
+so this is not the semi-supervised setting. This paper did not examine the
+potential advantage of exploiting large quantities of additional unlabeled
+data, but the availability of the generated dataset and of the learning
+setup would make it possible to easily conduct a study to answer this
+interesting question. Note however that previous work [5] already
+investigated the relative advantage of the semi-supervised setting for deep
+vs shallow architectures, which is why we did not focus on this here. It
+might still be worth to do these experiments because the deep learning
+algorithms were different.
 
-"...human errors may be present...": Indeed, there are variations across human
-labelings, which have have estimated (since each character
-was viewed by 3 different humans), and reported in the paper (the standard
-deviations across humans are large, but the standard error across a large test
-set is very small, so we believe the average error numbers to be fairly
-accurate).
+"...human errors may be present...": Indeed, there are variations across
+human labelings, which have have estimated (since each character was viewed
+by 3 different humans), and reported in the paper (the standard deviations
+across humans are large, but the standard error across a large test set is
+very small, so we believe the average error numbers to be fairly accurate).
 
-"...authors do cite a supplement, but I did not have access to it...": that is
-strange. We could (and still can) access it from the CMT web site. We will make
-sure to include a complete pseudo-code of SDAs in it.
+"...authors do cite a supplement, but I did not have access to it...": that
+is strange. We could (and still can) access it from the CMT web site. We
+will make sure to include a complete pseudo-code of SDAs in it.
 
-"...main contributions of the manuscript...": the main
-contribution is actually to show that the self-taught learning setting is more
-beneficial to deeper architectures.
+"...main contributions of the manuscript...": the main contribution is
+actually to show that the self-taught learning setting is more beneficial
+to deeper architectures.
 
-"...restriction to MLPs...": that restriction was motivated by the computational
-challenge of training on hundreds of millions of examples. Apart from linear
-models (which do not fare well on this task), it is not clear to us what 
-could be used, and so MLPs were the
-obvious candidates to compare with. We will explore the use of SVM
-approximations, as suggested by Reviewer_1. Other suggestions are welcome.
+"...restriction to MLPs...": that restriction was motivated by the
+computational challenge of training on hundreds of millions of
+examples. Apart from linear models (which do not fare well on this task),
+it is not clear to us what could be used, and so MLPs were the obvious
+candidates to compare with. We will explore the use of SVM approximations,
+as suggested by Reviewer_1. Other suggestions are welcome.
 
 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of
-prior work on character recognition using deformations and transformations".
-The main originality is in showing that deep learners can take more advantage
-than shallow learners of such data and of the self-taught learning framework in
-general.
+prior work on character recognition using deformations and
+transformations".  The main originality is in showing that deep learners
+can take more advantage than shallow learners of such data and of the
+self-taught learning framework in general.