ift6266: writeup/nips_rebuttal_clean.txt comparison

comparison writeup/nips_rebuttal_clean.txt @ 575:bff9ab360ef4

nips_rebuttal_clean

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sat, 07 Aug 2010 22:46:12 -0400
parents	d12b9a1432e8
children	185d79636a20

comparison

equal deleted inserted replaced

-:d12b9a1432e8
+:bff9ab360ef4
-Reviewer_1 claims that handwriting recognition is essentially solved, and we
-believe that this is not true. Indeed, the best methods have been
-getting essentially human performance in the case of clean digits. We are not
-aware of previous papers showing that human performance has been reached on the
-full character set. Furthermore, it is clear from our own experimentation that
-humans still greatly outperform machines when the characters are heavily
-distorted (e.g. the NISTP dataset). Playing with the provided demo will
-quickly convince you that this is true.
-"...not intended to compete with the state-of-the-art...": We actually included
+Reviewer_1 claims that handwriting recognition is essentially solved: we
+believe this is not true. Yes the best methods have been getting
+essentially human performance in the case of clean digits. But we are not
+aware of previous papers achieving human performance on the full character
+set. It is clear from our own experimentation (play with the demo to
+convince yourself) that humans still clearly outperform machines when the
+characters are heavily distorted (e.g. as in our NISTP dataset).
+"...not intended to compete with the state-of-the-art...": We had included
 comparisons with the state-of-the-art on the NIST dataset (and beat it).
 "the demonstrations that self-taught learning can help deep learners is
-helpful": indeed, but it is even more interesting to consider the result that
+helpful": indeed, but it is even more interesting to consider the result
-self-taught learning was found *more helpful for deep learners than for shallow
+that self-taught learning was found *more helpful for deep learners than
-ones*. Since the availability of out-of-distribution data is common (especially
+for shallow ones*. Since out-of-distribution data is common (especially
 out-of-class data), this is of practical importance.
-Reviewer_4: "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be
+Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary
-used on such large datasets, and indeed it is a good idea to explore variants of
+SVMs cannot be used on such large datasets. We will explore SVM variants
-SVMs or approximations of SVMs. We will continue exploring this thread (and the
+such as the suggestion made to add SVM results to the paper.
-particular suggestion made) and hope to include these results in the final
-paper, to add more shallow learners to the comparison.
-"...it would be helpful to provide some theoretical analysis...": indeed, but
-this is either mathematically challenging (to say the least, since deep models
+"...it would be helpful to provide some theoretical analysis...": indeed,
-involve a non-convex optimization) or would likely require very strong
+but this is either mathematically challenging (to say the least, since deep
-assumptions on the data distribution. However, there exists
+models involve a non-convex optimization) or would likely require very
+strong assumptions on the data distribution. However, there exists
 theoretical literature which answers some basic questions about this issue,
 starting with the work of Jonathan Baxter (COLT 1995) "Learning internal
-representations". The argument is about capacity
+representations". The argument is about capacity and sharing it across
-and sharing it across tasks so as to achieve better generalization. The lower
+tasks so as to achieve better generalization. The lower layers implement
-layers implement features that can potentially be shared across tasks. As long
+features that can potentially be shared across tasks. As long as some
-as some sharing is possible (because the same features can be useful for several
+sharing is possible (because the same features can be useful for several
-tasks), then there is a potential benefit from shared
+tasks), then there is a potential benefit from shared internal
-internal representations. Whereas a one-hidden-layer MLP can only share linear
+representations. Whereas a one-hidden-layer MLP can only share linear
-features, a deep architecture can share non-linear ones which have the potential
+features, a deep architecture can share non-linear ones which have the
-for representing more abstract concepts.
+potential for representing more abstract concepts.
-Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels
+Reviewer_5 about semi-supervised learning: In the unsupervised phase, no
-are used. In the supervised fine-tuning phase, all labels are used, so this is
+labels are used. In the supervised fine-tuning phase, all labels are used,
-not the semi-supervised setting. This paper did not examine the potential
+so this is not the semi-supervised setting. This paper did not examine the
-advantage of exploiting large quantities of additional unlabeled data, but the
+potential advantage of exploiting large quantities of additional unlabeled
-availability of the generated dataset and of the learning setup would make it
+data, but the availability of the generated dataset and of the learning
-possible to easily conduct a study to answer this interesting
+setup would make it possible to easily conduct a study to answer this
-question. Note however that previous work [5] already investigated the relative
+interesting question. Note however that previous work [5] already
-advantage of the semi-supervised setting for deep vs shallow architectures,
+investigated the relative advantage of the semi-supervised setting for deep
-which is why we did not focus on this here. It might still be worth to do these
+vs shallow architectures, which is why we did not focus on this here. It
-experiments because the deep learning algorithms were different.
+might still be worth to do these experiments because the deep learning
+algorithms were different.
-"...human errors may be present...": Indeed, there are variations across human
+"...human errors may be present...": Indeed, there are variations across
-labelings, which have have estimated (since each character
+human labelings, which have have estimated (since each character was viewed
-was viewed by 3 different humans), and reported in the paper (the standard
+by 3 different humans), and reported in the paper (the standard deviations
-deviations across humans are large, but the standard error across a large test
+across humans are large, but the standard error across a large test set is
-set is very small, so we believe the average error numbers to be fairly
+very small, so we believe the average error numbers to be fairly accurate).
-accurate).
-"...authors do cite a supplement, but I did not have access to it...": that is
+"...authors do cite a supplement, but I did not have access to it...": that
-strange. We could (and still can) access it from the CMT web site. We will make
+is strange. We could (and still can) access it from the CMT web site. We
-sure to include a complete pseudo-code of SDAs in it.
+will make sure to include a complete pseudo-code of SDAs in it.
-"...main contributions of the manuscript...": the main
+"...main contributions of the manuscript...": the main contribution is
-contribution is actually to show that the self-taught learning setting is more
+actually to show that the self-taught learning setting is more beneficial
-beneficial to deeper architectures.
+to deeper architectures.
-"...restriction to MLPs...": that restriction was motivated by the computational
+"...restriction to MLPs...": that restriction was motivated by the
-challenge of training on hundreds of millions of examples. Apart from linear
+computational challenge of training on hundreds of millions of
-models (which do not fare well on this task), it is not clear to us what
+examples. Apart from linear models (which do not fare well on this task),
-could be used, and so MLPs were the
+it is not clear to us what could be used, and so MLPs were the obvious
-obvious candidates to compare with. We will explore the use of SVM
+candidates to compare with. We will explore the use of SVM approximations,
-approximations, as suggested by Reviewer_1. Other suggestions are welcome.
+as suggested by Reviewer_1. Other suggestions are welcome.
 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of
-prior work on character recognition using deformations and transformations".
+prior work on character recognition using deformations and
-The main originality is in showing that deep learners can take more advantage
+transformations".  The main originality is in showing that deep learners
-than shallow learners of such data and of the self-taught learning framework in
+can take more advantage than shallow learners of such data and of the
-general.
+self-taught learning framework in general.

Mercurial > ift6266

comparison writeup/nips_rebuttal_clean.txt @ 575:bff9ab360ef4