comparison writeup/nips_rebuttal_clean.txt @ 575:bff9ab360ef4

nips_rebuttal_clean
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 07 Aug 2010 22:46:12 -0400
parents d12b9a1432e8
children 185d79636a20
comparison
equal deleted inserted replaced
574:d12b9a1432e8 575:bff9ab360ef4
1 Reviewer_1 claims that handwriting recognition is essentially solved, and we
2 believe that this is not true. Indeed, the best methods have been
3 getting essentially human performance in the case of clean digits. We are not
4 aware of previous papers showing that human performance has been reached on the
5 full character set. Furthermore, it is clear from our own experimentation that
6 humans still greatly outperform machines when the characters are heavily
7 distorted (e.g. the NISTP dataset). Playing with the provided demo will
8 quickly convince you that this is true.
9 1
10 "...not intended to compete with the state-of-the-art...": We actually included 2 Reviewer_1 claims that handwriting recognition is essentially solved: we
3 believe this is not true. Yes the best methods have been getting
4 essentially human performance in the case of clean digits. But we are not
5 aware of previous papers achieving human performance on the full character
6 set. It is clear from our own experimentation (play with the demo to
7 convince yourself) that humans still clearly outperform machines when the
8 characters are heavily distorted (e.g. as in our NISTP dataset).
9
10
11 "...not intended to compete with the state-of-the-art...": We had included
11 comparisons with the state-of-the-art on the NIST dataset (and beat it). 12 comparisons with the state-of-the-art on the NIST dataset (and beat it).
12 13
14
13 "the demonstrations that self-taught learning can help deep learners is 15 "the demonstrations that self-taught learning can help deep learners is
14 helpful": indeed, but it is even more interesting to consider the result that 16 helpful": indeed, but it is even more interesting to consider the result
15 self-taught learning was found *more helpful for deep learners than for shallow 17 that self-taught learning was found *more helpful for deep learners than
16 ones*. Since the availability of out-of-distribution data is common (especially 18 for shallow ones*. Since out-of-distribution data is common (especially
17 out-of-class data), this is of practical importance. 19 out-of-class data), this is of practical importance.
18 20
19 Reviewer_4: "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be 21 Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary
20 used on such large datasets, and indeed it is a good idea to explore variants of 22 SVMs cannot be used on such large datasets. We will explore SVM variants
21 SVMs or approximations of SVMs. We will continue exploring this thread (and the 23 such as the suggestion made to add SVM results to the paper.
22 particular suggestion made) and hope to include these results in the final
23 paper, to add more shallow learners to the comparison.
24 24
25 "...it would be helpful to provide some theoretical analysis...": indeed, but 25
26 this is either mathematically challenging (to say the least, since deep models 26 "...it would be helpful to provide some theoretical analysis...": indeed,
27 involve a non-convex optimization) or would likely require very strong 27 but this is either mathematically challenging (to say the least, since deep
28 assumptions on the data distribution. However, there exists 28 models involve a non-convex optimization) or would likely require very
29 strong assumptions on the data distribution. However, there exists
29 theoretical literature which answers some basic questions about this issue, 30 theoretical literature which answers some basic questions about this issue,
30 starting with the work of Jonathan Baxter (COLT 1995) "Learning internal 31 starting with the work of Jonathan Baxter (COLT 1995) "Learning internal
31 representations". The argument is about capacity 32 representations". The argument is about capacity and sharing it across
32 and sharing it across tasks so as to achieve better generalization. The lower 33 tasks so as to achieve better generalization. The lower layers implement
33 layers implement features that can potentially be shared across tasks. As long 34 features that can potentially be shared across tasks. As long as some
34 as some sharing is possible (because the same features can be useful for several 35 sharing is possible (because the same features can be useful for several
35 tasks), then there is a potential benefit from shared 36 tasks), then there is a potential benefit from shared internal
36 internal representations. Whereas a one-hidden-layer MLP can only share linear 37 representations. Whereas a one-hidden-layer MLP can only share linear
37 features, a deep architecture can share non-linear ones which have the potential 38 features, a deep architecture can share non-linear ones which have the
38 for representing more abstract concepts. 39 potential for representing more abstract concepts.
39 40
40 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels 41 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no
41 are used. In the supervised fine-tuning phase, all labels are used, so this is 42 labels are used. In the supervised fine-tuning phase, all labels are used,
42 not the semi-supervised setting. This paper did not examine the potential 43 so this is not the semi-supervised setting. This paper did not examine the
43 advantage of exploiting large quantities of additional unlabeled data, but the 44 potential advantage of exploiting large quantities of additional unlabeled
44 availability of the generated dataset and of the learning setup would make it 45 data, but the availability of the generated dataset and of the learning
45 possible to easily conduct a study to answer this interesting 46 setup would make it possible to easily conduct a study to answer this
46 question. Note however that previous work [5] already investigated the relative 47 interesting question. Note however that previous work [5] already
47 advantage of the semi-supervised setting for deep vs shallow architectures, 48 investigated the relative advantage of the semi-supervised setting for deep
48 which is why we did not focus on this here. It might still be worth to do these 49 vs shallow architectures, which is why we did not focus on this here. It
49 experiments because the deep learning algorithms were different. 50 might still be worth to do these experiments because the deep learning
51 algorithms were different.
50 52
51 "...human errors may be present...": Indeed, there are variations across human 53 "...human errors may be present...": Indeed, there are variations across
52 labelings, which have have estimated (since each character 54 human labelings, which have have estimated (since each character was viewed
53 was viewed by 3 different humans), and reported in the paper (the standard 55 by 3 different humans), and reported in the paper (the standard deviations
54 deviations across humans are large, but the standard error across a large test 56 across humans are large, but the standard error across a large test set is
55 set is very small, so we believe the average error numbers to be fairly 57 very small, so we believe the average error numbers to be fairly accurate).
56 accurate).
57 58
58 "...authors do cite a supplement, but I did not have access to it...": that is 59 "...authors do cite a supplement, but I did not have access to it...": that
59 strange. We could (and still can) access it from the CMT web site. We will make 60 is strange. We could (and still can) access it from the CMT web site. We
60 sure to include a complete pseudo-code of SDAs in it. 61 will make sure to include a complete pseudo-code of SDAs in it.
61 62
62 "...main contributions of the manuscript...": the main 63 "...main contributions of the manuscript...": the main contribution is
63 contribution is actually to show that the self-taught learning setting is more 64 actually to show that the self-taught learning setting is more beneficial
64 beneficial to deeper architectures. 65 to deeper architectures.
65 66
66 "...restriction to MLPs...": that restriction was motivated by the computational 67 "...restriction to MLPs...": that restriction was motivated by the
67 challenge of training on hundreds of millions of examples. Apart from linear 68 computational challenge of training on hundreds of millions of
68 models (which do not fare well on this task), it is not clear to us what 69 examples. Apart from linear models (which do not fare well on this task),
69 could be used, and so MLPs were the 70 it is not clear to us what could be used, and so MLPs were the obvious
70 obvious candidates to compare with. We will explore the use of SVM 71 candidates to compare with. We will explore the use of SVM approximations,
71 approximations, as suggested by Reviewer_1. Other suggestions are welcome. 72 as suggested by Reviewer_1. Other suggestions are welcome.
72 73
73 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of 74 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of
74 prior work on character recognition using deformations and transformations". 75 prior work on character recognition using deformations and
75 The main originality is in showing that deep learners can take more advantage 76 transformations". The main originality is in showing that deep learners
76 than shallow learners of such data and of the self-taught learning framework in 77 can take more advantage than shallow learners of such data and of the
77 general. 78 self-taught learning framework in general.
78 79