comparison writeup/nips_rebuttal_clean.txt @ 579:5a777a2550e0

SVM performance is worse
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 08 Aug 2010 13:38:55 -0400
parents 61aae4fd2da5
children 83da863b924d
comparison
equal deleted inserted replaced
578:61aae4fd2da5 579:5a777a2550e0
1 1
2 Reviewer_1 claims that handwriting recognition is essentially solved: we believe this is not true. Yes the best methods have been getting essentially human performance in the case of clean digits. But we are not aware of previous papers achieving human performance on the full character set. It is clear from our own experimentation (play with the demo to convince yourself) that humans still clearly outperform machines when the characters are heavily distorted (e.g. as in our NISTP dataset). 2 Reviewer_1 claims that handwriting recognition is essentially solved: we believe this is not true. Yes the best methods have been getting essentially human performance in the case of clean digits. But we are not aware of previous papers achieving human performance on the full character set. It is clear from our own experimentation (play with the demo to convince yourself) that humans still clearly outperform machines when the characters are heavily distorted (e.g. as in our NISTP dataset).
3
4 3
5 "...not intended to compete with the state-of-the-art...": We had included comparisons with the state-of-the-art on the NIST dataset (and beat it). 4 "...not intended to compete with the state-of-the-art...": We had included comparisons with the state-of-the-art on the NIST dataset (and beat it).
6 5
7 6
8 "the demonstrations that self-taught learning can help deep learners is helpful": indeed, but it is even more interesting to consider the result that self-taught learning was found *more helpful for deep learners than for shallow ones*. Since out-of-distribution data is common (especially out-of-class data), this is of practical importance. 7 "the demonstrations that self-taught learning can help deep learners is helpful": indeed, but it is even more interesting to consider the result that self-taught learning was found *more helpful for deep learners than for shallow ones*. Since out-of-distribution data is common (especially out-of-class data), this is of practical importance.
9 8
10 Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be used on such large datasets. We will explore SVM variants such as the suggestion made to add SVM results to the paper. 9 Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be used on such large datasets. When training on smaller datasets they perform much worse than MLPs (above 30% vs 24% for MLPs on NIST 62 characters). We will explore SVM variants such as the suggestion made to add SVM results to the paper.
11 10
12 11
13 "...it would be helpful to provide some theoretical analysis...": indeed, but this appears mathematically challenging (to say the least, since deep models involve a non-convex optimization) or would likely require very strong distributional assumptions. However, previous theoretical literature already provides some answers, e.g., Jonathan Baxter's (COLT 1995) "Learning internal representations". The argument is about sharing capacity across tasks to improve generalization: lower layers features can potentially be shared across tasks. Whereas a one-hidden-layer MLP can only share linear features, a deep architecture can share non-linear ones which have the potential for representing more abstract concepts. 12 "...it would be helpful to provide some theoretical analysis...": indeed, but this appears mathematically challenging (to say the least, since deep models involve a non-convex optimization) or would likely require very strong distributional assumptions. However, previous theoretical literature already provides some answers, e.g., Jonathan Baxter's (COLT 1995) "Learning internal representations". The argument is about sharing capacity across tasks to improve generalization: lower layers features can potentially be shared across tasks. Whereas a one-hidden-layer MLP can only share linear features, a deep architecture can share non-linear ones which have the potential for representing more abstract concepts.
14 13
15 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels are used. In the supervised fine-tuning phase, all labels are used. So this is *not* the semi-supervised setting, which was already previously studied [5], showing the advantage of depth. Instead, we focus here on the out-of-distribution aspect of self-taught learning. 14 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels are used. In the supervised fine-tuning phase, all labels are used. So this is *not* the semi-supervised setting, which was already previously studied [5], showing the advantage of depth. Instead, we focus here on the out-of-distribution aspect of self-taught learning.