comparison writeup/nips_rebuttal_clean.txt @ 576:185d79636a20

now fits
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 07 Aug 2010 22:54:54 -0400
parents bff9ab360ef4
children 685756a11fd2
comparison
equal deleted inserted replaced
575:bff9ab360ef4 576:185d79636a20
22 SVMs cannot be used on such large datasets. We will explore SVM variants 22 SVMs cannot be used on such large datasets. We will explore SVM variants
23 such as the suggestion made to add SVM results to the paper. 23 such as the suggestion made to add SVM results to the paper.
24 24
25 25
26 "...it would be helpful to provide some theoretical analysis...": indeed, 26 "...it would be helpful to provide some theoretical analysis...": indeed,
27 but this is either mathematically challenging (to say the least, since deep 27 but this appears mathematically challenging (to say the least, since deep
28 models involve a non-convex optimization) or would likely require very 28 models involve a non-convex optimization) or would likely require very
29 strong assumptions on the data distribution. However, there exists 29 strong distributional assumptions. However, previous
30 theoretical literature which answers some basic questions about this issue, 30 theoretical literature already provides some answers, e.g.,
31 starting with the work of Jonathan Baxter (COLT 1995) "Learning internal 31 Jonathan Baxter's (COLT 1995) "Learning internal
32 representations". The argument is about capacity and sharing it across 32 representations". The argument is about sharing capacity across
33 tasks so as to achieve better generalization. The lower layers implement 33 tasks to improve generalization: lower layers features can potentially
34 features that can potentially be shared across tasks. As long as some 34 be shared across tasks. Whereas a one-hidden-layer MLP can only share linear
35 sharing is possible (because the same features can be useful for several
36 tasks), then there is a potential benefit from shared internal
37 representations. Whereas a one-hidden-layer MLP can only share linear
38 features, a deep architecture can share non-linear ones which have the 35 features, a deep architecture can share non-linear ones which have the
39 potential for representing more abstract concepts. 36 potential for representing more abstract concepts.
40 37
41 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no 38 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no
42 labels are used. In the supervised fine-tuning phase, all labels are used, 39 labels are used. In the supervised fine-tuning phase, all labels are used.
43 so this is not the semi-supervised setting. This paper did not examine the 40 So this is *not* the semi-supervised setting, which was already previously
44 potential advantage of exploiting large quantities of additional unlabeled 41 studied [5], showing the advantage of depth. Instead, we focus here
45 data, but the availability of the generated dataset and of the learning 42 on the out-of-distribution aspect of self-taught learning.
46 setup would make it possible to easily conduct a study to answer this
47 interesting question. Note however that previous work [5] already
48 investigated the relative advantage of the semi-supervised setting for deep
49 vs shallow architectures, which is why we did not focus on this here. It
50 might still be worth to do these experiments because the deep learning
51 algorithms were different.
52 43
53 "...human errors may be present...": Indeed, there are variations across 44 "...human errors may be present...": Indeed, there are variations across
54 human labelings, which have have estimated (since each character was viewed 45 human labelings, which have have estimated (since each character was viewed
55 by 3 different humans), and reported in the paper (the standard deviations 46 by 3 different humans), and reported in the paper (the standard deviations
56 across humans are large, but the standard error across a large test set is 47 across humans are large, but the standard error across a large test set is
57 very small, so we believe the average error numbers to be fairly accurate). 48 very small, so we believe the average error numbers to be fairly accurate).
58 49
59 "...authors do cite a supplement, but I did not have access to it...": that 50 "...supplement, but I did not have access to it...": strange! We could
60 is strange. We could (and still can) access it from the CMT web site. We 51 (and still can) access it. We will include a complete pseudo-code of SDAs
61 will make sure to include a complete pseudo-code of SDAs in it. 52 in it.
62 53
63 "...main contributions of the manuscript...": the main contribution is 54 "...main contributions of the manuscript...": the main contribution is
64 actually to show that the self-taught learning setting is more beneficial 55 actually to show that the self-taught learning setting is more beneficial
65 to deeper architectures. 56 to deeper architectures.
66 57
67 "...restriction to MLPs...": that restriction was motivated by the 58 "...restriction to MLPs...": that restriction was motivated by the
68 computational challenge of training on hundreds of millions of 59 computational challenge of training on nearly a billion examples. Linear
69 examples. Apart from linear models (which do not fare well on this task), 60 models do not fare well here, and most non-parametric models do not scale
70 it is not clear to us what could be used, and so MLPs were the obvious 61 well, so MLPs (which have been used before on this task) were natural as
71 candidates to compare with. We will explore the use of SVM approximations, 62 the baseline. We will explore the use of SVM approximations, as suggested
72 as suggested by Reviewer_1. Other suggestions are welcome. 63 by Reviewer_1.
73 64
74 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of 65 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of
75 prior work on character recognition using deformations and 66 prior work on character recognition using deformations and
76 transformations". The main originality is in showing that deep learners 67 transformations". Main originality = showing that deep learners
77 can take more advantage than shallow learners of such data and of the 68 can take more advantage than shallow learners of such data and of the
78 self-taught learning framework in general. 69 self-taught learning framework in general.
79 70