Mercurial > ift6266
comparison writeup/nips_rebuttal_clean.txt @ 576:185d79636a20
now fits
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sat, 07 Aug 2010 22:54:54 -0400 |
parents | bff9ab360ef4 |
children | 685756a11fd2 |
comparison
equal
deleted
inserted
replaced
575:bff9ab360ef4 | 576:185d79636a20 |
---|---|
22 SVMs cannot be used on such large datasets. We will explore SVM variants | 22 SVMs cannot be used on such large datasets. We will explore SVM variants |
23 such as the suggestion made to add SVM results to the paper. | 23 such as the suggestion made to add SVM results to the paper. |
24 | 24 |
25 | 25 |
26 "...it would be helpful to provide some theoretical analysis...": indeed, | 26 "...it would be helpful to provide some theoretical analysis...": indeed, |
27 but this is either mathematically challenging (to say the least, since deep | 27 but this appears mathematically challenging (to say the least, since deep |
28 models involve a non-convex optimization) or would likely require very | 28 models involve a non-convex optimization) or would likely require very |
29 strong assumptions on the data distribution. However, there exists | 29 strong distributional assumptions. However, previous |
30 theoretical literature which answers some basic questions about this issue, | 30 theoretical literature already provides some answers, e.g., |
31 starting with the work of Jonathan Baxter (COLT 1995) "Learning internal | 31 Jonathan Baxter's (COLT 1995) "Learning internal |
32 representations". The argument is about capacity and sharing it across | 32 representations". The argument is about sharing capacity across |
33 tasks so as to achieve better generalization. The lower layers implement | 33 tasks to improve generalization: lower layers features can potentially |
34 features that can potentially be shared across tasks. As long as some | 34 be shared across tasks. Whereas a one-hidden-layer MLP can only share linear |
35 sharing is possible (because the same features can be useful for several | |
36 tasks), then there is a potential benefit from shared internal | |
37 representations. Whereas a one-hidden-layer MLP can only share linear | |
38 features, a deep architecture can share non-linear ones which have the | 35 features, a deep architecture can share non-linear ones which have the |
39 potential for representing more abstract concepts. | 36 potential for representing more abstract concepts. |
40 | 37 |
41 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no | 38 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no |
42 labels are used. In the supervised fine-tuning phase, all labels are used, | 39 labels are used. In the supervised fine-tuning phase, all labels are used. |
43 so this is not the semi-supervised setting. This paper did not examine the | 40 So this is *not* the semi-supervised setting, which was already previously |
44 potential advantage of exploiting large quantities of additional unlabeled | 41 studied [5], showing the advantage of depth. Instead, we focus here |
45 data, but the availability of the generated dataset and of the learning | 42 on the out-of-distribution aspect of self-taught learning. |
46 setup would make it possible to easily conduct a study to answer this | |
47 interesting question. Note however that previous work [5] already | |
48 investigated the relative advantage of the semi-supervised setting for deep | |
49 vs shallow architectures, which is why we did not focus on this here. It | |
50 might still be worth to do these experiments because the deep learning | |
51 algorithms were different. | |
52 | 43 |
53 "...human errors may be present...": Indeed, there are variations across | 44 "...human errors may be present...": Indeed, there are variations across |
54 human labelings, which have have estimated (since each character was viewed | 45 human labelings, which have have estimated (since each character was viewed |
55 by 3 different humans), and reported in the paper (the standard deviations | 46 by 3 different humans), and reported in the paper (the standard deviations |
56 across humans are large, but the standard error across a large test set is | 47 across humans are large, but the standard error across a large test set is |
57 very small, so we believe the average error numbers to be fairly accurate). | 48 very small, so we believe the average error numbers to be fairly accurate). |
58 | 49 |
59 "...authors do cite a supplement, but I did not have access to it...": that | 50 "...supplement, but I did not have access to it...": strange! We could |
60 is strange. We could (and still can) access it from the CMT web site. We | 51 (and still can) access it. We will include a complete pseudo-code of SDAs |
61 will make sure to include a complete pseudo-code of SDAs in it. | 52 in it. |
62 | 53 |
63 "...main contributions of the manuscript...": the main contribution is | 54 "...main contributions of the manuscript...": the main contribution is |
64 actually to show that the self-taught learning setting is more beneficial | 55 actually to show that the self-taught learning setting is more beneficial |
65 to deeper architectures. | 56 to deeper architectures. |
66 | 57 |
67 "...restriction to MLPs...": that restriction was motivated by the | 58 "...restriction to MLPs...": that restriction was motivated by the |
68 computational challenge of training on hundreds of millions of | 59 computational challenge of training on nearly a billion examples. Linear |
69 examples. Apart from linear models (which do not fare well on this task), | 60 models do not fare well here, and most non-parametric models do not scale |
70 it is not clear to us what could be used, and so MLPs were the obvious | 61 well, so MLPs (which have been used before on this task) were natural as |
71 candidates to compare with. We will explore the use of SVM approximations, | 62 the baseline. We will explore the use of SVM approximations, as suggested |
72 as suggested by Reviewer_1. Other suggestions are welcome. | 63 by Reviewer_1. |
73 | 64 |
74 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of | 65 "Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of |
75 prior work on character recognition using deformations and | 66 prior work on character recognition using deformations and |
76 transformations". The main originality is in showing that deep learners | 67 transformations". Main originality = showing that deep learners |
77 can take more advantage than shallow learners of such data and of the | 68 can take more advantage than shallow learners of such data and of the |
78 self-taught learning framework in general. | 69 self-taught learning framework in general. |
79 | 70 |