comparison writeup/nips_rebuttal_clean.txt @ 578:61aae4fd2da5

typo fixed, uploaded to CMT
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 08 Aug 2010 08:16:21 -0400
parents 685756a11fd2
children 5a777a2550e0
comparison
equal deleted inserted replaced
577:685756a11fd2 578:61aae4fd2da5
12 12
13 "...it would be helpful to provide some theoretical analysis...": indeed, but this appears mathematically challenging (to say the least, since deep models involve a non-convex optimization) or would likely require very strong distributional assumptions. However, previous theoretical literature already provides some answers, e.g., Jonathan Baxter's (COLT 1995) "Learning internal representations". The argument is about sharing capacity across tasks to improve generalization: lower layers features can potentially be shared across tasks. Whereas a one-hidden-layer MLP can only share linear features, a deep architecture can share non-linear ones which have the potential for representing more abstract concepts. 13 "...it would be helpful to provide some theoretical analysis...": indeed, but this appears mathematically challenging (to say the least, since deep models involve a non-convex optimization) or would likely require very strong distributional assumptions. However, previous theoretical literature already provides some answers, e.g., Jonathan Baxter's (COLT 1995) "Learning internal representations". The argument is about sharing capacity across tasks to improve generalization: lower layers features can potentially be shared across tasks. Whereas a one-hidden-layer MLP can only share linear features, a deep architecture can share non-linear ones which have the potential for representing more abstract concepts.
14 14
15 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels are used. In the supervised fine-tuning phase, all labels are used. So this is *not* the semi-supervised setting, which was already previously studied [5], showing the advantage of depth. Instead, we focus here on the out-of-distribution aspect of self-taught learning. 15 Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels are used. In the supervised fine-tuning phase, all labels are used. So this is *not* the semi-supervised setting, which was already previously studied [5], showing the advantage of depth. Instead, we focus here on the out-of-distribution aspect of self-taught learning.
16 16
17 "...human errors may be present...": Indeed, there are variations across human labelings, which have have estimated (since each character was viewed by 3 different humans), and reported in the paper (the standard deviations across humans are large, but the standard error across a large test set is very small, so we believe the average error numbers to be fairly accurate). 17 "...human errors may be present...": Indeed, there are variations across human labelings, which have been estimated (since each character was viewed by 3 different humans), and reported in the paper (the standard deviations across humans are large, but the standard error across a large test set is very small, so we believe the average error numbers to be fairly accurate).
18 18
19 "...supplement, but I did not have access to it...": strange! We could (and still can) access it. We will include a complete pseudo-code of SDAs in it. 19 "...supplement, but I did not have access to it...": strange! We could (and still can) access it. We will include a complete pseudo-code of SDAs in it.
20 20
21 "...main contributions of the manuscript...": the main contribution is actually to show that the self-taught learning setting is more beneficial to deeper architectures. 21 "...main contributions of the manuscript...": the main contribution is actually to show that the self-taught learning setting is more beneficial to deeper architectures.
22 22