# HG changeset patch # User Yoshua Bengio # Date 1281289306 14400 # Node ID 83da863b924d3dd3efccff498c5b01bc9ba631b2 # Parent 5a777a2550e02aa6fadfe9116d04a35e09b92790 minor diff -r 5a777a2550e0 -r 83da863b924d writeup/nips_rebuttal_clean.txt --- a/writeup/nips_rebuttal_clean.txt Sun Aug 08 13:38:55 2010 -0400 +++ b/writeup/nips_rebuttal_clean.txt Sun Aug 08 13:41:46 2010 -0400 @@ -6,7 +6,7 @@ "the demonstrations that self-taught learning can help deep learners is helpful": indeed, but it is even more interesting to consider the result that self-taught learning was found *more helpful for deep learners than for shallow ones*. Since out-of-distribution data is common (especially out-of-class data), this is of practical importance. -Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be used on such large datasets. When training on smaller datasets they perform much worse than MLPs (above 30% vs 24% for MLPs on NIST 62 characters). We will explore SVM variants such as the suggestion made to add SVM results to the paper. +Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be used on such large datasets. When training on smaller datasets they perform much worse than MLPs (above 30% vs 24% for MLPs on NIST 62 characters). We will explore SVM variants that can exploit large datasets, such as the suggestion made to add SVM results to the paper. "...it would be helpful to provide some theoretical analysis...": indeed, but this appears mathematically challenging (to say the least, since deep models involve a non-convex optimization) or would likely require very strong distributional assumptions. However, previous theoretical literature already provides some answers, e.g., Jonathan Baxter's (COLT 1995) "Learning internal representations". The argument is about sharing capacity across tasks to improve generalization: lower layers features can potentially be shared across tasks. Whereas a one-hidden-layer MLP can only share linear features, a deep architecture can share non-linear ones which have the potential for representing more abstract concepts.