# HG changeset patch # User Yoshua Bengio # Date 1294626835 18000 # Node ID e162e75ac5c6250bc0b02882a68418a342a2436e # Parent 5c67f674d724e4f2121fdf6de85a9de76224ec06# Parent 820764689d2fd9b0300b1d19eba2678767fa3412 merge diff -r 820764689d2f -r e162e75ac5c6 writeup/aistats_review_response.txt --- a/writeup/aistats_review_response.txt Sun Jan 09 12:45:44 2011 -0500 +++ b/writeup/aistats_review_response.txt Sun Jan 09 21:33:55 2011 -0500 @@ -1,9 +1,9 @@ -We thank the authors for their thoughtful comments. Here are some responses. +We thank the authors for their thoughtful comments. Please find our responses below. * Comparisons with shallower networks, but using unsupervised pre-training: e will add those results to the paper. Previous work in our group with -very similar data (the InfiniteMNIST dataset were published in JMLR in 20102 +very similar data (the InfiniteMNIST dataset were published in JMLR in 2010 "Why Does Unsupervised Pre-training Help Deep Learning?"). The results indeed show improvement when going from 1 to 2 and then 3 layers, even when using unsupervised pre-training (RBM or Denoising Auto-Encoder). @@ -51,7 +51,35 @@ true for many kinds of noises, but not for geometric transformations and deformations. -* Human labeling: +* Human labeling: We controlled noise in the labelling process by (1) +requiring AMT workers with a higher than normal average of accepted +responses (>95%) on other tasks (2) discarding responses that were not +complete (10 predictions) (3) discarding responses for which for which the +time to predict was smaller than 3 seconds for NIST (the mean response time +was 20 seconds) and 6 seconds seconds for NISTP (average response time of +45 seconds) (4) discarding responses which were obviously wrong (10 +identical ones, or "12345..."). Overall, after such filtering, we kept +approximately 95% of the AMT workers' responses. We thank the reviewer for +the suggestion about multi-stage questionnaires, we will definitely +consider this as an option next time we perform this experiment. However, +to be fair, if we were to do so, we should also consider the same +multi-stage decision process for the machine learning algorithms as well. -* Size of labeled set: +* Size of labeled set: in our JMLR 2010 paper on deep learning (cited +above), we already verified the effect of number of labeled examples on the +deep learners and shallow learners (with or without unsupervised +pre-training); see fig. 11 of that paper, which involves data very similar +to those studied here. Basically (and somewhat surprisingly) the deep +learners with unsupervised pre-training can take more advantage of a large +amount of labeled examples, presumably because of the initialization effect +(that benefits from the prior that representations that are useful for P(X) +are also useful for P(Y|X)), and the effect does not disappear when the +number of labeled examples increases. Other work in the semi-supervised +setting (Lee et al, NIPS2009, "Unsupervised feature learning...") also show +that the advantage of unsupervised feature learning by a deep architecture +is most pronounced in the semi-supervised setting with very few labeled +examples. Adding the training curve in the self-taught settings of this AISTAT +submission is a good idea, but probably unlikely to provide results +different from the above already reported in the literature in similar +settings.