# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1294626835 18000
# Node ID e162e75ac5c6250bc0b02882a68418a342a2436e
# Parent  5c67f674d724e4f2121fdf6de85a9de76224ec06# Parent  820764689d2fd9b0300b1d19eba2678767fa3412
merge

diff -r 820764689d2f -r e162e75ac5c6 writeup/aistats_review_response.txt
--- a/writeup/aistats_review_response.txt	Sun Jan 09 12:45:44 2011 -0500
+++ b/writeup/aistats_review_response.txt	Sun Jan 09 21:33:55 2011 -0500
@@ -1,9 +1,9 @@
 
-We thank the authors for their thoughtful comments. Here are some responses.
+We thank the authors for their thoughtful comments. Please find our responses below.
 
 * Comparisons with shallower networks, but using unsupervised pre-training:
 e will add those results to the paper. Previous work in our group with
-very similar data (the InfiniteMNIST dataset were published in JMLR in 20102
+very similar data (the InfiniteMNIST dataset were published in JMLR in 2010
 "Why Does Unsupervised Pre-training Help Deep Learning?"). The results indeed
 show improvement when going from 1 to 2 and then 3 layers, even when using
 unsupervised pre-training (RBM or Denoising Auto-Encoder).
@@ -51,7 +51,35 @@
 true for many kinds of noises, but not for geometric transformations
 and deformations.
 
-* Human labeling: 
+* Human labeling: We controlled noise in the labelling process by (1)
+requiring AMT workers with a higher than normal average of accepted
+responses (>95%) on other tasks (2) discarding responses that were not
+complete (10 predictions) (3) discarding responses for which for which the
+time to predict was smaller than 3 seconds for NIST (the mean response time
+was 20 seconds) and 6 seconds seconds for NISTP (average response time of
+45 seconds) (4) discarding responses which were obviously wrong (10
+identical ones, or "12345..."). Overall, after such filtering, we kept
+approximately 95% of the AMT workers' responses. We thank the reviewer for
+the suggestion about multi-stage questionnaires, we will definitely
+consider this as an option next time we perform this experiment. However,
+to be fair, if we were to do so, we should also consider the same
+multi-stage decision process for the machine learning algorithms as well.
 
-* Size of labeled set:
+* Size of labeled set: in our JMLR 2010 paper on deep learning (cited
+above), we already verified the effect of number of labeled examples on the
+deep learners and shallow learners (with or without unsupervised
+pre-training); see fig. 11 of that paper, which involves data very similar
+to those studied here. Basically (and somewhat surprisingly) the deep
+learners with unsupervised pre-training can take more advantage of a large
+amount of labeled examples, presumably because of the initialization effect
+(that benefits from the prior that representations that are useful for P(X)
+are also useful for P(Y|X)), and the effect does not disappear when the
+number of labeled examples increases. Other work in the semi-supervised
+setting (Lee et al, NIPS2009, "Unsupervised feature learning...") also show
+that the advantage of unsupervised feature learning by a deep architecture
+is most pronounced in the semi-supervised setting with very few labeled
+examples. Adding the training curve in the self-taught settings of this AISTAT
+submission is a good idea, but probably unlikely to provide results
+different from the above already reported in the literature in similar
+settings.