# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1294593225 18000
# Node ID b0cdd200b2bdcdf935fcc8912116f3ebae618528
# Parent  337253b82409fb0bbcdf48a058b53c29014b7868
added aistats_review_response.txt

diff -r 337253b82409 -r b0cdd200b2bd writeup/aistats_review_response.txt
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/writeup/aistats_review_response.txt	Sun Jan 09 12:13:45 2011 -0500
@@ -0,0 +1,57 @@
+
+We thank the authors for their thoughtful comments. Here are some responses.
+
+* Comparisons with shallower networks, but using unsupervised pre-training:
+e will add those results to the paper. Previous work in our group with
+very similar data (the InfiniteMNIST dataset were published in JMLR in 20102
+"Why Does Unsupervised Pre-training Help Deep Learning?"). The results indeed
+show improvement when going from 1 to 2 and then 3 layers, even when using
+unsupervised pre-training (RBM or Denoising Auto-Encoder).
+
+* Comparisons with SVMs. We have tried several kinds of SVMs. The main limitation
+of course is the size of the training set. One option is to use a non-linear SVM
+with a reduced training set, and the other is to use an online linear SVM.
+Another option we have considered is to project the input non-linearly in a
+high-dimensional but sparse representation and then use an online linear SVM on that space.
+For this experiment we have thresholded input pixel gray levels considered a
+low-order polynomial expansion (e.g. only looking at pairs of non-zero pixels).
+We have obtained the following results until now, all substantially worse than those
+obtained with the MLP and deep nets. 
+
+SVM type   training set   input               online    validation test set
+            type / size   features            training  set error    error
+                                              error
+Linear SVM,  NIST,  651k,  original,           36.62%,  34.41%,     42.26%
+Linear SVM,  NIST,  651k,  sparse quadratic,   30.96%,  28.00%,     41.28%
+Linear SVM,  NISTP, 800k,  original,           88.50%,  85.24%,     87.36%
+Linear SVM,  NISTP, 800k,  sparse quadratic,   81.76%,  83.69%,     85.56%
+RBF SVM,     NISTP, 100k,  original,           74.73%,  56.57%,     64.22%
+
+The best results were obtained with the sparse quadratic input features, and
+training on the CLEAN data (NIST) rather than the perturbed data (NISTP).
+
+
+* Using distorted characters as the corruption process of the Denoising
+Auto-Encoder (DAE). We had already performed preliminary experiments with this idea
+and it did not work very well (in fact it depends on the kind of distortion
+considered), i.e., it did not improve on the simpler forms of noise we used
+for the AISTATS submission.  We have several interpretations for this, which should
+probably go (along with more extensive simulations) into another paper.
+The main interpretation for those results is that the DAE learns good
+features by being given as target (to reconstruct) a pattern of higher
+density (according to the unknown, underlying generating distribution) than
+the network input. This is how it gets to know where the density should
+concentrate. Hence distortions that are *plausible* in the input distribution
+(such as translation, rotation, scaling, etc.) are not very useful, whereas
+corruption due to a form of noise are useful. In fact, the most useful 
+is a very simple form of noise, that guarantees that the input is much
+less likely than the target, such as Gaussian noise. Another way to think
+about it is to consider the symmetries involved. A corruption process should
+be such that swapping input for target should be very unlikely: this is
+true for many kinds of noises, but not for geometric transformations
+and deformations.
+
+* Human labeling: 
+
+* Size of labeled set:
+