# HG changeset patch # User Yoshua Bengio # Date 1294593225 18000 # Node ID b0cdd200b2bdcdf935fcc8912116f3ebae618528 # Parent 337253b82409fb0bbcdf48a058b53c29014b7868 added aistats_review_response.txt diff -r 337253b82409 -r b0cdd200b2bd writeup/aistats_review_response.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/writeup/aistats_review_response.txt Sun Jan 09 12:13:45 2011 -0500 @@ -0,0 +1,57 @@ + +We thank the authors for their thoughtful comments. Here are some responses. + +* Comparisons with shallower networks, but using unsupervised pre-training: +e will add those results to the paper. Previous work in our group with +very similar data (the InfiniteMNIST dataset were published in JMLR in 20102 +"Why Does Unsupervised Pre-training Help Deep Learning?"). The results indeed +show improvement when going from 1 to 2 and then 3 layers, even when using +unsupervised pre-training (RBM or Denoising Auto-Encoder). + +* Comparisons with SVMs. We have tried several kinds of SVMs. The main limitation +of course is the size of the training set. One option is to use a non-linear SVM +with a reduced training set, and the other is to use an online linear SVM. +Another option we have considered is to project the input non-linearly in a +high-dimensional but sparse representation and then use an online linear SVM on that space. +For this experiment we have thresholded input pixel gray levels considered a +low-order polynomial expansion (e.g. only looking at pairs of non-zero pixels). +We have obtained the following results until now, all substantially worse than those +obtained with the MLP and deep nets. + +SVM type training set input online validation test set + type / size features training set error error + error +Linear SVM, NIST, 651k, original, 36.62%, 34.41%, 42.26% +Linear SVM, NIST, 651k, sparse quadratic, 30.96%, 28.00%, 41.28% +Linear SVM, NISTP, 800k, original, 88.50%, 85.24%, 87.36% +Linear SVM, NISTP, 800k, sparse quadratic, 81.76%, 83.69%, 85.56% +RBF SVM, NISTP, 100k, original, 74.73%, 56.57%, 64.22% + +The best results were obtained with the sparse quadratic input features, and +training on the CLEAN data (NIST) rather than the perturbed data (NISTP). + + +* Using distorted characters as the corruption process of the Denoising +Auto-Encoder (DAE). We had already performed preliminary experiments with this idea +and it did not work very well (in fact it depends on the kind of distortion +considered), i.e., it did not improve on the simpler forms of noise we used +for the AISTATS submission. We have several interpretations for this, which should +probably go (along with more extensive simulations) into another paper. +The main interpretation for those results is that the DAE learns good +features by being given as target (to reconstruct) a pattern of higher +density (according to the unknown, underlying generating distribution) than +the network input. This is how it gets to know where the density should +concentrate. Hence distortions that are *plausible* in the input distribution +(such as translation, rotation, scaling, etc.) are not very useful, whereas +corruption due to a form of noise are useful. In fact, the most useful +is a very simple form of noise, that guarantees that the input is much +less likely than the target, such as Gaussian noise. Another way to think +about it is to consider the symmetries involved. A corruption process should +be such that swapping input for target should be very unlikely: this is +true for many kinds of noises, but not for geometric transformations +and deformations. + +* Human labeling: + +* Size of labeled set: +