comparison writeup/aistats_review_response.txt @ 616:b0cdd200b2bd

added aistats_review_response.txt
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 09 Jan 2011 12:13:45 -0500
parents
children 14ba0120baff
comparison
equal deleted inserted replaced
615:337253b82409 616:b0cdd200b2bd
1
2 We thank the authors for their thoughtful comments. Here are some responses.
3
4 * Comparisons with shallower networks, but using unsupervised pre-training:
5 e will add those results to the paper. Previous work in our group with
6 very similar data (the InfiniteMNIST dataset were published in JMLR in 20102
7 "Why Does Unsupervised Pre-training Help Deep Learning?"). The results indeed
8 show improvement when going from 1 to 2 and then 3 layers, even when using
9 unsupervised pre-training (RBM or Denoising Auto-Encoder).
10
11 * Comparisons with SVMs. We have tried several kinds of SVMs. The main limitation
12 of course is the size of the training set. One option is to use a non-linear SVM
13 with a reduced training set, and the other is to use an online linear SVM.
14 Another option we have considered is to project the input non-linearly in a
15 high-dimensional but sparse representation and then use an online linear SVM on that space.
16 For this experiment we have thresholded input pixel gray levels considered a
17 low-order polynomial expansion (e.g. only looking at pairs of non-zero pixels).
18 We have obtained the following results until now, all substantially worse than those
19 obtained with the MLP and deep nets.
20
21 SVM type training set input online validation test set
22 type / size features training set error error
23 error
24 Linear SVM, NIST, 651k, original, 36.62%, 34.41%, 42.26%
25 Linear SVM, NIST, 651k, sparse quadratic, 30.96%, 28.00%, 41.28%
26 Linear SVM, NISTP, 800k, original, 88.50%, 85.24%, 87.36%
27 Linear SVM, NISTP, 800k, sparse quadratic, 81.76%, 83.69%, 85.56%
28 RBF SVM, NISTP, 100k, original, 74.73%, 56.57%, 64.22%
29
30 The best results were obtained with the sparse quadratic input features, and
31 training on the CLEAN data (NIST) rather than the perturbed data (NISTP).
32
33
34 * Using distorted characters as the corruption process of the Denoising
35 Auto-Encoder (DAE). We had already performed preliminary experiments with this idea
36 and it did not work very well (in fact it depends on the kind of distortion
37 considered), i.e., it did not improve on the simpler forms of noise we used
38 for the AISTATS submission. We have several interpretations for this, which should
39 probably go (along with more extensive simulations) into another paper.
40 The main interpretation for those results is that the DAE learns good
41 features by being given as target (to reconstruct) a pattern of higher
42 density (according to the unknown, underlying generating distribution) than
43 the network input. This is how it gets to know where the density should
44 concentrate. Hence distortions that are *plausible* in the input distribution
45 (such as translation, rotation, scaling, etc.) are not very useful, whereas
46 corruption due to a form of noise are useful. In fact, the most useful
47 is a very simple form of noise, that guarantees that the input is much
48 less likely than the target, such as Gaussian noise. Another way to think
49 about it is to consider the symmetries involved. A corruption process should
50 be such that swapping input for target should be very unlikely: this is
51 true for many kinds of noises, but not for geometric transformations
52 and deformations.
53
54 * Human labeling:
55
56 * Size of labeled set:
57