Mercurial > ift6266


We thank the authors for their thoughtful comments. Please find our responses below.

* Comparisons with shallower networks, but using unsupervised pre-training:
e will add those results to the paper. Previous work in our group with
very similar data (the InfiniteMNIST dataset were published in JMLR in 2010
"Why Does Unsupervised Pre-training Help Deep Learning?"). The results indeed
show improvement when going from 1 to 2 and then 3 layers, even when using
unsupervised pre-training (RBM or Denoising Auto-Encoder).

* Comparisons with SVMs. We have tried several kinds of SVMs. The main limitation
of course is the size of the training set. One option is to use a non-linear SVM
with a reduced training set, and the other is to use an online linear SVM.
Another option we have considered is to project the input non-linearly in a
high-dimensional but sparse representation and then use an online linear SVM on that space.
For this experiment we have thresholded input pixel gray levels considered a
low-order polynomial expansion (e.g. only looking at pairs of non-zero pixels).
We have obtained the following results until now, all substantially worse than those
obtained with the MLP and deep nets.

SVM type   training set   input               online    validation test set
            type / size   features            training  set error    error
                                              error
Linear SVM,  NIST,  651k,  original,           36.62%,  34.41%,     42.26%
Linear SVM,  NIST,  651k,  sparse quadratic,   30.96%,  28.00%,     41.28%
Linear SVM,  NISTP, 800k,  original,           88.50%,  85.24%,     87.36%
Linear SVM,  NISTP, 800k,  sparse quadratic,   81.76%,  83.69%,     85.56%
RBF SVM,     NISTP, 100k,  original,           74.73%,  56.57%,     64.22%

The best results were obtained with the sparse quadratic input features, and
training on the CLEAN data (NIST) rather than the perturbed data (NISTP).


* Using distorted characters as the corruption process of the Denoising
Auto-Encoder (DAE). We had already performed preliminary experiments with this idea
and it did not work very well (in fact it depends on the kind of distortion
considered), i.e., it did not improve on the simpler forms of noise we used
for the AISTATS submission.  We have several interpretations for this, which should
probably go (along with more extensive simulations) into another paper.
The main interpretation for those results is that the DAE learns good
features by being given as target (to reconstruct) a pattern of higher
density (according to the unknown, underlying generating distribution) than
the network input. This is how it gets to know where the density should
concentrate. Hence distortions that are *plausible* in the input distribution
(such as translation, rotation, scaling, etc.) are not very useful, whereas
corruption due to a form of noise are useful. In fact, the most useful
is a very simple form of noise, that guarantees that the input is much
less likely than the target, such as Gaussian noise. Another way to think
about it is to consider the symmetries involved. A corruption process should
be such that swapping input for target should be very unlikely: this is
true for many kinds of noises, but not for geometric transformations
and deformations.

* Human labeling: We controlled noise in the labelling process by (1)
requiring AMT workers with a higher than normal average of accepted
responses (>95%) on other tasks (2) discarding responses that were not
complete (10 predictions) (3) discarding responses for which for which the
time to predict was smaller than 3 seconds for NIST (the mean response time
was 20 seconds) and 6 seconds seconds for NISTP (average response time of
45 seconds) (4) discarding responses which were obviously wrong (10
identical ones, or "12345..."). Overall, after such filtering, we kept
approximately 95% of the AMT workers' responses. We thank the reviewer for
the suggestion about multi-stage questionnaires, we will definitely
consider this as an option next time we perform this experiment. However,
to be fair, if we were to do so, we should also consider the same
multi-stage decision process for the machine learning algorithms as well.

* Size of labeled set: in our JMLR 2010 paper on deep learning (cited
above), we already verified the effect of number of labeled examples on the
deep learners and shallow learners (with or without unsupervised
pre-training); see fig. 11 of that paper, which involves data very similar
to those studied here. Basically (and somewhat surprisingly) the deep
learners with unsupervised pre-training can take more advantage of a large
amount of labeled examples, presumably because of the initialization effect
(that benefits from the prior that representations that are useful for P(X)
are also useful for P(Y|X)), and the effect does not disappear when the
number of labeled examples increases. Other work in the semi-supervised
setting (Lee et al, NIPS2009, "Unsupervised feature learning...") also show
that the advantage of unsupervised feature learning by a deep architecture
is most pronounced in the semi-supervised setting with very few labeled
examples. Adding the training curve in the self-taught settings of this AISTAT
submission is a good idea, but probably unlikely to provide results
different from the above already reported in the literature in similar
settings.
author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sun, 09 Jan 2011 14:35:03 -0500
parents	ea31fee25147
children	d44c78c90669