view writeup/nips_rebuttal_clean.txt @ 574:d12b9a1432e8

cleaned-up version, fewer typos, shortened (but need 700 chars less)
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Sat, 07 Aug 2010 18:39:36 -0700
parents
children bff9ab360ef4
line wrap: on
line source

Reviewer_1 claims that handwriting recognition is essentially solved, and we
believe that this is not true. Indeed, the best methods have been
getting essentially human performance in the case of clean digits. We are not
aware of previous papers showing that human performance has been reached on the
full character set. Furthermore, it is clear from our own experimentation that
humans still greatly outperform machines when the characters are heavily
distorted (e.g. the NISTP dataset). Playing with the provided demo will
quickly convince you that this is true.

"...not intended to compete with the state-of-the-art...": We actually included
comparisons with the state-of-the-art on the NIST dataset (and beat it).

"the demonstrations that self-taught learning can help deep learners is
helpful": indeed, but it is even more interesting to consider the result that
self-taught learning was found *more helpful for deep learners than for shallow
ones*. Since the availability of out-of-distribution data is common (especially
out-of-class data), this is of practical importance.

Reviewer_4: "It would also be interesting to compare to SVMs...": ordinary SVMs cannot be
used on such large datasets, and indeed it is a good idea to explore variants of
SVMs or approximations of SVMs. We will continue exploring this thread (and the
particular suggestion made) and hope to include these results in the final
paper, to add more shallow learners to the comparison.

"...it would be helpful to provide some theoretical analysis...": indeed, but
this is either mathematically challenging (to say the least, since deep models
involve a non-convex optimization) or would likely require very strong
assumptions on the data distribution. However, there exists
theoretical literature which answers some basic questions about this issue,
starting with the work of Jonathan Baxter (COLT 1995) "Learning internal
representations". The argument is about capacity
and sharing it across tasks so as to achieve better generalization. The lower
layers implement features that can potentially be shared across tasks. As long
as some sharing is possible (because the same features can be useful for several
tasks), then there is a potential benefit from shared
internal representations. Whereas a one-hidden-layer MLP can only share linear
features, a deep architecture can share non-linear ones which have the potential
for representing more abstract concepts.

Reviewer_5 about semi-supervised learning: In the unsupervised phase, no labels
are used. In the supervised fine-tuning phase, all labels are used, so this is
not the semi-supervised setting. This paper did not examine the potential
advantage of exploiting large quantities of additional unlabeled data, but the
availability of the generated dataset and of the learning setup would make it
possible to easily conduct a study to answer this interesting
question. Note however that previous work [5] already investigated the relative
advantage of the semi-supervised setting for deep vs shallow architectures,
which is why we did not focus on this here. It might still be worth to do these
experiments because the deep learning algorithms were different.

"...human errors may be present...": Indeed, there are variations across human
labelings, which have have estimated (since each character
was viewed by 3 different humans), and reported in the paper (the standard
deviations across humans are large, but the standard error across a large test
set is very small, so we believe the average error numbers to be fairly
accurate).

"...authors do cite a supplement, but I did not have access to it...": that is
strange. We could (and still can) access it from the CMT web site. We will make
sure to include a complete pseudo-code of SDAs in it.

"...main contributions of the manuscript...": the main
contribution is actually to show that the self-taught learning setting is more
beneficial to deeper architectures.

"...restriction to MLPs...": that restriction was motivated by the computational
challenge of training on hundreds of millions of examples. Apart from linear
models (which do not fare well on this task), it is not clear to us what 
could be used, and so MLPs were the
obvious candidates to compare with. We will explore the use of SVM
approximations, as suggested by Reviewer_1. Other suggestions are welcome.

"Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of
prior work on character recognition using deformations and transformations".
The main originality is in showing that deep learners can take more advantage
than shallow learners of such data and of the self-taught learning framework in
general.