Mercurial > ift6266


Reviewer_1 claims that handwriting recognition is essentially solved: we
believe this is not true. Yes the best methods have been getting
essentially human performance in the case of clean digits. But we are not
aware of previous papers achieving human performance on the full character
set. It is clear from our own experimentation (play with the demo to
convince yourself) that humans still clearly outperform machines when the
characters are heavily distorted (e.g. as in our NISTP dataset).


"...not intended to compete with the state-of-the-art...": We had included
comparisons with the state-of-the-art on the NIST dataset (and beat it).


"the demonstrations that self-taught learning can help deep learners is
helpful": indeed, but it is even more interesting to consider the result
that self-taught learning was found *more helpful for deep learners than
for shallow ones*. Since out-of-distribution data is common (especially
out-of-class data), this is of practical importance.

Reviewer_4, "It would also be interesting to compare to SVMs...": ordinary
SVMs cannot be used on such large datasets. We will explore SVM variants
such as the suggestion made to add SVM results to the paper.


"...it would be helpful to provide some theoretical analysis...": indeed,
but this is either mathematically challenging (to say the least, since deep
models involve a non-convex optimization) or would likely require very
strong assumptions on the data distribution. However, there exists
theoretical literature which answers some basic questions about this issue,
starting with the work of Jonathan Baxter (COLT 1995) "Learning internal
representations". The argument is about capacity and sharing it across
tasks so as to achieve better generalization. The lower layers implement
features that can potentially be shared across tasks. As long as some
sharing is possible (because the same features can be useful for several
tasks), then there is a potential benefit from shared internal
representations. Whereas a one-hidden-layer MLP can only share linear
features, a deep architecture can share non-linear ones which have the
potential for representing more abstract concepts.

Reviewer_5 about semi-supervised learning: In the unsupervised phase, no
labels are used. In the supervised fine-tuning phase, all labels are used,
so this is not the semi-supervised setting. This paper did not examine the
potential advantage of exploiting large quantities of additional unlabeled
data, but the availability of the generated dataset and of the learning
setup would make it possible to easily conduct a study to answer this
interesting question. Note however that previous work [5] already
investigated the relative advantage of the semi-supervised setting for deep
vs shallow architectures, which is why we did not focus on this here. It
might still be worth to do these experiments because the deep learning
algorithms were different.

"...human errors may be present...": Indeed, there are variations across
human labelings, which have have estimated (since each character was viewed
by 3 different humans), and reported in the paper (the standard deviations
across humans are large, but the standard error across a large test set is
very small, so we believe the average error numbers to be fairly accurate).

"...authors do cite a supplement, but I did not have access to it...": that
is strange. We could (and still can) access it from the CMT web site. We
will make sure to include a complete pseudo-code of SDAs in it.

"...main contributions of the manuscript...": the main contribution is
actually to show that the self-taught learning setting is more beneficial
to deeper architectures.

"...restriction to MLPs...": that restriction was motivated by the
computational challenge of training on hundreds of millions of
examples. Apart from linear models (which do not fare well on this task),
it is not clear to us what could be used, and so MLPs were the obvious
candidates to compare with. We will explore the use of SVM approximations,
as suggested by Reviewer_1. Other suggestions are welcome.

"Reviewer 6:...novelty [..] is somewhat marginal since [...] reminiscent of
prior work on character recognition using deformations and
transformations".  The main originality is in showing that deep learners
can take more advantage than shallow learners of such data and of the
self-taught learning framework in general.
author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sat, 07 Aug 2010 22:46:12 -0400
parents	d12b9a1432e8
children	185d79636a20