view writeup/jmlr_review2.txt @ 624:49933073590c

added jmlr_review1.txt and jmlr_review2.txt
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 13 Mar 2011 18:25:25 -0400
parents
children
line wrap: on
line source

The paper “Deep Self-Taught Learning for Handwritten Character Recognition” by Bengio et al. claims that deep neural networks benefit more from self-taught learning than shallow ones.

The paper presents neural network models applied to handwritten character recognition. Various transformations and noise injection modes to generate additional training data are introduced to get so called “out-of-distribution” examples. MLPs with one hidden layer are then trained on various date sets in a fully supervised way and compared with three-hidden layer MLPs where each layer is initialized in an unsupervised way and then fine tuned using Back-Propagation. It is then concluded that deep learners benefit more from out-of-distribution examples as well as from a multi-task setting.

It is well known that artificially increasing the training data by either adding noise or by incorporation some prior knowledge in the generation of new data points acts as a regularizer and help to improve performance (Simard et al. 2003, ICDAR). It is therefore not very surprising that deep architectures with a higher complexity profit more from this procedure. The paper suggests that MLPs (with one hidden layer) perform worse than deep SDAs (i.e. pretrained MLPs with three hidden layers), especially when the training data is artificially increased.  I would argue that an MLP with three hidden layers trained in a fully supervised way would also perform better with respect to the 1-hidden layer MLP. Therefore it would have been interesting to see results of such an MLP. Only in this way a fair comparison between shallow vs. deep MLPs, as well as supervised vs. unsupervised training, would be possible.

This paper claims that deep architectures with unsupervised pre-training outperform shallow ones and that additional training data is more beneficial for deep architectures. I think the authors should have compared their SDA with a 3-hidden-layer MLPs to support this claim. Furthermore it is claimed that unsupervised pre-training is required to successfully train deep (3-hidden-layer) MLPs. However, there are no experiments in this paper that justify this claim and I also would argue that deep MLPs can be successfully trained with Back-Propagation, especially if enough training data is available (Ciresan et al 2010, Neural Computation). I therefore strongly encourage the authors to either include the result of such an experiment or adjust the conclusion accordingly.

To cut a long story short, this paper wants to establish SDAs as the state of the art for character recognition, without even checking if deep MLPs trained in the usual supervised way are better or not. I run a simple test and trained a three hidden-layer MLP (500-500-500) on deformed NIST and obtained a test error recognition rate of 1.08% on the un-deformed NIST test set compare with 1.4% of the SDA in Table 1. For this particular task a three hidden layer MLP outperforms an even bigger SDA. I am therefore not fully convinced if supervised pretraining is necessary to obtain good performance for the presented task.

The extensive use of font styles made it hard to follow the paper. It is also very difficult to understand with what data which networks were trained. Especially in Table 1, Appendix it is not clear if the nets in the last column tested on digits are trained on 62 characters or only on digits.