diff writeup/nips2010_submission.tex @ 550:662299f265ab

suggestions from Ian
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Wed, 02 Jun 2010 15:44:46 -0400
parents ef172f4a322a
children 8f365abf171d
line wrap: on
line diff
--- a/writeup/nips2010_submission.tex	Wed Jun 02 13:56:01 2010 -0400
+++ b/writeup/nips2010_submission.tex	Wed Jun 02 15:44:46 2010 -0400
@@ -33,12 +33,12 @@
   human-level performance on both handwritten digit classification and
   62-class handwritten character recognition.  For this purpose we
   developed a powerful generator of stochastic variations and noise
-  processes character images, including not only affine transformations but
+  processes for character images, including not only affine transformations but
   also slant, local elastic deformations, changes in thickness, background
-  images, grey level changes, contrast, occlusion, and various types of pixel and
-  spatially correlated noise. The out-of-distribution examples are 
-  obtained by training with these highly distorted images or
-  by including object classes different from those in the target test set.
+  images, grey level changes, contrast, occlusion, and various types of
+  noise. The out-of-distribution examples are 
+  obtained from these highly distorted images or
+  by including examples of object classes different from those in the target test set.
 \end{abstract}
 \vspace*{-2mm}
 
@@ -87,14 +87,14 @@
 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
 of semi-supervised and multi-task learning: the learner can exploit examples
 that are unlabeled and/or come from a distribution different from the target
-distribution, e.g., from other classes that those of interest. 
+distribution, e.g., from other classes than those of interest. 
 It has already been shown that deep learners can clearly take advantage of
 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
 but more needs to be done to explore the impact
 of {\em out-of-distribution} examples and of the multi-task setting
-(one exception is~\citep{CollobertR2008}, but using very different kinds
+(one exception is~\citep{CollobertR2008}, which uses very different kinds
 of learning algorithms). In particular the {\em relative
-advantage} of deep learning for this settings has not been evaluated.
+advantage} of deep learning for these settings has not been evaluated.
 The hypothesis explored here is that a deep hierarchy of features
 may be better able to provide sharing of statistical strength
 between different regions in input space or different tasks,
@@ -120,7 +120,7 @@
 
 $\bullet$ %\item 
 Similarly, does the feature learning step in deep learning algorithms benefit more 
-training with similar but different classes (i.e. a multi-task learning scenario) than
+from training with moderately different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 %\end{enumerate}
 
@@ -199,7 +199,7 @@
 nearest to $(ax+by+c,dx+ey+f)$,
 producing scaling, translation, rotation and shearing.
 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to
-forbid important rotations (not to confuse classes) but to give good
+forbid large rotations (not to confuse classes) but to give good
 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
@@ -240,7 +240,7 @@
 {\bf Motion Blur.}
 This is GIMP's ``linear motion blur'' 
 with parameters $length$ and $angle$. The value of
-a pixel in the final image is approximately the  mean value of the $length$ first pixels
+a pixel in the final image is approximately the  mean value of the first $length$ pixels
 found by moving in the $angle$ direction. 
 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
 \vspace*{-1mm}
@@ -257,15 +257,15 @@
 \vspace*{-1mm}
 
 {\bf Pixel Permutation.}
-This filter permutes neighbouring pixels. It selects first
-$\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
+This filter permutes neighbouring pixels. It first selects 
+fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then
 sequentially exchanged with one other in as $V4$ neighbourhood. 
 This filter is skipped with probability 80\%.
 \vspace*{-1mm}
 
 {\bf Gaussian Noise.}
 This filter simply adds, to each pixel of the image independently, a
-noise $\sim Normal(0(\frac{complexity}{10})^2)$.
+noise $\sim Normal(0,(\frac{complexity}{10})^2)$.
 This filter is skipped with probability 70\%.
 \vspace*{-1mm}
 
@@ -364,8 +364,10 @@
 with 10 character images and asked to choose 10 corresponding ASCII
 characters. They were forced to make a hard choice among the
 62 or 10 character classes (all classes or digits only). 
-Three users classified each image, allowing
-to estimate inter-human variability. A total 2500 images/dataset were classified.
+A total 2500 images/dataset were classified by XXX subjects,
+with 3 subjects classifying each image, allowing
+us to estimate inter-human variability (e.g a standard error of 0.1\%
+on the average 18\% error done by humans on the 62-class task). 
 
 \vspace*{-1mm}
 \subsection{Data Sources}
@@ -420,7 +422,7 @@
 A large set (2 million) of scanned, OCRed and manually verified machine-printed 
 characters (from various documents and books) where included as an
 additional source. This set is part of a larger corpus being collected by the Image Understanding
-Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern 
+Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern 
 ({\tt http://www.iupr.com}), and which will be publicly released.
 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
 %\end{itemize}
@@ -523,7 +525,7 @@
 of the input and trained to reconstruct the uncorrupted input,
 forcing the hidden units to represent the leading regularities in
 the data. Once it is trained, in a purely unsupervised way, 
-its hidden units activations can
+its hidden units' activations can
 be used as inputs for training a second one, etc.
 After this unsupervised pre-training stage, the parameters
 are used to initialize a deep MLP, which is fine-tuned by
@@ -562,11 +564,10 @@
 %\vspace*{-1mm}
 The models are either trained on NIST (MLP0 and SDA0), 
 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested
-on either NIST, NISTP or P07, either on all 62 classes
-or only on the digits (considering only the outputs
-associated with digit classes).
+on either NIST, NISTP or P07, either on the 62-class task
+or on the 10-digits task.
 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
-comparing Humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
+comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
 SDA2), along with the previous results on the digits NIST special database
 19 test set from the literature respectively based on ARTMAP neural
 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
@@ -579,6 +580,10 @@
 significant way) but when trained with perturbed data
 reaches human performance on both the 62-class task
 and the 10-class (digits) task. 
+17\% error (SDA1) or 18\% error (humans) may seem large but a large
+majority of the errors from humans and from SDA1 are from out-of-context
+confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
+``c'' and a ``C'' are often indistinguishible).
 
 \begin{figure}[ht]
 \vspace*{-3mm}
@@ -625,7 +630,6 @@
 maximum conditional probability among only the digit classes outputs.  The
 setting is similar for the other two target classes (lower case characters
 and upper case characters).
-
 %\vspace*{-1mm}
 %\subsection{Perturbed Training Data More Helpful for SDA}
 %\vspace*{-1mm}
@@ -701,11 +705,13 @@
 out-of-sample examples were used as a source of unsupervised data, and
 experiments showed its positive effects in a \emph{limited labeled data}
 scenario. However, many of the results by \citet{RainaR2007} (who used a
-shallow, sparse coding approach) suggest that the relative gain of self-taught
-learning diminishes as the number of labeled examples increases (essentially,
-a ``diminishing returns'' scenario occurs).  We note instead that, for deep
+shallow, sparse coding approach) suggest that the {\em relative gain of self-taught
+learning vs ordinary supervised learning} diminishes as the number of labeled examples increases.
+We note instead that, for deep
 architectures, our experiments show that such a positive effect is accomplished
-even in a scenario with a \emph{very large number of labeled examples}.
+even in a scenario with a \emph{very large number of labeled examples},
+i.e., here, the relative gain of self-taught learning is probably preserved
+in the asymptotic regime.
 
 {\bf Why would deep learners benefit more from the self-taught learning framework}?
 The key idea is that the lower layers of the predictor compute a hierarchy
@@ -731,7 +737,7 @@
 that corresponds to better generalization. Furthermore, such good
 basins of attraction are not discovered by pure supervised learning
 (with or without self-taught settings), and more labeled examples
-does not allow to go from the poorer basins of attraction discovered
+does not allow the model to go from the poorer basins of attraction discovered
 by the purely supervised shallow models to the kind of better basins associated
 with deep learning and self-taught learning.