# HG changeset patch # User Yoshua Bengio # Date 1275507886 14400 # Node ID 662299f265abc10e071f8bad80f9f10e9fa05aa6 # Parent ef172f4a322ab5231d7fce3992bcccea447bbce9 suggestions from Ian diff -r ef172f4a322a -r 662299f265ab writeup/nips2010_submission.tex --- a/writeup/nips2010_submission.tex Wed Jun 02 13:56:01 2010 -0400 +++ b/writeup/nips2010_submission.tex Wed Jun 02 15:44:46 2010 -0400 @@ -33,12 +33,12 @@ human-level performance on both handwritten digit classification and 62-class handwritten character recognition. For this purpose we developed a powerful generator of stochastic variations and noise - processes character images, including not only affine transformations but + processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background - images, grey level changes, contrast, occlusion, and various types of pixel and - spatially correlated noise. The out-of-distribution examples are - obtained by training with these highly distorted images or - by including object classes different from those in the target test set. + images, grey level changes, contrast, occlusion, and various types of + noise. The out-of-distribution examples are + obtained from these highly distorted images or + by including examples of object classes different from those in the target test set. \end{abstract} \vspace*{-2mm} @@ -87,14 +87,14 @@ Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles of semi-supervised and multi-task learning: the learner can exploit examples that are unlabeled and/or come from a distribution different from the target -distribution, e.g., from other classes that those of interest. +distribution, e.g., from other classes than those of interest. It has already been shown that deep learners can clearly take advantage of unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, but more needs to be done to explore the impact of {\em out-of-distribution} examples and of the multi-task setting -(one exception is~\citep{CollobertR2008}, but using very different kinds +(one exception is~\citep{CollobertR2008}, which uses very different kinds of learning algorithms). In particular the {\em relative -advantage} of deep learning for this settings has not been evaluated. +advantage} of deep learning for these settings has not been evaluated. The hypothesis explored here is that a deep hierarchy of features may be better able to provide sharing of statistical strength between different regions in input space or different tasks, @@ -120,7 +120,7 @@ $\bullet$ %\item Similarly, does the feature learning step in deep learning algorithms benefit more -training with similar but different classes (i.e. a multi-task learning scenario) than +from training with moderately different classes (i.e. a multi-task learning scenario) than a corresponding shallow and purely supervised architecture? %\end{enumerate} @@ -199,7 +199,7 @@ nearest to $(ax+by+c,dx+ey+f)$, producing scaling, translation, rotation and shearing. The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to -forbid important rotations (not to confuse classes) but to give good +forbid large rotations (not to confuse classes) but to give good variability of the transformation: $a$ and $d$ $\sim U[1-3 \times complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times @@ -240,7 +240,7 @@ {\bf Motion Blur.} This is GIMP's ``linear motion blur'' with parameters $length$ and $angle$. The value of -a pixel in the final image is approximately the mean value of the $length$ first pixels +a pixel in the final image is approximately the mean value of the first $length$ pixels found by moving in the $angle$ direction. Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. \vspace*{-1mm} @@ -257,15 +257,15 @@ \vspace*{-1mm} {\bf Pixel Permutation.} -This filter permutes neighbouring pixels. It selects first -$\frac{complexity}{3}$ pixels randomly in the image. Each of them are then +This filter permutes neighbouring pixels. It first selects +fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then sequentially exchanged with one other in as $V4$ neighbourhood. This filter is skipped with probability 80\%. \vspace*{-1mm} {\bf Gaussian Noise.} This filter simply adds, to each pixel of the image independently, a -noise $\sim Normal(0(\frac{complexity}{10})^2)$. +noise $\sim Normal(0,(\frac{complexity}{10})^2)$. This filter is skipped with probability 70\%. \vspace*{-1mm} @@ -364,8 +364,10 @@ with 10 character images and asked to choose 10 corresponding ASCII characters. They were forced to make a hard choice among the 62 or 10 character classes (all classes or digits only). -Three users classified each image, allowing -to estimate inter-human variability. A total 2500 images/dataset were classified. +A total 2500 images/dataset were classified by XXX subjects, +with 3 subjects classifying each image, allowing +us to estimate inter-human variability (e.g a standard error of 0.1\% +on the average 18\% error done by humans on the 62-class task). \vspace*{-1mm} \subsection{Data Sources} @@ -420,7 +422,7 @@ A large set (2 million) of scanned, OCRed and manually verified machine-printed characters (from various documents and books) where included as an additional source. This set is part of a larger corpus being collected by the Image Understanding -Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern +Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern ({\tt http://www.iupr.com}), and which will be publicly released. %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this %\end{itemize} @@ -523,7 +525,7 @@ of the input and trained to reconstruct the uncorrupted input, forcing the hidden units to represent the leading regularities in the data. Once it is trained, in a purely unsupervised way, -its hidden units activations can +its hidden units' activations can be used as inputs for training a second one, etc. After this unsupervised pre-training stage, the parameters are used to initialize a deep MLP, which is fine-tuned by @@ -562,11 +564,10 @@ %\vspace*{-1mm} The models are either trained on NIST (MLP0 and SDA0), NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested -on either NIST, NISTP or P07, either on all 62 classes -or only on the digits (considering only the outputs -associated with digit classes). +on either NIST, NISTP or P07, either on the 62-class task +or on the 10-digits task. Figure~\ref{fig:error-rates-charts} summarizes the results obtained, -comparing Humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1, +comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1, SDA2), along with the previous results on the digits NIST special database 19 test set from the literature respectively based on ARTMAP neural networks ~\citep{Granger+al-2007}, fast nearest-neighbor search @@ -579,6 +580,10 @@ significant way) but when trained with perturbed data reaches human performance on both the 62-class task and the 10-class (digits) task. +17\% error (SDA1) or 18\% error (humans) may seem large but a large +majority of the errors from humans and from SDA1 are from out-of-context +confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a +``c'' and a ``C'' are often indistinguishible). \begin{figure}[ht] \vspace*{-3mm} @@ -625,7 +630,6 @@ maximum conditional probability among only the digit classes outputs. The setting is similar for the other two target classes (lower case characters and upper case characters). - %\vspace*{-1mm} %\subsection{Perturbed Training Data More Helpful for SDA} %\vspace*{-1mm} @@ -701,11 +705,13 @@ out-of-sample examples were used as a source of unsupervised data, and experiments showed its positive effects in a \emph{limited labeled data} scenario. However, many of the results by \citet{RainaR2007} (who used a -shallow, sparse coding approach) suggest that the relative gain of self-taught -learning diminishes as the number of labeled examples increases (essentially, -a ``diminishing returns'' scenario occurs). We note instead that, for deep +shallow, sparse coding approach) suggest that the {\em relative gain of self-taught +learning vs ordinary supervised learning} diminishes as the number of labeled examples increases. +We note instead that, for deep architectures, our experiments show that such a positive effect is accomplished -even in a scenario with a \emph{very large number of labeled examples}. +even in a scenario with a \emph{very large number of labeled examples}, +i.e., here, the relative gain of self-taught learning is probably preserved +in the asymptotic regime. {\bf Why would deep learners benefit more from the self-taught learning framework}? The key idea is that the lower layers of the predictor compute a hierarchy @@ -731,7 +737,7 @@ that corresponds to better generalization. Furthermore, such good basins of attraction are not discovered by pure supervised learning (with or without self-taught settings), and more labeled examples -does not allow to go from the poorer basins of attraction discovered +does not allow the model to go from the poorer basins of attraction discovered by the purely supervised shallow models to the kind of better basins associated with deep learning and self-taught learning.