ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 550:662299f265ab

suggestions from Ian

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Wed, 02 Jun 2010 15:44:46 -0400
parents	ef172f4a322a
children	8f365abf171d

comparison

equal deleted inserted replaced

-:ef172f4a322a
+:662299f265ab
 corresponding shallow learner}, in the area
 of handwritten character recognition. In fact, we show that they reach
 human-level performance on both handwritten digit classification and
 62-class handwritten character recognition.  For this purpose we
 developed a powerful generator of stochastic variations and noise
-processes character images, including not only affine transformations but
+processes for character images, including not only affine transformations but
 also slant, local elastic deformations, changes in thickness, background
-images, grey level changes, contrast, occlusion, and various types of pixel and
+images, grey level changes, contrast, occlusion, and various types of
-spatially correlated noise. The out-of-distribution examples are
+noise. The out-of-distribution examples are
-obtained by training with these highly distorted images or
+obtained from these highly distorted images or
-by including object classes different from those in the target test set.
+by including examples of object classes different from those in the target test set.
 \end{abstract}
 \vspace*{-2mm}
 \section{Introduction}
 \vspace*{-1mm}
 stochastic gradient descent.
 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
 of semi-supervised and multi-task learning: the learner can exploit examples
 that are unlabeled and/or come from a distribution different from the target
-distribution, e.g., from other classes that those of interest.
+distribution, e.g., from other classes than those of interest.
 It has already been shown that deep learners can clearly take advantage of
 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
 but more needs to be done to explore the impact
 of {\em out-of-distribution} examples and of the multi-task setting
-(one exception is~\citep{CollobertR2008}, but using very different kinds
+(one exception is~\citep{CollobertR2008}, which uses very different kinds
 of learning algorithms). In particular the {\em relative
-advantage} of deep learning for this settings has not been evaluated.
+advantage} of deep learning for these settings has not been evaluated.
 The hypothesis explored here is that a deep hierarchy of features
 may be better able to provide sharing of statistical strength
 between different regions in input space or different tasks,
 as discussed in the conclusion.
 Do deep architectures {\em benefit more from such out-of-distribution}
 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
 $\bullet$ %\item
 Similarly, does the feature learning step in deep learning algorithms benefit more
-training with similar but different classes (i.e. a multi-task learning scenario) than
+from training with moderately different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 %\end{enumerate}
 Our experimental results provide positive evidence towards all of these questions.
 To achieve these results, we introduce in the next section a sophisticated system
 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
 Output pixel $(x,y)$ takes the value of input pixel
 nearest to $(ax+by+c,dx+ey+f)$,
 producing scaling, translation, rotation and shearing.
 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to
-forbid important rotations (not to confuse classes) but to give good
+forbid large rotations (not to confuse classes) but to give good
 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
 complexity]$.
 \vspace*{-1mm}
 \vspace*{0.5mm}
 {\bf Motion Blur.}
 This is GIMP's ``linear motion blur''
 with parameters $length$ and $angle$. The value of
-a pixel in the final image is approximately the  mean value of the $length$ first pixels
+a pixel in the final image is approximately the  mean value of the first $length$ pixels
 found by moving in the $angle$ direction.
 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
 \vspace*{-1mm}
 {\bf Occlusion.}
 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}).
 This filter is skipped with probability 60\%.
 \vspace*{-1mm}
 {\bf Pixel Permutation.}
-This filter permutes neighbouring pixels. It selects first
+This filter permutes neighbouring pixels. It first selects
-$\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
+fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then
 sequentially exchanged with one other in as $V4$ neighbourhood.
 This filter is skipped with probability 80\%.
 \vspace*{-1mm}
 {\bf Gaussian Noise.}
 This filter simply adds, to each pixel of the image independently, a
-noise $\sim Normal(0(\frac{complexity}{10})^2)$.
+noise $\sim Normal(0,(\frac{complexity}{10})^2)$.
 This filter is skipped with probability 70\%.
 \vspace*{-1mm}
 {\bf Background Images.}
 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
 %\citep{SorokinAndForsyth2008,whitehill09}.
 AMT users were presented
 with 10 character images and asked to choose 10 corresponding ASCII
 characters. They were forced to make a hard choice among the
 62 or 10 character classes (all classes or digits only).
-Three users classified each image, allowing
+A total 2500 images/dataset were classified by XXX subjects,
-to estimate inter-human variability. A total 2500 images/dataset were classified.
+with 3 subjects classifying each image, allowing
+us to estimate inter-human variability (e.g a standard error of 0.1\%
+on the average 18\% error done by humans on the 62-class task).
 \vspace*{-1mm}
 \subsection{Data Sources}
 \vspace*{-1mm}
 %\item
 {\bf OCR data.}
 A large set (2 million) of scanned, OCRed and manually verified machine-printed
 characters (from various documents and books) where included as an
 additional source. This set is part of a larger corpus being collected by the Image Understanding
-Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern
+Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern
 ({\tt http://www.iupr.com}), and which will be publicly released.
 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
 %\end{itemize}
 \vspace*{-1mm}
 \citep{VincentPLarochelleH2008}. During training, a Denoising
 Auto-Encoder is presented with a stochastically corrupted version
 of the input and trained to reconstruct the uncorrupted input,
 forcing the hidden units to represent the leading regularities in
 the data. Once it is trained, in a purely unsupervised way,
-its hidden units activations can
+its hidden units' activations can
 be used as inputs for training a second one, etc.
 After this unsupervised pre-training stage, the parameters
 are used to initialize a deep MLP, which is fine-tuned by
 the same standard procedure used to train them (see previous section).
 The SDA hyper-parameters are the same as for the MLP, with the addition of the
 %\vspace*{-1mm}
 %\subsection{SDA vs MLP vs Humans}
 %\vspace*{-1mm}
 The models are either trained on NIST (MLP0 and SDA0),
 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested
-on either NIST, NISTP or P07, either on all 62 classes
+on either NIST, NISTP or P07, either on the 62-class task
-or only on the digits (considering only the outputs
+or on the 10-digits task.
-associated with digit classes).
 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
-comparing Humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
+comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
 SDA2), along with the previous results on the digits NIST special database
 19 test set from the literature respectively based on ARTMAP neural
 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
 ~\citep{Milgram+al-2005}.  More detailed and complete numerical results
 The deep learner not only outperformed the shallow ones and
 previously published performance (in a statistically and qualitatively
 significant way) but when trained with perturbed data
 reaches human performance on both the 62-class task
 and the 10-class (digits) task.
+17\% error (SDA1) or 18\% error (humans) may seem large but a large
+majority of the errors from humans and from SDA1 are from out-of-context
+confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
+``c'' and a ``C'' are often indistinguishible).
 \begin{figure}[ht]
 \vspace*{-3mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
 \vspace*{-3mm}
 all tasks.  For the multi-task model, the digit error rate is measured by
 comparing the correct digit class with the output class associated with the
 maximum conditional probability among only the digit classes outputs.  The
 setting is similar for the other two target classes (lower case characters
 and upper case characters).
 %\vspace*{-1mm}
 %\subsection{Perturbed Training Data More Helpful for SDA}
 %\vspace*{-1mm}
 %\vspace*{-1mm}
 In the original self-taught learning framework~\citep{RainaR2007}, the
 out-of-sample examples were used as a source of unsupervised data, and
 experiments showed its positive effects in a \emph{limited labeled data}
 scenario. However, many of the results by \citet{RainaR2007} (who used a
-shallow, sparse coding approach) suggest that the relative gain of self-taught
+shallow, sparse coding approach) suggest that the {\em relative gain of self-taught
-learning diminishes as the number of labeled examples increases (essentially,
+learning vs ordinary supervised learning} diminishes as the number of labeled examples increases.
-a ``diminishing returns'' scenario occurs).  We note instead that, for deep
+We note instead that, for deep
 architectures, our experiments show that such a positive effect is accomplished
-even in a scenario with a \emph{very large number of labeled examples}.
+even in a scenario with a \emph{very large number of labeled examples},
+i.e., here, the relative gain of self-taught learning is probably preserved
+in the asymptotic regime.
 {\bf Why would deep learners benefit more from the self-taught learning framework}?
 The key idea is that the lower layers of the predictor compute a hierarchy
 of features that can be shared across tasks or across variants of the
 input distribution. Intermediate features that can be used in different
 of a deep hierarchy with self-taught learning initializes the
 model in the basin of attraction of supervised gradient descent
 that corresponds to better generalization. Furthermore, such good
 basins of attraction are not discovered by pure supervised learning
 (with or without self-taught settings), and more labeled examples
-does not allow to go from the poorer basins of attraction discovered
+does not allow the model to go from the poorer basins of attraction discovered
 by the purely supervised shallow models to the kind of better basins associated
 with deep learning and self-taught learning.
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
 can be executed on-line at {\tt http://deep.host22.com}.

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 550:662299f265ab