# HG changeset patch # User Yoshua Bengio # Date 1275498567 14400 # Node ID 316c7bdad5ad433dcc1e68a1b1cc4a87d6d72d96 # Parent cf68f56854063c45a0bb6913b36b5a864a1181bf charts diff -r cf68f5685406 -r 316c7bdad5ad writeup/images/charts.ods Binary file writeup/images/charts.ods has changed diff -r cf68f5685406 -r 316c7bdad5ad writeup/nips2010_submission.tex --- a/writeup/nips2010_submission.tex Wed Jun 02 11:45:17 2010 -0400 +++ b/writeup/nips2010_submission.tex Wed Jun 02 13:09:27 2010 -0400 @@ -73,13 +73,14 @@ learning, often in an greedy layer-wise ``unsupervised pre-training'' stage~\citep{Bengio-2009}. One of these layer initialization techniques, applied here, is the Denoising -Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which +Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}), +which performed similarly or better than previously proposed Restricted Boltzmann Machines in terms of unsupervised extraction of a hierarchy of features useful for classification. The principle is that each layer starting from the bottom is trained to encode its input (the output of the previous layer) and to reconstruct it from a corrupted version. After this -unsupervised initialization, the stack of denoising auto-encoders can be +unsupervised initialization, the stack of DAs can be converted into a deep supervised feedforward neural network and fine-tuned by stochastic gradient descent. @@ -124,6 +125,10 @@ %\end{enumerate} Our experimental results provide positive evidence towards all of these questions. +To achieve these results, we introduce in the next section a sophisticated system +for stochastically transforming character images. The conclusion discusses +the more general question of why deep learners may benefit so much from +the self-taught learning framework. \vspace*{-1mm} \section{Perturbation and Transformation of Character Images} @@ -131,7 +136,13 @@ \vspace*{-1mm} This section describes the different transformations we used to stochastically -transform source images in order to obtain data. More details can +transform source images in order to obtain data from a larger distribution which +covers a domain substantially larger than the clean characters distribution from +which we start. Although character transformations have been used before to +improve character recognizers, this effort is on a large scale both +in number of classes and in the complexity of the transformations, hence +in the complexity of the learning task. +More details can be found in this technical report~\citep{ift6266-tr-anonymous}. The code for these transformations (mostly python) is available at {\tt http://anonymous.url.net}. All the modules in the pipeline share @@ -334,10 +345,11 @@ The first step in constructing the larger datasets (called NISTP and P07) is to sample from a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, and {\bf OCR data} (scanned machine printed characters). Once a character -is sampled from one of these sources (chosen randomly), the pipeline of -the transformations and/or noise processes described in section \ref{s:perturbations} -is applied to the image. +is sampled from one of these sources (chosen randomly), the second step is to +apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}. +To provide a baseline of error rate comparison we also estimate human performance +on both the 62-class task and the 10-class digits task. We compare the best MLPs against the best SDAs (both models' hyper-parameters are selected to minimize the validation set error), along with a comparison against a precise estimate @@ -460,20 +472,6 @@ through preliminary experiments (measuring performance on a validation set), and $0.1$ was then selected for optimizing on the whole training sets. -\begin{figure}[ht] -\vspace*{-2mm} -\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} -\caption{Illustration of the computations and training criterion for the denoising -auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of -the layer (i.e. raw input or output of previous layer) -is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. -The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which -is compared to the uncorrupted input $x$ through the loss function -$L_H(x,z)$, whose expected value is approximately minimized during training -by tuning $\theta$ and $\theta'$.} -\label{fig:da} -\vspace*{-2mm} -\end{figure} {\bf Stacked Denoising Auto-Encoders (SDA).} Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) @@ -490,6 +488,21 @@ deep architecture (whereby complex concepts are expressed as compositions of simpler ones through a deep hierarchy). +\begin{figure}[ht] +\vspace*{-2mm} +\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} +\caption{Illustration of the computations and training criterion for the denoising +auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of +the layer (i.e. raw input or output of previous layer) +is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. +The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which +is compared to the uncorrupted input $x$ through the loss function +$L_H(x,z)$, whose expected value is approximately minimized during training +by tuning $\theta$ and $\theta'$.} +\label{fig:da} +\vspace*{-2mm} +\end{figure} + Here we chose to use the Denoising Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for these deep hierarchies of features, as it is very simple to train and @@ -514,14 +527,14 @@ from the same above set). The fraction of inputs corrupted was selected among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number of hidden layers but it was fixed to 3 based on previous work with -stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. +SDAs on MNIST~\citep{VincentPLarochelleH2008}. \vspace*{-1mm} \begin{figure}[ht] \vspace*{-2mm} \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} -\caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained +\caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained on NIST, 1 on NISTP, and 2 on P07. Left: overall results of all models, on 3 different test sets (NIST, NISTP, P07). Right: error rates on NIST test digits only, along with the previous results from @@ -580,8 +593,8 @@ The left side of the figure shows the improvement to the clean NIST test set error brought by the use of out-of-distribution examples (i.e. the perturbed examples examples from NISTP or P07). -Relative change is measured by taking -(original model's error / perturbed-data model's error - 1). +Relative percent change is measured by taking +100 \% \times (original model's error / perturbed-data model's error - 1). The right side of Figure~\ref{fig:improvements-charts} shows the relative improvement brought by the use of a multi-task setting, in which the same model is @@ -589,9 +602,10 @@ with all 62 classes when the target classes are respectively the digits, lower-case, or upper-case characters). Again, whereas the gain from the multi-task setting is marginal or negative for the MLP, it is substantial -for the SDA. Note that for these multi-task experiment, only the original +for the SDA. Note that to simplify these multi-task experiments, only the original NIST dataset is used. For example, the MLP-digits bar shows the relative -improvement in MLP error rate on the NIST digits test set (1 - single-task +percent improvement in MLP error rate on the NIST digits test set +is 100\% $\times$ (1 - single-task model's error / multi-task model's error). The single-task model is trained with only 10 outputs (one per digit), seeing only digit examples, whereas the multi-task model is trained with 62 outputs, with all 62 @@ -647,16 +661,16 @@ %\begin{itemize} $\bullet$ %\item -Do the good results previously obtained with deep architectures on the +{\bf Do the good results previously obtained with deep architectures on the MNIST digits generalize to the setting of a much larger and richer (but similar) -dataset, the NIST special database 19, with 62 classes and around 800k examples? +dataset, the NIST special database 19, with 62 classes and around 800k examples}? Yes, the SDA {\bf systematically outperformed the MLP and all the previously published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level performance} at around 17\% error on the 62-class task and 1.4\% on the digits. $\bullet$ %\item -To what extent do self-taught learning scenarios help deep learners, -and do they help them more than shallow supervised ones? +{\bf To what extent do self-taught learning scenarios help deep learners, +and do they help them more than shallow supervised ones}? We found that distorted training examples not only made the resulting classifier better on similarly perturbed images but also on the {\em original clean examples}, and more importantly and more novel, @@ -669,7 +683,8 @@ were very significantly boosted by these out-of-distribution examples. Similarly, whereas the improvement due to the multi-task setting was marginal or negative for the MLP (from +5.6\% to -3.6\% relative change), -it was very significant for the SDA (from +13\% to +27\% relative change). +it was very significant for the SDA (from +13\% to +27\% relative change), +which may be explained by the arguments below. %\end{itemize} In the original self-taught learning framework~\citep{RainaR2007}, the @@ -682,7 +697,7 @@ architectures, our experiments show that such a positive effect is accomplished even in a scenario with a \emph{very large number of labeled examples}. -Why would deep learners benefit more from the self-taught learning framework? +{\bf Why would deep learners benefit more from the self-taught learning framework}? The key idea is that the lower layers of the predictor compute a hierarchy of features that can be shared across tasks or across variants of the input distribution. Intermediate features that can be used in different