# HG changeset patch # User Yoshua Bengio # Date 1300654184 14400 # Node ID 507cb92d8e15ad6aacaea1cef287536b66ccf7a7 # Parent 677d1b1d8158f0a8935e62aedd2a7c4abd06f7bd modifs mineures diff -r 677d1b1d8158 -r 507cb92d8e15 writeup/aistats2011_cameraready.tex --- a/writeup/aistats2011_cameraready.tex Sat Mar 19 23:11:17 2011 -0400 +++ b/writeup/aistats2011_cameraready.tex Sun Mar 20 16:49:44 2011 -0400 @@ -11,6 +11,7 @@ %\usepackage{algorithm,algorithmic} % not used after all \usepackage{graphicx,subfigure} \usepackage{natbib} +%\usepackage{afterpage} \addtolength{\textwidth}{10mm} \addtolength{\evensidemargin}{-5mm} @@ -32,15 +33,15 @@ \bf Thomas Breuel \and Youssouf Chherawala \and \bf Moustapha Cisse \and -Myriam Côté \and \\ -\bf Dumitru Erhan \and -Jeremy Eustache \and +Myriam Côté \and +\bf Dumitru Erhan \\ +\and \bf Jeremy Eustache \and \bf Xavier Glorot \and -Xavier Muller \and \\ -\bf Sylvain Pannetier Lebeuf \and -Razvan Pascanu \and +Xavier Muller \and +\bf Sylvain Pannetier Lebeuf \\ +\and \bf Razvan Pascanu \and \bf Salah Rifai \and -Francois Savard \and \\ +Francois Savard \and \bf Guillaume Sicard \\ \vspace*{1mm}} @@ -143,7 +144,7 @@ more such levels, can be exploited to {\bf share statistical strength across different but related types of examples}, such as examples coming from other tasks than the task of interest -(the multi-task setting), or examples coming from an overlapping +(the multi-task setting~\citep{caruana97a}), or examples coming from an overlapping but different distribution (images with different kinds of perturbations and noises, here). This is consistent with the hypotheses discussed in~\citet{Bengio-2009} regarding the potential advantage @@ -152,15 +153,18 @@ This hypothesis is related to a learning setting called {\bf self-taught learning}~\citep{RainaR2007}, which combines principles -of semi-supervised and multi-task learning: the learner can exploit examples +of semi-supervised and multi-task learning: in addition to the labeled +examples from the target distribution, the learner can exploit examples that are unlabeled and possibly come from a distribution different from the target distribution, e.g., from other classes than those of interest. It has already been shown that deep learners can clearly take advantage of -unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, +unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small} +in order to improve performance on a supervised task, but more needed to be done to explore the impact of {\em out-of-distribution} examples and of the {\em multi-task} setting -(one exception is~\citep{CollobertR2008}, which shares and uses unsupervised -pre-training only with the first layer). In particular the {\em relative +(two exceptions are~\citet{CollobertR2008}, which shares and uses unsupervised +pre-training only with the first layer, and~\citet{icml2009_093} in the case +of video data). In particular the {\em relative advantage of deep learning} for these settings has not been evaluated. @@ -233,7 +237,7 @@ There are two main parts in the pipeline. The first one, from thickness to pinch, performs transformations. The second part, from blur to contrast, adds different kinds of noise. -More details can be found in~\citep{ARXIV-2010}. +More details can be found in~\citet{ARXIV-2010}. \begin{figure*}[ht] \centering @@ -268,7 +272,8 @@ Much previous work on deep learning had been performed on the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, with 60,000 examples, and variants involving 10,000 -examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}. +examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}\footnote{Fortunately, there +are more and more exceptions of course, such as~\citet{RainaICML09} using a million examples.} The focus here is on much larger training sets, from 10 times to to 1000 times larger, and 62 classes. @@ -346,7 +351,7 @@ In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. % TODO: pointless to anonymize, it's not pointing to our work -Including an operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. +Including an operating system's (Windows 7) fonts, there we uniformly chose from $9817$ different fonts. The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, directly as input to our models. %\vspace*{-1mm} @@ -368,7 +373,7 @@ characters where included as an additional source. This set is part of a larger corpus being collected by the Image Understanding Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern -({\tt http://www.iupr.com}), and which will be publicly released. +({\tt http://www.iupr.com}).%, and which will be publicly released. %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this %\end{itemize} @@ -376,8 +381,7 @@ \subsection{Data Sets} %\vspace*{-2mm} -All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label -from one of the 62 character classes. +All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with one of 62 character labels. %\begin{itemize} %\vspace*{-1mm} @@ -387,10 +391,11 @@ %\vspace*{-1mm} %\item -{\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources +{\bf P07.} This dataset is obtained by taking raw characters from the above 4 sources and sending them through the transformation pipeline described in section \ref{s:perturbations}. -For each new example to generate, a data source is selected with probability $10\%$ from the fonts, -$25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the +For each generated example, a data source is selected with probability $10\%$ from the fonts, +$25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. The transformations are +applied in the order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples obtained from the corresponding NIST sets plus other sources. @@ -401,29 +406,13 @@ except that we only apply transformations from slant to pinch (see Fig.\ref{fig:transform}(b-f)). Therefore, the character is - transformed but no additional noise is added to the image, giving images + transformed but without added noise, yielding images closer to the NIST dataset. It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples obtained from the corresponding NIST sets plus other sources. %\end{itemize} -\begin{figure*}[ht] -%\vspace*{-2mm} -\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} -%\vspace*{-2mm} -\caption{Illustration of the computations and training criterion for the denoising -auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of -the layer (i.e. raw input or output of previous layer) -s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. -The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which -is compared to the uncorrupted input $x$ through the loss function -$L_H(x,z)$, whose expected value is approximately minimized during training -by tuning $\theta$ and $\theta'$.} -\label{fig:da} -%\vspace*{-2mm} -\end{figure*} - -%\vspace*{-3mm} +\vspace*{-3mm} \subsection{Models and their Hyper-parameters} %\vspace*{-2mm} @@ -431,7 +420,8 @@ hidden layer) and deep SDAs. \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} -{\bf Multi-Layer Perceptrons (MLP).} The MLP output estimated with +{\bf Multi-Layer Perceptrons (MLP).} The MLP output estimates the +class-conditional probabilities \[ P({\rm class}|{\rm input}=x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)), \] @@ -474,6 +464,23 @@ %the whole training sets. %\vspace*{-1mm} +\begin{figure*}[htb] +%\vspace*{-2mm} +\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} +%\vspace*{-2mm} +\caption{Illustration of the computations and training criterion for the denoising +auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of +the layer (i.e. raw input or output of previous layer) +s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. +The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which +is compared to the uncorrupted input $x$ through the loss function +$L_H(x,z)$, whose expected value is approximately minimized during training +by tuning $\theta$ and $\theta'$.} +\label{fig:da} +%\vspace*{-2mm} +\end{figure*} + +%\afterpage{\clearpage} {\bf Stacked Denoising Auto-encoders (SDA).} Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) @@ -481,8 +488,9 @@ layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, apparently setting parameters in the basin of attraction of supervised gradient descent yielding better -generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised -pre-training phase} uses all of the training images but not the training labels. +generalization~\citep{Erhan+al-2010}. +This initial {\em unsupervised +pre-training phase} does not use the training labels. Each layer is trained in turn to produce a new representation of its input (starting from the raw pixels). It is hypothesized that the @@ -501,30 +509,26 @@ tutorial and code there: {\tt http://deeplearning.net/tutorial}), provides efficient inference, and yielded results comparable or better than RBMs in series of experiments -\citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian +\citep{VincentPLarochelleH2008-very-small}. +Some denoising auto-encoders correspond +to a Gaussian RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. During its unsupervised training, a Denoising Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$ of the input $x$ and trained to reconstruct to produce a reconstruction $z$ of the uncorrupted input $x$. Because the network has to denoise, it is forcing the hidden units $y$ to represent the leading regularities in -the data. Following~\citep{VincentPLarochelleH2008-very-small} -the hidden units output $y$ is obtained through the sigmoid-affine +the data. In a slight departure from \citet{VincentPLarochelleH2008-very-small}, +the hidden units output $y$ is obtained through the tanh-affine encoder -\[ - y={\rm sigm}(c+V x) -\] -where ${\rm sigm}(a)=1/(1+\exp(-a))$ -and the reconstruction is obtained through the same transformation -\[ - z={\rm sigm}(d+V' y) -\] -using the transpose of encoder weights. +$y=\tanh(c+V x)$ +and the reconstruction is obtained through the transposed transformation +$z=\tanh(d+V' y)$. The training set average of the cross-entropy -reconstruction loss +reconstruction loss (after mapping back numbers in (-1,1) into (0,1)) \[ - L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i) + L_H(x,z)=-\sum_i \frac{(z_i+1)}{2} \log \frac{(x_i+1)}{2} + \frac{z_i}{2} \log\frac{x_i}{2} \] is minimized. Here we use the random binary masking corruption @@ -552,7 +556,7 @@ among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number of hidden layers but it was fixed to 3 for our experiments, based on previous work with -SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. +SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. We also compared against 1 and against 2 hidden layers, to disantangle the effect of depth from that of unsupervised pre-training. @@ -767,7 +771,7 @@ of features that can be shared across tasks or across variants of the input distribution. A theoretical analysis of generalization improvements due to sharing of intermediate features across tasks already points -towards that explanation~\cite{baxter95a}. +towards that explanation~\citep{baxter95a}. Intermediate features that can be used in different contexts can be estimated in a way that allows to share statistical strength. Features extracted through many levels are more likely to @@ -794,7 +798,7 @@ the kind of better basins associated with deep learning and out-of-distribution examples. -A Flash demo of the recognizer (where both the MLP and the SDA can be compared) +A Java demo of the recognizer (where both the MLP and the SDA can be compared) can be executed on-line at {\tt http://deep.host22.com}. \iffalse