ift6266: writeup/aistats2011_cameraready.tex comparison

comparison writeup/aistats2011_cameraready.tex @ 639:507cb92d8e15

modifs mineures

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sun, 20 Mar 2011 16:49:44 -0400
parents	677d1b1d8158
children	8b1a0b9fecff

comparison

equal deleted inserted replaced

-:677d1b1d8158
+:507cb92d8e15
 \usepackage[utf8]{inputenc}
 \usepackage[psamsfonts]{amssymb}
 %\usepackage{algorithm,algorithmic} % not used after all
 \usepackage{graphicx,subfigure}
 \usepackage{natbib}
+%\usepackage{afterpage}
 \addtolength{\textwidth}{10mm}
 \addtolength{\evensidemargin}{-5mm}
 \addtolength{\oddsidemargin}{-5mm}
 \bf Arnaud  Bergeron \and
 Nicolas  Boulanger-Lewandowski \and \\
 \bf Thomas  Breuel \and
 Youssouf  Chherawala \and
 \bf Moustapha  Cisse \and
-Myriam  Côté \and  \\
+Myriam  Côté \and
-\bf Dumitru  Erhan \and
+\bf Dumitru  Erhan \\
-Jeremy  Eustache \and
+\and  \bf Jeremy  Eustache \and
 \bf Xavier  Glorot \and
-Xavier  Muller \and \\
+Xavier  Muller \and
-\bf Sylvain  Pannetier Lebeuf \and
+\bf Sylvain  Pannetier Lebeuf \\
-Razvan  Pascanu \and
+\and \bf Razvan  Pascanu \and
 \bf Salah  Rifai \and
-Francois  Savard \and  \\
+Francois  Savard \and
 \bf Guillaume  Sicard \\
 \vspace*{1mm}}
 %I can't use aistatsaddress in a single side paragraphe.
 %The document is 2 colums, but this section span the 2 colums, sot there is only 1 left
 observed with deep learners, we focus here on the following {\em hypothesis}:
 intermediate levels of representation, especially when there are
 more such levels, can be exploited to {\bf share
 statistical strength across different but related types of examples},
 such as examples coming from other tasks than the task of interest
-(the multi-task setting), or examples coming from an overlapping
+(the multi-task setting~\citep{caruana97a}), or examples coming from an overlapping
 but different distribution (images with different kinds of perturbations
 and noises, here). This is consistent with the hypotheses discussed
 in~\citet{Bengio-2009} regarding the potential advantage
 of deep learning and the idea that more levels of representation can
 give rise to more abstract, more general features of the raw input.
 This hypothesis is related to a learning setting called
 {\bf self-taught learning}~\citep{RainaR2007}, which combines principles
-of semi-supervised and multi-task learning: the learner can exploit examples
+of semi-supervised and multi-task learning: in addition to the labeled
+examples from the target distribution, the learner can exploit examples
 that are unlabeled and possibly come from a distribution different from the target
 distribution, e.g., from other classes than those of interest.
 It has already been shown that deep learners can clearly take advantage of
-unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
+unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}
+in order to improve performance on a supervised task,
 but more needed to be done to explore the impact
 of {\em out-of-distribution} examples and of the {\em multi-task} setting
-(one exception is~\citep{CollobertR2008}, which shares and uses unsupervised
+(two exceptions are~\citet{CollobertR2008}, which shares and uses unsupervised
-pre-training only with the first layer). In particular the {\em relative
+pre-training only with the first layer, and~\citet{icml2009_093} in the case
+of video data). In particular the {\em relative
 advantage of deep learning} for these settings has not been evaluated.
 %
 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
 amount of deformation or noise introduced.
 There are two main parts in the pipeline. The first one,
 from thickness to pinch, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
-More details can be found in~\citep{ARXIV-2010}.
+More details can be found in~\citet{ARXIV-2010}.
 \begin{figure*}[ht]
 \centering
 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}}
 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}}
 %\vspace*{-1mm}
 Much previous work on deep learning had been performed on
 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
 with 60,000 examples, and variants involving 10,000
-examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}.
+examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}\footnote{Fortunately, there
+are more and more exceptions of course, such as~\citet{RainaICML09} using a million examples.}
 The focus here is on much larger training sets, from 10 times to
 to 1000 times larger, and 62 classes.
 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
 %\item
 {\bf Fonts.}
 In order to have a good variety of sources we downloaded an important number of free fonts from:
 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
 % TODO: pointless to anonymize, it's not pointing to our work
-Including an operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
+Including an operating system's (Windows 7) fonts, there we uniformly chose from $9817$ different fonts.
 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
 directly as input to our models.
 %\vspace*{-1mm}
 %\item
 {\bf OCR data.}
 A large set (2 million) of scanned, OCRed and manually verified machine-printed
 characters where included as an
 additional source. This set is part of a larger corpus being collected by the Image Understanding
 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern
-({\tt http://www.iupr.com}), and which will be publicly released.
+({\tt http://www.iupr.com}).%, and which will be publicly released.
 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
 %\end{itemize}
 %\vspace*{-3mm}
 \subsection{Data Sets}
 %\vspace*{-2mm}
-All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
+All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with one of 62 character labels.
-from one of the 62 character classes.
 %\begin{itemize}
 %\vspace*{-1mm}
 %\item
 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
 \{651,668 / 80,000 / 82,587\} \{training / validation / test\} examples.
 %\vspace*{-1mm}
 %\item
-{\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
+{\bf P07.} This dataset is obtained by taking raw characters from the above 4 sources
 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
-For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
+For each generated example, a data source is selected with probability $10\%$ from the fonts,
-$25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
+$25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. The transformations are
+applied in the
 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples
 obtained from the corresponding NIST sets plus other sources.
 %\vspace*{-1mm}
 %\item
 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
 except that we only apply
 transformations from slant to pinch (see Fig.\ref{fig:transform}(b-f)).
 Therefore, the character is
-transformed but no additional noise is added to the image, giving images
+transformed but without added noise, yielding images
 closer to the NIST dataset.
 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples
 obtained from the corresponding NIST sets plus other sources.
 %\end{itemize}
-\begin{figure*}[ht]
+\vspace*{-3mm}
-%\vspace*{-2mm}
-\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
-%\vspace*{-2mm}
-\caption{Illustration of the computations and training criterion for the denoising
-auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
-the layer (i.e. raw input or output of previous layer)
-s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
-The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
-is compared to the uncorrupted input $x$ through the loss function
-$L_H(x,z)$, whose expected value is approximately minimized during training
-by tuning $\theta$ and $\theta'$.}
-\label{fig:da}
-%\vspace*{-2mm}
-\end{figure*}
-%\vspace*{-3mm}
 \subsection{Models and their Hyper-parameters}
 %\vspace*{-2mm}
 The experiments are performed using MLPs (with a single
 hidden layer) and deep SDAs.
 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
-{\bf Multi-Layer Perceptrons (MLP).}  The MLP output estimated with
+{\bf Multi-Layer Perceptrons (MLP).}  The MLP output estimates the
+class-conditional probabilities
 \[
 P({\rm class}|{\rm input}=x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)),
 \]
 i.e., two layers, where $p={\rm softmax}(a)$ means that
 $p_i(x)=\exp(a_i)/\sum_j \exp(a_j)$
 %through preliminary experiments (measuring performance on a validation set),
 %and $0.1$ (which was found to work best) was then selected for optimizing on
 %the whole training sets.
 %\vspace*{-1mm}
+\begin{figure*}[htb]
+%\vspace*{-2mm}
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
+%\vspace*{-2mm}
+\caption{Illustration of the computations and training criterion for the denoising
+auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
+the layer (i.e. raw input or output of previous layer)
+s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
+The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
+is compared to the uncorrupted input $x$ through the loss function
+$L_H(x,z)$, whose expected value is approximately minimized during training
+by tuning $\theta$ and $\theta'$.}
+\label{fig:da}
+%\vspace*{-2mm}
+\end{figure*}
+%\afterpage{\clearpage}
 {\bf Stacked Denoising Auto-encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden
 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
 apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better
-generalization~\citep{Erhan+al-2010}.  This initial {\em unsupervised
+generalization~\citep{Erhan+al-2010}.
-pre-training phase} uses all of the training images but not the training labels.
+This initial {\em unsupervised
+pre-training phase} does not use the training labels.
 Each layer is trained in turn to produce a new representation of its input
 (starting from the raw pixels).
 It is hypothesized that the
 advantage brought by this procedure stems from a better prior,
 on the one hand taking advantage of the link between the input
 these deep hierarchies of features, as it is simple to train and
 explain (see Figure~\ref{fig:da}, as well as
 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
 provides efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
-\citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian
+\citep{VincentPLarochelleH2008-very-small}.
+Some denoising auto-encoders correspond
+to a Gaussian
 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}.
 During its unsupervised training, a Denoising
 Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$
 of the input $x$ and trained to reconstruct to produce a reconstruction $z$
 of the uncorrupted input $x$. Because the network has to denoise, it is
 forcing the hidden units $y$ to represent the leading regularities in
-the data. Following~\citep{VincentPLarochelleH2008-very-small}
+the data. In a slight departure from \citet{VincentPLarochelleH2008-very-small},
-the hidden units output $y$ is obtained through the sigmoid-affine
+the hidden units output $y$ is obtained through the tanh-affine
 encoder
-\[
+$y=\tanh(c+V x)$
-y={\rm sigm}(c+V x)
+and the reconstruction is obtained through the transposed transformation
-\]
+$z=\tanh(d+V' y)$.
-where ${\rm sigm}(a)=1/(1+\exp(-a))$
-and the reconstruction is obtained through the same transformation
-\[
-z={\rm sigm}(d+V' y)
-\]
-using the transpose of encoder weights.
 The training
 set average of the cross-entropy
-reconstruction loss
+reconstruction loss (after mapping back numbers in (-1,1) into (0,1))
 \[
-L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i)
+L_H(x,z)=-\sum_i \frac{(z_i+1)}{2} \log \frac{(x_i+1)}{2} + \frac{z_i}{2} \log\frac{x_i}{2}
 \]
 is minimized.
 Here we use the random binary masking corruption
 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and
 copies the rest).
 separate learning rate for the unsupervised pre-training stage (selected
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
 of hidden layers but it was fixed to 3 for our experiments,
 based on previous work with
 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}.
 We also compared against 1 and against 2 hidden layers,
 to disantangle the effect of depth from that of unsupervised
 pre-training.
 The size of each hidden
 layer was kept constant across hidden layers, and the best results
 framework and out-of-distribution examples}?
 The key idea is that the lower layers of the predictor compute a hierarchy
 of features that can be shared across tasks or across variants of the
 input distribution. A theoretical analysis of generalization improvements
 due to sharing of intermediate features across tasks already points
-towards that explanation~\cite{baxter95a}.
+towards that explanation~\citep{baxter95a}.
 Intermediate features that can be used in different
 contexts can be estimated in a way that allows to share statistical
 strength. Features extracted through many levels are more likely to
 be more abstract and more invariant to some of the factors of variation
 in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest),
 (with or without out-of-distribution examples) from random initialization, and more labeled examples
 does not allow the shallow or purely supervised models to discover
 the kind of better basins associated
 with deep learning and out-of-distribution examples.
-A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
+A Java demo of the recognizer (where both the MLP and the SDA can be compared)
 can be executed on-line at {\tt http://deep.host22.com}.
 \iffalse
 \section*{Appendix I: Detailed Numerical Results}

Mercurial > ift6266

comparison writeup/aistats2011_cameraready.tex @ 639:507cb92d8e15