# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1300654184 14400
# Node ID 507cb92d8e15ad6aacaea1cef287536b66ccf7a7
# Parent  677d1b1d8158f0a8935e62aedd2a7c4abd06f7bd
modifs mineures

diff -r 677d1b1d8158 -r 507cb92d8e15 writeup/aistats2011_cameraready.tex
--- a/writeup/aistats2011_cameraready.tex	Sat Mar 19 23:11:17 2011 -0400
+++ b/writeup/aistats2011_cameraready.tex	Sun Mar 20 16:49:44 2011 -0400
@@ -11,6 +11,7 @@
 %\usepackage{algorithm,algorithmic} % not used after all
 \usepackage{graphicx,subfigure}
 \usepackage{natbib}
+%\usepackage{afterpage}
 
 \addtolength{\textwidth}{10mm}
 \addtolength{\evensidemargin}{-5mm}
@@ -32,15 +33,15 @@
 \bf Thomas  Breuel \and
 Youssouf  Chherawala \and
 \bf Moustapha  Cisse \and 
-Myriam  Côté \and  \\
-\bf Dumitru  Erhan \and
-Jeremy  Eustache \and
+Myriam  Côté \and 
+\bf Dumitru  Erhan \\
+\and  \bf Jeremy  Eustache \and
 \bf Xavier  Glorot \and 
-Xavier  Muller \and \\
-\bf Sylvain  Pannetier Lebeuf \and
-Razvan  Pascanu \and
+Xavier  Muller \and 
+\bf Sylvain  Pannetier Lebeuf \\ 
+\and \bf Razvan  Pascanu \and
 \bf Salah  Rifai \and 
-Francois  Savard \and  \\
+Francois  Savard \and 
 \bf Guillaume  Sicard \\
 \vspace*{1mm}}
 
@@ -143,7 +144,7 @@
 more such levels, can be exploited to {\bf share
 statistical strength across different but related types of examples},
 such as examples coming from other tasks than the task of interest
-(the multi-task setting), or examples coming from an overlapping
+(the multi-task setting~\citep{caruana97a}), or examples coming from an overlapping
 but different distribution (images with different kinds of perturbations
 and noises, here). This is consistent with the hypotheses discussed
 in~\citet{Bengio-2009} regarding the potential advantage
@@ -152,15 +153,18 @@
 
 This hypothesis is related to a learning setting called
 {\bf self-taught learning}~\citep{RainaR2007}, which combines principles
-of semi-supervised and multi-task learning: the learner can exploit examples
+of semi-supervised and multi-task learning: in addition to the labeled
+examples from the target distribution, the learner can exploit examples
 that are unlabeled and possibly come from a distribution different from the target
 distribution, e.g., from other classes than those of interest. 
 It has already been shown that deep learners can clearly take advantage of
-unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
+unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}
+in order to improve performance on a supervised task,
 but more needed to be done to explore the impact
 of {\em out-of-distribution} examples and of the {\em multi-task} setting
-(one exception is~\citep{CollobertR2008}, which shares and uses unsupervised
-pre-training only with the first layer). In particular the {\em relative
+(two exceptions are~\citet{CollobertR2008}, which shares and uses unsupervised
+pre-training only with the first layer, and~\citet{icml2009_093} in the case
+of video data). In particular the {\em relative
 advantage of deep learning} for these settings has not been evaluated.
 
 
@@ -233,7 +237,7 @@
 There are two main parts in the pipeline. The first one,
 from thickness to pinch, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
-More details can be found in~\citep{ARXIV-2010}.
+More details can be found in~\citet{ARXIV-2010}.
 
 \begin{figure*}[ht]
 \centering
@@ -268,7 +272,8 @@
 Much previous work on deep learning had been performed on
 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
 with 60,000 examples, and variants involving 10,000
-examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}.
+examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}\footnote{Fortunately, there
+are more and more exceptions of course, such as~\citet{RainaICML09} using a million examples.}
 The focus here is on much larger training sets, from 10 times to 
 to 1000 times larger, and 62 classes.
 
@@ -346,7 +351,7 @@
 In order to have a good variety of sources we downloaded an important number of free fonts from:
 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
 % TODO: pointless to anonymize, it's not pointing to our work
-Including an operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
+Including an operating system's (Windows 7) fonts, there we uniformly chose from $9817$ different fonts.
 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 
 directly as input to our models.
 %\vspace*{-1mm}
@@ -368,7 +373,7 @@
 characters where included as an
 additional source. This set is part of a larger corpus being collected by the Image Understanding
 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern 
-({\tt http://www.iupr.com}), and which will be publicly released.
+({\tt http://www.iupr.com}).%, and which will be publicly released.
 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
 %\end{itemize}
 
@@ -376,8 +381,7 @@
 \subsection{Data Sets}
 %\vspace*{-2mm}
 
-All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
-from one of the 62 character classes.
+All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with one of 62 character labels.
 %\begin{itemize}
 %\vspace*{-1mm}
 
@@ -387,10 +391,11 @@
 %\vspace*{-1mm}
 
 %\item 
-{\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
+{\bf P07.} This dataset is obtained by taking raw characters from the above 4 sources
 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
-For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
-$25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
+For each generated example, a data source is selected with probability $10\%$ from the fonts,
+$25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. The transformations are 
+applied in the
 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples
 obtained from the corresponding NIST sets plus other sources.
@@ -401,29 +406,13 @@
   except that we only apply
   transformations from slant to pinch (see Fig.\ref{fig:transform}(b-f)).
   Therefore, the character is
-  transformed but no additional noise is added to the image, giving images
+  transformed but without added noise, yielding images
   closer to the NIST dataset. 
 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples
 obtained from the corresponding NIST sets plus other sources.
 %\end{itemize}
 
-\begin{figure*}[ht]
-%\vspace*{-2mm}
-\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
-%\vspace*{-2mm}
-\caption{Illustration of the computations and training criterion for the denoising
-auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
-the layer (i.e. raw input or output of previous layer)
-s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
-The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
-is compared to the uncorrupted input $x$ through the loss function
-$L_H(x,z)$, whose expected value is approximately minimized during training
-by tuning $\theta$ and $\theta'$.}
-\label{fig:da}
-%\vspace*{-2mm}
-\end{figure*}
-
-%\vspace*{-3mm}
+\vspace*{-3mm}
 \subsection{Models and their Hyper-parameters}
 %\vspace*{-2mm}
 
@@ -431,7 +420,8 @@
 hidden layer) and deep SDAs.
 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
 
-{\bf Multi-Layer Perceptrons (MLP).}  The MLP output estimated with
+{\bf Multi-Layer Perceptrons (MLP).}  The MLP output estimates the
+class-conditional probabilities
 \[
 P({\rm class}|{\rm input}=x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)),
 \] 
@@ -474,6 +464,23 @@
 %the whole training sets.
 %\vspace*{-1mm}
 
+\begin{figure*}[htb]
+%\vspace*{-2mm}
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
+%\vspace*{-2mm}
+\caption{Illustration of the computations and training criterion for the denoising
+auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
+the layer (i.e. raw input or output of previous layer)
+s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
+The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
+is compared to the uncorrupted input $x$ through the loss function
+$L_H(x,z)$, whose expected value is approximately minimized during training
+by tuning $\theta$ and $\theta'$.}
+\label{fig:da}
+%\vspace*{-2mm}
+\end{figure*}
+
+%\afterpage{\clearpage}
 
 {\bf Stacked Denoising Auto-encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
@@ -481,8 +488,9 @@
 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, 
 apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better 
-generalization~\citep{Erhan+al-2010}.  This initial {\em unsupervised
-pre-training phase} uses all of the training images but not the training labels.
+generalization~\citep{Erhan+al-2010}. 
+This initial {\em unsupervised
+pre-training phase} does not use the training labels.
 Each layer is trained in turn to produce a new representation of its input
 (starting from the raw pixels).
 It is hypothesized that the
@@ -501,30 +509,26 @@
 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 
 provides efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
-\citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian
+\citep{VincentPLarochelleH2008-very-small}. 
+Some denoising auto-encoders correspond 
+to a Gaussian
 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}.
 During its unsupervised training, a Denoising
 Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$
 of the input $x$ and trained to reconstruct to produce a reconstruction $z$ 
 of the uncorrupted input $x$. Because the network has to denoise, it is
 forcing the hidden units $y$ to represent the leading regularities in
-the data. Following~\citep{VincentPLarochelleH2008-very-small} 
-the hidden units output $y$ is obtained through the sigmoid-affine
+the data. In a slight departure from \citet{VincentPLarochelleH2008-very-small},
+the hidden units output $y$ is obtained through the tanh-affine
 encoder
-\[
- y={\rm sigm}(c+V x)
-\]
-where ${\rm sigm}(a)=1/(1+\exp(-a))$
-and the reconstruction is obtained through the same transformation
-\[ 
- z={\rm sigm}(d+V' y)
-\]
-using the transpose of encoder weights.
+$y=\tanh(c+V x)$
+and the reconstruction is obtained through the transposed transformation
+$z=\tanh(d+V' y)$.
 The training
 set average of the cross-entropy
-reconstruction loss 
+reconstruction loss (after mapping back numbers in (-1,1) into (0,1))
 \[
- L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i)
+ L_H(x,z)=-\sum_i \frac{(z_i+1)}{2} \log \frac{(x_i+1)}{2} + \frac{z_i}{2} \log\frac{x_i}{2}
 \]
 is minimized.
 Here we use the random binary masking corruption
@@ -552,7 +556,7 @@
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
 of hidden layers but it was fixed to 3 for our experiments,
 based on previous work with
-SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. 
+SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}.
 We also compared against 1 and against 2 hidden layers, 
 to disantangle the effect of depth from that of unsupervised
 pre-training.
@@ -767,7 +771,7 @@
 of features that can be shared across tasks or across variants of the
 input distribution. A theoretical analysis of generalization improvements
 due to sharing of intermediate features across tasks already points
-towards that explanation~\cite{baxter95a}.
+towards that explanation~\citep{baxter95a}.
 Intermediate features that can be used in different
 contexts can be estimated in a way that allows to share statistical 
 strength. Features extracted through many levels are more likely to
@@ -794,7 +798,7 @@
 the kind of better basins associated
 with deep learning and out-of-distribution examples.
  
-A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 
+A Java demo of the recognizer (where both the MLP and the SDA can be compared) 
 can be executed on-line at {\tt http://deep.host22.com}.
 
 \iffalse