# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1275422792 14400
# Node ID c778d20ab6f807359979da4de6a54b5738249cdc
# Parent  d41926a68993080c9b4702f4ddcb75e45f94d184
space adjustments

diff -r d41926a68993 -r c778d20ab6f8 writeup/ift6266_ml.bib
--- a/writeup/ift6266_ml.bib	Tue Jun 01 15:52:54 2010 -0400
+++ b/writeup/ift6266_ml.bib	Tue Jun 01 16:06:32 2010 -0400
@@ -25829,3 +25829,11 @@
  pages = {2035--2043},
  year = 2009
 }
+
+@techreport{ift6266-tr-anonymous,
+ author = "Anonymous authors",
+ title = "Generating and Exploiting Perturbed and Multi-Task Handwritten 
+     Training Data for Deep Architectures",
+ institution = "University X.",
+ year = 2010,
+}
\ No newline at end of file
diff -r d41926a68993 -r c778d20ab6f8 writeup/nips2010_submission.tex
--- a/writeup/nips2010_submission.tex	Tue Jun 01 15:52:54 2010 -0400
+++ b/writeup/nips2010_submission.tex	Tue Jun 01 16:06:32 2010 -0400
@@ -7,6 +7,8 @@
 \usepackage{graphicx,subfigure}
 \usepackage[numbers]{natbib}
 
+%\setlength\parindent{0mm}
+
 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
 \author{The IFT6266 Gang}
 
@@ -139,8 +141,8 @@
 from slant to pinch below, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
 
-\begin{figure}[h]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\
+\begin{figure}[ht]
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}}
 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE
 \caption{Illustration of each transformation applied alone to the same image
 of an upper-case h (top left). First row (from left to right) : original image, slant,
@@ -163,7 +165,9 @@
 and its value is randomly sampled according to the complexity level:
 $slant \sim U[0,complexity]$, so the
 maximum displacement for the lowest or highest pixel line is of
-$round(complexity \times 32)$.\\
+$round(complexity \times 32)$.
+\vspace*{0mm}
+
 {\bf Thickness.}
 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
 are applied. The neighborhood of each pixel is multiplied
@@ -177,7 +181,9 @@
 for erosion.  A neutral element is always present in the set, and if it is
 chosen no transformation is applied.  Erosion allows only the six
 smallest structural elements because when the character is too thin it may
-be completely erased.\\
+be completely erased.
+\vspace*{0mm}
+
 {\bf Affine Transformations.}
 A $2 \times 3$ affine transform matrix (with
 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
@@ -189,7 +195,9 @@
 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
-complexity]$.\\
+complexity]$.
+\vspace*{0mm}
+
 {\bf Local Elastic Deformations.}
 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
 which provides more details. 
@@ -202,7 +210,9 @@
 Each field is convoluted with a Gaussian 2D kernel of
 standard deviation $\sigma$. Visually, this results in a blur.
 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
-\sqrt[3]{complexity}$.\\
+\sqrt[3]{complexity}$.
+\vspace*{0mm}
+
 {\bf Pinch.}
 This is a GIMP filter called ``Whirl and
 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
@@ -230,7 +240,9 @@
 terminology, with two parameters, $length$ and $angle$. The value of
 a pixel in the final image is approximately the  mean value of the $length$ first pixels
 found by moving in the $angle$ direction. 
-Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\
+Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
+\vspace*{0mm}
+
 {\bf Occlusion.}
 Selects a random rectangle from an {\em occluder} character
 images and places it over the original {\em occluded} character
@@ -239,7 +251,9 @@
 are sampled so that larger complexity gives larger rectangles.
 The destination position in the occluded image are also sampled
 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
-This filter has a probability of 60\% of not being applied.\\
+This filter has a probability of 60\% of not being applied.
+\vspace*{0mm}
+
 {\bf Pixel Permutation.}
 This filter permutes neighbouring pixels. It selects first
 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
@@ -247,11 +261,15 @@
 of exchanges to the left, right, top, bottom is equal or does not differ
 from more than 1 if the number of selected pixels is not a multiple of 4.
 % TODO: The previous sentence is hard to parse
-This filter has a probability of 80\% of not being applied.\\
+This filter has a probability of 80\% of not being applied.
+\vspace*{0mm}
+
 {\bf Gaussian Noise.}
 This filter simply adds, to each pixel of the image independently, a
 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
-It has a probability of 70\% of not being applied.\\
+It has a probability of 70\% of not being applied.
+\vspace*{0mm}
+
 {\bf Background Images.}
 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
 background behind the letter. The background is chosen by first selecting,
@@ -264,11 +282,15 @@
 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$.
 Each background pixel value is multiplied by $\frac{max(maximage -
   contrast, 0)}{maxbg}$ (higher contrast yield darker
-background). The output image pixels are max(background,original).\\
+background). The output image pixels are max(background,original).
+\vspace*{0mm}
+
 {\bf Salt and Pepper Noise.}
 This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
 The number of selected pixels is $0.2 \times complexity$.
-This filter has a probability of not being applied at all of 75\%.\\
+This filter has a probability of not being applied at all of 75\%.
+\vspace*{0mm}
+
 {\bf Spatially Gaussian Noise.}
 Different regions of the image are spatially smoothed.
 The image is convolved with a symmetric Gaussian kernel of
@@ -282,7 +304,9 @@
 we add to the mask the averaging window centered to it.  The final image is
 computed from the following element-wise operation: $\frac{image + filtered
   image \times mask}{mask+1}$.
-This filter has a probability of not being applied at all of 75\%.\\
+This filter has a probability of not being applied at all of 75\%.
+\vspace*{0mm}
+
 {\bf Scratches.}
 The scratches module places line-like white patches on the image.  The
 lines are heavily transformed images of the digit ``1'' (one), chosen
@@ -296,7 +320,9 @@
 of the time, only one patch image is generated and applied. In 30\% of
 cases, two patches are generated, and otherwise three patches are
 generated. The patch is applied by taking the maximal value on any given
-patch or the original image, for each of the 32x32 pixel locations.\\
+patch or the original image, for each of the 32x32 pixel locations.
+\vspace*{0mm}
+
 {\bf Grey Level and Contrast Changes.}
 This filter changes the contrast and may invert the image polarity (white
 on black to black on white). The contrast $C$ is defined here as the
@@ -306,8 +332,8 @@
 polarity is inverted with $0.5$ probability.
 
 \iffalse
-\begin{figure}[h]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\
+\begin{figure}[ht]
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}}\\
 \caption{Illustration of the pipeline of stochastic 
 transformations applied to the image of a lower-case \emph{t}
 (the upper left image). Each image in the pipeline (going from
@@ -454,10 +480,11 @@
 through preliminary experiments (measuring performance on a validation set),
 and $0.1$ was then selected.
 
-\begin{figure}[h]
-\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}
+\begin{figure}[ht]
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
 \caption{Illustration of the computations and training criterion for the denoising
-auto-encoder used to pre-train each layer of the deep architecture. Input $x$
+auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
+the layer (i.e. raw input or output of previous layer)
 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
 is compared to the uncorrupted input $x$ through the loss function
@@ -506,6 +533,22 @@
 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
 
 \vspace*{-1mm}
+
+\begin{figure}[ht]
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
+\caption{Error bars indicate a 95\% confidence interval. 0 indicates training
+on NIST, 1 on NISTP, and 2 on P07. Left: overall results
+of all models, on 3 different test sets corresponding to the three
+datasets.
+Right: error rates on NIST test digits only, along with the previous results from 
+literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
+respectively based on ART, nearest neighbors, MLPs, and SVMs.}
+
+\label{fig:error-rates-charts}
+\vspace*{-1mm}
+\end{figure}
+
+
 \section{Experimental Results}
 
 %\vspace*{-1mm}
@@ -525,8 +568,24 @@
 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and
 previously published performance (in a statistically and qualitatively
 significant way) but reaches human performance on both the 62-class task
-and the 10-class (digits) task. In addition, as shown in the left of
-Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error
+and the 10-class (digits) task. 
+
+\begin{figure}[ht]
+\vspace*{-2mm}
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
+\caption{Relative improvement in error rate due to self-taught learning. 
+Left: Improvement (or loss, when negative)
+induced by out-of-distribution examples (perturbed data). 
+Right: Improvement (or loss, when negative) induced by multi-task 
+learning (training on all classes and testing only on either digits,
+upper case, or lower-case). The deep learner (SDA) benefits more from
+both self-taught learning scenarios, compared to the shallow MLP.}
+\label{fig:improvements-charts}
+\vspace*{-2mm}
+\end{figure}
+
+In addition, as shown in the left of
+Figure~\ref{fig:improvements-charts}, the relative improvement in error
 rate brought by self-taught learning is greater for the SDA, and these
 differences with the MLP are statistically and qualitatively
 significant. 
@@ -536,7 +595,7 @@
 Relative change is measured by taking
 (original model's error / perturbed-data model's error - 1).
 The right side of
-Figure~\ref{fig:fig:improvements-charts} shows the relative improvement
+Figure~\ref{fig:improvements-charts} shows the relative improvement
 brought by the use of a multi-task setting, in which the same model is
 trained for more classes than the target classes of interest (i.e. training
 with all 62 classes when the target classes are respectively the digits,
@@ -555,19 +614,6 @@
 setting is similar for the other two target classes (lower case characters
 and upper case characters).
 
-\begin{figure}[h]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
-\caption{Error bars indicate a 95\% confidence interval. 0 indicates training
-on NIST, 1 on NISTP, and 2 on P07. Left: overall results
-of all models, on 3 different test sets corresponding to the three
-datasets.
-Right: error rates on NIST test digits only, along with the previous results from 
-literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
-respectively based on ART, nearest neighbors, MLPs, and SVMs.}
-
-\label{fig:error-rates-charts}
-\end{figure}
-
 %\vspace*{-1mm}
 %\subsection{Perturbed Training Data More Helpful for SDA}
 %\vspace*{-1mm}
@@ -602,18 +648,6 @@
 \fi
 
 
-\begin{figure}[h]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
-\caption{Relative improvement in error rate due to self-taught learning. 
-Left: Improvement (or loss, when negative)
-induced by out-of-distribution examples (perturbed data). 
-Right: Improvement (or loss, when negative) induced by multi-task 
-learning (training on all classes and testing only on either digits,
-upper case, or lower-case). The deep learner (SDA) benefits more from
-both self-taught learning scenarios, compared to the shallow MLP.}
-\label{fig:improvements-charts}
-\end{figure}
-
 \vspace*{-1mm}
 \section{Conclusions}
 \vspace*{-1mm}