# HG changeset patch
# User Olivier Delalleau <delallea@iro>
# Date 1275571082 14400
# Node ID cf5a7ee2d89222c0f044f7681e93ff0b0dfd8fab
# Parent  143a1467f157e1235f2c4f42203279b40862bf3c# Parent  17d16700e0c8c19bb7a199de93d8fe8dbca8fdf9
Merged

diff -r 143a1467f157 -r cf5a7ee2d892 writeup/nips2010_submission.tex
--- a/writeup/nips2010_submission.tex	Thu Jun 03 09:16:53 2010 -0400
+++ b/writeup/nips2010_submission.tex	Thu Jun 03 09:18:02 2010 -0400
@@ -107,7 +107,8 @@
 
 Our experimental results provide positive evidence towards all of these questions.
 To achieve these results, we introduce in the next section a sophisticated system
-for stochastically transforming character images. The conclusion discusses
+for stochastically transforming character images and then explain the methodology. 
+The conclusion discusses
 the more general question of why deep learners may benefit so much from 
 the self-taught learning framework.
 
@@ -165,7 +166,7 @@
 %\end{minipage}%
 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
 \end{wrapfigure}
-Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
+To change character {\bf thickness}, morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
 are applied. The neighborhood of each pixel is multiplied
 element-wise with a {\em structuring element} matrix.
 The pixel value is replaced by the maximum or the minimum of the resulting
@@ -188,7 +189,7 @@
 \hspace{0.3cm}\begin{minipage}[b]{0.83\linewidth}
 %\centering
 %\vspace*{-15mm}
-Each row of the image is shifted
+To produce {\bf slant}, each row of the image is shifted
 proportionally to its height: $shift = round(slant \times height)$.  
 $slant \sim U[-complexity,complexity]$.
 \vspace{1.5cm}
@@ -201,12 +202,12 @@
 \vspace*{-6mm}
 \begin{center}
 \includegraphics[scale=.4]{images/Affine_only.png}\\
-{\bf Affine}
+{\bf Affine Transformation}
 \end{center}
 \end{wrapfigure}
 %\end{minipage}%
 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-A $2 \times 3$ affine transform matrix (with
+A $2 \times 3$ {\bf affine transform} matrix (with
 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$.
 Output pixel $(x,y)$ takes the value of input pixel
 nearest to $(ax+by+c,dx+ey+f)$,
@@ -234,8 +235,8 @@
 %\end{minipage}%
 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth}
 %\vspace*{-20mm}
-This local elastic deformation 
-filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
+The {\bf local elastic} deformation 
+module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
 which provides more details. 
 The intensity of the displacement fields is given by 
 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are 
@@ -258,7 +259,7 @@
 %\vspace{.6cm}
 %\end{minipage}%
 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
-This is the ``Whirl and pinch'' GIMP filter with whirl was set to 0. 
+The {\bf pinch} module applies the ``Whirl and pinch'' GIMP filter with whirl was set to 0. 
 A pinch is ``similar to projecting the image onto an elastic
 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
 For a square input image, draw a radius-$r$ disk
@@ -267,7 +268,7 @@
 the value of a ``source'' pixel in the original image,
 on the line that goes through $C$ and $P$, but
 at some other distance $d_2$. Define $d_1=distance(P,C) = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
-d_1$, where $pinch$ is a parameter to the filter.
+d_1$, where $pinch$ is a parameter of the filter.
 The actual value is given by bilinear interpolation considering the pixels
 around the (non-integer) source position thus found.
 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
@@ -289,8 +290,8 @@
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
 %\vspace*{.5mm}
-This is GIMP's ``linear motion blur'' 
-with parameters $length$ and $angle$. The value of
+The {\bf motion blur} module is GIMP's ``linear motion blur'', which
+has parameters $length$ and $angle$. The value of
 a pixel in the final image is approximately the  mean of the first $length$ pixels
 found by moving in the $angle$ direction,
 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
@@ -307,14 +308,14 @@
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
 \vspace*{-18mm}
-Selects a random rectangle from an {\em occluder} character
+The {\bf occlusion} module selects a random rectangle from an {\em occluder} character
 image and places it over the original {\em occluded}
 image. Pixels are combined by taking the max(occluder,occluded),
 closer to black. The rectangle corners
 are sampled so that larger complexity gives larger rectangles.
 The destination position in the occluded image are also sampled
 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}).
-This filter is skipped with probability 60\%.
+This module is skipped with probability 60\%.
 %\vspace{7mm}
 \end{minipage}
 
@@ -332,7 +333,8 @@
 %\vspace{.5cm}
 %\end{minipage}%
 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
-Different regions of the image are spatially smoothed by convolving
+With the {\bf Gaussian smoothing} module, 
+different regions of the image are spatially smoothed by convolving
 the image with a symmetric Gaussian kernel of
 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
@@ -344,7 +346,7 @@
 we add to the mask the averaging window centered to it.  The final image is
 computed from the following element-wise operation: $\frac{image + filtered
   image \times mask}{mask+1}$.
-This filter is skipped with probability 75\%.
+This module is skipped with probability 75\%.
 %\end{minipage}
 
 \newpage
@@ -364,14 +366,14 @@
 %\end{minipage}%
 %\hspace{-0cm}\begin{minipage}[t]{0.86\linewidth}
 %\vspace*{-20mm}
-This filter permutes neighbouring pixels. It first selects
+This module {\bf permutes neighbouring pixels}. It first selects
 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then
 sequentially exchanged with one other in as $V4$ neighbourhood. 
-This filter is skipped with probability 80\%.\\
+This module is skipped with probability 80\%.\\
 \vspace*{1mm}
 \end{minipage}
 
-\vspace{-1mm}
+\vspace{-3mm}
 
 \begin{minipage}[t]{\linewidth}
 \begin{wrapfigure}[7]{l}{0.15\textwidth}
@@ -387,13 +389,13 @@
 %\end{minipage}%
 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
 \vspace*{12mm}
-This filter simply adds, to each pixel of the image independently, a
+The {\bf Gaussian noise} module simply adds, to each pixel of the image independently, a
 noise $\sim Normal(0,(\frac{complexity}{10})^2)$.
-This filter is skipped with probability 70\%.
+This module is skipped with probability 70\%.
 %\vspace{1.1cm}
 \end{minipage}
 
-\vspace*{1.5cm}
+\vspace*{1.2cm}
 
 \begin{minipage}[t]{\linewidth}
 \begin{minipage}[t]{0.14\linewidth}
@@ -403,7 +405,7 @@
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
 \vspace*{-18mm}
-Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
+Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random
 background image behind the letter, from a randomly chosen natural image,
 with contrast adjustments depending on $complexity$, to preserve
 more or less of the original character image.
@@ -419,9 +421,9 @@
 \end{minipage}%
 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
 \vspace*{-18mm}
-This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
+The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels.
 The number of selected pixels is $0.2 \times complexity$.
-This filter is skipped with probability 75\%.
+This module is skipped with probability 75\%.
 %\vspace{.9cm}
 \end{minipage}
 %\vspace{-.7cm}
@@ -441,7 +443,7 @@
 \end{wrapfigure}
 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
 %\vspace{.4cm}
-The scratches module places line-like white patches on the image.  The
+The {\bf scratches} module places line-like white patches on the image.  The
 lines are heavily transformed images of the digit ``1'' (one), chosen
 at random among 500 such 1 images,
 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
@@ -449,20 +451,20 @@
 Two passes of a grey-scale morphological erosion filter
 are applied, reducing the width of the line
 by an amount controlled by $complexity$.
-This filter is skipped with probability 85\%. The probabilities
+This module is skipped with probability 85\%. The probabilities
 of applying 1, 2, or 3 patches are (50\%,30\%,20\%).
 \end{minipage}
 
 \vspace*{2mm}
 
-\begin{minipage}[t]{0.20\linewidth}
+\begin{minipage}[t]{0.25\linewidth}
 \centering
-\hspace*{-7mm}\includegraphics[scale=.4]{images/Contrast_only.png}\\
-{\bf Grey \& Contrast}
+\hspace*{-16mm}\includegraphics[scale=.4]{images/Contrast_only.png}\\
+{\bf Grey Level \& Contrast}
 \end{minipage}%
-\hspace{-4mm}\begin{minipage}[t]{0.82\linewidth}
-\vspace*{-18mm}
-This filter changes the contrast by changing grey levels, and may invert the image polarity (white
+\hspace{-12mm}\begin{minipage}[t]{0.82\linewidth}
+t -m "\vspace*{-18mm}
+The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white
 to black and black to white). The contrast is $C \sim U[1-0.85 \times complexity,1]$ 
 so the image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
 polarity is inverted with probability 50\%.
@@ -710,6 +712,21 @@
 \end{figure}
 
 
+\begin{figure}[ht]
+\vspace*{-3mm}
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
+\vspace*{-3mm}
+\caption{Relative improvement in error rate due to self-taught learning. 
+Left: Improvement (or loss, when negative)
+induced by out-of-distribution examples (perturbed data). 
+Right: Improvement (or loss, when negative) induced by multi-task 
+learning (training on all classes and testing only on either digits,
+upper case, or lower-case). The deep learner (SDA) benefits more from
+both self-taught learning scenarios, compared to the shallow MLP.}
+\label{fig:improvements-charts}
+\vspace*{-2mm}
+\end{figure}
+
 \section{Experimental Results}
 \vspace*{-2mm}
 
@@ -739,21 +756,6 @@
 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
 ``c'' and a ``C'' are often indistinguishible).
 
-\begin{figure}[ht]
-\vspace*{-3mm}
-\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
-\vspace*{-3mm}
-\caption{Relative improvement in error rate due to self-taught learning. 
-Left: Improvement (or loss, when negative)
-induced by out-of-distribution examples (perturbed data). 
-Right: Improvement (or loss, when negative) induced by multi-task 
-learning (training on all classes and testing only on either digits,
-upper case, or lower-case). The deep learner (SDA) benefits more from
-both self-taught learning scenarios, compared to the shallow MLP.}
-\label{fig:improvements-charts}
-\vspace*{-2mm}
-\end{figure}
-
 In addition, as shown in the left of
 Figure~\ref{fig:improvements-charts}, the relative improvement in error
 rate brought by self-taught learning is greater for the SDA, and these