Mercurial > ift6266

--- a/writeup/nips2010_submission.tex	Wed Jun 02 01:34:49 2010 -0400
+++ b/writeup/nips2010_submission.tex	Wed Jun 02 10:23:33 2010 -0400
@@ -143,6 +143,7 @@
 part, from blur to contrast, adds different kinds of noise.

 \begin{figure}[ht]
+\vspace*{-2mm}
 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/transfo.png}}}
 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE
 \caption{Illustration of each transformation applied alone to the same image
@@ -153,21 +154,18 @@
 background image, salt and pepper noise, spatially Gaussian noise, scratches,
 grey level and contrast changes.}
 \label{fig:transfo}
+\vspace*{-2mm}
 \end{figure}

 {\large\bf Transformations}

-\vspace*{2mm}
+\vspace*{0.5mm}

 {\bf Slant.}
-We mimic slant by shifting each row of the image
+Each row of the image is shifted
 proportionally to its height: $shift = round(slant \times height)$.
-The $slant$ coefficient can be negative or positive with equal probability
-and its value is randomly sampled according to the complexity level:
-$slant \sim U[0,complexity]$, so the
-maximum displacement for the lowest or highest pixel line is of
-$round(complexity \times 32)$.
-\vspace*{0mm}
+$slant \sim U[-complexity,complexity]$.
+\vspace*{-1mm}

 {\bf Thickness.}
 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
@@ -177,46 +175,38 @@
 matrix, respectively for dilation or erosion. Ten different structural elements with
 increasing dimensions (largest is $5\times5$) were used.  For each image,
 randomly sample the operator type (dilation or erosion) with equal probability and one structural
-element from a subset of the $n$ smallest structuring elements where $n$ is
-$round(10 \times complexity)$ for dilation and $round(6 \times complexity)$
-for erosion.  A neutral element is always present in the set, and if it is
-chosen no transformation is applied.  Erosion allows only the six
-smallest structural elements because when the character is too thin it may
-be completely erased.
-\vspace*{0mm}
+element from a subset of the $n=round(m \times complexity)$ smallest structuring elements
+where $m=10$ for dilation and $m=6$ for erosion (to avoid completely erasing thin characters).
+A neutral element (no transformation)
+is always present in the set. is applied.
+\vspace*{-1mm}

 {\bf Affine Transformations.}
 A $2 \times 3$ affine transform matrix (with
 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
-Each pixel $(x,y)$ of the output image takes the value of the pixel
-nearest to $(ax+by+c,dx+ey+f)$ in the input image.  This
-produces scaling, translation, rotation and shearing.
+Output pixel $(x,y)$ takes the value of input pixel
+nearest to $(ax+by+c,dx+ey+f)$,
+producing scaling, translation, rotation and shearing.
 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to
 forbid important rotations (not to confuse classes) but to give good
 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
 complexity]$.
-\vspace*{0mm}
+\vspace*{-1mm}

 {\bf Local Elastic Deformations.}
 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
 which provides more details.
-Two ``displacements'' fields are generated and applied, for horizontal
-and vertical displacements of pixels.
-To generate a pixel in either field, first a value between -1 and 1 is
-chosen from a uniform distribution. Then all the pixels, in both fields, are
-multiplied by a constant $\alpha$ which controls the intensity of the
-displacements (larger $\alpha$ translates into larger wiggles).
-Each field is convoluted with a Gaussian 2D kernel of
-standard deviation $\sigma$. Visually, this results in a blur.
-$\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
-\sqrt[3]{complexity}$.
-\vspace*{0mm}
+The intensity of the displacement fields is given by
+$\alpha = \sqrt[3]{complexity} \times 10.0$, which are
+convolved with a Gaussian 2D kernel (resulting in a blur) of
+standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$.
+\vspace*{-1mm}

 {\bf Pinch.}
-This is a GIMP filter called ``Whirl and
-pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
+This is the ``Whirl and pinch'' GIMP filter but with whirl was set to 0.
+A pinch is ``similar to projecting the image onto an elastic
 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
 For a square input image, this is akin to drawing a circle of
 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
@@ -230,11 +220,11 @@
 around the (non-integer) source position thus found.
 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.

-\vspace*{1mm}
+\vspace*{0.5mm}

 {\large\bf Injecting Noise}

-\vspace*{1mm}
+\vspace*{0.5mm}

 {\bf Motion Blur.}
 This is a ``linear motion blur'' in GIMP
@@ -242,7 +232,7 @@
 a pixel in the final image is approximately the  mean value of the $length$ first pixels
 found by moving in the $angle$ direction.
 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
-\vspace*{0mm}
+\vspace*{-1mm}

 {\bf Occlusion.}
 Selects a random rectangle from an {\em occluder} character
@@ -253,7 +243,7 @@
 The destination position in the occluded image are also sampled
 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
 This filter has a probability of 60\% of not being applied.
-\vspace*{0mm}
+\vspace*{-1mm}

 {\bf Pixel Permutation.}
 This filter permutes neighbouring pixels. It selects first
@@ -263,13 +253,13 @@
 from more than 1 if the number of selected pixels is not a multiple of 4.
 % TODO: The previous sentence is hard to parse
 This filter has a probability of 80\% of not being applied.
-\vspace*{0mm}
+\vspace*{-1mm}

 {\bf Gaussian Noise.}
 This filter simply adds, to each pixel of the image independently, a
 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
 It has a probability of 70\% of not being applied.
-\vspace*{0mm}
+\vspace*{-1mm}

 {\bf Background Images.}
 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
@@ -284,13 +274,13 @@
 Each background pixel value is multiplied by $\frac{max(maximage -
   contrast, 0)}{maxbg}$ (higher contrast yield darker
 background). The output image pixels are max(background,original).
-\vspace*{0mm}
+\vspace*{-1mm}

 {\bf Salt and Pepper Noise.}
 This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
 The number of selected pixels is $0.2 \times complexity$.
 This filter has a probability of not being applied at all of 75\%.
-\vspace*{0mm}
+\vspace*{-1mm}

 {\bf Spatially Gaussian Noise.}
 Different regions of the image are spatially smoothed.
@@ -306,7 +296,7 @@
 computed from the following element-wise operation: $\frac{image + filtered
   image \times mask}{mask+1}$.
 This filter has a probability of not being applied at all of 75\%.
-\vspace*{0mm}
+\vspace*{-1mm}

 {\bf Scratches.}
 The scratches module places line-like white patches on the image.  The
@@ -322,7 +312,7 @@
 cases, two patches are generated, and otherwise three patches are
 generated. The patch is applied by taking the maximal value on any given
 patch or the original image, for each of the 32x32 pixel locations.
-\vspace*{0mm}
+\vspace*{-1mm}

 {\bf Grey Level and Contrast Changes.}
 This filter changes the contrast and may invert the image polarity (white
@@ -487,6 +477,7 @@
 and $0.1$ was then selected for optimizing on the whole training sets.

 \begin{figure}[ht]
+\vspace*{-2mm}
 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
 \caption{Illustration of the computations and training criterion for the denoising
 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
@@ -497,6 +488,7 @@
 $L_H(x,z)$, whose expected value is approximately minimized during training
 by tuning $\theta$ and $\theta'$.}
 \label{fig:da}
+\vspace*{-2mm}
 \end{figure}

 {\bf Stacked Denoising Auto-Encoders (SDA).}
@@ -543,6 +535,7 @@
 \vspace*{-1mm}

 \begin{figure}[ht]
+\vspace*{-2mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
 \caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
@@ -552,7 +545,7 @@
 respectively based on ART, nearest neighbors, MLPs, and SVMs.}

 \label{fig:error-rates-charts}
-\vspace*{-1mm}
+\vspace*{-2mm}
 \end{figure}
--- a/writeup/techreport.tex	Wed Jun 02 01:34:49 2010 -0400
+++ b/writeup/techreport.tex	Wed Jun 02 10:23:33 2010 -0400
@@ -23,7 +23,7 @@
 of examples for character images based on a pipeline of stochastic
 transformations that include not only the usual affine transformations
 but also the addition of slant, local elastic deformations, changes
-in thickness, background images, color, contrast, occlusion, and
+in thickness, background images, grey level, contrast, occlusion, and
 various types of pixel and spatially correlated noise.
 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders)
 on the task of learning to classify digits and letters transformed
@@ -101,6 +101,16 @@

 \subsection{Slant}

+We mimic slant by shifting each row of the image
+proportionally to its height: $shift = round(slant \times height)$.
+The $slant$ coefficient can be negative or positive with equal probability
+and its value is randomly sampled according to the complexity level:
+$slant \sim U[0,complexity]$, so the
+maximum displacement for the lowest or highest pixel line is of
+$round(complexity \times 32)$.
+
+---
+
 In order to mimic a slant effect, we simply shift each row of the image
 proportionnaly to its height: $shift = round(slant \times height)$.  We
 round the shift in order to have a discret displacement. We do not use a
@@ -116,6 +126,22 @@

 \subsection{Thickness}

+Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
+are applied. The neighborhood of each pixel is multiplied
+element-wise with a {\em structuring element} matrix.
+The pixel value is replaced by the maximum or the minimum of the resulting
+matrix, respectively for dilation or erosion. Ten different structural elements with
+increasing dimensions (largest is $5\times5$) were used.  For each image,
+randomly sample the operator type (dilation or erosion) with equal probability and one structural
+element from a subset of the $n$ smallest structuring elements where $n$ is
+$round(10 \times complexity)$ for dilation and $round(6 \times complexity)$
+for erosion.  A neutral element is always present in the set, and if it is
+chosen no transformation is applied.  Erosion allows only the six
+smallest structural elements because when the character is too thin it may
+be completely erased.
+
+---
+
 To change the thickness of the characters we used morpholigical operators:
 dilation and erosion~\cite{Haralick87,Serra82}.

@@ -138,6 +164,20 @@

 \subsection{Affine Transformations}

+A $2 \times 3$ affine transform matrix (with
+6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
+Each pixel $(x,y)$ of the output image takes the value of the pixel
+nearest to $(ax+by+c,dx+ey+f)$ in the input image.  This
+produces scaling, translation, rotation and shearing.
+The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to
+forbid important rotations (not to confuse classes) but to give good
+variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
+complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
+\times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
+complexity]$.
+
+----
+
 We generate an affine transform matrix according to the complexity level,
 then we apply it directly to the image.  The matrix is of size $2 \times
 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$.  Formally,
@@ -156,6 +196,21 @@

 \subsection{Local Elastic Deformations}

+This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
+which provides more details.
+Two ``displacements'' fields are generated and applied, for horizontal
+and vertical displacements of pixels.
+To generate a pixel in either field, first a value between -1 and 1 is
+chosen from a uniform distribution. Then all the pixels, in both fields, are
+multiplied by a constant $\alpha$ which controls the intensity of the
+displacements (larger $\alpha$ translates into larger wiggles).
+Each field is convolved with a Gaussian 2D kernel of
+standard deviation $\sigma$. Visually, this results in a blur.
+$\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
+\sqrt[3]{complexity}$.
+
+----
+
 This filter induces a "wiggly" effect in the image. The description here
 will be brief, as the algorithm follows precisely what is described in
 \cite{SimardSP03}.
@@ -193,6 +248,23 @@

 \subsection{Pinch}

+This is a GIMP filter called ``Whirl and
+pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
+surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
+For a square input image, this is akin to drawing a circle of
+radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
+that disk (region inside circle) will have its value recalculated by taking
+the value of another ``source'' pixel in the original image. The position of
+that source pixel is found on the line that goes through $C$ and $P$, but
+at some other distance $d_2$. Define $d_1$ to be the distance between $P$
+and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
+d_1$, where $pinch$ is a parameter to the filter.
+The actual value is given by bilinear interpolation considering the pixels
+around the (non-integer) source position thus found.
+Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
+
+---
+
 This is another GIMP filter we used. The filter is in fact named "Whirl and
 pinch", but we don't use the "whirl" part (whirl is set to 0). As described
 in GIMP, a pinch is "similar to projecting the image onto an elastic
@@ -222,6 +294,14 @@

 \subsection{Motion Blur}

+This is a ``linear motion blur'' in GIMP
+terminology, with two parameters, $length$ and $angle$. The value of
+a pixel in the final image is approximately the  mean value of the $length$ first pixels
+found by moving in the $angle$ direction.
+Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
+
+----
+
 This is a GIMP filter we applied, a "linear motion blur" in GIMP
 terminology. The description will be brief as it is a well-known filter.

@@ -238,6 +318,17 @@

 \subsection{Occlusion}

+Selects a random rectangle from an {\em occluder} character
+images and places it over the original {\em occluded} character
+image. Pixels are combined by taking the max(occluder,occluded),
+closer to black. The rectangle corners
+are sampled so that larger complexity gives larger rectangles.
+The destination position in the occluded image are also sampled
+according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
+This filter has a probability of 60\% of not being applied.
+
+---
+
 This filter selects random parts of other (hereafter "occlusive") letter
 images and places them over the original letter (hereafter "occluded")
 image. To be more precise, having selected a subregion of the occlusive
@@ -281,6 +372,16 @@

 \subsection{Pixel Permutation}

+This filter permutes neighbouring pixels. It selects first
+$\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
+sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number
+of exchanges to the left, right, top, bottom is equal or does not differ
+from more than 1 if the number of selected pixels is not a multiple of 4.
+% TODO: The previous sentence is hard to parse
+This filter has a probability of 80\% of not being applied.
+
+---
+
 This filter permuts neighbouring pixels. It selects first
 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number
@@ -293,6 +394,12 @@
 \subsection{Gaussian Noise}

 This filter simply adds, to each pixel of the image independently, a
+noise $\sim Normal(0(\frac{complexity}{10})^2)$.
+It has a probability of 70\% of not being applied.
+
+---
+
+This filter simply adds, to each pixel of the image independently, a
 Gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$.

 It has has a probability of not being applied, at all, of 70\%.
@@ -300,6 +407,21 @@

 \subsection{Background Images}

+Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
+background behind the letter. The background is chosen by first selecting,
+at random, an image from a set of images. Then a 32$\times$32 sub-region
+of that image is chosen as the background image (by sampling position
+uniformly while making sure not to cross image borders).
+To combine the original letter image and the background image, contrast
+adjustments are made. We first get the maximal values (i.e. maximal
+intensity) for both the original image and the background image, $maximage$
+and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$.
+Each background pixel value is multiplied by $\frac{max(maximage -
+  contrast, 0)}{maxbg}$ (higher contrast yield darker
+background). The output image pixels are max(background,original).
+
+---
+
 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random
 background behind the letter. The background is chosen by first selecting,
 at random, an image from a set of images. Then we choose a 32x32 subregion
@@ -323,6 +445,12 @@

 \subsection{Salt and Pepper Noise}

+This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
+The number of selected pixels is $0.2 \times complexity$.
+This filter has a probability of not being applied at all of 75\%.
+
+---
+
 This filter adds noise to the image by randomly selecting a certain number
 of them and, for those selected pixels, assign a random value according to
 a uniform distribution over the $[0,1]$ ranges. This last distribution does
@@ -335,6 +463,22 @@

 \subsection{Spatially Gaussian Noise}

+Different regions of the image are spatially smoothed.
+The image is convolved with a symmetric Gaussian kernel of
+size and variance chosen uniformly in the ranges $[12,12 + 20 \times
+complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
+between $0$ and $1$.  We also create a symmetric averaging window, of the
+kernel size, with maximum value at the center.  For each image we sample
+uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be
+averaging centers between the original image and the filtered one.  We
+initialize to zero a mask matrix of the image size. For each selected pixel
+we add to the mask the averaging window centered to it.  The final image is
+computed from the following element-wise operation: $\frac{image + filtered
+  image \times mask}{mask+1}$.
+This filter has a probability of not being applied at all of 75\%.
+
+----
+
 The aim of this transformation is to filter, with a gaussian kernel,
 different regions of the image. In order to save computing time we decided
 to convolve the whole image only once with a symmetric gaussian kernel of
@@ -354,6 +498,22 @@
 \subsection{Scratches}

 The scratches module places line-like white patches on the image.  The
+lines are heavily transformed images of the digit ``1'' (one), chosen
+at random among five thousands such 1 images. The 1 image is
+randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
+complexity)^2$, using bi-cubic interpolation,
+Two passes of a grey-scale morphological erosion filter
+are applied, reducing the width of the line
+by an amount controlled by $complexity$.
+This filter is only applied only 15\% of the time. When it is applied, 50\%
+of the time, only one patch image is generated and applied. In 30\% of
+cases, two patches are generated, and otherwise three patches are
+generated. The patch is applied by taking the maximal value on any given
+patch or the original image, for each of the 32x32 pixel locations.
+
+---
+
+The scratches module places line-like white patches on the image.  The
 lines are in fact heavily transformed images of the digit "1" (one), chosen
 at random among five thousands such start images of this digit.

@@ -387,8 +547,16 @@
 generated. The patch is applied by taking the maximal value on any given
 patch or the original image, for each of the 32x32 pixel locations.

-\subsection{Color and Contrast Changes}
+\subsection{Grey Level and Contrast Changes}

+This filter changes the contrast and may invert the image polarity (white
+on black to black on white). The contrast $C$ is defined here as the
+difference between the maximum and the minimum pixel value of the image.
+Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$).
+The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
+polarity is inverted with $0.5$ probability.
+
+---
 This filter changes the constrast and may invert the image polarity (white
 on black to black on white). The contrast $C$ is defined here as the
 difference between the maximum and the minimum pixel value of the image. A
@@ -417,7 +585,7 @@
 thickness, affine transformation, local elastic deformation; second row (from left to rigth) :
 pinch, motion blur, occlusion, pixel permutation, gaussian noise; third row (from left to rigth) :
 background image, salt and pepper noise, spatially gaussian noise, scratches,
-color and contrast changes.}
+grey level and contrast changes.}
 \label{fig:transfo}
 \end{figure}