ift6266: writeup/techreport.tex comparison

comparison writeup/techreport.tex @ 541:8aad1c6ec39a

reduction espace

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Wed, 02 Jun 2010 10:23:33 -0400
parents	6593e67381a3
children	9ebb335ca904

comparison

equal deleted inserted replaced

-:269c39f55134
+:8aad1c6ec39a
 had been evaluated on rather small datasets with a few tens of thousands
 of examples. Here we propose a powerful generator of variations
 of examples for character images based on a pipeline of stochastic
 transformations that include not only the usual affine transformations
 but also the addition of slant, local elastic deformations, changes
-in thickness, background images, color, contrast, occlusion, and
+in thickness, background images, grey level, contrast, occlusion, and
 various types of pixel and spatially correlated noise.
 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders)
 on the task of learning to classify digits and letters transformed
 with this pipeline, using the hundreds of millions of generated examples
 and testing on the full 62-class NIST test set.
 from slant to pinch, performs transformations of the character. The second
 part, from blur to contrast, adds noise to the image.
 \subsection{Slant}
+We mimic slant by shifting each row of the image
+proportionally to its height: $shift = round(slant \times height)$.
+The $slant$ coefficient can be negative or positive with equal probability
+and its value is randomly sampled according to the complexity level:
+$slant \sim U[0,complexity]$, so the
+maximum displacement for the lowest or highest pixel line is of
+$round(complexity \times 32)$.
+---
 In order to mimic a slant effect, we simply shift each row of the image
 proportionnaly to its height: $shift = round(slant \times height)$.  We
 round the shift in order to have a discret displacement. We do not use a
 filter to smooth the result in order to save computing time and also
 because latter transformations have similar effects.
 maximum displacement for the lowest or highest pixel line is of
 $round(complexity \times 32)$.
 \subsection{Thickness}
+Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
+are applied. The neighborhood of each pixel is multiplied
+element-wise with a {\em structuring element} matrix.
+The pixel value is replaced by the maximum or the minimum of the resulting
+matrix, respectively for dilation or erosion. Ten different structural elements with
+increasing dimensions (largest is $5\times5$) were used.  For each image,
+randomly sample the operator type (dilation or erosion) with equal probability and one structural
+element from a subset of the $n$ smallest structuring elements where $n$ is
+$round(10 \times complexity)$ for dilation and $round(6 \times complexity)$
+for erosion.  A neutral element is always present in the set, and if it is
+chosen no transformation is applied.  Erosion allows only the six
+smallest structural elements because when the character is too thin it may
+be completely erased.
+---
 To change the thickness of the characters we used morpholigical operators:
 dilation and erosion~\cite{Haralick87,Serra82}.
 The basic idea of such transform is, for each pixel, to multiply in the
 smallest structural elements because when the character is too thin it may
 erase it completly.
 \subsection{Affine Transformations}
+A $2 \times 3$ affine transform matrix (with
+6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
+Each pixel $(x,y)$ of the output image takes the value of the pixel
+nearest to $(ax+by+c,dx+ey+f)$ in the input image.  This
+produces scaling, translation, rotation and shearing.
+The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to
+forbid important rotations (not to confuse classes) but to give good
+variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
+complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
+\times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
+complexity]$.
+----
 We generate an affine transform matrix according to the complexity level,
 then we apply it directly to the image.  The matrix is of size $2 \times
 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$.  Formally,
 for each pixel $(x,y)$ of the output image, we give the value of the pixel
 nearest to : $(ax+by+c,dx+ey+f)$, in the input image.  This allows to
 complexity]$.
 \subsection{Local Elastic Deformations}
+This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
+which provides more details.
+Two ``displacements'' fields are generated and applied, for horizontal
+and vertical displacements of pixels.
+To generate a pixel in either field, first a value between -1 and 1 is
+chosen from a uniform distribution. Then all the pixels, in both fields, are
+multiplied by a constant $\alpha$ which controls the intensity of the
+displacements (larger $\alpha$ translates into larger wiggles).
+Each field is convolved with a Gaussian 2D kernel of
+standard deviation $\sigma$. Visually, this results in a blur.
+$\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
+\sqrt[3]{complexity}$.
+----
 This filter induces a "wiggly" effect in the image. The description here
 will be brief, as the algorithm follows precisely what is described in
 \cite{SimardSP03}.
 The general idea is to generate two "displacements" fields, for horizontal
 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
 \sqrt[3]{complexity}$.
 \subsection{Pinch}
+This is a GIMP filter called ``Whirl and
+pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
+surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
+For a square input image, this is akin to drawing a circle of
+radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
+that disk (region inside circle) will have its value recalculated by taking
+the value of another ``source'' pixel in the original image. The position of
+that source pixel is found on the line that goes through $C$ and $P$, but
+at some other distance $d_2$. Define $d_1$ to be the distance between $P$
+and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
+d_1$, where $pinch$ is a parameter to the filter.
+The actual value is given by bilinear interpolation considering the pixels
+around the (non-integer) source position thus found.
+Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
+---
 This is another GIMP filter we used. The filter is in fact named "Whirl and
 pinch", but we don't use the "whirl" part (whirl is set to 0). As described
 in GIMP, a pinch is "similar to projecting the image onto an elastic
 surface and pressing or pulling on the center of the surface".
 The value for $pinch$ in our case was given by sampling from an uniform
 distribution over the range $[-complexity, 0.7 \times complexity]$.
 \subsection{Motion Blur}
+This is a ``linear motion blur'' in GIMP
+terminology, with two parameters, $length$ and $angle$. The value of
+a pixel in the final image is approximately the  mean value of the $length$ first pixels
+found by moving in the $angle$ direction.
+Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
+----
 This is a GIMP filter we applied, a "linear motion blur" in GIMP
 terminology. The description will be brief as it is a well-known filter.
 This algorithm has two input parameters, $length$ and $angle$. The value of
 a pixel in the final image is the mean value of the $length$ first pixels
 $[0,360]$ degrees. The length, though, depends on the complexity; it's
 sampled from a Gaussian distribution of mean 0 and standard deviation
 $\sigma = 3 \times complexity$.
 \subsection{Occlusion}
+Selects a random rectangle from an {\em occluder} character
+images and places it over the original {\em occluded} character
+image. Pixels are combined by taking the max(occluder,occluded),
+closer to black. The rectangle corners
+are sampled so that larger complexity gives larger rectangles.
+The destination position in the occluded image are also sampled
+according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
+This filter has a probability of 60\% of not being applied.
+---
 This filter selects random parts of other (hereafter "occlusive") letter
 images and places them over the original letter (hereafter "occluded")
 image. To be more precise, having selected a subregion of the occlusive
 image and a desination position in the occluded image, to determine the
 This filter has a probability of not being applied, at all, of 60\%.
 \subsection{Pixel Permutation}
+This filter permutes neighbouring pixels. It selects first
+$\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
+sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number
+of exchanges to the left, right, top, bottom is equal or does not differ
+from more than 1 if the number of selected pixels is not a multiple of 4.
+% TODO: The previous sentence is hard to parse
+This filter has a probability of 80\% of not being applied.
+---
 This filter permuts neighbouring pixels. It selects first
 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number
 of exchanges to the left, right, top, bottom are equal or does not differ
 from more than 1 if the number of selected pixels is not a multiple of 4.
 \subsection{Gaussian Noise}
 This filter simply adds, to each pixel of the image independently, a
+noise $\sim Normal(0(\frac{complexity}{10})^2)$.
+It has a probability of 70\% of not being applied.
+---
+This filter simply adds, to each pixel of the image independently, a
 Gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$.
 It has has a probability of not being applied, at all, of 70\%.
 \subsection{Background Images}
+Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
+background behind the letter. The background is chosen by first selecting,
+at random, an image from a set of images. Then a 32$\times$32 sub-region
+of that image is chosen as the background image (by sampling position
+uniformly while making sure not to cross image borders).
+To combine the original letter image and the background image, contrast
+adjustments are made. We first get the maximal values (i.e. maximal
+intensity) for both the original image and the background image, $maximage$
+and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$.
+Each background pixel value is multiplied by $\frac{max(maximage -
+contrast, 0)}{maxbg}$ (higher contrast yield darker
+background). The output image pixels are max(background,original).
+---
 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random
 background behind the letter. The background is chosen by first selecting,
 at random, an image from a set of images. Then we choose a 32x32 subregion
 of that image as the background image (by sampling x and y positions
 The final image is found by taking the brightest (i.e. value nearest to 1)
 pixel from either the background image or the corresponding pixel in the
 original image.
 \subsection{Salt and Pepper Noise}
+This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
+The number of selected pixels is $0.2 \times complexity$.
+This filter has a probability of not being applied at all of 75\%.
+---
 This filter adds noise to the image by randomly selecting a certain number
 of them and, for those selected pixels, assign a random value according to
 a uniform distribution over the $[0,1]$ ranges. This last distribution does
 not change according to complexity. Instead, the number of selected pixels
 lowest extreme, no pixel is changed.
 This filter also has a probability of not being applied, at all, of 75\%.
 \subsection{Spatially Gaussian Noise}
+Different regions of the image are spatially smoothed.
+The image is convolved with a symmetric Gaussian kernel of
+size and variance chosen uniformly in the ranges $[12,12 + 20 \times
+complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
+between $0$ and $1$.  We also create a symmetric averaging window, of the
+kernel size, with maximum value at the center.  For each image we sample
+uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be
+averaging centers between the original image and the filtered one.  We
+initialize to zero a mask matrix of the image size. For each selected pixel
+we add to the mask the averaging window centered to it.  The final image is
+computed from the following element-wise operation: $\frac{image + filtered
+image \times mask}{mask+1}$.
+This filter has a probability of not being applied at all of 75\%.
+----
 The aim of this transformation is to filter, with a gaussian kernel,
 different regions of the image. In order to save computing time we decided
 to convolve the whole image only once with a symmetric gaussian kernel of
 size and variance choosen uniformly in the ranges: $[12,12 + 20 \times
 This filter has a probability of not being applied, at all, of 75\%.
 \subsection{Scratches}
 The scratches module places line-like white patches on the image.  The
+lines are heavily transformed images of the digit ``1'' (one), chosen
+at random among five thousands such 1 images. The 1 image is
+randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
+complexity)^2$, using bi-cubic interpolation,
+Two passes of a grey-scale morphological erosion filter
+are applied, reducing the width of the line
+by an amount controlled by $complexity$.
+This filter is only applied only 15\% of the time. When it is applied, 50\%
+of the time, only one patch image is generated and applied. In 30\% of
+cases, two patches are generated, and otherwise three patches are
+generated. The patch is applied by taking the maximal value on any given
+patch or the original image, for each of the 32x32 pixel locations.
+---
+The scratches module places line-like white patches on the image.  The
 lines are in fact heavily transformed images of the digit "1" (one), chosen
 at random among five thousands such start images of this digit.
 Once the image is selected, the transformation begins by finding the first
 $top$, $bottom$, $right$ and $left$ non-zero pixels in the image. It is
 of the time, only one patch image is generated and applied. In 30\% of
 cases, two patches are generated, and otherwise three patches are
 generated. The patch is applied by taking the maximal value on any given
 patch or the original image, for each of the 32x32 pixel locations.
-\subsection{Color and Contrast Changes}
+\subsection{Grey Level and Contrast Changes}
+This filter changes the contrast and may invert the image polarity (white
+on black to black on white). The contrast $C$ is defined here as the
+difference between the maximum and the minimum pixel value of the image.
+Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$).
+The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
+polarity is inverted with $0.5$ probability.
+---
 This filter changes the constrast and may invert the image polarity (white
 on black to black on white). The contrast $C$ is defined here as the
 difference between the maximum and the minimum pixel value of the image. A
 contrast value is sampled uniformly between $1$ and $1-0.85 \times
 complexity$ (this insure a minimum constrast of $0.15$). We then simply
 \caption{Illustration of each transformation applied to the same image
 of the upper-case h (upper-left image). first row (from left to rigth) : original image, slant,
 thickness, affine transformation, local elastic deformation; second row (from left to rigth) :
 pinch, motion blur, occlusion, pixel permutation, gaussian noise; third row (from left to rigth) :
 background image, salt and pepper noise, spatially gaussian noise, scratches,
-color and contrast changes.}
+grey level and contrast changes.}
 \label{fig:transfo}
 \end{figure}
 \section{Experimental Setup}

Mercurial > ift6266

comparison writeup/techreport.tex @ 541:8aad1c6ec39a