comparison writeup/techreport.tex @ 541:8aad1c6ec39a

reduction espace
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Wed, 02 Jun 2010 10:23:33 -0400
parents 6593e67381a3
children 9ebb335ca904
comparison
equal deleted inserted replaced
540:269c39f55134 541:8aad1c6ec39a
21 had been evaluated on rather small datasets with a few tens of thousands 21 had been evaluated on rather small datasets with a few tens of thousands
22 of examples. Here we propose a powerful generator of variations 22 of examples. Here we propose a powerful generator of variations
23 of examples for character images based on a pipeline of stochastic 23 of examples for character images based on a pipeline of stochastic
24 transformations that include not only the usual affine transformations 24 transformations that include not only the usual affine transformations
25 but also the addition of slant, local elastic deformations, changes 25 but also the addition of slant, local elastic deformations, changes
26 in thickness, background images, color, contrast, occlusion, and 26 in thickness, background images, grey level, contrast, occlusion, and
27 various types of pixel and spatially correlated noise. 27 various types of pixel and spatially correlated noise.
28 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders) 28 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders)
29 on the task of learning to classify digits and letters transformed 29 on the task of learning to classify digits and letters transformed
30 with this pipeline, using the hundreds of millions of generated examples 30 with this pipeline, using the hundreds of millions of generated examples
31 and testing on the full 62-class NIST test set. 31 and testing on the full 62-class NIST test set.
99 from slant to pinch, performs transformations of the character. The second 99 from slant to pinch, performs transformations of the character. The second
100 part, from blur to contrast, adds noise to the image. 100 part, from blur to contrast, adds noise to the image.
101 101
102 \subsection{Slant} 102 \subsection{Slant}
103 103
104 We mimic slant by shifting each row of the image
105 proportionally to its height: $shift = round(slant \times height)$.
106 The $slant$ coefficient can be negative or positive with equal probability
107 and its value is randomly sampled according to the complexity level:
108 $slant \sim U[0,complexity]$, so the
109 maximum displacement for the lowest or highest pixel line is of
110 $round(complexity \times 32)$.
111
112 ---
113
104 In order to mimic a slant effect, we simply shift each row of the image 114 In order to mimic a slant effect, we simply shift each row of the image
105 proportionnaly to its height: $shift = round(slant \times height)$. We 115 proportionnaly to its height: $shift = round(slant \times height)$. We
106 round the shift in order to have a discret displacement. We do not use a 116 round the shift in order to have a discret displacement. We do not use a
107 filter to smooth the result in order to save computing time and also 117 filter to smooth the result in order to save computing time and also
108 because latter transformations have similar effects. 118 because latter transformations have similar effects.
113 maximum displacement for the lowest or highest pixel line is of 123 maximum displacement for the lowest or highest pixel line is of
114 $round(complexity \times 32)$. 124 $round(complexity \times 32)$.
115 125
116 126
117 \subsection{Thickness} 127 \subsection{Thickness}
128
129 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
130 are applied. The neighborhood of each pixel is multiplied
131 element-wise with a {\em structuring element} matrix.
132 The pixel value is replaced by the maximum or the minimum of the resulting
133 matrix, respectively for dilation or erosion. Ten different structural elements with
134 increasing dimensions (largest is $5\times5$) were used. For each image,
135 randomly sample the operator type (dilation or erosion) with equal probability and one structural
136 element from a subset of the $n$ smallest structuring elements where $n$ is
137 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$
138 for erosion. A neutral element is always present in the set, and if it is
139 chosen no transformation is applied. Erosion allows only the six
140 smallest structural elements because when the character is too thin it may
141 be completely erased.
142
143 ---
118 144
119 To change the thickness of the characters we used morpholigical operators: 145 To change the thickness of the characters we used morpholigical operators:
120 dilation and erosion~\cite{Haralick87,Serra82}. 146 dilation and erosion~\cite{Haralick87,Serra82}.
121 147
122 The basic idea of such transform is, for each pixel, to multiply in the 148 The basic idea of such transform is, for each pixel, to multiply in the
136 smallest structural elements because when the character is too thin it may 162 smallest structural elements because when the character is too thin it may
137 erase it completly. 163 erase it completly.
138 164
139 \subsection{Affine Transformations} 165 \subsection{Affine Transformations}
140 166
167 A $2 \times 3$ affine transform matrix (with
168 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
169 Each pixel $(x,y)$ of the output image takes the value of the pixel
170 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This
171 produces scaling, translation, rotation and shearing.
172 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to
173 forbid important rotations (not to confuse classes) but to give good
174 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
175 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
176 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
177 complexity]$.
178
179 ----
180
141 We generate an affine transform matrix according to the complexity level, 181 We generate an affine transform matrix according to the complexity level,
142 then we apply it directly to the image. The matrix is of size $2 \times 182 then we apply it directly to the image. The matrix is of size $2 \times
143 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$. Formally, 183 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$. Formally,
144 for each pixel $(x,y)$ of the output image, we give the value of the pixel 184 for each pixel $(x,y)$ of the output image, we give the value of the pixel
145 nearest to : $(ax+by+c,dx+ey+f)$, in the input image. This allows to 185 nearest to : $(ax+by+c,dx+ey+f)$, in the input image. This allows to
154 complexity]$. 194 complexity]$.
155 195
156 196
157 \subsection{Local Elastic Deformations} 197 \subsection{Local Elastic Deformations}
158 198
199 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
200 which provides more details.
201 Two ``displacements'' fields are generated and applied, for horizontal
202 and vertical displacements of pixels.
203 To generate a pixel in either field, first a value between -1 and 1 is
204 chosen from a uniform distribution. Then all the pixels, in both fields, are
205 multiplied by a constant $\alpha$ which controls the intensity of the
206 displacements (larger $\alpha$ translates into larger wiggles).
207 Each field is convolved with a Gaussian 2D kernel of
208 standard deviation $\sigma$. Visually, this results in a blur.
209 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
210 \sqrt[3]{complexity}$.
211
212 ----
213
159 This filter induces a "wiggly" effect in the image. The description here 214 This filter induces a "wiggly" effect in the image. The description here
160 will be brief, as the algorithm follows precisely what is described in 215 will be brief, as the algorithm follows precisely what is described in
161 \cite{SimardSP03}. 216 \cite{SimardSP03}.
162 217
163 The general idea is to generate two "displacements" fields, for horizontal 218 The general idea is to generate two "displacements" fields, for horizontal
190 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times 245 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
191 \sqrt[3]{complexity}$. 246 \sqrt[3]{complexity}$.
192 247
193 248
194 \subsection{Pinch} 249 \subsection{Pinch}
250
251 This is a GIMP filter called ``Whirl and
252 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
253 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
254 For a square input image, this is akin to drawing a circle of
255 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
256 that disk (region inside circle) will have its value recalculated by taking
257 the value of another ``source'' pixel in the original image. The position of
258 that source pixel is found on the line that goes through $C$ and $P$, but
259 at some other distance $d_2$. Define $d_1$ to be the distance between $P$
260 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
261 d_1$, where $pinch$ is a parameter to the filter.
262 The actual value is given by bilinear interpolation considering the pixels
263 around the (non-integer) source position thus found.
264 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
265
266 ---
195 267
196 This is another GIMP filter we used. The filter is in fact named "Whirl and 268 This is another GIMP filter we used. The filter is in fact named "Whirl and
197 pinch", but we don't use the "whirl" part (whirl is set to 0). As described 269 pinch", but we don't use the "whirl" part (whirl is set to 0). As described
198 in GIMP, a pinch is "similar to projecting the image onto an elastic 270 in GIMP, a pinch is "similar to projecting the image onto an elastic
199 surface and pressing or pulling on the center of the surface". 271 surface and pressing or pulling on the center of the surface".
220 The value for $pinch$ in our case was given by sampling from an uniform 292 The value for $pinch$ in our case was given by sampling from an uniform
221 distribution over the range $[-complexity, 0.7 \times complexity]$. 293 distribution over the range $[-complexity, 0.7 \times complexity]$.
222 294
223 \subsection{Motion Blur} 295 \subsection{Motion Blur}
224 296
297 This is a ``linear motion blur'' in GIMP
298 terminology, with two parameters, $length$ and $angle$. The value of
299 a pixel in the final image is approximately the mean value of the $length$ first pixels
300 found by moving in the $angle$ direction.
301 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
302
303 ----
304
225 This is a GIMP filter we applied, a "linear motion blur" in GIMP 305 This is a GIMP filter we applied, a "linear motion blur" in GIMP
226 terminology. The description will be brief as it is a well-known filter. 306 terminology. The description will be brief as it is a well-known filter.
227 307
228 This algorithm has two input parameters, $length$ and $angle$. The value of 308 This algorithm has two input parameters, $length$ and $angle$. The value of
229 a pixel in the final image is the mean value of the $length$ first pixels 309 a pixel in the final image is the mean value of the $length$ first pixels
235 $[0,360]$ degrees. The length, though, depends on the complexity; it's 315 $[0,360]$ degrees. The length, though, depends on the complexity; it's
236 sampled from a Gaussian distribution of mean 0 and standard deviation 316 sampled from a Gaussian distribution of mean 0 and standard deviation
237 $\sigma = 3 \times complexity$. 317 $\sigma = 3 \times complexity$.
238 318
239 \subsection{Occlusion} 319 \subsection{Occlusion}
320
321 Selects a random rectangle from an {\em occluder} character
322 images and places it over the original {\em occluded} character
323 image. Pixels are combined by taking the max(occluder,occluded),
324 closer to black. The rectangle corners
325 are sampled so that larger complexity gives larger rectangles.
326 The destination position in the occluded image are also sampled
327 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
328 This filter has a probability of 60\% of not being applied.
329
330 ---
240 331
241 This filter selects random parts of other (hereafter "occlusive") letter 332 This filter selects random parts of other (hereafter "occlusive") letter
242 images and places them over the original letter (hereafter "occluded") 333 images and places them over the original letter (hereafter "occluded")
243 image. To be more precise, having selected a subregion of the occlusive 334 image. To be more precise, having selected a subregion of the occlusive
244 image and a desination position in the occluded image, to determine the 335 image and a desination position in the occluded image, to determine the
279 This filter has a probability of not being applied, at all, of 60\%. 370 This filter has a probability of not being applied, at all, of 60\%.
280 371
281 372
282 \subsection{Pixel Permutation} 373 \subsection{Pixel Permutation}
283 374
375 This filter permutes neighbouring pixels. It selects first
376 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
377 sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number
378 of exchanges to the left, right, top, bottom is equal or does not differ
379 from more than 1 if the number of selected pixels is not a multiple of 4.
380 % TODO: The previous sentence is hard to parse
381 This filter has a probability of 80\% of not being applied.
382
383 ---
384
284 This filter permuts neighbouring pixels. It selects first 385 This filter permuts neighbouring pixels. It selects first
285 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then 386 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
286 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number 387 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number
287 of exchanges to the left, right, top, bottom are equal or does not differ 388 of exchanges to the left, right, top, bottom are equal or does not differ
288 from more than 1 if the number of selected pixels is not a multiple of 4. 389 from more than 1 if the number of selected pixels is not a multiple of 4.
291 392
292 393
293 \subsection{Gaussian Noise} 394 \subsection{Gaussian Noise}
294 395
295 This filter simply adds, to each pixel of the image independently, a 396 This filter simply adds, to each pixel of the image independently, a
397 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
398 It has a probability of 70\% of not being applied.
399
400 ---
401
402 This filter simply adds, to each pixel of the image independently, a
296 Gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$. 403 Gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$.
297 404
298 It has has a probability of not being applied, at all, of 70\%. 405 It has has a probability of not being applied, at all, of 70\%.
299 406
300 407
301 \subsection{Background Images} 408 \subsection{Background Images}
409
410 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
411 background behind the letter. The background is chosen by first selecting,
412 at random, an image from a set of images. Then a 32$\times$32 sub-region
413 of that image is chosen as the background image (by sampling position
414 uniformly while making sure not to cross image borders).
415 To combine the original letter image and the background image, contrast
416 adjustments are made. We first get the maximal values (i.e. maximal
417 intensity) for both the original image and the background image, $maximage$
418 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$.
419 Each background pixel value is multiplied by $\frac{max(maximage -
420 contrast, 0)}{maxbg}$ (higher contrast yield darker
421 background). The output image pixels are max(background,original).
422
423 ---
302 424
303 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random 425 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random
304 background behind the letter. The background is chosen by first selecting, 426 background behind the letter. The background is chosen by first selecting,
305 at random, an image from a set of images. Then we choose a 32x32 subregion 427 at random, an image from a set of images. Then we choose a 32x32 subregion
306 of that image as the background image (by sampling x and y positions 428 of that image as the background image (by sampling x and y positions
320 The final image is found by taking the brightest (i.e. value nearest to 1) 442 The final image is found by taking the brightest (i.e. value nearest to 1)
321 pixel from either the background image or the corresponding pixel in the 443 pixel from either the background image or the corresponding pixel in the
322 original image. 444 original image.
323 445
324 \subsection{Salt and Pepper Noise} 446 \subsection{Salt and Pepper Noise}
447
448 This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
449 The number of selected pixels is $0.2 \times complexity$.
450 This filter has a probability of not being applied at all of 75\%.
451
452 ---
325 453
326 This filter adds noise to the image by randomly selecting a certain number 454 This filter adds noise to the image by randomly selecting a certain number
327 of them and, for those selected pixels, assign a random value according to 455 of them and, for those selected pixels, assign a random value according to
328 a uniform distribution over the $[0,1]$ ranges. This last distribution does 456 a uniform distribution over the $[0,1]$ ranges. This last distribution does
329 not change according to complexity. Instead, the number of selected pixels 457 not change according to complexity. Instead, the number of selected pixels
332 lowest extreme, no pixel is changed. 460 lowest extreme, no pixel is changed.
333 461
334 This filter also has a probability of not being applied, at all, of 75\%. 462 This filter also has a probability of not being applied, at all, of 75\%.
335 463
336 \subsection{Spatially Gaussian Noise} 464 \subsection{Spatially Gaussian Noise}
465
466 Different regions of the image are spatially smoothed.
467 The image is convolved with a symmetric Gaussian kernel of
468 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
469 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
470 between $0$ and $1$. We also create a symmetric averaging window, of the
471 kernel size, with maximum value at the center. For each image we sample
472 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be
473 averaging centers between the original image and the filtered one. We
474 initialize to zero a mask matrix of the image size. For each selected pixel
475 we add to the mask the averaging window centered to it. The final image is
476 computed from the following element-wise operation: $\frac{image + filtered
477 image \times mask}{mask+1}$.
478 This filter has a probability of not being applied at all of 75\%.
479
480 ----
337 481
338 The aim of this transformation is to filter, with a gaussian kernel, 482 The aim of this transformation is to filter, with a gaussian kernel,
339 different regions of the image. In order to save computing time we decided 483 different regions of the image. In order to save computing time we decided
340 to convolve the whole image only once with a symmetric gaussian kernel of 484 to convolve the whole image only once with a symmetric gaussian kernel of
341 size and variance choosen uniformly in the ranges: $[12,12 + 20 \times 485 size and variance choosen uniformly in the ranges: $[12,12 + 20 \times
352 This filter has a probability of not being applied, at all, of 75\%. 496 This filter has a probability of not being applied, at all, of 75\%.
353 497
354 \subsection{Scratches} 498 \subsection{Scratches}
355 499
356 The scratches module places line-like white patches on the image. The 500 The scratches module places line-like white patches on the image. The
501 lines are heavily transformed images of the digit ``1'' (one), chosen
502 at random among five thousands such 1 images. The 1 image is
503 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
504 complexity)^2$, using bi-cubic interpolation,
505 Two passes of a grey-scale morphological erosion filter
506 are applied, reducing the width of the line
507 by an amount controlled by $complexity$.
508 This filter is only applied only 15\% of the time. When it is applied, 50\%
509 of the time, only one patch image is generated and applied. In 30\% of
510 cases, two patches are generated, and otherwise three patches are
511 generated. The patch is applied by taking the maximal value on any given
512 patch or the original image, for each of the 32x32 pixel locations.
513
514 ---
515
516 The scratches module places line-like white patches on the image. The
357 lines are in fact heavily transformed images of the digit "1" (one), chosen 517 lines are in fact heavily transformed images of the digit "1" (one), chosen
358 at random among five thousands such start images of this digit. 518 at random among five thousands such start images of this digit.
359 519
360 Once the image is selected, the transformation begins by finding the first 520 Once the image is selected, the transformation begins by finding the first
361 $top$, $bottom$, $right$ and $left$ non-zero pixels in the image. It is 521 $top$, $bottom$, $right$ and $left$ non-zero pixels in the image. It is
385 of the time, only one patch image is generated and applied. In 30\% of 545 of the time, only one patch image is generated and applied. In 30\% of
386 cases, two patches are generated, and otherwise three patches are 546 cases, two patches are generated, and otherwise three patches are
387 generated. The patch is applied by taking the maximal value on any given 547 generated. The patch is applied by taking the maximal value on any given
388 patch or the original image, for each of the 32x32 pixel locations. 548 patch or the original image, for each of the 32x32 pixel locations.
389 549
390 \subsection{Color and Contrast Changes} 550 \subsection{Grey Level and Contrast Changes}
391 551
552 This filter changes the contrast and may invert the image polarity (white
553 on black to black on white). The contrast $C$ is defined here as the
554 difference between the maximum and the minimum pixel value of the image.
555 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$).
556 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
557 polarity is inverted with $0.5$ probability.
558
559 ---
392 This filter changes the constrast and may invert the image polarity (white 560 This filter changes the constrast and may invert the image polarity (white
393 on black to black on white). The contrast $C$ is defined here as the 561 on black to black on white). The contrast $C$ is defined here as the
394 difference between the maximum and the minimum pixel value of the image. A 562 difference between the maximum and the minimum pixel value of the image. A
395 contrast value is sampled uniformly between $1$ and $1-0.85 \times 563 contrast value is sampled uniformly between $1$ and $1-0.85 \times
396 complexity$ (this insure a minimum constrast of $0.15$). We then simply 564 complexity$ (this insure a minimum constrast of $0.15$). We then simply
415 \caption{Illustration of each transformation applied to the same image 583 \caption{Illustration of each transformation applied to the same image
416 of the upper-case h (upper-left image). first row (from left to rigth) : original image, slant, 584 of the upper-case h (upper-left image). first row (from left to rigth) : original image, slant,
417 thickness, affine transformation, local elastic deformation; second row (from left to rigth) : 585 thickness, affine transformation, local elastic deformation; second row (from left to rigth) :
418 pinch, motion blur, occlusion, pixel permutation, gaussian noise; third row (from left to rigth) : 586 pinch, motion blur, occlusion, pixel permutation, gaussian noise; third row (from left to rigth) :
419 background image, salt and pepper noise, spatially gaussian noise, scratches, 587 background image, salt and pepper noise, spatially gaussian noise, scratches,
420 color and contrast changes.} 588 grey level and contrast changes.}
421 \label{fig:transfo} 589 \label{fig:transfo}
422 \end{figure} 590 \end{figure}
423 591
424 592
425 \section{Experimental Setup} 593 \section{Experimental Setup}