Mercurial > ift6266
comparison writeup/techreport.tex @ 541:8aad1c6ec39a
reduction espace
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Wed, 02 Jun 2010 10:23:33 -0400 |
parents | 6593e67381a3 |
children | 9ebb335ca904 |
comparison
equal
deleted
inserted
replaced
540:269c39f55134 | 541:8aad1c6ec39a |
---|---|
21 had been evaluated on rather small datasets with a few tens of thousands | 21 had been evaluated on rather small datasets with a few tens of thousands |
22 of examples. Here we propose a powerful generator of variations | 22 of examples. Here we propose a powerful generator of variations |
23 of examples for character images based on a pipeline of stochastic | 23 of examples for character images based on a pipeline of stochastic |
24 transformations that include not only the usual affine transformations | 24 transformations that include not only the usual affine transformations |
25 but also the addition of slant, local elastic deformations, changes | 25 but also the addition of slant, local elastic deformations, changes |
26 in thickness, background images, color, contrast, occlusion, and | 26 in thickness, background images, grey level, contrast, occlusion, and |
27 various types of pixel and spatially correlated noise. | 27 various types of pixel and spatially correlated noise. |
28 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders) | 28 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders) |
29 on the task of learning to classify digits and letters transformed | 29 on the task of learning to classify digits and letters transformed |
30 with this pipeline, using the hundreds of millions of generated examples | 30 with this pipeline, using the hundreds of millions of generated examples |
31 and testing on the full 62-class NIST test set. | 31 and testing on the full 62-class NIST test set. |
99 from slant to pinch, performs transformations of the character. The second | 99 from slant to pinch, performs transformations of the character. The second |
100 part, from blur to contrast, adds noise to the image. | 100 part, from blur to contrast, adds noise to the image. |
101 | 101 |
102 \subsection{Slant} | 102 \subsection{Slant} |
103 | 103 |
104 We mimic slant by shifting each row of the image | |
105 proportionally to its height: $shift = round(slant \times height)$. | |
106 The $slant$ coefficient can be negative or positive with equal probability | |
107 and its value is randomly sampled according to the complexity level: | |
108 $slant \sim U[0,complexity]$, so the | |
109 maximum displacement for the lowest or highest pixel line is of | |
110 $round(complexity \times 32)$. | |
111 | |
112 --- | |
113 | |
104 In order to mimic a slant effect, we simply shift each row of the image | 114 In order to mimic a slant effect, we simply shift each row of the image |
105 proportionnaly to its height: $shift = round(slant \times height)$. We | 115 proportionnaly to its height: $shift = round(slant \times height)$. We |
106 round the shift in order to have a discret displacement. We do not use a | 116 round the shift in order to have a discret displacement. We do not use a |
107 filter to smooth the result in order to save computing time and also | 117 filter to smooth the result in order to save computing time and also |
108 because latter transformations have similar effects. | 118 because latter transformations have similar effects. |
113 maximum displacement for the lowest or highest pixel line is of | 123 maximum displacement for the lowest or highest pixel line is of |
114 $round(complexity \times 32)$. | 124 $round(complexity \times 32)$. |
115 | 125 |
116 | 126 |
117 \subsection{Thickness} | 127 \subsection{Thickness} |
128 | |
129 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} | |
130 are applied. The neighborhood of each pixel is multiplied | |
131 element-wise with a {\em structuring element} matrix. | |
132 The pixel value is replaced by the maximum or the minimum of the resulting | |
133 matrix, respectively for dilation or erosion. Ten different structural elements with | |
134 increasing dimensions (largest is $5\times5$) were used. For each image, | |
135 randomly sample the operator type (dilation or erosion) with equal probability and one structural | |
136 element from a subset of the $n$ smallest structuring elements where $n$ is | |
137 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ | |
138 for erosion. A neutral element is always present in the set, and if it is | |
139 chosen no transformation is applied. Erosion allows only the six | |
140 smallest structural elements because when the character is too thin it may | |
141 be completely erased. | |
142 | |
143 --- | |
118 | 144 |
119 To change the thickness of the characters we used morpholigical operators: | 145 To change the thickness of the characters we used morpholigical operators: |
120 dilation and erosion~\cite{Haralick87,Serra82}. | 146 dilation and erosion~\cite{Haralick87,Serra82}. |
121 | 147 |
122 The basic idea of such transform is, for each pixel, to multiply in the | 148 The basic idea of such transform is, for each pixel, to multiply in the |
136 smallest structural elements because when the character is too thin it may | 162 smallest structural elements because when the character is too thin it may |
137 erase it completly. | 163 erase it completly. |
138 | 164 |
139 \subsection{Affine Transformations} | 165 \subsection{Affine Transformations} |
140 | 166 |
167 A $2 \times 3$ affine transform matrix (with | |
168 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. | |
169 Each pixel $(x,y)$ of the output image takes the value of the pixel | |
170 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This | |
171 produces scaling, translation, rotation and shearing. | |
172 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to | |
173 forbid important rotations (not to confuse classes) but to give good | |
174 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times | |
175 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 | |
176 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times | |
177 complexity]$. | |
178 | |
179 ---- | |
180 | |
141 We generate an affine transform matrix according to the complexity level, | 181 We generate an affine transform matrix according to the complexity level, |
142 then we apply it directly to the image. The matrix is of size $2 \times | 182 then we apply it directly to the image. The matrix is of size $2 \times |
143 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$. Formally, | 183 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$. Formally, |
144 for each pixel $(x,y)$ of the output image, we give the value of the pixel | 184 for each pixel $(x,y)$ of the output image, we give the value of the pixel |
145 nearest to : $(ax+by+c,dx+ey+f)$, in the input image. This allows to | 185 nearest to : $(ax+by+c,dx+ey+f)$, in the input image. This allows to |
154 complexity]$. | 194 complexity]$. |
155 | 195 |
156 | 196 |
157 \subsection{Local Elastic Deformations} | 197 \subsection{Local Elastic Deformations} |
158 | 198 |
199 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, | |
200 which provides more details. | |
201 Two ``displacements'' fields are generated and applied, for horizontal | |
202 and vertical displacements of pixels. | |
203 To generate a pixel in either field, first a value between -1 and 1 is | |
204 chosen from a uniform distribution. Then all the pixels, in both fields, are | |
205 multiplied by a constant $\alpha$ which controls the intensity of the | |
206 displacements (larger $\alpha$ translates into larger wiggles). | |
207 Each field is convolved with a Gaussian 2D kernel of | |
208 standard deviation $\sigma$. Visually, this results in a blur. | |
209 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times | |
210 \sqrt[3]{complexity}$. | |
211 | |
212 ---- | |
213 | |
159 This filter induces a "wiggly" effect in the image. The description here | 214 This filter induces a "wiggly" effect in the image. The description here |
160 will be brief, as the algorithm follows precisely what is described in | 215 will be brief, as the algorithm follows precisely what is described in |
161 \cite{SimardSP03}. | 216 \cite{SimardSP03}. |
162 | 217 |
163 The general idea is to generate two "displacements" fields, for horizontal | 218 The general idea is to generate two "displacements" fields, for horizontal |
190 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times | 245 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times |
191 \sqrt[3]{complexity}$. | 246 \sqrt[3]{complexity}$. |
192 | 247 |
193 | 248 |
194 \subsection{Pinch} | 249 \subsection{Pinch} |
250 | |
251 This is a GIMP filter called ``Whirl and | |
252 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic | |
253 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). | |
254 For a square input image, this is akin to drawing a circle of | |
255 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to | |
256 that disk (region inside circle) will have its value recalculated by taking | |
257 the value of another ``source'' pixel in the original image. The position of | |
258 that source pixel is found on the line that goes through $C$ and $P$, but | |
259 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ | |
260 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times | |
261 d_1$, where $pinch$ is a parameter to the filter. | |
262 The actual value is given by bilinear interpolation considering the pixels | |
263 around the (non-integer) source position thus found. | |
264 Here $pinch \sim U[-complexity, 0.7 \times complexity]$. | |
265 | |
266 --- | |
195 | 267 |
196 This is another GIMP filter we used. The filter is in fact named "Whirl and | 268 This is another GIMP filter we used. The filter is in fact named "Whirl and |
197 pinch", but we don't use the "whirl" part (whirl is set to 0). As described | 269 pinch", but we don't use the "whirl" part (whirl is set to 0). As described |
198 in GIMP, a pinch is "similar to projecting the image onto an elastic | 270 in GIMP, a pinch is "similar to projecting the image onto an elastic |
199 surface and pressing or pulling on the center of the surface". | 271 surface and pressing or pulling on the center of the surface". |
220 The value for $pinch$ in our case was given by sampling from an uniform | 292 The value for $pinch$ in our case was given by sampling from an uniform |
221 distribution over the range $[-complexity, 0.7 \times complexity]$. | 293 distribution over the range $[-complexity, 0.7 \times complexity]$. |
222 | 294 |
223 \subsection{Motion Blur} | 295 \subsection{Motion Blur} |
224 | 296 |
297 This is a ``linear motion blur'' in GIMP | |
298 terminology, with two parameters, $length$ and $angle$. The value of | |
299 a pixel in the final image is approximately the mean value of the $length$ first pixels | |
300 found by moving in the $angle$ direction. | |
301 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. | |
302 | |
303 ---- | |
304 | |
225 This is a GIMP filter we applied, a "linear motion blur" in GIMP | 305 This is a GIMP filter we applied, a "linear motion blur" in GIMP |
226 terminology. The description will be brief as it is a well-known filter. | 306 terminology. The description will be brief as it is a well-known filter. |
227 | 307 |
228 This algorithm has two input parameters, $length$ and $angle$. The value of | 308 This algorithm has two input parameters, $length$ and $angle$. The value of |
229 a pixel in the final image is the mean value of the $length$ first pixels | 309 a pixel in the final image is the mean value of the $length$ first pixels |
235 $[0,360]$ degrees. The length, though, depends on the complexity; it's | 315 $[0,360]$ degrees. The length, though, depends on the complexity; it's |
236 sampled from a Gaussian distribution of mean 0 and standard deviation | 316 sampled from a Gaussian distribution of mean 0 and standard deviation |
237 $\sigma = 3 \times complexity$. | 317 $\sigma = 3 \times complexity$. |
238 | 318 |
239 \subsection{Occlusion} | 319 \subsection{Occlusion} |
320 | |
321 Selects a random rectangle from an {\em occluder} character | |
322 images and places it over the original {\em occluded} character | |
323 image. Pixels are combined by taking the max(occluder,occluded), | |
324 closer to black. The rectangle corners | |
325 are sampled so that larger complexity gives larger rectangles. | |
326 The destination position in the occluded image are also sampled | |
327 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). | |
328 This filter has a probability of 60\% of not being applied. | |
329 | |
330 --- | |
240 | 331 |
241 This filter selects random parts of other (hereafter "occlusive") letter | 332 This filter selects random parts of other (hereafter "occlusive") letter |
242 images and places them over the original letter (hereafter "occluded") | 333 images and places them over the original letter (hereafter "occluded") |
243 image. To be more precise, having selected a subregion of the occlusive | 334 image. To be more precise, having selected a subregion of the occlusive |
244 image and a desination position in the occluded image, to determine the | 335 image and a desination position in the occluded image, to determine the |
279 This filter has a probability of not being applied, at all, of 60\%. | 370 This filter has a probability of not being applied, at all, of 60\%. |
280 | 371 |
281 | 372 |
282 \subsection{Pixel Permutation} | 373 \subsection{Pixel Permutation} |
283 | 374 |
375 This filter permutes neighbouring pixels. It selects first | |
376 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then | |
377 sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number | |
378 of exchanges to the left, right, top, bottom is equal or does not differ | |
379 from more than 1 if the number of selected pixels is not a multiple of 4. | |
380 % TODO: The previous sentence is hard to parse | |
381 This filter has a probability of 80\% of not being applied. | |
382 | |
383 --- | |
384 | |
284 This filter permuts neighbouring pixels. It selects first | 385 This filter permuts neighbouring pixels. It selects first |
285 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then | 386 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then |
286 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number | 387 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number |
287 of exchanges to the left, right, top, bottom are equal or does not differ | 388 of exchanges to the left, right, top, bottom are equal or does not differ |
288 from more than 1 if the number of selected pixels is not a multiple of 4. | 389 from more than 1 if the number of selected pixels is not a multiple of 4. |
291 | 392 |
292 | 393 |
293 \subsection{Gaussian Noise} | 394 \subsection{Gaussian Noise} |
294 | 395 |
295 This filter simply adds, to each pixel of the image independently, a | 396 This filter simply adds, to each pixel of the image independently, a |
397 noise $\sim Normal(0(\frac{complexity}{10})^2)$. | |
398 It has a probability of 70\% of not being applied. | |
399 | |
400 --- | |
401 | |
402 This filter simply adds, to each pixel of the image independently, a | |
296 Gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$. | 403 Gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$. |
297 | 404 |
298 It has has a probability of not being applied, at all, of 70\%. | 405 It has has a probability of not being applied, at all, of 70\%. |
299 | 406 |
300 | 407 |
301 \subsection{Background Images} | 408 \subsection{Background Images} |
409 | |
410 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random | |
411 background behind the letter. The background is chosen by first selecting, | |
412 at random, an image from a set of images. Then a 32$\times$32 sub-region | |
413 of that image is chosen as the background image (by sampling position | |
414 uniformly while making sure not to cross image borders). | |
415 To combine the original letter image and the background image, contrast | |
416 adjustments are made. We first get the maximal values (i.e. maximal | |
417 intensity) for both the original image and the background image, $maximage$ | |
418 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$. | |
419 Each background pixel value is multiplied by $\frac{max(maximage - | |
420 contrast, 0)}{maxbg}$ (higher contrast yield darker | |
421 background). The output image pixels are max(background,original). | |
422 | |
423 --- | |
302 | 424 |
303 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random | 425 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random |
304 background behind the letter. The background is chosen by first selecting, | 426 background behind the letter. The background is chosen by first selecting, |
305 at random, an image from a set of images. Then we choose a 32x32 subregion | 427 at random, an image from a set of images. Then we choose a 32x32 subregion |
306 of that image as the background image (by sampling x and y positions | 428 of that image as the background image (by sampling x and y positions |
320 The final image is found by taking the brightest (i.e. value nearest to 1) | 442 The final image is found by taking the brightest (i.e. value nearest to 1) |
321 pixel from either the background image or the corresponding pixel in the | 443 pixel from either the background image or the corresponding pixel in the |
322 original image. | 444 original image. |
323 | 445 |
324 \subsection{Salt and Pepper Noise} | 446 \subsection{Salt and Pepper Noise} |
447 | |
448 This filter adds noise $\sim U[0,1]$ to random subsets of pixels. | |
449 The number of selected pixels is $0.2 \times complexity$. | |
450 This filter has a probability of not being applied at all of 75\%. | |
451 | |
452 --- | |
325 | 453 |
326 This filter adds noise to the image by randomly selecting a certain number | 454 This filter adds noise to the image by randomly selecting a certain number |
327 of them and, for those selected pixels, assign a random value according to | 455 of them and, for those selected pixels, assign a random value according to |
328 a uniform distribution over the $[0,1]$ ranges. This last distribution does | 456 a uniform distribution over the $[0,1]$ ranges. This last distribution does |
329 not change according to complexity. Instead, the number of selected pixels | 457 not change according to complexity. Instead, the number of selected pixels |
332 lowest extreme, no pixel is changed. | 460 lowest extreme, no pixel is changed. |
333 | 461 |
334 This filter also has a probability of not being applied, at all, of 75\%. | 462 This filter also has a probability of not being applied, at all, of 75\%. |
335 | 463 |
336 \subsection{Spatially Gaussian Noise} | 464 \subsection{Spatially Gaussian Noise} |
465 | |
466 Different regions of the image are spatially smoothed. | |
467 The image is convolved with a symmetric Gaussian kernel of | |
468 size and variance chosen uniformly in the ranges $[12,12 + 20 \times | |
469 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized | |
470 between $0$ and $1$. We also create a symmetric averaging window, of the | |
471 kernel size, with maximum value at the center. For each image we sample | |
472 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be | |
473 averaging centers between the original image and the filtered one. We | |
474 initialize to zero a mask matrix of the image size. For each selected pixel | |
475 we add to the mask the averaging window centered to it. The final image is | |
476 computed from the following element-wise operation: $\frac{image + filtered | |
477 image \times mask}{mask+1}$. | |
478 This filter has a probability of not being applied at all of 75\%. | |
479 | |
480 ---- | |
337 | 481 |
338 The aim of this transformation is to filter, with a gaussian kernel, | 482 The aim of this transformation is to filter, with a gaussian kernel, |
339 different regions of the image. In order to save computing time we decided | 483 different regions of the image. In order to save computing time we decided |
340 to convolve the whole image only once with a symmetric gaussian kernel of | 484 to convolve the whole image only once with a symmetric gaussian kernel of |
341 size and variance choosen uniformly in the ranges: $[12,12 + 20 \times | 485 size and variance choosen uniformly in the ranges: $[12,12 + 20 \times |
352 This filter has a probability of not being applied, at all, of 75\%. | 496 This filter has a probability of not being applied, at all, of 75\%. |
353 | 497 |
354 \subsection{Scratches} | 498 \subsection{Scratches} |
355 | 499 |
356 The scratches module places line-like white patches on the image. The | 500 The scratches module places line-like white patches on the image. The |
501 lines are heavily transformed images of the digit ``1'' (one), chosen | |
502 at random among five thousands such 1 images. The 1 image is | |
503 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times | |
504 complexity)^2$, using bi-cubic interpolation, | |
505 Two passes of a grey-scale morphological erosion filter | |
506 are applied, reducing the width of the line | |
507 by an amount controlled by $complexity$. | |
508 This filter is only applied only 15\% of the time. When it is applied, 50\% | |
509 of the time, only one patch image is generated and applied. In 30\% of | |
510 cases, two patches are generated, and otherwise three patches are | |
511 generated. The patch is applied by taking the maximal value on any given | |
512 patch or the original image, for each of the 32x32 pixel locations. | |
513 | |
514 --- | |
515 | |
516 The scratches module places line-like white patches on the image. The | |
357 lines are in fact heavily transformed images of the digit "1" (one), chosen | 517 lines are in fact heavily transformed images of the digit "1" (one), chosen |
358 at random among five thousands such start images of this digit. | 518 at random among five thousands such start images of this digit. |
359 | 519 |
360 Once the image is selected, the transformation begins by finding the first | 520 Once the image is selected, the transformation begins by finding the first |
361 $top$, $bottom$, $right$ and $left$ non-zero pixels in the image. It is | 521 $top$, $bottom$, $right$ and $left$ non-zero pixels in the image. It is |
385 of the time, only one patch image is generated and applied. In 30\% of | 545 of the time, only one patch image is generated and applied. In 30\% of |
386 cases, two patches are generated, and otherwise three patches are | 546 cases, two patches are generated, and otherwise three patches are |
387 generated. The patch is applied by taking the maximal value on any given | 547 generated. The patch is applied by taking the maximal value on any given |
388 patch or the original image, for each of the 32x32 pixel locations. | 548 patch or the original image, for each of the 32x32 pixel locations. |
389 | 549 |
390 \subsection{Color and Contrast Changes} | 550 \subsection{Grey Level and Contrast Changes} |
391 | 551 |
552 This filter changes the contrast and may invert the image polarity (white | |
553 on black to black on white). The contrast $C$ is defined here as the | |
554 difference between the maximum and the minimum pixel value of the image. | |
555 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$). | |
556 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The | |
557 polarity is inverted with $0.5$ probability. | |
558 | |
559 --- | |
392 This filter changes the constrast and may invert the image polarity (white | 560 This filter changes the constrast and may invert the image polarity (white |
393 on black to black on white). The contrast $C$ is defined here as the | 561 on black to black on white). The contrast $C$ is defined here as the |
394 difference between the maximum and the minimum pixel value of the image. A | 562 difference between the maximum and the minimum pixel value of the image. A |
395 contrast value is sampled uniformly between $1$ and $1-0.85 \times | 563 contrast value is sampled uniformly between $1$ and $1-0.85 \times |
396 complexity$ (this insure a minimum constrast of $0.15$). We then simply | 564 complexity$ (this insure a minimum constrast of $0.15$). We then simply |
415 \caption{Illustration of each transformation applied to the same image | 583 \caption{Illustration of each transformation applied to the same image |
416 of the upper-case h (upper-left image). first row (from left to rigth) : original image, slant, | 584 of the upper-case h (upper-left image). first row (from left to rigth) : original image, slant, |
417 thickness, affine transformation, local elastic deformation; second row (from left to rigth) : | 585 thickness, affine transformation, local elastic deformation; second row (from left to rigth) : |
418 pinch, motion blur, occlusion, pixel permutation, gaussian noise; third row (from left to rigth) : | 586 pinch, motion blur, occlusion, pixel permutation, gaussian noise; third row (from left to rigth) : |
419 background image, salt and pepper noise, spatially gaussian noise, scratches, | 587 background image, salt and pepper noise, spatially gaussian noise, scratches, |
420 color and contrast changes.} | 588 grey level and contrast changes.} |
421 \label{fig:transfo} | 589 \label{fig:transfo} |
422 \end{figure} | 590 \end{figure} |
423 | 591 |
424 | 592 |
425 \section{Experimental Setup} | 593 \section{Experimental Setup} |