comparison writeup/techreport.tex @ 462:f59af1648d83

cleaner le techreport
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Fri, 28 May 2010 08:49:36 -0600
parents 9609c5cf9b6b
children 5fa1c653620c
comparison
equal deleted inserted replaced
461:9609c5cf9b6b 462:f59af1648d83
4 \usepackage{times} 4 \usepackage{times}
5 \usepackage{mlapa} 5 \usepackage{mlapa}
6 \usepackage{subfigure} 6 \usepackage{subfigure}
7 7
8 \begin{document} 8 \begin{document}
9 \title{Generating and Exploiting Perturbed Training Data for Deep Architectures} 9 \title{Generating and Exploiting Perturbed and Multi-Task Handwritten Training Data for Deep Architectures}
10 \author{The IFT6266 Gang} 10 \author{The IFT6266 Gang}
11 \date{April 2010, Technical Report, Dept. IRO, U. Montreal} 11 \date{April 2010, Technical Report, Dept. IRO, U. Montreal}
12 12
13 \maketitle 13 \maketitle
14 14
26 in thickness, background images, color, contrast, occlusion, and 26 in thickness, background images, color, contrast, occlusion, and
27 various types of pixel and spatially correlated noise. 27 various types of pixel and spatially correlated noise.
28 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders) 28 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders)
29 on the task of learning to classify digits and letters transformed 29 on the task of learning to classify digits and letters transformed
30 with this pipeline, using the hundreds of millions of generated examples 30 with this pipeline, using the hundreds of millions of generated examples
31 and testing on the full NIST test set. 31 and testing on the full 62-class NIST test set.
32 We find that the SDA outperforms its 32 We find that the SDA outperforms its
33 shallow counterpart, an ordinary Multi-Layer Perceptron, 33 shallow counterpart, an ordinary Multi-Layer Perceptron,
34 and that it is better able to take advantage of the additional 34 and that it is better able to take advantage of the additional
35 generated data, as well as better able to take advantage of 35 generated data, as well as better able to take advantage of
36 the multi-task setting, i.e.,
36 training from more classes than those of interest in the end. 37 training from more classes than those of interest in the end.
37 In fact, we find that the SDA reaches human performance as 38 In fact, we find that the SDA reaches human performance as
38 estimated by the Amazon Mechanical Turk on the NIST test characters. 39 estimated by the Amazon Mechanical Turk on the 62-class NIST test characters.
39 \end{abstract} 40 \end{abstract}
40 41
41 \section{Introduction} 42 \section{Introduction}
42 43
43 Deep Learning has emerged as a promising new area of research in 44 Deep Learning has emerged as a promising new area of research in
65 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the 66 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
66 feature representation can lead to higher-level (more abstract, more 67 feature representation can lead to higher-level (more abstract, more
67 general) features that are more robust to unanticipated sources of 68 general) features that are more robust to unanticipated sources of
68 variance extant in real data. 69 variance extant in real data.
69 70
70 Whereas a deep architecture can in principle be more powerful than a shallow 71 Whereas a deep architecture can in principle be more powerful than a
71 one in terms of representation, depth appears to render the training problem 72 shallow one in terms of representation, depth appears to render the
72 more difficult in terms of optimization and local minima. 73 training problem more difficult in terms of optimization and local minima.
73 It is also only recently that 74 It is also only recently that successful algorithms were proposed to
74 successful algorithms were proposed to overcome some of these 75 overcome some of these difficulties. All are based on unsupervised
75 difficulties. 76 learning, often in an greedy layer-wise ``unsupervised pre-training''
77 stage~\cite{Bengio-2009}. One of these layer initialization techniques,
78 applied here, is the Denoising
79 Auto-Encoder~(DEA)~\cite{VincentPLarochelleH2008-very-small}, which
80 performed similarly or better than previously proposed Restricted Boltzmann
81 Machines in terms of unsupervised extraction of a hierarchy of features
82 useful for classification. The principle is that each layer starting from
83 the bottom is trained to encode their input (the output of the previous
84 layer) and try to reconstruct it from a corrupted version of it. After this
85 unsupervised initialization, the stack of denoising auto-encoders can be
86 converted into a deep supervised feedforward neural network and trained by
87 stochastic gradient descent.
88
76 89
77 \section{Perturbation and Transformation of Character Images} 90 \section{Perturbation and Transformation of Character Images}
91
78 This section describes the different transformations we used to generate data, in their order. 92 This section describes the different transformations we used to generate data, in their order.
79 We can differentiate two important parts in the pipeline. The first one, from slant to pinch, performs transformations 93 The code for these transformations (mostly python) is available at
80 of the character. The second part, from blur to contrast, adds noise to the image. 94 {\tt http://anonymous.url.net}. All the modules in the pipeline share
81 95 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
82 \subsection{Adding Slant} 96 amount of deformation or noise introduced.
83 In order to mimic a slant effect, we simply shift each row of the image proportionnaly to its height: $shift = round(slant \times height)$. 97
84 We round the shift in order to have a discret displacement. We do not use a filter to smooth the result in order to save computing time 98 We can differentiate two important parts in the pipeline. The first one,
85 and also because latter transformations have similar effects. 99 from slant to pinch, performs transformations of the character. The second
86 100 part, from blur to contrast, adds noise to the image.
87 The $slant$ coefficient can be negative or positive with equal probability and its value is randomly sampled according to the complexity level. 101
88 In our case we take uniformly a number in the range $[0,complexity]$, so the maximum displacement for the lowest 102 \subsection{Slant}
89 or highest pixel line is of $round(complexity \times 32)$. 103
90 104 In order to mimic a slant effect, we simply shift each row of the image
91 105 proportionnaly to its height: $shift = round(slant \times height)$. We
92 \subsection{Changing Thickness} 106 round the shift in order to have a discret displacement. We do not use a
93 To change the thickness of the characters we used morpholigical operators: dilation and erosion~\cite{Haralick87,Serra82}. 107 filter to smooth the result in order to save computing time and also
94 108 because latter transformations have similar effects.
95 The basic idea of such transform is, for each pixel, to multiply in the element-wise manner its neighbourhood with a matrix called the structuring element. 109
96 Then for dilation we remplace the pixel value by the maximum of the result, or the minimum for erosion. 110 The $slant$ coefficient can be negative or positive with equal probability
97 This will dilate or erode objects in the image and the strength of the transform only depends on the structuring element. 111 and its value is randomly sampled according to the complexity level. In
98 112 our case we take uniformly a number in the range $[0,complexity]$, so the
99 We used ten different structural elements with increasing dimensions (the biggest is $5\times5$). 113 maximum displacement for the lowest or highest pixel line is of
100 for each image, we radomly sample the operator type (dilation or erosion) with equal probability and one structural element 114 $round(complexity \times 32)$.
101 from a subset of the $n$ smallest structuring elements where $n$ is $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ for erosion. 115
102 A neutral element is always present in the set, if it is chosen the transformation is not applied. 116
103 Erosion allows only the six smallest structural elements because when the character is too thin it may erase it completly. 117 \subsection{Thickness}
118
119 To change the thickness of the characters we used morpholigical operators:
120 dilation and erosion~\cite{Haralick87,Serra82}.
121
122 The basic idea of such transform is, for each pixel, to multiply in the
123 element-wise manner its neighbourhood with a matrix called the structuring
124 element. Then for dilation we remplace the pixel value by the maximum of
125 the result, or the minimum for erosion. This will dilate or erode objects
126 in the image and the strength of the transform only depends on the
127 structuring element.
128
129 We used ten different structural elements with increasing dimensions (the
130 biggest is $5\times5$). for each image, we radomly sample the operator
131 type (dilation or erosion) with equal probability and one structural
132 element from a subset of the $n$ smallest structuring elements where $n$ is
133 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$
134 for erosion. A neutral element is always present in the set, if it is
135 chosen the transformation is not applied. Erosion allows only the six
136 smallest structural elements because when the character is too thin it may
137 erase it completly.
104 138
105 \subsection{Affine Transformations} 139 \subsection{Affine Transformations}
106 We generate an affine transform matrix according to the complexity level, then we apply it directly to the image. 140
107 The matrix is of size $2 \times 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$. 141 We generate an affine transform matrix according to the complexity level,
108 Formally, for each pixel $(x,y)$ of the output image, 142 then we apply it directly to the image. The matrix is of size $2 \times
109 we give the value of the pixel nearest to : $(ax+by+c,dx+ey+f)$, in the input image. 143 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$. Formally,
110 This allows to produce scaling, translation, rotation and shearing variances. 144 for each pixel $(x,y)$ of the output image, we give the value of the pixel
111 145 nearest to : $(ax+by+c,dx+ey+f)$, in the input image. This allows to
112 The sampling of the parameters $(a,b,c,d,e,f)$ have been tuned by hand to forbid important rotations (not to confuse classes) but to give good variability of the transformation. For each image we sample uniformly the parameters in the following ranges: 146 produce scaling, translation, rotation and shearing variances.
113 $a$ and $d$ in $[1-3 \times complexity,1+3 \times complexity]$, $b$ and $e$ in $[-3 \times complexity,3 \times complexity]$ and $c$ and $f$ in $[-4 \times complexity, 4 \times complexity]$. 147
148 The sampling of the parameters $(a,b,c,d,e,f)$ have been tuned by hand to
149 forbid important rotations (not to confuse classes) but to give good
150 variability of the transformation. For each image we sample uniformly the
151 parameters in the following ranges: $a$ and $d$ in $[1-3 \times
152 complexity,1+3 \times complexity]$, $b$ and $e$ in $[-3 \times complexity,3
153 \times complexity]$ and $c$ and $f$ in $[-4 \times complexity, 4 \times
154 complexity]$.
114 155
115 156
116 \subsection{Local Elastic Deformations} 157 \subsection{Local Elastic Deformations}
117 This filter induces a "wiggly" effect in the image. The description here will be brief, as the algorithm follows precisely what is described in \cite{SimardSP03}. 158
118 159 This filter induces a "wiggly" effect in the image. The description here
119 The general idea is to generate two "displacements" fields, for horizontal and vertical displacements of pixels. Each of these fields has the same size as the original image. 160 will be brief, as the algorithm follows precisely what is described in
120 161 \cite{SimardSP03}.
121 When generating the transformed image, we'll loop over the x and y positions in the fields and select, as a value, the value of the pixel in the original image at the (relative) position given by the displacement fields for this x and y. If the position we'd retrieve is outside the borders of the image, we use a 0 value instead. 162
122 163 The general idea is to generate two "displacements" fields, for horizontal
123 To generate a pixel in either field, first a value between -1 and 1 is chosen from a uniform distribution. Then all the pixels, in both fields, is multiplied by a constant $\alpha$ which controls the intensity of the displacements (bigger $\alpha$ translates into larger wiggles). 164 and vertical displacements of pixels. Each of these fields has the same
124 165 size as the original image.
125 As a final step, each field is convoluted with a Gaussian 2D kernel of standard deviation $\sigma$. Visually, this results in a "blur" filter. This has the effect of making values next to each other in the displacement fields similar. In effect, this makes the wiggles more coherent, less noisy. 166
126 167 When generating the transformed image, we'll loop over the x and y
127 As displacement fields were long to compute, 50 pairs of fields were generated per complexity in increments of 0.1 (50 pairs for 0.1, 50 pairs for 0.2, etc.), and afterwards, given a complexity, we selected randomly among the 50 corresponding pairs. 168 positions in the fields and select, as a value, the value of the pixel in
128 169 the original image at the (relative) position given by the displacement
129 $\sigma$ and $\alpha$ were linked to complexity through the formulas $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times \sqrt[3]{complexity}$. 170 fields for this x and y. If the position we'd retrieve is outside the
171 borders of the image, we use a 0 value instead.
172
173 To generate a pixel in either field, first a value between -1 and 1 is
174 chosen from a uniform distribution. Then all the pixels, in both fields, is
175 multiplied by a constant $\alpha$ which controls the intensity of the
176 displacements (bigger $\alpha$ translates into larger wiggles).
177
178 As a final step, each field is convoluted with a Gaussian 2D kernel of
179 standard deviation $\sigma$. Visually, this results in a "blur"
180 filter. This has the effect of making values next to each other in the
181 displacement fields similar. In effect, this makes the wiggles more
182 coherent, less noisy.
183
184 As displacement fields were long to compute, 50 pairs of fields were
185 generated per complexity in increments of 0.1 (50 pairs for 0.1, 50 pairs
186 for 0.2, etc.), and afterwards, given a complexity, we selected randomly
187 among the 50 corresponding pairs.
188
189 $\sigma$ and $\alpha$ were linked to complexity through the formulas
190 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
191 \sqrt[3]{complexity}$.
130 192
131 193
132 \subsection{Pinch} 194 \subsection{Pinch}
133 195
134 This is another GIMP filter we used. The filter is in fact named "Whirl and pinch", but we don't use the "whirl" part (whirl is set to 0). As described in GIMP, a pinch is "similar to projecting the image onto an elastic surface and pressing or pulling on the center of the surface". 196 This is another GIMP filter we used. The filter is in fact named "Whirl and
135 197 pinch", but we don't use the "whirl" part (whirl is set to 0). As described
136 Mathematically, for a square input image, think of drawing a circle of radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to that disk (region inside circle) will have its value recalculated by taking the value of another "source" pixel in the original image. The position of that source pixel is found on the line thats goes through $C$ and $P$, but at some other distance $d_2$. Define $d_1$ to be the distance between $P$ and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times d_1$, where $pinch$ is a parameter to the filter. 198 in GIMP, a pinch is "similar to projecting the image onto an elastic
137 199 surface and pressing or pulling on the center of the surface".
138 If the region considered is not square then, before computing $d_2$, the smallest dimension (x or y) is stretched such that we may consider the region as if it was square. Then, after $d_2$ has been computed and corresponding components $d_2\_x$ and $d_2\_y$ have been found, the component corresponding to the stretched dimension is compressed back by an inverse ratio. 200
139 201 Mathematically, for a square input image, think of drawing a circle of
140 The actual value is given by bilinear interpolation considering the pixels around the (non-integer) source position thus found. 202 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
141 203 that disk (region inside circle) will have its value recalculated by taking
142 The value for $pinch$ in our case was given by sampling from an uniform distribution over the range $[-complexity, 0.7 \times complexity]$. 204 the value of another "source" pixel in the original image. The position of
205 that source pixel is found on the line thats goes through $C$ and $P$, but
206 at some other distance $d_2$. Define $d_1$ to be the distance between $P$
207 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
208 d_1$, where $pinch$ is a parameter to the filter.
209
210 If the region considered is not square then, before computing $d_2$, the
211 smallest dimension (x or y) is stretched such that we may consider the
212 region as if it was square. Then, after $d_2$ has been computed and
213 corresponding components $d_2\_x$ and $d_2\_y$ have been found, the
214 component corresponding to the stretched dimension is compressed back by an
215 inverse ratio.
216
217 The actual value is given by bilinear interpolation considering the pixels
218 around the (non-integer) source position thus found.
219
220 The value for $pinch$ in our case was given by sampling from an uniform
221 distribution over the range $[-complexity, 0.7 \times complexity]$.
143 222
144 \subsection{Motion Blur} 223 \subsection{Motion Blur}
145 224
146 This is a GIMP filter we applied, a "linear motion blur" in GIMP terminology. The description will be brief as it is a well-known filter. 225 This is a GIMP filter we applied, a "linear motion blur" in GIMP
147 226 terminology. The description will be brief as it is a well-known filter.
148 This algorithm has two input parameters, $length$ and $angle$. The value of a pixel in the final image is the mean value of the $length$ first pixels found by moving in the $angle$ direction. An approximation of this idea is used, as we won't fall onto precise pixels by following that direction. This is done using the Bresenham line algorithm. 227
149 228 This algorithm has two input parameters, $length$ and $angle$. The value of
150 The angle, in our case, is chosen from a uniform distribution over $[0,360]$ degrees. The length, though, depends on the complexity; it's sampled from a Gaussian distribution of mean 0 and standard deviation $\sigma = 3 \times complexity$. 229 a pixel in the final image is the mean value of the $length$ first pixels
230 found by moving in the $angle$ direction. An approximation of this idea is
231 used, as we won't fall onto precise pixels by following that
232 direction. This is done using the Bresenham line algorithm.
233
234 The angle, in our case, is chosen from a uniform distribution over
235 $[0,360]$ degrees. The length, though, depends on the complexity; it's
236 sampled from a Gaussian distribution of mean 0 and standard deviation
237 $\sigma = 3 \times complexity$.
151 238
152 \subsection{Occlusion} 239 \subsection{Occlusion}
153 240
154 This filter selects random parts of other (hereafter "occlusive") letter images and places them over the original letter (hereafter "occluded") image. To be more precise, having selected a subregion of the occlusive image and a desination position in the occluded image, to determine the final value for a given overlapping pixel, it selects whichever pixel is the lightest. As a reminder, the background value is 0, black, so the value nearest to 1 is selected. 241 This filter selects random parts of other (hereafter "occlusive") letter
155 242 images and places them over the original letter (hereafter "occluded")
156 To select a subpart of the occlusive image, four numbers are generated. For compability with the code, we'll call them "haut", "bas", "gauche" and "droite" (respectively meaning top, bottom, left and right). Each of these numbers is selected according to a Gaussian distribution of mean $8 \times complexity$ and standard deviation $2$. This means the largest the complexity is, the biggest the occlusion will be. The absolute value is taken, as the numbers must be positive, and the maximum value is capped at 15. 243 image. To be more precise, having selected a subregion of the occlusive
157 244 image and a desination position in the occluded image, to determine the
158 These four sizes collectively define a window centered on the middle pixel of the occlusive image. This is the part that will be extracted as the occlusion. 245 final value for a given overlapping pixel, it selects whichever pixel is
159 246 the lightest. As a reminder, the background value is 0, black, so the value
160 The next step is to select a destination position in the occluded image. Vertical and horizontal displacements $y\_arrivee$ and $x\_arrivee$ are selected according to Gaussian distributions of mean 0 and of standard deviations of, respectively, 3 and 2. Then an horizontal placement mode, $endroit$ (meaning location), is selected to be of three values meaning left, middle or right. 247 nearest to 1 is selected.
161 248
162 If $endroit$ is "middle", the occlusion will be horizontally centered around the horizontal middle of the occluded image, then shifted according to $x\_arrivee$. If $endroit$ is "left", it will be placed on the left of the occluded image, then displaced right according to $x\_arrivee$. The contrary happens if $endroit$ is $right$. 249 To select a subpart of the occlusive image, four numbers are generated. For
163 250 compability with the code, we'll call them "haut", "bas", "gauche" and
164 In both the horizontal and vertical positionning, the maximum position in either direction is such that the selected occlusion won't go beyond the borders of the occluded image. 251 "droite" (respectively meaning top, bottom, left and right). Each of these
252 numbers is selected according to a Gaussian distribution of mean $8 \times
253 complexity$ and standard deviation $2$. This means the largest the
254 complexity is, the biggest the occlusion will be. The absolute value is
255 taken, as the numbers must be positive, and the maximum value is capped at
256 15.
257
258 These four sizes collectively define a window centered on the middle pixel
259 of the occlusive image. This is the part that will be extracted as the
260 occlusion.
261
262 The next step is to select a destination position in the occluded
263 image. Vertical and horizontal displacements $y\_arrivee$ and $x\_arrivee$
264 are selected according to Gaussian distributions of mean 0 and of standard
265 deviations of, respectively, 3 and 2. Then an horizontal placement mode,
266 $place$, is selected to be of three values meaning
267 left, middle or right.
268
269 If $place$ is "middle", the occlusion will be horizontally centered
270 around the horizontal middle of the occluded image, then shifted according
271 to $x\_arrivee$. If $place$ is "left", it will be placed on the left of
272 the occluded image, then displaced right according to $x\_arrivee$. The
273 contrary happens if $place$ is $right$.
274
275 In both the horizontal and vertical positionning, the maximum position in
276 either direction is such that the selected occlusion won't go beyond the
277 borders of the occluded image.
165 278
166 This filter has a probability of not being applied, at all, of 60\%. 279 This filter has a probability of not being applied, at all, of 60\%.
167 280
168 281
169 \subsection{Pixel permutation} 282 \subsection{Pixel Permutation}
170 283
171 This filter permuts neighbouring pixels. It selects first $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then sequentially 284 This filter permuts neighbouring pixels. It selects first
172 exchanged to one other pixel in its $V4$ neighbourhood. Number of exchanges to the left, right, top, bottom are equal or does not differ from more than 1 285 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
173 if the number of selected pixels is not a multiple of 4. 286 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number
287 of exchanges to the left, right, top, bottom are equal or does not differ
288 from more than 1 if the number of selected pixels is not a multiple of 4.
174 289
175 It has has a probability of not being applied, at all, of 80\%. 290 It has has a probability of not being applied, at all, of 80\%.
176 291
177 292
178 \subsection{Distorsion gauss} 293 \subsection{Gaussian Noise}
179 294
180 This filter simply adds, to each pixel of the image independently, a gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$. 295 This filter simply adds, to each pixel of the image independently, a
296 Gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$.
181 297
182 It has has a probability of not being applied, at all, of 70\%. 298 It has has a probability of not being applied, at all, of 70\%.
183 299
184 300
185 \subsection{Background Images} 301 \subsection{Background Images}
186 302
187 This transformation adds a random background behind the letter. The background is chosen by first selecting, at random, an image from a set of images. Then we choose a 32x32 subregion of that image as the background image (by sampling x and y positions uniformly while making sure not to cross image borders). 303 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random
188 304 background behind the letter. The background is chosen by first selecting,
189 To combine the original letter image and the background image, contrast adjustments are made. We first get the maximal values (i.e. maximal intensity) for both the original image and the background image, $maximage$ and $maxbg$. We also have a parameter, $contrast$, given by sampling from a uniform distribution over $[complexity, 1]$. 305 at random, an image from a set of images. Then we choose a 32x32 subregion
190 306 of that image as the background image (by sampling x and y positions
191 Once we have all these numbers, we first adjust the values for the background image. Each pixel value is multiplied by $\frac{max(maximage - contrast, 0)}{maxbg}$. Therefore the higher the contrast, the darkest the background will be. 307 uniformly while making sure not to cross image borders).
192 308
193 The final image is found by taking the brightest (i.e. value nearest to 1) pixel from either the background image or the corresponding pixel in the original image. 309 To combine the original letter image and the background image, contrast
310 adjustments are made. We first get the maximal values (i.e. maximal
311 intensity) for both the original image and the background image, $maximage$
312 and $maxbg$. We also have a parameter, $contrast$, given by sampling from a
313 uniform distribution over $[complexity, 1]$.
314
315 Once we have all these numbers, we first adjust the values for the
316 background image. Each pixel value is multiplied by $\frac{max(maximage -
317 contrast, 0)}{maxbg}$. Therefore the higher the contrast, the darkest the
318 background will be.
319
320 The final image is found by taking the brightest (i.e. value nearest to 1)
321 pixel from either the background image or the corresponding pixel in the
322 original image.
194 323
195 \subsection{Salt and Pepper Noise} 324 \subsection{Salt and Pepper Noise}
196 325
197 This filter adds noise to the image by randomly selecting a certain number of them and, for those selected pixels, assign a random value according to a uniform distribution over the $[0,1]$ ranges. This last distribution does not change according to complexity. Instead, the number of selected pixels does: the proportion of changed pixels corresponds to $complexity / 5$, which means, as a maximum, 20\% of the pixels will be randomized. On the lowest extreme, no pixel is changed. 326 This filter adds noise to the image by randomly selecting a certain number
327 of them and, for those selected pixels, assign a random value according to
328 a uniform distribution over the $[0,1]$ ranges. This last distribution does
329 not change according to complexity. Instead, the number of selected pixels
330 does: the proportion of changed pixels corresponds to $complexity / 5$,
331 which means, as a maximum, 20\% of the pixels will be randomized. On the
332 lowest extreme, no pixel is changed.
198 333
199 This filter also has a probability of not being applied, at all, of 75\%. 334 This filter also has a probability of not being applied, at all, of 75\%.
200 335
201 \subsection{Spatially Gaussian Noise} 336 \subsection{Spatially Gaussian Noise}
202 337
203 The aim of this transformation is to filter, with a gaussian kernel, different regions of the image. In order to save computing time 338 The aim of this transformation is to filter, with a gaussian kernel,
204 we decided to convolve the whole image only once with a symmetric gaussian kernel of size and variance choosen uniformly in the ranges: 339 different regions of the image. In order to save computing time we decided
205 $[12,12 + 20 \times complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized between $0$ and $1$. 340 to convolve the whole image only once with a symmetric gaussian kernel of
206 We also create a symmetric averaging window, of the kernel size, with maximum value at the center. 341 size and variance choosen uniformly in the ranges: $[12,12 + 20 \times
207 For each image we sample uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be averaging centers 342 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
208 between the original image and the filtered one. 343 between $0$ and $1$. We also create a symmetric averaging window, of the
209 We initialize to zero a mask matrix of the image size. For each selected pixel we add to the mask the averaging window centered to it. 344 kernel size, with maximum value at the center. For each image we sample
210 The final image is computed from the following element-wise operation: $\frac{image + filtered image \times mask}{mask+1}$. 345 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be
346 averaging centers between the original image and the filtered one. We
347 initialize to zero a mask matrix of the image size. For each selected pixel
348 we add to the mask the averaging window centered to it. The final image is
349 computed from the following element-wise operation: $\frac{image + filtered
350 image \times mask}{mask+1}$.
211 351
212 This filter has a probability of not being applied, at all, of 75\%. 352 This filter has a probability of not being applied, at all, of 75\%.
213 353
214 \subsection{"Ratures"} 354 \subsection{Scratches}
215 355
216 The ratures ("scratches") filter places line-like white patches on the image. The lines are in fact heavily transformed images of the digit "1" (one), chosen at random among five thousands such start images of this digit. 356 The scratches module places line-like white patches on the image. The
217 357 lines are in fact heavily transformed images of the digit "1" (one), chosen
218 Once the image is selected, the transformation begins by finding the first $top$, $bottom$, $right$ and $left$ non-zero pixels in the image. It is then cropped to the region thus delimited, then this cropped version is expanded to 32x32 again. It is then rotated by a random angle having a Gaussian distribution of mean 90 and standard deviation $100 \times complexity$ (in degrees). The rotation is done with bicubic interpolation. 358 at random among five thousands such start images of this digit.
219 359
220 The rotated image is then resized to 50x50, with anti-aliasing. In that image, we crop the image again by selecting a region delimited horizontally to $left$ to $left+32$ and vertically by $top$ to $top+32$. 360 Once the image is selected, the transformation begins by finding the first
221 361 $top$, $bottom$, $right$ and $left$ non-zero pixels in the image. It is
222 Once this is done, two passes of a greyscale morphological erosion filter are applied. Put briefly, this erosion filter reduces the width of the line by a certain $smoothing$ amount. For small complexities (< 0.5), $smoothing$ is 6, so the line is very small. For complexities ranging from 0.25 to 0.5, $smoothing$ is 5. It is 4 for complexities 0.5 to 0.75, and 3 for higher complexities. 362 then cropped to the region thus delimited, then this cropped version is
223 363 expanded to $32\times32$ again. It is then rotated by a random angle having a
224 To compensate for border effects, the image is then cropped to 28x28 by removing two pixels everywhere on the borders, then expanded to 32x32 again. The pixel values are then linearly expanded such that the minimum value is 0 and the maximal one is 1. Then, 50\% of the time, the image is vertically flipped. 364 Gaussian distribution of mean 90 and standard deviation $100 \times
225 365 complexity$ (in degrees). The rotation is done with bicubic interpolation.
226 This filter is only applied only 15\% of the time. When it is applied, 50\% of the time, only one patch image is generated and applied. In 30\% of cases, two patches are generated, and otherwise three patches are generated. The patch is applied by taking the maximal value on any given patch or the original image, for each of the 32x32 pixel locations. 366
367 The rotated image is then resized to $50\times50$, with anti-aliasing. In
368 that image, we crop the image again by selecting a region delimited
369 horizontally to $left$ to $left+32$ and vertically by $top$ to $top+32$.
370
371 Once this is done, two passes of a greyscale morphological erosion filter
372 are applied. Put briefly, this erosion filter reduces the width of the line
373 by a certain $smoothing$ amount. For small complexities (< 0.5),
374 $smoothing$ is 6, so the line is very small. For complexities ranging from
375 0.25 to 0.5, $smoothing$ is 5. It is 4 for complexities 0.5 to 0.75, and 3
376 for higher complexities.
377
378 To compensate for border effects, the image is then cropped to 28x28 by
379 removing two pixels everywhere on the borders, then expanded to 32x32
380 again. The pixel values are then linearly expanded such that the minimum
381 value is 0 and the maximal one is 1. Then, 50\% of the time, the image is
382 vertically flipped.
383
384 This filter is only applied only 15\% of the time. When it is applied, 50\%
385 of the time, only one patch image is generated and applied. In 30\% of
386 cases, two patches are generated, and otherwise three patches are
387 generated. The patch is applied by taking the maximal value on any given
388 patch or the original image, for each of the 32x32 pixel locations.
227 389
228 \subsection{Color and Contrast Changes} 390 \subsection{Color and Contrast Changes}
229 391
230 This filter changes the constrast and may invert the image polarity (white on black to black on white). The contrast $C$ is defined here as the difference 392 This filter changes the constrast and may invert the image polarity (white
231 between the maximum and the minimum pixel value of the image. A contrast value is sampled uniformly between $1$ and $1-0.85 \times complexity$ 393 on black to black on white). The contrast $C$ is defined here as the
232 (this insure a minimum constrast of $0.15$). We then simply normalize the image to the range $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The polarity 394 difference between the maximum and the minimum pixel value of the image. A
233 is inverted with $0.5$ probability. 395 contrast value is sampled uniformly between $1$ and $1-0.85 \times
396 complexity$ (this insure a minimum constrast of $0.15$). We then simply
397 normalize the image to the range $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
398 polarity is inverted with $0.5$ probability.
234 399
235 400
236 \begin{figure}[h] 401 \begin{figure}[h]
237 \resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\ 402 \resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\
238 \caption{Illustration of the pipeline of stochastic 403 \caption{Illustration of the pipeline of stochastic
242 of applying one of the modules in the pipeline. The last image 407 of applying one of the modules in the pipeline. The last image
243 (bottom right) is used as training example.} 408 (bottom right) is used as training example.}
244 \label{fig:pipeline} 409 \label{fig:pipeline}
245 \end{figure} 410 \end{figure}
246 411
247 \section{Learning Algorithms for Deep Architectures}
248
249 Learning for deep network has long been a problem since well-known learning algorithms do not generalize well on deep architectures.
250 Using these training algorithms on deep network usually yields to a worse generalization than on shallow networks.
251 Recently, new initialization techniques have been discovered that enable better generalization overall.
252
253 One of these initialization techniques is denoising auto-encoders.
254 The principle is that each layer starting from the bottom is trained to encode and decode their input and the encoding part is kept as initialization for the weights and bias of the network.
255 For more details see section \ref{SdA}.
256
257 After initialization is done, standard training algorithms work.
258 In this case, since we have large data sets we use stochastic gradient descent.
259 This resemble minibatch training except that the batches are selected at random.
260 To speed up computation, we randomly pre-arranged examples in batches and used those for all training experiments.
261 412
262 \section{Experimental Setup} 413 \section{Experimental Setup}
263 414
264 \subsection{Training Datasets} 415 \subsection{Training Datasets}
265 416