comparison writeup/nips2010_submission.tex @ 476:db28764b8252

Merge
author fsavard
date Sun, 30 May 2010 12:06:45 -0400
parents ead3085c1c66 bcf024e6ab23
children 6593e67381a3
comparison
equal deleted inserted replaced
475:ead3085c1c66 476:db28764b8252
119 There are two main parts in the pipeline. The first one, 119 There are two main parts in the pipeline. The first one,
120 from slant to pinch below, performs transformations. The second 120 from slant to pinch below, performs transformations. The second
121 part, from blur to contrast, adds different kinds of noise. 121 part, from blur to contrast, adds different kinds of noise.
122 122
123 {\large\bf Transformations}\\ 123 {\large\bf Transformations}\\
124 {\bf Slant}\\ 124 {\bf Slant.}
125 We mimic slant by shifting each row of the image 125 We mimic slant by shifting each row of the image
126 proportionnaly to its height: $shift = round(slant \times height)$. 126 proportionnaly to its height: $shift = round(slant \times height)$.
127 The $slant$ coefficient can be negative or positive with equal probability 127 The $slant$ coefficient can be negative or positive with equal probability
128 and its value is randomly sampled according to the complexity level: 128 and its value is randomly sampled according to the complexity level:
129 e $slant \sim U[0,complexity]$, so the 129 e $slant \sim U[0,complexity]$, so the
130 maximum displacement for the lowest or highest pixel line is of 130 maximum displacement for the lowest or highest pixel line is of
131 $round(complexity \times 32)$.\\ 131 $round(complexity \times 32)$.\\
132 {\bf Thickness}\\ 132 {\bf Thickness.}
133 Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82} 133 Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82}
134 are applied. The neighborhood of each pixel is multiplied 134 are applied. The neighborhood of each pixel is multiplied
135 element-wise with a {\em structuring element} matrix. 135 element-wise with a {\em structuring element} matrix.
136 The pixel value is replaced by the maximum or the minimum of the resulting 136 The pixel value is replaced by the maximum or the minimum of the resulting
137 matrix, respectively for dilation or erosion. Ten different structural elements with 137 matrix, respectively for dilation or erosion. Ten different structural elements with
141 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ 141 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$
142 for erosion. A neutral element is always present in the set, and if it is 142 for erosion. A neutral element is always present in the set, and if it is
143 chosen no transformation is applied. Erosion allows only the six 143 chosen no transformation is applied. Erosion allows only the six
144 smallest structural elements because when the character is too thin it may 144 smallest structural elements because when the character is too thin it may
145 be completely erased.\\ 145 be completely erased.\\
146 {\bf Affine Transformations}\\ 146 {\bf Affine Transformations.}
147 A $2 \times 3$ affine transform matrix (with 147 A $2 \times 3$ affine transform matrix (with
148 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. 148 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
149 Each pixel $(x,y)$ of the output image takes the value of the pixel 149 Each pixel $(x,y)$ of the output image takes the value of the pixel
150 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This 150 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This
151 produces scaling, translation, rotation and shearing. 151 produces scaling, translation, rotation and shearing.
153 forbid important rotations (not to confuse classes) but to give good 153 forbid important rotations (not to confuse classes) but to give good
154 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times 154 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
155 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 155 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
156 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times 156 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
157 complexity]$.\\ 157 complexity]$.\\
158 {\bf Local Elastic Deformations}\\ 158 {\bf Local Elastic Deformations.}
159 This filter induces a "wiggly" effect in the image, following~\citet{SimardSP03}, 159 This filter induces a "wiggly" effect in the image, following~\citet{SimardSP03},
160 which provides more details. 160 which provides more details.
161 Two "displacements" fields are generated and applied, for horizontal 161 Two "displacements" fields are generated and applied, for horizontal
162 and vertical displacements of pixels. 162 and vertical displacements of pixels.
163 To generate a pixel in either field, first a value between -1 and 1 is 163 To generate a pixel in either field, first a value between -1 and 1 is
166 displacements (larger $\alpha$ translates into larger wiggles). 166 displacements (larger $\alpha$ translates into larger wiggles).
167 Each field is convoluted with a Gaussian 2D kernel of 167 Each field is convoluted with a Gaussian 2D kernel of
168 standard deviation $\sigma$. Visually, this results in a blur. 168 standard deviation $\sigma$. Visually, this results in a blur.
169 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times 169 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
170 \sqrt[3]{complexity}$.\\ 170 \sqrt[3]{complexity}$.\\
171 {\bf Pinch}\\ 171 {\bf Pinch.}
172 This GIMP filter is named "Whirl and 172 This GIMP filter is named "Whirl and
173 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic 173 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
174 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. 174 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}.
175 For a square input image, think of drawing a circle of 175 For a square input image, think of drawing a circle of
176 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to 176 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
183 The actual value is given by bilinear interpolation considering the pixels 183 The actual value is given by bilinear interpolation considering the pixels
184 around the (non-integer) source position thus found. 184 around the (non-integer) source position thus found.
185 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\ 185 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\
186 186
187 {\large\bf Injecting Noise}\\ 187 {\large\bf Injecting Noise}\\
188 {\bf Motion Blur}\\ 188 {\bf Motion Blur.}
189 This GIMP filter is a ``linear motion blur'' in GIMP 189 This GIMP filter is a ``linear motion blur'' in GIMP
190 terminology, with two parameters, $length$ and $angle$. The value of 190 terminology, with two parameters, $length$ and $angle$. The value of
191 a pixel in the final image is the approximately mean value of the $length$ first pixels 191 a pixel in the final image is the approximately mean value of the $length$ first pixels
192 found by moving in the $angle$ direction. 192 found by moving in the $angle$ direction.
193 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\ 193 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\
194 {\bf Occlusion}\\ 194 {\bf Occlusion.}
195 This filter selects a random rectangle from an {\em occluder} character 195 This filter selects a random rectangle from an {\em occluder} character
196 images and places it over the original {\em occluded} character 196 images and places it over the original {\em occluded} character
197 image. Pixels are combined by taking the max(occluder,occluded), 197 image. Pixels are combined by taking the max(occluder,occluded),
198 closer to black. The corners of the occluder The rectangle corners 198 closer to black. The corners of the occluder The rectangle corners
199 are sampled so that larger complexity gives larger rectangles. 199 are sampled so that larger complexity gives larger rectangles.
200 The destination position in the occluded image are also sampled 200 The destination position in the occluded image are also sampled
201 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). 201 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
202 It has has a probability of not being applied at all of 60\%.\\ 202 It has has a probability of not being applied at all of 60\%.\\
203 {\bf Pixel Permutation}\\ 203 {\bf Pixel Permutation.}
204 This filter permutes neighbouring pixels. It selects first 204 This filter permutes neighbouring pixels. It selects first
205 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then 205 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
206 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number 206 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number
207 of exchanges to the left, right, top, bottom are equal or does not differ 207 of exchanges to the left, right, top, bottom are equal or does not differ
208 from more than 1 if the number of selected pixels is not a multiple of 4. 208 from more than 1 if the number of selected pixels is not a multiple of 4.
209 It has has a probability of not being applied at all of 80\%.\\ 209 It has has a probability of not being applied at all of 80\%.\\
210 {\bf Gaussian Noise}\\ 210 {\bf Gaussian Noise.}
211 This filter simply adds, to each pixel of the image independently, a 211 This filter simply adds, to each pixel of the image independently, a
212 noise $\sim Normal(0(\frac{complexity}{10})^2)$. 212 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
213 It has has a probability of not being applied at all of 70\%.\\ 213 It has has a probability of not being applied at all of 70\%.\\
214 {\bf Background Images}\\ 214 {\bf Background Images.}
215 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random 215 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
216 background behind the letter. The background is chosen by first selecting, 216 background behind the letter. The background is chosen by first selecting,
217 at random, an image from a set of images. Then a 32$\times$32 subregion 217 at random, an image from a set of images. Then a 32$\times$32 subregion
218 of that image is chosen as the background image (by sampling position 218 of that image is chosen as the background image (by sampling position
219 uniformly while making sure not to cross image borders). 219 uniformly while making sure not to cross image borders).
222 intensity) for both the original image and the background image, $maximage$ 222 intensity) for both the original image and the background image, $maximage$
223 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$. 223 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$.
224 Each background pixel value is multiplied by $\frac{max(maximage - 224 Each background pixel value is multiplied by $\frac{max(maximage -
225 contrast, 0)}{maxbg}$ (higher contrast yield darker 225 contrast, 0)}{maxbg}$ (higher contrast yield darker
226 background). The output image pixels are max(background,original).\\ 226 background). The output image pixels are max(background,original).\\
227 {\bf Salt and Pepper Noise}\\ 227 {\bf Salt and Pepper Noise.}
228 This filter adds noise $\sim U[0,1]$ to random subsets of pixels. 228 This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
229 The number of selected pixels is $0.2 \times complexity$. 229 The number of selected pixels is $0.2 \times complexity$.
230 This filter has a probability of not being applied at all of 75\%.\\ 230 This filter has a probability of not being applied at all of 75\%.\\
231 {\bf Spatially Gaussian Noise}\\ 231 {\bf Spatially Gaussian Noise.}
232 Different regions of the image are spatially smoothed. 232 Different regions of the image are spatially smoothed.
233 The image is convolved with a symmetric Gaussian kernel of 233 The image is convolved with a symmetric Gaussian kernel of
234 size and variance choosen uniformly in the ranges $[12,12 + 20 \times 234 size and variance choosen uniformly in the ranges $[12,12 + 20 \times
235 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized 235 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
236 between $0$ and $1$. We also create a symmetric averaging window, of the 236 between $0$ and $1$. We also create a symmetric averaging window, of the
240 initialize to zero a mask matrix of the image size. For each selected pixel 240 initialize to zero a mask matrix of the image size. For each selected pixel
241 we add to the mask the averaging window centered to it. The final image is 241 we add to the mask the averaging window centered to it. The final image is
242 computed from the following element-wise operation: $\frac{image + filtered 242 computed from the following element-wise operation: $\frac{image + filtered
243 image \times mask}{mask+1}$. 243 image \times mask}{mask+1}$.
244 This filter has a probability of not being applied at all of 75\%.\\ 244 This filter has a probability of not being applied at all of 75\%.\\
245 {\bf Scratches}\\ 245 {\bf Scratches.}
246 The scratches module places line-like white patches on the image. The 246 The scratches module places line-like white patches on the image. The
247 lines are heavily transformed images of the digit "1" (one), chosen 247 lines are heavily transformed images of the digit "1" (one), chosen
248 at random among five thousands such 1 images. The 1 image is 248 at random among five thousands such 1 images. The 1 image is
249 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times 249 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
250 complexity)^2$, using bicubic interpolation, 250 complexity)^2$, using bicubic interpolation,
254 This filter is only applied only 15\% of the time. When it is applied, 50\% 254 This filter is only applied only 15\% of the time. When it is applied, 50\%
255 of the time, only one patch image is generated and applied. In 30\% of 255 of the time, only one patch image is generated and applied. In 30\% of
256 cases, two patches are generated, and otherwise three patches are 256 cases, two patches are generated, and otherwise three patches are
257 generated. The patch is applied by taking the maximal value on any given 257 generated. The patch is applied by taking the maximal value on any given
258 patch or the original image, for each of the 32x32 pixel locations.\\ 258 patch or the original image, for each of the 32x32 pixel locations.\\
259 {\bf Color and Contrast Changes}\\ 259 {\bf Color and Contrast Changes.}
260 This filter changes the constrast and may invert the image polarity (white 260 This filter changes the constrast and may invert the image polarity (white
261 on black to black on white). The contrast $C$ is defined here as the 261 on black to black on white). The contrast $C$ is defined here as the
262 difference between the maximum and the minimum pixel value of the image. 262 difference between the maximum and the minimum pixel value of the image.
263 Contrast $\sim U[1-0.85 \times complexity,1]$ (so constrast $\geq 0.15$). 263 Contrast $\sim U[1-0.85 \times complexity,1]$ (so constrast $\geq 0.15$).
264 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The 264 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
277 \end{figure} 277 \end{figure}
278 278
279 279
280 \section{Experimental Setup} 280 \section{Experimental Setup}
281 281
282 \subsection{Training Datasets} 282 Whereas much previous work on deep learning algorithms had been performed on
283 283 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
284 \subsubsection{Data Sources} 284 with 60~000 examples, and variants involving 10~000
285 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
286 to focus here on the case of much larger training sets, from 10 times to
287 to 1000 times larger. The larger datasets are obtained by first sampling from
288 a {\em data source} (NIST characters, scanned machine printed characters, characters
289 from fonts, or characters from captchas) and then optionally applying some of the
290 above transformations and/or noise processes.
291
292 \subsection{Data Sources}
285 293
286 \begin{itemize} 294 \begin{itemize}
287 \item {\bf NIST} 295 \item {\bf NIST}
288 The NIST Special Database 19 (NIST19) is a very widely used dataset for training and testing OCR systems. 296 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995},
297 widely used for training and testing character
298 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}.
289 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, 299 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications,
290 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 300 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
291 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. 301 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity.
292 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one for classification task is recommended 302 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended
293 by NIST as testing set and is used in our work for that purpose. It contains 82600 examples, 303 by NIST as testing set and is used in our work and some previous work~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
294 while the training and validation sets (which have the same distribution) contain XXXXX and 304 for that purpose. We randomly split the remainder into a training set and a validation set for
295 XXXXX examples respectively. 305 model selection. The sizes of these data sets are: XXX for training, XXX for validation,
306 and XXX for testing.
296 The performances reported by previous work on that dataset mostly use only the digits. 307 The performances reported by previous work on that dataset mostly use only the digits.
297 Here we use all the classes both in the training and testing phase. This is especially 308 Here we use all the classes both in the training and testing phase. This is especially
298 useful to estimate the effect of a multi-task setting. 309 useful to estimate the effect of a multi-task setting.
299 Note that the distribution of the classes in the NIST training and test sets differs 310 Note that the distribution of the classes in the NIST training and test sets differs
300 substantially, with relatively many more digits in the test set, and uniform distribution 311 substantially, with relatively many more digits in the test set, and uniform distribution
303 314
304 \item {\bf Fonts} TODO!!! 315 \item {\bf Fonts} TODO!!!
305 316
306 \item {\bf Captchas} 317 \item {\bf Captchas}
307 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 318 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
308 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character 319 generating characters of the same format as the NIST dataset. This software is based on
309 generator and various kinds of tranformations similar to those described in the previous sections. 320 a random character class generator and various kinds of tranformations similar to those described in the previous sections.
310 In order to increase the variability of the data generated, different fonts are used for generating the characters. 321 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
311 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity 322 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
312 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are 323 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are
313 allowed and can be controlled via an easy to use facade class. 324 allowed and can be controlled via an easy to use facade class.
314 \item {\bf OCR data} 325 \item {\bf OCR data}
326 A large set (2 million) of scanned, OCRed and manually verified machine-printed
327 characters (from various documents and books) where included as an
328 additional source. This set is part of a larger corpus being collected by the Image Understanding
329 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern
330 ({\tt http://www.iupr.com}), and which will be publically released.
315 \end{itemize} 331 \end{itemize}
316 332
317 \subsubsection{Data Sets} 333 \subsection{Data Sets}
334 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
335 from one of the 62 character classes.
318 \begin{itemize} 336 \begin{itemize}
319 \item {\bf NIST} This is the raw NIST special database 19. 337 \item {\bf NIST}. This is the raw NIST special database 19.
320 \item {\bf P07} 338 \item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources
321 The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$. 339 and sending them through the above transformation pipeline.
322 For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts, 340 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts,
323 $0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order 341 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
324 and for each of them we sample uniformly a complexity in the range $[0,0.7]$. 342 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
325 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) 343 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
326 except that we only apply 344 except that we only apply
327 transformations from slant to pinch. Therefore, the character is 345 transformations from slant to pinch. Therefore, the character is
328 transformed but no additionnal noise is added to the image, giving images 346 transformed but no additionnal noise is added to the image, giving images
329 closer to the NIST dataset. 347 closer to the NIST dataset.
330 \end{itemize} 348 \end{itemize}
331 349
332 \subsection{Models and their Hyperparameters} 350 \subsection{Models and their Hyperparameters}
333 351
352 All hyper-parameters are selected based on performance on the NISTP validation set.
353
334 \subsubsection{Multi-Layer Perceptrons (MLP)} 354 \subsubsection{Multi-Layer Perceptrons (MLP)}
335 355
336 An MLP is a family of functions that are described by stacking layers of of a function similar to 356 Whereas previous work had compared deep architectures to both shallow MLPs and
337 $$g(x) = \tanh(b+Wx)$$ 357 SVMs, we only compared to MLPs here because of the very large datasets used.
338 The input, $x$, is a $d$-dimension vector. 358 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
339 The output, $g(x)$, is a $m$-dimension vector. 359 exponentials) on the output layer for estimating P(class | image).
340 The parameter $W$ is a $m\times d$ matrix and is called the weight matrix. 360 The hyper-parameters are the following: number of hidden units, taken in
341 The parameter $b$ is a $m$-vector and is called the bias vector. 361 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
342 The non-linearity (here $\tanh$) is applied element-wise to the output vector. 362 examples are presented in minibatches of size 20. A constant learning
343 Usually the input is referred to a input layer and similarly for the output. 363 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
344 You can of course chain several such functions to obtain a more complex one. 364 through preliminary experiments, and 0.1 was selected.
345 Here is a common example 365
346 $$f(x) = c + V\tanh(b+Wx)$$
347 In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer.
348 Here the output layer does not have the same non-linearity as the hidden layer.
349 This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand.
350
351 If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP.
352 The parameters to adapt are the weight matrix and the bias vector for each layer.
353 366
354 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} 367 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
355 \label{SdA} 368 \label{SdA}
356 369
357 Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization. 370 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
358 This is essentially unsupervised training where the layer is made to reconstruct its input through and encoding and decoding phase. 371 can be used to initialize the weights of each layer of a deep MLP (with many hidden
359 Denoising auto-encoders are a variant where the input is corrupted with random noise but the target is the uncorrupted input. 372 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
360 The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform. 373 enabling better generalization, apparently setting parameters in the
361 374 basin of attraction of supervised gradient descent yielding better
362 An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer. 375 generalization~\citep{Erhan+al-2010}. It is hypothesized that the
363 Usually the top and bottom weight matrices are the transpose of each other and are fixed this way. 376 advantage brought by this procedure stems from a better prior,
364 The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer. 377 on the one hand taking advantage of the link between the input
365 The other parameters are discarded. 378 distribution $P(x)$ and the conditional distribution of interest
366 379 $P(y|x)$ (like in semi-supervised learning), and on the other hand
367 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. 380 taking advantage of the expressive power and bias implicit in the
368 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder. 381 deep architecture (whereby complex concepts are expressed as
369 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it. 382 compositions of simpler ones through a deep hierarchy).
370 For additional details see \citet{vincent:icml08}. 383
384 Here we chose to use the Denoising
385 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
386 these deep hierarchies of features, as it is very simple to train and
387 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}),
388 provides immediate and efficient inference, and yielded results
389 comparable or better than RBMs in series of experiments
390 \citep{VincentPLarochelleH2008}. During training of a Denoising
391 Auto-Encoder, it is presented with a stochastically corrupted version
392 of the input and trained to reconstruct the uncorrupted input,
393 forcing the hidden units to represent the leading regularities in
394 the data. Once it is trained, its hidden units activations can
395 be used as inputs for training a second one, etc.
396 After this unsupervised pre-training stage, the parameters
397 are used to initialize a deep MLP, which is fine-tuned by
398 the same standard procedure used to train them (see previous section).
399
400 The hyper-parameters are the same as for the MLP, with the addition of the
401 amount of corruption noise (we used the masking noise process, whereby a
402 fixed proportion of the input values, randomly selected, are zeroed), and a
403 separate learning rate for the unsupervised pre-training stage (selected
404 from the same above set). The fraction of inputs corrupted was selected
405 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
406 of hidden layers but it was fixed to 3 based on previous work with
407 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
371 408
372 \section{Experimental Results} 409 \section{Experimental Results}
373 410
374 \subsection{SDA vs MLP vs Humans} 411 \subsection{SDA vs MLP vs Humans}
375 412
399 on NIST digits classification using the same test set are included.} 436 on NIST digits classification using the same test set are included.}
400 \label{tab:sda-vs-mlp-vs-humans} 437 \label{tab:sda-vs-mlp-vs-humans}
401 \begin{center} 438 \begin{center}
402 \begin{tabular}{|l|r|r|r|r|} \hline 439 \begin{tabular}{|l|r|r|r|r|} \hline
403 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline 440 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline
404 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $>1.1\%$ \\ \hline 441 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $1.4\%$ \\ \hline
405 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline 442 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline
406 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline 443 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline
407 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline 444 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline
408 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline 445 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline
409 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline 446 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline
410 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline 447 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline
411 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline 448 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline
412 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline 449 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline
413 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline 450 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline
414 \citep{Migram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline 451 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline
415 \end{tabular} 452 \end{tabular}
416 \end{center} 453 \end{center}
417 \end{table} 454 \end{table}
418 455
419 \begin{figure}[h] 456 \begin{figure}[h]
502 539
503 540
504 541
505 \section{Conclusions} 542 \section{Conclusions}
506 543
544 The conclusions are positive for all the questions asked in the introduction.
545 \begin{itemize}
546 \item Do the good results previously obtained with deep architectures on the
547 MNIST digits generalize to the setting of a much larger and richer (but similar)
548 dataset, the NIST special database 19, with 62 classes and around 800k examples?
549 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
550 performance.
551 \item To what extent does the perturbation of input images (e.g. adding
552 noise, affine transformations, background images) make the resulting
553 classifier better not only on similarly perturbed images but also on
554 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
555 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
556 MLPs were helped by perturbed training examples when tested on perturbed input images,
557 but only marginally helped wrt clean examples. On the other hand, the deep SDAs
558 were very significantly boosted by these out-of-distribution examples.
559 \item Similarly, does the feature learning step in deep learning algorithms benefit more
560 training with similar but different classes (i.e. a multi-task learning scenario) than
561 a corresponding shallow and purely supervised architecture?
562 Whereas the improvement due to the multi-task setting was marginal or
563 negative for the MLP, it was very significant for the SDA.
564 \end{itemize}
565
507 \bibliography{strings,ml,aigaion,specials} 566 \bibliography{strings,ml,aigaion,specials}
508 %\bibliographystyle{plainnat} 567 %\bibliographystyle{plainnat}
509 \bibliographystyle{unsrtnat} 568 \bibliographystyle{unsrtnat}
510 %\bibliographystyle{apalike} 569 %\bibliographystyle{apalike}
511 570