comparison writeup/nips2010_submission.tex @ 495:5764a2ae1fb5

typos
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 01 Jun 2010 11:02:10 -0400
parents a194ce5a4249
children e41007dd40e9 2b58eda9fc08
comparison
equal deleted inserted replaced
494:405cabc08c92 495:5764a2ae1fb5
18 \vspace*{-2mm} 18 \vspace*{-2mm}
19 \begin{abstract} 19 \begin{abstract}
20 Recent theoretical and empirical work in statistical machine learning has 20 Recent theoretical and empirical work in statistical machine learning has
21 demonstrated the importance of learning algorithms for deep 21 demonstrated the importance of learning algorithms for deep
22 architectures, i.e., function classes obtained by composing multiple 22 architectures, i.e., function classes obtained by composing multiple
23 non-linear transformations. The self-taught learning (exploitng unlabeled 23 non-linear transformations. The self-taught learning (exploiting unlabeled
24 examples or examples from other distributions) has already been applied 24 examples or examples from other distributions) has already been applied
25 to deep learners, but mostly to show the advantage of unlabeled 25 to deep learners, but mostly to show the advantage of unlabeled
26 examples. Here we explore the advantage brought by {\em out-of-distribution 26 examples. Here we explore the advantage brought by {\em out-of-distribution
27 examples} and show that {\em deep learners benefit more from them than a 27 examples} and show that {\em deep learners benefit more from them than a
28 corresponding shallow learner}, in the area 28 corresponding shallow learner}, in the area
137 137
138 \vspace*{2mm} 138 \vspace*{2mm}
139 139
140 {\bf Slant.} 140 {\bf Slant.}
141 We mimic slant by shifting each row of the image 141 We mimic slant by shifting each row of the image
142 proportionnaly to its height: $shift = round(slant \times height)$. 142 proportionally to its height: $shift = round(slant \times height)$.
143 The $slant$ coefficient can be negative or positive with equal probability 143 The $slant$ coefficient can be negative or positive with equal probability
144 and its value is randomly sampled according to the complexity level: 144 and its value is randomly sampled according to the complexity level:
145 e $slant \sim U[0,complexity]$, so the 145 e $slant \sim U[0,complexity]$, so the
146 maximum displacement for the lowest or highest pixel line is of 146 maximum displacement for the lowest or highest pixel line is of
147 $round(complexity \times 32)$.\\ 147 $round(complexity \times 32)$.\\
148 {\bf Thickness.} 148 {\bf Thickness.}
149 Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82} 149 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
150 are applied. The neighborhood of each pixel is multiplied 150 are applied. The neighborhood of each pixel is multiplied
151 element-wise with a {\em structuring element} matrix. 151 element-wise with a {\em structuring element} matrix.
152 The pixel value is replaced by the maximum or the minimum of the resulting 152 The pixel value is replaced by the maximum or the minimum of the resulting
153 matrix, respectively for dilation or erosion. Ten different structural elements with 153 matrix, respectively for dilation or erosion. Ten different structural elements with
154 increasing dimensions (largest is $5\times5$) were used. For each image, 154 increasing dimensions (largest is $5\times5$) were used. For each image,
190 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. 190 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}.
191 For a square input image, think of drawing a circle of 191 For a square input image, think of drawing a circle of
192 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to 192 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
193 that disk (region inside circle) will have its value recalculated by taking 193 that disk (region inside circle) will have its value recalculated by taking
194 the value of another "source" pixel in the original image. The position of 194 the value of another "source" pixel in the original image. The position of
195 that source pixel is found on the line thats goes through $C$ and $P$, but 195 that source pixel is found on the line that goes through $C$ and $P$, but
196 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ 196 at some other distance $d_2$. Define $d_1$ to be the distance between $P$
197 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times 197 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
198 d_1$, where $pinch$ is a parameter to the filter. 198 d_1$, where $pinch$ is a parameter to the filter.
199 The actual value is given by bilinear interpolation considering the pixels 199 The actual value is given by bilinear interpolation considering the pixels
200 around the (non-integer) source position thus found. 200 around the (non-integer) source position thus found.
233 noise $\sim Normal(0(\frac{complexity}{10})^2)$. 233 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
234 It has has a probability of not being applied at all of 70\%.\\ 234 It has has a probability of not being applied at all of 70\%.\\
235 {\bf Background Images.} 235 {\bf Background Images.}
236 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random 236 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
237 background behind the letter. The background is chosen by first selecting, 237 background behind the letter. The background is chosen by first selecting,
238 at random, an image from a set of images. Then a 32$\times$32 subregion 238 at random, an image from a set of images. Then a 32$\times$32 sub-region
239 of that image is chosen as the background image (by sampling position 239 of that image is chosen as the background image (by sampling position
240 uniformly while making sure not to cross image borders). 240 uniformly while making sure not to cross image borders).
241 To combine the original letter image and the background image, contrast 241 To combine the original letter image and the background image, contrast
242 adjustments are made. We first get the maximal values (i.e. maximal 242 adjustments are made. We first get the maximal values (i.e. maximal
243 intensity) for both the original image and the background image, $maximage$ 243 intensity) for both the original image and the background image, $maximage$
250 The number of selected pixels is $0.2 \times complexity$. 250 The number of selected pixels is $0.2 \times complexity$.
251 This filter has a probability of not being applied at all of 75\%.\\ 251 This filter has a probability of not being applied at all of 75\%.\\
252 {\bf Spatially Gaussian Noise.} 252 {\bf Spatially Gaussian Noise.}
253 Different regions of the image are spatially smoothed. 253 Different regions of the image are spatially smoothed.
254 The image is convolved with a symmetric Gaussian kernel of 254 The image is convolved with a symmetric Gaussian kernel of
255 size and variance choosen uniformly in the ranges $[12,12 + 20 \times 255 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
256 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized 256 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
257 between $0$ and $1$. We also create a symmetric averaging window, of the 257 between $0$ and $1$. We also create a symmetric averaging window, of the
258 kernel size, with maximum value at the center. For each image we sample 258 kernel size, with maximum value at the center. For each image we sample
259 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be 259 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be
260 averaging centers between the original image and the filtered one. We 260 averaging centers between the original image and the filtered one. We
266 {\bf Scratches.} 266 {\bf Scratches.}
267 The scratches module places line-like white patches on the image. The 267 The scratches module places line-like white patches on the image. The
268 lines are heavily transformed images of the digit "1" (one), chosen 268 lines are heavily transformed images of the digit "1" (one), chosen
269 at random among five thousands such 1 images. The 1 image is 269 at random among five thousands such 1 images. The 1 image is
270 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times 270 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
271 complexity)^2$, using bicubic interpolation, 271 complexity)^2$, using bi-cubic interpolation,
272 Two passes of a greyscale morphological erosion filter 272 Two passes of a grey-scale morphological erosion filter
273 are applied, reducing the width of the line 273 are applied, reducing the width of the line
274 by an amount controlled by $complexity$. 274 by an amount controlled by $complexity$.
275 This filter is only applied only 15\% of the time. When it is applied, 50\% 275 This filter is only applied only 15\% of the time. When it is applied, 50\%
276 of the time, only one patch image is generated and applied. In 30\% of 276 of the time, only one patch image is generated and applied. In 30\% of
277 cases, two patches are generated, and otherwise three patches are 277 cases, two patches are generated, and otherwise three patches are
278 generated. The patch is applied by taking the maximal value on any given 278 generated. The patch is applied by taking the maximal value on any given
279 patch or the original image, for each of the 32x32 pixel locations.\\ 279 patch or the original image, for each of the 32x32 pixel locations.\\
280 {\bf Color and Contrast Changes.} 280 {\bf Color and Contrast Changes.}
281 This filter changes the constrast and may invert the image polarity (white 281 This filter changes the contrast and may invert the image polarity (white
282 on black to black on white). The contrast $C$ is defined here as the 282 on black to black on white). The contrast $C$ is defined here as the
283 difference between the maximum and the minimum pixel value of the image. 283 difference between the maximum and the minimum pixel value of the image.
284 Contrast $\sim U[1-0.85 \times complexity,1]$ (so constrast $\geq 0.15$). 284 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$).
285 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The 285 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
286 polarity is inverted with $0.5$ probability. 286 polarity is inverted with $0.5$ probability.
287 287
288 288
289 \begin{figure}[h] 289 \begin{figure}[h]
299 299
300 300
301 \begin{figure}[h] 301 \begin{figure}[h]
302 \resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\ 302 \resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\
303 \caption{Illustration of each transformation applied alone to the same image 303 \caption{Illustration of each transformation applied alone to the same image
304 of an upper-case h (top left). First row (from left to rigth) : original image, slant, 304 of an upper-case h (top left). First row (from left to right) : original image, slant,
305 thickness, affine transformation, local elastic deformation; second row (from left to rigth) : 305 thickness, affine transformation, local elastic deformation; second row (from left to right) :
306 pinch, motion blur, occlusion, pixel permutation, Gaussian noise; third row (from left to rigth) : 306 pinch, motion blur, occlusion, pixel permutation, Gaussian noise; third row (from left to right) :
307 background image, salt and pepper noise, spatially Gaussian noise, scratches, 307 background image, salt and pepper noise, spatially Gaussian noise, scratches,
308 color and contrast changes.} 308 color and contrast changes.}
309 \label{fig:transfo} 309 \label{fig:transfo}
310 \end{figure} 310 \end{figure}
311 311
353 %\item 353 %\item
354 {\bf Fonts.} 354 {\bf Fonts.}
355 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} 355 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net}
356 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} 356 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html}
357 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. 357 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly.
358 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 358 The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
359 directly as input to our models. 359 directly as input to our models.
360 360
361 %\item 361 %\item
362 {\bf Captchas.} 362 {\bf Captchas.}
363 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 363 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
364 generating characters of the same format as the NIST dataset. This software is based on 364 generating characters of the same format as the NIST dataset. This software is based on
365 a random character class generator and various kinds of tranformations similar to those described in the previous sections. 365 a random character class generator and various kinds of transformations similar to those described in the previous sections.
366 In order to increase the variability of the data generated, many different fonts are used for generating the characters. 366 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
367 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity 367 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
368 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are 368 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are
369 allowed and can be controlled via an easy to use facade class. 369 allowed and can be controlled via an easy to use facade class.
370 370
371 %\item 371 %\item
372 {\bf OCR data.} 372 {\bf OCR data.}
373 A large set (2 million) of scanned, OCRed and manually verified machine-printed 373 A large set (2 million) of scanned, OCRed and manually verified machine-printed
374 characters (from various documents and books) where included as an 374 characters (from various documents and books) where included as an
375 additional source. This set is part of a larger corpus being collected by the Image Understanding 375 additional source. This set is part of a larger corpus being collected by the Image Understanding
376 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern 376 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern
377 ({\tt http://www.iupr.com}), and which will be publically released. 377 ({\tt http://www.iupr.com}), and which will be publicly released.
378 %\end{itemize} 378 %\end{itemize}
379 379
380 \vspace*{-1mm} 380 \vspace*{-1mm}
381 \subsection{Data Sets} 381 \subsection{Data Sets}
382 \vspace*{-1mm} 382 \vspace*{-1mm}
389 {\bf NIST.} This is the raw NIST special database 19. 389 {\bf NIST.} This is the raw NIST special database 19.
390 390
391 %\item 391 %\item
392 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources 392 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
393 and sending them through the above transformation pipeline. 393 and sending them through the above transformation pipeline.
394 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts, 394 For each new example to generate, a source is selected with probability $10\%$ from the fonts,
395 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the 395 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
396 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. 396 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
397 397
398 %\item 398 %\item
399 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) 399 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
400 except that we only apply 400 except that we only apply
401 transformations from slant to pinch. Therefore, the character is 401 transformations from slant to pinch. Therefore, the character is
402 transformed but no additionnal noise is added to the image, giving images 402 transformed but no additional noise is added to the image, giving images
403 closer to the NIST dataset. 403 closer to the NIST dataset.
404 %\end{itemize} 404 %\end{itemize}
405 405
406 \vspace*{-1mm} 406 \vspace*{-1mm}
407 \subsection{Models and their Hyperparameters} 407 \subsection{Models and their Hyperparameters}
473 %of money to perform tasks for which human intelligence is required. 473 %of money to perform tasks for which human intelligence is required.
474 %Mechanical Turk has been used extensively in natural language 474 %Mechanical Turk has been used extensively in natural language
475 %processing \citep{SnowEtAl2008} and vision 475 %processing \citep{SnowEtAl2008} and vision
476 %\citep{SorokinAndForsyth2008,whitehill09}. 476 %\citep{SorokinAndForsyth2008,whitehill09}.
477 AMT users where presented 477 AMT users where presented
478 with 10 character images and asked to type 10 corresponding ascii 478 with 10 character images and asked to type 10 corresponding ASCII
479 characters. They were forced to make a hard choice among the 479 characters. They were forced to make a hard choice among the
480 62 or 10 character classes (all classes or digits only). 480 62 or 10 character classes (all classes or digits only).
481 Three users classified each image, allowing 481 Three users classified each image, allowing
482 to estimate inter-human variability (shown as +/- in parenthesis below). 482 to estimate inter-human variability (shown as +/- in parenthesis below).
483 483
553 fine-tuned on NIST. 553 fine-tuned on NIST.
554 554
555 Our results show that the MLP benefits marginally from the multi-task setting 555 Our results show that the MLP benefits marginally from the multi-task setting
556 in the case of digits (5\% relative improvement) but is actually hurt in the case 556 in the case of digits (5\% relative improvement) but is actually hurt in the case
557 of characters (respectively 3\% and 4\% worse for lower and upper class characters). 557 of characters (respectively 3\% and 4\% worse for lower and upper class characters).
558 On the other hand the SDA benefitted from the multi-task setting, with relative 558 On the other hand the SDA benefited from the multi-task setting, with relative
559 error rate improvements of 27\%, 15\% and 13\% respectively for digits, 559 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
560 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. 560 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
561 \fi 561 \fi
562 562
563 563
593 noise, affine transformations, background images) make the resulting 593 noise, affine transformations, background images) make the resulting
594 classifier better not only on similarly perturbed images but also on 594 classifier better not only on similarly perturbed images but also on
595 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} 595 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
596 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 596 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
597 MLPs were helped by perturbed training examples when tested on perturbed input images, 597 MLPs were helped by perturbed training examples when tested on perturbed input images,
598 but only marginally helped wrt clean examples. On the other hand, the deep SDAs 598 but only marginally helped with respect to clean examples. On the other hand, the deep SDAs
599 were very significantly boosted by these out-of-distribution examples. 599 were very significantly boosted by these out-of-distribution examples.
600 600
601 $\bullet$ %\item 601 $\bullet$ %\item
602 Similarly, does the feature learning step in deep learning algorithms benefit more 602 Similarly, does the feature learning step in deep learning algorithms benefit more
603 training with similar but different classes (i.e. a multi-task learning scenario) than 603 training with similar but different classes (i.e. a multi-task learning scenario) than