Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 495:5764a2ae1fb5
typos
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 01 Jun 2010 11:02:10 -0400 |
parents | a194ce5a4249 |
children | e41007dd40e9 2b58eda9fc08 |
comparison
equal
deleted
inserted
replaced
494:405cabc08c92 | 495:5764a2ae1fb5 |
---|---|
18 \vspace*{-2mm} | 18 \vspace*{-2mm} |
19 \begin{abstract} | 19 \begin{abstract} |
20 Recent theoretical and empirical work in statistical machine learning has | 20 Recent theoretical and empirical work in statistical machine learning has |
21 demonstrated the importance of learning algorithms for deep | 21 demonstrated the importance of learning algorithms for deep |
22 architectures, i.e., function classes obtained by composing multiple | 22 architectures, i.e., function classes obtained by composing multiple |
23 non-linear transformations. The self-taught learning (exploitng unlabeled | 23 non-linear transformations. The self-taught learning (exploiting unlabeled |
24 examples or examples from other distributions) has already been applied | 24 examples or examples from other distributions) has already been applied |
25 to deep learners, but mostly to show the advantage of unlabeled | 25 to deep learners, but mostly to show the advantage of unlabeled |
26 examples. Here we explore the advantage brought by {\em out-of-distribution | 26 examples. Here we explore the advantage brought by {\em out-of-distribution |
27 examples} and show that {\em deep learners benefit more from them than a | 27 examples} and show that {\em deep learners benefit more from them than a |
28 corresponding shallow learner}, in the area | 28 corresponding shallow learner}, in the area |
137 | 137 |
138 \vspace*{2mm} | 138 \vspace*{2mm} |
139 | 139 |
140 {\bf Slant.} | 140 {\bf Slant.} |
141 We mimic slant by shifting each row of the image | 141 We mimic slant by shifting each row of the image |
142 proportionnaly to its height: $shift = round(slant \times height)$. | 142 proportionally to its height: $shift = round(slant \times height)$. |
143 The $slant$ coefficient can be negative or positive with equal probability | 143 The $slant$ coefficient can be negative or positive with equal probability |
144 and its value is randomly sampled according to the complexity level: | 144 and its value is randomly sampled according to the complexity level: |
145 e $slant \sim U[0,complexity]$, so the | 145 e $slant \sim U[0,complexity]$, so the |
146 maximum displacement for the lowest or highest pixel line is of | 146 maximum displacement for the lowest or highest pixel line is of |
147 $round(complexity \times 32)$.\\ | 147 $round(complexity \times 32)$.\\ |
148 {\bf Thickness.} | 148 {\bf Thickness.} |
149 Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82} | 149 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} |
150 are applied. The neighborhood of each pixel is multiplied | 150 are applied. The neighborhood of each pixel is multiplied |
151 element-wise with a {\em structuring element} matrix. | 151 element-wise with a {\em structuring element} matrix. |
152 The pixel value is replaced by the maximum or the minimum of the resulting | 152 The pixel value is replaced by the maximum or the minimum of the resulting |
153 matrix, respectively for dilation or erosion. Ten different structural elements with | 153 matrix, respectively for dilation or erosion. Ten different structural elements with |
154 increasing dimensions (largest is $5\times5$) were used. For each image, | 154 increasing dimensions (largest is $5\times5$) were used. For each image, |
190 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. | 190 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. |
191 For a square input image, think of drawing a circle of | 191 For a square input image, think of drawing a circle of |
192 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to | 192 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to |
193 that disk (region inside circle) will have its value recalculated by taking | 193 that disk (region inside circle) will have its value recalculated by taking |
194 the value of another "source" pixel in the original image. The position of | 194 the value of another "source" pixel in the original image. The position of |
195 that source pixel is found on the line thats goes through $C$ and $P$, but | 195 that source pixel is found on the line that goes through $C$ and $P$, but |
196 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ | 196 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ |
197 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times | 197 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times |
198 d_1$, where $pinch$ is a parameter to the filter. | 198 d_1$, where $pinch$ is a parameter to the filter. |
199 The actual value is given by bilinear interpolation considering the pixels | 199 The actual value is given by bilinear interpolation considering the pixels |
200 around the (non-integer) source position thus found. | 200 around the (non-integer) source position thus found. |
233 noise $\sim Normal(0(\frac{complexity}{10})^2)$. | 233 noise $\sim Normal(0(\frac{complexity}{10})^2)$. |
234 It has has a probability of not being applied at all of 70\%.\\ | 234 It has has a probability of not being applied at all of 70\%.\\ |
235 {\bf Background Images.} | 235 {\bf Background Images.} |
236 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random | 236 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random |
237 background behind the letter. The background is chosen by first selecting, | 237 background behind the letter. The background is chosen by first selecting, |
238 at random, an image from a set of images. Then a 32$\times$32 subregion | 238 at random, an image from a set of images. Then a 32$\times$32 sub-region |
239 of that image is chosen as the background image (by sampling position | 239 of that image is chosen as the background image (by sampling position |
240 uniformly while making sure not to cross image borders). | 240 uniformly while making sure not to cross image borders). |
241 To combine the original letter image and the background image, contrast | 241 To combine the original letter image and the background image, contrast |
242 adjustments are made. We first get the maximal values (i.e. maximal | 242 adjustments are made. We first get the maximal values (i.e. maximal |
243 intensity) for both the original image and the background image, $maximage$ | 243 intensity) for both the original image and the background image, $maximage$ |
250 The number of selected pixels is $0.2 \times complexity$. | 250 The number of selected pixels is $0.2 \times complexity$. |
251 This filter has a probability of not being applied at all of 75\%.\\ | 251 This filter has a probability of not being applied at all of 75\%.\\ |
252 {\bf Spatially Gaussian Noise.} | 252 {\bf Spatially Gaussian Noise.} |
253 Different regions of the image are spatially smoothed. | 253 Different regions of the image are spatially smoothed. |
254 The image is convolved with a symmetric Gaussian kernel of | 254 The image is convolved with a symmetric Gaussian kernel of |
255 size and variance choosen uniformly in the ranges $[12,12 + 20 \times | 255 size and variance chosen uniformly in the ranges $[12,12 + 20 \times |
256 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized | 256 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized |
257 between $0$ and $1$. We also create a symmetric averaging window, of the | 257 between $0$ and $1$. We also create a symmetric averaging window, of the |
258 kernel size, with maximum value at the center. For each image we sample | 258 kernel size, with maximum value at the center. For each image we sample |
259 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be | 259 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be |
260 averaging centers between the original image and the filtered one. We | 260 averaging centers between the original image and the filtered one. We |
266 {\bf Scratches.} | 266 {\bf Scratches.} |
267 The scratches module places line-like white patches on the image. The | 267 The scratches module places line-like white patches on the image. The |
268 lines are heavily transformed images of the digit "1" (one), chosen | 268 lines are heavily transformed images of the digit "1" (one), chosen |
269 at random among five thousands such 1 images. The 1 image is | 269 at random among five thousands such 1 images. The 1 image is |
270 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times | 270 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times |
271 complexity)^2$, using bicubic interpolation, | 271 complexity)^2$, using bi-cubic interpolation, |
272 Two passes of a greyscale morphological erosion filter | 272 Two passes of a grey-scale morphological erosion filter |
273 are applied, reducing the width of the line | 273 are applied, reducing the width of the line |
274 by an amount controlled by $complexity$. | 274 by an amount controlled by $complexity$. |
275 This filter is only applied only 15\% of the time. When it is applied, 50\% | 275 This filter is only applied only 15\% of the time. When it is applied, 50\% |
276 of the time, only one patch image is generated and applied. In 30\% of | 276 of the time, only one patch image is generated and applied. In 30\% of |
277 cases, two patches are generated, and otherwise three patches are | 277 cases, two patches are generated, and otherwise three patches are |
278 generated. The patch is applied by taking the maximal value on any given | 278 generated. The patch is applied by taking the maximal value on any given |
279 patch or the original image, for each of the 32x32 pixel locations.\\ | 279 patch or the original image, for each of the 32x32 pixel locations.\\ |
280 {\bf Color and Contrast Changes.} | 280 {\bf Color and Contrast Changes.} |
281 This filter changes the constrast and may invert the image polarity (white | 281 This filter changes the contrast and may invert the image polarity (white |
282 on black to black on white). The contrast $C$ is defined here as the | 282 on black to black on white). The contrast $C$ is defined here as the |
283 difference between the maximum and the minimum pixel value of the image. | 283 difference between the maximum and the minimum pixel value of the image. |
284 Contrast $\sim U[1-0.85 \times complexity,1]$ (so constrast $\geq 0.15$). | 284 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$). |
285 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The | 285 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The |
286 polarity is inverted with $0.5$ probability. | 286 polarity is inverted with $0.5$ probability. |
287 | 287 |
288 | 288 |
289 \begin{figure}[h] | 289 \begin{figure}[h] |
299 | 299 |
300 | 300 |
301 \begin{figure}[h] | 301 \begin{figure}[h] |
302 \resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\ | 302 \resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\ |
303 \caption{Illustration of each transformation applied alone to the same image | 303 \caption{Illustration of each transformation applied alone to the same image |
304 of an upper-case h (top left). First row (from left to rigth) : original image, slant, | 304 of an upper-case h (top left). First row (from left to right) : original image, slant, |
305 thickness, affine transformation, local elastic deformation; second row (from left to rigth) : | 305 thickness, affine transformation, local elastic deformation; second row (from left to right) : |
306 pinch, motion blur, occlusion, pixel permutation, Gaussian noise; third row (from left to rigth) : | 306 pinch, motion blur, occlusion, pixel permutation, Gaussian noise; third row (from left to right) : |
307 background image, salt and pepper noise, spatially Gaussian noise, scratches, | 307 background image, salt and pepper noise, spatially Gaussian noise, scratches, |
308 color and contrast changes.} | 308 color and contrast changes.} |
309 \label{fig:transfo} | 309 \label{fig:transfo} |
310 \end{figure} | 310 \end{figure} |
311 | 311 |
353 %\item | 353 %\item |
354 {\bf Fonts.} | 354 {\bf Fonts.} |
355 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} | 355 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} |
356 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} | 356 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} |
357 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. | 357 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. |
358 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | 358 The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, |
359 directly as input to our models. | 359 directly as input to our models. |
360 | 360 |
361 %\item | 361 %\item |
362 {\bf Captchas.} | 362 {\bf Captchas.} |
363 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 363 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for |
364 generating characters of the same format as the NIST dataset. This software is based on | 364 generating characters of the same format as the NIST dataset. This software is based on |
365 a random character class generator and various kinds of tranformations similar to those described in the previous sections. | 365 a random character class generator and various kinds of transformations similar to those described in the previous sections. |
366 In order to increase the variability of the data generated, many different fonts are used for generating the characters. | 366 In order to increase the variability of the data generated, many different fonts are used for generating the characters. |
367 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity | 367 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity |
368 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are | 368 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are |
369 allowed and can be controlled via an easy to use facade class. | 369 allowed and can be controlled via an easy to use facade class. |
370 | 370 |
371 %\item | 371 %\item |
372 {\bf OCR data.} | 372 {\bf OCR data.} |
373 A large set (2 million) of scanned, OCRed and manually verified machine-printed | 373 A large set (2 million) of scanned, OCRed and manually verified machine-printed |
374 characters (from various documents and books) where included as an | 374 characters (from various documents and books) where included as an |
375 additional source. This set is part of a larger corpus being collected by the Image Understanding | 375 additional source. This set is part of a larger corpus being collected by the Image Understanding |
376 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern | 376 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern |
377 ({\tt http://www.iupr.com}), and which will be publically released. | 377 ({\tt http://www.iupr.com}), and which will be publicly released. |
378 %\end{itemize} | 378 %\end{itemize} |
379 | 379 |
380 \vspace*{-1mm} | 380 \vspace*{-1mm} |
381 \subsection{Data Sets} | 381 \subsection{Data Sets} |
382 \vspace*{-1mm} | 382 \vspace*{-1mm} |
389 {\bf NIST.} This is the raw NIST special database 19. | 389 {\bf NIST.} This is the raw NIST special database 19. |
390 | 390 |
391 %\item | 391 %\item |
392 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources | 392 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources |
393 and sending them through the above transformation pipeline. | 393 and sending them through the above transformation pipeline. |
394 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts, | 394 For each new example to generate, a source is selected with probability $10\%$ from the fonts, |
395 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the | 395 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the |
396 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. | 396 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. |
397 | 397 |
398 %\item | 398 %\item |
399 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) | 399 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) |
400 except that we only apply | 400 except that we only apply |
401 transformations from slant to pinch. Therefore, the character is | 401 transformations from slant to pinch. Therefore, the character is |
402 transformed but no additionnal noise is added to the image, giving images | 402 transformed but no additional noise is added to the image, giving images |
403 closer to the NIST dataset. | 403 closer to the NIST dataset. |
404 %\end{itemize} | 404 %\end{itemize} |
405 | 405 |
406 \vspace*{-1mm} | 406 \vspace*{-1mm} |
407 \subsection{Models and their Hyperparameters} | 407 \subsection{Models and their Hyperparameters} |
473 %of money to perform tasks for which human intelligence is required. | 473 %of money to perform tasks for which human intelligence is required. |
474 %Mechanical Turk has been used extensively in natural language | 474 %Mechanical Turk has been used extensively in natural language |
475 %processing \citep{SnowEtAl2008} and vision | 475 %processing \citep{SnowEtAl2008} and vision |
476 %\citep{SorokinAndForsyth2008,whitehill09}. | 476 %\citep{SorokinAndForsyth2008,whitehill09}. |
477 AMT users where presented | 477 AMT users where presented |
478 with 10 character images and asked to type 10 corresponding ascii | 478 with 10 character images and asked to type 10 corresponding ASCII |
479 characters. They were forced to make a hard choice among the | 479 characters. They were forced to make a hard choice among the |
480 62 or 10 character classes (all classes or digits only). | 480 62 or 10 character classes (all classes or digits only). |
481 Three users classified each image, allowing | 481 Three users classified each image, allowing |
482 to estimate inter-human variability (shown as +/- in parenthesis below). | 482 to estimate inter-human variability (shown as +/- in parenthesis below). |
483 | 483 |
553 fine-tuned on NIST. | 553 fine-tuned on NIST. |
554 | 554 |
555 Our results show that the MLP benefits marginally from the multi-task setting | 555 Our results show that the MLP benefits marginally from the multi-task setting |
556 in the case of digits (5\% relative improvement) but is actually hurt in the case | 556 in the case of digits (5\% relative improvement) but is actually hurt in the case |
557 of characters (respectively 3\% and 4\% worse for lower and upper class characters). | 557 of characters (respectively 3\% and 4\% worse for lower and upper class characters). |
558 On the other hand the SDA benefitted from the multi-task setting, with relative | 558 On the other hand the SDA benefited from the multi-task setting, with relative |
559 error rate improvements of 27\%, 15\% and 13\% respectively for digits, | 559 error rate improvements of 27\%, 15\% and 13\% respectively for digits, |
560 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. | 560 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. |
561 \fi | 561 \fi |
562 | 562 |
563 | 563 |
593 noise, affine transformations, background images) make the resulting | 593 noise, affine transformations, background images) make the resulting |
594 classifier better not only on similarly perturbed images but also on | 594 classifier better not only on similarly perturbed images but also on |
595 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | 595 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} |
596 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 596 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? |
597 MLPs were helped by perturbed training examples when tested on perturbed input images, | 597 MLPs were helped by perturbed training examples when tested on perturbed input images, |
598 but only marginally helped wrt clean examples. On the other hand, the deep SDAs | 598 but only marginally helped with respect to clean examples. On the other hand, the deep SDAs |
599 were very significantly boosted by these out-of-distribution examples. | 599 were very significantly boosted by these out-of-distribution examples. |
600 | 600 |
601 $\bullet$ %\item | 601 $\bullet$ %\item |
602 Similarly, does the feature learning step in deep learning algorithms benefit more | 602 Similarly, does the feature learning step in deep learning algorithms benefit more |
603 training with similar but different classes (i.e. a multi-task learning scenario) than | 603 training with similar but different classes (i.e. a multi-task learning scenario) than |