Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 469:d02d288257bf
redone bib style
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sat, 29 May 2010 18:03:37 -0400 |
parents | e0e57270b2af |
children | 2dd6e8962df1 ead3085c1c66 |
comparison
equal
deleted
inserted
replaced
468:d48a7777e4d8 | 469:d02d288257bf |
---|---|
3 | 3 |
4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} | 4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} |
5 \usepackage{algorithm,algorithmic} | 5 \usepackage{algorithm,algorithmic} |
6 \usepackage[utf8]{inputenc} | 6 \usepackage[utf8]{inputenc} |
7 \usepackage{graphicx,subfigure} | 7 \usepackage{graphicx,subfigure} |
8 \usepackage{mlapa} | 8 \usepackage[numbers]{natbib} |
9 | 9 |
10 \title{Generating and Exploiting Perturbed and Multi-Task Handwritten Training Data for Deep Architectures} | 10 \title{Generating and Exploiting Perturbed and Multi-Task Handwritten Training Data for Deep Architectures} |
11 \author{The IFT6266 Gang} | 11 \author{The IFT6266 Gang} |
12 | 12 |
13 \begin{document} | 13 \begin{document} |
43 \end{abstract} | 43 \end{abstract} |
44 | 44 |
45 \section{Introduction} | 45 \section{Introduction} |
46 | 46 |
47 Deep Learning has emerged as a promising new area of research in | 47 Deep Learning has emerged as a promising new area of research in |
48 statistical machine learning (see~\emcite{Bengio-2009} for a review). | 48 statistical machine learning (see~\citet{Bengio-2009} for a review). |
49 Learning algorithms for deep architectures are centered on the learning | 49 Learning algorithms for deep architectures are centered on the learning |
50 of useful representations of data, which are better suited to the task at hand. | 50 of useful representations of data, which are better suited to the task at hand. |
51 This is in great part inspired by observations of the mammalian visual cortex, | 51 This is in great part inspired by observations of the mammalian visual cortex, |
52 which consists of a chain of processing elements, each of which is associated with a | 52 which consists of a chain of processing elements, each of which is associated with a |
53 different representation. In fact, | 53 different representation. In fact, |
54 it was found recently that the features learnt in deep architectures resemble | 54 it was found recently that the features learnt in deep architectures resemble |
55 those observed in the first two of these stages (in areas V1 and V2 | 55 those observed in the first two of these stages (in areas V1 and V2 |
56 of visual cortex)~\cite{HonglakL2008}. | 56 of visual cortex)~\citep{HonglakL2008}. |
57 Processing images typically involves transforming the raw pixel data into | 57 Processing images typically involves transforming the raw pixel data into |
58 new {\bf representations} that can be used for analysis or classification. | 58 new {\bf representations} that can be used for analysis or classification. |
59 For example, a principal component analysis representation linearly projects | 59 For example, a principal component analysis representation linearly projects |
60 the input image into a lower-dimensional feature space. | 60 the input image into a lower-dimensional feature space. |
61 Why learn a representation? Current practice in the computer vision | 61 Why learn a representation? Current practice in the computer vision |
62 literature converts the raw pixels into a hand-crafted representation | 62 literature converts the raw pixels into a hand-crafted representation |
63 (e.g.\ SIFT features~\cite{Lowe04}), but deep learning algorithms | 63 e.g.\ SIFT features~\citep{Lowe04}, but deep learning algorithms |
64 tend to discover similar features in their first few | 64 tend to discover similar features in their first few |
65 levels~\cite{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}. | 65 levels~\citep{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}. |
66 Learning increases the | 66 Learning increases the |
67 ease and practicality of developing representations that are at once | 67 ease and practicality of developing representations that are at once |
68 tailored to specific tasks, yet are able to borrow statistical strength | 68 tailored to specific tasks, yet are able to borrow statistical strength |
69 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the | 69 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the |
70 feature representation can lead to higher-level (more abstract, more | 70 feature representation can lead to higher-level (more abstract, more |
75 shallow one in terms of representation, depth appears to render the | 75 shallow one in terms of representation, depth appears to render the |
76 training problem more difficult in terms of optimization and local minima. | 76 training problem more difficult in terms of optimization and local minima. |
77 It is also only recently that successful algorithms were proposed to | 77 It is also only recently that successful algorithms were proposed to |
78 overcome some of these difficulties. All are based on unsupervised | 78 overcome some of these difficulties. All are based on unsupervised |
79 learning, often in an greedy layer-wise ``unsupervised pre-training'' | 79 learning, often in an greedy layer-wise ``unsupervised pre-training'' |
80 stage~\cite{Bengio-2009}. One of these layer initialization techniques, | 80 stage~\citep{Bengio-2009}. One of these layer initialization techniques, |
81 applied here, is the Denoising | 81 applied here, is the Denoising |
82 Auto-Encoder~(DEA)~\cite{VincentPLarochelleH2008-very-small}, which | 82 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which |
83 performed similarly or better than previously proposed Restricted Boltzmann | 83 performed similarly or better than previously proposed Restricted Boltzmann |
84 Machines in terms of unsupervised extraction of a hierarchy of features | 84 Machines in terms of unsupervised extraction of a hierarchy of features |
85 useful for classification. The principle is that each layer starting from | 85 useful for classification. The principle is that each layer starting from |
86 the bottom is trained to encode their input (the output of the previous | 86 the bottom is trained to encode their input (the output of the previous |
87 layer) and try to reconstruct it from a corrupted version of it. After this | 87 layer) and try to reconstruct it from a corrupted version of it. After this |
97 \item To what extent does the perturbation of input images (e.g. adding | 97 \item To what extent does the perturbation of input images (e.g. adding |
98 noise, affine transformations, background images) make the resulting | 98 noise, affine transformations, background images) make the resulting |
99 classifier better not only on similarly perturbed images but also on | 99 classifier better not only on similarly perturbed images but also on |
100 the {\em original clean examples}? | 100 the {\em original clean examples}? |
101 \item Do deep architectures benefit more from such {\em out-of-distribution} | 101 \item Do deep architectures benefit more from such {\em out-of-distribution} |
102 examples, i.e. do they benefit more from the self-taught learning~\cite{RainaR2007} framework? | 102 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? |
103 \item Similarly, does the feature learning step in deep learning algorithms benefit more | 103 \item Similarly, does the feature learning step in deep learning algorithms benefit more |
104 training with similar but different classes (i.e. a multi-task learning scenario) than | 104 training with similar but different classes (i.e. a multi-task learning scenario) than |
105 a corresponding shallow and purely supervised architecture? | 105 a corresponding shallow and purely supervised architecture? |
106 \end{enumerate} | 106 \end{enumerate} |
107 The experimental results presented here provide positive evidence towards all of these questions. | 107 The experimental results presented here provide positive evidence towards all of these questions. |
108 | 108 |
109 \section{Perturbation and Transformation of Character Images} | 109 \section{Perturbation and Transformation of Character Images} |
110 | 110 |
111 This section describes the different transformations we used to stochastically | 111 This section describes the different transformations we used to stochastically |
112 transform source images in order to obtain data. More details can | 112 transform source images in order to obtain data. More details can |
113 be found in this technical report~\cite{ift6266-tr-anonymous}. | 113 be found in this technical report~\citep{ift6266-tr-anonymous}. |
114 The code for these transformations (mostly python) is available at | 114 The code for these transformations (mostly python) is available at |
115 {\tt http://anonymous.url.net}. All the modules in the pipeline share | 115 {\tt http://anonymous.url.net}. All the modules in the pipeline share |
116 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the | 116 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the |
117 amount of deformation or noise introduced. | 117 amount of deformation or noise introduced. |
118 | 118 |
128 and its value is randomly sampled according to the complexity level: | 128 and its value is randomly sampled according to the complexity level: |
129 e $slant \sim U[0,complexity]$, so the | 129 e $slant \sim U[0,complexity]$, so the |
130 maximum displacement for the lowest or highest pixel line is of | 130 maximum displacement for the lowest or highest pixel line is of |
131 $round(complexity \times 32)$.\\ | 131 $round(complexity \times 32)$.\\ |
132 {\bf Thickness}\\ | 132 {\bf Thickness}\\ |
133 Morpholigical operators of dilation and erosion~\cite{Haralick87,Serra82} | 133 Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82} |
134 are applied. The neighborhood of each pixel is multiplied | 134 are applied. The neighborhood of each pixel is multiplied |
135 element-wise with a {\em structuring element} matrix. | 135 element-wise with a {\em structuring element} matrix. |
136 The pixel value is replaced by the maximum or the minimum of the resulting | 136 The pixel value is replaced by the maximum or the minimum of the resulting |
137 matrix, respectively for dilation or erosion. Ten different structural elements with | 137 matrix, respectively for dilation or erosion. Ten different structural elements with |
138 increasing dimensions (largest is $5\times5$) were used. For each image, | 138 increasing dimensions (largest is $5\times5$) were used. For each image, |
154 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times | 154 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times |
155 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 | 155 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 |
156 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times | 156 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times |
157 complexity]$.\\ | 157 complexity]$.\\ |
158 {\bf Local Elastic Deformations}\\ | 158 {\bf Local Elastic Deformations}\\ |
159 This filter induces a "wiggly" effect in the image, following~\cite{SimardSP03}, | 159 This filter induces a "wiggly" effect in the image, following~\citet{SimardSP03}, |
160 which provides more details. | 160 which provides more details. |
161 Two "displacements" fields are generated and applied, for horizontal | 161 Two "displacements" fields are generated and applied, for horizontal |
162 and vertical displacements of pixels. | 162 and vertical displacements of pixels. |
163 To generate a pixel in either field, first a value between -1 and 1 is | 163 To generate a pixel in either field, first a value between -1 and 1 is |
164 chosen from a uniform distribution. Then all the pixels, in both fields, are | 164 chosen from a uniform distribution. Then all the pixels, in both fields, are |
169 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times | 169 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times |
170 \sqrt[3]{complexity}$.\\ | 170 \sqrt[3]{complexity}$.\\ |
171 {\bf Pinch}\\ | 171 {\bf Pinch}\\ |
172 This GIMP filter is named "Whirl and | 172 This GIMP filter is named "Whirl and |
173 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic | 173 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic |
174 surface and pressing or pulling on the center of the surface''~\cite{GIMP-manual}. | 174 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. |
175 For a square input image, think of drawing a circle of | 175 For a square input image, think of drawing a circle of |
176 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to | 176 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to |
177 that disk (region inside circle) will have its value recalculated by taking | 177 that disk (region inside circle) will have its value recalculated by taking |
178 the value of another "source" pixel in the original image. The position of | 178 the value of another "source" pixel in the original image. The position of |
179 that source pixel is found on the line thats goes through $C$ and $P$, but | 179 that source pixel is found on the line thats goes through $C$ and $P$, but |
196 images and places it over the original {\em occluded} character | 196 images and places it over the original {\em occluded} character |
197 image. Pixels are combined by taking the max(occluder,occluded), | 197 image. Pixels are combined by taking the max(occluder,occluded), |
198 closer to black. The corners of the occluder The rectangle corners | 198 closer to black. The corners of the occluder The rectangle corners |
199 are sampled so that larger complexity gives larger rectangles. | 199 are sampled so that larger complexity gives larger rectangles. |
200 The destination position in the occluded image are also sampled | 200 The destination position in the occluded image are also sampled |
201 according to a normal distribution (see more details in~\cite{ift6266-tr-anonymous}. | 201 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). |
202 It has has a probability of not being applied at all of 60\%.\\ | 202 It has has a probability of not being applied at all of 60\%.\\ |
203 {\bf Pixel Permutation}\\ | 203 {\bf Pixel Permutation}\\ |
204 This filter permutes neighbouring pixels. It selects first | 204 This filter permutes neighbouring pixels. It selects first |
205 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then | 205 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then |
206 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number | 206 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number |
210 {\bf Gaussian Noise}\\ | 210 {\bf Gaussian Noise}\\ |
211 This filter simply adds, to each pixel of the image independently, a | 211 This filter simply adds, to each pixel of the image independently, a |
212 noise $\sim Normal(0(\frac{complexity}{10})^2)$. | 212 noise $\sim Normal(0(\frac{complexity}{10})^2)$. |
213 It has has a probability of not being applied at all of 70\%.\\ | 213 It has has a probability of not being applied at all of 70\%.\\ |
214 {\bf Background Images}\\ | 214 {\bf Background Images}\\ |
215 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random | 215 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random |
216 background behind the letter. The background is chosen by first selecting, | 216 background behind the letter. The background is chosen by first selecting, |
217 at random, an image from a set of images. Then a 32$\times$32 subregion | 217 at random, an image from a set of images. Then a 32$\times$32 subregion |
218 of that image is chosen as the background image (by sampling position | 218 of that image is chosen as the background image (by sampling position |
219 uniformly while making sure not to cross image borders). | 219 uniformly while making sure not to cross image borders). |
220 To combine the original letter image and the background image, contrast | 220 To combine the original letter image and the background image, contrast |
365 The other parameters are discarded. | 365 The other parameters are discarded. |
366 | 366 |
367 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. | 367 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. |
368 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder. | 368 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder. |
369 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it. | 369 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it. |
370 For additional details see \cite{vincent:icml08}. | 370 For additional details see \citet{vincent:icml08}. |
371 | 371 |
372 \section{Experimental Results} | 372 \section{Experimental Results} |
373 | 373 |
374 \subsection{SDA vs MLP vs Humans} | 374 \subsection{SDA vs MLP vs Humans} |
375 | 375 |
377 the best SDA (again according to validation set error), along with a precise estimate | 377 the best SDA (again according to validation set error), along with a precise estimate |
378 of human performance obtained via Amazon's Mechanical Turk (AMT) | 378 of human performance obtained via Amazon's Mechanical Turk (AMT) |
379 service\footnote{http://mturk.com}. AMT users are paid small amounts | 379 service\footnote{http://mturk.com}. AMT users are paid small amounts |
380 of money to perform tasks for which human intelligence is required. | 380 of money to perform tasks for which human intelligence is required. |
381 Mechanical Turk has been used extensively in natural language | 381 Mechanical Turk has been used extensively in natural language |
382 processing \cite{SnowEtAl2008} and vision | 382 processing \citep{SnowEtAl2008} and vision |
383 \cite{SorokinAndForsyth2008,whitehill09}. AMT users where presented | 383 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented |
384 with 10 character images and asked to type 10 corresponding ascii | 384 with 10 character images and asked to type 10 corresponding ascii |
385 characters. Hence they were forced to make a hard choice among the | 385 characters. Hence they were forced to make a hard choice among the |
386 62 character classes. Three users classified each image, allowing | 386 62 character classes. Three users classified each image, allowing |
387 to estimate inter-human variability (shown as +/- in parenthesis below). | 387 to estimate inter-human variability (shown as +/- in parenthesis below). |
388 | 388 |
406 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline | 406 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline |
407 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline | 407 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline |
408 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline | 408 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline |
409 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline | 409 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline |
410 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline | 410 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline |
411 \cite{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline | 411 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline |
412 \cite{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline | 412 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline |
413 \cite{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline | 413 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline |
414 \cite{Migram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline | 414 \citep{Migram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline |
415 \end{tabular} | 415 \end{tabular} |
416 \end{center} | 416 \end{center} |
417 \end{table} | 417 \end{table} |
418 | 418 |
419 \subsection{Perturbed Training Data More Helpful for SDAE} | 419 \subsection{Perturbed Training Data More Helpful for SDAE} |
425 given test set (the first 3 columns on the 62-class tasks and the last one is | 425 given test set (the first 3 columns on the 62-class tasks and the last one is |
426 on the clean 10-class digits). Clearly, the deep learning models did benefit more | 426 on the clean 10-class digits). Clearly, the deep learning models did benefit more |
427 from perturbed training data, even when testing on clean data, whereas the MLP | 427 from perturbed training data, even when testing on clean data, whereas the MLP |
428 trained on perturbed data performed worse on the clean digits and about the same | 428 trained on perturbed data performed worse on the clean digits and about the same |
429 on the clean characters. } | 429 on the clean characters. } |
430 \label{tab:sda-vs-mlp-vs-humans} | 430 \label{tab:perturbation-effect} |
431 \begin{center} | 431 \begin{center} |
432 \begin{tabular}{|l|r|r|r|r|} \hline | 432 \begin{tabular}{|l|r|r|r|r|} \hline |
433 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline | 433 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline |
434 SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline | 434 SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline |
435 SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline | 435 SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline |
488 \end{table} | 488 \end{table} |
489 | 489 |
490 \section{Conclusions} | 490 \section{Conclusions} |
491 | 491 |
492 \bibliography{strings,ml,aigaion,specials} | 492 \bibliography{strings,ml,aigaion,specials} |
493 \bibliographystyle{mlapa} | 493 %\bibliographystyle{plainnat} |
494 \bibliographystyle{unsrtnat} | |
494 %\bibliographystyle{apalike} | 495 %\bibliographystyle{apalike} |
495 | 496 |
496 \end{document} | 497 \end{document} |