comparison writeup/nips2010_submission.tex @ 469:d02d288257bf

redone bib style
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 29 May 2010 18:03:37 -0400
parents e0e57270b2af
children 2dd6e8962df1 ead3085c1c66
comparison
equal deleted inserted replaced
468:d48a7777e4d8 469:d02d288257bf
3 3
4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} 4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm}
5 \usepackage{algorithm,algorithmic} 5 \usepackage{algorithm,algorithmic}
6 \usepackage[utf8]{inputenc} 6 \usepackage[utf8]{inputenc}
7 \usepackage{graphicx,subfigure} 7 \usepackage{graphicx,subfigure}
8 \usepackage{mlapa} 8 \usepackage[numbers]{natbib}
9 9
10 \title{Generating and Exploiting Perturbed and Multi-Task Handwritten Training Data for Deep Architectures} 10 \title{Generating and Exploiting Perturbed and Multi-Task Handwritten Training Data for Deep Architectures}
11 \author{The IFT6266 Gang} 11 \author{The IFT6266 Gang}
12 12
13 \begin{document} 13 \begin{document}
43 \end{abstract} 43 \end{abstract}
44 44
45 \section{Introduction} 45 \section{Introduction}
46 46
47 Deep Learning has emerged as a promising new area of research in 47 Deep Learning has emerged as a promising new area of research in
48 statistical machine learning (see~\emcite{Bengio-2009} for a review). 48 statistical machine learning (see~\citet{Bengio-2009} for a review).
49 Learning algorithms for deep architectures are centered on the learning 49 Learning algorithms for deep architectures are centered on the learning
50 of useful representations of data, which are better suited to the task at hand. 50 of useful representations of data, which are better suited to the task at hand.
51 This is in great part inspired by observations of the mammalian visual cortex, 51 This is in great part inspired by observations of the mammalian visual cortex,
52 which consists of a chain of processing elements, each of which is associated with a 52 which consists of a chain of processing elements, each of which is associated with a
53 different representation. In fact, 53 different representation. In fact,
54 it was found recently that the features learnt in deep architectures resemble 54 it was found recently that the features learnt in deep architectures resemble
55 those observed in the first two of these stages (in areas V1 and V2 55 those observed in the first two of these stages (in areas V1 and V2
56 of visual cortex)~\cite{HonglakL2008}. 56 of visual cortex)~\citep{HonglakL2008}.
57 Processing images typically involves transforming the raw pixel data into 57 Processing images typically involves transforming the raw pixel data into
58 new {\bf representations} that can be used for analysis or classification. 58 new {\bf representations} that can be used for analysis or classification.
59 For example, a principal component analysis representation linearly projects 59 For example, a principal component analysis representation linearly projects
60 the input image into a lower-dimensional feature space. 60 the input image into a lower-dimensional feature space.
61 Why learn a representation? Current practice in the computer vision 61 Why learn a representation? Current practice in the computer vision
62 literature converts the raw pixels into a hand-crafted representation 62 literature converts the raw pixels into a hand-crafted representation
63 (e.g.\ SIFT features~\cite{Lowe04}), but deep learning algorithms 63 e.g.\ SIFT features~\citep{Lowe04}, but deep learning algorithms
64 tend to discover similar features in their first few 64 tend to discover similar features in their first few
65 levels~\cite{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}. 65 levels~\citep{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}.
66 Learning increases the 66 Learning increases the
67 ease and practicality of developing representations that are at once 67 ease and practicality of developing representations that are at once
68 tailored to specific tasks, yet are able to borrow statistical strength 68 tailored to specific tasks, yet are able to borrow statistical strength
69 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the 69 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
70 feature representation can lead to higher-level (more abstract, more 70 feature representation can lead to higher-level (more abstract, more
75 shallow one in terms of representation, depth appears to render the 75 shallow one in terms of representation, depth appears to render the
76 training problem more difficult in terms of optimization and local minima. 76 training problem more difficult in terms of optimization and local minima.
77 It is also only recently that successful algorithms were proposed to 77 It is also only recently that successful algorithms were proposed to
78 overcome some of these difficulties. All are based on unsupervised 78 overcome some of these difficulties. All are based on unsupervised
79 learning, often in an greedy layer-wise ``unsupervised pre-training'' 79 learning, often in an greedy layer-wise ``unsupervised pre-training''
80 stage~\cite{Bengio-2009}. One of these layer initialization techniques, 80 stage~\citep{Bengio-2009}. One of these layer initialization techniques,
81 applied here, is the Denoising 81 applied here, is the Denoising
82 Auto-Encoder~(DEA)~\cite{VincentPLarochelleH2008-very-small}, which 82 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
83 performed similarly or better than previously proposed Restricted Boltzmann 83 performed similarly or better than previously proposed Restricted Boltzmann
84 Machines in terms of unsupervised extraction of a hierarchy of features 84 Machines in terms of unsupervised extraction of a hierarchy of features
85 useful for classification. The principle is that each layer starting from 85 useful for classification. The principle is that each layer starting from
86 the bottom is trained to encode their input (the output of the previous 86 the bottom is trained to encode their input (the output of the previous
87 layer) and try to reconstruct it from a corrupted version of it. After this 87 layer) and try to reconstruct it from a corrupted version of it. After this
97 \item To what extent does the perturbation of input images (e.g. adding 97 \item To what extent does the perturbation of input images (e.g. adding
98 noise, affine transformations, background images) make the resulting 98 noise, affine transformations, background images) make the resulting
99 classifier better not only on similarly perturbed images but also on 99 classifier better not only on similarly perturbed images but also on
100 the {\em original clean examples}? 100 the {\em original clean examples}?
101 \item Do deep architectures benefit more from such {\em out-of-distribution} 101 \item Do deep architectures benefit more from such {\em out-of-distribution}
102 examples, i.e. do they benefit more from the self-taught learning~\cite{RainaR2007} framework? 102 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
103 \item Similarly, does the feature learning step in deep learning algorithms benefit more 103 \item Similarly, does the feature learning step in deep learning algorithms benefit more
104 training with similar but different classes (i.e. a multi-task learning scenario) than 104 training with similar but different classes (i.e. a multi-task learning scenario) than
105 a corresponding shallow and purely supervised architecture? 105 a corresponding shallow and purely supervised architecture?
106 \end{enumerate} 106 \end{enumerate}
107 The experimental results presented here provide positive evidence towards all of these questions. 107 The experimental results presented here provide positive evidence towards all of these questions.
108 108
109 \section{Perturbation and Transformation of Character Images} 109 \section{Perturbation and Transformation of Character Images}
110 110
111 This section describes the different transformations we used to stochastically 111 This section describes the different transformations we used to stochastically
112 transform source images in order to obtain data. More details can 112 transform source images in order to obtain data. More details can
113 be found in this technical report~\cite{ift6266-tr-anonymous}. 113 be found in this technical report~\citep{ift6266-tr-anonymous}.
114 The code for these transformations (mostly python) is available at 114 The code for these transformations (mostly python) is available at
115 {\tt http://anonymous.url.net}. All the modules in the pipeline share 115 {\tt http://anonymous.url.net}. All the modules in the pipeline share
116 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the 116 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
117 amount of deformation or noise introduced. 117 amount of deformation or noise introduced.
118 118
128 and its value is randomly sampled according to the complexity level: 128 and its value is randomly sampled according to the complexity level:
129 e $slant \sim U[0,complexity]$, so the 129 e $slant \sim U[0,complexity]$, so the
130 maximum displacement for the lowest or highest pixel line is of 130 maximum displacement for the lowest or highest pixel line is of
131 $round(complexity \times 32)$.\\ 131 $round(complexity \times 32)$.\\
132 {\bf Thickness}\\ 132 {\bf Thickness}\\
133 Morpholigical operators of dilation and erosion~\cite{Haralick87,Serra82} 133 Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82}
134 are applied. The neighborhood of each pixel is multiplied 134 are applied. The neighborhood of each pixel is multiplied
135 element-wise with a {\em structuring element} matrix. 135 element-wise with a {\em structuring element} matrix.
136 The pixel value is replaced by the maximum or the minimum of the resulting 136 The pixel value is replaced by the maximum or the minimum of the resulting
137 matrix, respectively for dilation or erosion. Ten different structural elements with 137 matrix, respectively for dilation or erosion. Ten different structural elements with
138 increasing dimensions (largest is $5\times5$) were used. For each image, 138 increasing dimensions (largest is $5\times5$) were used. For each image,
154 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times 154 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
155 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 155 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
156 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times 156 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
157 complexity]$.\\ 157 complexity]$.\\
158 {\bf Local Elastic Deformations}\\ 158 {\bf Local Elastic Deformations}\\
159 This filter induces a "wiggly" effect in the image, following~\cite{SimardSP03}, 159 This filter induces a "wiggly" effect in the image, following~\citet{SimardSP03},
160 which provides more details. 160 which provides more details.
161 Two "displacements" fields are generated and applied, for horizontal 161 Two "displacements" fields are generated and applied, for horizontal
162 and vertical displacements of pixels. 162 and vertical displacements of pixels.
163 To generate a pixel in either field, first a value between -1 and 1 is 163 To generate a pixel in either field, first a value between -1 and 1 is
164 chosen from a uniform distribution. Then all the pixels, in both fields, are 164 chosen from a uniform distribution. Then all the pixels, in both fields, are
169 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times 169 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
170 \sqrt[3]{complexity}$.\\ 170 \sqrt[3]{complexity}$.\\
171 {\bf Pinch}\\ 171 {\bf Pinch}\\
172 This GIMP filter is named "Whirl and 172 This GIMP filter is named "Whirl and
173 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic 173 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
174 surface and pressing or pulling on the center of the surface''~\cite{GIMP-manual}. 174 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}.
175 For a square input image, think of drawing a circle of 175 For a square input image, think of drawing a circle of
176 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to 176 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
177 that disk (region inside circle) will have its value recalculated by taking 177 that disk (region inside circle) will have its value recalculated by taking
178 the value of another "source" pixel in the original image. The position of 178 the value of another "source" pixel in the original image. The position of
179 that source pixel is found on the line thats goes through $C$ and $P$, but 179 that source pixel is found on the line thats goes through $C$ and $P$, but
196 images and places it over the original {\em occluded} character 196 images and places it over the original {\em occluded} character
197 image. Pixels are combined by taking the max(occluder,occluded), 197 image. Pixels are combined by taking the max(occluder,occluded),
198 closer to black. The corners of the occluder The rectangle corners 198 closer to black. The corners of the occluder The rectangle corners
199 are sampled so that larger complexity gives larger rectangles. 199 are sampled so that larger complexity gives larger rectangles.
200 The destination position in the occluded image are also sampled 200 The destination position in the occluded image are also sampled
201 according to a normal distribution (see more details in~\cite{ift6266-tr-anonymous}. 201 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
202 It has has a probability of not being applied at all of 60\%.\\ 202 It has has a probability of not being applied at all of 60\%.\\
203 {\bf Pixel Permutation}\\ 203 {\bf Pixel Permutation}\\
204 This filter permutes neighbouring pixels. It selects first 204 This filter permutes neighbouring pixels. It selects first
205 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then 205 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
206 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number 206 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number
210 {\bf Gaussian Noise}\\ 210 {\bf Gaussian Noise}\\
211 This filter simply adds, to each pixel of the image independently, a 211 This filter simply adds, to each pixel of the image independently, a
212 noise $\sim Normal(0(\frac{complexity}{10})^2)$. 212 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
213 It has has a probability of not being applied at all of 70\%.\\ 213 It has has a probability of not being applied at all of 70\%.\\
214 {\bf Background Images}\\ 214 {\bf Background Images}\\
215 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random 215 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
216 background behind the letter. The background is chosen by first selecting, 216 background behind the letter. The background is chosen by first selecting,
217 at random, an image from a set of images. Then a 32$\times$32 subregion 217 at random, an image from a set of images. Then a 32$\times$32 subregion
218 of that image is chosen as the background image (by sampling position 218 of that image is chosen as the background image (by sampling position
219 uniformly while making sure not to cross image borders). 219 uniformly while making sure not to cross image borders).
220 To combine the original letter image and the background image, contrast 220 To combine the original letter image and the background image, contrast
365 The other parameters are discarded. 365 The other parameters are discarded.
366 366
367 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. 367 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom.
368 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder. 368 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder.
369 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it. 369 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it.
370 For additional details see \cite{vincent:icml08}. 370 For additional details see \citet{vincent:icml08}.
371 371
372 \section{Experimental Results} 372 \section{Experimental Results}
373 373
374 \subsection{SDA vs MLP vs Humans} 374 \subsection{SDA vs MLP vs Humans}
375 375
377 the best SDA (again according to validation set error), along with a precise estimate 377 the best SDA (again according to validation set error), along with a precise estimate
378 of human performance obtained via Amazon's Mechanical Turk (AMT) 378 of human performance obtained via Amazon's Mechanical Turk (AMT)
379 service\footnote{http://mturk.com}. AMT users are paid small amounts 379 service\footnote{http://mturk.com}. AMT users are paid small amounts
380 of money to perform tasks for which human intelligence is required. 380 of money to perform tasks for which human intelligence is required.
381 Mechanical Turk has been used extensively in natural language 381 Mechanical Turk has been used extensively in natural language
382 processing \cite{SnowEtAl2008} and vision 382 processing \citep{SnowEtAl2008} and vision
383 \cite{SorokinAndForsyth2008,whitehill09}. AMT users where presented 383 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented
384 with 10 character images and asked to type 10 corresponding ascii 384 with 10 character images and asked to type 10 corresponding ascii
385 characters. Hence they were forced to make a hard choice among the 385 characters. Hence they were forced to make a hard choice among the
386 62 character classes. Three users classified each image, allowing 386 62 character classes. Three users classified each image, allowing
387 to estimate inter-human variability (shown as +/- in parenthesis below). 387 to estimate inter-human variability (shown as +/- in parenthesis below).
388 388
406 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline 406 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline
407 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline 407 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline
408 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline 408 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline
409 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline 409 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline
410 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline 410 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline
411 \cite{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline 411 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline
412 \cite{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline 412 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline
413 \cite{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline 413 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline
414 \cite{Migram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline 414 \citep{Migram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline
415 \end{tabular} 415 \end{tabular}
416 \end{center} 416 \end{center}
417 \end{table} 417 \end{table}
418 418
419 \subsection{Perturbed Training Data More Helpful for SDAE} 419 \subsection{Perturbed Training Data More Helpful for SDAE}
425 given test set (the first 3 columns on the 62-class tasks and the last one is 425 given test set (the first 3 columns on the 62-class tasks and the last one is
426 on the clean 10-class digits). Clearly, the deep learning models did benefit more 426 on the clean 10-class digits). Clearly, the deep learning models did benefit more
427 from perturbed training data, even when testing on clean data, whereas the MLP 427 from perturbed training data, even when testing on clean data, whereas the MLP
428 trained on perturbed data performed worse on the clean digits and about the same 428 trained on perturbed data performed worse on the clean digits and about the same
429 on the clean characters. } 429 on the clean characters. }
430 \label{tab:sda-vs-mlp-vs-humans} 430 \label{tab:perturbation-effect}
431 \begin{center} 431 \begin{center}
432 \begin{tabular}{|l|r|r|r|r|} \hline 432 \begin{tabular}{|l|r|r|r|r|} \hline
433 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline 433 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline
434 SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline 434 SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline
435 SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline 435 SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline
488 \end{table} 488 \end{table}
489 489
490 \section{Conclusions} 490 \section{Conclusions}
491 491
492 \bibliography{strings,ml,aigaion,specials} 492 \bibliography{strings,ml,aigaion,specials}
493 \bibliographystyle{mlapa} 493 %\bibliographystyle{plainnat}
494 \bibliographystyle{unsrtnat}
494 %\bibliographystyle{apalike} 495 %\bibliographystyle{apalike}
495 496
496 \end{document} 497 \end{document}