comparison writeup/techreport.tex @ 582:9ebb335ca904

techreport.tex
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 18 Sep 2010 16:32:08 -0400
parents 8aad1c6ec39a
children ae77edb9df67
comparison
equal deleted inserted replaced
581:df749e70f637 582:9ebb335ca904
1 \documentclass[12pt,letterpaper]{article} 1 \documentclass{article} % For LaTeX2e
2 \usepackage{times}
3 \usepackage{wrapfig}
4 \usepackage{amsthm,amsmath,bbm}
5 \usepackage[psamsfonts]{amssymb}
6 \usepackage{algorithm,algorithmic}
2 \usepackage[utf8]{inputenc} 7 \usepackage[utf8]{inputenc}
3 \usepackage{graphicx} 8 \usepackage{graphicx,subfigure}
4 \usepackage{times} 9 \usepackage[numbers]{natbib}
5 \usepackage{mlapa} 10
6 \usepackage{subfigure} 11 \addtolength{\textwidth}{10mm}
12 \addtolength{\evensidemargin}{-5mm}
13 \addtolength{\oddsidemargin}{-5mm}
14
15 %\setlength\parindent{0mm}
16
17 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
18 \author{
19 Frédéric Bastien \and
20 Yoshua Bengio \and
21 Arnaud Bergeron \and
22 Nicolas Boulanger-Lewandowski \and
23 Thomas Breuel \and
24 Youssouf Chherawala \and
25 Moustapha Cisse \and
26 Myriam Côté \and
27 Dumitru Erhan \and
28 Jeremy Eustache \and
29 Xavier Glorot \and
30 Xavier Muller \and
31 Sylvain Pannetier Lebeuf \and
32 Razvan Pascanu \and
33 Salah Rifai \and
34 Francois Savard \and
35 Guillaume Sicard
36 }
37 \date{June 8th, 2010, Technical Report 1353, Dept. IRO, U. Montreal}
7 38
8 \begin{document} 39 \begin{document}
9 \title{Generating and Exploiting Perturbed and Multi-Task Handwritten Training Data for Deep Architectures} 40
10 \author{The IFT6266 Gang} 41 %\makeanontitle
11 \date{April 2010, Technical Report, Dept. IRO, U. Montreal}
12
13 \maketitle 42 \maketitle
14 43
44 %\vspace*{-2mm}
15 \begin{abstract} 45 \begin{abstract}
16 Recent theoretical and empirical work in statistical machine learning has 46 Recent theoretical and empirical work in statistical machine learning has
17 demonstrated the importance of learning algorithms for deep 47 demonstrated the importance of learning algorithms for deep
18 architectures, i.e., function classes obtained by composing multiple 48 architectures, i.e., function classes obtained by composing multiple
19 non-linear transformations. In the area of handwriting recognition, 49 non-linear transformations. Self-taught learning (exploiting unlabeled
20 deep learning algorithms 50 examples or examples from other distributions) has already been applied
21 had been evaluated on rather small datasets with a few tens of thousands 51 to deep learners, but mostly to show the advantage of unlabeled
22 of examples. Here we propose a powerful generator of variations 52 examples. Here we explore the advantage brought by {\em out-of-distribution examples}.
23 of examples for character images based on a pipeline of stochastic 53 For this purpose we
24 transformations that include not only the usual affine transformations 54 developed a powerful generator of stochastic variations and noise
25 but also the addition of slant, local elastic deformations, changes 55 processes for character images, including not only affine transformations
26 in thickness, background images, grey level, contrast, occlusion, and 56 but also slant, local elastic deformations, changes in thickness,
27 various types of pixel and spatially correlated noise. 57 background images, grey level changes, contrast, occlusion, and various
28 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders) 58 types of noise. The out-of-distribution examples are obtained from these
29 on the task of learning to classify digits and letters transformed 59 highly distorted images or by including examples of object classes
30 with this pipeline, using the hundreds of millions of generated examples 60 different from those in the target test set.
31 and testing on the full 62-class NIST test set. 61 We show that {\em deep learners benefit
32 We find that the SDA outperforms its 62 more from them than a corresponding shallow learner}, at least in the area of
33 shallow counterpart, an ordinary Multi-Layer Perceptron, 63 handwritten character recognition. In fact, we show that they reach
34 and that it is better able to take advantage of the additional 64 human-level performance on both handwritten digit classification and
35 generated data, as well as better able to take advantage of 65 62-class handwritten character recognition.
36 the multi-task setting, i.e.,
37 training from more classes than those of interest in the end.
38 In fact, we find that the SDA reaches human performance as
39 estimated by the Amazon Mechanical Turk on the 62-class NIST test characters.
40 \end{abstract} 66 \end{abstract}
67 %\vspace*{-3mm}
41 68
42 \section{Introduction} 69 \section{Introduction}
43 70 %\vspace*{-1mm}
44 Deep Learning has emerged as a promising new area of research in 71
45 statistical machine learning (see~\emcite{Bengio-2009} for a review). 72 {\bf Deep Learning} has emerged as a promising new area of research in
73 statistical machine learning (see~\citet{Bengio-2009} for a review).
46 Learning algorithms for deep architectures are centered on the learning 74 Learning algorithms for deep architectures are centered on the learning
47 of useful representations of data, which are better suited to the task at hand. 75 of useful representations of data, which are better suited to the task at hand.
48 This is in great part inspired by observations of the mammalian visual cortex, 76 This is in part inspired by observations of the mammalian visual cortex,
49 which consists of a chain of processing elements, each of which is associated with a 77 which consists of a chain of processing elements, each of which is associated with a
50 different representation. In fact, 78 different representation of the raw visual input. In fact,
51 it was found recently that the features learnt in deep architectures resemble 79 it was found recently that the features learnt in deep architectures resemble
52 those observed in the first two of these stages (in areas V1 and V2 80 those observed in the first two of these stages (in areas V1 and V2
53 of visual cortex)~\cite{HonglakL2008}. 81 of visual cortex)~\citep{HonglakL2008}, and that they become more and
54 Processing images typically involves transforming the raw pixel data into 82 more invariant to factors of variation (such as camera movement) in
55 new {\bf representations} that can be used for analysis or classification. 83 higher layers~\citep{Goodfellow2009}.
56 For example, a principal component analysis representation linearly projects 84 Learning a hierarchy of features increases the
57 the input image into a lower-dimensional feature space.
58 Why learn a representation? Current practice in the computer vision
59 literature converts the raw pixels into a hand-crafted representation
60 (e.g.\ SIFT features~\cite{Lowe04}), but deep learning algorithms
61 tend to discover similar features in their first few
62 levels~\cite{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}.
63 Learning increases the
64 ease and practicality of developing representations that are at once 85 ease and practicality of developing representations that are at once
65 tailored to specific tasks, yet are able to borrow statistical strength 86 tailored to specific tasks, yet are able to borrow statistical strength
66 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the 87 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
67 feature representation can lead to higher-level (more abstract, more 88 feature representation can lead to higher-level (more abstract, more
68 general) features that are more robust to unanticipated sources of 89 general) features that are more robust to unanticipated sources of
69 variance extant in real data. 90 variance extant in real data.
70 91
92 {\bf Self-taught learning}~\citep{RainaR2007} is a paradigm that combines principles
93 of semi-supervised and multi-task learning: the learner can exploit examples
94 that are unlabeled and possibly come from a distribution different from the target
95 distribution, e.g., from other classes than those of interest.
96 It has already been shown that deep learners can clearly take advantage of
97 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
98 but more needs to be done to explore the impact
99 of {\em out-of-distribution} examples and of the multi-task setting
100 (one exception is~\citep{CollobertR2008}, which uses a different kind
101 of learning algorithm). In particular the {\em relative
102 advantage} of deep learning for these settings has not been evaluated.
103 The hypothesis discussed in the conclusion is that a deep hierarchy of features
104 may be better able to provide sharing of statistical strength
105 between different regions in input space or different tasks.
106
107 \iffalse
71 Whereas a deep architecture can in principle be more powerful than a 108 Whereas a deep architecture can in principle be more powerful than a
72 shallow one in terms of representation, depth appears to render the 109 shallow one in terms of representation, depth appears to render the
73 training problem more difficult in terms of optimization and local minima. 110 training problem more difficult in terms of optimization and local minima.
74 It is also only recently that successful algorithms were proposed to 111 It is also only recently that successful algorithms were proposed to
75 overcome some of these difficulties. All are based on unsupervised 112 overcome some of these difficulties. All are based on unsupervised
76 learning, often in an greedy layer-wise ``unsupervised pre-training'' 113 learning, often in an greedy layer-wise ``unsupervised pre-training''
77 stage~\cite{Bengio-2009}. One of these layer initialization techniques, 114 stage~\citep{Bengio-2009}. One of these layer initialization techniques,
78 applied here, is the Denoising 115 applied here, is the Denoising
79 Auto-Encoder~(DEA)~\cite{VincentPLarochelleH2008-very-small}, which 116 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}),
117 which
80 performed similarly or better than previously proposed Restricted Boltzmann 118 performed similarly or better than previously proposed Restricted Boltzmann
81 Machines in terms of unsupervised extraction of a hierarchy of features 119 Machines in terms of unsupervised extraction of a hierarchy of features
82 useful for classification. The principle is that each layer starting from 120 useful for classification. Each layer is trained to denoise its
83 the bottom is trained to encode their input (the output of the previous 121 input, creating a layer of features that can be used as input for the next layer.
84 layer) and try to reconstruct it from a corrupted version of it. After this 122 \fi
85 unsupervised initialization, the stack of denoising auto-encoders can be 123 %The principle is that each layer starting from
86 converted into a deep supervised feedforward neural network and trained by 124 %the bottom is trained to encode its input (the output of the previous
87 stochastic gradient descent. 125 %layer) and to reconstruct it from a corrupted version. After this
88 126 %unsupervised initialization, the stack of DAs can be
89 127 %converted into a deep supervised feedforward neural network and fine-tuned by
128 %stochastic gradient descent.
129
130 %
131 In this paper we ask the following questions:
132
133 %\begin{enumerate}
134 $\bullet$ %\item
135 Do the good results previously obtained with deep architectures on the
136 MNIST digit images generalize to the setting of a much larger and richer (but similar)
137 dataset, the NIST special database 19, with 62 classes and around 800k examples?
138
139 $\bullet$ %\item
140 To what extent does the perturbation of input images (e.g. adding
141 noise, affine transformations, background images) make the resulting
142 classifiers better not only on similarly perturbed images but also on
143 the {\em original clean examples}? We study this question in the
144 context of the 62-class and 10-class tasks of the NIST special database 19.
145
146 $\bullet$ %\item
147 Do deep architectures {\em benefit more from such out-of-distribution}
148 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
149 We use highly perturbed examples to generate out-of-distribution examples.
150
151 $\bullet$ %\item
152 Similarly, does the feature learning step in deep learning algorithms benefit more
153 from training with moderately different classes (i.e. a multi-task learning scenario) than
154 a corresponding shallow and purely supervised architecture?
155 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case)
156 to answer this question.
157 %\end{enumerate}
158
159 Our experimental results provide positive evidence towards all of these questions.
160 To achieve these results, we introduce in the next section a sophisticated system
161 for stochastically transforming character images and then explain the methodology,
162 which is based on training with or without these transformed images and testing on
163 clean ones. We measure the relative advantage of out-of-distribution examples
164 for a deep learner vs a supervised shallow one.
165 Code for generating these transformations as well as for the deep learning
166 algorithms are made available.
167 We also estimate the relative advantage for deep learners of training with
168 other classes than those of interest, by comparing learners trained with
169 62 classes with learners trained with only a subset (on which they
170 are then tested).
171 The conclusion discusses
172 the more general question of why deep learners may benefit so much from
173 the self-taught learning framework.
174
175 %\vspace*{-3mm}
176 \newpage
90 \section{Perturbation and Transformation of Character Images} 177 \section{Perturbation and Transformation of Character Images}
91 178 \label{s:perturbations}
92 This section describes the different transformations we used to generate data, in their order. 179 %\vspace*{-2mm}
180
181 \begin{wrapfigure}[8]{l}{0.15\textwidth}
182 %\begin{minipage}[b]{0.14\linewidth}
183 %\vspace*{-5mm}
184 \begin{center}
185 \includegraphics[scale=.4]{images/Original.png}\\
186 {\bf Original}
187 \end{center}
188 \end{wrapfigure}
189 %%\vspace{0.7cm}
190 %\end{minipage}%
191 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
192 This section describes the different transformations we used to stochastically
193 transform $32 \times 32$ source images (such as the one on the left)
194 in order to obtain data from a larger distribution which
195 covers a domain substantially larger than the clean characters distribution from
196 which we start.
197 Although character transformations have been used before to
198 improve character recognizers, this effort is on a large scale both
199 in number of classes and in the complexity of the transformations, hence
200 in the complexity of the learning task.
201 More details can
202 be found in this technical report~\citep{ift6266-tr-anonymous}.
93 The code for these transformations (mostly python) is available at 203 The code for these transformations (mostly python) is available at
94 {\tt http://anonymous.url.net}. All the modules in the pipeline share 204 {\tt http://anonymous.url.net}. All the modules in the pipeline share
95 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the 205 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
96 amount of deformation or noise introduced. 206 amount of deformation or noise introduced.
97 207 There are two main parts in the pipeline. The first one,
98 We can differentiate two important parts in the pipeline. The first one, 208 from slant to pinch below, performs transformations. The second
99 from slant to pinch, performs transformations of the character. The second 209 part, from blur to contrast, adds different kinds of noise.
100 part, from blur to contrast, adds noise to the image. 210 %\end{minipage}
101 211
102 \subsection{Slant} 212 %\vspace*{1mm}
103 213 \subsection{Transformations}
104 We mimic slant by shifting each row of the image 214 %{\large\bf 2.1 Transformations}
105 proportionally to its height: $shift = round(slant \times height)$. 215 %\vspace*{1mm}
106 The $slant$ coefficient can be negative or positive with equal probability 216
107 and its value is randomly sampled according to the complexity level: 217 \subsubsection*{Thickness}
108 $slant \sim U[0,complexity]$, so the 218
109 maximum displacement for the lowest or highest pixel line is of 219 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
110 $round(complexity \times 32)$. 220 \begin{minipage}[b]{0.14\linewidth}
111 221 %\centering
112 --- 222 \begin{center}
113 223 \vspace*{-5mm}
114 In order to mimic a slant effect, we simply shift each row of the image 224 \includegraphics[scale=.4]{images/Thick_only.png}\\
115 proportionnaly to its height: $shift = round(slant \times height)$. We 225 %{\bf Thickness}
116 round the shift in order to have a discret displacement. We do not use a 226 \end{center}
117 filter to smooth the result in order to save computing time and also 227 \vspace{.6cm}
118 because latter transformations have similar effects. 228 \end{minipage}%
119 229 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
120 The $slant$ coefficient can be negative or positive with equal probability 230 %\end{wrapfigure}
121 and its value is randomly sampled according to the complexity level. In 231 To change character {\bf thickness}, morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
122 our case we take uniformly a number in the range $[0,complexity]$, so the
123 maximum displacement for the lowest or highest pixel line is of
124 $round(complexity \times 32)$.
125
126
127 \subsection{Thickness}
128
129 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
130 are applied. The neighborhood of each pixel is multiplied 232 are applied. The neighborhood of each pixel is multiplied
131 element-wise with a {\em structuring element} matrix. 233 element-wise with a {\em structuring element} matrix.
132 The pixel value is replaced by the maximum or the minimum of the resulting 234 The pixel value is replaced by the maximum or the minimum of the resulting
133 matrix, respectively for dilation or erosion. Ten different structural elements with 235 matrix, respectively for dilation or erosion. Ten different structural elements with
134 increasing dimensions (largest is $5\times5$) were used. For each image, 236 increasing dimensions (largest is $5\times5$) were used. For each image,
135 randomly sample the operator type (dilation or erosion) with equal probability and one structural 237 randomly sample the operator type (dilation or erosion) with equal probability and one structural
136 element from a subset of the $n$ smallest structuring elements where $n$ is 238 element from a subset of the $n=round(m \times complexity)$ smallest structuring elements
137 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ 239 where $m=10$ for dilation and $m=6$ for erosion (to avoid completely erasing thin characters).
138 for erosion. A neutral element is always present in the set, and if it is 240 A neutral element (no transformation)
139 chosen no transformation is applied. Erosion allows only the six 241 is always present in the set.
140 smallest structural elements because when the character is too thin it may 242 %%\vspace{.4cm}
141 be completely erased. 243 \end{minipage}
142 244
143 --- 245 \vspace{2mm}
144 246
145 To change the thickness of the characters we used morpholigical operators: 247 \subsubsection*{Slant}
146 dilation and erosion~\cite{Haralick87,Serra82}. 248 \vspace*{2mm}
147 249
148 The basic idea of such transform is, for each pixel, to multiply in the 250 \begin{minipage}[b]{0.14\linewidth}
149 element-wise manner its neighbourhood with a matrix called the structuring 251 \centering
150 element. Then for dilation we remplace the pixel value by the maximum of 252 \includegraphics[scale=.4]{images/Slant_only.png}\\
151 the result, or the minimum for erosion. This will dilate or erode objects 253 %{\bf Slant}
152 in the image and the strength of the transform only depends on the 254 \end{minipage}%
153 structuring element. 255 \hspace{0.3cm}
154 256 \begin{minipage}[b]{0.83\linewidth}
155 We used ten different structural elements with increasing dimensions (the 257 %\centering
156 biggest is $5\times5$). for each image, we radomly sample the operator 258 To produce {\bf slant}, each row of the image is shifted
157 type (dilation or erosion) with equal probability and one structural 259 proportionally to its height: $shift = round(slant \times height)$.
158 element from a subset of the $n$ smallest structuring elements where $n$ is 260 $slant \sim U[-complexity,complexity]$.
159 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ 261 The shift is randomly chosen to be either to the left or to the right.
160 for erosion. A neutral element is always present in the set, if it is 262 \vspace{5mm}
161 chosen the transformation is not applied. Erosion allows only the six 263 \end{minipage}
162 smallest structural elements because when the character is too thin it may 264 %\vspace*{-4mm}
163 erase it completly. 265
164 266 %\newpage
165 \subsection{Affine Transformations} 267
166 268 \subsubsection*{Affine Transformations}
167 A $2 \times 3$ affine transform matrix (with 269
168 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. 270 \begin{minipage}[b]{0.14\linewidth}
169 Each pixel $(x,y)$ of the output image takes the value of the pixel 271 %\centering
170 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This 272 %\begin{wrapfigure}[8]{l}{0.15\textwidth}
171 produces scaling, translation, rotation and shearing. 273 \begin{center}
172 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to 274 \includegraphics[scale=.4]{images/Affine_only.png}
173 forbid important rotations (not to confuse classes) but to give good 275 \vspace*{6mm}
174 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times 276 %{\small {\bf Affine \mbox{Transformation}}}
175 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 277 \end{center}
176 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times 278 %\end{wrapfigure}
177 complexity]$. 279 \end{minipage}%
178 280 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
179 ---- 281 \noindent A $2 \times 3$ {\bf affine transform} matrix (with
180 282 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$.
181 We generate an affine transform matrix according to the complexity level, 283 Output pixel $(x,y)$ takes the value of input pixel
182 then we apply it directly to the image. The matrix is of size $2 \times 284 nearest to $(ax+by+c,dx+ey+f)$,
183 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$. Formally, 285 producing scaling, translation, rotation and shearing.
184 for each pixel $(x,y)$ of the output image, we give the value of the pixel 286 Marginal distributions of $(a,b,c,d,e,f)$ have been tuned to
185 nearest to : $(ax+by+c,dx+ey+f)$, in the input image. This allows to 287 forbid large rotations (to avoid confusing classes) but to give good
186 produce scaling, translation, rotation and shearing variances. 288 variability of the transformation: $a$ and $d$ $\sim U[1-3
187 289 complexity,1+3\,complexity]$, $b$ and $e$ $\sim U[-3 \,complexity,3\,
188 The sampling of the parameters $(a,b,c,d,e,f)$ have been tuned by hand to 290 complexity]$, and $c$ and $f \sim U[-4 \,complexity, 4 \,
189 forbid important rotations (not to confuse classes) but to give good 291 complexity]$.\\
190 variability of the transformation. For each image we sample uniformly the 292 \end{minipage}
191 parameters in the following ranges: $a$ and $d$ in $[1-3 \times 293
192 complexity,1+3 \times complexity]$, $b$ and $e$ in $[-3 \times complexity,3 294 %\vspace*{-4.5mm}
193 \times complexity]$ and $c$ and $f$ in $[-4 \times complexity, 4 \times 295 \subsubsection*{Local Elastic Deformations}
194 complexity]$. 296
195 297 %\begin{minipage}[t]{\linewidth}
196 298 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
197 \subsection{Local Elastic Deformations} 299 %\hspace*{-8mm}
198 300 \begin{minipage}[b]{0.14\linewidth}
199 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, 301 %\centering
302 \begin{center}
303 \vspace*{5mm}
304 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png}
305 %{\bf Local Elastic Deformation}
306 \end{center}
307 %\end{wrapfigure}
308 \end{minipage}%
309 \hspace{3mm}
310 \begin{minipage}[b]{0.85\linewidth}
311 %%\vspace*{-20mm}
312 The {\bf local elastic deformation}
313 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
200 which provides more details. 314 which provides more details.
201 Two ``displacements'' fields are generated and applied, for horizontal 315 The intensity of the displacement fields is given by
202 and vertical displacements of pixels. 316 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are
203 To generate a pixel in either field, first a value between -1 and 1 is 317 convolved with a Gaussian 2D kernel (resulting in a blur) of
204 chosen from a uniform distribution. Then all the pixels, in both fields, are 318 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$.
205 multiplied by a constant $\alpha$ which controls the intensity of the 319 \vspace{2mm}
206 displacements (larger $\alpha$ translates into larger wiggles). 320 \end{minipage}
207 Each field is convolved with a Gaussian 2D kernel of 321
208 standard deviation $\sigma$. Visually, this results in a blur. 322 \vspace*{4mm}
209 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times 323
210 \sqrt[3]{complexity}$. 324 \subsubsection*{Pinch}
211 325
212 ---- 326 \begin{minipage}[b]{0.14\linewidth}
213 327 %\centering
214 This filter induces a "wiggly" effect in the image. The description here 328 %\begin{wrapfigure}[7]{l}{0.15\textwidth}
215 will be brief, as the algorithm follows precisely what is described in 329 %\vspace*{-5mm}
216 \cite{SimardSP03}. 330 \begin{center}
217 331 \includegraphics[scale=.4]{images/Pinch_only.png}\\
218 The general idea is to generate two "displacements" fields, for horizontal 332 \vspace*{15mm}
219 and vertical displacements of pixels. Each of these fields has the same 333 %{\bf Pinch}
220 size as the original image. 334 \end{center}
221 335 %\end{wrapfigure}
222 When generating the transformed image, we'll loop over the x and y 336 %%\vspace{.6cm}
223 positions in the fields and select, as a value, the value of the pixel in 337 \end{minipage}%
224 the original image at the (relative) position given by the displacement 338 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
225 fields for this x and y. If the position we'd retrieve is outside the 339 The {\bf pinch} module applies the ``Whirl and pinch'' GIMP filter with whirl set to 0.
226 borders of the image, we use a 0 value instead. 340 A pinch is ``similar to projecting the image onto an elastic
227
228 To generate a pixel in either field, first a value between -1 and 1 is
229 chosen from a uniform distribution. Then all the pixels, in both fields, is
230 multiplied by a constant $\alpha$ which controls the intensity of the
231 displacements (bigger $\alpha$ translates into larger wiggles).
232
233 As a final step, each field is convoluted with a Gaussian 2D kernel of
234 standard deviation $\sigma$. Visually, this results in a "blur"
235 filter. This has the effect of making values next to each other in the
236 displacement fields similar. In effect, this makes the wiggles more
237 coherent, less noisy.
238
239 As displacement fields were long to compute, 50 pairs of fields were
240 generated per complexity in increments of 0.1 (50 pairs for 0.1, 50 pairs
241 for 0.2, etc.), and afterwards, given a complexity, we selected randomly
242 among the 50 corresponding pairs.
243
244 $\sigma$ and $\alpha$ were linked to complexity through the formulas
245 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
246 \sqrt[3]{complexity}$.
247
248
249 \subsection{Pinch}
250
251 This is a GIMP filter called ``Whirl and
252 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
253 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). 341 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
254 For a square input image, this is akin to drawing a circle of 342 For a square input image, draw a radius-$r$ disk
255 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to 343 around its center $C$. Any pixel $P$ belonging to
256 that disk (region inside circle) will have its value recalculated by taking 344 that disk has its value replaced by
257 the value of another ``source'' pixel in the original image. The position of 345 the value of a ``source'' pixel in the original image,
258 that source pixel is found on the line that goes through $C$ and $P$, but 346 on the line that goes through $C$ and $P$, but
259 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ 347 at some other distance $d_2$. Define $d_1=distance(P,C)$
260 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times 348 and $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
261 d_1$, where $pinch$ is a parameter to the filter. 349 d_1$, where $pinch$ is a parameter of the filter.
262 The actual value is given by bilinear interpolation considering the pixels 350 The actual value is given by bilinear interpolation considering the pixels
263 around the (non-integer) source position thus found. 351 around the (non-integer) source position thus found.
264 Here $pinch \sim U[-complexity, 0.7 \times complexity]$. 352 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
265 353 %%\vspace{1.5cm}
266 --- 354 \end{minipage}
267 355
268 This is another GIMP filter we used. The filter is in fact named "Whirl and 356 %\vspace{1mm}
269 pinch", but we don't use the "whirl" part (whirl is set to 0). As described 357
270 in GIMP, a pinch is "similar to projecting the image onto an elastic 358 %{\large\bf 2.2 Injecting Noise}
271 surface and pressing or pulling on the center of the surface". 359 \subsection{Injecting Noise}
272 360 %\vspace{2mm}
273 Mathematically, for a square input image, think of drawing a circle of 361
274 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to 362 \subsubsection*{Motion Blur}
275 that disk (region inside circle) will have its value recalculated by taking 363
276 the value of another "source" pixel in the original image. The position of 364 %%\vspace*{-.2cm}
277 that source pixel is found on the line thats goes through $C$ and $P$, but 365 \begin{minipage}[t]{0.14\linewidth}
278 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ 366 \centering
279 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times 367 \vspace*{0mm}
280 d_1$, where $pinch$ is a parameter to the filter. 368 \includegraphics[scale=.4]{images/Motionblur_only.png}
281 369 %{\bf Motion Blur}
282 If the region considered is not square then, before computing $d_2$, the 370 \end{minipage}%
283 smallest dimension (x or y) is stretched such that we may consider the 371 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
284 region as if it was square. Then, after $d_2$ has been computed and 372 %%\vspace*{.5mm}
285 corresponding components $d_2\_x$ and $d_2\_y$ have been found, the 373 \vspace*{2mm}
286 component corresponding to the stretched dimension is compressed back by an 374 The {\bf motion blur} module is GIMP's ``linear motion blur'', which
287 inverse ratio. 375 has parameters $length$ and $angle$. The value of
288 376 a pixel in the final image is approximately the mean of the first $length$ pixels
289 The actual value is given by bilinear interpolation considering the pixels 377 found by moving in the $angle$ direction,
290 around the (non-integer) source position thus found. 378 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
291 379 %\vspace{5mm}
292 The value for $pinch$ in our case was given by sampling from an uniform 380 \end{minipage}
293 distribution over the range $[-complexity, 0.7 \times complexity]$. 381
294 382 %\vspace*{1mm}
295 \subsection{Motion Blur} 383
296 384 \subsubsection*{Occlusion}
297 This is a ``linear motion blur'' in GIMP 385
298 terminology, with two parameters, $length$ and $angle$. The value of 386 \begin{minipage}[t]{0.14\linewidth}
299 a pixel in the final image is approximately the mean value of the $length$ first pixels 387 \centering
300 found by moving in the $angle$ direction. 388 \vspace*{3mm}
301 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. 389 \includegraphics[scale=.4]{images/occlusion_only.png}\\
302 390 %{\bf Occlusion}
303 ---- 391 %%\vspace{.5cm}
304 392 \end{minipage}%
305 This is a GIMP filter we applied, a "linear motion blur" in GIMP 393 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
306 terminology. The description will be brief as it is a well-known filter. 394 %\vspace*{-18mm}
307 395 The {\bf occlusion} module selects a random rectangle from an {\em occluder} character
308 This algorithm has two input parameters, $length$ and $angle$. The value of 396 image and places it over the original {\em occluded}
309 a pixel in the final image is the mean value of the $length$ first pixels 397 image. Pixels are combined by taking the max(occluder, occluded),
310 found by moving in the $angle$ direction. An approximation of this idea is 398 i.e. keeping the lighter ones.
311 used, as we won't fall onto precise pixels by following that 399 The rectangle corners
312 direction. This is done using the Bresenham line algorithm.
313
314 The angle, in our case, is chosen from a uniform distribution over
315 $[0,360]$ degrees. The length, though, depends on the complexity; it's
316 sampled from a Gaussian distribution of mean 0 and standard deviation
317 $\sigma = 3 \times complexity$.
318
319 \subsection{Occlusion}
320
321 Selects a random rectangle from an {\em occluder} character
322 images and places it over the original {\em occluded} character
323 image. Pixels are combined by taking the max(occluder,occluded),
324 closer to black. The rectangle corners
325 are sampled so that larger complexity gives larger rectangles. 400 are sampled so that larger complexity gives larger rectangles.
326 The destination position in the occluded image are also sampled 401 The destination position in the occluded image are also sampled
327 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). 402 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}).
328 This filter has a probability of 60\% of not being applied. 403 This module is skipped with probability 60\%.
329 404 %%\vspace{7mm}
330 --- 405 \end{minipage}
331 406
332 This filter selects random parts of other (hereafter "occlusive") letter 407 %\vspace*{1mm}
333 images and places them over the original letter (hereafter "occluded") 408 \subsubsection*{Gaussian Smoothing}
334 image. To be more precise, having selected a subregion of the occlusive 409
335 image and a desination position in the occluded image, to determine the 410 %\begin{wrapfigure}[8]{l}{0.15\textwidth}
336 final value for a given overlapping pixel, it selects whichever pixel is 411 %\vspace*{-6mm}
337 the lightest. As a reminder, the background value is 0, black, so the value 412 \begin{minipage}[t]{0.14\linewidth}
338 nearest to 1 is selected. 413 \begin{center}
339 414 %\centering
340 To select a subpart of the occlusive image, four numbers are generated. For 415 \vspace*{6mm}
341 compability with the code, we'll call them "haut", "bas", "gauche" and 416 \includegraphics[scale=.4]{images/Bruitgauss_only.png}
342 "droite" (respectively meaning top, bottom, left and right). Each of these 417 %{\bf Gaussian Smoothing}
343 numbers is selected according to a Gaussian distribution of mean $8 \times 418 \end{center}
344 complexity$ and standard deviation $2$. This means the largest the 419 %\end{wrapfigure}
345 complexity is, the biggest the occlusion will be. The absolute value is 420 %%\vspace{.5cm}
346 taken, as the numbers must be positive, and the maximum value is capped at 421 \end{minipage}%
347 15. 422 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
348 423 With the {\bf Gaussian smoothing} module,
349 These four sizes collectively define a window centered on the middle pixel 424 different regions of the image are spatially smoothed.
350 of the occlusive image. This is the part that will be extracted as the 425 This is achieved by first convolving
351 occlusion. 426 the image with an isotropic Gaussian kernel of
352
353 The next step is to select a destination position in the occluded
354 image. Vertical and horizontal displacements $y\_arrivee$ and $x\_arrivee$
355 are selected according to Gaussian distributions of mean 0 and of standard
356 deviations of, respectively, 3 and 2. Then an horizontal placement mode,
357 $place$, is selected to be of three values meaning
358 left, middle or right.
359
360 If $place$ is "middle", the occlusion will be horizontally centered
361 around the horizontal middle of the occluded image, then shifted according
362 to $x\_arrivee$. If $place$ is "left", it will be placed on the left of
363 the occluded image, then displaced right according to $x\_arrivee$. The
364 contrary happens if $place$ is $right$.
365
366 In both the horizontal and vertical positionning, the maximum position in
367 either direction is such that the selected occlusion won't go beyond the
368 borders of the occluded image.
369
370 This filter has a probability of not being applied, at all, of 60\%.
371
372
373 \subsection{Pixel Permutation}
374
375 This filter permutes neighbouring pixels. It selects first
376 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
377 sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number
378 of exchanges to the left, right, top, bottom is equal or does not differ
379 from more than 1 if the number of selected pixels is not a multiple of 4.
380 % TODO: The previous sentence is hard to parse
381 This filter has a probability of 80\% of not being applied.
382
383 ---
384
385 This filter permuts neighbouring pixels. It selects first
386 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
387 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number
388 of exchanges to the left, right, top, bottom are equal or does not differ
389 from more than 1 if the number of selected pixels is not a multiple of 4.
390
391 It has has a probability of not being applied, at all, of 80\%.
392
393
394 \subsection{Gaussian Noise}
395
396 This filter simply adds, to each pixel of the image independently, a
397 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
398 It has a probability of 70\% of not being applied.
399
400 ---
401
402 This filter simply adds, to each pixel of the image independently, a
403 Gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$.
404
405 It has has a probability of not being applied, at all, of 70\%.
406
407
408 \subsection{Background Images}
409
410 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
411 background behind the letter. The background is chosen by first selecting,
412 at random, an image from a set of images. Then a 32$\times$32 sub-region
413 of that image is chosen as the background image (by sampling position
414 uniformly while making sure not to cross image borders).
415 To combine the original letter image and the background image, contrast
416 adjustments are made. We first get the maximal values (i.e. maximal
417 intensity) for both the original image and the background image, $maximage$
418 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$.
419 Each background pixel value is multiplied by $\frac{max(maximage -
420 contrast, 0)}{maxbg}$ (higher contrast yield darker
421 background). The output image pixels are max(background,original).
422
423 ---
424
425 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random
426 background behind the letter. The background is chosen by first selecting,
427 at random, an image from a set of images. Then we choose a 32x32 subregion
428 of that image as the background image (by sampling x and y positions
429 uniformly while making sure not to cross image borders).
430
431 To combine the original letter image and the background image, contrast
432 adjustments are made. We first get the maximal values (i.e. maximal
433 intensity) for both the original image and the background image, $maximage$
434 and $maxbg$. We also have a parameter, $contrast$, given by sampling from a
435 uniform distribution over $[complexity, 1]$.
436
437 Once we have all these numbers, we first adjust the values for the
438 background image. Each pixel value is multiplied by $\frac{max(maximage -
439 contrast, 0)}{maxbg}$. Therefore the higher the contrast, the darkest the
440 background will be.
441
442 The final image is found by taking the brightest (i.e. value nearest to 1)
443 pixel from either the background image or the corresponding pixel in the
444 original image.
445
446 \subsection{Salt and Pepper Noise}
447
448 This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
449 The number of selected pixels is $0.2 \times complexity$.
450 This filter has a probability of not being applied at all of 75\%.
451
452 ---
453
454 This filter adds noise to the image by randomly selecting a certain number
455 of them and, for those selected pixels, assign a random value according to
456 a uniform distribution over the $[0,1]$ ranges. This last distribution does
457 not change according to complexity. Instead, the number of selected pixels
458 does: the proportion of changed pixels corresponds to $complexity / 5$,
459 which means, as a maximum, 20\% of the pixels will be randomized. On the
460 lowest extreme, no pixel is changed.
461
462 This filter also has a probability of not being applied, at all, of 75\%.
463
464 \subsection{Spatially Gaussian Noise}
465
466 Different regions of the image are spatially smoothed.
467 The image is convolved with a symmetric Gaussian kernel of
468 size and variance chosen uniformly in the ranges $[12,12 + 20 \times 427 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
469 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized 428 complexity]$ and $[2,2 + 6 \times complexity]$. This filtered image is normalized
470 between $0$ and $1$. We also create a symmetric averaging window, of the 429 between $0$ and $1$. We also create an isotropic weighted averaging window, of the
471 kernel size, with maximum value at the center. For each image we sample 430 kernel size, with maximum value at the center. For each image we sample
472 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be 431 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be
473 averaging centers between the original image and the filtered one. We 432 averaging centers between the original image and the filtered one. We
474 initialize to zero a mask matrix of the image size. For each selected pixel 433 initialize to zero a mask matrix of the image size. For each selected pixel
475 we add to the mask the averaging window centered to it. The final image is 434 we add to the mask the averaging window centered on it. The final image is
476 computed from the following element-wise operation: $\frac{image + filtered 435 computed from the following element-wise operation: $\frac{image + filtered\_image
477 image \times mask}{mask+1}$. 436 \times mask}{mask+1}$.
478 This filter has a probability of not being applied at all of 75\%. 437 This module is skipped with probability 75\%.
479 438 \end{minipage}
480 ---- 439
481 440 %\newpage
482 The aim of this transformation is to filter, with a gaussian kernel, 441
483 different regions of the image. In order to save computing time we decided 442 %\vspace*{-9mm}
484 to convolve the whole image only once with a symmetric gaussian kernel of 443 \subsubsection*{Permute Pixels}
485 size and variance choosen uniformly in the ranges: $[12,12 + 20 \times 444
486 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized 445 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
487 between $0$ and $1$. We also create a symmetric averaging window, of the 446 %\centering
488 kernel size, with maximum value at the center. For each image we sample 447 \begin{minipage}[t]{0.14\textwidth}
489 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be 448 %\begin{wrapfigure}[7]{l}{
490 averaging centers between the original image and the filtered one. We 449 %\vspace*{-5mm}
491 initialize to zero a mask matrix of the image size. For each selected pixel 450 \begin{center}
492 we add to the mask the averaging window centered to it. The final image is 451 \vspace*{1mm}
493 computed from the following element-wise operation: $\frac{image + filtered 452 \includegraphics[scale=.4]{images/Permutpixel_only.png}
494 image \times mask}{mask+1}$. 453 %{\small\bf Permute Pixels}
495 454 \end{center}
496 This filter has a probability of not being applied, at all, of 75\%. 455 %\end{wrapfigure}
497 456 \end{minipage}%
498 \subsection{Scratches} 457 \hspace{3mm}\begin{minipage}[t]{0.86\linewidth}
499 458 \vspace*{1mm}
500 The scratches module places line-like white patches on the image. The 459 %%\vspace*{-20mm}
460 This module {\bf permutes neighbouring pixels}. It first selects a
461 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each
462 of these pixels is then sequentially exchanged with a random pixel
463 among its four nearest neighbors (on its left, right, top or bottom).
464 This module is skipped with probability 80\%.\\
465 %\vspace*{1mm}
466 \end{minipage}
467
468 %\vspace{-3mm}
469
470 \subsubsection*{Gaussian Noise}
471
472 \begin{minipage}[t]{0.14\textwidth}
473 %\begin{wrapfigure}[7]{l}{
474 %%\vspace*{-3mm}
475 \begin{center}
476 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
477 %\centering
478 \vspace*{0mm}
479 \includegraphics[scale=.4]{images/Distorsiongauss_only.png}
480 %{\small \bf Gauss. Noise}
481 \end{center}
482 %\end{wrapfigure}
483 \end{minipage}%
484 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
485 \vspace*{1mm}
486 %\vspace*{12mm}
487 The {\bf Gaussian noise} module simply adds, to each pixel of the image independently, a
488 noise $\sim Normal(0,(\frac{complexity}{10})^2)$.
489 This module is skipped with probability 70\%.
490 %%\vspace{1.1cm}
491 \end{minipage}
492
493 %\vspace*{1.2cm}
494
495 \subsubsection*{Background Image Addition}
496
497 \begin{minipage}[t]{\linewidth}
498 \begin{minipage}[t]{0.14\linewidth}
499 \centering
500 \vspace*{0mm}
501 \includegraphics[scale=.4]{images/background_other_only.png}
502 %{\small \bf Bg Image}
503 \end{minipage}%
504 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
505 \vspace*{1mm}
506 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random
507 background image behind the letter, from a randomly chosen natural image,
508 with contrast adjustments depending on $complexity$, to preserve
509 more or less of the original character image.
510 %%\vspace{.8cm}
511 \end{minipage}
512 \end{minipage}
513 %%\vspace{-.7cm}
514
515 \subsubsection*{Salt and Pepper Noise}
516
517 \begin{minipage}[t]{0.14\linewidth}
518 \centering
519 \vspace*{0mm}
520 \includegraphics[scale=.4]{images/Poivresel_only.png}
521 %{\small \bf Salt \& Pepper}
522 \end{minipage}%
523 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
524 \vspace*{1mm}
525 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels.
526 The number of selected pixels is $0.2 \times complexity$.
527 This module is skipped with probability 75\%.
528 %%\vspace{.9cm}
529 \end{minipage}
530 %%\vspace{-.7cm}
531
532 %\vspace{1mm}
533 \subsubsection*{Scratches}
534
535 \begin{minipage}[t]{0.14\textwidth}
536 %\begin{wrapfigure}[7]{l}{
537 %\begin{minipage}[t]{0.14\linewidth}
538 %\centering
539 \begin{center}
540 \vspace*{4mm}
541 %\hspace*{-1mm}
542 \includegraphics[scale=.4]{images/Rature_only.png}\\
543 %{\bf Scratches}
544 \end{center}
545 \end{minipage}%
546 %\end{wrapfigure}
547 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
548 %%\vspace{.4cm}
549 The {\bf scratches} module places line-like white patches on the image. The
501 lines are heavily transformed images of the digit ``1'' (one), chosen 550 lines are heavily transformed images of the digit ``1'' (one), chosen
502 at random among five thousands such 1 images. The 1 image is 551 at random among 500 such 1 images,
503 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times 552 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
504 complexity)^2$, using bi-cubic interpolation, 553 complexity)^2$ (in degrees), using bi-cubic interpolation.
505 Two passes of a grey-scale morphological erosion filter 554 Two passes of a grey-scale morphological erosion filter
506 are applied, reducing the width of the line 555 are applied, reducing the width of the line
507 by an amount controlled by $complexity$. 556 by an amount controlled by $complexity$.
508 This filter is only applied only 15\% of the time. When it is applied, 50\% 557 This module is skipped with probability 85\%. The probabilities
509 of the time, only one patch image is generated and applied. In 30\% of 558 of applying 1, 2, or 3 patches are (50\%,30\%,20\%).
510 cases, two patches are generated, and otherwise three patches are 559 \end{minipage}
511 generated. The patch is applied by taking the maximal value on any given 560
512 patch or the original image, for each of the 32x32 pixel locations. 561 %\vspace*{1mm}
513 562
514 --- 563 \subsubsection*{Grey Level and Contrast Changes}
515 564
516 The scratches module places line-like white patches on the image. The 565 \begin{minipage}[t]{0.15\linewidth}
517 lines are in fact heavily transformed images of the digit "1" (one), chosen 566 \centering
518 at random among five thousands such start images of this digit. 567 \vspace*{0mm}
519 568 \includegraphics[scale=.4]{images/Contrast_only.png}
520 Once the image is selected, the transformation begins by finding the first 569 %{\bf Grey Level \& Contrast}
521 $top$, $bottom$, $right$ and $left$ non-zero pixels in the image. It is 570 \end{minipage}%
522 then cropped to the region thus delimited, then this cropped version is 571 \hspace{3mm}\begin{minipage}[t]{0.85\linewidth}
523 expanded to $32\times32$ again. It is then rotated by a random angle having a 572 \vspace*{1mm}
524 Gaussian distribution of mean 90 and standard deviation $100 \times 573 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white
525 complexity$ (in degrees). The rotation is done with bicubic interpolation. 574 to black and black to white). The contrast is $C \sim U[1-0.85 \times complexity,1]$
526 575 so the image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
527 The rotated image is then resized to $50\times50$, with anti-aliasing. In 576 polarity is inverted with probability 50\%.
528 that image, we crop the image again by selecting a region delimited 577 %%\vspace{.7cm}
529 horizontally to $left$ to $left+32$ and vertically by $top$ to $top+32$. 578 \end{minipage}
530 579 %\vspace{2mm}
531 Once this is done, two passes of a greyscale morphological erosion filter 580
532 are applied. Put briefly, this erosion filter reduces the width of the line 581
533 by a certain $smoothing$ amount. For small complexities (< 0.5), 582 \iffalse
534 $smoothing$ is 6, so the line is very small. For complexities ranging from 583 \begin{figure}[ht]
535 0.25 to 0.5, $smoothing$ is 5. It is 4 for complexities 0.5 to 0.75, and 3 584 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/example_t.png}}}\\
536 for higher complexities.
537
538 To compensate for border effects, the image is then cropped to 28x28 by
539 removing two pixels everywhere on the borders, then expanded to 32x32
540 again. The pixel values are then linearly expanded such that the minimum
541 value is 0 and the maximal one is 1. Then, 50\% of the time, the image is
542 vertically flipped.
543
544 This filter is only applied only 15\% of the time. When it is applied, 50\%
545 of the time, only one patch image is generated and applied. In 30\% of
546 cases, two patches are generated, and otherwise three patches are
547 generated. The patch is applied by taking the maximal value on any given
548 patch or the original image, for each of the 32x32 pixel locations.
549
550 \subsection{Grey Level and Contrast Changes}
551
552 This filter changes the contrast and may invert the image polarity (white
553 on black to black on white). The contrast $C$ is defined here as the
554 difference between the maximum and the minimum pixel value of the image.
555 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$).
556 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
557 polarity is inverted with $0.5$ probability.
558
559 ---
560 This filter changes the constrast and may invert the image polarity (white
561 on black to black on white). The contrast $C$ is defined here as the
562 difference between the maximum and the minimum pixel value of the image. A
563 contrast value is sampled uniformly between $1$ and $1-0.85 \times
564 complexity$ (this insure a minimum constrast of $0.15$). We then simply
565 normalize the image to the range $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
566 polarity is inverted with $0.5$ probability.
567
568
569 \begin{figure}[h]
570 \resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\
571 \caption{Illustration of the pipeline of stochastic 585 \caption{Illustration of the pipeline of stochastic
572 transformations applied to the image of a lower-case t 586 transformations applied to the image of a lower-case \emph{t}
573 (the upper left image). Each image in the pipeline (going from 587 (the upper left image). Each image in the pipeline (going from
574 left to right, first top line, then bottom line) shows the result 588 left to right, first top line, then bottom line) shows the result
575 of applying one of the modules in the pipeline. The last image 589 of applying one of the modules in the pipeline. The last image
576 (bottom right) is used as training example.} 590 (bottom right) is used as training example.}
577 \label{fig:pipeline} 591 \label{fig:pipeline}
578 \end{figure} 592 \end{figure}
579 593 \fi
580 594
581 \begin{figure}[h] 595 %\vspace*{-3mm}
582 \resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\ 596 \section{Experimental Setup}
583 \caption{Illustration of each transformation applied to the same image 597 %\vspace*{-1mm}
584 of the upper-case h (upper-left image). first row (from left to rigth) : original image, slant, 598
585 thickness, affine transformation, local elastic deformation; second row (from left to rigth) : 599 Much previous work on deep learning had been performed on
586 pinch, motion blur, occlusion, pixel permutation, gaussian noise; third row (from left to rigth) : 600 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
587 background image, salt and pepper noise, spatially gaussian noise, scratches, 601 with 60~000 examples, and variants involving 10~000
588 grey level and contrast changes.} 602 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}.
589 \label{fig:transfo} 603 The focus here is on much larger training sets, from 10 times to
604 to 1000 times larger, and 62 classes.
605
606 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
607 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
608 and {\bf OCR data} (scanned machine printed characters). Once a character
609 is sampled from one of these sources (chosen randomly), the second step is to
610 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}.
611
612 To provide a baseline of error rate comparison we also estimate human performance
613 on both the 62-class task and the 10-class digits task.
614 We compare the best Multi-Layer Perceptrons (MLP) against
615 the best Stacked Denoising Auto-encoders (SDA), when
616 both models' hyper-parameters are selected to minimize the validation set error.
617 We also provide a comparison against a precise estimate
618 of human performance obtained via Amazon's Mechanical Turk (AMT)
619 service (http://mturk.com).
620 AMT users are paid small amounts
621 of money to perform tasks for which human intelligence is required.
622 Mechanical Turk has been used extensively in natural language processing and vision.
623 %processing \citep{SnowEtAl2008} and vision
624 %\citep{SorokinAndForsyth2008,whitehill09}.
625 AMT users were presented
626 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII
627 characters. They were forced to choose a single character class (either among the
628 62 or 10 character classes) for each image.
629 80 subjects classified 2500 images per (dataset,task) pair,
630 with the guarantee that 3 different subjects classified each image, allowing
631 us to estimate inter-human variability (e.g a standard error of 0.1\%
632 on the average 18.2\% error done by humans on the 62-class task NIST test set).
633
634 %\vspace*{-3mm}
635 \subsection{Data Sources}
636 %\vspace*{-2mm}
637
638 %\begin{itemize}
639 %\item
640 {\bf NIST.}
641 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995},
642 widely used for training and testing character
643 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
644 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
645 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
646 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity.
647 The fourth partition (called $hsf_4$, 82587 examples),
648 experimentally recognized to be the most difficult one, is the one recommended
649 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
650 for that purpose. We randomly split the remainder (731668 examples) into a training set and a validation set for
651 model selection.
652 The performances reported by previous work on that dataset mostly use only the digits.
653 Here we use all the classes both in the training and testing phase. This is especially
654 useful to estimate the effect of a multi-task setting.
655 The distribution of the classes in the NIST training and test sets differs
656 substantially, with relatively many more digits in the test set, and a more uniform distribution
657 of letters in the test set (whereas in the training set they are distributed
658 more like in natural text).
659 %\vspace*{-1mm}
660
661 %\item
662 {\bf Fonts.}
663 In order to have a good variety of sources we downloaded an important number of free fonts from:
664 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
665 % TODO: pointless to anonymize, it's not pointing to our work
666 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
667 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
668 directly as input to our models.
669 %\vspace*{-1mm}
670
671 %\item
672 {\bf Captchas.}
673 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
674 generating characters of the same format as the NIST dataset. This software is based on
675 a random character class generator and various kinds of transformations similar to those described in the previous sections.
676 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
677 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
678 depending on the value of the complexity parameter provided by the user of the data source.
679 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class?
680 %\vspace*{-1mm}
681
682 %\item
683 {\bf OCR data.}
684 A large set (2 million) of scanned, OCRed and manually verified machine-printed
685 characters where included as an
686 additional source. This set is part of a larger corpus being collected by the Image Understanding
687 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern
688 ({\tt http://www.iupr.com}), and which will be publicly released.
689 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
690 %\end{itemize}
691
692 %\vspace*{-3mm}
693 \subsection{Data Sets}
694 %\vspace*{-2mm}
695
696 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
697 from one of the 62 character classes.
698 %\begin{itemize}
699 %\vspace*{-1mm}
700
701 %\item
702 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
703 \{651668 / 80000 / 82587\} \{training / validation / test\} examples.
704 %\vspace*{-1mm}
705
706 %\item
707 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
708 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
709 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
710 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
711 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
712 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
713 %\vspace*{-1mm}
714
715 %\item
716 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
717 except that we only apply
718 transformations from slant to pinch. Therefore, the character is
719 transformed but no additional noise is added to the image, giving images
720 closer to the NIST dataset.
721 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
722 %\end{itemize}
723
724 %\vspace*{-3mm}
725 \subsection{Models and their Hyperparameters}
726 %\vspace*{-2mm}
727
728 The experiments are performed using MLPs (with a single
729 hidden layer) and SDAs.
730 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
731
732 {\bf Multi-Layer Perceptrons (MLP).}
733 Whereas previous work had compared deep architectures to both shallow MLPs and
734 SVMs, we only compared to MLPs here because of the very large datasets used
735 (making the use of SVMs computationally challenging because of their quadratic
736 scaling behavior).
737 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
738 exponentials) on the output layer for estimating $P(class | image)$.
739 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
740 Training examples are presented in minibatches of size 20. A constant learning
741 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
742 %through preliminary experiments (measuring performance on a validation set),
743 %and $0.1$ (which was found to work best) was then selected for optimizing on
744 %the whole training sets.
745 %\vspace*{-1mm}
746
747
748 {\bf Stacked Denoising Auto-Encoders (SDA).}
749 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
750 can be used to initialize the weights of each layer of a deep MLP (with many hidden
751 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
752 apparently setting parameters in the
753 basin of attraction of supervised gradient descent yielding better
754 generalization~\citep{Erhan+al-2010}. It is hypothesized that the
755 advantage brought by this procedure stems from a better prior,
756 on the one hand taking advantage of the link between the input
757 distribution $P(x)$ and the conditional distribution of interest
758 $P(y|x)$ (like in semi-supervised learning), and on the other hand
759 taking advantage of the expressive power and bias implicit in the
760 deep architecture (whereby complex concepts are expressed as
761 compositions of simpler ones through a deep hierarchy).
762
763 \begin{figure}[ht]
764 %\vspace*{-2mm}
765 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
766 %\vspace*{-2mm}
767 \caption{Illustration of the computations and training criterion for the denoising
768 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
769 the layer (i.e. raw input or output of previous layer)
770 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
771 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
772 is compared to the uncorrupted input $x$ through the loss function
773 $L_H(x,z)$, whose expected value is approximately minimized during training
774 by tuning $\theta$ and $\theta'$.}
775 \label{fig:da}
776 %\vspace*{-2mm}
590 \end{figure} 777 \end{figure}
591 778
592 779 Here we chose to use the Denoising
593 \section{Experimental Setup} 780 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
594 781 these deep hierarchies of features, as it is simple to train and
595 \subsection{Training Datasets} 782 explain (see Figure~\ref{fig:da}, as well as
596 783 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
597 \subsubsection{Data Sources} 784 provides efficient inference, and yielded results
598 785 comparable or better than RBMs in series of experiments
599 \begin{itemize} 786 \citep{VincentPLarochelleH2008}. During training, a Denoising
600 \item {\bf NIST} 787 Auto-encoder is presented with a stochastically corrupted version
601 The NIST Special Database 19 (NIST19) is a very widely used dataset for training and testing OCR systems. 788 of the input and trained to reconstruct the uncorrupted input,
602 The dataset is composed with over 800 000 digits and characters (upper and lower cases), with hand checked classifications, 789 forcing the hidden units to represent the leading regularities in
603 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 790 the data. Here we use the random binary masking corruption
604 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. 791 (which sets to 0 a random subset of the inputs).
605 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one for classification task is recommended 792 Once it is trained, in a purely unsupervised way,
606 by NIST as testing set and is used in our work for that purpose. 793 its hidden units' activations can
607 The performances reported by previous work on that dataset mostly use only the digits. 794 be used as inputs for training a second one, etc.
608 Here we use the whole classes both in the training and testing phase. 795 After this unsupervised pre-training stage, the parameters
609 796 are used to initialize a deep MLP, which is fine-tuned by
610 797 the same standard procedure used to train them (see previous section).
611 \item {\bf Fonts} 798 The SDA hyper-parameters are the same as for the MLP, with the addition of the
612 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} 799 amount of corruption noise (we used the masking noise process, whereby a
613 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} 800 fixed proportion of the input values, randomly selected, are zeroed), and a
614 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. 801 separate learning rate for the unsupervised pre-training stage (selected
615 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 802 from the same above set). The fraction of inputs corrupted was selected
616 directly as input to our models. 803 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
617 %Guillaume are there other details I forgot on the font selection? 804 of hidden layers but it was fixed to 3 based on previous work with
618 805 SDAs on MNIST~\citep{VincentPLarochelleH2008}.
619 \item {\bf Captchas} 806
620 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 807 %\vspace*{-1mm}
621 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character 808
622 generator and various kinds of tranformations similar to those described in the previous sections. 809 \begin{figure}[ht]
623 In order to increase the variability of the data generated, different fonts are used for generating the characters. 810 %\vspace*{-2mm}
624 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity 811 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
625 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are 812 %\vspace*{-3mm}
626 allowed and can be controlled via an easy to use facade class. 813 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
627 \item {\bf OCR data} 814 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
628 \end{itemize} 815 of all models, on NIST and NISTP test sets.
629 816 Right: error rates on NIST test digits only, along with the previous results from
630 \subsubsection{Data Sets} 817 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
631 \begin{itemize} 818 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
632 \item {\bf P07} 819 \label{fig:error-rates-charts}
633 The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$. 820 %\vspace*{-2mm}
634 For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts,
635 $0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order
636 and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
637 \item {\bf NISTP} {\em ne pas utiliser PNIST mais NISTP, pour rester politically correct...}
638 NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) except that we only apply transformations from slant to pinch. Therefore, the character is transformed
639 but no additionnal noise is added to the image, this gives images closer to the NIST dataset.
640 \end{itemize}
641
642 We noticed that the distribution of the training sets and the test sets differ.
643 Since our validation sets are sampled from the training set, they have approximately the same distribution, but the test set has a completely different distribution as illustrated in figure \ref {setsdata}.
644
645 \begin{figure}
646 \subfigure[NIST training]{\includegraphics[width=0.5\textwidth]{images/nisttrainstats}}
647 \subfigure[NIST validation]{\includegraphics[width=0.5\textwidth]{images/nistvalidstats}}
648 \subfigure[NIST test]{\includegraphics[width=0.5\textwidth]{images/nistteststats}}
649 \subfigure[NISTP validation]{\includegraphics[width=0.5\textwidth]{images/nistpvalidstats}}
650 \caption{Proportion of each class in some of the data sets}
651 \label{setsdata}
652 \end{figure} 821 \end{figure}
653 822
654 \subsection{Models and their Hyperparameters} 823
655 824 \begin{figure}[ht]
656 \subsubsection{Multi-Layer Perceptrons (MLP)} 825 %\vspace*{-3mm}
657 826 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
658 An MLP is a family of functions that are described by stacking layers of of a function similar to 827 %\vspace*{-3mm}
659 $$g(x) = \tanh(b+Wx)$$ 828 \caption{Relative improvement in error rate due to self-taught learning.
660 The input, $x$, is a $d$-dimension vector. 829 Left: Improvement (or loss, when negative)
661 The output, $g(x)$, is a $m$-dimension vector. 830 induced by out-of-distribution examples (perturbed data).
662 The parameter $W$ is a $m\times d$ matrix and is called the weight matrix. 831 Right: Improvement (or loss, when negative) induced by multi-task
663 The parameter $b$ is a $m$-vector and is called the bias vector. 832 learning (training on all classes and testing only on either digits,
664 The non-linearity (here $\tanh$) is applied element-wise to the output vector. 833 upper case, or lower-case). The deep learner (SDA) benefits more from
665 Usually the input is referred to a input layer and similarly for the output. 834 both self-taught learning scenarios, compared to the shallow MLP.}
666 You can of course chain several such functions to obtain a more complex one. 835 \label{fig:improvements-charts}
667 Here is a common example 836 %\vspace*{-2mm}
668 $$f(x) = c + V\tanh(b+Wx)$$ 837 \end{figure}
669 In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer.
670 Here the output layer does not have the same non-linearity as the hidden layer.
671 This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand.
672
673 If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP.
674 The parameters to adapt are the weight matrix and the bias vector for each layer.
675
676 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
677 \label{SdA}
678
679 Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization.
680 This is essentially unsupervised training where the layer is made to reconstruct its input through and encoding and decoding phase.
681 Denoising auto-encoders are a variant where the input is corrupted with random noise but the target is the uncorrupted input.
682 The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform.
683
684 An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer.
685 Usually the top and bottom weight matrices are the transpose of each other and are fixed this way.
686 The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer.
687 The other parameters are discarded.
688
689 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom.
690 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder.
691 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it.
692 For additional details see \cite{vincent:icml08}.
693 838
694 \section{Experimental Results} 839 \section{Experimental Results}
695 840 %\vspace*{-2mm}
696 \subsection{SDA vs MLP vs Humans} 841
697 842 %%\vspace*{-1mm}
698 We compare here the best MLP (according to validation set error) that we found against 843 %\subsection{SDA vs MLP vs Humans}
699 the best SDA (again according to validation set error), along with a precise estimate 844 %%\vspace*{-1mm}
700 of human performance obtained via Amazon's Mechanical Turk (AMT) 845 The models are either trained on NIST (MLP0 and SDA0),
701 service\footnote{http://mturk.com}. AMT users are paid small amounts 846 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested
702 of money to perform tasks for which human intelligence is required. 847 on either NIST, NISTP or P07, either on the 62-class task
703 Mechanical Turk has been used extensively in natural language 848 or on the 10-digits task. Training (including about half
704 processing \cite{SnowEtAl2008} and vision 849 for unsupervised pre-training, for DAs) on the larger
705 \cite{SorokinAndForsyth2008,whitehill09}. AMT users where presented 850 datasets takes around one day on a GPU-285.
706 with 10 character images and asked to type 10 corresponding ascii 851 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
707 characters. Hence they were forced to make a hard choice among the 852 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1,
708 62 character classes. Three users classified each image, allowing 853 SDA2), along with the previous results on the digits NIST special database
709 to estimate inter-human variability (shown as +/- in parenthesis below). 854 19 test set from the literature, respectively based on ARTMAP neural
710 855 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
711 \begin{table} 856 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs
712 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + 857 ~\citep{Milgram+al-2005}. More detailed and complete numerical results
713 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training 858 (figures and tables, including standard errors on the error rates) can be
714 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture 859 found in Appendix I of the supplementary material.
715 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07) 860 The deep learner not only outperformed the shallow ones and
716 and using a validation set to select hyper-parameters and other training choices. 861 previously published performance (in a statistically and qualitatively
717 \{SDA,MLP\}0 are trained on NIST, 862 significant way) but when trained with perturbed data
718 \{SDA,MLP\}1 are trained on NISTP, and \{SDA,MLP\}2 are trained on P07. 863 reaches human performance on both the 62-class task
719 The human error rate on digits is a lower bound because it does not count digits that were 864 and the 10-class (digits) task.
720 recognized as letters. For comparison, the results found in the literature 865 17\% error (SDA1) or 18\% error (humans) may seem large but a large
721 on NIST digits classification using the same test set are included.} 866 majority of the errors from humans and from SDA1 are from out-of-context
722 \label{tab:sda-vs-mlp-vs-humans} 867 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
723 \begin{center} 868 ``c'' and a ``C'' are often indistinguishible).
724 \begin{tabular}{|l|r|r|r|r|} \hline 869
725 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline 870 In addition, as shown in the left of
726 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $>1.1\%$ \\ \hline 871 Figure~\ref{fig:improvements-charts}, the relative improvement in error
727 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline 872 rate brought by self-taught learning is greater for the SDA, and these
728 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline 873 differences with the MLP are statistically and qualitatively
729 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline 874 significant.
730 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline 875 The left side of the figure shows the improvement to the clean
731 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline 876 NIST test set error brought by the use of out-of-distribution examples
732 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline 877 (i.e. the perturbed examples examples from NISTP or P07).
733 [5] & & & & 4.95\% $\pm$.18\% \\ \hline 878 Relative percent change is measured by taking
734 [2] & & & & 3.71\% $\pm$.16\% \\ \hline 879 $100 \% \times$ (original model's error / perturbed-data model's error - 1).
735 [3] & & & & 2.4\% $\pm$.13\% \\ \hline 880 The right side of
736 [4] & & & & 2.1\% $\pm$.12\% \\ \hline 881 Figure~\ref{fig:improvements-charts} shows the relative improvement
737 \end{tabular} 882 brought by the use of a multi-task setting, in which the same model is
738 \end{center} 883 trained for more classes than the target classes of interest (i.e. training
739 \end{table} 884 with all 62 classes when the target classes are respectively the digits,
740 885 lower-case, or upper-case characters). Again, whereas the gain from the
741 \subsection{Perturbed Training Data More Helpful for SDAE} 886 multi-task setting is marginal or negative for the MLP, it is substantial
742 887 for the SDA. Note that to simplify these multi-task experiments, only the original
743 \begin{table} 888 NIST dataset is used. For example, the MLP-digits bar shows the relative
744 \caption{Relative change in error rates due to the use of perturbed training data, 889 percent improvement in MLP error rate on the NIST digits test set
745 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. 890 is $100\% \times$ (single-task
746 A positive value indicates that training on the perturbed data helped for the 891 model's error / multi-task model's error - 1). The single-task model is
747 given test set (the first 3 columns on the 62-class tasks and the last one is 892 trained with only 10 outputs (one per digit), seeing only digit examples,
748 on the clean 10-class digits). Clearly, the deep learning models did benefit more 893 whereas the multi-task model is trained with 62 outputs, with all 62
749 from perturbed training data, even when testing on clean data, whereas the MLP 894 character classes as examples. Hence the hidden units are shared across
750 trained on perturbed data performed worse on the clean digits and about the same 895 all tasks. For the multi-task model, the digit error rate is measured by
751 on the clean characters. } 896 comparing the correct digit class with the output class associated with the
752 \label{tab:sda-vs-mlp-vs-humans} 897 maximum conditional probability among only the digit classes outputs. The
753 \begin{center} 898 setting is similar for the other two target classes (lower case characters
754 \begin{tabular}{|l|r|r|r|r|} \hline 899 and upper case characters).
755 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline 900 %%\vspace*{-1mm}
756 SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline 901 %\subsection{Perturbed Training Data More Helpful for SDA}
757 SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline 902 %%\vspace*{-1mm}
758 MLP0/MLP1-1 & 5.2\% & 65\% & -13\% & -10\% \\ \hline 903
759 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline 904 %%\vspace*{-1mm}
760 \end{tabular} 905 %\subsection{Multi-Task Learning Effects}
761 \end{center} 906 %%\vspace*{-1mm}
762 \end{table} 907
763 908 \iffalse
764
765 \subsection{Multi-Task Learning Effects}
766
767 As previously seen, the SDA is better able to benefit from the 909 As previously seen, the SDA is better able to benefit from the
768 transformations applied to the data than the MLP. In this experiment we 910 transformations applied to the data than the MLP. In this experiment we
769 define three tasks: recognizing digits (knowing that the input is a digit), 911 define three tasks: recognizing digits (knowing that the input is a digit),
770 recognizing upper case characters (knowing that the input is one), and 912 recognizing upper case characters (knowing that the input is one), and
771 recognizing lower case characters (knowing that the input is one). We 913 recognizing lower case characters (knowing that the input is one). We
781 fine-tuned on NIST. 923 fine-tuned on NIST.
782 924
783 Our results show that the MLP benefits marginally from the multi-task setting 925 Our results show that the MLP benefits marginally from the multi-task setting
784 in the case of digits (5\% relative improvement) but is actually hurt in the case 926 in the case of digits (5\% relative improvement) but is actually hurt in the case
785 of characters (respectively 3\% and 4\% worse for lower and upper class characters). 927 of characters (respectively 3\% and 4\% worse for lower and upper class characters).
786 On the other hand the SDA benefitted from the multi-task setting, with relative 928 On the other hand the SDA benefited from the multi-task setting, with relative
787 error rate improvements of 27\%, 15\% and 13\% respectively for digits, 929 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
788 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. 930 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
789 931 \fi
790 \begin{table} 932
791 \caption{Test error rates and relative change in error rates due to the use of 933
792 a multi-task setting, i.e., training on each task in isolation vs training 934 %\vspace*{-2mm}
793 for all three tasks together, for MLPs vs SDAs. The SDA benefits much 935 \section{Conclusions and Discussion}
794 more from the multi-task setting. All experiments on only on the 936 %\vspace*{-2mm}
795 unperturbed NIST data, using validation error for model selection. 937
796 Relative improvement is 1 - single-task error / multi-task error.} 938 We have found that the self-taught learning framework is more beneficial
797 \label{tab:multi-task} 939 to a deep learner than to a traditional shallow and purely
798 \begin{center} 940 supervised learner. More precisely,
799 \begin{tabular}{|l|r|r|r|} \hline 941 the answers are positive for all the questions asked in the introduction.
800 & single-task & multi-task & relative \\ 942 %\begin{itemize}
801 & setting & setting & improvement \\ \hline 943
802 MLP-digits & 3.77\% & 3.99\% & 5.6\% \\ \hline 944 $\bullet$ %\item
803 MLP-lower & 17.4\% & 16.8\% & -4.1\% \\ \hline 945 {\bf Do the good results previously obtained with deep architectures on the
804 MLP-upper & 7.84\% & 7.54\% & -3.6\% \\ \hline 946 MNIST digits generalize to a much larger and richer (but similar)
805 SDA-digits & 2.6\% & 3.56\% & 27\% \\ \hline 947 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
806 SDA-lower & 12.3\% & 14.4\% & 15\% \\ \hline 948 Yes, the SDA {\em systematically outperformed the MLP and all the previously
807 SDA-upper & 5.93\% & 6.78\% & 13\% \\ \hline 949 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level
808 \end{tabular} 950 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.
809 \end{center} 951
810 \end{table} 952 $\bullet$ %\item
811 953 {\bf To what extent do self-taught learning scenarios help deep learners,
812 \section{Conclusions} 954 and do they help them more than shallow supervised ones}?
813 955 We found that distorted training examples not only made the resulting
814 \bibliography{strings,ml,aigaion,specials} 956 classifier better on similarly perturbed images but also on
815 \bibliographystyle{mlapa} 957 the {\em original clean examples}, and more importantly and more novel,
958 that deep architectures benefit more from such {\em out-of-distribution}
959 examples. MLPs were helped by perturbed training examples when tested on perturbed input
960 images (65\% relative improvement on NISTP)
961 but only marginally helped (5\% relative improvement on all classes)
962 or even hurt (10\% relative loss on digits)
963 with respect to clean examples . On the other hand, the deep SDAs
964 were significantly boosted by these out-of-distribution examples.
965 Similarly, whereas the improvement due to the multi-task setting was marginal or
966 negative for the MLP (from +5.6\% to -3.6\% relative change),
967 it was quite significant for the SDA (from +13\% to +27\% relative change),
968 which may be explained by the arguments below.
969 %\end{itemize}
970
971 In the original self-taught learning framework~\citep{RainaR2007}, the
972 out-of-sample examples were used as a source of unsupervised data, and
973 experiments showed its positive effects in a \emph{limited labeled data}
974 scenario. However, many of the results by \citet{RainaR2007} (who used a
975 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught
976 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases.
977 We note instead that, for deep
978 architectures, our experiments show that such a positive effect is accomplished
979 even in a scenario with a \emph{large number of labeled examples},
980 i.e., here, the relative gain of self-taught learning is probably preserved
981 in the asymptotic regime.
982
983 {\bf Why would deep learners benefit more from the self-taught learning framework}?
984 The key idea is that the lower layers of the predictor compute a hierarchy
985 of features that can be shared across tasks or across variants of the
986 input distribution. Intermediate features that can be used in different
987 contexts can be estimated in a way that allows to share statistical
988 strength. Features extracted through many levels are more likely to
989 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
990 increasing the likelihood that they would be useful for a larger array
991 of tasks and input conditions.
992 Therefore, we hypothesize that both depth and unsupervised
993 pre-training play a part in explaining the advantages observed here, and future
994 experiments could attempt at teasing apart these factors.
995 And why would deep learners benefit from the self-taught learning
996 scenarios even when the number of labeled examples is very large?
997 We hypothesize that this is related to the hypotheses studied
998 in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010}
999 it was found that online learning on a huge dataset did not make the
1000 advantage of the deep learning bias vanish, a similar phenomenon
1001 may be happening here. We hypothesize that unsupervised pre-training
1002 of a deep hierarchy with self-taught learning initializes the
1003 model in the basin of attraction of supervised gradient descent
1004 that corresponds to better generalization. Furthermore, such good
1005 basins of attraction are not discovered by pure supervised learning
1006 (with or without self-taught settings), and more labeled examples
1007 does not allow the model to go from the poorer basins of attraction discovered
1008 by the purely supervised shallow models to the kind of better basins associated
1009 with deep learning and self-taught learning.
1010
1011 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
1012 can be executed on-line at {\tt http://deep.host22.com}.
1013
1014 \newpage
1015 {
1016 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,aigaion-shorter,specials}
1017 %\bibliographystyle{plainnat}
1018 \bibliographystyle{unsrtnat}
1019 %\bibliographystyle{apalike}
1020 }
1021
816 1022
817 \end{document} 1023 \end{document}