Mercurial > ift6266
comparison writeup/techreport.tex @ 582:9ebb335ca904
techreport.tex
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sat, 18 Sep 2010 16:32:08 -0400 |
parents | 8aad1c6ec39a |
children | ae77edb9df67 |
comparison
equal
deleted
inserted
replaced
581:df749e70f637 | 582:9ebb335ca904 |
---|---|
1 \documentclass[12pt,letterpaper]{article} | 1 \documentclass{article} % For LaTeX2e |
2 \usepackage{times} | |
3 \usepackage{wrapfig} | |
4 \usepackage{amsthm,amsmath,bbm} | |
5 \usepackage[psamsfonts]{amssymb} | |
6 \usepackage{algorithm,algorithmic} | |
2 \usepackage[utf8]{inputenc} | 7 \usepackage[utf8]{inputenc} |
3 \usepackage{graphicx} | 8 \usepackage{graphicx,subfigure} |
4 \usepackage{times} | 9 \usepackage[numbers]{natbib} |
5 \usepackage{mlapa} | 10 |
6 \usepackage{subfigure} | 11 \addtolength{\textwidth}{10mm} |
12 \addtolength{\evensidemargin}{-5mm} | |
13 \addtolength{\oddsidemargin}{-5mm} | |
14 | |
15 %\setlength\parindent{0mm} | |
16 | |
17 \title{Deep Self-Taught Learning for Handwritten Character Recognition} | |
18 \author{ | |
19 Frédéric Bastien \and | |
20 Yoshua Bengio \and | |
21 Arnaud Bergeron \and | |
22 Nicolas Boulanger-Lewandowski \and | |
23 Thomas Breuel \and | |
24 Youssouf Chherawala \and | |
25 Moustapha Cisse \and | |
26 Myriam Côté \and | |
27 Dumitru Erhan \and | |
28 Jeremy Eustache \and | |
29 Xavier Glorot \and | |
30 Xavier Muller \and | |
31 Sylvain Pannetier Lebeuf \and | |
32 Razvan Pascanu \and | |
33 Salah Rifai \and | |
34 Francois Savard \and | |
35 Guillaume Sicard | |
36 } | |
37 \date{June 8th, 2010, Technical Report 1353, Dept. IRO, U. Montreal} | |
7 | 38 |
8 \begin{document} | 39 \begin{document} |
9 \title{Generating and Exploiting Perturbed and Multi-Task Handwritten Training Data for Deep Architectures} | 40 |
10 \author{The IFT6266 Gang} | 41 %\makeanontitle |
11 \date{April 2010, Technical Report, Dept. IRO, U. Montreal} | |
12 | |
13 \maketitle | 42 \maketitle |
14 | 43 |
44 %\vspace*{-2mm} | |
15 \begin{abstract} | 45 \begin{abstract} |
16 Recent theoretical and empirical work in statistical machine learning has | 46 Recent theoretical and empirical work in statistical machine learning has |
17 demonstrated the importance of learning algorithms for deep | 47 demonstrated the importance of learning algorithms for deep |
18 architectures, i.e., function classes obtained by composing multiple | 48 architectures, i.e., function classes obtained by composing multiple |
19 non-linear transformations. In the area of handwriting recognition, | 49 non-linear transformations. Self-taught learning (exploiting unlabeled |
20 deep learning algorithms | 50 examples or examples from other distributions) has already been applied |
21 had been evaluated on rather small datasets with a few tens of thousands | 51 to deep learners, but mostly to show the advantage of unlabeled |
22 of examples. Here we propose a powerful generator of variations | 52 examples. Here we explore the advantage brought by {\em out-of-distribution examples}. |
23 of examples for character images based on a pipeline of stochastic | 53 For this purpose we |
24 transformations that include not only the usual affine transformations | 54 developed a powerful generator of stochastic variations and noise |
25 but also the addition of slant, local elastic deformations, changes | 55 processes for character images, including not only affine transformations |
26 in thickness, background images, grey level, contrast, occlusion, and | 56 but also slant, local elastic deformations, changes in thickness, |
27 various types of pixel and spatially correlated noise. | 57 background images, grey level changes, contrast, occlusion, and various |
28 We evaluate a deep learning algorithm (Stacked Denoising Autoencoders) | 58 types of noise. The out-of-distribution examples are obtained from these |
29 on the task of learning to classify digits and letters transformed | 59 highly distorted images or by including examples of object classes |
30 with this pipeline, using the hundreds of millions of generated examples | 60 different from those in the target test set. |
31 and testing on the full 62-class NIST test set. | 61 We show that {\em deep learners benefit |
32 We find that the SDA outperforms its | 62 more from them than a corresponding shallow learner}, at least in the area of |
33 shallow counterpart, an ordinary Multi-Layer Perceptron, | 63 handwritten character recognition. In fact, we show that they reach |
34 and that it is better able to take advantage of the additional | 64 human-level performance on both handwritten digit classification and |
35 generated data, as well as better able to take advantage of | 65 62-class handwritten character recognition. |
36 the multi-task setting, i.e., | |
37 training from more classes than those of interest in the end. | |
38 In fact, we find that the SDA reaches human performance as | |
39 estimated by the Amazon Mechanical Turk on the 62-class NIST test characters. | |
40 \end{abstract} | 66 \end{abstract} |
67 %\vspace*{-3mm} | |
41 | 68 |
42 \section{Introduction} | 69 \section{Introduction} |
43 | 70 %\vspace*{-1mm} |
44 Deep Learning has emerged as a promising new area of research in | 71 |
45 statistical machine learning (see~\emcite{Bengio-2009} for a review). | 72 {\bf Deep Learning} has emerged as a promising new area of research in |
73 statistical machine learning (see~\citet{Bengio-2009} for a review). | |
46 Learning algorithms for deep architectures are centered on the learning | 74 Learning algorithms for deep architectures are centered on the learning |
47 of useful representations of data, which are better suited to the task at hand. | 75 of useful representations of data, which are better suited to the task at hand. |
48 This is in great part inspired by observations of the mammalian visual cortex, | 76 This is in part inspired by observations of the mammalian visual cortex, |
49 which consists of a chain of processing elements, each of which is associated with a | 77 which consists of a chain of processing elements, each of which is associated with a |
50 different representation. In fact, | 78 different representation of the raw visual input. In fact, |
51 it was found recently that the features learnt in deep architectures resemble | 79 it was found recently that the features learnt in deep architectures resemble |
52 those observed in the first two of these stages (in areas V1 and V2 | 80 those observed in the first two of these stages (in areas V1 and V2 |
53 of visual cortex)~\cite{HonglakL2008}. | 81 of visual cortex)~\citep{HonglakL2008}, and that they become more and |
54 Processing images typically involves transforming the raw pixel data into | 82 more invariant to factors of variation (such as camera movement) in |
55 new {\bf representations} that can be used for analysis or classification. | 83 higher layers~\citep{Goodfellow2009}. |
56 For example, a principal component analysis representation linearly projects | 84 Learning a hierarchy of features increases the |
57 the input image into a lower-dimensional feature space. | |
58 Why learn a representation? Current practice in the computer vision | |
59 literature converts the raw pixels into a hand-crafted representation | |
60 (e.g.\ SIFT features~\cite{Lowe04}), but deep learning algorithms | |
61 tend to discover similar features in their first few | |
62 levels~\cite{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}. | |
63 Learning increases the | |
64 ease and practicality of developing representations that are at once | 85 ease and practicality of developing representations that are at once |
65 tailored to specific tasks, yet are able to borrow statistical strength | 86 tailored to specific tasks, yet are able to borrow statistical strength |
66 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the | 87 from other related tasks (e.g., modeling different kinds of objects). Finally, learning the |
67 feature representation can lead to higher-level (more abstract, more | 88 feature representation can lead to higher-level (more abstract, more |
68 general) features that are more robust to unanticipated sources of | 89 general) features that are more robust to unanticipated sources of |
69 variance extant in real data. | 90 variance extant in real data. |
70 | 91 |
92 {\bf Self-taught learning}~\citep{RainaR2007} is a paradigm that combines principles | |
93 of semi-supervised and multi-task learning: the learner can exploit examples | |
94 that are unlabeled and possibly come from a distribution different from the target | |
95 distribution, e.g., from other classes than those of interest. | |
96 It has already been shown that deep learners can clearly take advantage of | |
97 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, | |
98 but more needs to be done to explore the impact | |
99 of {\em out-of-distribution} examples and of the multi-task setting | |
100 (one exception is~\citep{CollobertR2008}, which uses a different kind | |
101 of learning algorithm). In particular the {\em relative | |
102 advantage} of deep learning for these settings has not been evaluated. | |
103 The hypothesis discussed in the conclusion is that a deep hierarchy of features | |
104 may be better able to provide sharing of statistical strength | |
105 between different regions in input space or different tasks. | |
106 | |
107 \iffalse | |
71 Whereas a deep architecture can in principle be more powerful than a | 108 Whereas a deep architecture can in principle be more powerful than a |
72 shallow one in terms of representation, depth appears to render the | 109 shallow one in terms of representation, depth appears to render the |
73 training problem more difficult in terms of optimization and local minima. | 110 training problem more difficult in terms of optimization and local minima. |
74 It is also only recently that successful algorithms were proposed to | 111 It is also only recently that successful algorithms were proposed to |
75 overcome some of these difficulties. All are based on unsupervised | 112 overcome some of these difficulties. All are based on unsupervised |
76 learning, often in an greedy layer-wise ``unsupervised pre-training'' | 113 learning, often in an greedy layer-wise ``unsupervised pre-training'' |
77 stage~\cite{Bengio-2009}. One of these layer initialization techniques, | 114 stage~\citep{Bengio-2009}. One of these layer initialization techniques, |
78 applied here, is the Denoising | 115 applied here, is the Denoising |
79 Auto-Encoder~(DEA)~\cite{VincentPLarochelleH2008-very-small}, which | 116 Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see Figure~\ref{fig:da}), |
117 which | |
80 performed similarly or better than previously proposed Restricted Boltzmann | 118 performed similarly or better than previously proposed Restricted Boltzmann |
81 Machines in terms of unsupervised extraction of a hierarchy of features | 119 Machines in terms of unsupervised extraction of a hierarchy of features |
82 useful for classification. The principle is that each layer starting from | 120 useful for classification. Each layer is trained to denoise its |
83 the bottom is trained to encode their input (the output of the previous | 121 input, creating a layer of features that can be used as input for the next layer. |
84 layer) and try to reconstruct it from a corrupted version of it. After this | 122 \fi |
85 unsupervised initialization, the stack of denoising auto-encoders can be | 123 %The principle is that each layer starting from |
86 converted into a deep supervised feedforward neural network and trained by | 124 %the bottom is trained to encode its input (the output of the previous |
87 stochastic gradient descent. | 125 %layer) and to reconstruct it from a corrupted version. After this |
88 | 126 %unsupervised initialization, the stack of DAs can be |
89 | 127 %converted into a deep supervised feedforward neural network and fine-tuned by |
128 %stochastic gradient descent. | |
129 | |
130 % | |
131 In this paper we ask the following questions: | |
132 | |
133 %\begin{enumerate} | |
134 $\bullet$ %\item | |
135 Do the good results previously obtained with deep architectures on the | |
136 MNIST digit images generalize to the setting of a much larger and richer (but similar) | |
137 dataset, the NIST special database 19, with 62 classes and around 800k examples? | |
138 | |
139 $\bullet$ %\item | |
140 To what extent does the perturbation of input images (e.g. adding | |
141 noise, affine transformations, background images) make the resulting | |
142 classifiers better not only on similarly perturbed images but also on | |
143 the {\em original clean examples}? We study this question in the | |
144 context of the 62-class and 10-class tasks of the NIST special database 19. | |
145 | |
146 $\bullet$ %\item | |
147 Do deep architectures {\em benefit more from such out-of-distribution} | |
148 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | |
149 We use highly perturbed examples to generate out-of-distribution examples. | |
150 | |
151 $\bullet$ %\item | |
152 Similarly, does the feature learning step in deep learning algorithms benefit more | |
153 from training with moderately different classes (i.e. a multi-task learning scenario) than | |
154 a corresponding shallow and purely supervised architecture? | |
155 We train on 62 classes and test on 10 (digits) or 26 (upper case or lower case) | |
156 to answer this question. | |
157 %\end{enumerate} | |
158 | |
159 Our experimental results provide positive evidence towards all of these questions. | |
160 To achieve these results, we introduce in the next section a sophisticated system | |
161 for stochastically transforming character images and then explain the methodology, | |
162 which is based on training with or without these transformed images and testing on | |
163 clean ones. We measure the relative advantage of out-of-distribution examples | |
164 for a deep learner vs a supervised shallow one. | |
165 Code for generating these transformations as well as for the deep learning | |
166 algorithms are made available. | |
167 We also estimate the relative advantage for deep learners of training with | |
168 other classes than those of interest, by comparing learners trained with | |
169 62 classes with learners trained with only a subset (on which they | |
170 are then tested). | |
171 The conclusion discusses | |
172 the more general question of why deep learners may benefit so much from | |
173 the self-taught learning framework. | |
174 | |
175 %\vspace*{-3mm} | |
176 \newpage | |
90 \section{Perturbation and Transformation of Character Images} | 177 \section{Perturbation and Transformation of Character Images} |
91 | 178 \label{s:perturbations} |
92 This section describes the different transformations we used to generate data, in their order. | 179 %\vspace*{-2mm} |
180 | |
181 \begin{wrapfigure}[8]{l}{0.15\textwidth} | |
182 %\begin{minipage}[b]{0.14\linewidth} | |
183 %\vspace*{-5mm} | |
184 \begin{center} | |
185 \includegraphics[scale=.4]{images/Original.png}\\ | |
186 {\bf Original} | |
187 \end{center} | |
188 \end{wrapfigure} | |
189 %%\vspace{0.7cm} | |
190 %\end{minipage}% | |
191 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} | |
192 This section describes the different transformations we used to stochastically | |
193 transform $32 \times 32$ source images (such as the one on the left) | |
194 in order to obtain data from a larger distribution which | |
195 covers a domain substantially larger than the clean characters distribution from | |
196 which we start. | |
197 Although character transformations have been used before to | |
198 improve character recognizers, this effort is on a large scale both | |
199 in number of classes and in the complexity of the transformations, hence | |
200 in the complexity of the learning task. | |
201 More details can | |
202 be found in this technical report~\citep{ift6266-tr-anonymous}. | |
93 The code for these transformations (mostly python) is available at | 203 The code for these transformations (mostly python) is available at |
94 {\tt http://anonymous.url.net}. All the modules in the pipeline share | 204 {\tt http://anonymous.url.net}. All the modules in the pipeline share |
95 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the | 205 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the |
96 amount of deformation or noise introduced. | 206 amount of deformation or noise introduced. |
97 | 207 There are two main parts in the pipeline. The first one, |
98 We can differentiate two important parts in the pipeline. The first one, | 208 from slant to pinch below, performs transformations. The second |
99 from slant to pinch, performs transformations of the character. The second | 209 part, from blur to contrast, adds different kinds of noise. |
100 part, from blur to contrast, adds noise to the image. | 210 %\end{minipage} |
101 | 211 |
102 \subsection{Slant} | 212 %\vspace*{1mm} |
103 | 213 \subsection{Transformations} |
104 We mimic slant by shifting each row of the image | 214 %{\large\bf 2.1 Transformations} |
105 proportionally to its height: $shift = round(slant \times height)$. | 215 %\vspace*{1mm} |
106 The $slant$ coefficient can be negative or positive with equal probability | 216 |
107 and its value is randomly sampled according to the complexity level: | 217 \subsubsection*{Thickness} |
108 $slant \sim U[0,complexity]$, so the | 218 |
109 maximum displacement for the lowest or highest pixel line is of | 219 %\begin{wrapfigure}[7]{l}{0.15\textwidth} |
110 $round(complexity \times 32)$. | 220 \begin{minipage}[b]{0.14\linewidth} |
111 | 221 %\centering |
112 --- | 222 \begin{center} |
113 | 223 \vspace*{-5mm} |
114 In order to mimic a slant effect, we simply shift each row of the image | 224 \includegraphics[scale=.4]{images/Thick_only.png}\\ |
115 proportionnaly to its height: $shift = round(slant \times height)$. We | 225 %{\bf Thickness} |
116 round the shift in order to have a discret displacement. We do not use a | 226 \end{center} |
117 filter to smooth the result in order to save computing time and also | 227 \vspace{.6cm} |
118 because latter transformations have similar effects. | 228 \end{minipage}% |
119 | 229 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} |
120 The $slant$ coefficient can be negative or positive with equal probability | 230 %\end{wrapfigure} |
121 and its value is randomly sampled according to the complexity level. In | 231 To change character {\bf thickness}, morphological operators of dilation and erosion~\citep{Haralick87,Serra82} |
122 our case we take uniformly a number in the range $[0,complexity]$, so the | |
123 maximum displacement for the lowest or highest pixel line is of | |
124 $round(complexity \times 32)$. | |
125 | |
126 | |
127 \subsection{Thickness} | |
128 | |
129 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} | |
130 are applied. The neighborhood of each pixel is multiplied | 232 are applied. The neighborhood of each pixel is multiplied |
131 element-wise with a {\em structuring element} matrix. | 233 element-wise with a {\em structuring element} matrix. |
132 The pixel value is replaced by the maximum or the minimum of the resulting | 234 The pixel value is replaced by the maximum or the minimum of the resulting |
133 matrix, respectively for dilation or erosion. Ten different structural elements with | 235 matrix, respectively for dilation or erosion. Ten different structural elements with |
134 increasing dimensions (largest is $5\times5$) were used. For each image, | 236 increasing dimensions (largest is $5\times5$) were used. For each image, |
135 randomly sample the operator type (dilation or erosion) with equal probability and one structural | 237 randomly sample the operator type (dilation or erosion) with equal probability and one structural |
136 element from a subset of the $n$ smallest structuring elements where $n$ is | 238 element from a subset of the $n=round(m \times complexity)$ smallest structuring elements |
137 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ | 239 where $m=10$ for dilation and $m=6$ for erosion (to avoid completely erasing thin characters). |
138 for erosion. A neutral element is always present in the set, and if it is | 240 A neutral element (no transformation) |
139 chosen no transformation is applied. Erosion allows only the six | 241 is always present in the set. |
140 smallest structural elements because when the character is too thin it may | 242 %%\vspace{.4cm} |
141 be completely erased. | 243 \end{minipage} |
142 | 244 |
143 --- | 245 \vspace{2mm} |
144 | 246 |
145 To change the thickness of the characters we used morpholigical operators: | 247 \subsubsection*{Slant} |
146 dilation and erosion~\cite{Haralick87,Serra82}. | 248 \vspace*{2mm} |
147 | 249 |
148 The basic idea of such transform is, for each pixel, to multiply in the | 250 \begin{minipage}[b]{0.14\linewidth} |
149 element-wise manner its neighbourhood with a matrix called the structuring | 251 \centering |
150 element. Then for dilation we remplace the pixel value by the maximum of | 252 \includegraphics[scale=.4]{images/Slant_only.png}\\ |
151 the result, or the minimum for erosion. This will dilate or erode objects | 253 %{\bf Slant} |
152 in the image and the strength of the transform only depends on the | 254 \end{minipage}% |
153 structuring element. | 255 \hspace{0.3cm} |
154 | 256 \begin{minipage}[b]{0.83\linewidth} |
155 We used ten different structural elements with increasing dimensions (the | 257 %\centering |
156 biggest is $5\times5$). for each image, we radomly sample the operator | 258 To produce {\bf slant}, each row of the image is shifted |
157 type (dilation or erosion) with equal probability and one structural | 259 proportionally to its height: $shift = round(slant \times height)$. |
158 element from a subset of the $n$ smallest structuring elements where $n$ is | 260 $slant \sim U[-complexity,complexity]$. |
159 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ | 261 The shift is randomly chosen to be either to the left or to the right. |
160 for erosion. A neutral element is always present in the set, if it is | 262 \vspace{5mm} |
161 chosen the transformation is not applied. Erosion allows only the six | 263 \end{minipage} |
162 smallest structural elements because when the character is too thin it may | 264 %\vspace*{-4mm} |
163 erase it completly. | 265 |
164 | 266 %\newpage |
165 \subsection{Affine Transformations} | 267 |
166 | 268 \subsubsection*{Affine Transformations} |
167 A $2 \times 3$ affine transform matrix (with | 269 |
168 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. | 270 \begin{minipage}[b]{0.14\linewidth} |
169 Each pixel $(x,y)$ of the output image takes the value of the pixel | 271 %\centering |
170 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This | 272 %\begin{wrapfigure}[8]{l}{0.15\textwidth} |
171 produces scaling, translation, rotation and shearing. | 273 \begin{center} |
172 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to | 274 \includegraphics[scale=.4]{images/Affine_only.png} |
173 forbid important rotations (not to confuse classes) but to give good | 275 \vspace*{6mm} |
174 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times | 276 %{\small {\bf Affine \mbox{Transformation}}} |
175 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 | 277 \end{center} |
176 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times | 278 %\end{wrapfigure} |
177 complexity]$. | 279 \end{minipage}% |
178 | 280 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} |
179 ---- | 281 \noindent A $2 \times 3$ {\bf affine transform} matrix (with |
180 | 282 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$. |
181 We generate an affine transform matrix according to the complexity level, | 283 Output pixel $(x,y)$ takes the value of input pixel |
182 then we apply it directly to the image. The matrix is of size $2 \times | 284 nearest to $(ax+by+c,dx+ey+f)$, |
183 3$, so we can represent it by six parameters $(a,b,c,d,e,f)$. Formally, | 285 producing scaling, translation, rotation and shearing. |
184 for each pixel $(x,y)$ of the output image, we give the value of the pixel | 286 Marginal distributions of $(a,b,c,d,e,f)$ have been tuned to |
185 nearest to : $(ax+by+c,dx+ey+f)$, in the input image. This allows to | 287 forbid large rotations (to avoid confusing classes) but to give good |
186 produce scaling, translation, rotation and shearing variances. | 288 variability of the transformation: $a$ and $d$ $\sim U[1-3 |
187 | 289 complexity,1+3\,complexity]$, $b$ and $e$ $\sim U[-3 \,complexity,3\, |
188 The sampling of the parameters $(a,b,c,d,e,f)$ have been tuned by hand to | 290 complexity]$, and $c$ and $f \sim U[-4 \,complexity, 4 \, |
189 forbid important rotations (not to confuse classes) but to give good | 291 complexity]$.\\ |
190 variability of the transformation. For each image we sample uniformly the | 292 \end{minipage} |
191 parameters in the following ranges: $a$ and $d$ in $[1-3 \times | 293 |
192 complexity,1+3 \times complexity]$, $b$ and $e$ in $[-3 \times complexity,3 | 294 %\vspace*{-4.5mm} |
193 \times complexity]$ and $c$ and $f$ in $[-4 \times complexity, 4 \times | 295 \subsubsection*{Local Elastic Deformations} |
194 complexity]$. | 296 |
195 | 297 %\begin{minipage}[t]{\linewidth} |
196 | 298 %\begin{wrapfigure}[7]{l}{0.15\textwidth} |
197 \subsection{Local Elastic Deformations} | 299 %\hspace*{-8mm} |
198 | 300 \begin{minipage}[b]{0.14\linewidth} |
199 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, | 301 %\centering |
302 \begin{center} | |
303 \vspace*{5mm} | |
304 \includegraphics[scale=.4]{images/Localelasticdistorsions_only.png} | |
305 %{\bf Local Elastic Deformation} | |
306 \end{center} | |
307 %\end{wrapfigure} | |
308 \end{minipage}% | |
309 \hspace{3mm} | |
310 \begin{minipage}[b]{0.85\linewidth} | |
311 %%\vspace*{-20mm} | |
312 The {\bf local elastic deformation} | |
313 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, | |
200 which provides more details. | 314 which provides more details. |
201 Two ``displacements'' fields are generated and applied, for horizontal | 315 The intensity of the displacement fields is given by |
202 and vertical displacements of pixels. | 316 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are |
203 To generate a pixel in either field, first a value between -1 and 1 is | 317 convolved with a Gaussian 2D kernel (resulting in a blur) of |
204 chosen from a uniform distribution. Then all the pixels, in both fields, are | 318 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$. |
205 multiplied by a constant $\alpha$ which controls the intensity of the | 319 \vspace{2mm} |
206 displacements (larger $\alpha$ translates into larger wiggles). | 320 \end{minipage} |
207 Each field is convolved with a Gaussian 2D kernel of | 321 |
208 standard deviation $\sigma$. Visually, this results in a blur. | 322 \vspace*{4mm} |
209 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times | 323 |
210 \sqrt[3]{complexity}$. | 324 \subsubsection*{Pinch} |
211 | 325 |
212 ---- | 326 \begin{minipage}[b]{0.14\linewidth} |
213 | 327 %\centering |
214 This filter induces a "wiggly" effect in the image. The description here | 328 %\begin{wrapfigure}[7]{l}{0.15\textwidth} |
215 will be brief, as the algorithm follows precisely what is described in | 329 %\vspace*{-5mm} |
216 \cite{SimardSP03}. | 330 \begin{center} |
217 | 331 \includegraphics[scale=.4]{images/Pinch_only.png}\\ |
218 The general idea is to generate two "displacements" fields, for horizontal | 332 \vspace*{15mm} |
219 and vertical displacements of pixels. Each of these fields has the same | 333 %{\bf Pinch} |
220 size as the original image. | 334 \end{center} |
221 | 335 %\end{wrapfigure} |
222 When generating the transformed image, we'll loop over the x and y | 336 %%\vspace{.6cm} |
223 positions in the fields and select, as a value, the value of the pixel in | 337 \end{minipage}% |
224 the original image at the (relative) position given by the displacement | 338 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} |
225 fields for this x and y. If the position we'd retrieve is outside the | 339 The {\bf pinch} module applies the ``Whirl and pinch'' GIMP filter with whirl set to 0. |
226 borders of the image, we use a 0 value instead. | 340 A pinch is ``similar to projecting the image onto an elastic |
227 | |
228 To generate a pixel in either field, first a value between -1 and 1 is | |
229 chosen from a uniform distribution. Then all the pixels, in both fields, is | |
230 multiplied by a constant $\alpha$ which controls the intensity of the | |
231 displacements (bigger $\alpha$ translates into larger wiggles). | |
232 | |
233 As a final step, each field is convoluted with a Gaussian 2D kernel of | |
234 standard deviation $\sigma$. Visually, this results in a "blur" | |
235 filter. This has the effect of making values next to each other in the | |
236 displacement fields similar. In effect, this makes the wiggles more | |
237 coherent, less noisy. | |
238 | |
239 As displacement fields were long to compute, 50 pairs of fields were | |
240 generated per complexity in increments of 0.1 (50 pairs for 0.1, 50 pairs | |
241 for 0.2, etc.), and afterwards, given a complexity, we selected randomly | |
242 among the 50 corresponding pairs. | |
243 | |
244 $\sigma$ and $\alpha$ were linked to complexity through the formulas | |
245 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times | |
246 \sqrt[3]{complexity}$. | |
247 | |
248 | |
249 \subsection{Pinch} | |
250 | |
251 This is a GIMP filter called ``Whirl and | |
252 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic | |
253 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). | 341 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). |
254 For a square input image, this is akin to drawing a circle of | 342 For a square input image, draw a radius-$r$ disk |
255 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to | 343 around its center $C$. Any pixel $P$ belonging to |
256 that disk (region inside circle) will have its value recalculated by taking | 344 that disk has its value replaced by |
257 the value of another ``source'' pixel in the original image. The position of | 345 the value of a ``source'' pixel in the original image, |
258 that source pixel is found on the line that goes through $C$ and $P$, but | 346 on the line that goes through $C$ and $P$, but |
259 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ | 347 at some other distance $d_2$. Define $d_1=distance(P,C)$ |
260 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times | 348 and $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times |
261 d_1$, where $pinch$ is a parameter to the filter. | 349 d_1$, where $pinch$ is a parameter of the filter. |
262 The actual value is given by bilinear interpolation considering the pixels | 350 The actual value is given by bilinear interpolation considering the pixels |
263 around the (non-integer) source position thus found. | 351 around the (non-integer) source position thus found. |
264 Here $pinch \sim U[-complexity, 0.7 \times complexity]$. | 352 Here $pinch \sim U[-complexity, 0.7 \times complexity]$. |
265 | 353 %%\vspace{1.5cm} |
266 --- | 354 \end{minipage} |
267 | 355 |
268 This is another GIMP filter we used. The filter is in fact named "Whirl and | 356 %\vspace{1mm} |
269 pinch", but we don't use the "whirl" part (whirl is set to 0). As described | 357 |
270 in GIMP, a pinch is "similar to projecting the image onto an elastic | 358 %{\large\bf 2.2 Injecting Noise} |
271 surface and pressing or pulling on the center of the surface". | 359 \subsection{Injecting Noise} |
272 | 360 %\vspace{2mm} |
273 Mathematically, for a square input image, think of drawing a circle of | 361 |
274 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to | 362 \subsubsection*{Motion Blur} |
275 that disk (region inside circle) will have its value recalculated by taking | 363 |
276 the value of another "source" pixel in the original image. The position of | 364 %%\vspace*{-.2cm} |
277 that source pixel is found on the line thats goes through $C$ and $P$, but | 365 \begin{minipage}[t]{0.14\linewidth} |
278 at some other distance $d_2$. Define $d_1$ to be the distance between $P$ | 366 \centering |
279 and $C$. $d_2$ is given by $d_2 = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times | 367 \vspace*{0mm} |
280 d_1$, where $pinch$ is a parameter to the filter. | 368 \includegraphics[scale=.4]{images/Motionblur_only.png} |
281 | 369 %{\bf Motion Blur} |
282 If the region considered is not square then, before computing $d_2$, the | 370 \end{minipage}% |
283 smallest dimension (x or y) is stretched such that we may consider the | 371 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} |
284 region as if it was square. Then, after $d_2$ has been computed and | 372 %%\vspace*{.5mm} |
285 corresponding components $d_2\_x$ and $d_2\_y$ have been found, the | 373 \vspace*{2mm} |
286 component corresponding to the stretched dimension is compressed back by an | 374 The {\bf motion blur} module is GIMP's ``linear motion blur'', which |
287 inverse ratio. | 375 has parameters $length$ and $angle$. The value of |
288 | 376 a pixel in the final image is approximately the mean of the first $length$ pixels |
289 The actual value is given by bilinear interpolation considering the pixels | 377 found by moving in the $angle$ direction, |
290 around the (non-integer) source position thus found. | 378 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. |
291 | 379 %\vspace{5mm} |
292 The value for $pinch$ in our case was given by sampling from an uniform | 380 \end{minipage} |
293 distribution over the range $[-complexity, 0.7 \times complexity]$. | 381 |
294 | 382 %\vspace*{1mm} |
295 \subsection{Motion Blur} | 383 |
296 | 384 \subsubsection*{Occlusion} |
297 This is a ``linear motion blur'' in GIMP | 385 |
298 terminology, with two parameters, $length$ and $angle$. The value of | 386 \begin{minipage}[t]{0.14\linewidth} |
299 a pixel in the final image is approximately the mean value of the $length$ first pixels | 387 \centering |
300 found by moving in the $angle$ direction. | 388 \vspace*{3mm} |
301 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. | 389 \includegraphics[scale=.4]{images/occlusion_only.png}\\ |
302 | 390 %{\bf Occlusion} |
303 ---- | 391 %%\vspace{.5cm} |
304 | 392 \end{minipage}% |
305 This is a GIMP filter we applied, a "linear motion blur" in GIMP | 393 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} |
306 terminology. The description will be brief as it is a well-known filter. | 394 %\vspace*{-18mm} |
307 | 395 The {\bf occlusion} module selects a random rectangle from an {\em occluder} character |
308 This algorithm has two input parameters, $length$ and $angle$. The value of | 396 image and places it over the original {\em occluded} |
309 a pixel in the final image is the mean value of the $length$ first pixels | 397 image. Pixels are combined by taking the max(occluder, occluded), |
310 found by moving in the $angle$ direction. An approximation of this idea is | 398 i.e. keeping the lighter ones. |
311 used, as we won't fall onto precise pixels by following that | 399 The rectangle corners |
312 direction. This is done using the Bresenham line algorithm. | |
313 | |
314 The angle, in our case, is chosen from a uniform distribution over | |
315 $[0,360]$ degrees. The length, though, depends on the complexity; it's | |
316 sampled from a Gaussian distribution of mean 0 and standard deviation | |
317 $\sigma = 3 \times complexity$. | |
318 | |
319 \subsection{Occlusion} | |
320 | |
321 Selects a random rectangle from an {\em occluder} character | |
322 images and places it over the original {\em occluded} character | |
323 image. Pixels are combined by taking the max(occluder,occluded), | |
324 closer to black. The rectangle corners | |
325 are sampled so that larger complexity gives larger rectangles. | 400 are sampled so that larger complexity gives larger rectangles. |
326 The destination position in the occluded image are also sampled | 401 The destination position in the occluded image are also sampled |
327 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). | 402 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). |
328 This filter has a probability of 60\% of not being applied. | 403 This module is skipped with probability 60\%. |
329 | 404 %%\vspace{7mm} |
330 --- | 405 \end{minipage} |
331 | 406 |
332 This filter selects random parts of other (hereafter "occlusive") letter | 407 %\vspace*{1mm} |
333 images and places them over the original letter (hereafter "occluded") | 408 \subsubsection*{Gaussian Smoothing} |
334 image. To be more precise, having selected a subregion of the occlusive | 409 |
335 image and a desination position in the occluded image, to determine the | 410 %\begin{wrapfigure}[8]{l}{0.15\textwidth} |
336 final value for a given overlapping pixel, it selects whichever pixel is | 411 %\vspace*{-6mm} |
337 the lightest. As a reminder, the background value is 0, black, so the value | 412 \begin{minipage}[t]{0.14\linewidth} |
338 nearest to 1 is selected. | 413 \begin{center} |
339 | 414 %\centering |
340 To select a subpart of the occlusive image, four numbers are generated. For | 415 \vspace*{6mm} |
341 compability with the code, we'll call them "haut", "bas", "gauche" and | 416 \includegraphics[scale=.4]{images/Bruitgauss_only.png} |
342 "droite" (respectively meaning top, bottom, left and right). Each of these | 417 %{\bf Gaussian Smoothing} |
343 numbers is selected according to a Gaussian distribution of mean $8 \times | 418 \end{center} |
344 complexity$ and standard deviation $2$. This means the largest the | 419 %\end{wrapfigure} |
345 complexity is, the biggest the occlusion will be. The absolute value is | 420 %%\vspace{.5cm} |
346 taken, as the numbers must be positive, and the maximum value is capped at | 421 \end{minipage}% |
347 15. | 422 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} |
348 | 423 With the {\bf Gaussian smoothing} module, |
349 These four sizes collectively define a window centered on the middle pixel | 424 different regions of the image are spatially smoothed. |
350 of the occlusive image. This is the part that will be extracted as the | 425 This is achieved by first convolving |
351 occlusion. | 426 the image with an isotropic Gaussian kernel of |
352 | |
353 The next step is to select a destination position in the occluded | |
354 image. Vertical and horizontal displacements $y\_arrivee$ and $x\_arrivee$ | |
355 are selected according to Gaussian distributions of mean 0 and of standard | |
356 deviations of, respectively, 3 and 2. Then an horizontal placement mode, | |
357 $place$, is selected to be of three values meaning | |
358 left, middle or right. | |
359 | |
360 If $place$ is "middle", the occlusion will be horizontally centered | |
361 around the horizontal middle of the occluded image, then shifted according | |
362 to $x\_arrivee$. If $place$ is "left", it will be placed on the left of | |
363 the occluded image, then displaced right according to $x\_arrivee$. The | |
364 contrary happens if $place$ is $right$. | |
365 | |
366 In both the horizontal and vertical positionning, the maximum position in | |
367 either direction is such that the selected occlusion won't go beyond the | |
368 borders of the occluded image. | |
369 | |
370 This filter has a probability of not being applied, at all, of 60\%. | |
371 | |
372 | |
373 \subsection{Pixel Permutation} | |
374 | |
375 This filter permutes neighbouring pixels. It selects first | |
376 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then | |
377 sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number | |
378 of exchanges to the left, right, top, bottom is equal or does not differ | |
379 from more than 1 if the number of selected pixels is not a multiple of 4. | |
380 % TODO: The previous sentence is hard to parse | |
381 This filter has a probability of 80\% of not being applied. | |
382 | |
383 --- | |
384 | |
385 This filter permuts neighbouring pixels. It selects first | |
386 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then | |
387 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number | |
388 of exchanges to the left, right, top, bottom are equal or does not differ | |
389 from more than 1 if the number of selected pixels is not a multiple of 4. | |
390 | |
391 It has has a probability of not being applied, at all, of 80\%. | |
392 | |
393 | |
394 \subsection{Gaussian Noise} | |
395 | |
396 This filter simply adds, to each pixel of the image independently, a | |
397 noise $\sim Normal(0(\frac{complexity}{10})^2)$. | |
398 It has a probability of 70\% of not being applied. | |
399 | |
400 --- | |
401 | |
402 This filter simply adds, to each pixel of the image independently, a | |
403 Gaussian noise of mean $0$ and standard deviation $\frac{complexity}{10}$. | |
404 | |
405 It has has a probability of not being applied, at all, of 70\%. | |
406 | |
407 | |
408 \subsection{Background Images} | |
409 | |
410 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random | |
411 background behind the letter. The background is chosen by first selecting, | |
412 at random, an image from a set of images. Then a 32$\times$32 sub-region | |
413 of that image is chosen as the background image (by sampling position | |
414 uniformly while making sure not to cross image borders). | |
415 To combine the original letter image and the background image, contrast | |
416 adjustments are made. We first get the maximal values (i.e. maximal | |
417 intensity) for both the original image and the background image, $maximage$ | |
418 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$. | |
419 Each background pixel value is multiplied by $\frac{max(maximage - | |
420 contrast, 0)}{maxbg}$ (higher contrast yield darker | |
421 background). The output image pixels are max(background,original). | |
422 | |
423 --- | |
424 | |
425 Following~\cite{Larochelle-jmlr-2009}, this transformation adds a random | |
426 background behind the letter. The background is chosen by first selecting, | |
427 at random, an image from a set of images. Then we choose a 32x32 subregion | |
428 of that image as the background image (by sampling x and y positions | |
429 uniformly while making sure not to cross image borders). | |
430 | |
431 To combine the original letter image and the background image, contrast | |
432 adjustments are made. We first get the maximal values (i.e. maximal | |
433 intensity) for both the original image and the background image, $maximage$ | |
434 and $maxbg$. We also have a parameter, $contrast$, given by sampling from a | |
435 uniform distribution over $[complexity, 1]$. | |
436 | |
437 Once we have all these numbers, we first adjust the values for the | |
438 background image. Each pixel value is multiplied by $\frac{max(maximage - | |
439 contrast, 0)}{maxbg}$. Therefore the higher the contrast, the darkest the | |
440 background will be. | |
441 | |
442 The final image is found by taking the brightest (i.e. value nearest to 1) | |
443 pixel from either the background image or the corresponding pixel in the | |
444 original image. | |
445 | |
446 \subsection{Salt and Pepper Noise} | |
447 | |
448 This filter adds noise $\sim U[0,1]$ to random subsets of pixels. | |
449 The number of selected pixels is $0.2 \times complexity$. | |
450 This filter has a probability of not being applied at all of 75\%. | |
451 | |
452 --- | |
453 | |
454 This filter adds noise to the image by randomly selecting a certain number | |
455 of them and, for those selected pixels, assign a random value according to | |
456 a uniform distribution over the $[0,1]$ ranges. This last distribution does | |
457 not change according to complexity. Instead, the number of selected pixels | |
458 does: the proportion of changed pixels corresponds to $complexity / 5$, | |
459 which means, as a maximum, 20\% of the pixels will be randomized. On the | |
460 lowest extreme, no pixel is changed. | |
461 | |
462 This filter also has a probability of not being applied, at all, of 75\%. | |
463 | |
464 \subsection{Spatially Gaussian Noise} | |
465 | |
466 Different regions of the image are spatially smoothed. | |
467 The image is convolved with a symmetric Gaussian kernel of | |
468 size and variance chosen uniformly in the ranges $[12,12 + 20 \times | 427 size and variance chosen uniformly in the ranges $[12,12 + 20 \times |
469 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized | 428 complexity]$ and $[2,2 + 6 \times complexity]$. This filtered image is normalized |
470 between $0$ and $1$. We also create a symmetric averaging window, of the | 429 between $0$ and $1$. We also create an isotropic weighted averaging window, of the |
471 kernel size, with maximum value at the center. For each image we sample | 430 kernel size, with maximum value at the center. For each image we sample |
472 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be | 431 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be |
473 averaging centers between the original image and the filtered one. We | 432 averaging centers between the original image and the filtered one. We |
474 initialize to zero a mask matrix of the image size. For each selected pixel | 433 initialize to zero a mask matrix of the image size. For each selected pixel |
475 we add to the mask the averaging window centered to it. The final image is | 434 we add to the mask the averaging window centered on it. The final image is |
476 computed from the following element-wise operation: $\frac{image + filtered | 435 computed from the following element-wise operation: $\frac{image + filtered\_image |
477 image \times mask}{mask+1}$. | 436 \times mask}{mask+1}$. |
478 This filter has a probability of not being applied at all of 75\%. | 437 This module is skipped with probability 75\%. |
479 | 438 \end{minipage} |
480 ---- | 439 |
481 | 440 %\newpage |
482 The aim of this transformation is to filter, with a gaussian kernel, | 441 |
483 different regions of the image. In order to save computing time we decided | 442 %\vspace*{-9mm} |
484 to convolve the whole image only once with a symmetric gaussian kernel of | 443 \subsubsection*{Permute Pixels} |
485 size and variance choosen uniformly in the ranges: $[12,12 + 20 \times | 444 |
486 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized | 445 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth} |
487 between $0$ and $1$. We also create a symmetric averaging window, of the | 446 %\centering |
488 kernel size, with maximum value at the center. For each image we sample | 447 \begin{minipage}[t]{0.14\textwidth} |
489 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be | 448 %\begin{wrapfigure}[7]{l}{ |
490 averaging centers between the original image and the filtered one. We | 449 %\vspace*{-5mm} |
491 initialize to zero a mask matrix of the image size. For each selected pixel | 450 \begin{center} |
492 we add to the mask the averaging window centered to it. The final image is | 451 \vspace*{1mm} |
493 computed from the following element-wise operation: $\frac{image + filtered | 452 \includegraphics[scale=.4]{images/Permutpixel_only.png} |
494 image \times mask}{mask+1}$. | 453 %{\small\bf Permute Pixels} |
495 | 454 \end{center} |
496 This filter has a probability of not being applied, at all, of 75\%. | 455 %\end{wrapfigure} |
497 | 456 \end{minipage}% |
498 \subsection{Scratches} | 457 \hspace{3mm}\begin{minipage}[t]{0.86\linewidth} |
499 | 458 \vspace*{1mm} |
500 The scratches module places line-like white patches on the image. The | 459 %%\vspace*{-20mm} |
460 This module {\bf permutes neighbouring pixels}. It first selects a | |
461 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each | |
462 of these pixels is then sequentially exchanged with a random pixel | |
463 among its four nearest neighbors (on its left, right, top or bottom). | |
464 This module is skipped with probability 80\%.\\ | |
465 %\vspace*{1mm} | |
466 \end{minipage} | |
467 | |
468 %\vspace{-3mm} | |
469 | |
470 \subsubsection*{Gaussian Noise} | |
471 | |
472 \begin{minipage}[t]{0.14\textwidth} | |
473 %\begin{wrapfigure}[7]{l}{ | |
474 %%\vspace*{-3mm} | |
475 \begin{center} | |
476 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth} | |
477 %\centering | |
478 \vspace*{0mm} | |
479 \includegraphics[scale=.4]{images/Distorsiongauss_only.png} | |
480 %{\small \bf Gauss. Noise} | |
481 \end{center} | |
482 %\end{wrapfigure} | |
483 \end{minipage}% | |
484 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} | |
485 \vspace*{1mm} | |
486 %\vspace*{12mm} | |
487 The {\bf Gaussian noise} module simply adds, to each pixel of the image independently, a | |
488 noise $\sim Normal(0,(\frac{complexity}{10})^2)$. | |
489 This module is skipped with probability 70\%. | |
490 %%\vspace{1.1cm} | |
491 \end{minipage} | |
492 | |
493 %\vspace*{1.2cm} | |
494 | |
495 \subsubsection*{Background Image Addition} | |
496 | |
497 \begin{minipage}[t]{\linewidth} | |
498 \begin{minipage}[t]{0.14\linewidth} | |
499 \centering | |
500 \vspace*{0mm} | |
501 \includegraphics[scale=.4]{images/background_other_only.png} | |
502 %{\small \bf Bg Image} | |
503 \end{minipage}% | |
504 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} | |
505 \vspace*{1mm} | |
506 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random | |
507 background image behind the letter, from a randomly chosen natural image, | |
508 with contrast adjustments depending on $complexity$, to preserve | |
509 more or less of the original character image. | |
510 %%\vspace{.8cm} | |
511 \end{minipage} | |
512 \end{minipage} | |
513 %%\vspace{-.7cm} | |
514 | |
515 \subsubsection*{Salt and Pepper Noise} | |
516 | |
517 \begin{minipage}[t]{0.14\linewidth} | |
518 \centering | |
519 \vspace*{0mm} | |
520 \includegraphics[scale=.4]{images/Poivresel_only.png} | |
521 %{\small \bf Salt \& Pepper} | |
522 \end{minipage}% | |
523 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} | |
524 \vspace*{1mm} | |
525 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels. | |
526 The number of selected pixels is $0.2 \times complexity$. | |
527 This module is skipped with probability 75\%. | |
528 %%\vspace{.9cm} | |
529 \end{minipage} | |
530 %%\vspace{-.7cm} | |
531 | |
532 %\vspace{1mm} | |
533 \subsubsection*{Scratches} | |
534 | |
535 \begin{minipage}[t]{0.14\textwidth} | |
536 %\begin{wrapfigure}[7]{l}{ | |
537 %\begin{minipage}[t]{0.14\linewidth} | |
538 %\centering | |
539 \begin{center} | |
540 \vspace*{4mm} | |
541 %\hspace*{-1mm} | |
542 \includegraphics[scale=.4]{images/Rature_only.png}\\ | |
543 %{\bf Scratches} | |
544 \end{center} | |
545 \end{minipage}% | |
546 %\end{wrapfigure} | |
547 \hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} | |
548 %%\vspace{.4cm} | |
549 The {\bf scratches} module places line-like white patches on the image. The | |
501 lines are heavily transformed images of the digit ``1'' (one), chosen | 550 lines are heavily transformed images of the digit ``1'' (one), chosen |
502 at random among five thousands such 1 images. The 1 image is | 551 at random among 500 such 1 images, |
503 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times | 552 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times |
504 complexity)^2$, using bi-cubic interpolation, | 553 complexity)^2$ (in degrees), using bi-cubic interpolation. |
505 Two passes of a grey-scale morphological erosion filter | 554 Two passes of a grey-scale morphological erosion filter |
506 are applied, reducing the width of the line | 555 are applied, reducing the width of the line |
507 by an amount controlled by $complexity$. | 556 by an amount controlled by $complexity$. |
508 This filter is only applied only 15\% of the time. When it is applied, 50\% | 557 This module is skipped with probability 85\%. The probabilities |
509 of the time, only one patch image is generated and applied. In 30\% of | 558 of applying 1, 2, or 3 patches are (50\%,30\%,20\%). |
510 cases, two patches are generated, and otherwise three patches are | 559 \end{minipage} |
511 generated. The patch is applied by taking the maximal value on any given | 560 |
512 patch or the original image, for each of the 32x32 pixel locations. | 561 %\vspace*{1mm} |
513 | 562 |
514 --- | 563 \subsubsection*{Grey Level and Contrast Changes} |
515 | 564 |
516 The scratches module places line-like white patches on the image. The | 565 \begin{minipage}[t]{0.15\linewidth} |
517 lines are in fact heavily transformed images of the digit "1" (one), chosen | 566 \centering |
518 at random among five thousands such start images of this digit. | 567 \vspace*{0mm} |
519 | 568 \includegraphics[scale=.4]{images/Contrast_only.png} |
520 Once the image is selected, the transformation begins by finding the first | 569 %{\bf Grey Level \& Contrast} |
521 $top$, $bottom$, $right$ and $left$ non-zero pixels in the image. It is | 570 \end{minipage}% |
522 then cropped to the region thus delimited, then this cropped version is | 571 \hspace{3mm}\begin{minipage}[t]{0.85\linewidth} |
523 expanded to $32\times32$ again. It is then rotated by a random angle having a | 572 \vspace*{1mm} |
524 Gaussian distribution of mean 90 and standard deviation $100 \times | 573 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white |
525 complexity$ (in degrees). The rotation is done with bicubic interpolation. | 574 to black and black to white). The contrast is $C \sim U[1-0.85 \times complexity,1]$ |
526 | 575 so the image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The |
527 The rotated image is then resized to $50\times50$, with anti-aliasing. In | 576 polarity is inverted with probability 50\%. |
528 that image, we crop the image again by selecting a region delimited | 577 %%\vspace{.7cm} |
529 horizontally to $left$ to $left+32$ and vertically by $top$ to $top+32$. | 578 \end{minipage} |
530 | 579 %\vspace{2mm} |
531 Once this is done, two passes of a greyscale morphological erosion filter | 580 |
532 are applied. Put briefly, this erosion filter reduces the width of the line | 581 |
533 by a certain $smoothing$ amount. For small complexities (< 0.5), | 582 \iffalse |
534 $smoothing$ is 6, so the line is very small. For complexities ranging from | 583 \begin{figure}[ht] |
535 0.25 to 0.5, $smoothing$ is 5. It is 4 for complexities 0.5 to 0.75, and 3 | 584 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/example_t.png}}}\\ |
536 for higher complexities. | |
537 | |
538 To compensate for border effects, the image is then cropped to 28x28 by | |
539 removing two pixels everywhere on the borders, then expanded to 32x32 | |
540 again. The pixel values are then linearly expanded such that the minimum | |
541 value is 0 and the maximal one is 1. Then, 50\% of the time, the image is | |
542 vertically flipped. | |
543 | |
544 This filter is only applied only 15\% of the time. When it is applied, 50\% | |
545 of the time, only one patch image is generated and applied. In 30\% of | |
546 cases, two patches are generated, and otherwise three patches are | |
547 generated. The patch is applied by taking the maximal value on any given | |
548 patch or the original image, for each of the 32x32 pixel locations. | |
549 | |
550 \subsection{Grey Level and Contrast Changes} | |
551 | |
552 This filter changes the contrast and may invert the image polarity (white | |
553 on black to black on white). The contrast $C$ is defined here as the | |
554 difference between the maximum and the minimum pixel value of the image. | |
555 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$). | |
556 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The | |
557 polarity is inverted with $0.5$ probability. | |
558 | |
559 --- | |
560 This filter changes the constrast and may invert the image polarity (white | |
561 on black to black on white). The contrast $C$ is defined here as the | |
562 difference between the maximum and the minimum pixel value of the image. A | |
563 contrast value is sampled uniformly between $1$ and $1-0.85 \times | |
564 complexity$ (this insure a minimum constrast of $0.15$). We then simply | |
565 normalize the image to the range $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The | |
566 polarity is inverted with $0.5$ probability. | |
567 | |
568 | |
569 \begin{figure}[h] | |
570 \resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\ | |
571 \caption{Illustration of the pipeline of stochastic | 585 \caption{Illustration of the pipeline of stochastic |
572 transformations applied to the image of a lower-case t | 586 transformations applied to the image of a lower-case \emph{t} |
573 (the upper left image). Each image in the pipeline (going from | 587 (the upper left image). Each image in the pipeline (going from |
574 left to right, first top line, then bottom line) shows the result | 588 left to right, first top line, then bottom line) shows the result |
575 of applying one of the modules in the pipeline. The last image | 589 of applying one of the modules in the pipeline. The last image |
576 (bottom right) is used as training example.} | 590 (bottom right) is used as training example.} |
577 \label{fig:pipeline} | 591 \label{fig:pipeline} |
578 \end{figure} | 592 \end{figure} |
579 | 593 \fi |
580 | 594 |
581 \begin{figure}[h] | 595 %\vspace*{-3mm} |
582 \resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\ | 596 \section{Experimental Setup} |
583 \caption{Illustration of each transformation applied to the same image | 597 %\vspace*{-1mm} |
584 of the upper-case h (upper-left image). first row (from left to rigth) : original image, slant, | 598 |
585 thickness, affine transformation, local elastic deformation; second row (from left to rigth) : | 599 Much previous work on deep learning had been performed on |
586 pinch, motion blur, occlusion, pixel permutation, gaussian noise; third row (from left to rigth) : | 600 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, |
587 background image, salt and pepper noise, spatially gaussian noise, scratches, | 601 with 60~000 examples, and variants involving 10~000 |
588 grey level and contrast changes.} | 602 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}. |
589 \label{fig:transfo} | 603 The focus here is on much larger training sets, from 10 times to |
604 to 1000 times larger, and 62 classes. | |
605 | |
606 The first step in constructing the larger datasets (called NISTP and P07) is to sample from | |
607 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, | |
608 and {\bf OCR data} (scanned machine printed characters). Once a character | |
609 is sampled from one of these sources (chosen randomly), the second step is to | |
610 apply a pipeline of transformations and/or noise processes described in section \ref{s:perturbations}. | |
611 | |
612 To provide a baseline of error rate comparison we also estimate human performance | |
613 on both the 62-class task and the 10-class digits task. | |
614 We compare the best Multi-Layer Perceptrons (MLP) against | |
615 the best Stacked Denoising Auto-encoders (SDA), when | |
616 both models' hyper-parameters are selected to minimize the validation set error. | |
617 We also provide a comparison against a precise estimate | |
618 of human performance obtained via Amazon's Mechanical Turk (AMT) | |
619 service (http://mturk.com). | |
620 AMT users are paid small amounts | |
621 of money to perform tasks for which human intelligence is required. | |
622 Mechanical Turk has been used extensively in natural language processing and vision. | |
623 %processing \citep{SnowEtAl2008} and vision | |
624 %\citep{SorokinAndForsyth2008,whitehill09}. | |
625 AMT users were presented | |
626 with 10 character images (from a test set) and asked to choose 10 corresponding ASCII | |
627 characters. They were forced to choose a single character class (either among the | |
628 62 or 10 character classes) for each image. | |
629 80 subjects classified 2500 images per (dataset,task) pair, | |
630 with the guarantee that 3 different subjects classified each image, allowing | |
631 us to estimate inter-human variability (e.g a standard error of 0.1\% | |
632 on the average 18.2\% error done by humans on the 62-class task NIST test set). | |
633 | |
634 %\vspace*{-3mm} | |
635 \subsection{Data Sources} | |
636 %\vspace*{-2mm} | |
637 | |
638 %\begin{itemize} | |
639 %\item | |
640 {\bf NIST.} | |
641 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995}, | |
642 widely used for training and testing character | |
643 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. | |
644 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, | |
645 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | |
646 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. | |
647 The fourth partition (called $hsf_4$, 82587 examples), | |
648 experimentally recognized to be the most difficult one, is the one recommended | |
649 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | |
650 for that purpose. We randomly split the remainder (731668 examples) into a training set and a validation set for | |
651 model selection. | |
652 The performances reported by previous work on that dataset mostly use only the digits. | |
653 Here we use all the classes both in the training and testing phase. This is especially | |
654 useful to estimate the effect of a multi-task setting. | |
655 The distribution of the classes in the NIST training and test sets differs | |
656 substantially, with relatively many more digits in the test set, and a more uniform distribution | |
657 of letters in the test set (whereas in the training set they are distributed | |
658 more like in natural text). | |
659 %\vspace*{-1mm} | |
660 | |
661 %\item | |
662 {\bf Fonts.} | |
663 In order to have a good variety of sources we downloaded an important number of free fonts from: | |
664 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. | |
665 % TODO: pointless to anonymize, it's not pointing to our work | |
666 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. | |
667 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | |
668 directly as input to our models. | |
669 %\vspace*{-1mm} | |
670 | |
671 %\item | |
672 {\bf Captchas.} | |
673 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | |
674 generating characters of the same format as the NIST dataset. This software is based on | |
675 a random character class generator and various kinds of transformations similar to those described in the previous sections. | |
676 In order to increase the variability of the data generated, many different fonts are used for generating the characters. | |
677 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity | |
678 depending on the value of the complexity parameter provided by the user of the data source. | |
679 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class? | |
680 %\vspace*{-1mm} | |
681 | |
682 %\item | |
683 {\bf OCR data.} | |
684 A large set (2 million) of scanned, OCRed and manually verified machine-printed | |
685 characters where included as an | |
686 additional source. This set is part of a larger corpus being collected by the Image Understanding | |
687 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern | |
688 ({\tt http://www.iupr.com}), and which will be publicly released. | |
689 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this | |
690 %\end{itemize} | |
691 | |
692 %\vspace*{-3mm} | |
693 \subsection{Data Sets} | |
694 %\vspace*{-2mm} | |
695 | |
696 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label | |
697 from one of the 62 character classes. | |
698 %\begin{itemize} | |
699 %\vspace*{-1mm} | |
700 | |
701 %\item | |
702 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has | |
703 \{651668 / 80000 / 82587\} \{training / validation / test\} examples. | |
704 %\vspace*{-1mm} | |
705 | |
706 %\item | |
707 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources | |
708 and sending them through the transformation pipeline described in section \ref{s:perturbations}. | |
709 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, | |
710 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the | |
711 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. | |
712 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. | |
713 %\vspace*{-1mm} | |
714 | |
715 %\item | |
716 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) | |
717 except that we only apply | |
718 transformations from slant to pinch. Therefore, the character is | |
719 transformed but no additional noise is added to the image, giving images | |
720 closer to the NIST dataset. | |
721 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. | |
722 %\end{itemize} | |
723 | |
724 %\vspace*{-3mm} | |
725 \subsection{Models and their Hyperparameters} | |
726 %\vspace*{-2mm} | |
727 | |
728 The experiments are performed using MLPs (with a single | |
729 hidden layer) and SDAs. | |
730 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} | |
731 | |
732 {\bf Multi-Layer Perceptrons (MLP).} | |
733 Whereas previous work had compared deep architectures to both shallow MLPs and | |
734 SVMs, we only compared to MLPs here because of the very large datasets used | |
735 (making the use of SVMs computationally challenging because of their quadratic | |
736 scaling behavior). | |
737 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | |
738 exponentials) on the output layer for estimating $P(class | image)$. | |
739 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. | |
740 Training examples are presented in minibatches of size 20. A constant learning | |
741 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. | |
742 %through preliminary experiments (measuring performance on a validation set), | |
743 %and $0.1$ (which was found to work best) was then selected for optimizing on | |
744 %the whole training sets. | |
745 %\vspace*{-1mm} | |
746 | |
747 | |
748 {\bf Stacked Denoising Auto-Encoders (SDA).} | |
749 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | |
750 can be used to initialize the weights of each layer of a deep MLP (with many hidden | |
751 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, | |
752 apparently setting parameters in the | |
753 basin of attraction of supervised gradient descent yielding better | |
754 generalization~\citep{Erhan+al-2010}. It is hypothesized that the | |
755 advantage brought by this procedure stems from a better prior, | |
756 on the one hand taking advantage of the link between the input | |
757 distribution $P(x)$ and the conditional distribution of interest | |
758 $P(y|x)$ (like in semi-supervised learning), and on the other hand | |
759 taking advantage of the expressive power and bias implicit in the | |
760 deep architecture (whereby complex concepts are expressed as | |
761 compositions of simpler ones through a deep hierarchy). | |
762 | |
763 \begin{figure}[ht] | |
764 %\vspace*{-2mm} | |
765 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | |
766 %\vspace*{-2mm} | |
767 \caption{Illustration of the computations and training criterion for the denoising | |
768 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | |
769 the layer (i.e. raw input or output of previous layer) | |
770 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | |
771 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | |
772 is compared to the uncorrupted input $x$ through the loss function | |
773 $L_H(x,z)$, whose expected value is approximately minimized during training | |
774 by tuning $\theta$ and $\theta'$.} | |
775 \label{fig:da} | |
776 %\vspace*{-2mm} | |
590 \end{figure} | 777 \end{figure} |
591 | 778 |
592 | 779 Here we chose to use the Denoising |
593 \section{Experimental Setup} | 780 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for |
594 | 781 these deep hierarchies of features, as it is simple to train and |
595 \subsection{Training Datasets} | 782 explain (see Figure~\ref{fig:da}, as well as |
596 | 783 tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
597 \subsubsection{Data Sources} | 784 provides efficient inference, and yielded results |
598 | 785 comparable or better than RBMs in series of experiments |
599 \begin{itemize} | 786 \citep{VincentPLarochelleH2008}. During training, a Denoising |
600 \item {\bf NIST} | 787 Auto-encoder is presented with a stochastically corrupted version |
601 The NIST Special Database 19 (NIST19) is a very widely used dataset for training and testing OCR systems. | 788 of the input and trained to reconstruct the uncorrupted input, |
602 The dataset is composed with over 800 000 digits and characters (upper and lower cases), with hand checked classifications, | 789 forcing the hidden units to represent the leading regularities in |
603 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | 790 the data. Here we use the random binary masking corruption |
604 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. | 791 (which sets to 0 a random subset of the inputs). |
605 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one for classification task is recommended | 792 Once it is trained, in a purely unsupervised way, |
606 by NIST as testing set and is used in our work for that purpose. | 793 its hidden units' activations can |
607 The performances reported by previous work on that dataset mostly use only the digits. | 794 be used as inputs for training a second one, etc. |
608 Here we use the whole classes both in the training and testing phase. | 795 After this unsupervised pre-training stage, the parameters |
609 | 796 are used to initialize a deep MLP, which is fine-tuned by |
610 | 797 the same standard procedure used to train them (see previous section). |
611 \item {\bf Fonts} | 798 The SDA hyper-parameters are the same as for the MLP, with the addition of the |
612 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} | 799 amount of corruption noise (we used the masking noise process, whereby a |
613 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} | 800 fixed proportion of the input values, randomly selected, are zeroed), and a |
614 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. | 801 separate learning rate for the unsupervised pre-training stage (selected |
615 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | 802 from the same above set). The fraction of inputs corrupted was selected |
616 directly as input to our models. | 803 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number |
617 %Guillaume are there other details I forgot on the font selection? | 804 of hidden layers but it was fixed to 3 based on previous work with |
618 | 805 SDAs on MNIST~\citep{VincentPLarochelleH2008}. |
619 \item {\bf Captchas} | 806 |
620 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 807 %\vspace*{-1mm} |
621 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character | 808 |
622 generator and various kinds of tranformations similar to those described in the previous sections. | 809 \begin{figure}[ht] |
623 In order to increase the variability of the data generated, different fonts are used for generating the characters. | 810 %\vspace*{-2mm} |
624 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity | 811 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} |
625 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are | 812 %\vspace*{-3mm} |
626 allowed and can be controlled via an easy to use facade class. | 813 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained |
627 \item {\bf OCR data} | 814 on NIST, 1 on NISTP, and 2 on P07. Left: overall results |
628 \end{itemize} | 815 of all models, on NIST and NISTP test sets. |
629 | 816 Right: error rates on NIST test digits only, along with the previous results from |
630 \subsubsection{Data Sets} | 817 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
631 \begin{itemize} | 818 respectively based on ART, nearest neighbors, MLPs, and SVMs.} |
632 \item {\bf P07} | 819 \label{fig:error-rates-charts} |
633 The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$. | 820 %\vspace*{-2mm} |
634 For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts, | |
635 $0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order | |
636 and for each of them we sample uniformly a complexity in the range $[0,0.7]$. | |
637 \item {\bf NISTP} {\em ne pas utiliser PNIST mais NISTP, pour rester politically correct...} | |
638 NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) except that we only apply transformations from slant to pinch. Therefore, the character is transformed | |
639 but no additionnal noise is added to the image, this gives images closer to the NIST dataset. | |
640 \end{itemize} | |
641 | |
642 We noticed that the distribution of the training sets and the test sets differ. | |
643 Since our validation sets are sampled from the training set, they have approximately the same distribution, but the test set has a completely different distribution as illustrated in figure \ref {setsdata}. | |
644 | |
645 \begin{figure} | |
646 \subfigure[NIST training]{\includegraphics[width=0.5\textwidth]{images/nisttrainstats}} | |
647 \subfigure[NIST validation]{\includegraphics[width=0.5\textwidth]{images/nistvalidstats}} | |
648 \subfigure[NIST test]{\includegraphics[width=0.5\textwidth]{images/nistteststats}} | |
649 \subfigure[NISTP validation]{\includegraphics[width=0.5\textwidth]{images/nistpvalidstats}} | |
650 \caption{Proportion of each class in some of the data sets} | |
651 \label{setsdata} | |
652 \end{figure} | 821 \end{figure} |
653 | 822 |
654 \subsection{Models and their Hyperparameters} | 823 |
655 | 824 \begin{figure}[ht] |
656 \subsubsection{Multi-Layer Perceptrons (MLP)} | 825 %\vspace*{-3mm} |
657 | 826 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} |
658 An MLP is a family of functions that are described by stacking layers of of a function similar to | 827 %\vspace*{-3mm} |
659 $$g(x) = \tanh(b+Wx)$$ | 828 \caption{Relative improvement in error rate due to self-taught learning. |
660 The input, $x$, is a $d$-dimension vector. | 829 Left: Improvement (or loss, when negative) |
661 The output, $g(x)$, is a $m$-dimension vector. | 830 induced by out-of-distribution examples (perturbed data). |
662 The parameter $W$ is a $m\times d$ matrix and is called the weight matrix. | 831 Right: Improvement (or loss, when negative) induced by multi-task |
663 The parameter $b$ is a $m$-vector and is called the bias vector. | 832 learning (training on all classes and testing only on either digits, |
664 The non-linearity (here $\tanh$) is applied element-wise to the output vector. | 833 upper case, or lower-case). The deep learner (SDA) benefits more from |
665 Usually the input is referred to a input layer and similarly for the output. | 834 both self-taught learning scenarios, compared to the shallow MLP.} |
666 You can of course chain several such functions to obtain a more complex one. | 835 \label{fig:improvements-charts} |
667 Here is a common example | 836 %\vspace*{-2mm} |
668 $$f(x) = c + V\tanh(b+Wx)$$ | 837 \end{figure} |
669 In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer. | |
670 Here the output layer does not have the same non-linearity as the hidden layer. | |
671 This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand. | |
672 | |
673 If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP. | |
674 The parameters to adapt are the weight matrix and the bias vector for each layer. | |
675 | |
676 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} | |
677 \label{SdA} | |
678 | |
679 Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization. | |
680 This is essentially unsupervised training where the layer is made to reconstruct its input through and encoding and decoding phase. | |
681 Denoising auto-encoders are a variant where the input is corrupted with random noise but the target is the uncorrupted input. | |
682 The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform. | |
683 | |
684 An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer. | |
685 Usually the top and bottom weight matrices are the transpose of each other and are fixed this way. | |
686 The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer. | |
687 The other parameters are discarded. | |
688 | |
689 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. | |
690 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder. | |
691 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it. | |
692 For additional details see \cite{vincent:icml08}. | |
693 | 838 |
694 \section{Experimental Results} | 839 \section{Experimental Results} |
695 | 840 %\vspace*{-2mm} |
696 \subsection{SDA vs MLP vs Humans} | 841 |
697 | 842 %%\vspace*{-1mm} |
698 We compare here the best MLP (according to validation set error) that we found against | 843 %\subsection{SDA vs MLP vs Humans} |
699 the best SDA (again according to validation set error), along with a precise estimate | 844 %%\vspace*{-1mm} |
700 of human performance obtained via Amazon's Mechanical Turk (AMT) | 845 The models are either trained on NIST (MLP0 and SDA0), |
701 service\footnote{http://mturk.com}. AMT users are paid small amounts | 846 NISTP (MLP1 and SDA1), or P07 (MLP2 and SDA2), and tested |
702 of money to perform tasks for which human intelligence is required. | 847 on either NIST, NISTP or P07, either on the 62-class task |
703 Mechanical Turk has been used extensively in natural language | 848 or on the 10-digits task. Training (including about half |
704 processing \cite{SnowEtAl2008} and vision | 849 for unsupervised pre-training, for DAs) on the larger |
705 \cite{SorokinAndForsyth2008,whitehill09}. AMT users where presented | 850 datasets takes around one day on a GPU-285. |
706 with 10 character images and asked to type 10 corresponding ascii | 851 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, |
707 characters. Hence they were forced to make a hard choice among the | 852 comparing humans, the three MLPs (MLP0, MLP1, MLP2) and the three SDAs (SDA0, SDA1, |
708 62 character classes. Three users classified each image, allowing | 853 SDA2), along with the previous results on the digits NIST special database |
709 to estimate inter-human variability (shown as +/- in parenthesis below). | 854 19 test set from the literature, respectively based on ARTMAP neural |
710 | 855 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search |
711 \begin{table} | 856 ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002-short}, and SVMs |
712 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + | 857 ~\citep{Milgram+al-2005}. More detailed and complete numerical results |
713 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training | 858 (figures and tables, including standard errors on the error rates) can be |
714 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture | 859 found in Appendix I of the supplementary material. |
715 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07) | 860 The deep learner not only outperformed the shallow ones and |
716 and using a validation set to select hyper-parameters and other training choices. | 861 previously published performance (in a statistically and qualitatively |
717 \{SDA,MLP\}0 are trained on NIST, | 862 significant way) but when trained with perturbed data |
718 \{SDA,MLP\}1 are trained on NISTP, and \{SDA,MLP\}2 are trained on P07. | 863 reaches human performance on both the 62-class task |
719 The human error rate on digits is a lower bound because it does not count digits that were | 864 and the 10-class (digits) task. |
720 recognized as letters. For comparison, the results found in the literature | 865 17\% error (SDA1) or 18\% error (humans) may seem large but a large |
721 on NIST digits classification using the same test set are included.} | 866 majority of the errors from humans and from SDA1 are from out-of-context |
722 \label{tab:sda-vs-mlp-vs-humans} | 867 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a |
723 \begin{center} | 868 ``c'' and a ``C'' are often indistinguishible). |
724 \begin{tabular}{|l|r|r|r|r|} \hline | 869 |
725 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline | 870 In addition, as shown in the left of |
726 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $>1.1\%$ \\ \hline | 871 Figure~\ref{fig:improvements-charts}, the relative improvement in error |
727 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline | 872 rate brought by self-taught learning is greater for the SDA, and these |
728 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline | 873 differences with the MLP are statistically and qualitatively |
729 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline | 874 significant. |
730 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline | 875 The left side of the figure shows the improvement to the clean |
731 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline | 876 NIST test set error brought by the use of out-of-distribution examples |
732 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline | 877 (i.e. the perturbed examples examples from NISTP or P07). |
733 [5] & & & & 4.95\% $\pm$.18\% \\ \hline | 878 Relative percent change is measured by taking |
734 [2] & & & & 3.71\% $\pm$.16\% \\ \hline | 879 $100 \% \times$ (original model's error / perturbed-data model's error - 1). |
735 [3] & & & & 2.4\% $\pm$.13\% \\ \hline | 880 The right side of |
736 [4] & & & & 2.1\% $\pm$.12\% \\ \hline | 881 Figure~\ref{fig:improvements-charts} shows the relative improvement |
737 \end{tabular} | 882 brought by the use of a multi-task setting, in which the same model is |
738 \end{center} | 883 trained for more classes than the target classes of interest (i.e. training |
739 \end{table} | 884 with all 62 classes when the target classes are respectively the digits, |
740 | 885 lower-case, or upper-case characters). Again, whereas the gain from the |
741 \subsection{Perturbed Training Data More Helpful for SDAE} | 886 multi-task setting is marginal or negative for the MLP, it is substantial |
742 | 887 for the SDA. Note that to simplify these multi-task experiments, only the original |
743 \begin{table} | 888 NIST dataset is used. For example, the MLP-digits bar shows the relative |
744 \caption{Relative change in error rates due to the use of perturbed training data, | 889 percent improvement in MLP error rate on the NIST digits test set |
745 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. | 890 is $100\% \times$ (single-task |
746 A positive value indicates that training on the perturbed data helped for the | 891 model's error / multi-task model's error - 1). The single-task model is |
747 given test set (the first 3 columns on the 62-class tasks and the last one is | 892 trained with only 10 outputs (one per digit), seeing only digit examples, |
748 on the clean 10-class digits). Clearly, the deep learning models did benefit more | 893 whereas the multi-task model is trained with 62 outputs, with all 62 |
749 from perturbed training data, even when testing on clean data, whereas the MLP | 894 character classes as examples. Hence the hidden units are shared across |
750 trained on perturbed data performed worse on the clean digits and about the same | 895 all tasks. For the multi-task model, the digit error rate is measured by |
751 on the clean characters. } | 896 comparing the correct digit class with the output class associated with the |
752 \label{tab:sda-vs-mlp-vs-humans} | 897 maximum conditional probability among only the digit classes outputs. The |
753 \begin{center} | 898 setting is similar for the other two target classes (lower case characters |
754 \begin{tabular}{|l|r|r|r|r|} \hline | 899 and upper case characters). |
755 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline | 900 %%\vspace*{-1mm} |
756 SDA0/SDA1-1 & 38\% & 84\% & 228\% & 93\% \\ \hline | 901 %\subsection{Perturbed Training Data More Helpful for SDA} |
757 SDA0/SDA2-1 & 27\% & 94\% & 144\% & 59\% \\ \hline | 902 %%\vspace*{-1mm} |
758 MLP0/MLP1-1 & 5.2\% & 65\% & -13\% & -10\% \\ \hline | 903 |
759 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline | 904 %%\vspace*{-1mm} |
760 \end{tabular} | 905 %\subsection{Multi-Task Learning Effects} |
761 \end{center} | 906 %%\vspace*{-1mm} |
762 \end{table} | 907 |
763 | 908 \iffalse |
764 | |
765 \subsection{Multi-Task Learning Effects} | |
766 | |
767 As previously seen, the SDA is better able to benefit from the | 909 As previously seen, the SDA is better able to benefit from the |
768 transformations applied to the data than the MLP. In this experiment we | 910 transformations applied to the data than the MLP. In this experiment we |
769 define three tasks: recognizing digits (knowing that the input is a digit), | 911 define three tasks: recognizing digits (knowing that the input is a digit), |
770 recognizing upper case characters (knowing that the input is one), and | 912 recognizing upper case characters (knowing that the input is one), and |
771 recognizing lower case characters (knowing that the input is one). We | 913 recognizing lower case characters (knowing that the input is one). We |
781 fine-tuned on NIST. | 923 fine-tuned on NIST. |
782 | 924 |
783 Our results show that the MLP benefits marginally from the multi-task setting | 925 Our results show that the MLP benefits marginally from the multi-task setting |
784 in the case of digits (5\% relative improvement) but is actually hurt in the case | 926 in the case of digits (5\% relative improvement) but is actually hurt in the case |
785 of characters (respectively 3\% and 4\% worse for lower and upper class characters). | 927 of characters (respectively 3\% and 4\% worse for lower and upper class characters). |
786 On the other hand the SDA benefitted from the multi-task setting, with relative | 928 On the other hand the SDA benefited from the multi-task setting, with relative |
787 error rate improvements of 27\%, 15\% and 13\% respectively for digits, | 929 error rate improvements of 27\%, 15\% and 13\% respectively for digits, |
788 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. | 930 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. |
789 | 931 \fi |
790 \begin{table} | 932 |
791 \caption{Test error rates and relative change in error rates due to the use of | 933 |
792 a multi-task setting, i.e., training on each task in isolation vs training | 934 %\vspace*{-2mm} |
793 for all three tasks together, for MLPs vs SDAs. The SDA benefits much | 935 \section{Conclusions and Discussion} |
794 more from the multi-task setting. All experiments on only on the | 936 %\vspace*{-2mm} |
795 unperturbed NIST data, using validation error for model selection. | 937 |
796 Relative improvement is 1 - single-task error / multi-task error.} | 938 We have found that the self-taught learning framework is more beneficial |
797 \label{tab:multi-task} | 939 to a deep learner than to a traditional shallow and purely |
798 \begin{center} | 940 supervised learner. More precisely, |
799 \begin{tabular}{|l|r|r|r|} \hline | 941 the answers are positive for all the questions asked in the introduction. |
800 & single-task & multi-task & relative \\ | 942 %\begin{itemize} |
801 & setting & setting & improvement \\ \hline | 943 |
802 MLP-digits & 3.77\% & 3.99\% & 5.6\% \\ \hline | 944 $\bullet$ %\item |
803 MLP-lower & 17.4\% & 16.8\% & -4.1\% \\ \hline | 945 {\bf Do the good results previously obtained with deep architectures on the |
804 MLP-upper & 7.84\% & 7.54\% & -3.6\% \\ \hline | 946 MNIST digits generalize to a much larger and richer (but similar) |
805 SDA-digits & 2.6\% & 3.56\% & 27\% \\ \hline | 947 dataset, the NIST special database 19, with 62 classes and around 800k examples}? |
806 SDA-lower & 12.3\% & 14.4\% & 15\% \\ \hline | 948 Yes, the SDA {\em systematically outperformed the MLP and all the previously |
807 SDA-upper & 5.93\% & 6.78\% & 13\% \\ \hline | 949 published results on this dataset} (the ones that we are aware of), {\em in fact reaching human-level |
808 \end{tabular} | 950 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. |
809 \end{center} | 951 |
810 \end{table} | 952 $\bullet$ %\item |
811 | 953 {\bf To what extent do self-taught learning scenarios help deep learners, |
812 \section{Conclusions} | 954 and do they help them more than shallow supervised ones}? |
813 | 955 We found that distorted training examples not only made the resulting |
814 \bibliography{strings,ml,aigaion,specials} | 956 classifier better on similarly perturbed images but also on |
815 \bibliographystyle{mlapa} | 957 the {\em original clean examples}, and more importantly and more novel, |
958 that deep architectures benefit more from such {\em out-of-distribution} | |
959 examples. MLPs were helped by perturbed training examples when tested on perturbed input | |
960 images (65\% relative improvement on NISTP) | |
961 but only marginally helped (5\% relative improvement on all classes) | |
962 or even hurt (10\% relative loss on digits) | |
963 with respect to clean examples . On the other hand, the deep SDAs | |
964 were significantly boosted by these out-of-distribution examples. | |
965 Similarly, whereas the improvement due to the multi-task setting was marginal or | |
966 negative for the MLP (from +5.6\% to -3.6\% relative change), | |
967 it was quite significant for the SDA (from +13\% to +27\% relative change), | |
968 which may be explained by the arguments below. | |
969 %\end{itemize} | |
970 | |
971 In the original self-taught learning framework~\citep{RainaR2007}, the | |
972 out-of-sample examples were used as a source of unsupervised data, and | |
973 experiments showed its positive effects in a \emph{limited labeled data} | |
974 scenario. However, many of the results by \citet{RainaR2007} (who used a | |
975 shallow, sparse coding approach) suggest that the {\em relative gain of self-taught | |
976 learning vs ordinary supervised learning} diminishes as the number of labeled examples increases. | |
977 We note instead that, for deep | |
978 architectures, our experiments show that such a positive effect is accomplished | |
979 even in a scenario with a \emph{large number of labeled examples}, | |
980 i.e., here, the relative gain of self-taught learning is probably preserved | |
981 in the asymptotic regime. | |
982 | |
983 {\bf Why would deep learners benefit more from the self-taught learning framework}? | |
984 The key idea is that the lower layers of the predictor compute a hierarchy | |
985 of features that can be shared across tasks or across variants of the | |
986 input distribution. Intermediate features that can be used in different | |
987 contexts can be estimated in a way that allows to share statistical | |
988 strength. Features extracted through many levels are more likely to | |
989 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), | |
990 increasing the likelihood that they would be useful for a larger array | |
991 of tasks and input conditions. | |
992 Therefore, we hypothesize that both depth and unsupervised | |
993 pre-training play a part in explaining the advantages observed here, and future | |
994 experiments could attempt at teasing apart these factors. | |
995 And why would deep learners benefit from the self-taught learning | |
996 scenarios even when the number of labeled examples is very large? | |
997 We hypothesize that this is related to the hypotheses studied | |
998 in~\citet{Erhan+al-2010}. Whereas in~\citet{Erhan+al-2010} | |
999 it was found that online learning on a huge dataset did not make the | |
1000 advantage of the deep learning bias vanish, a similar phenomenon | |
1001 may be happening here. We hypothesize that unsupervised pre-training | |
1002 of a deep hierarchy with self-taught learning initializes the | |
1003 model in the basin of attraction of supervised gradient descent | |
1004 that corresponds to better generalization. Furthermore, such good | |
1005 basins of attraction are not discovered by pure supervised learning | |
1006 (with or without self-taught settings), and more labeled examples | |
1007 does not allow the model to go from the poorer basins of attraction discovered | |
1008 by the purely supervised shallow models to the kind of better basins associated | |
1009 with deep learning and self-taught learning. | |
1010 | |
1011 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | |
1012 can be executed on-line at {\tt http://deep.host22.com}. | |
1013 | |
1014 \newpage | |
1015 { | |
1016 \bibliography{strings,strings-short,strings-shorter,ift6266_ml,aigaion-shorter,specials} | |
1017 %\bibliographystyle{plainnat} | |
1018 \bibliographystyle{unsrtnat} | |
1019 %\bibliographystyle{apalike} | |
1020 } | |
1021 | |
816 | 1022 |
817 \end{document} | 1023 \end{document} |