Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 476:db28764b8252
Merge
author | fsavard |
---|---|
date | Sun, 30 May 2010 12:06:45 -0400 |
parents | ead3085c1c66 bcf024e6ab23 |
children | 6593e67381a3 |
comparison
equal
deleted
inserted
replaced
475:ead3085c1c66 | 476:db28764b8252 |
---|---|
119 There are two main parts in the pipeline. The first one, | 119 There are two main parts in the pipeline. The first one, |
120 from slant to pinch below, performs transformations. The second | 120 from slant to pinch below, performs transformations. The second |
121 part, from blur to contrast, adds different kinds of noise. | 121 part, from blur to contrast, adds different kinds of noise. |
122 | 122 |
123 {\large\bf Transformations}\\ | 123 {\large\bf Transformations}\\ |
124 {\bf Slant}\\ | 124 {\bf Slant.} |
125 We mimic slant by shifting each row of the image | 125 We mimic slant by shifting each row of the image |
126 proportionnaly to its height: $shift = round(slant \times height)$. | 126 proportionnaly to its height: $shift = round(slant \times height)$. |
127 The $slant$ coefficient can be negative or positive with equal probability | 127 The $slant$ coefficient can be negative or positive with equal probability |
128 and its value is randomly sampled according to the complexity level: | 128 and its value is randomly sampled according to the complexity level: |
129 e $slant \sim U[0,complexity]$, so the | 129 e $slant \sim U[0,complexity]$, so the |
130 maximum displacement for the lowest or highest pixel line is of | 130 maximum displacement for the lowest or highest pixel line is of |
131 $round(complexity \times 32)$.\\ | 131 $round(complexity \times 32)$.\\ |
132 {\bf Thickness}\\ | 132 {\bf Thickness.} |
133 Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82} | 133 Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82} |
134 are applied. The neighborhood of each pixel is multiplied | 134 are applied. The neighborhood of each pixel is multiplied |
135 element-wise with a {\em structuring element} matrix. | 135 element-wise with a {\em structuring element} matrix. |
136 The pixel value is replaced by the maximum or the minimum of the resulting | 136 The pixel value is replaced by the maximum or the minimum of the resulting |
137 matrix, respectively for dilation or erosion. Ten different structural elements with | 137 matrix, respectively for dilation or erosion. Ten different structural elements with |
141 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ | 141 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ |
142 for erosion. A neutral element is always present in the set, and if it is | 142 for erosion. A neutral element is always present in the set, and if it is |
143 chosen no transformation is applied. Erosion allows only the six | 143 chosen no transformation is applied. Erosion allows only the six |
144 smallest structural elements because when the character is too thin it may | 144 smallest structural elements because when the character is too thin it may |
145 be completely erased.\\ | 145 be completely erased.\\ |
146 {\bf Affine Transformations}\\ | 146 {\bf Affine Transformations.} |
147 A $2 \times 3$ affine transform matrix (with | 147 A $2 \times 3$ affine transform matrix (with |
148 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. | 148 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. |
149 Each pixel $(x,y)$ of the output image takes the value of the pixel | 149 Each pixel $(x,y)$ of the output image takes the value of the pixel |
150 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This | 150 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This |
151 produces scaling, translation, rotation and shearing. | 151 produces scaling, translation, rotation and shearing. |
153 forbid important rotations (not to confuse classes) but to give good | 153 forbid important rotations (not to confuse classes) but to give good |
154 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times | 154 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times |
155 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 | 155 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 |
156 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times | 156 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times |
157 complexity]$.\\ | 157 complexity]$.\\ |
158 {\bf Local Elastic Deformations}\\ | 158 {\bf Local Elastic Deformations.} |
159 This filter induces a "wiggly" effect in the image, following~\citet{SimardSP03}, | 159 This filter induces a "wiggly" effect in the image, following~\citet{SimardSP03}, |
160 which provides more details. | 160 which provides more details. |
161 Two "displacements" fields are generated and applied, for horizontal | 161 Two "displacements" fields are generated and applied, for horizontal |
162 and vertical displacements of pixels. | 162 and vertical displacements of pixels. |
163 To generate a pixel in either field, first a value between -1 and 1 is | 163 To generate a pixel in either field, first a value between -1 and 1 is |
166 displacements (larger $\alpha$ translates into larger wiggles). | 166 displacements (larger $\alpha$ translates into larger wiggles). |
167 Each field is convoluted with a Gaussian 2D kernel of | 167 Each field is convoluted with a Gaussian 2D kernel of |
168 standard deviation $\sigma$. Visually, this results in a blur. | 168 standard deviation $\sigma$. Visually, this results in a blur. |
169 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times | 169 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times |
170 \sqrt[3]{complexity}$.\\ | 170 \sqrt[3]{complexity}$.\\ |
171 {\bf Pinch}\\ | 171 {\bf Pinch.} |
172 This GIMP filter is named "Whirl and | 172 This GIMP filter is named "Whirl and |
173 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic | 173 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic |
174 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. | 174 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. |
175 For a square input image, think of drawing a circle of | 175 For a square input image, think of drawing a circle of |
176 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to | 176 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to |
183 The actual value is given by bilinear interpolation considering the pixels | 183 The actual value is given by bilinear interpolation considering the pixels |
184 around the (non-integer) source position thus found. | 184 around the (non-integer) source position thus found. |
185 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\ | 185 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\ |
186 | 186 |
187 {\large\bf Injecting Noise}\\ | 187 {\large\bf Injecting Noise}\\ |
188 {\bf Motion Blur}\\ | 188 {\bf Motion Blur.} |
189 This GIMP filter is a ``linear motion blur'' in GIMP | 189 This GIMP filter is a ``linear motion blur'' in GIMP |
190 terminology, with two parameters, $length$ and $angle$. The value of | 190 terminology, with two parameters, $length$ and $angle$. The value of |
191 a pixel in the final image is the approximately mean value of the $length$ first pixels | 191 a pixel in the final image is the approximately mean value of the $length$ first pixels |
192 found by moving in the $angle$ direction. | 192 found by moving in the $angle$ direction. |
193 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\ | 193 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\ |
194 {\bf Occlusion}\\ | 194 {\bf Occlusion.} |
195 This filter selects a random rectangle from an {\em occluder} character | 195 This filter selects a random rectangle from an {\em occluder} character |
196 images and places it over the original {\em occluded} character | 196 images and places it over the original {\em occluded} character |
197 image. Pixels are combined by taking the max(occluder,occluded), | 197 image. Pixels are combined by taking the max(occluder,occluded), |
198 closer to black. The corners of the occluder The rectangle corners | 198 closer to black. The corners of the occluder The rectangle corners |
199 are sampled so that larger complexity gives larger rectangles. | 199 are sampled so that larger complexity gives larger rectangles. |
200 The destination position in the occluded image are also sampled | 200 The destination position in the occluded image are also sampled |
201 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). | 201 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). |
202 It has has a probability of not being applied at all of 60\%.\\ | 202 It has has a probability of not being applied at all of 60\%.\\ |
203 {\bf Pixel Permutation}\\ | 203 {\bf Pixel Permutation.} |
204 This filter permutes neighbouring pixels. It selects first | 204 This filter permutes neighbouring pixels. It selects first |
205 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then | 205 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then |
206 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number | 206 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number |
207 of exchanges to the left, right, top, bottom are equal or does not differ | 207 of exchanges to the left, right, top, bottom are equal or does not differ |
208 from more than 1 if the number of selected pixels is not a multiple of 4. | 208 from more than 1 if the number of selected pixels is not a multiple of 4. |
209 It has has a probability of not being applied at all of 80\%.\\ | 209 It has has a probability of not being applied at all of 80\%.\\ |
210 {\bf Gaussian Noise}\\ | 210 {\bf Gaussian Noise.} |
211 This filter simply adds, to each pixel of the image independently, a | 211 This filter simply adds, to each pixel of the image independently, a |
212 noise $\sim Normal(0(\frac{complexity}{10})^2)$. | 212 noise $\sim Normal(0(\frac{complexity}{10})^2)$. |
213 It has has a probability of not being applied at all of 70\%.\\ | 213 It has has a probability of not being applied at all of 70\%.\\ |
214 {\bf Background Images}\\ | 214 {\bf Background Images.} |
215 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random | 215 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random |
216 background behind the letter. The background is chosen by first selecting, | 216 background behind the letter. The background is chosen by first selecting, |
217 at random, an image from a set of images. Then a 32$\times$32 subregion | 217 at random, an image from a set of images. Then a 32$\times$32 subregion |
218 of that image is chosen as the background image (by sampling position | 218 of that image is chosen as the background image (by sampling position |
219 uniformly while making sure not to cross image borders). | 219 uniformly while making sure not to cross image borders). |
222 intensity) for both the original image and the background image, $maximage$ | 222 intensity) for both the original image and the background image, $maximage$ |
223 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$. | 223 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$. |
224 Each background pixel value is multiplied by $\frac{max(maximage - | 224 Each background pixel value is multiplied by $\frac{max(maximage - |
225 contrast, 0)}{maxbg}$ (higher contrast yield darker | 225 contrast, 0)}{maxbg}$ (higher contrast yield darker |
226 background). The output image pixels are max(background,original).\\ | 226 background). The output image pixels are max(background,original).\\ |
227 {\bf Salt and Pepper Noise}\\ | 227 {\bf Salt and Pepper Noise.} |
228 This filter adds noise $\sim U[0,1]$ to random subsets of pixels. | 228 This filter adds noise $\sim U[0,1]$ to random subsets of pixels. |
229 The number of selected pixels is $0.2 \times complexity$. | 229 The number of selected pixels is $0.2 \times complexity$. |
230 This filter has a probability of not being applied at all of 75\%.\\ | 230 This filter has a probability of not being applied at all of 75\%.\\ |
231 {\bf Spatially Gaussian Noise}\\ | 231 {\bf Spatially Gaussian Noise.} |
232 Different regions of the image are spatially smoothed. | 232 Different regions of the image are spatially smoothed. |
233 The image is convolved with a symmetric Gaussian kernel of | 233 The image is convolved with a symmetric Gaussian kernel of |
234 size and variance choosen uniformly in the ranges $[12,12 + 20 \times | 234 size and variance choosen uniformly in the ranges $[12,12 + 20 \times |
235 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized | 235 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized |
236 between $0$ and $1$. We also create a symmetric averaging window, of the | 236 between $0$ and $1$. We also create a symmetric averaging window, of the |
240 initialize to zero a mask matrix of the image size. For each selected pixel | 240 initialize to zero a mask matrix of the image size. For each selected pixel |
241 we add to the mask the averaging window centered to it. The final image is | 241 we add to the mask the averaging window centered to it. The final image is |
242 computed from the following element-wise operation: $\frac{image + filtered | 242 computed from the following element-wise operation: $\frac{image + filtered |
243 image \times mask}{mask+1}$. | 243 image \times mask}{mask+1}$. |
244 This filter has a probability of not being applied at all of 75\%.\\ | 244 This filter has a probability of not being applied at all of 75\%.\\ |
245 {\bf Scratches}\\ | 245 {\bf Scratches.} |
246 The scratches module places line-like white patches on the image. The | 246 The scratches module places line-like white patches on the image. The |
247 lines are heavily transformed images of the digit "1" (one), chosen | 247 lines are heavily transformed images of the digit "1" (one), chosen |
248 at random among five thousands such 1 images. The 1 image is | 248 at random among five thousands such 1 images. The 1 image is |
249 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times | 249 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times |
250 complexity)^2$, using bicubic interpolation, | 250 complexity)^2$, using bicubic interpolation, |
254 This filter is only applied only 15\% of the time. When it is applied, 50\% | 254 This filter is only applied only 15\% of the time. When it is applied, 50\% |
255 of the time, only one patch image is generated and applied. In 30\% of | 255 of the time, only one patch image is generated and applied. In 30\% of |
256 cases, two patches are generated, and otherwise three patches are | 256 cases, two patches are generated, and otherwise three patches are |
257 generated. The patch is applied by taking the maximal value on any given | 257 generated. The patch is applied by taking the maximal value on any given |
258 patch or the original image, for each of the 32x32 pixel locations.\\ | 258 patch or the original image, for each of the 32x32 pixel locations.\\ |
259 {\bf Color and Contrast Changes}\\ | 259 {\bf Color and Contrast Changes.} |
260 This filter changes the constrast and may invert the image polarity (white | 260 This filter changes the constrast and may invert the image polarity (white |
261 on black to black on white). The contrast $C$ is defined here as the | 261 on black to black on white). The contrast $C$ is defined here as the |
262 difference between the maximum and the minimum pixel value of the image. | 262 difference between the maximum and the minimum pixel value of the image. |
263 Contrast $\sim U[1-0.85 \times complexity,1]$ (so constrast $\geq 0.15$). | 263 Contrast $\sim U[1-0.85 \times complexity,1]$ (so constrast $\geq 0.15$). |
264 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The | 264 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The |
277 \end{figure} | 277 \end{figure} |
278 | 278 |
279 | 279 |
280 \section{Experimental Setup} | 280 \section{Experimental Setup} |
281 | 281 |
282 \subsection{Training Datasets} | 282 Whereas much previous work on deep learning algorithms had been performed on |
283 | 283 the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, |
284 \subsubsection{Data Sources} | 284 with 60~000 examples, and variants involving 10~000 |
285 examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want | |
286 to focus here on the case of much larger training sets, from 10 times to | |
287 to 1000 times larger. The larger datasets are obtained by first sampling from | |
288 a {\em data source} (NIST characters, scanned machine printed characters, characters | |
289 from fonts, or characters from captchas) and then optionally applying some of the | |
290 above transformations and/or noise processes. | |
291 | |
292 \subsection{Data Sources} | |
285 | 293 |
286 \begin{itemize} | 294 \begin{itemize} |
287 \item {\bf NIST} | 295 \item {\bf NIST} |
288 The NIST Special Database 19 (NIST19) is a very widely used dataset for training and testing OCR systems. | 296 Our main source of characters is the NIST Special Database 19~\cite{Grother-1995}, |
297 widely used for training and testing character | |
298 recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}. | |
289 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, | 299 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications, |
290 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | 300 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes |
291 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. | 301 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. |
292 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one for classification task is recommended | 302 The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended |
293 by NIST as testing set and is used in our work for that purpose. It contains 82600 examples, | 303 by NIST as testing set and is used in our work and some previous work~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} |
294 while the training and validation sets (which have the same distribution) contain XXXXX and | 304 for that purpose. We randomly split the remainder into a training set and a validation set for |
295 XXXXX examples respectively. | 305 model selection. The sizes of these data sets are: XXX for training, XXX for validation, |
306 and XXX for testing. | |
296 The performances reported by previous work on that dataset mostly use only the digits. | 307 The performances reported by previous work on that dataset mostly use only the digits. |
297 Here we use all the classes both in the training and testing phase. This is especially | 308 Here we use all the classes both in the training and testing phase. This is especially |
298 useful to estimate the effect of a multi-task setting. | 309 useful to estimate the effect of a multi-task setting. |
299 Note that the distribution of the classes in the NIST training and test sets differs | 310 Note that the distribution of the classes in the NIST training and test sets differs |
300 substantially, with relatively many more digits in the test set, and uniform distribution | 311 substantially, with relatively many more digits in the test set, and uniform distribution |
303 | 314 |
304 \item {\bf Fonts} TODO!!! | 315 \item {\bf Fonts} TODO!!! |
305 | 316 |
306 \item {\bf Captchas} | 317 \item {\bf Captchas} |
307 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 318 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for |
308 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character | 319 generating characters of the same format as the NIST dataset. This software is based on |
309 generator and various kinds of tranformations similar to those described in the previous sections. | 320 a random character class generator and various kinds of tranformations similar to those described in the previous sections. |
310 In order to increase the variability of the data generated, different fonts are used for generating the characters. | 321 In order to increase the variability of the data generated, many different fonts are used for generating the characters. |
311 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity | 322 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity |
312 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are | 323 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are |
313 allowed and can be controlled via an easy to use facade class. | 324 allowed and can be controlled via an easy to use facade class. |
314 \item {\bf OCR data} | 325 \item {\bf OCR data} |
326 A large set (2 million) of scanned, OCRed and manually verified machine-printed | |
327 characters (from various documents and books) where included as an | |
328 additional source. This set is part of a larger corpus being collected by the Image Understanding | |
329 Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern | |
330 ({\tt http://www.iupr.com}), and which will be publically released. | |
315 \end{itemize} | 331 \end{itemize} |
316 | 332 |
317 \subsubsection{Data Sets} | 333 \subsection{Data Sets} |
334 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label | |
335 from one of the 62 character classes. | |
318 \begin{itemize} | 336 \begin{itemize} |
319 \item {\bf NIST} This is the raw NIST special database 19. | 337 \item {\bf NIST}. This is the raw NIST special database 19. |
320 \item {\bf P07} | 338 \item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources |
321 The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$. | 339 and sending them through the above transformation pipeline. |
322 For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts, | 340 For each new exemple to generate, a source is selected with probability $10\%$ from the fonts, |
323 $0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order | 341 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the |
324 and for each of them we sample uniformly a complexity in the range $[0,0.7]$. | 342 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. |
325 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) | 343 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) |
326 except that we only apply | 344 except that we only apply |
327 transformations from slant to pinch. Therefore, the character is | 345 transformations from slant to pinch. Therefore, the character is |
328 transformed but no additionnal noise is added to the image, giving images | 346 transformed but no additionnal noise is added to the image, giving images |
329 closer to the NIST dataset. | 347 closer to the NIST dataset. |
330 \end{itemize} | 348 \end{itemize} |
331 | 349 |
332 \subsection{Models and their Hyperparameters} | 350 \subsection{Models and their Hyperparameters} |
333 | 351 |
352 All hyper-parameters are selected based on performance on the NISTP validation set. | |
353 | |
334 \subsubsection{Multi-Layer Perceptrons (MLP)} | 354 \subsubsection{Multi-Layer Perceptrons (MLP)} |
335 | 355 |
336 An MLP is a family of functions that are described by stacking layers of of a function similar to | 356 Whereas previous work had compared deep architectures to both shallow MLPs and |
337 $$g(x) = \tanh(b+Wx)$$ | 357 SVMs, we only compared to MLPs here because of the very large datasets used. |
338 The input, $x$, is a $d$-dimension vector. | 358 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
339 The output, $g(x)$, is a $m$-dimension vector. | 359 exponentials) on the output layer for estimating P(class | image). |
340 The parameter $W$ is a $m\times d$ matrix and is called the weight matrix. | 360 The hyper-parameters are the following: number of hidden units, taken in |
341 The parameter $b$ is a $m$-vector and is called the bias vector. | 361 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training |
342 The non-linearity (here $\tanh$) is applied element-wise to the output vector. | 362 examples are presented in minibatches of size 20. A constant learning |
343 Usually the input is referred to a input layer and similarly for the output. | 363 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ |
344 You can of course chain several such functions to obtain a more complex one. | 364 through preliminary experiments, and 0.1 was selected. |
345 Here is a common example | 365 |
346 $$f(x) = c + V\tanh(b+Wx)$$ | |
347 In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer. | |
348 Here the output layer does not have the same non-linearity as the hidden layer. | |
349 This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand. | |
350 | |
351 If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP. | |
352 The parameters to adapt are the weight matrix and the bias vector for each layer. | |
353 | 366 |
354 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} | 367 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} |
355 \label{SdA} | 368 \label{SdA} |
356 | 369 |
357 Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization. | 370 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
358 This is essentially unsupervised training where the layer is made to reconstruct its input through and encoding and decoding phase. | 371 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
359 Denoising auto-encoders are a variant where the input is corrupted with random noise but the target is the uncorrupted input. | 372 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} |
360 The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform. | 373 enabling better generalization, apparently setting parameters in the |
361 | 374 basin of attraction of supervised gradient descent yielding better |
362 An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer. | 375 generalization~\citep{Erhan+al-2010}. It is hypothesized that the |
363 Usually the top and bottom weight matrices are the transpose of each other and are fixed this way. | 376 advantage brought by this procedure stems from a better prior, |
364 The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer. | 377 on the one hand taking advantage of the link between the input |
365 The other parameters are discarded. | 378 distribution $P(x)$ and the conditional distribution of interest |
366 | 379 $P(y|x)$ (like in semi-supervised learning), and on the other hand |
367 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. | 380 taking advantage of the expressive power and bias implicit in the |
368 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder. | 381 deep architecture (whereby complex concepts are expressed as |
369 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it. | 382 compositions of simpler ones through a deep hierarchy). |
370 For additional details see \citet{vincent:icml08}. | 383 |
384 Here we chose to use the Denoising | |
385 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for | |
386 these deep hierarchies of features, as it is very simple to train and | |
387 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), | |
388 provides immediate and efficient inference, and yielded results | |
389 comparable or better than RBMs in series of experiments | |
390 \citep{VincentPLarochelleH2008}. During training of a Denoising | |
391 Auto-Encoder, it is presented with a stochastically corrupted version | |
392 of the input and trained to reconstruct the uncorrupted input, | |
393 forcing the hidden units to represent the leading regularities in | |
394 the data. Once it is trained, its hidden units activations can | |
395 be used as inputs for training a second one, etc. | |
396 After this unsupervised pre-training stage, the parameters | |
397 are used to initialize a deep MLP, which is fine-tuned by | |
398 the same standard procedure used to train them (see previous section). | |
399 | |
400 The hyper-parameters are the same as for the MLP, with the addition of the | |
401 amount of corruption noise (we used the masking noise process, whereby a | |
402 fixed proportion of the input values, randomly selected, are zeroed), and a | |
403 separate learning rate for the unsupervised pre-training stage (selected | |
404 from the same above set). The fraction of inputs corrupted was selected | |
405 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | |
406 of hidden layers but it was fixed to 3 based on previous work with | |
407 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. | |
371 | 408 |
372 \section{Experimental Results} | 409 \section{Experimental Results} |
373 | 410 |
374 \subsection{SDA vs MLP vs Humans} | 411 \subsection{SDA vs MLP vs Humans} |
375 | 412 |
399 on NIST digits classification using the same test set are included.} | 436 on NIST digits classification using the same test set are included.} |
400 \label{tab:sda-vs-mlp-vs-humans} | 437 \label{tab:sda-vs-mlp-vs-humans} |
401 \begin{center} | 438 \begin{center} |
402 \begin{tabular}{|l|r|r|r|r|} \hline | 439 \begin{tabular}{|l|r|r|r|r|} \hline |
403 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline | 440 & NIST test & NISTP test & P07 test & NIST test digits \\ \hline |
404 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $>1.1\%$ \\ \hline | 441 Humans& 18.2\% $\pm$.1\% & 39.4\%$\pm$.1\% & 46.9\%$\pm$.1\% & $1.4\%$ \\ \hline |
405 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline | 442 SDA0 & 23.7\% $\pm$.14\% & 65.2\%$\pm$.34\% & 97.45\%$\pm$.06\% & 2.7\% $\pm$.14\%\\ \hline |
406 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline | 443 SDA1 & 17.1\% $\pm$.13\% & 29.7\%$\pm$.3\% & 29.7\%$\pm$.3\% & 1.4\% $\pm$.1\%\\ \hline |
407 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline | 444 SDA2 & 18.7\% $\pm$.13\% & 33.6\%$\pm$.3\% & 39.9\%$\pm$.17\% & 1.7\% $\pm$.1\%\\ \hline |
408 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline | 445 MLP0 & 24.2\% $\pm$.15\% & 68.8\%$\pm$.33\% & 78.70\%$\pm$.14\% & 3.45\% $\pm$.15\% \\ \hline |
409 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline | 446 MLP1 & 23.0\% $\pm$.15\% & 41.8\%$\pm$.35\% & 90.4\%$\pm$.1\% & 3.85\% $\pm$.16\% \\ \hline |
410 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline | 447 MLP2 & 24.3\% $\pm$.15\% & 46.0\%$\pm$.35\% & 54.7\%$\pm$.17\% & 4.85\% $\pm$.18\% \\ \hline |
411 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline | 448 \citep{Granger+al-2007} & & & & 4.95\% $\pm$.18\% \\ \hline |
412 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline | 449 \citep{Cortes+al-2000} & & & & 3.71\% $\pm$.16\% \\ \hline |
413 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline | 450 \citep{Oliveira+al-2002} & & & & 2.4\% $\pm$.13\% \\ \hline |
414 \citep{Migram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline | 451 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline |
415 \end{tabular} | 452 \end{tabular} |
416 \end{center} | 453 \end{center} |
417 \end{table} | 454 \end{table} |
418 | 455 |
419 \begin{figure}[h] | 456 \begin{figure}[h] |
502 | 539 |
503 | 540 |
504 | 541 |
505 \section{Conclusions} | 542 \section{Conclusions} |
506 | 543 |
544 The conclusions are positive for all the questions asked in the introduction. | |
545 \begin{itemize} | |
546 \item Do the good results previously obtained with deep architectures on the | |
547 MNIST digits generalize to the setting of a much larger and richer (but similar) | |
548 dataset, the NIST special database 19, with 62 classes and around 800k examples? | |
549 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level | |
550 performance. | |
551 \item To what extent does the perturbation of input images (e.g. adding | |
552 noise, affine transformations, background images) make the resulting | |
553 classifier better not only on similarly perturbed images but also on | |
554 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | |
555 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | |
556 MLPs were helped by perturbed training examples when tested on perturbed input images, | |
557 but only marginally helped wrt clean examples. On the other hand, the deep SDAs | |
558 were very significantly boosted by these out-of-distribution examples. | |
559 \item Similarly, does the feature learning step in deep learning algorithms benefit more | |
560 training with similar but different classes (i.e. a multi-task learning scenario) than | |
561 a corresponding shallow and purely supervised architecture? | |
562 Whereas the improvement due to the multi-task setting was marginal or | |
563 negative for the MLP, it was very significant for the SDA. | |
564 \end{itemize} | |
565 | |
507 \bibliography{strings,ml,aigaion,specials} | 566 \bibliography{strings,ml,aigaion,specials} |
508 %\bibliographystyle{plainnat} | 567 %\bibliographystyle{plainnat} |
509 \bibliographystyle{unsrtnat} | 568 \bibliographystyle{unsrtnat} |
510 %\bibliographystyle{apalike} | 569 %\bibliographystyle{apalike} |
511 | 570 |