Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 554:e95395f51d72
minor
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Wed, 02 Jun 2010 18:17:52 -0400 |
parents | 8f6c09d1140f |
children | b6dfba0a110c |
comparison
equal
deleted
inserted
replaced
553:8f6c09d1140f | 554:e95395f51d72 |
---|---|
18 %\makeanontitle | 18 %\makeanontitle |
19 \maketitle | 19 \maketitle |
20 | 20 |
21 \vspace*{-2mm} | 21 \vspace*{-2mm} |
22 \begin{abstract} | 22 \begin{abstract} |
23 Recent theoretical and empirical work in statistical machine learning has | 23 Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples} and show that {\em deep learners benefit more from them than a corresponding shallow learner}, in the area of handwritten character recognition. In fact, we show that they reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. |
24 demonstrated the importance of learning algorithms for deep | |
25 architectures, i.e., function classes obtained by composing multiple | |
26 non-linear transformations. Self-taught learning (exploiting unlabeled | |
27 examples or examples from other distributions) has already been applied | |
28 to deep learners, but mostly to show the advantage of unlabeled | |
29 examples. Here we explore the advantage brought by {\em out-of-distribution | |
30 examples} and show that {\em deep learners benefit more from them than a | |
31 corresponding shallow learner}, in the area | |
32 of handwritten character recognition. In fact, we show that they reach | |
33 human-level performance on both handwritten digit classification and | |
34 62-class handwritten character recognition. For this purpose we | |
35 developed a powerful generator of stochastic variations and noise | |
36 processes for character images, including not only affine transformations but | |
37 also slant, local elastic deformations, changes in thickness, background | |
38 images, grey level changes, contrast, occlusion, and various types of | |
39 noise. The out-of-distribution examples are | |
40 obtained from these highly distorted images or | |
41 by including examples of object classes different from those in the target test set. | |
42 \end{abstract} | 24 \end{abstract} |
43 \vspace*{-2mm} | 25 \vspace*{-2mm} |
44 | 26 |
45 \section{Introduction} | 27 \section{Introduction} |
46 \vspace*{-1mm} | 28 \vspace*{-1mm} |
181 | 163 |
182 | 164 |
183 \begin{minipage}[b]{0.14\linewidth} | 165 \begin{minipage}[b]{0.14\linewidth} |
184 \centering | 166 \centering |
185 \includegraphics[scale=.45]{images/Thick_only.PNG} | 167 \includegraphics[scale=.45]{images/Thick_only.PNG} |
186 \label{fig:Think} | 168 \label{fig:Thick} |
187 \vspace{.9cm} | 169 \vspace{.9cm} |
188 \end{minipage}% | 170 \end{minipage}% |
189 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} | 171 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} |
190 {\bf Thinkness.} | 172 {\bf Thickness.} |
191 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} | 173 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} |
192 are applied. The neighborhood of each pixel is multiplied | 174 are applied. The neighborhood of each pixel is multiplied |
193 element-wise with a {\em structuring element} matrix. | 175 element-wise with a {\em structuring element} matrix. |
194 The pixel value is replaced by the maximum or the minimum of the resulting | 176 The pixel value is replaced by the maximum or the minimum of the resulting |
195 matrix, respectively for dilation or erosion. Ten different structural elements with | 177 matrix, respectively for dilation or erosion. Ten different structural elements with |
370 \includegraphics[scale=.45]{images/Bruitgauss_only.PNG} | 352 \includegraphics[scale=.45]{images/Bruitgauss_only.PNG} |
371 \label{fig:Original} | 353 \label{fig:Original} |
372 \vspace{.5cm} | 354 \vspace{.5cm} |
373 \end{minipage}% | 355 \end{minipage}% |
374 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} | 356 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} |
375 {\bf Spatially Gaussian Noise.} | 357 {\bf Spatially Gaussian Smoothing.} |
376 Different regions of the image are spatially smoothed by convolving | 358 Different regions of the image are spatially smoothed by convolving |
377 the image is convolved with a symmetric Gaussian kernel of | 359 the image with a symmetric Gaussian kernel of |
378 size and variance chosen uniformly in the ranges $[12,12 + 20 \times | 360 size and variance chosen uniformly in the ranges $[12,12 + 20 \times |
379 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized | 361 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized |
380 between $0$ and $1$. We also create a symmetric averaging window, of the | 362 between $0$ and $1$. We also create a symmetric weighted averaging window, of the |
381 kernel size, with maximum value at the center. For each image we sample | 363 kernel size, with maximum value at the center. For each image we sample |
382 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be | 364 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be |
383 averaging centers between the original image and the filtered one. We | 365 averaging centers between the original image and the filtered one. We |
384 initialize to zero a mask matrix of the image size. For each selected pixel | 366 initialize to zero a mask matrix of the image size. For each selected pixel |
385 we add to the mask the averaging window centered to it. The final image is | 367 we add to the mask the averaging window centered to it. The final image is |
399 {\bf Scratches.} | 381 {\bf Scratches.} |
400 The scratches module places line-like white patches on the image. The | 382 The scratches module places line-like white patches on the image. The |
401 lines are heavily transformed images of the digit ``1'' (one), chosen | 383 lines are heavily transformed images of the digit ``1'' (one), chosen |
402 at random among 500 such 1 images, | 384 at random among 500 such 1 images, |
403 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times | 385 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times |
404 complexity)^2$, using bi-cubic interpolation. | 386 complexity)^2$ (in degrees), using bi-cubic interpolation. |
405 Two passes of a grey-scale morphological erosion filter | 387 Two passes of a grey-scale morphological erosion filter |
406 are applied, reducing the width of the line | 388 are applied, reducing the width of the line |
407 by an amount controlled by $complexity$. | 389 by an amount controlled by $complexity$. |
408 This filter is skipped with probability 85\%. The probabilities | 390 This filter is skipped with probability 85\%. The probabilities |
409 of applying 1, 2, or 3 patches are (50\%,30\%,20\%). | 391 of applying 1, 2, or 3 patches are (50\%,30\%,20\%). |