comparison writeup/nips2010_submission.tex @ 554:e95395f51d72

minor
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Wed, 02 Jun 2010 18:17:52 -0400
parents 8f6c09d1140f
children b6dfba0a110c
comparison
equal deleted inserted replaced
553:8f6c09d1140f 554:e95395f51d72
18 %\makeanontitle 18 %\makeanontitle
19 \maketitle 19 \maketitle
20 20
21 \vspace*{-2mm} 21 \vspace*{-2mm}
22 \begin{abstract} 22 \begin{abstract}
23 Recent theoretical and empirical work in statistical machine learning has 23 Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples} and show that {\em deep learners benefit more from them than a corresponding shallow learner}, in the area of handwritten character recognition. In fact, we show that they reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set.
24 demonstrated the importance of learning algorithms for deep
25 architectures, i.e., function classes obtained by composing multiple
26 non-linear transformations. Self-taught learning (exploiting unlabeled
27 examples or examples from other distributions) has already been applied
28 to deep learners, but mostly to show the advantage of unlabeled
29 examples. Here we explore the advantage brought by {\em out-of-distribution
30 examples} and show that {\em deep learners benefit more from them than a
31 corresponding shallow learner}, in the area
32 of handwritten character recognition. In fact, we show that they reach
33 human-level performance on both handwritten digit classification and
34 62-class handwritten character recognition. For this purpose we
35 developed a powerful generator of stochastic variations and noise
36 processes for character images, including not only affine transformations but
37 also slant, local elastic deformations, changes in thickness, background
38 images, grey level changes, contrast, occlusion, and various types of
39 noise. The out-of-distribution examples are
40 obtained from these highly distorted images or
41 by including examples of object classes different from those in the target test set.
42 \end{abstract} 24 \end{abstract}
43 \vspace*{-2mm} 25 \vspace*{-2mm}
44 26
45 \section{Introduction} 27 \section{Introduction}
46 \vspace*{-1mm} 28 \vspace*{-1mm}
181 163
182 164
183 \begin{minipage}[b]{0.14\linewidth} 165 \begin{minipage}[b]{0.14\linewidth}
184 \centering 166 \centering
185 \includegraphics[scale=.45]{images/Thick_only.PNG} 167 \includegraphics[scale=.45]{images/Thick_only.PNG}
186 \label{fig:Think} 168 \label{fig:Thick}
187 \vspace{.9cm} 169 \vspace{.9cm}
188 \end{minipage}% 170 \end{minipage}%
189 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} 171 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
190 {\bf Thinkness.} 172 {\bf Thickness.}
191 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} 173 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
192 are applied. The neighborhood of each pixel is multiplied 174 are applied. The neighborhood of each pixel is multiplied
193 element-wise with a {\em structuring element} matrix. 175 element-wise with a {\em structuring element} matrix.
194 The pixel value is replaced by the maximum or the minimum of the resulting 176 The pixel value is replaced by the maximum or the minimum of the resulting
195 matrix, respectively for dilation or erosion. Ten different structural elements with 177 matrix, respectively for dilation or erosion. Ten different structural elements with
370 \includegraphics[scale=.45]{images/Bruitgauss_only.PNG} 352 \includegraphics[scale=.45]{images/Bruitgauss_only.PNG}
371 \label{fig:Original} 353 \label{fig:Original}
372 \vspace{.5cm} 354 \vspace{.5cm}
373 \end{minipage}% 355 \end{minipage}%
374 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} 356 \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
375 {\bf Spatially Gaussian Noise.} 357 {\bf Spatially Gaussian Smoothing.}
376 Different regions of the image are spatially smoothed by convolving 358 Different regions of the image are spatially smoothed by convolving
377 the image is convolved with a symmetric Gaussian kernel of 359 the image with a symmetric Gaussian kernel of
378 size and variance chosen uniformly in the ranges $[12,12 + 20 \times 360 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
379 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized 361 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
380 between $0$ and $1$. We also create a symmetric averaging window, of the 362 between $0$ and $1$. We also create a symmetric weighted averaging window, of the
381 kernel size, with maximum value at the center. For each image we sample 363 kernel size, with maximum value at the center. For each image we sample
382 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be 364 uniformly from $3$ to $3 + 10 \times complexity$ pixels that will be
383 averaging centers between the original image and the filtered one. We 365 averaging centers between the original image and the filtered one. We
384 initialize to zero a mask matrix of the image size. For each selected pixel 366 initialize to zero a mask matrix of the image size. For each selected pixel
385 we add to the mask the averaging window centered to it. The final image is 367 we add to the mask the averaging window centered to it. The final image is
399 {\bf Scratches.} 381 {\bf Scratches.}
400 The scratches module places line-like white patches on the image. The 382 The scratches module places line-like white patches on the image. The
401 lines are heavily transformed images of the digit ``1'' (one), chosen 383 lines are heavily transformed images of the digit ``1'' (one), chosen
402 at random among 500 such 1 images, 384 at random among 500 such 1 images,
403 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times 385 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
404 complexity)^2$, using bi-cubic interpolation. 386 complexity)^2$ (in degrees), using bi-cubic interpolation.
405 Two passes of a grey-scale morphological erosion filter 387 Two passes of a grey-scale morphological erosion filter
406 are applied, reducing the width of the line 388 are applied, reducing the width of the line
407 by an amount controlled by $complexity$. 389 by an amount controlled by $complexity$.
408 This filter is skipped with probability 85\%. The probabilities 390 This filter is skipped with probability 85\%. The probabilities
409 of applying 1, 2, or 3 patches are (50\%,30\%,20\%). 391 of applying 1, 2, or 3 patches are (50\%,30\%,20\%).