ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 523:c778d20ab6f8

space adjustments

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Tue, 01 Jun 2010 16:06:32 -0400
parents	d41926a68993
children	07bc0ca8d246

comparison

equal deleted inserted replaced

-:d41926a68993
+:c778d20ab6f8
 \usepackage{amsthm,amsmath,amssymb,bbold,bbm}
 \usepackage{algorithm,algorithmic}
 \usepackage[utf8]{inputenc}
 \usepackage{graphicx,subfigure}
 \usepackage[numbers]{natbib}
+%\setlength\parindent{0mm}
 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
 \author{The IFT6266 Gang}
 \begin{document}
 There are two main parts in the pipeline. The first one,
 from slant to pinch below, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
-\begin{figure}[h]
+\begin{figure}[ht]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}}
 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE
 \caption{Illustration of each transformation applied alone to the same image
 of an upper-case h (top left). First row (from left to right) : original image, slant,
 thickness, affine transformation (translation, rotation, shear),
 local elastic deformation; second row (from left to right) :
 proportionally to its height: $shift = round(slant \times height)$.
 The $slant$ coefficient can be negative or positive with equal probability
 and its value is randomly sampled according to the complexity level:
 $slant \sim U[0,complexity]$, so the
 maximum displacement for the lowest or highest pixel line is of
-$round(complexity \times 32)$.\\
+$round(complexity \times 32)$.
+\vspace*{0mm}
 {\bf Thickness.}
 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
 are applied. The neighborhood of each pixel is multiplied
 element-wise with a {\em structuring element} matrix.
 The pixel value is replaced by the maximum or the minimum of the resulting
 element from a subset of the $n$ smallest structuring elements where $n$ is
 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$
 for erosion.  A neutral element is always present in the set, and if it is
 chosen no transformation is applied.  Erosion allows only the six
 smallest structural elements because when the character is too thin it may
-be completely erased.\\
+be completely erased.
+\vspace*{0mm}
 {\bf Affine Transformations.}
 A $2 \times 3$ affine transform matrix (with
 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
 Each pixel $(x,y)$ of the output image takes the value of the pixel
 nearest to $(ax+by+c,dx+ey+f)$ in the input image.  This
 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to
 forbid important rotations (not to confuse classes) but to give good
 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
-complexity]$.\\
+complexity]$.
+\vspace*{0mm}
 {\bf Local Elastic Deformations.}
 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
 which provides more details.
 Two ``displacements'' fields are generated and applied, for horizontal
 and vertical displacements of pixels.
 multiplied by a constant $\alpha$ which controls the intensity of the
 displacements (larger $\alpha$ translates into larger wiggles).
 Each field is convoluted with a Gaussian 2D kernel of
 standard deviation $\sigma$. Visually, this results in a blur.
 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
-\sqrt[3]{complexity}$.\\
+\sqrt[3]{complexity}$.
+\vspace*{0mm}
 {\bf Pinch.}
 This is a GIMP filter called ``Whirl and
 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
 For a square input image, this is akin to drawing a circle of
 {\bf Motion Blur.}
 This is a ``linear motion blur'' in GIMP
 terminology, with two parameters, $length$ and $angle$. The value of
 a pixel in the final image is approximately the  mean value of the $length$ first pixels
 found by moving in the $angle$ direction.
-Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\
+Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
+\vspace*{0mm}
 {\bf Occlusion.}
 Selects a random rectangle from an {\em occluder} character
 images and places it over the original {\em occluded} character
 image. Pixels are combined by taking the max(occluder,occluded),
 closer to black. The rectangle corners
 are sampled so that larger complexity gives larger rectangles.
 The destination position in the occluded image are also sampled
 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
-This filter has a probability of 60\% of not being applied.\\
+This filter has a probability of 60\% of not being applied.
+\vspace*{0mm}
 {\bf Pixel Permutation.}
 This filter permutes neighbouring pixels. It selects first
 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
 sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number
 of exchanges to the left, right, top, bottom is equal or does not differ
 from more than 1 if the number of selected pixels is not a multiple of 4.
 % TODO: The previous sentence is hard to parse
-This filter has a probability of 80\% of not being applied.\\
+This filter has a probability of 80\% of not being applied.
+\vspace*{0mm}
 {\bf Gaussian Noise.}
 This filter simply adds, to each pixel of the image independently, a
 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
-It has a probability of 70\% of not being applied.\\
+It has a probability of 70\% of not being applied.
+\vspace*{0mm}
 {\bf Background Images.}
 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
 background behind the letter. The background is chosen by first selecting,
 at random, an image from a set of images. Then a 32$\times$32 sub-region
 of that image is chosen as the background image (by sampling position
 adjustments are made. We first get the maximal values (i.e. maximal
 intensity) for both the original image and the background image, $maximage$
 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$.
 Each background pixel value is multiplied by $\frac{max(maximage -
 contrast, 0)}{maxbg}$ (higher contrast yield darker
-background). The output image pixels are max(background,original).\\
+background). The output image pixels are max(background,original).
+\vspace*{0mm}
 {\bf Salt and Pepper Noise.}
 This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
 The number of selected pixels is $0.2 \times complexity$.
-This filter has a probability of not being applied at all of 75\%.\\
+This filter has a probability of not being applied at all of 75\%.
+\vspace*{0mm}
 {\bf Spatially Gaussian Noise.}
 Different regions of the image are spatially smoothed.
 The image is convolved with a symmetric Gaussian kernel of
 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
 averaging centers between the original image and the filtered one.  We
 initialize to zero a mask matrix of the image size. For each selected pixel
 we add to the mask the averaging window centered to it.  The final image is
 computed from the following element-wise operation: $\frac{image + filtered
 image \times mask}{mask+1}$.
-This filter has a probability of not being applied at all of 75\%.\\
+This filter has a probability of not being applied at all of 75\%.
+\vspace*{0mm}
 {\bf Scratches.}
 The scratches module places line-like white patches on the image.  The
 lines are heavily transformed images of the digit ``1'' (one), chosen
 at random among five thousands such 1 images. The 1 image is
 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
 by an amount controlled by $complexity$.
 This filter is only applied only 15\% of the time. When it is applied, 50\%
 of the time, only one patch image is generated and applied. In 30\% of
 cases, two patches are generated, and otherwise three patches are
 generated. The patch is applied by taking the maximal value on any given
-patch or the original image, for each of the 32x32 pixel locations.\\
+patch or the original image, for each of the 32x32 pixel locations.
+\vspace*{0mm}
 {\bf Grey Level and Contrast Changes.}
 This filter changes the contrast and may invert the image polarity (white
 on black to black on white). The contrast $C$ is defined here as the
 difference between the maximum and the minimum pixel value of the image.
 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$).
 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
 polarity is inverted with $0.5$ probability.
 \iffalse
-\begin{figure}[h]
+\begin{figure}[ht]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}}\\
 \caption{Illustration of the pipeline of stochastic
 transformations applied to the image of a lower-case \emph{t}
 (the upper left image). Each image in the pipeline (going from
 left to right, first top line, then bottom line) shows the result
 of applying one of the modules in the pipeline. The last image
 examples are presented in minibatches of size 20, a constant learning
 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments (measuring performance on a validation set),
 and $0.1$ was then selected.
-\begin{figure}[h]
+\begin{figure}[ht]
-\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
 \caption{Illustration of the computations and training criterion for the denoising
-auto-encoder used to pre-train each layer of the deep architecture. Input $x$
+auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
+the layer (i.e. raw input or output of previous layer)
 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
 is compared to the uncorrupted input $x$ through the loss function
 $L_H(x,z)$, whose expected value is approximately minimized during training
 by tuning $\theta$ and $\theta'$.}
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
 of hidden layers but it was fixed to 3 based on previous work with
 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
 \vspace*{-1mm}
+\begin{figure}[ht]
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
+\caption{Error bars indicate a 95\% confidence interval. 0 indicates training
+on NIST, 1 on NISTP, and 2 on P07. Left: overall results
+of all models, on 3 different test sets corresponding to the three
+datasets.
+Right: error rates on NIST test digits only, along with the previous results from
+literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
+respectively based on ART, nearest neighbors, MLPs, and SVMs.}
+\label{fig:error-rates-charts}
+\vspace*{-1mm}
+\end{figure}
 \section{Experimental Results}
 %\vspace*{-1mm}
 %\subsection{SDA vs MLP vs Humans}
 %\vspace*{-1mm}
 found in Appendix I of the supplementary material.  The 3 kinds of model differ in the
 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07
 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and
 previously published performance (in a statistically and qualitatively
 significant way) but reaches human performance on both the 62-class task
-and the 10-class (digits) task. In addition, as shown in the left of
+and the 10-class (digits) task.
-Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error
+\begin{figure}[ht]
+\vspace*{-2mm}
+\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
+\caption{Relative improvement in error rate due to self-taught learning.
+Left: Improvement (or loss, when negative)
+induced by out-of-distribution examples (perturbed data).
+Right: Improvement (or loss, when negative) induced by multi-task
+learning (training on all classes and testing only on either digits,
+upper case, or lower-case). The deep learner (SDA) benefits more from
+both self-taught learning scenarios, compared to the shallow MLP.}
+\label{fig:improvements-charts}
+\vspace*{-2mm}
+\end{figure}
+In addition, as shown in the left of
+Figure~\ref{fig:improvements-charts}, the relative improvement in error
 rate brought by self-taught learning is greater for the SDA, and these
 differences with the MLP are statistically and qualitatively
 significant.
 The left side of the figure shows the improvement to the clean
 NIST test set error brought by the use of out-of-distribution examples
 (i.e. the perturbed examples examples from NISTP or P07).
 Relative change is measured by taking
 (original model's error / perturbed-data model's error - 1).
 The right side of
-Figure~\ref{fig:fig:improvements-charts} shows the relative improvement
+Figure~\ref{fig:improvements-charts} shows the relative improvement
 brought by the use of a multi-task setting, in which the same model is
 trained for more classes than the target classes of interest (i.e. training
 with all 62 classes when the target classes are respectively the digits,
 lower-case, or upper-case characters). Again, whereas the gain from the
 multi-task setting is marginal or negative for the MLP, it is substantial
 comparing the correct digit class with the output class associated with the
 maximum conditional probability among only the digit classes outputs.  The
 setting is similar for the other two target classes (lower case characters
 and upper case characters).
-\begin{figure}[h]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
-\caption{Error bars indicate a 95\% confidence interval. 0 indicates training
-on NIST, 1 on NISTP, and 2 on P07. Left: overall results
-of all models, on 3 different test sets corresponding to the three
-datasets.
-Right: error rates on NIST test digits only, along with the previous results from
-literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
-respectively based on ART, nearest neighbors, MLPs, and SVMs.}
-\label{fig:error-rates-charts}
-\end{figure}
 %\vspace*{-1mm}
 %\subsection{Perturbed Training Data More Helpful for SDA}
 %\vspace*{-1mm}
 %\vspace*{-1mm}
 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
 \fi
-\begin{figure}[h]
-\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
-\caption{Relative improvement in error rate due to self-taught learning.
-Left: Improvement (or loss, when negative)
-induced by out-of-distribution examples (perturbed data).
-Right: Improvement (or loss, when negative) induced by multi-task
-learning (training on all classes and testing only on either digits,
-upper case, or lower-case). The deep learner (SDA) benefits more from
-both self-taught learning scenarios, compared to the shallow MLP.}
-\label{fig:improvements-charts}
-\end{figure}
 \vspace*{-1mm}
 \section{Conclusions}
 \vspace*{-1mm}
 We have found that the self-taught learning framework is more beneficial

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 523:c778d20ab6f8