# HG changeset patch # User Yoshua Bengio # Date 1275514843 14400 # Node ID 8f6c09d1140f842b1e78a80bf44292a886ba2fb8 # Parent 35c611363291ea4356acf893e8bc356a2209e9db ca fitte de nouveau diff -r 35c611363291 -r 8f6c09d1140f writeup/nips2010_submission.tex --- a/writeup/nips2010_submission.tex Wed Jun 02 17:28:43 2010 -0400 +++ b/writeup/nips2010_submission.tex Wed Jun 02 17:40:43 2010 -0400 @@ -133,8 +133,6 @@ \vspace*{-1mm} \section{Perturbation and Transformation of Character Images} \label{s:perturbations} -{\large\bf Transformations} - \vspace*{-1mm} \begin{minipage}[b]{0.14\linewidth} @@ -144,9 +142,10 @@ \vspace{1.2cm} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Original:} +{\bf Original.} This section describes the different transformations we used to stochastically -transform source images in order to obtain data from a larger distribution which +transform source images such as the one on the left +in order to obtain data from a larger distribution which covers a domain substantially larger than the clean characters distribution from which we start. Although character transformations have been used before to improve character recognizers, this effort is on a large scale both @@ -158,12 +157,13 @@ {\tt http://anonymous.url.net}. All the modules in the pipeline share a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the amount of deformation or noise introduced. - There are two main parts in the pipeline. The first one, from slant to pinch below, performs transformations. The second part, from blur to contrast, adds different kinds of noise. \end{minipage} +{\large\bf Transformations} + \begin{minipage}[b]{0.14\linewidth} \centering @@ -172,7 +172,7 @@ \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} %\centering -{\bf Slant:} +{\bf Slant.} Each row of the image is shifted proportionally to its height: $shift = round(slant \times height)$. $slant \sim U[-complexity,complexity]$. @@ -187,7 +187,7 @@ \vspace{.9cm} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Thinkness:} +{\bf Thinkness.} Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} are applied. The neighborhood of each pixel is multiplied element-wise with a {\em structuring element} matrix. @@ -210,7 +210,7 @@ \label{fig:Affine} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Affine Transformations:} +{\bf Affine Transformations.} A $2 \times 3$ affine transform matrix (with 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. Output pixel $(x,y)$ takes the value of input pixel @@ -230,7 +230,7 @@ \label{fig:Elastic} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Local Elastic Deformations:} +{\bf Local Elastic Deformations.} This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, which provides more details. The intensity of the displacement fields is given by @@ -248,7 +248,7 @@ \vspace{.6cm} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Pinch:} +{\bf Pinch.} This is the ``Whirl and pinch'' GIMP filter but with whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). @@ -277,7 +277,7 @@ \label{fig:Original} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Motion Blur:} +{\bf Motion Blur.} This is GIMP's ``linear motion blur'' with parameters $length$ and $angle$. The value of a pixel in the final image is approximately the mean value of the first $length$ pixels @@ -294,7 +294,7 @@ \label{fig:Original} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Occlusion:} +{\bf Occlusion.} Selects a random rectangle from an {\em occluder} character image and places it over the original {\em occluded} image. Pixels are combined by taking the max(occluder,occluded), @@ -313,7 +313,7 @@ \label{fig:Original} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Pixel Permutation:} +{\bf Pixel Permutation.} This filter permutes neighbouring pixels. It first selects fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then sequentially exchanged with one other in as $V4$ neighbourhood. @@ -328,7 +328,7 @@ \label{fig:Original} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Gaussian Noise:} +{\bf Gaussian Noise.} This filter simply adds, to each pixel of the image independently, a noise $\sim Normal(0,(\frac{complexity}{10})^2)$. This filter is skipped with probability 70\%. @@ -342,7 +342,7 @@ \label{fig:Original} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Background Images:} +{\bf Background Images.} Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random background behind the letter, from a randomly chosen natural image, with contrast adjustments depending on $complexity$, to preserve @@ -357,7 +357,7 @@ \label{fig:Original} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Salt and Pepper Noise:} +{\bf Salt and Pepper Noise.} This filter adds noise $\sim U[0,1]$ to random subsets of pixels. The number of selected pixels is $0.2 \times complexity$. This filter is skipped with probability 75\%. @@ -372,7 +372,7 @@ \vspace{.5cm} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Spatially Gaussian Noise:} +{\bf Spatially Gaussian Noise.} Different regions of the image are spatially smoothed by convolving the image is convolved with a symmetric Gaussian kernel of size and variance chosen uniformly in the ranges $[12,12 + 20 \times @@ -396,7 +396,7 @@ \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} \vspace{.4cm} -{\bf Scratches:} +{\bf Scratches.} The scratches module places line-like white patches on the image. The lines are heavily transformed images of the digit ``1'' (one), chosen at random among 500 such 1 images, @@ -416,7 +416,7 @@ \label{fig:Original} \end{minipage}% \hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} -{\bf Grey Level and Contrast Changes:} +{\bf Grey Level and Contrast Changes.} This filter changes the contrast and may invert the image polarity (white to black and black to white). The contrast is $C \sim U[1-0.85 \times complexity,1]$ so the image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The @@ -440,16 +440,16 @@ \fi -\vspace*{-1mm} +\vspace*{-2mm} \section{Experimental Setup} \vspace*{-1mm} -Whereas much previous work on deep learning algorithms had been performed on -the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, +Much previous work on deep learning had been performed on +the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, with 60~000 examples, and variants involving 10~000 -examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want -to focus here on the case of much larger training sets, from 10 times to -to 1000 times larger. +examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}. +The focus here is on much larger training sets, from 10 times to +to 1000 times larger, and 62 classes. The first step in constructing the larger datasets (called NISTP and P07) is to sample from a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, @@ -470,17 +470,17 @@ %processing \citep{SnowEtAl2008} and vision %\citep{SorokinAndForsyth2008,whitehill09}. AMT users were presented -with 10 character images and asked to choose 10 corresponding ASCII +with 10 character images (from a test set) and asked to choose 10 corresponding ASCII characters. They were forced to make a hard choice among the 62 or 10 character classes (all classes or digits only). -A total 2500 images/dataset were classified by XXX subjects, -with 3 subjects classifying each image, allowing +80 subjects classified 2500 images per (dataset,task) pair, +with the guarantee that 3 different subjects classified each image, allowing us to estimate inter-human variability (e.g a standard error of 0.1\% -on the average 18\% error done by humans on the 62-class task). +on the average 18.2\% error done by humans on the 62-class task NIST test set). -\vspace*{-1mm} +\vspace*{-3mm} \subsection{Data Sources} -\vspace*{-1mm} +\vspace*{-2mm} %\begin{itemize} %\item @@ -499,10 +499,10 @@ The performances reported by previous work on that dataset mostly use only the digits. Here we use all the classes both in the training and testing phase. This is especially useful to estimate the effect of a multi-task setting. -Note that the distribution of the classes in the NIST training and test sets differs -substantially, with relatively many more digits in the test set, and more uniform distribution -of letters in the test set, compared to the training set (in the latter, the letters are distributed -more like the natural distribution of letters in text). +The distribution of the classes in the NIST training and test sets differs +substantially, with relatively many more digits in the test set, and a more uniform distribution +of letters in the test set (where the letters are distributed +more like in natural text). \vspace*{-1mm} %\item @@ -529,16 +529,16 @@ %\item {\bf OCR data.} A large set (2 million) of scanned, OCRed and manually verified machine-printed -characters (from various documents and books) where included as an +characters where included as an additional source. This set is part of a larger corpus being collected by the Image Understanding Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern ({\tt http://www.iupr.com}), and which will be publicly released. %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this %\end{itemize} -\vspace*{-1mm} +\vspace*{-3mm} \subsection{Data Sets} -\vspace*{-1mm} +\vspace*{-2mm} All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label from one of the 62 character classes. @@ -568,13 +568,13 @@ It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. %\end{itemize} -\vspace*{-1mm} +\vspace*{-3mm} \subsection{Models and their Hyperparameters} -\vspace*{-1mm} +\vspace*{-2mm} The experiments are performed with Multi-Layer Perceptrons (MLP) with a single hidden layer and with Stacked Denoising Auto-Encoders (SDA). -\emph{Note that all hyper-parameters are selected based on performance on the {\bf NISTP} validation set.} +\emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} {\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared deep architectures to both shallow MLPs and