ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 476:db28764b8252

Merge

author	fsavard
date	Sun, 30 May 2010 12:06:45 -0400
parents	ead3085c1c66 bcf024e6ab23
children	6593e67381a3

comparison

equal deleted inserted replaced

-:ead3085c1c66
+:db28764b8252
 There are two main parts in the pipeline. The first one,
 from slant to pinch below, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
 {\large\bf Transformations}\\
-{\bf Slant}\\
+{\bf Slant.}
 We mimic slant by shifting each row of the image
 proportionnaly to its height: $shift = round(slant \times height)$.
 The $slant$ coefficient can be negative or positive with equal probability
 and its value is randomly sampled according to the complexity level:
 e $slant \sim U[0,complexity]$, so the
 maximum displacement for the lowest or highest pixel line is of
 $round(complexity \times 32)$.\\
-{\bf Thickness}\\
+{\bf Thickness.}
 Morpholigical operators of dilation and erosion~\citep{Haralick87,Serra82}
 are applied. The neighborhood of each pixel is multiplied
 element-wise with a {\em structuring element} matrix.
 The pixel value is replaced by the maximum or the minimum of the resulting
 matrix, respectively for dilation or erosion. Ten different structural elements with
 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$
 for erosion.  A neutral element is always present in the set, and if it is
 chosen no transformation is applied.  Erosion allows only the six
 smallest structural elements because when the character is too thin it may
 be completely erased.\\
-{\bf Affine Transformations}\\
+{\bf Affine Transformations.}
 A $2 \times 3$ affine transform matrix (with
 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
 Each pixel $(x,y)$ of the output image takes the value of the pixel
 nearest to $(ax+by+c,dx+ey+f)$ in the input image.  This
 produces scaling, translation, rotation and shearing.
 forbid important rotations (not to confuse classes) but to give good
 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
 complexity]$.\\
-{\bf Local Elastic Deformations}\\
+{\bf Local Elastic Deformations.}
 This filter induces a "wiggly" effect in the image, following~\citet{SimardSP03},
 which provides more details.
 Two "displacements" fields are generated and applied, for horizontal
 and vertical displacements of pixels.
 To generate a pixel in either field, first a value between -1 and 1 is
 displacements (larger $\alpha$ translates into larger wiggles).
 Each field is convoluted with a Gaussian 2D kernel of
 standard deviation $\sigma$. Visually, this results in a blur.
 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
 \sqrt[3]{complexity}$.\\
-{\bf Pinch}\\
+{\bf Pinch.}
 This GIMP filter is named "Whirl and
 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}.
 For a square input image, think of drawing a circle of
 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
 The actual value is given by bilinear interpolation considering the pixels
 around the (non-integer) source position thus found.
 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.\\
 {\large\bf Injecting Noise}\\
-{\bf Motion Blur}\\
+{\bf Motion Blur.}
 This GIMP filter is a ``linear motion blur'' in GIMP
 terminology, with two parameters, $length$ and $angle$. The value of
 a pixel in the final image is the approximately mean value of the $length$ first pixels
 found by moving in the $angle$ direction.
 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\
-{\bf Occlusion}\\
+{\bf Occlusion.}
 This filter selects a random rectangle from an {\em occluder} character
 images and places it over the original {\em occluded} character
 image. Pixels are combined by taking the max(occluder,occluded),
 closer to black. The corners of the occluder  The rectangle corners
 are sampled so that larger complexity gives larger rectangles.
 The destination position in the occluded image are also sampled
 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
 It has has a probability of not being applied at all of 60\%.\\
-{\bf Pixel Permutation}\\
+{\bf Pixel Permutation.}
 This filter permutes neighbouring pixels. It selects first
 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
 sequentially exchanged to one other pixel in its $V4$ neighbourhood. Number
 of exchanges to the left, right, top, bottom are equal or does not differ
 from more than 1 if the number of selected pixels is not a multiple of 4.
 It has has a probability of not being applied at all of 80\%.\\
-{\bf Gaussian Noise}\\
+{\bf Gaussian Noise.}
 This filter simply adds, to each pixel of the image independently, a
 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
 It has has a probability of not being applied at all of 70\%.\\
-{\bf Background Images}\\
+{\bf Background Images.}
 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
 background behind the letter. The background is chosen by first selecting,
 at random, an image from a set of images. Then a 32$\times$32 subregion
 of that image is chosen as the background image (by sampling position
 uniformly while making sure not to cross image borders).
 intensity) for both the original image and the background image, $maximage$
 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$.
 Each background pixel value is multiplied by $\frac{max(maximage -
 contrast, 0)}{maxbg}$ (higher contrast yield darker
 background). The output image pixels are max(background,original).\\
-{\bf Salt and Pepper Noise}\\
+{\bf Salt and Pepper Noise.}
 This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
 The number of selected pixels is $0.2 \times complexity$.
 This filter has a probability of not being applied at all of 75\%.\\
-{\bf Spatially Gaussian Noise}\\
+{\bf Spatially Gaussian Noise.}
 Different regions of the image are spatially smoothed.
 The image is convolved with a symmetric Gaussian kernel of
 size and variance choosen uniformly in the ranges $[12,12 + 20 \times
 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
 between $0$ and $1$.  We also create a symmetric averaging window, of the
 initialize to zero a mask matrix of the image size. For each selected pixel
 we add to the mask the averaging window centered to it.  The final image is
 computed from the following element-wise operation: $\frac{image + filtered
 image \times mask}{mask+1}$.
 This filter has a probability of not being applied at all of 75\%.\\
-{\bf Scratches}\\
+{\bf Scratches.}
 The scratches module places line-like white patches on the image.  The
 lines are heavily transformed images of the digit "1" (one), chosen
 at random among five thousands such 1 images. The 1 image is
 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
 complexity)^2$, using bicubic interpolation,
 This filter is only applied only 15\% of the time. When it is applied, 50\%
 of the time, only one patch image is generated and applied. In 30\% of
 cases, two patches are generated, and otherwise three patches are
 generated. The patch is applied by taking the maximal value on any given
 patch or the original image, for each of the 32x32 pixel locations.\\
-{\bf Color and Contrast Changes}\\
+{\bf Color and Contrast Changes.}
 This filter changes the constrast and may invert the image polarity (white
 on black to black on white). The contrast $C$ is defined here as the
 difference between the maximum and the minimum pixel value of the image.
 Contrast $\sim U[1-0.85 \times complexity,1]$ (so constrast $\geq 0.15$).
 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
 \end{figure}
 \section{Experimental Setup}
-\subsection{Training Datasets}
+Whereas much previous work on deep learning algorithms had been performed on
+the MNIST digits classification task~\citep{Hinton06,ranzato-07,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
-\subsubsection{Data Sources}
+with 60~000 examples, and variants involving 10~000
+examples~\cite{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
+to focus here on the case of much larger training sets, from 10 times to
+to 1000 times larger.  The larger datasets are obtained by first sampling from
+a {\em data source} (NIST characters, scanned machine printed characters, characters
+from fonts, or characters from captchas) and then optionally applying some of the
+above transformations and/or noise processes.
+\subsection{Data Sources}
 \begin{itemize}
 \item {\bf NIST}
-The NIST Special Database 19 (NIST19) is a very widely used dataset for training and testing OCR systems.
+Our main source of characters is the NIST Special Database 19~\cite{Grother-1995},
+widely used for training and testing character
+recognition systems~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}.
 The dataset is composed with 8????? digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
 corresponding to "0"-"9","A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity.
-The fourth series, $hsf_4$, experimentally recognized to be the most difficult one for classification task is recommended
+The fourth series, $hsf_4$, experimentally recognized to be the most difficult one is recommended
-by NIST as testing set and is used in our work for that purpose. It contains 82600 examples,
+by NIST as testing set and is used in our work and some previous work~\cite{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
-while the training and validation sets (which have the same distribution) contain XXXXX and
+for that purpose. We randomly split the remainder into a training set and a validation set for
-XXXXX examples respectively.
+model selection. The sizes of these data sets are: XXX for training, XXX for validation,
+and XXX for testing.
 The performances reported by previous work on that dataset mostly use only the digits.
 Here we use all the classes both in the training and testing phase. This is especially
 useful to estimate the effect of a multi-task setting.
 Note that the distribution of the classes in the NIST training and test sets differs
 substantially, with relatively many more digits in the test set, and uniform distribution
 \item {\bf Fonts} TODO!!!
 \item {\bf Captchas}
 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
-generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character
+generating characters of the same format as the NIST dataset. This software is based on
-generator and various kinds of tranformations similar to those described in the previous sections.
+a random character class generator and various kinds of tranformations similar to those described in the previous sections.
-In order to increase the variability of the data generated, different fonts are used for generating the characters.
+In order to increase the variability of the data generated, many different fonts are used for generating the characters.
 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are
 allowed and can be controlled via an easy to use facade class.
 \item {\bf OCR data}
+A large set (2 million) of scanned, OCRed and manually verified machine-printed
+characters (from various documents and books) where included as an
+additional source. This set is part of a larger corpus being collected by the Image Understanding
+Pattern Recognition Research group lead by Thomas Breuel at University of Kaiserslautern
+({\tt http://www.iupr.com}), and which will be publically released.
 \end{itemize}
-\subsubsection{Data Sets}
+\subsection{Data Sets}
+All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
+from one of the 62 character classes.
 \begin{itemize}
-\item {\bf NIST} This is the raw NIST special database 19.
+\item {\bf NIST}. This is the raw NIST special database 19.
-\item {\bf P07}
+\item {\bf P07}. This dataset is obtained by taking raw characters from all four of the above sources
-The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$.
+and sending them through the above transformation pipeline.
-For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts,
+For each new exemple to generate, a source is selected with probability $10\%$ from the fonts,
-$0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order
+$25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
-and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
+order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$.
 \item {\bf NISTP} NISTP is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion)
 except that we only apply
 transformations from slant to pinch. Therefore, the character is
 transformed but no additionnal noise is added to the image, giving images
 closer to the NIST dataset.
 \end{itemize}
 \subsection{Models and their Hyperparameters}
+All hyper-parameters are selected based on performance on the NISTP validation set.
 \subsubsection{Multi-Layer Perceptrons (MLP)}
-An MLP is a family of functions that are described by stacking layers of of a function similar to
+Whereas previous work had compared deep architectures to both shallow MLPs and
-$$g(x) = \tanh(b+Wx)$$
+SVMs, we only compared to MLPs here because of the very large datasets used.
-The input, $x$, is a $d$-dimension vector.
+The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
-The output, $g(x)$, is a $m$-dimension vector.
+exponentials) on the output layer for estimating P(class | image).
-The parameter $W$ is a $m\times d$ matrix and is called the weight matrix.
+The hyper-parameters are the following: number of hidden units, taken in
-The parameter  $b$ is a $m$-vector and is called the bias vector.
+$\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
-The non-linearity (here $\tanh$) is applied element-wise to the output vector.
+examples are presented in minibatches of size 20. A constant learning
-Usually the input is referred to a input layer and similarly for the output.
+rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
-You can of course chain several such functions to obtain a more complex one.
+through preliminary experiments, and 0.1 was selected.
-Here is a common example
-$$f(x) = c + V\tanh(b+Wx)$$
-In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer.
-Here the output layer does not have the same non-linearity as the hidden layer.
-This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand.
-If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP.
-The parameters to adapt are the weight matrix and the bias vector for each layer.
 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
 \label{SdA}
-Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization.
+Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
-This is essentially unsupervised training where the layer is made to reconstruct its input through and encoding and decoding phase.
+can be used to initialize the weights of each layer of a deep MLP (with many hidden
-Denoising auto-encoders are a variant where the input is corrupted with random noise but the target is the uncorrupted input.
+layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
-The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform.
+enabling better generalization, apparently setting parameters in the
+basin of attraction of supervised gradient descent yielding better
-An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer.
+generalization~\citep{Erhan+al-2010}. It is hypothesized that the
-Usually the top and bottom weight matrices are the transpose of each other and are fixed this way.
+advantage brought by this procedure stems from a better prior,
-The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer.
+on the one hand taking advantage of the link between the input
-The other parameters are discarded.
+distribution $P(x)$ and the conditional distribution of interest
+$P(y|x)$ (like in semi-supervised learning), and on the other hand
-The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder  starting from the bottom.
+taking advantage of the expressive power and bias implicit in the
-During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder.
+deep architecture (whereby complex concepts are expressed as
-The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it.
+compositions of simpler ones through a deep hierarchy).
-For additional details see \citet{vincent:icml08}.
+Here we chose to use the Denoising
+Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
+these deep hierarchies of features, as it is very simple to train and
+teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}),
+provides immediate and efficient inference, and yielded results
+comparable or better than RBMs in series of experiments
+\citep{VincentPLarochelleH2008}. During training of a Denoising
+Auto-Encoder, it is presented with a stochastically corrupted version
+of the input and trained to reconstruct the uncorrupted input,
+forcing the hidden units to represent the leading regularities in
+the data. Once it is trained, its hidden units activations can
+be used as inputs for training a second one, etc.
+After this unsupervised pre-training stage, the parameters
+are used to initialize a deep MLP, which is fine-tuned by
+the same standard procedure used to train them (see previous section).
+The hyper-parameters are the same as for the MLP, with the addition of the
+amount of corruption noise (we used the masking noise process, whereby a
+fixed proportion of the input values, randomly selected, are zeroed), and a
+separate learning rate for the unsupervised pre-training stage (selected
+from the same above set). The fraction of inputs corrupted was selected
+among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
+of hidden layers but it was fixed to 3 based on previous work with
+stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
 \section{Experimental Results}
 \subsection{SDA vs MLP vs Humans}
 on NIST digits classification using the same test set are included.}
 \label{tab:sda-vs-mlp-vs-humans}
 \begin{center}
 \begin{tabular}{|l|r|r|r|r|} \hline
 & NIST test          & NISTP test       & P07 test       & NIST test digits   \\ \hline
-Humans&   18.2\% $\pm$.1\%   &  39.4\%$\pm$.1\%   &  46.9\%$\pm$.1\%  &  $>1.1\%$ \\ \hline
+Humans&   18.2\% $\pm$.1\%   &  39.4\%$\pm$.1\%   &  46.9\%$\pm$.1\%  &  $1.4\%$ \\ \hline
 SDA0   &  23.7\% $\pm$.14\%  &  65.2\%$\pm$.34\%  & 97.45\%$\pm$.06\%  & 2.7\% $\pm$.14\%\\ \hline
 SDA1   &  17.1\% $\pm$.13\%  &  29.7\%$\pm$.3\%  & 29.7\%$\pm$.3\%  & 1.4\% $\pm$.1\%\\ \hline
 SDA2   &  18.7\% $\pm$.13\%  &  33.6\%$\pm$.3\%  & 39.9\%$\pm$.17\%  & 1.7\% $\pm$.1\%\\ \hline
 MLP0   &  24.2\% $\pm$.15\%  & 68.8\%$\pm$.33\%  & 78.70\%$\pm$.14\%  & 3.45\% $\pm$.15\% \\ \hline
 MLP1   &  23.0\% $\pm$.15\%  &  41.8\%$\pm$.35\%  & 90.4\%$\pm$.1\%  & 3.85\% $\pm$.16\% \\ \hline
 MLP2   &  24.3\% $\pm$.15\%  &  46.0\%$\pm$.35\%  & 54.7\%$\pm$.17\%  & 4.85\% $\pm$.18\% \\ \hline
 \citep{Granger+al-2007} &     &                    &                   & 4.95\% $\pm$.18\% \\ \hline
 \citep{Cortes+al-2000} &      &                    &                   & 3.71\% $\pm$.16\% \\ \hline
 \citep{Oliveira+al-2002} &    &                    &                   & 2.4\% $\pm$.13\% \\ \hline
-\citep{Migram+al-2005} &      &                    &                   & 2.1\% $\pm$.12\% \\ \hline
+\citep{Milgram+al-2005} &      &                    &                   & 2.1\% $\pm$.12\% \\ \hline
 \end{tabular}
 \end{center}
 \end{table}
 \begin{figure}[h]
 \section{Conclusions}
+The conclusions are positive for all the questions asked in the introduction.
+\begin{itemize}
+\item Do the good results previously obtained with deep architectures on the
+MNIST digits generalize to the setting of a much larger and richer (but similar)
+dataset, the NIST special database 19, with 62 classes and around 800k examples?
+Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
+performance.
+\item To what extent does the perturbation of input images (e.g. adding
+noise, affine transformations, background images) make the resulting
+classifier better not only on similarly perturbed images but also on
+the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
+examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
+MLPs were helped by perturbed training examples when tested on perturbed input images,
+but only marginally helped wrt clean examples. On the other hand, the deep SDAs
+were very significantly boosted by these out-of-distribution examples.
+\item Similarly, does the feature learning step in deep learning algorithms benefit more
+training with similar but different classes (i.e. a multi-task learning scenario) than
+a corresponding shallow and purely supervised architecture?
+Whereas the improvement due to the multi-task setting was marginal or
+negative for the MLP, it was very significant for the SDA.
+\end{itemize}
 \bibliography{strings,ml,aigaion,specials}
 %\bibliographystyle{plainnat}
 \bibliographystyle{unsrtnat}
 %\bibliographystyle{apalike}

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 476:db28764b8252