view writeup/techreport.tex @ 413:f2dd75248483

initial commit of mlp with options for detection and 36 classes
author youssouf
date Thu, 29 Apr 2010 16:51:03 -0400
parents 4f69d915d142
children 1e9788ce1680
line wrap: on
line source

\documentclass[12pt,letterpaper]{article}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{times}
\usepackage{mlapa}

\begin{document}
\title{Generating and Exploiting Perturbed Training Data for Deep Architectures}
\author{The IFT6266 Gang}
\date{April 2010, Technical Report, Dept. IRO, U. Montreal}

\maketitle

\begin{abstract}
Recent theoretical and empirical work in statistical machine learning has
demonstrated the importance of learning algorithms for deep
architectures, i.e., function classes obtained by composing multiple
non-linear transformations. In the area of handwriting recognition,
deep learning algorithms
had been evaluated on rather small datasets with a few tens of thousands
of examples. Here we propose a powerful generator of variations
of examples for character images based on a pipeline of stochastic
transformations that include not only the usual affine transformations
but also the addition of slant, local elastic deformations, changes
in thickness, background images, color, contrast, occlusion, and
various types of pixel and spatially correlated noise.
We evaluate a deep learning algorithm (Stacked Denoising Autoencoders)
on the task of learning to classify digits and letters transformed
with this pipeline, using the hundreds of millions of generated examples
and testing on the full NIST test set.
We find that the SDA outperforms its
shallow counterpart, an ordinary Multi-Layer Perceptron,
and that it is better able to take advantage of the additional
generated data.
\end{abstract}

\section{Introduction}

Deep Learning has emerged as a promising new area of research in
statistical machine learning (see~\emcite{Bengio-2009} for a review).
Learning algorithms for deep architectures are centered on the learning
of useful representations of data, which are better suited to the task at hand.
This is in great part inspired by observations of the mammalian visual cortex, 
which consists of a chain of processing elements, each of which is associated with a
different representation. In fact,
it was found recently that the features learnt in deep architectures resemble
those observed in the first two of these stages (in areas V1 and V2
of visual cortex)~\cite{HonglakL2008}.
Processing images typically involves transforming the raw pixel data into
new {\bf representations} that can be used for analysis or classification.
For example, a principal component analysis representation linearly projects 
the input image into a lower-dimensional feature space.
Why learn a representation?  Current practice in the computer vision
literature converts the raw pixels into a hand-crafted representation
(e.g.\ SIFT features~\cite{Lowe04}), but deep learning algorithms
tend to discover similar features in their first few 
levels~\cite{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}.
Learning increases the
ease and practicality of developing representations that are at once
tailored to specific tasks, yet are able to borrow statistical strength
from other related tasks (e.g., modeling different kinds of objects). Finally, learning the
feature representation can lead to higher-level (more abstract, more
general) features that are more robust to unanticipated sources of
variance extant in real data.

Whereas a deep architecture can in principle be more powerful than a shallow
one in terms of representation, depth appears to render the training problem
more difficult in terms of optimization and local minima.
It is also only recently that
successful algorithms were proposed to overcome some of these
difficulties.

\section{Perturbation and Transformation of Character Images}

\subsection{Adding Slant}
In order to mimic a slant effect, we simply shift each row of the image proportionnaly to its height.
The coefficient is randomly sampled according to the complexity level and can be negatif or positif with equal probability.

\subsection{Changing Thickness}
To change the thickness of the characters we used morpholigical operators: dilation and erosion~\cite{Haralick87,Serra82}.
The basic idea of such transform is, for each pixel, to multiply in the element-wise manner its neighbourhood with a matrix called the structuring element.
Then for dilation we remplace the pixel value by the maximum of the result, or the minimum for erosion.
This will dilate or erode objects in the image, the strength of the transform only depends on the structuring element.
We used ten different structural elements with various shapes (the biggest is $5\times5$).
for each image, we radomly sample the operator type (dilation or erosion) and one structural element
from a subset depending of the complexity (the higher the complexity, the biggest the structural element can be).
Erosion allows only the five smallest structural elements because when the character is too thin it may erase it completly.

\subsection{Affine Transformations}
We generate an affine transform matrix according to the complexity level, then we apply it directly to the image.
This allows to produce scaling, translation, rotation and shearing variances. We took care that the maximum rotation applied
to the image is low enough not to confuse classes.

\subsection{Local Elastic Deformations}
\subsection{GIMP transformation}
\subsection{Occlusion}
\subsection{Background Images}
\subsection{Salt and Pepper Noise}
\subsection{Spatially Gaussian Noise}
\subsection{Color and Contrast Changes}

\begin{figure}[h]
\resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\
\caption{Illustration of the pipeline of stochastic 
transformations applied to the image of a lower-case t
(the upper left image). Each image in the pipeline (going from
left to right, first top line, then bottom line) shows the result
of applying one of the modules in the pipeline. The last image
(bottom right) is used as training example.}
\label{fig:pipeline}
\end{figure}

\section{Learning Algorithms for Deep Architectures}

\section{Experimental Setup}

\subsection{Training Datasets}

\subsubsection{Data Sources}

\begin{itemize}
\item {\bf NIST}
\item {\bf Fonts}
\item {\bf Captchas}
\item {\bf OCR data}
\end{itemize}

\subsubsection{Data Sets}
\begin{itemize}
\item {\bf NIST}
\item {\bf P07}
\item {\bf NISTP} {\em ne pas utiliser PNIST mais NISTP, pour rester politically correct...}
\end{itemize}

\subsection{Models and their Hyperparameters}

\subsubsection{Multi-Layer Perceptrons (MLP)}

An MLP is a family of functions that are described by stacking layers of of a function similar to
$$g(x) = \tanh(b+Wx)$$
The input, $x$, is a $d$-dimension vector.  
The output, $g(x)$, is a $m$-dimension vector.
The parameter $W$ is a $m\times d$ matrix and is called the weight matrix.
The parameter  $b$ is a $m$-vector and is called the bias vector.
The non-linearity (here $\tanh$) is applied element-wise to the output vector.
Usually the input is referred to a input layer and similarly for the output.
You can of course chain several such functions to obtain a more complex one.
Here is a common example
$$f(x) = c + V\tanh(b+Wx)$$
In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer.
Here the output layer does not have the same non-linearity as the hidden layer.
This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand.

If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP.
The parameters to adapt are the weight matrix and the bias vector for each layer.

\subsubsection{Stacked Denoising Auto-Encoders (SDAE)}

Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization.
Denoising auto-encoders are a variant where the input is corrupted with random noise before trying to repair it.
The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform.

An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer.
Usually the top and bottom weight matrices are the transpose of each other and are fixed this way.
The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer.
The other parameters are discarded.

The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder  starting from the bottom.
During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder.
The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it.
For additional details see \cite{vincent:icml08}.

\section{Experimental Results}

\subsection{SDA vs MLP}

\begin{center}
\begin{tabular}{lcc}
      & train w/   & train w/    \\
      & NIST       & P07 + NIST  \\ \hline 
SDA   &            &             \\ \hline 
MLP   &            &             \\ \hline 
\end{tabular}
\end{center}

\subsection{Perturbed Training Data More Helpful for SDAE}

\subsection{Training with More Classes than Necessary}

\section{Conclusions}

\bibliography{strings,ml,aigaion,specials}
\bibliographystyle{mlapa}

\end{document}