Mercurial > ift6266
view writeup/techreport.tex @ 412:6478eef4f8aa
Added support for calculating the test error over different set of classes (lower,upper,digits,all,36)
author | humel |
---|---|
date | Thu, 29 Apr 2010 14:28:52 -0400 |
parents | 4f69d915d142 |
children | 1e9788ce1680 |
line wrap: on
line source
\documentclass[12pt,letterpaper]{article} \usepackage[utf8]{inputenc} \usepackage{graphicx} \usepackage{times} \usepackage{mlapa} \begin{document} \title{Generating and Exploiting Perturbed Training Data for Deep Architectures} \author{The IFT6266 Gang} \date{April 2010, Technical Report, Dept. IRO, U. Montreal} \maketitle \begin{abstract} Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. In the area of handwriting recognition, deep learning algorithms had been evaluated on rather small datasets with a few tens of thousands of examples. Here we propose a powerful generator of variations of examples for character images based on a pipeline of stochastic transformations that include not only the usual affine transformations but also the addition of slant, local elastic deformations, changes in thickness, background images, color, contrast, occlusion, and various types of pixel and spatially correlated noise. We evaluate a deep learning algorithm (Stacked Denoising Autoencoders) on the task of learning to classify digits and letters transformed with this pipeline, using the hundreds of millions of generated examples and testing on the full NIST test set. We find that the SDA outperforms its shallow counterpart, an ordinary Multi-Layer Perceptron, and that it is better able to take advantage of the additional generated data. \end{abstract} \section{Introduction} Deep Learning has emerged as a promising new area of research in statistical machine learning (see~\emcite{Bengio-2009} for a review). Learning algorithms for deep architectures are centered on the learning of useful representations of data, which are better suited to the task at hand. This is in great part inspired by observations of the mammalian visual cortex, which consists of a chain of processing elements, each of which is associated with a different representation. In fact, it was found recently that the features learnt in deep architectures resemble those observed in the first two of these stages (in areas V1 and V2 of visual cortex)~\cite{HonglakL2008}. Processing images typically involves transforming the raw pixel data into new {\bf representations} that can be used for analysis or classification. For example, a principal component analysis representation linearly projects the input image into a lower-dimensional feature space. Why learn a representation? Current practice in the computer vision literature converts the raw pixels into a hand-crafted representation (e.g.\ SIFT features~\cite{Lowe04}), but deep learning algorithms tend to discover similar features in their first few levels~\cite{HonglakL2008,ranzato-08,Koray-08,VincentPLarochelleH2008-very-small}. Learning increases the ease and practicality of developing representations that are at once tailored to specific tasks, yet are able to borrow statistical strength from other related tasks (e.g., modeling different kinds of objects). Finally, learning the feature representation can lead to higher-level (more abstract, more general) features that are more robust to unanticipated sources of variance extant in real data. Whereas a deep architecture can in principle be more powerful than a shallow one in terms of representation, depth appears to render the training problem more difficult in terms of optimization and local minima. It is also only recently that successful algorithms were proposed to overcome some of these difficulties. \section{Perturbation and Transformation of Character Images} \subsection{Adding Slant} In order to mimic a slant effect, we simply shift each row of the image proportionnaly to its height. The coefficient is randomly sampled according to the complexity level and can be negatif or positif with equal probability. \subsection{Changing Thickness} To change the thickness of the characters we used morpholigical operators: dilation and erosion~\cite{Haralick87,Serra82}. The basic idea of such transform is, for each pixel, to multiply in the element-wise manner its neighbourhood with a matrix called the structuring element. Then for dilation we remplace the pixel value by the maximum of the result, or the minimum for erosion. This will dilate or erode objects in the image, the strength of the transform only depends on the structuring element. We used ten different structural elements with various shapes (the biggest is $5\times5$). for each image, we radomly sample the operator type (dilation or erosion) and one structural element from a subset depending of the complexity (the higher the complexity, the biggest the structural element can be). Erosion allows only the five smallest structural elements because when the character is too thin it may erase it completly. \subsection{Affine Transformations} We generate an affine transform matrix according to the complexity level, then we apply it directly to the image. This allows to produce scaling, translation, rotation and shearing variances. We took care that the maximum rotation applied to the image is low enough not to confuse classes. \subsection{Local Elastic Deformations} \subsection{GIMP transformation} \subsection{Occlusion} \subsection{Background Images} \subsection{Salt and Pepper Noise} \subsection{Spatially Gaussian Noise} \subsection{Color and Contrast Changes} \begin{figure}[h] \resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\ \caption{Illustration of the pipeline of stochastic transformations applied to the image of a lower-case t (the upper left image). Each image in the pipeline (going from left to right, first top line, then bottom line) shows the result of applying one of the modules in the pipeline. The last image (bottom right) is used as training example.} \label{fig:pipeline} \end{figure} \section{Learning Algorithms for Deep Architectures} \section{Experimental Setup} \subsection{Training Datasets} \subsubsection{Data Sources} \begin{itemize} \item {\bf NIST} \item {\bf Fonts} \item {\bf Captchas} \item {\bf OCR data} \end{itemize} \subsubsection{Data Sets} \begin{itemize} \item {\bf NIST} \item {\bf P07} \item {\bf NISTP} {\em ne pas utiliser PNIST mais NISTP, pour rester politically correct...} \end{itemize} \subsection{Models and their Hyperparameters} \subsubsection{Multi-Layer Perceptrons (MLP)} An MLP is a family of functions that are described by stacking layers of of a function similar to $$g(x) = \tanh(b+Wx)$$ The input, $x$, is a $d$-dimension vector. The output, $g(x)$, is a $m$-dimension vector. The parameter $W$ is a $m\times d$ matrix and is called the weight matrix. The parameter $b$ is a $m$-vector and is called the bias vector. The non-linearity (here $\tanh$) is applied element-wise to the output vector. Usually the input is referred to a input layer and similarly for the output. You can of course chain several such functions to obtain a more complex one. Here is a common example $$f(x) = c + V\tanh(b+Wx)$$ In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer. Here the output layer does not have the same non-linearity as the hidden layer. This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand. If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP. The parameters to adapt are the weight matrix and the bias vector for each layer. \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization. Denoising auto-encoders are a variant where the input is corrupted with random noise before trying to repair it. The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform. An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer. Usually the top and bottom weight matrices are the transpose of each other and are fixed this way. The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer. The other parameters are discarded. The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder. The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it. For additional details see \cite{vincent:icml08}. \section{Experimental Results} \subsection{SDA vs MLP} \begin{center} \begin{tabular}{lcc} & train w/ & train w/ \\ & NIST & P07 + NIST \\ \hline SDA & & \\ \hline MLP & & \\ \hline \end{tabular} \end{center} \subsection{Perturbed Training Data More Helpful for SDAE} \subsection{Training with More Classes than Necessary} \section{Conclusions} \bibliography{strings,ml,aigaion,specials} \bibliographystyle{mlapa} \end{document}