# HG changeset patch # User Dumitru Erhan # Date 1275441481 25200 # Node ID 8fe77eac344f2a9340b1f7d483a20205e1ce7565 # Parent 8b7e054d22bd4ef5fbf60c9557ede44eac18f38d Clarifying the experimental setup, typos here and there diff -r 8b7e054d22bd -r 8fe77eac344f writeup/nips2010_submission.tex --- a/writeup/nips2010_submission.tex Tue Jun 01 14:23:34 2010 -0700 +++ b/writeup/nips2010_submission.tex Tue Jun 01 18:18:01 2010 -0700 @@ -1,7 +1,8 @@ \documentclass{article} % For LaTeX2e \usepackage{nips10submit_e,times} -\usepackage{amsthm,amsmath,amssymb,bbold,bbm} +\usepackage{amsthm,amsmath,bbm} +\usepackage[psamsfonts]{amssymb} \usepackage{algorithm,algorithmic} \usepackage[utf8]{inputenc} \usepackage{graphicx,subfigure} @@ -77,7 +78,7 @@ Machines in terms of unsupervised extraction of a hierarchy of features useful for classification. The principle is that each layer starting from the bottom is trained to encode its input (the output of the previous -layer) and to reconstruct it from a corrupted version of it. After this +layer) and to reconstruct it from a corrupted version. After this unsupervised initialization, the stack of denoising auto-encoders can be converted into a deep supervised feedforward neural network and fine-tuned by stochastic gradient descent. @@ -97,8 +98,6 @@ between different regions in input space or different tasks, as discussed in the conclusion. -% TODO: why we care to evaluate this relative advantage - In this paper we ask the following questions: %\begin{enumerate} @@ -127,6 +126,7 @@ \vspace*{-1mm} \section{Perturbation and Transformation of Character Images} +\label{s:perturbations} \vspace*{-1mm} This section describes the different transformations we used to stochastically @@ -142,7 +142,7 @@ part, from blur to contrast, adds different kinds of noise. \begin{figure}[ht] -\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}} +\centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/transfo.png}}} % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE \caption{Illustration of each transformation applied alone to the same image of an upper-case h (top left). First row (from left to right) : original image, slant, @@ -354,28 +354,31 @@ with 60~000 examples, and variants involving 10~000 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want to focus here on the case of much larger training sets, from 10 times to -to 1000 times larger. The larger datasets are obtained by first sampling from +to 1000 times larger. + +The first step in constructing the larger datasets is to sample from a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, and {\bf OCR data} (scanned machine printed characters). Once a character -is sampled from one of these sources (chosen randomly), a pipeline of -the above transformations and/or noise processes is applied to the -image. +is sampled from one of these sources (chosen randomly), the pipeline of +the transformations and/or noise processes described in section \ref{s:perturbations} +is applied to the image. -We compare the best MLP (according to validation set error) that we found against -the best SDA (again according to validation set error), along with a precise estimate +We compare the best MLP against +the best SDA (both models' hyper-parameters are selected to minimize the validation set error), +along with a comparison against a precise estimate of human performance obtained via Amazon's Mechanical Turk (AMT) -service\footnote{http://mturk.com}. +service (http://mturk.com). AMT users are paid small amounts of money to perform tasks for which human intelligence is required. Mechanical Turk has been used extensively in natural language processing and vision. %processing \citep{SnowEtAl2008} and vision %\citep{SorokinAndForsyth2008,whitehill09}. -AMT users where presented -with 10 character images and asked to type 10 corresponding ASCII +AMT users were presented +with 10 character images and asked to choose 10 corresponding ASCII characters. They were forced to make a hard choice among the 62 or 10 character classes (all classes or digits only). Three users classified each image, allowing -to estimate inter-human variability. +to estimate inter-human variability. A total 2500 images/dataset were classified. \vspace*{-1mm} \subsection{Data Sources} @@ -390,7 +393,7 @@ The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. -The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended +The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} for that purpose. We randomly split the remainder into a training set and a validation set for model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, @@ -406,10 +409,10 @@ %\item {\bf Fonts.} In order to have a good variety of sources we downloaded an important number of free fonts from: -{\tt http://cg.scs.carleton.ca/~luc/freefonts.html} +{\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. % TODO: pointless to anonymize, it's not pointing to our work -Including operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. -The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, +Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. +The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, directly as input to our models. %\item @@ -445,13 +448,13 @@ %\item {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources -and sending them through the above transformation pipeline. -For each new example to generate, a source is selected with probability $10\%$ from the fonts, +and sending them through the transformation pipeline described in section \ref{s:perturbations}. +For each new example to generate, a data source is selected with probability $10\%$ from the fonts, $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the -order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. +order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. %\item -{\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) +{\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) except that we only apply transformations from slant to pinch. Therefore, the character is transformed but no additional noise is added to the image, giving images @@ -464,21 +467,20 @@ The experiments are performed with Multi-Layer Perceptrons (MLP) with a single hidden layer and with Stacked Denoising Auto-Encoders (SDA). -All hyper-parameters are selected based on performance on the NISTP validation set. +\emph{Note that all hyper-parameters are selected based on performance on the {\bf NISTP} validation set.} {\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared deep architectures to both shallow MLPs and SVMs, we only compared to MLPs here because of the very large datasets used -(making the use of SVMs computationally inconvenient because of their quadratic +(making the use of SVMs computationally challenging because of their quadratic scaling behavior). The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized exponentials) on the output layer for estimating $P(class | image)$. The number of hidden units is taken in $\{300,500,800,1000,1500\}$. -The optimization procedure is as follows: training -examples are presented in minibatches of size 20, a constant learning -rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ +Training examples are presented in minibatches of size 20. A constant learning +rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ through preliminary experiments (measuring performance on a validation set), -and $0.1$ was then selected. +and $0.1$ was then selected for optimizing on the whole training sets. \begin{figure}[ht] \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} @@ -507,11 +509,12 @@ taking advantage of the expressive power and bias implicit in the deep architecture (whereby complex concepts are expressed as compositions of simpler ones through a deep hierarchy). + Here we chose to use the Denoising Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for these deep hierarchies of features, as it is very simple to train and teach (see Figure~\ref{fig:da}, as well as -tutorial and code there: {\tt http://deeplearning.net/tutorial}), +tutorial and code at {\tt http://deeplearning.net/tutorial}), provides immediate and efficient inference, and yielded results comparable or better than RBMs in series of experiments \citep{VincentPLarochelleH2008}. During training, a Denoising @@ -536,10 +539,9 @@ \begin{figure}[ht] \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} -\caption{Error bars indicate a 95\% confidence interval. 0 indicates training +\caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained on NIST, 1 on NISTP, and 2 on P07. Left: overall results -of all models, on 3 different test sets corresponding to the three -datasets. +of all models, on 3 different test sets (NIST, NISTP, P07). Right: error rates on NIST test digits only, along with the previous results from literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} respectively based on ART, nearest neighbors, MLPs, and SVMs.} @@ -693,7 +695,7 @@ experiments showed its positive effects in a \emph{limited labeled data} scenario. However, many of the results by \citet{RainaR2007} (who used a shallow, sparse coding approach) suggest that the relative gain of self-taught -learning diminishes as the number of labeled examples increases, (essentially, +learning diminishes as the number of labeled examples increases (essentially, a ``diminishing returns'' scenario occurs). We note that, for deep architectures, our experiments show that such a positive effect is accomplished even in a scenario with a \emph{very large number of labeled examples}.