ift6266: writeup/aistats2011_submission.tex comparison

comparison writeup/aistats2011_submission.tex @ 602:203c6071e104

aistats submission looking good

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sun, 31 Oct 2010 22:27:30 -0400
parents	1f5d2d01b84d
children	eb6244c6d861

comparison

equal deleted inserted replaced

-:84cb106ef428
+:203c6071e104
 %\documentclass[twoside,11pt]{article} % For LaTeX2e
 \documentclass{article} % For LaTeX2e
 \usepackage{aistats2e_2011}
-\usepackage{times}
+%\usepackage{times}
 \usepackage{wrapfig}
 \usepackage{amsthm}
 \usepackage{amsmath}
 \usepackage{bbm}
 \usepackage[utf8]{inputenc}
 %\setlength\parindent{0mm}
 \begin{document}
-\title{Deeper Learners Benefit More from Multi-Task and Perturbed Examples}
+\twocolumn[
-\author{
+\aistatstitle{Deeper Learners Benefit More from Multi-Task and Perturbed Examples}
+\runningtitle{Deep Learners for Out-of-Distribution Examples}
+\runningauthor{Bengio et. al.}
+\aistatsauthor{Anonymous Authors}]
+\iffalse
 Yoshua  Bengio \and
 Frédéric  Bastien \and
 Arnaud  Bergeron \and
 Nicolas  Boulanger-Lewandowski \and
 Thomas  Breuel \and
 Sylvain  Pannetier Lebeuf \and
 Razvan  Pascanu \and
 Salah  Rifai \and
 Francois  Savard \and
 Guillaume  Sicard
-}
+%}
-\date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
+\fi
+%\aistatsaddress{Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
+%\date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
 %\jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al}
 %\editor{}
 %\makeanontitle
-\maketitle
+%\maketitle
 %{\bf Running title: Deep Self-Taught Learning}
-\vspace*{-2mm}
+%\vspace*{-2mm}
 \begin{abstract}
 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because
 they can be shared across tasks and examples from different but related
 distributions, can yield even more benefits where there are more such levels of representation. The experiments are performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits). We show that a deep learner could not only {\em beat previously published results but also reach human-level performance}.
 \end{abstract}
-\vspace*{-3mm}
+%\vspace*{-3mm}
 %\begin{keywords}
 %Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning
 %\end{keywords}
 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition}
 \section{Introduction}
-\vspace*{-1mm}
+%\vspace*{-1mm}
 {\bf Deep Learning} has emerged as a promising new area of research in
 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review.
 Learning algorithms for deep architectures are centered on the learning
 of useful representations of data, which are better suited to the task at hand,
 stochastic gradient descent.
 One of these layer initialization techniques,
 applied here, is the Denoising
 Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see
 Figure~\ref{fig:da}), which performed similarly or
-better~\citep{VincentPLarchelleH2008-very-small} than previously
+better~\citep{VincentPLarochelleH2008-very-small} than previously
 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06}
 in terms of unsupervised extraction
 of a hierarchy of features useful for classification. Each layer is trained
 to denoise its input, creating a layer of features that can be used as
 input for the next layer. Note that training a Denoising Auto-Encoder
 the more general question of why deep learners may benefit so much from
 the self-taught learning framework. Since out-of-distribution data
 (perturbed or from other related classes) is very common, this conclusion
 is of practical importance.
-\vspace*{-3mm}
+%\vspace*{-3mm}
 %\newpage
 \section{Perturbed and Transformed Character Images}
 \label{s:perturbations}
-\vspace*{-2mm}
+%\vspace*{-2mm}
 Figure~\ref{fig:transform} shows the different transformations we used to stochastically
 transform $32 \times 32$ source images (such as the one in Fig.\ref{fig:torig})
 in order to obtain data from a larger distribution which
 covers a domain substantially larger than the clean characters distribution from
 There are two main parts in the pipeline. The first one,
 from slant to pinch below, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
 More details can be found in~\citep{ift6266-tr-anonymous}.
-\begin{figure}[ht]
+\begin{figure*}[ht]
 \centering
 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}}
 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}}
 \subfigure[Slant]{\includegraphics[scale=0.6]{images/Slant_only.png}}
 \subfigure[Affine Transformation]{\includegraphics[scale=0.6]{images/Affine_only.png}}
 \caption{Top left (a): example original image. Others (b-o): examples of the effect
 of each transformation module taken separately. Actual perturbed examples are obtained by
 a pipeline of these, with random choices about which module to apply and how much perturbation
 to apply.}
 \label{fig:transform}
-\vspace*{-2mm}
+%\vspace*{-2mm}
-\end{figure}
+\end{figure*}
-\vspace*{-3mm}
+%\vspace*{-3mm}
 \section{Experimental Setup}
-\vspace*{-1mm}
+%\vspace*{-1mm}
 Much previous work on deep learning had been performed on
 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
 with 60~000 examples, and variants involving 10~000
-examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}.
+examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008}.
 The focus here is on much larger training sets, from 10 times to
 to 1000 times larger, and 62 classes.
 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
 example, and we were able to estimate the error variance due to this effect
 because each image was classified by 3 different persons.
 The average error of humans on the 62-class task NIST test set
 is 18.2\%, with a standard error of 0.1\%.
-\vspace*{-3mm}
+%\vspace*{-3mm}
 \subsection{Data Sources}
-\vspace*{-2mm}
+%\vspace*{-2mm}
 %\begin{itemize}
 %\item
 {\bf NIST.}
 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995},
 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
 % TODO: pointless to anonymize, it's not pointing to our work
 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
 directly as input to our models.
-\vspace*{-1mm}
+%\vspace*{-1mm}
 %\item
 {\bf Captchas.}
 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
 generating characters of the same format as the NIST dataset. This software is based on
 a random character class generator and various kinds of transformations similar to those described in the previous sections.
 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
 depending on the value of the complexity parameter provided by the user of the data source.
 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class?
-\vspace*{-1mm}
+%\vspace*{-1mm}
 %\item
 {\bf OCR data.}
 A large set (2 million) of scanned, OCRed and manually verified machine-printed
 characters where included as an
 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern
 ({\tt http://www.iupr.com}), and which will be publicly released.
 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
 %\end{itemize}
-\vspace*{-3mm}
+%\vspace*{-3mm}
 \subsection{Data Sets}
-\vspace*{-2mm}
+%\vspace*{-2mm}
 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
 from one of the 62 character classes.
 %\begin{itemize}
-\vspace*{-1mm}
+%\vspace*{-1mm}
 %\item
 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
 \{651668 / 80000 / 82587\} \{training / validation / test\} examples.
-\vspace*{-1mm}
+%\vspace*{-1mm}
 %\item
 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
-\vspace*{-1mm}
+%\vspace*{-1mm}
 %\item
 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
 except that we only apply
 transformations from slant to pinch. Therefore, the character is
 transformed but no additional noise is added to the image, giving images
 closer to the NIST dataset.
 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
 %\end{itemize}
-\vspace*{-3mm}
+\begin{figure*}[ht]
+%\vspace*{-2mm}
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
+%\vspace*{-2mm}
+\caption{Illustration of the computations and training criterion for the denoising
+auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
+the layer (i.e. raw input or output of previous layer)
+s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
+The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
+is compared to the uncorrupted input $x$ through the loss function
+$L_H(x,z)$, whose expected value is approximately minimized during training
+by tuning $\theta$ and $\theta'$.}
+\label{fig:da}
+%\vspace*{-2mm}
+\end{figure*}
+%\vspace*{-3mm}
 \subsection{Models and their Hyperparameters}
-\vspace*{-2mm}
+%\vspace*{-2mm}
 The experiments are performed using MLPs (with a single
 hidden layer) and SDAs.
 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
 Training examples are presented in minibatches of size 20. A constant learning
 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
 %through preliminary experiments (measuring performance on a validation set),
 %and $0.1$ (which was found to work best) was then selected for optimizing on
 %the whole training sets.
-\vspace*{-1mm}
+%\vspace*{-1mm}
 {\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden
 distribution $P(x)$ and the conditional distribution of interest
 $P(y|x)$ (like in semi-supervised learning), and on the other hand
 taking advantage of the expressive power and bias implicit in the
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
-\begin{figure}[ht]
-\vspace*{-2mm}
-\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
-\vspace*{-2mm}
-\caption{Illustration of the computations and training criterion for the denoising
-auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
-the layer (i.e. raw input or output of previous layer)
-s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
-The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
-is compared to the uncorrupted input $x$ through the loss function
-$L_H(x,z)$, whose expected value is approximately minimized during training
-by tuning $\theta$ and $\theta'$.}
-\label{fig:da}
-\vspace*{-2mm}
-\end{figure}
 Here we chose to use the Denoising
 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
 these deep hierarchies of features, as it is simple to train and
 explain (see Figure~\ref{fig:da}, as well as
 SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden
 layers was kept constant across hidden layers, and the best results
 were obtained with the largest values that we could experiment
 with given our patience, with 1000 hidden units.
-\vspace*{-1mm}
+%\vspace*{-1mm}
-\begin{figure}[ht]
+\begin{figure*}[ht]
 %\vspace*{-2mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
 %\vspace*{-3mm}
 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
 of all models, on NIST and NISTP test sets.
 Right: error rates on NIST test digits only, along with the previous results from
 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
 \label{fig:error-rates-charts}
-\vspace*{-2mm}
+%\vspace*{-2mm}
-\end{figure}
+\end{figure*}
-\begin{figure}[ht]
+\begin{figure*}[ht]
 \vspace*{-3mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
 \vspace*{-3mm}
 \caption{Relative improvement in error rate due to self-taught learning.
 Left: Improvement (or loss, when negative)
 learning (training on all classes and testing only on either digits,
 upper case, or lower-case). The deep learner (SDA) benefits more from
 both self-taught learning scenarios, compared to the shallow MLP.}
 \label{fig:improvements-charts}
 \vspace*{-2mm}
-\end{figure}
+\end{figure*}
+\vspace*{-2mm}
 \section{Experimental Results}
 \vspace*{-2mm}
 %%\vspace*{-1mm}
 %\subsection{SDA vs MLP vs Humans}
 does not allow the shallow or purely supervised models to discover
 the kind of better basins associated
 with deep learning and self-taught learning.
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
-can be executed on-line at {\tt http://deep.host22.com}.
+can be executed on-line at the anonymous site {\tt http://deep.host22.com}.
 \iffalse
 \section*{Appendix I: Detailed Numerical Results}
 These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered.

Mercurial > ift6266

comparison writeup/aistats2011_submission.tex @ 602:203c6071e104