# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1288578450 14400
# Node ID 203c6071e104940a42e0b23eecc5d137a8ef8ce7
# Parent  84cb106ef42892b421a8ad3bc99f67e7c1bfe06c
aistats submission looking good

diff -r 84cb106ef428 -r 203c6071e104 writeup/aistats2011_submission.tex
--- a/writeup/aistats2011_submission.tex	Sun Oct 31 09:12:06 2010 -0400
+++ b/writeup/aistats2011_submission.tex	Sun Oct 31 22:27:30 2010 -0400
@@ -1,7 +1,7 @@
 %\documentclass[twoside,11pt]{article} % For LaTeX2e
 \documentclass{article} % For LaTeX2e
 \usepackage{aistats2e_2011}
-\usepackage{times}
+%\usepackage{times}
 \usepackage{wrapfig}
 \usepackage{amsthm}
 \usepackage{amsmath}
@@ -20,8 +20,12 @@
 
 \begin{document}
 
-\title{Deeper Learners Benefit More from Multi-Task and Perturbed Examples}
-\author{
+\twocolumn[
+\aistatstitle{Deeper Learners Benefit More from Multi-Task and Perturbed Examples}
+\runningtitle{Deep Learners for Out-of-Distribution Examples}
+\runningauthor{Bengio et. al.}
+\aistatsauthor{Anonymous Authors}]
+\iffalse
 Yoshua  Bengio \and
 Frédéric  Bastien \and
 Arnaud  Bergeron \and
@@ -39,23 +43,25 @@
 Salah  Rifai \and 
 Francois  Savard \and 
 Guillaume  Sicard 
-}
-\date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
+%}
+\fi
+%\aistatsaddress{Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
+%\date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
 %\jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al}
 %\editor{}
 
 %\makeanontitle
-\maketitle
+%\maketitle
 
 %{\bf Running title: Deep Self-Taught Learning}
 
-\vspace*{-2mm}
+%\vspace*{-2mm}
 \begin{abstract}
   Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because
 they can be shared across tasks and examples from different but related 
 distributions, can yield even more benefits where there are more such levels of representation. The experiments are performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits). We show that a deep learner could not only {\em beat previously published results but also reach human-level performance}.
 \end{abstract}
-\vspace*{-3mm}
+%\vspace*{-3mm}
 
 %\begin{keywords}  
 %Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning
@@ -65,7 +71,7 @@
 
 
 \section{Introduction}
-\vspace*{-1mm}
+%\vspace*{-1mm}
 
 {\bf Deep Learning} has emerged as a promising new area of research in
 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review.
@@ -105,7 +111,7 @@
 applied here, is the Denoising
 Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see
 Figure~\ref{fig:da}), which performed similarly or 
-better~\citep{VincentPLarchelleH2008-very-small} than previously
+better~\citep{VincentPLarochelleH2008-very-small} than previously
 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} 
 in terms of unsupervised extraction
 of a hierarchy of features useful for classification. Each layer is trained
@@ -210,11 +216,11 @@
 (perturbed or from other related classes) is very common, this conclusion
 is of practical importance.
 
-\vspace*{-3mm}
+%\vspace*{-3mm}
 %\newpage
 \section{Perturbed and Transformed Character Images}
 \label{s:perturbations}
-\vspace*{-2mm}
+%\vspace*{-2mm}
 
 Figure~\ref{fig:transform} shows the different transformations we used to stochastically
 transform $32 \times 32$ source images (such as the one in Fig.\ref{fig:torig})
@@ -234,7 +240,7 @@
 part, from blur to contrast, adds different kinds of noise.
 More details can be found in~\citep{ift6266-tr-anonymous}.
 
-\begin{figure}[ht]
+\begin{figure*}[ht]
 \centering
 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}}
 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}}
@@ -257,17 +263,17 @@
 a pipeline of these, with random choices about which module to apply and how much perturbation
 to apply.}
 \label{fig:transform}
-\vspace*{-2mm}
-\end{figure}
+%\vspace*{-2mm}
+\end{figure*}
 
-\vspace*{-3mm}
+%\vspace*{-3mm}
 \section{Experimental Setup}
-\vspace*{-1mm}
+%\vspace*{-1mm}
 
 Much previous work on deep learning had been performed on
 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
 with 60~000 examples, and variants involving 10~000
-examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}.
+examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008}.
 The focus here is on much larger training sets, from 10 times to 
 to 1000 times larger, and 62 classes.
 
@@ -301,9 +307,9 @@
 The average error of humans on the 62-class task NIST test set
 is 18.2\%, with a standard error of 0.1\%.
 
-\vspace*{-3mm}
+%\vspace*{-3mm}
 \subsection{Data Sources}
-\vspace*{-2mm}
+%\vspace*{-2mm}
 
 %\begin{itemize}
 %\item 
@@ -336,7 +342,7 @@
 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 
 directly as input to our models.
-\vspace*{-1mm}
+%\vspace*{-1mm}
 
 %\item 
 {\bf Captchas.}
@@ -347,7 +353,7 @@
 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
 depending on the value of the complexity parameter provided by the user of the data source. 
 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class?
-\vspace*{-1mm}
+%\vspace*{-1mm}
 
 %\item 
 {\bf OCR data.}
@@ -359,19 +365,19 @@
 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
 %\end{itemize}
 
-\vspace*{-3mm}
+%\vspace*{-3mm}
 \subsection{Data Sets}
-\vspace*{-2mm}
+%\vspace*{-2mm}
 
 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
 from one of the 62 character classes.
 %\begin{itemize}
-\vspace*{-1mm}
+%\vspace*{-1mm}
 
 %\item 
 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
 \{651668 / 80000 / 82587\} \{training / validation / test\} examples.
-\vspace*{-1mm}
+%\vspace*{-1mm}
 
 %\item 
 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
@@ -380,7 +386,7 @@
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
-\vspace*{-1mm}
+%\vspace*{-1mm}
 
 %\item 
 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
@@ -391,9 +397,25 @@
 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
 %\end{itemize}
 
-\vspace*{-3mm}
+\begin{figure*}[ht]
+%\vspace*{-2mm}
+\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
+%\vspace*{-2mm}
+\caption{Illustration of the computations and training criterion for the denoising
+auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
+the layer (i.e. raw input or output of previous layer)
+s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
+The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
+is compared to the uncorrupted input $x$ through the loss function
+$L_H(x,z)$, whose expected value is approximately minimized during training
+by tuning $\theta$ and $\theta'$.}
+\label{fig:da}
+%\vspace*{-2mm}
+\end{figure*}
+
+%\vspace*{-3mm}
 \subsection{Models and their Hyperparameters}
-\vspace*{-2mm}
+%\vspace*{-2mm}
 
 The experiments are performed using MLPs (with a single
 hidden layer) and SDAs.
@@ -416,7 +438,7 @@
 %through preliminary experiments (measuring performance on a validation set),
 %and $0.1$ (which was found to work best) was then selected for optimizing on
 %the whole training sets.
-\vspace*{-1mm}
+%\vspace*{-1mm}
 
 
 {\bf Stacked Denoising Auto-Encoders (SDA).}
@@ -438,22 +460,6 @@
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
 
-\begin{figure}[ht]
-\vspace*{-2mm}
-\centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
-\vspace*{-2mm}
-\caption{Illustration of the computations and training criterion for the denoising
-auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
-the layer (i.e. raw input or output of previous layer)
-s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
-The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
-is compared to the uncorrupted input $x$ through the loss function
-$L_H(x,z)$, whose expected value is approximately minimized during training
-by tuning $\theta$ and $\theta'$.}
-\label{fig:da}
-\vspace*{-2mm}
-\end{figure}
-
 Here we chose to use the Denoising
 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
 these deep hierarchies of features, as it is simple to train and
@@ -485,9 +491,9 @@
 were obtained with the largest values that we could experiment
 with given our patience, with 1000 hidden units.
 
-\vspace*{-1mm}
+%\vspace*{-1mm}
 
-\begin{figure}[ht]
+\begin{figure*}[ht]
 %\vspace*{-2mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
 %\vspace*{-3mm}
@@ -498,11 +504,11 @@
 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
 \label{fig:error-rates-charts}
-\vspace*{-2mm}
-\end{figure}
+%\vspace*{-2mm}
+\end{figure*}
 
 
-\begin{figure}[ht]
+\begin{figure*}[ht]
 \vspace*{-3mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
 \vspace*{-3mm}
@@ -515,8 +521,9 @@
 both self-taught learning scenarios, compared to the shallow MLP.}
 \label{fig:improvements-charts}
 \vspace*{-2mm}
-\end{figure}
+\end{figure*}
 
+\vspace*{-2mm}
 \section{Experimental Results}
 \vspace*{-2mm}
 
@@ -695,7 +702,7 @@
 with deep learning and self-taught learning.
  
 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 
-can be executed on-line at {\tt http://deep.host22.com}.
+can be executed on-line at the anonymous site {\tt http://deep.host22.com}.
 
 \iffalse
 \section*{Appendix I: Detailed Numerical Results}