ift6266: writeup/nips2010_cameraready.tex comparison

comparison writeup/nips2010_cameraready.tex @ 607:d840139444fe

NIPS workshop spotlight

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Fri, 26 Nov 2010 17:41:43 -0500
parents	bd7d11089a47
children

comparison

equal deleted inserted replaced

-:bd7d11089a47
+:d840139444fe
 \usepackage{algorithm,algorithmic}
 \usepackage[utf8]{inputenc}
 \usepackage{graphicx,subfigure}
 \usepackage[numbers]{natbib}
-\addtolength{\textwidth}{20mm}
+\addtolength{\textwidth}{10mm}
-\addtolength{\textheight}{20mm}
+\addtolength{\textheight}{10mm}
-\addtolength{\topmargin}{-10mm}
+\addtolength{\topmargin}{-5mm}
-\addtolength{\evensidemargin}{-10mm}
+\addtolength{\evensidemargin}{-5mm}
-\addtolength{\oddsidemargin}{-10mm}
+\addtolength{\oddsidemargin}{-5mm}
 %\setlength\parindent{0mm}
 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
 \author{
 \section{Introduction}
 \vspace*{-1mm}
 {\bf Deep Learning} has emerged as a promising new area of research in
-statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008-very-small,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review.
+statistical machine learning~\citep{Hinton06}
+(see \citet{Bengio-2009} for a review).
 Learning algorithms for deep architectures are centered on the learning
 of useful representations of data, which are better suited to the task at hand,
 and are organized in a hierarchy with multiple levels.
 This is in part inspired by observations of the mammalian visual cortex,
 which consists of a chain of processing elements, each of which is associated with a
 advantage} of deep learning for these settings has not been evaluated.
 The hypothesis discussed in the conclusion is that a deep hierarchy of features
 may be better able to provide sharing of statistical strength
 between different regions in input space or different tasks.
-\iffalse
-Whereas a deep architecture can in principle be more powerful than a
-shallow one in terms of representation, depth appears to render the
-training problem more difficult in terms of optimization and local minima.
-It is also only recently that successful algorithms were proposed to
-overcome some of these difficulties.  All are based on unsupervised
-learning, often in an greedy layer-wise ``unsupervised pre-training''
-stage~\citep{Bengio-2009}.
-The principle is that each layer starting from
-the bottom is trained to represent its input (the output of the previous
-layer). After this
-unsupervised initialization, the stack of layers can be
-converted into a deep supervised feedforward neural network and fine-tuned by
-stochastic gradient descent.
-One of these layer initialization techniques,
-applied here, is the Denoising
-Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see
-Figure~\ref{fig:da}), which performed similarly or
-better~\citep{VincentPLarochelleH2008-very-small} than previously
-proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06}
-in terms of unsupervised extraction
-of a hierarchy of features useful for classification. Each layer is trained
-to denoise its input, creating a layer of features that can be used as
-input for the next layer, forming a Stacked Denoising Auto-encoder (SDA).
-Note that training a Denoising Auto-encoder
-can actually been seen as training a particular RBM by an inductive
-principle different from maximum likelihood~\citep{Vincent-SM-2010},
-namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}.
-\fi
 Previous comparative experimental results with stacking of RBMs and DAs
 to build deep supervised predictors had shown that they could outperform
 shallow architectures in a variety of settings, especially
 when the data involves complex interactions between many factors of
 variation~\citep{LarochelleH2007,Bengio-2009}. Other experiments have suggested
 and noises, here). This is consistent with the hypotheses discussed
 in~\citet{Bengio-2009} regarding the potential advantage
 of deep learning and the idea that more levels of representation can
 give rise to more abstract, more general features of the raw input.
-This hypothesis is related to a learning setting called
+This hypothesis is related to the
-{\bf self-taught learning}~\citep{RainaR2007}, which combines principles
+{\bf self-taught learning} setting~\citep{RainaR2007}, which combines principles
 of semi-supervised and multi-task learning: the learner can exploit examples
 that are unlabeled and possibly come from a distribution different from the target
-distribution, e.g., from other classes than those of interest.
+distribution, e.g., from classes other than those of interest.
-It has already been shown that deep learners can clearly take advantage of
+It has already been shown that deep learners can take advantage of
 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
 but more needed to be done to explore the impact
 of {\em out-of-distribution} examples and of the {\em multi-task} setting
 (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised
 pre-training only with the first layer). In particular the {\em relative
-advantage of deep learning} for these settings has not been evaluated.
+advantage of deep learning} for these settings had not been evaluated.
 %
 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
 {\bf benefit more from out-of-distribution examples than shallow learners} (with a single
 %%\newpage
 \section{Perturbed and Transformed Character Images}
 \label{s:perturbations}
 \vspace*{-2mm}
-\begin{minipage}[h]{\linewidth}
+%\begin{minipage}[h]{\linewidth}
 \begin{wrapfigure}[8]{l}{0.15\textwidth}
 %\begin{minipage}[b]{0.14\linewidth}
 \vspace*{-5mm}
 \begin{center}
 \includegraphics[scale=.4]{images/Original.png}\\
 in the complexity of the learning task.
 More details can
 be found in this technical report~\citep{ARXIV-2010}.
 The code for these transformations (mostly python) is available at
 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share
-a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
+a global control parameter ($0 \le complexity \le 1$) modulating the
-amount of deformation or noise introduced.
+amount of deformation or noise.
 There are two main parts in the pipeline. The first one,
 from thickness to pinch, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
-\end{minipage}
+%\end{minipage}
-\newpage
+%\newpage
 \vspace*{1mm}
 %\subsection{Transformations}
 {\large\bf 2.1 Transformations}
 \vspace*{1mm}
 \vspace{1mm}
 {\large\bf 2.2 Injecting Noise}
 %\subsection{Injecting Noise}
-\vspace{2mm}
+%\vspace{2mm}
 \begin{minipage}[h]{\linewidth}
 %\vspace*{-.2cm}
-\begin{minipage}[t]{0.14\linewidth}
+%\begin{minipage}[t]{0.14\linewidth}
-\centering
+\begin{wrapfigure}[8]{l}{0.15\textwidth}
-\vspace*{-2mm}
+\begin{center}
+\vspace*{-5mm}
+%\vspace*{-2mm}
 \includegraphics[scale=.4]{images/Motionblur_only.png}\\
 {\bf Motion Blur}
-\end{minipage}%
+%\end{minipage}%
-\hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
+\end{center}
+\end{wrapfigure}
+%\hspace{0.3cm}
+%\begin{minipage}[t]{0.83\linewidth}
 %\vspace*{.5mm}
 The {\bf motion blur} module is GIMP's ``linear motion blur'', which
 has parameters $length$ and $angle$. The value of
 a pixel in the final image is approximately the  mean of the first $length$ pixels
 found by moving in the $angle$ direction,
 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
 \vspace{5mm}
 \end{minipage}
-\end{minipage}
+%\end{minipage}
 \vspace*{1mm}
 \begin{minipage}[h]{\linewidth}
 \begin{minipage}[t]{0.14\linewidth}
 image. Pixels are combined by taking the max(occluder, occluded),
 i.e. keeping the lighter ones.
 The rectangle corners
 are sampled so that larger complexity gives larger rectangles.
 The destination position in the occluded image are also sampled
-according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}).
+according to a normal distribution (more details in~\citet{ARXIV-2010}).
 This module is skipped with probability 60\%.
 %\vspace{7mm}
 \end{minipage}
 \end{minipage}
 \vspace*{1mm}
 \begin{wrapfigure}[8]{l}{0.15\textwidth}
-\vspace*{-6mm}
+\vspace*{-3mm}
 \begin{center}
 %\begin{minipage}[t]{0.14\linewidth}
 %\centering
 \includegraphics[scale=.4]{images/Bruitgauss_only.png}\\
 {\bf Gaussian Smoothing}
 This module is skipped with probability 75\%.
 %\end{minipage}
 %\newpage
-\vspace*{-9mm}
+\vspace*{1mm}
 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
 %\centering
 \begin{minipage}[t]{\linewidth}
 \begin{wrapfigure}[7]{l}{0.15\textwidth}
 \vspace*{-3mm}
 \section{Experimental Setup}
 \vspace*{-1mm}
 Much previous work on deep learning had been performed on
-the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
+the MNIST digits task
 with 60~000 examples, and variants involving 10~000
-examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}.
+examples~\citep{VincentPLarochelleH2008-very-small}.
 The focus here is on much larger training sets, from 10 times to
 to 1000 times larger, and 62 classes.
 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
 {\bf Stacked Denoising Auto-encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden
-layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
+layers)
 apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better
 generalization~\citep{Erhan+al-2010}.  This initial {\em unsupervised
 pre-training phase} uses all of the training images but not the training labels.
 Each layer is trained in turn to produce a new representation of its input
 $P(y|x)$ (like in semi-supervised learning), and on the other hand
 taking advantage of the expressive power and bias implicit in the
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
+\iffalse
 \begin{figure}[ht]
 \vspace*{-2mm}
 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
 \vspace*{-2mm}
 \caption{Illustration of the computations and training criterion for the denoising
 $L_H(x,z)$, whose expected value is approximately minimized during training
 by tuning $\theta$ and $\theta'$.}
 \label{fig:da}
 \vspace*{-2mm}
 \end{figure}
+\fi
 Here we chose to use the Denoising
-Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
+Auto-encoder~\citep{VincentPLarochelleH2008-very-small} as the building block for
 these deep hierarchies of features, as it is simple to train and
-explain (see Figure~\ref{fig:da}, as well as
+explain (see % Figure~\ref{fig:da}, as well as
 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
 provides efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian
 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}.

Mercurial > ift6266

comparison writeup/nips2010_cameraready.tex @ 607:d840139444fe