Mercurial > ift6266
diff writeup/nips2010_cameraready.tex @ 607:d840139444fe
NIPS workshop spotlight
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Fri, 26 Nov 2010 17:41:43 -0500 |
parents | bd7d11089a47 |
children |
line wrap: on
line diff
--- a/writeup/nips2010_cameraready.tex Mon Nov 22 16:03:46 2010 -0500 +++ b/writeup/nips2010_cameraready.tex Fri Nov 26 17:41:43 2010 -0500 @@ -8,11 +8,11 @@ \usepackage{graphicx,subfigure} \usepackage[numbers]{natbib} -\addtolength{\textwidth}{20mm} -\addtolength{\textheight}{20mm} -\addtolength{\topmargin}{-10mm} -\addtolength{\evensidemargin}{-10mm} -\addtolength{\oddsidemargin}{-10mm} +\addtolength{\textwidth}{10mm} +\addtolength{\textheight}{10mm} +\addtolength{\topmargin}{-5mm} +\addtolength{\evensidemargin}{-5mm} +\addtolength{\oddsidemargin}{-5mm} %\setlength\parindent{0mm} @@ -72,7 +72,8 @@ \vspace*{-1mm} {\bf Deep Learning} has emerged as a promising new area of research in -statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008-very-small,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review. +statistical machine learning~\citep{Hinton06} +(see \citet{Bengio-2009} for a review). Learning algorithms for deep architectures are centered on the learning of useful representations of data, which are better suited to the task at hand, and are organized in a hierarchy with multiple levels. @@ -107,36 +108,6 @@ may be better able to provide sharing of statistical strength between different regions in input space or different tasks. -\iffalse -Whereas a deep architecture can in principle be more powerful than a -shallow one in terms of representation, depth appears to render the -training problem more difficult in terms of optimization and local minima. -It is also only recently that successful algorithms were proposed to -overcome some of these difficulties. All are based on unsupervised -learning, often in an greedy layer-wise ``unsupervised pre-training'' -stage~\citep{Bengio-2009}. -The principle is that each layer starting from -the bottom is trained to represent its input (the output of the previous -layer). After this -unsupervised initialization, the stack of layers can be -converted into a deep supervised feedforward neural network and fine-tuned by -stochastic gradient descent. -One of these layer initialization techniques, -applied here, is the Denoising -Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see -Figure~\ref{fig:da}), which performed similarly or -better~\citep{VincentPLarochelleH2008-very-small} than previously -proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} -in terms of unsupervised extraction -of a hierarchy of features useful for classification. Each layer is trained -to denoise its input, creating a layer of features that can be used as -input for the next layer, forming a Stacked Denoising Auto-encoder (SDA). -Note that training a Denoising Auto-encoder -can actually been seen as training a particular RBM by an inductive -principle different from maximum likelihood~\citep{Vincent-SM-2010}, -namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. -\fi - Previous comparative experimental results with stacking of RBMs and DAs to build deep supervised predictors had shown that they could outperform shallow architectures in a variety of settings, especially @@ -160,18 +131,18 @@ of deep learning and the idea that more levels of representation can give rise to more abstract, more general features of the raw input. -This hypothesis is related to a learning setting called -{\bf self-taught learning}~\citep{RainaR2007}, which combines principles +This hypothesis is related to the +{\bf self-taught learning} setting~\citep{RainaR2007}, which combines principles of semi-supervised and multi-task learning: the learner can exploit examples that are unlabeled and possibly come from a distribution different from the target -distribution, e.g., from other classes than those of interest. -It has already been shown that deep learners can clearly take advantage of +distribution, e.g., from classes other than those of interest. +It has already been shown that deep learners can take advantage of unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, but more needed to be done to explore the impact of {\em out-of-distribution} examples and of the {\em multi-task} setting (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised pre-training only with the first layer). In particular the {\em relative -advantage of deep learning} for these settings has not been evaluated. +advantage of deep learning} for these settings had not been evaluated. % @@ -226,7 +197,7 @@ \label{s:perturbations} \vspace*{-2mm} -\begin{minipage}[h]{\linewidth} +%\begin{minipage}[h]{\linewidth} \begin{wrapfigure}[8]{l}{0.15\textwidth} %\begin{minipage}[b]{0.14\linewidth} \vspace*{-5mm} @@ -251,14 +222,14 @@ be found in this technical report~\citep{ARXIV-2010}. The code for these transformations (mostly python) is available at {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share -a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the -amount of deformation or noise introduced. +a global control parameter ($0 \le complexity \le 1$) modulating the +amount of deformation or noise. There are two main parts in the pipeline. The first one, from thickness to pinch, performs transformations. The second part, from blur to contrast, adds different kinds of noise. -\end{minipage} +%\end{minipage} -\newpage +%\newpage \vspace*{1mm} %\subsection{Transformations} {\large\bf 2.1 Transformations} @@ -404,17 +375,22 @@ {\large\bf 2.2 Injecting Noise} %\subsection{Injecting Noise} -\vspace{2mm} +%\vspace{2mm} \begin{minipage}[h]{\linewidth} %\vspace*{-.2cm} -\begin{minipage}[t]{0.14\linewidth} -\centering -\vspace*{-2mm} +%\begin{minipage}[t]{0.14\linewidth} +\begin{wrapfigure}[8]{l}{0.15\textwidth} +\begin{center} +\vspace*{-5mm} +%\vspace*{-2mm} \includegraphics[scale=.4]{images/Motionblur_only.png}\\ {\bf Motion Blur} -\end{minipage}% -\hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} +%\end{minipage}% +\end{center} +\end{wrapfigure} +%\hspace{0.3cm} +%\begin{minipage}[t]{0.83\linewidth} %\vspace*{.5mm} The {\bf motion blur} module is GIMP's ``linear motion blur'', which has parameters $length$ and $angle$. The value of @@ -423,7 +399,7 @@ $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. \vspace{5mm} \end{minipage} -\end{minipage} +%\end{minipage} \vspace*{1mm} @@ -443,7 +419,7 @@ The rectangle corners are sampled so that larger complexity gives larger rectangles. The destination position in the occluded image are also sampled -according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). +according to a normal distribution (more details in~\citet{ARXIV-2010}). This module is skipped with probability 60\%. %\vspace{7mm} \end{minipage} @@ -452,7 +428,7 @@ \vspace*{1mm} \begin{wrapfigure}[8]{l}{0.15\textwidth} -\vspace*{-6mm} +\vspace*{-3mm} \begin{center} %\begin{minipage}[t]{0.14\linewidth} %\centering @@ -482,7 +458,7 @@ %\newpage -\vspace*{-9mm} +\vspace*{1mm} %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth} %\centering @@ -622,9 +598,9 @@ \vspace*{-1mm} Much previous work on deep learning had been performed on -the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, +the MNIST digits task with 60~000 examples, and variants involving 10~000 -examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}. +examples~\citep{VincentPLarochelleH2008-very-small}. The focus here is on much larger training sets, from 10 times to to 1000 times larger, and 62 classes. @@ -786,7 +762,7 @@ {\bf Stacked Denoising Auto-encoders (SDA).} Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) can be used to initialize the weights of each layer of a deep MLP (with many hidden -layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, +layers) apparently setting parameters in the basin of attraction of supervised gradient descent yielding better generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised @@ -802,6 +778,7 @@ deep architecture (whereby complex concepts are expressed as compositions of simpler ones through a deep hierarchy). +\iffalse \begin{figure}[ht] \vspace*{-2mm} \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} @@ -817,11 +794,12 @@ \label{fig:da} \vspace*{-2mm} \end{figure} +\fi Here we chose to use the Denoising -Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for +Auto-encoder~\citep{VincentPLarochelleH2008-very-small} as the building block for these deep hierarchies of features, as it is simple to train and -explain (see Figure~\ref{fig:da}, as well as +explain (see % Figure~\ref{fig:da}, as well as tutorial and code there: {\tt http://deeplearning.net/tutorial}), provides efficient inference, and yielded results comparable or better than RBMs in series of experiments