changeset 607:d840139444fe

NIPS workshop spotlight
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Fri, 26 Nov 2010 17:41:43 -0500
parents bd7d11089a47
children 456a8fb9829e
files writeup/NIPS2010_workshop_spotlight.ppt writeup/ift6266_ml.bib writeup/nips2010_cameraready.tex writeup/strings-shorter.bib
diffstat 4 files changed, 54 insertions(+), 76 deletions(-) [+]
line wrap: on
line diff
Binary file writeup/NIPS2010_workshop_spotlight.ppt has changed
--- a/writeup/ift6266_ml.bib	Mon Nov 22 16:03:46 2010 -0500
+++ b/writeup/ift6266_ml.bib	Fri Nov 26 17:41:43 2010 -0500
@@ -9752,7 +9752,7 @@
 
 
 @Article{Hinton06,
-  author =       "Goeffrey E. Hinton and Simon Osindero and {Yee Whye} Teh",
+  author =       "Geoffrey E. Hinton and Simon Osindero and {Yee Whye} Teh",
   title =        "A fast learning algorithm for deep belief nets",
   journal =      "Neural Computation",
   volume =       "18",
--- a/writeup/nips2010_cameraready.tex	Mon Nov 22 16:03:46 2010 -0500
+++ b/writeup/nips2010_cameraready.tex	Fri Nov 26 17:41:43 2010 -0500
@@ -8,11 +8,11 @@
 \usepackage{graphicx,subfigure}
 \usepackage[numbers]{natbib}
 
-\addtolength{\textwidth}{20mm}
-\addtolength{\textheight}{20mm}
-\addtolength{\topmargin}{-10mm}
-\addtolength{\evensidemargin}{-10mm}
-\addtolength{\oddsidemargin}{-10mm}
+\addtolength{\textwidth}{10mm}
+\addtolength{\textheight}{10mm}
+\addtolength{\topmargin}{-5mm}
+\addtolength{\evensidemargin}{-5mm}
+\addtolength{\oddsidemargin}{-5mm}
 
 %\setlength\parindent{0mm}
 
@@ -72,7 +72,8 @@
 \vspace*{-1mm}
 
 {\bf Deep Learning} has emerged as a promising new area of research in
-statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008-very-small,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review.
+statistical machine learning~\citep{Hinton06}
+(see \citet{Bengio-2009} for a review).
 Learning algorithms for deep architectures are centered on the learning
 of useful representations of data, which are better suited to the task at hand,
 and are organized in a hierarchy with multiple levels.
@@ -107,36 +108,6 @@
 may be better able to provide sharing of statistical strength
 between different regions in input space or different tasks.
 
-\iffalse
-Whereas a deep architecture can in principle be more powerful than a
-shallow one in terms of representation, depth appears to render the
-training problem more difficult in terms of optimization and local minima.
-It is also only recently that successful algorithms were proposed to
-overcome some of these difficulties.  All are based on unsupervised
-learning, often in an greedy layer-wise ``unsupervised pre-training''
-stage~\citep{Bengio-2009}.  
-The principle is that each layer starting from
-the bottom is trained to represent its input (the output of the previous
-layer). After this
-unsupervised initialization, the stack of layers can be
-converted into a deep supervised feedforward neural network and fine-tuned by
-stochastic gradient descent.
-One of these layer initialization techniques,
-applied here, is the Denoising
-Auto-encoder~(DA)~\citep{VincentPLarochelleH2008-very-small} (see
-Figure~\ref{fig:da}), which performed similarly or 
-better~\citep{VincentPLarochelleH2008-very-small} than previously
-proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} 
-in terms of unsupervised extraction
-of a hierarchy of features useful for classification. Each layer is trained
-to denoise its input, creating a layer of features that can be used as
-input for the next layer, forming a Stacked Denoising Auto-encoder (SDA).
-Note that training a Denoising Auto-encoder
-can actually been seen as training a particular RBM by an inductive
-principle different from maximum likelihood~\citep{Vincent-SM-2010}, 
-namely by Score Matching~\citep{Hyvarinen-2005,HyvarinenA2008}. 
-\fi
-
 Previous comparative experimental results with stacking of RBMs and DAs
 to build deep supervised predictors had shown that they could outperform
 shallow architectures in a variety of settings, especially
@@ -160,18 +131,18 @@
 of deep learning and the idea that more levels of representation can
 give rise to more abstract, more general features of the raw input.
 
-This hypothesis is related to a learning setting called
-{\bf self-taught learning}~\citep{RainaR2007}, which combines principles
+This hypothesis is related to the
+{\bf self-taught learning} setting~\citep{RainaR2007}, which combines principles
 of semi-supervised and multi-task learning: the learner can exploit examples
 that are unlabeled and possibly come from a distribution different from the target
-distribution, e.g., from other classes than those of interest. 
-It has already been shown that deep learners can clearly take advantage of
+distribution, e.g., from classes other than those of interest. 
+It has already been shown that deep learners can take advantage of
 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small},
 but more needed to be done to explore the impact
 of {\em out-of-distribution} examples and of the {\em multi-task} setting
 (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised
 pre-training only with the first layer). In particular the {\em relative
-advantage of deep learning} for these settings has not been evaluated.
+advantage of deep learning} for these settings had not been evaluated.
 
 
 %
@@ -226,7 +197,7 @@
 \label{s:perturbations}
 \vspace*{-2mm}
 
-\begin{minipage}[h]{\linewidth}
+%\begin{minipage}[h]{\linewidth}
 \begin{wrapfigure}[8]{l}{0.15\textwidth}
 %\begin{minipage}[b]{0.14\linewidth}
 \vspace*{-5mm}
@@ -251,14 +222,14 @@
 be found in this technical report~\citep{ARXIV-2010}.
 The code for these transformations (mostly python) is available at 
 {\tt http://hg.assembla.com/ift6266}. All the modules in the pipeline share
-a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
-amount of deformation or noise introduced. 
+a global control parameter ($0 \le complexity \le 1$) modulating the
+amount of deformation or noise. 
 There are two main parts in the pipeline. The first one,
 from thickness to pinch, performs transformations. The second
 part, from blur to contrast, adds different kinds of noise.
-\end{minipage}
+%\end{minipage}
 
-\newpage
+%\newpage
 \vspace*{1mm}
 %\subsection{Transformations}
 {\large\bf 2.1 Transformations}
@@ -404,17 +375,22 @@
 
 {\large\bf 2.2 Injecting Noise}
 %\subsection{Injecting Noise}
-\vspace{2mm}
+%\vspace{2mm}
 
 \begin{minipage}[h]{\linewidth}
 %\vspace*{-.2cm}
-\begin{minipage}[t]{0.14\linewidth}
-\centering
-\vspace*{-2mm}
+%\begin{minipage}[t]{0.14\linewidth}
+\begin{wrapfigure}[8]{l}{0.15\textwidth}
+\begin{center}
+\vspace*{-5mm}
+%\vspace*{-2mm}
 \includegraphics[scale=.4]{images/Motionblur_only.png}\\
 {\bf Motion Blur}
-\end{minipage}%
-\hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
+%\end{minipage}%
+\end{center}
+\end{wrapfigure}
+%\hspace{0.3cm}
+%\begin{minipage}[t]{0.83\linewidth}
 %\vspace*{.5mm}
 The {\bf motion blur} module is GIMP's ``linear motion blur'', which
 has parameters $length$ and $angle$. The value of
@@ -423,7 +399,7 @@
 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
 \vspace{5mm}
 \end{minipage}
-\end{minipage}
+%\end{minipage}
 
 \vspace*{1mm}
 
@@ -443,7 +419,7 @@
 The rectangle corners
 are sampled so that larger complexity gives larger rectangles.
 The destination position in the occluded image are also sampled
-according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}).
+according to a normal distribution (more details in~\citet{ARXIV-2010}).
 This module is skipped with probability 60\%.
 %\vspace{7mm}
 \end{minipage}
@@ -452,7 +428,7 @@
 \vspace*{1mm}
 
 \begin{wrapfigure}[8]{l}{0.15\textwidth}
-\vspace*{-6mm}
+\vspace*{-3mm}
 \begin{center}
 %\begin{minipage}[t]{0.14\linewidth}
 %\centering
@@ -482,7 +458,7 @@
 
 %\newpage
 
-\vspace*{-9mm}
+\vspace*{1mm}
 
 %\hspace*{-3mm}\begin{minipage}[t]{0.18\linewidth}
 %\centering
@@ -622,9 +598,9 @@
 \vspace*{-1mm}
 
 Much previous work on deep learning had been performed on
-the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
+the MNIST digits task
 with 60~000 examples, and variants involving 10~000
-examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}.
+examples~\citep{VincentPLarochelleH2008-very-small}.
 The focus here is on much larger training sets, from 10 times to 
 to 1000 times larger, and 62 classes.
 
@@ -786,7 +762,7 @@
 {\bf Stacked Denoising Auto-encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden 
-layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, 
+layers)
 apparently setting parameters in the
 basin of attraction of supervised gradient descent yielding better 
 generalization~\citep{Erhan+al-2010}.  This initial {\em unsupervised
@@ -802,6 +778,7 @@
 deep architecture (whereby complex concepts are expressed as
 compositions of simpler ones through a deep hierarchy).
 
+\iffalse
 \begin{figure}[ht]
 \vspace*{-2mm}
 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
@@ -817,11 +794,12 @@
 \label{fig:da}
 \vspace*{-2mm}
 \end{figure}
+\fi
 
 Here we chose to use the Denoising
-Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
+Auto-encoder~\citep{VincentPLarochelleH2008-very-small} as the building block for
 these deep hierarchies of features, as it is simple to train and
-explain (see Figure~\ref{fig:da}, as well as 
+explain (see % Figure~\ref{fig:da}, as well as 
 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 
 provides efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
--- a/writeup/strings-shorter.bib	Mon Nov 22 16:03:46 2010 -0500
+++ b/writeup/strings-shorter.bib	Fri Nov 26 17:41:43 2010 -0500
@@ -81,20 +81,20 @@
 @String{ICDAR03 =  "Proc. {ICDAR}'03"}
 @String{ICDAR07 =  "Proc. {ICDAR}'07"}
 
-@String{ICML96 = "{ICML} 1996"}
-@String{ICML97 = "{ICML} 1997"}
-@String{ICML98 = "{ICML} 1998"}
-@String{ICML99 = "{ICML} 1999"}
-@String{ICML00 = "{ICML} 2000"}
-@String{ICML01 = "{ICML} 2001"}
-@String{ICML02 = "{ICML} 2002"}
-@String{ICML03 = "{ICML} 2003"}
-@String{ICML04 = "{ICML} 2004"}
-@String{ICML05 = "{ICML} 2005"}
-@String{ICML06 = "{ICML} 2006"}
-@String{ICML07 = "{ICML} 2007"}
-@String{ICML08 = "{ICML} 2008"}
-@String{ICML09 = "{ICML} 2009"}
+@String{ICML96 = "{ICML}"}
+@String{ICML97 = "{ICML}"}
+@String{ICML98 = "{ICML}"}
+@String{ICML99 = "{ICML}"}
+@String{ICML00 = "{ICML}"}
+@String{ICML01 = "{ICML}"}
+@String{ICML02 = "{ICML}"}
+@String{ICML03 = "{ICML}"}
+@String{ICML04 = "{ICML}"}
+@String{ICML05 = "{ICML}"}
+@String{ICML06 = "{ICML}"}
+@String{ICML07 = "{ICML}"}
+@String{ICML08 = "{ICML}"}
+@String{ICML09 = "{ICML}"}
 @string{icml09loc = {}}
 @STRING{aistats05 = "AISTATS'2005"}
 @STRING{aistats07 = "AISTATS'2007"}