ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 513:66a905508e34

resolved merge conflict

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Tue, 01 Jun 2010 14:05:02 -0400
parents	6f042a71be23 8c2ab4f246b1
children	920a38715c90

comparison

equal deleted inserted replaced

-:6f042a71be23
+:66a905508e34
 \vspace*{-2mm}
 \begin{abstract}
 Recent theoretical and empirical work in statistical machine learning has
 demonstrated the importance of learning algorithms for deep
 architectures, i.e., function classes obtained by composing multiple
-non-linear transformations. Self-taught learning (exploiting unlabeled
+non-linear transformations. The self-taught learning (exploiting unlabeled
 examples or examples from other distributions) has already been applied
 to deep learners, but mostly to show the advantage of unlabeled
 examples. Here we explore the advantage brought by {\em out-of-distribution
 examples} and show that {\em deep learners benefit more from them than a
 corresponding shallow learner}, in the area
 applied here, is the Denoising
 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
 performed similarly or better than previously proposed Restricted Boltzmann
 Machines in terms of unsupervised extraction of a hierarchy of features
 useful for classification.  The principle is that each layer starting from
-the bottom is trained to encode its input (the output of the previous
+the bottom is trained to encode their input (the output of the previous
-layer) and to reconstruct it from a corrupted version of it. After this
+layer) and try to reconstruct it from a corrupted version of it. After this
 unsupervised initialization, the stack of denoising auto-encoders can be
 converted into a deep supervised feedforward neural network and fine-tuned by
 stochastic gradient descent.
 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
 Similarly, does the feature learning step in deep learning algorithms benefit more
 training with similar but different classes (i.e. a multi-task learning scenario) than
 a corresponding shallow and purely supervised architecture?
 %\end{enumerate}
-Our experimental results provide evidence to support positive answers to all of these questions.
+The experimental results presented here provide positive evidence towards all of these questions.
 \vspace*{-1mm}
 \section{Perturbation and Transformation of Character Images}
 \vspace*{-1mm}
 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
 \sqrt[3]{complexity}$.\\
 {\bf Pinch.}
 This GIMP filter is named "Whirl and
 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
-surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
+surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}.
 For a square input image, think of drawing a circle of
 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
 that disk (region inside circle) will have its value recalculated by taking
 the value of another "source" pixel in the original image. The position of
 that source pixel is found on the line that goes through $C$ and $P$, but
 the best SDA (again according to validation set error), along with a precise estimate
 of human performance obtained via Amazon's Mechanical Turk (AMT)
 service\footnote{http://mturk.com}.
 AMT users are paid small amounts
 of money to perform tasks for which human intelligence is required.
-Mechanical Turk has been used extensively in natural language processing and vision.
+Mechanical Turk has been used extensively in natural language
-%processing \citep{SnowEtAl2008} and vision
+processing \citep{SnowEtAl2008} and vision
-%\citep{SorokinAndForsyth2008,whitehill09}.
+\citep{SorokinAndForsyth2008,whitehill09}.
-%\citep{SorokinAndForsyth2008,whitehill09}.
 AMT users where presented
 with 10 character images and asked to type 10 corresponding ASCII
 characters. They were forced to make a hard choice among the
 62 or 10 character classes (all classes or digits only).
 Three users classified each image, allowing
 \fi
 \begin{figure}[h]
 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
-\caption{Charts corresponding to tables 2 (left) and 3 (right), from Appendix I.}
+\caption{Relative improvement in error rate due to self-taught learning.
+Left: Improvement (or loss, when negative)
+induced by out-of-distribution examples (perturbed data).
+Right: Improvement (or loss, when negative) induced by multi-task
+learning (training on all classes and testing only on either digits,
+upper case, or lower-case). The deep learner (SDA) benefits more from
+both self-taught learning scenarios, compared to the shallow MLP.}
 \label{fig:improvements-charts}
 \end{figure}
 \vspace*{-1mm}
 \section{Conclusions}

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 513:66a905508e34