comparison writeup/nips2010_submission.tex @ 509:860c755ddcff

argh, sorry about that
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Tue, 01 Jun 2010 10:58:15 -0700
parents a41a8925be70
children 8c2ab4f246b1
comparison
equal deleted inserted replaced
508:c421ea80edeb 509:860c755ddcff
18 \vspace*{-2mm} 18 \vspace*{-2mm}
19 \begin{abstract} 19 \begin{abstract}
20 Recent theoretical and empirical work in statistical machine learning has 20 Recent theoretical and empirical work in statistical machine learning has
21 demonstrated the importance of learning algorithms for deep 21 demonstrated the importance of learning algorithms for deep
22 architectures, i.e., function classes obtained by composing multiple 22 architectures, i.e., function classes obtained by composing multiple
23 non-linear transformations. Self-taught learning (exploiting unlabeled 23 non-linear transformations. The self-taught learning (exploiting unlabeled
24 examples or examples from other distributions) has already been applied 24 examples or examples from other distributions) has already been applied
25 to deep learners, but mostly to show the advantage of unlabeled 25 to deep learners, but mostly to show the advantage of unlabeled
26 examples. Here we explore the advantage brought by {\em out-of-distribution 26 examples. Here we explore the advantage brought by {\em out-of-distribution
27 examples} and show that {\em deep learners benefit more from them than a 27 examples} and show that {\em deep learners benefit more from them than a
28 corresponding shallow learner}, in the area 28 corresponding shallow learner}, in the area
72 applied here, is the Denoising 72 applied here, is the Denoising
73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which 73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
74 performed similarly or better than previously proposed Restricted Boltzmann 74 performed similarly or better than previously proposed Restricted Boltzmann
75 Machines in terms of unsupervised extraction of a hierarchy of features 75 Machines in terms of unsupervised extraction of a hierarchy of features
76 useful for classification. The principle is that each layer starting from 76 useful for classification. The principle is that each layer starting from
77 the bottom is trained to encode its input (the output of the previous 77 the bottom is trained to encode their input (the output of the previous
78 layer) and to reconstruct it from a corrupted version of it. After this 78 layer) and try to reconstruct it from a corrupted version of it. After this
79 unsupervised initialization, the stack of denoising auto-encoders can be 79 unsupervised initialization, the stack of denoising auto-encoders can be
80 converted into a deep supervised feedforward neural network and fine-tuned by 80 converted into a deep supervised feedforward neural network and fine-tuned by
81 stochastic gradient descent. 81 stochastic gradient descent.
82 82
83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles 83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
89 and multi-task learning, not much has been done yet to explore the impact 89 and multi-task learning, not much has been done yet to explore the impact
90 of {\em out-of-distribution} examples and of the multi-task setting 90 of {\em out-of-distribution} examples and of the multi-task setting
91 (but see~\citep{CollobertR2008}). In particular the {\em relative 91 (but see~\citep{CollobertR2008}). In particular the {\em relative
92 advantage} of deep learning for this settings has not been evaluated. 92 advantage} of deep learning for this settings has not been evaluated.
93 93
94 % TODO: Explain why we care about this question.
95
96 In this paper we ask the following questions: 94 In this paper we ask the following questions:
97 95
98 %\begin{enumerate} 96 %\begin{enumerate}
99 $\bullet$ %\item 97 $\bullet$ %\item
100 Do the good results previously obtained with deep architectures on the 98 Do the good results previously obtained with deep architectures on the
115 Similarly, does the feature learning step in deep learning algorithms benefit more 113 Similarly, does the feature learning step in deep learning algorithms benefit more
116 training with similar but different classes (i.e. a multi-task learning scenario) than 114 training with similar but different classes (i.e. a multi-task learning scenario) than
117 a corresponding shallow and purely supervised architecture? 115 a corresponding shallow and purely supervised architecture?
118 %\end{enumerate} 116 %\end{enumerate}
119 117
120 Our experimental results provide evidence to support positive answers to all of these questions. 118 The experimental results presented here provide positive evidence towards all of these questions.
121 119
122 \vspace*{-1mm} 120 \vspace*{-1mm}
123 \section{Perturbation and Transformation of Character Images} 121 \section{Perturbation and Transformation of Character Images}
124 \vspace*{-1mm} 122 \vspace*{-1mm}
125 123
199 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times 197 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
200 \sqrt[3]{complexity}$.\\ 198 \sqrt[3]{complexity}$.\\
201 {\bf Pinch.} 199 {\bf Pinch.}
202 This GIMP filter is named "Whirl and 200 This GIMP filter is named "Whirl and
203 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic 201 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
204 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). 202 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}.
205 For a square input image, think of drawing a circle of 203 For a square input image, think of drawing a circle of
206 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to 204 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
207 that disk (region inside circle) will have its value recalculated by taking 205 that disk (region inside circle) will have its value recalculated by taking
208 the value of another "source" pixel in the original image. The position of 206 the value of another "source" pixel in the original image. The position of
209 that source pixel is found on the line that goes through $C$ and $P$, but 207 that source pixel is found on the line that goes through $C$ and $P$, but
333 the best SDA (again according to validation set error), along with a precise estimate 331 the best SDA (again according to validation set error), along with a precise estimate
334 of human performance obtained via Amazon's Mechanical Turk (AMT) 332 of human performance obtained via Amazon's Mechanical Turk (AMT)
335 service\footnote{http://mturk.com}. 333 service\footnote{http://mturk.com}.
336 AMT users are paid small amounts 334 AMT users are paid small amounts
337 of money to perform tasks for which human intelligence is required. 335 of money to perform tasks for which human intelligence is required.
338 Mechanical Turk has been used extensively in natural language processing and vision. 336 Mechanical Turk has been used extensively in natural language
339 %processing \citep{SnowEtAl2008} and vision 337 processing \citep{SnowEtAl2008} and vision
340 %\citep{SorokinAndForsyth2008,whitehill09}. 338 \citep{SorokinAndForsyth2008,whitehill09}.
341 %\citep{SorokinAndForsyth2008,whitehill09}.
342 AMT users where presented 339 AMT users where presented
343 with 10 character images and asked to type 10 corresponding ASCII 340 with 10 character images and asked to type 10 corresponding ASCII
344 characters. They were forced to make a hard choice among the 341 characters. They were forced to make a hard choice among the
345 62 or 10 character classes (all classes or digits only). 342 62 or 10 character classes (all classes or digits only).
346 Three users classified each image, allowing 343 Three users classified each image, allowing
582 \fi 579 \fi
583 580
584 581
585 \begin{figure}[h] 582 \begin{figure}[h]
586 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ 583 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
587 \caption{Charts corresponding to tables 2 (left) and 3 (right), from Appendix I.} 584 \caption{Relative improvement in error rate due to self-taught learning.
585 Left: Improvement (or loss, when negative)
586 induced by out-of-distribution examples (perturbed data).
587 Right: Improvement (or loss, when negative) induced by multi-task
588 learning (training on all classes and testing only on either digits,
589 upper case, or lower-case). The deep learner (SDA) benefits more from
590 both self-taught learning scenarios, compared to the shallow MLP.}
588 \label{fig:improvements-charts} 591 \label{fig:improvements-charts}
589 \end{figure} 592 \end{figure}
590 593
591 \vspace*{-1mm} 594 \vspace*{-1mm}
592 \section{Conclusions} 595 \section{Conclusions}