Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 509:860c755ddcff
argh, sorry about that
author | Dumitru Erhan <dumitru.erhan@gmail.com> |
---|---|
date | Tue, 01 Jun 2010 10:58:15 -0700 |
parents | a41a8925be70 |
children | 8c2ab4f246b1 |
comparison
equal
deleted
inserted
replaced
508:c421ea80edeb | 509:860c755ddcff |
---|---|
18 \vspace*{-2mm} | 18 \vspace*{-2mm} |
19 \begin{abstract} | 19 \begin{abstract} |
20 Recent theoretical and empirical work in statistical machine learning has | 20 Recent theoretical and empirical work in statistical machine learning has |
21 demonstrated the importance of learning algorithms for deep | 21 demonstrated the importance of learning algorithms for deep |
22 architectures, i.e., function classes obtained by composing multiple | 22 architectures, i.e., function classes obtained by composing multiple |
23 non-linear transformations. Self-taught learning (exploiting unlabeled | 23 non-linear transformations. The self-taught learning (exploiting unlabeled |
24 examples or examples from other distributions) has already been applied | 24 examples or examples from other distributions) has already been applied |
25 to deep learners, but mostly to show the advantage of unlabeled | 25 to deep learners, but mostly to show the advantage of unlabeled |
26 examples. Here we explore the advantage brought by {\em out-of-distribution | 26 examples. Here we explore the advantage brought by {\em out-of-distribution |
27 examples} and show that {\em deep learners benefit more from them than a | 27 examples} and show that {\em deep learners benefit more from them than a |
28 corresponding shallow learner}, in the area | 28 corresponding shallow learner}, in the area |
72 applied here, is the Denoising | 72 applied here, is the Denoising |
73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which | 73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which |
74 performed similarly or better than previously proposed Restricted Boltzmann | 74 performed similarly or better than previously proposed Restricted Boltzmann |
75 Machines in terms of unsupervised extraction of a hierarchy of features | 75 Machines in terms of unsupervised extraction of a hierarchy of features |
76 useful for classification. The principle is that each layer starting from | 76 useful for classification. The principle is that each layer starting from |
77 the bottom is trained to encode its input (the output of the previous | 77 the bottom is trained to encode their input (the output of the previous |
78 layer) and to reconstruct it from a corrupted version of it. After this | 78 layer) and try to reconstruct it from a corrupted version of it. After this |
79 unsupervised initialization, the stack of denoising auto-encoders can be | 79 unsupervised initialization, the stack of denoising auto-encoders can be |
80 converted into a deep supervised feedforward neural network and fine-tuned by | 80 converted into a deep supervised feedforward neural network and fine-tuned by |
81 stochastic gradient descent. | 81 stochastic gradient descent. |
82 | 82 |
83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles | 83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles |
89 and multi-task learning, not much has been done yet to explore the impact | 89 and multi-task learning, not much has been done yet to explore the impact |
90 of {\em out-of-distribution} examples and of the multi-task setting | 90 of {\em out-of-distribution} examples and of the multi-task setting |
91 (but see~\citep{CollobertR2008}). In particular the {\em relative | 91 (but see~\citep{CollobertR2008}). In particular the {\em relative |
92 advantage} of deep learning for this settings has not been evaluated. | 92 advantage} of deep learning for this settings has not been evaluated. |
93 | 93 |
94 % TODO: Explain why we care about this question. | |
95 | |
96 In this paper we ask the following questions: | 94 In this paper we ask the following questions: |
97 | 95 |
98 %\begin{enumerate} | 96 %\begin{enumerate} |
99 $\bullet$ %\item | 97 $\bullet$ %\item |
100 Do the good results previously obtained with deep architectures on the | 98 Do the good results previously obtained with deep architectures on the |
115 Similarly, does the feature learning step in deep learning algorithms benefit more | 113 Similarly, does the feature learning step in deep learning algorithms benefit more |
116 training with similar but different classes (i.e. a multi-task learning scenario) than | 114 training with similar but different classes (i.e. a multi-task learning scenario) than |
117 a corresponding shallow and purely supervised architecture? | 115 a corresponding shallow and purely supervised architecture? |
118 %\end{enumerate} | 116 %\end{enumerate} |
119 | 117 |
120 Our experimental results provide evidence to support positive answers to all of these questions. | 118 The experimental results presented here provide positive evidence towards all of these questions. |
121 | 119 |
122 \vspace*{-1mm} | 120 \vspace*{-1mm} |
123 \section{Perturbation and Transformation of Character Images} | 121 \section{Perturbation and Transformation of Character Images} |
124 \vspace*{-1mm} | 122 \vspace*{-1mm} |
125 | 123 |
199 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times | 197 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times |
200 \sqrt[3]{complexity}$.\\ | 198 \sqrt[3]{complexity}$.\\ |
201 {\bf Pinch.} | 199 {\bf Pinch.} |
202 This GIMP filter is named "Whirl and | 200 This GIMP filter is named "Whirl and |
203 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic | 201 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic |
204 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). | 202 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. |
205 For a square input image, think of drawing a circle of | 203 For a square input image, think of drawing a circle of |
206 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to | 204 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to |
207 that disk (region inside circle) will have its value recalculated by taking | 205 that disk (region inside circle) will have its value recalculated by taking |
208 the value of another "source" pixel in the original image. The position of | 206 the value of another "source" pixel in the original image. The position of |
209 that source pixel is found on the line that goes through $C$ and $P$, but | 207 that source pixel is found on the line that goes through $C$ and $P$, but |
333 the best SDA (again according to validation set error), along with a precise estimate | 331 the best SDA (again according to validation set error), along with a precise estimate |
334 of human performance obtained via Amazon's Mechanical Turk (AMT) | 332 of human performance obtained via Amazon's Mechanical Turk (AMT) |
335 service\footnote{http://mturk.com}. | 333 service\footnote{http://mturk.com}. |
336 AMT users are paid small amounts | 334 AMT users are paid small amounts |
337 of money to perform tasks for which human intelligence is required. | 335 of money to perform tasks for which human intelligence is required. |
338 Mechanical Turk has been used extensively in natural language processing and vision. | 336 Mechanical Turk has been used extensively in natural language |
339 %processing \citep{SnowEtAl2008} and vision | 337 processing \citep{SnowEtAl2008} and vision |
340 %\citep{SorokinAndForsyth2008,whitehill09}. | 338 \citep{SorokinAndForsyth2008,whitehill09}. |
341 %\citep{SorokinAndForsyth2008,whitehill09}. | |
342 AMT users where presented | 339 AMT users where presented |
343 with 10 character images and asked to type 10 corresponding ASCII | 340 with 10 character images and asked to type 10 corresponding ASCII |
344 characters. They were forced to make a hard choice among the | 341 characters. They were forced to make a hard choice among the |
345 62 or 10 character classes (all classes or digits only). | 342 62 or 10 character classes (all classes or digits only). |
346 Three users classified each image, allowing | 343 Three users classified each image, allowing |
582 \fi | 579 \fi |
583 | 580 |
584 | 581 |
585 \begin{figure}[h] | 582 \begin{figure}[h] |
586 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ | 583 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ |
587 \caption{Charts corresponding to tables 2 (left) and 3 (right), from Appendix I.} | 584 \caption{Relative improvement in error rate due to self-taught learning. |
585 Left: Improvement (or loss, when negative) | |
586 induced by out-of-distribution examples (perturbed data). | |
587 Right: Improvement (or loss, when negative) induced by multi-task | |
588 learning (training on all classes and testing only on either digits, | |
589 upper case, or lower-case). The deep learner (SDA) benefits more from | |
590 both self-taught learning scenarios, compared to the shallow MLP.} | |
588 \label{fig:improvements-charts} | 591 \label{fig:improvements-charts} |
589 \end{figure} | 592 \end{figure} |
590 | 593 |
591 \vspace*{-1mm} | 594 \vspace*{-1mm} |
592 \section{Conclusions} | 595 \section{Conclusions} |