comparison writeup/nips2010_submission.tex @ 514:920a38715c90

merge
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 01 Jun 2010 14:05:21 -0400
parents 66a905508e34 d057941417ed
children 092dae9a5040
comparison
equal deleted inserted replaced
513:66a905508e34 514:920a38715c90
18 \vspace*{-2mm} 18 \vspace*{-2mm}
19 \begin{abstract} 19 \begin{abstract}
20 Recent theoretical and empirical work in statistical machine learning has 20 Recent theoretical and empirical work in statistical machine learning has
21 demonstrated the importance of learning algorithms for deep 21 demonstrated the importance of learning algorithms for deep
22 architectures, i.e., function classes obtained by composing multiple 22 architectures, i.e., function classes obtained by composing multiple
23 non-linear transformations. The self-taught learning (exploiting unlabeled 23 non-linear transformations. Self-taught learning (exploiting unlabeled
24 examples or examples from other distributions) has already been applied 24 examples or examples from other distributions) has already been applied
25 to deep learners, but mostly to show the advantage of unlabeled 25 to deep learners, but mostly to show the advantage of unlabeled
26 examples. Here we explore the advantage brought by {\em out-of-distribution 26 examples. Here we explore the advantage brought by {\em out-of-distribution
27 examples} and show that {\em deep learners benefit more from them than a 27 examples} and show that {\em deep learners benefit more from them than a
28 corresponding shallow learner}, in the area 28 corresponding shallow learner}, in the area
72 applied here, is the Denoising 72 applied here, is the Denoising
73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which 73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
74 performed similarly or better than previously proposed Restricted Boltzmann 74 performed similarly or better than previously proposed Restricted Boltzmann
75 Machines in terms of unsupervised extraction of a hierarchy of features 75 Machines in terms of unsupervised extraction of a hierarchy of features
76 useful for classification. The principle is that each layer starting from 76 useful for classification. The principle is that each layer starting from
77 the bottom is trained to encode their input (the output of the previous 77 the bottom is trained to encode its input (the output of the previous
78 layer) and try to reconstruct it from a corrupted version of it. After this 78 layer) and to reconstruct it from a corrupted version of it. After this
79 unsupervised initialization, the stack of denoising auto-encoders can be 79 unsupervised initialization, the stack of denoising auto-encoders can be
80 converted into a deep supervised feedforward neural network and fine-tuned by 80 converted into a deep supervised feedforward neural network and fine-tuned by
81 stochastic gradient descent. 81 stochastic gradient descent.
82 82
83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles 83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
93 The hypothesis explored here is that a deep hierarchy of features 93 The hypothesis explored here is that a deep hierarchy of features
94 may be better able to provide sharing of statistical strength 94 may be better able to provide sharing of statistical strength
95 between different regions in input space or different tasks, 95 between different regions in input space or different tasks,
96 as discussed in the conclusion. 96 as discussed in the conclusion.
97 97
98 % TODO: why we care to evaluate this relative advantage
99
98 In this paper we ask the following questions: 100 In this paper we ask the following questions:
99 101
100 %\begin{enumerate} 102 %\begin{enumerate}
101 $\bullet$ %\item 103 $\bullet$ %\item
102 Do the good results previously obtained with deep architectures on the 104 Do the good results previously obtained with deep architectures on the
117 Similarly, does the feature learning step in deep learning algorithms benefit more 119 Similarly, does the feature learning step in deep learning algorithms benefit more
118 training with similar but different classes (i.e. a multi-task learning scenario) than 120 training with similar but different classes (i.e. a multi-task learning scenario) than
119 a corresponding shallow and purely supervised architecture? 121 a corresponding shallow and purely supervised architecture?
120 %\end{enumerate} 122 %\end{enumerate}
121 123
122 The experimental results presented here provide positive evidence towards all of these questions. 124 Our experimental results provide positive evidence towards all of these questions.
123 125
124 \vspace*{-1mm} 126 \vspace*{-1mm}
125 \section{Perturbation and Transformation of Character Images} 127 \section{Perturbation and Transformation of Character Images}
126 \vspace*{-1mm} 128 \vspace*{-1mm}
127 129