Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 514:920a38715c90
merge
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 01 Jun 2010 14:05:21 -0400 |
parents | 66a905508e34 d057941417ed |
children | 092dae9a5040 |
comparison
equal
deleted
inserted
replaced
513:66a905508e34 | 514:920a38715c90 |
---|---|
18 \vspace*{-2mm} | 18 \vspace*{-2mm} |
19 \begin{abstract} | 19 \begin{abstract} |
20 Recent theoretical and empirical work in statistical machine learning has | 20 Recent theoretical and empirical work in statistical machine learning has |
21 demonstrated the importance of learning algorithms for deep | 21 demonstrated the importance of learning algorithms for deep |
22 architectures, i.e., function classes obtained by composing multiple | 22 architectures, i.e., function classes obtained by composing multiple |
23 non-linear transformations. The self-taught learning (exploiting unlabeled | 23 non-linear transformations. Self-taught learning (exploiting unlabeled |
24 examples or examples from other distributions) has already been applied | 24 examples or examples from other distributions) has already been applied |
25 to deep learners, but mostly to show the advantage of unlabeled | 25 to deep learners, but mostly to show the advantage of unlabeled |
26 examples. Here we explore the advantage brought by {\em out-of-distribution | 26 examples. Here we explore the advantage brought by {\em out-of-distribution |
27 examples} and show that {\em deep learners benefit more from them than a | 27 examples} and show that {\em deep learners benefit more from them than a |
28 corresponding shallow learner}, in the area | 28 corresponding shallow learner}, in the area |
72 applied here, is the Denoising | 72 applied here, is the Denoising |
73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which | 73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which |
74 performed similarly or better than previously proposed Restricted Boltzmann | 74 performed similarly or better than previously proposed Restricted Boltzmann |
75 Machines in terms of unsupervised extraction of a hierarchy of features | 75 Machines in terms of unsupervised extraction of a hierarchy of features |
76 useful for classification. The principle is that each layer starting from | 76 useful for classification. The principle is that each layer starting from |
77 the bottom is trained to encode their input (the output of the previous | 77 the bottom is trained to encode its input (the output of the previous |
78 layer) and try to reconstruct it from a corrupted version of it. After this | 78 layer) and to reconstruct it from a corrupted version of it. After this |
79 unsupervised initialization, the stack of denoising auto-encoders can be | 79 unsupervised initialization, the stack of denoising auto-encoders can be |
80 converted into a deep supervised feedforward neural network and fine-tuned by | 80 converted into a deep supervised feedforward neural network and fine-tuned by |
81 stochastic gradient descent. | 81 stochastic gradient descent. |
82 | 82 |
83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles | 83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles |
93 The hypothesis explored here is that a deep hierarchy of features | 93 The hypothesis explored here is that a deep hierarchy of features |
94 may be better able to provide sharing of statistical strength | 94 may be better able to provide sharing of statistical strength |
95 between different regions in input space or different tasks, | 95 between different regions in input space or different tasks, |
96 as discussed in the conclusion. | 96 as discussed in the conclusion. |
97 | 97 |
98 % TODO: why we care to evaluate this relative advantage | |
99 | |
98 In this paper we ask the following questions: | 100 In this paper we ask the following questions: |
99 | 101 |
100 %\begin{enumerate} | 102 %\begin{enumerate} |
101 $\bullet$ %\item | 103 $\bullet$ %\item |
102 Do the good results previously obtained with deep architectures on the | 104 Do the good results previously obtained with deep architectures on the |
117 Similarly, does the feature learning step in deep learning algorithms benefit more | 119 Similarly, does the feature learning step in deep learning algorithms benefit more |
118 training with similar but different classes (i.e. a multi-task learning scenario) than | 120 training with similar but different classes (i.e. a multi-task learning scenario) than |
119 a corresponding shallow and purely supervised architecture? | 121 a corresponding shallow and purely supervised architecture? |
120 %\end{enumerate} | 122 %\end{enumerate} |
121 | 123 |
122 The experimental results presented here provide positive evidence towards all of these questions. | 124 Our experimental results provide positive evidence towards all of these questions. |
123 | 125 |
124 \vspace*{-1mm} | 126 \vspace*{-1mm} |
125 \section{Perturbation and Transformation of Character Images} | 127 \section{Perturbation and Transformation of Character Images} |
126 \vspace*{-1mm} | 128 \vspace*{-1mm} |
127 | 129 |