Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 507:b8e33d3d7f65
merge
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 01 Jun 2010 13:57:16 -0400 |
parents | 8bf07979b8ba a41a8925be70 |
children | 8c2ab4f246b1 6f042a71be23 |
comparison
equal
deleted
inserted
replaced
506:8bf07979b8ba | 507:b8e33d3d7f65 |
---|---|
18 \vspace*{-2mm} | 18 \vspace*{-2mm} |
19 \begin{abstract} | 19 \begin{abstract} |
20 Recent theoretical and empirical work in statistical machine learning has | 20 Recent theoretical and empirical work in statistical machine learning has |
21 demonstrated the importance of learning algorithms for deep | 21 demonstrated the importance of learning algorithms for deep |
22 architectures, i.e., function classes obtained by composing multiple | 22 architectures, i.e., function classes obtained by composing multiple |
23 non-linear transformations. The self-taught learning (exploiting unlabeled | 23 non-linear transformations. Self-taught learning (exploiting unlabeled |
24 examples or examples from other distributions) has already been applied | 24 examples or examples from other distributions) has already been applied |
25 to deep learners, but mostly to show the advantage of unlabeled | 25 to deep learners, but mostly to show the advantage of unlabeled |
26 examples. Here we explore the advantage brought by {\em out-of-distribution | 26 examples. Here we explore the advantage brought by {\em out-of-distribution |
27 examples} and show that {\em deep learners benefit more from them than a | 27 examples} and show that {\em deep learners benefit more from them than a |
28 corresponding shallow learner}, in the area | 28 corresponding shallow learner}, in the area |
72 applied here, is the Denoising | 72 applied here, is the Denoising |
73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which | 73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which |
74 performed similarly or better than previously proposed Restricted Boltzmann | 74 performed similarly or better than previously proposed Restricted Boltzmann |
75 Machines in terms of unsupervised extraction of a hierarchy of features | 75 Machines in terms of unsupervised extraction of a hierarchy of features |
76 useful for classification. The principle is that each layer starting from | 76 useful for classification. The principle is that each layer starting from |
77 the bottom is trained to encode their input (the output of the previous | 77 the bottom is trained to encode its input (the output of the previous |
78 layer) and try to reconstruct it from a corrupted version of it. After this | 78 layer) and to reconstruct it from a corrupted version of it. After this |
79 unsupervised initialization, the stack of denoising auto-encoders can be | 79 unsupervised initialization, the stack of denoising auto-encoders can be |
80 converted into a deep supervised feedforward neural network and fine-tuned by | 80 converted into a deep supervised feedforward neural network and fine-tuned by |
81 stochastic gradient descent. | 81 stochastic gradient descent. |
82 | 82 |
83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles | 83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles |
89 and multi-task learning, not much has been done yet to explore the impact | 89 and multi-task learning, not much has been done yet to explore the impact |
90 of {\em out-of-distribution} examples and of the multi-task setting | 90 of {\em out-of-distribution} examples and of the multi-task setting |
91 (but see~\citep{CollobertR2008}). In particular the {\em relative | 91 (but see~\citep{CollobertR2008}). In particular the {\em relative |
92 advantage} of deep learning for this settings has not been evaluated. | 92 advantage} of deep learning for this settings has not been evaluated. |
93 | 93 |
94 % TODO: Explain why we care about this question. | |
95 | |
94 In this paper we ask the following questions: | 96 In this paper we ask the following questions: |
95 | 97 |
96 %\begin{enumerate} | 98 %\begin{enumerate} |
97 $\bullet$ %\item | 99 $\bullet$ %\item |
98 Do the good results previously obtained with deep architectures on the | 100 Do the good results previously obtained with deep architectures on the |
113 Similarly, does the feature learning step in deep learning algorithms benefit more | 115 Similarly, does the feature learning step in deep learning algorithms benefit more |
114 training with similar but different classes (i.e. a multi-task learning scenario) than | 116 training with similar but different classes (i.e. a multi-task learning scenario) than |
115 a corresponding shallow and purely supervised architecture? | 117 a corresponding shallow and purely supervised architecture? |
116 %\end{enumerate} | 118 %\end{enumerate} |
117 | 119 |
118 The experimental results presented here provide positive evidence towards all of these questions. | 120 Our experimental results provide evidence to support positive answers to all of these questions. |
119 | 121 |
120 \vspace*{-1mm} | 122 \vspace*{-1mm} |
121 \section{Perturbation and Transformation of Character Images} | 123 \section{Perturbation and Transformation of Character Images} |
122 \vspace*{-1mm} | 124 \vspace*{-1mm} |
123 | 125 |
581 \fi | 583 \fi |
582 | 584 |
583 | 585 |
584 \begin{figure}[h] | 586 \begin{figure}[h] |
585 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ | 587 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ |
586 \caption{Relative improvement in error rate due to self-taught learning. | 588 \caption{Charts corresponding to tables 2 (left) and 3 (right), from Appendix I.} |
587 Left: Improvement (or loss, when negative) | |
588 induced by out-of-distribution examples (perturbed data). | |
589 Right: Improvement (or loss, when negative) induced by multi-task | |
590 learning (training on all classes and testing only on either digits, | |
591 upper case, or lower-case). The deep learner (SDA) benefits more from | |
592 both self-taught learning scenarios, compared to the shallow MLP.} | |
593 \label{fig:improvements-charts} | 589 \label{fig:improvements-charts} |
594 \end{figure} | 590 \end{figure} |
595 | 591 |
596 \vspace*{-1mm} | 592 \vspace*{-1mm} |
597 \section{Conclusions} | 593 \section{Conclusions} |