comparison writeup/nips2010_submission.tex @ 504:e837ef6eef8c

commit early, commit often: a couple of changes to kick-start things
author dumitru@dumitru.mtv.corp.google.com
date Tue, 01 Jun 2010 10:53:07 -0700
parents 5927432d8b8d
children a41a8925be70
comparison
equal deleted inserted replaced
501:5927432d8b8d 504:e837ef6eef8c
18 \vspace*{-2mm} 18 \vspace*{-2mm}
19 \begin{abstract} 19 \begin{abstract}
20 Recent theoretical and empirical work in statistical machine learning has 20 Recent theoretical and empirical work in statistical machine learning has
21 demonstrated the importance of learning algorithms for deep 21 demonstrated the importance of learning algorithms for deep
22 architectures, i.e., function classes obtained by composing multiple 22 architectures, i.e., function classes obtained by composing multiple
23 non-linear transformations. The self-taught learning (exploiting unlabeled 23 non-linear transformations. Self-taught learning (exploiting unlabeled
24 examples or examples from other distributions) has already been applied 24 examples or examples from other distributions) has already been applied
25 to deep learners, but mostly to show the advantage of unlabeled 25 to deep learners, but mostly to show the advantage of unlabeled
26 examples. Here we explore the advantage brought by {\em out-of-distribution 26 examples. Here we explore the advantage brought by {\em out-of-distribution
27 examples} and show that {\em deep learners benefit more from them than a 27 examples} and show that {\em deep learners benefit more from them than a
28 corresponding shallow learner}, in the area 28 corresponding shallow learner}, in the area
72 applied here, is the Denoising 72 applied here, is the Denoising
73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which 73 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
74 performed similarly or better than previously proposed Restricted Boltzmann 74 performed similarly or better than previously proposed Restricted Boltzmann
75 Machines in terms of unsupervised extraction of a hierarchy of features 75 Machines in terms of unsupervised extraction of a hierarchy of features
76 useful for classification. The principle is that each layer starting from 76 useful for classification. The principle is that each layer starting from
77 the bottom is trained to encode their input (the output of the previous 77 the bottom is trained to encode its input (the output of the previous
78 layer) and try to reconstruct it from a corrupted version of it. After this 78 layer) and to reconstruct it from a corrupted version of it. After this
79 unsupervised initialization, the stack of denoising auto-encoders can be 79 unsupervised initialization, the stack of denoising auto-encoders can be
80 converted into a deep supervised feedforward neural network and fine-tuned by 80 converted into a deep supervised feedforward neural network and fine-tuned by
81 stochastic gradient descent. 81 stochastic gradient descent.
82 82
83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles 83 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
89 and multi-task learning, not much has been done yet to explore the impact 89 and multi-task learning, not much has been done yet to explore the impact
90 of {\em out-of-distribution} examples and of the multi-task setting 90 of {\em out-of-distribution} examples and of the multi-task setting
91 (but see~\citep{CollobertR2008}). In particular the {\em relative 91 (but see~\citep{CollobertR2008}). In particular the {\em relative
92 advantage} of deep learning for this settings has not been evaluated. 92 advantage} of deep learning for this settings has not been evaluated.
93 93
94 % TODO: Explain why we care about this question.
95
94 In this paper we ask the following questions: 96 In this paper we ask the following questions:
95 97
96 %\begin{enumerate} 98 %\begin{enumerate}
97 $\bullet$ %\item 99 $\bullet$ %\item
98 Do the good results previously obtained with deep architectures on the 100 Do the good results previously obtained with deep architectures on the
113 Similarly, does the feature learning step in deep learning algorithms benefit more 115 Similarly, does the feature learning step in deep learning algorithms benefit more
114 training with similar but different classes (i.e. a multi-task learning scenario) than 116 training with similar but different classes (i.e. a multi-task learning scenario) than
115 a corresponding shallow and purely supervised architecture? 117 a corresponding shallow and purely supervised architecture?
116 %\end{enumerate} 118 %\end{enumerate}
117 119
118 The experimental results presented here provide positive evidence towards all of these questions. 120 Our experimental results provide evidence to support positive answers to all of these questions.
119 121
120 \vspace*{-1mm} 122 \vspace*{-1mm}
121 \section{Perturbation and Transformation of Character Images} 123 \section{Perturbation and Transformation of Character Images}
122 \vspace*{-1mm} 124 \vspace*{-1mm}
123 125
523 setting is similar for the other two target classes (lower case characters 525 setting is similar for the other two target classes (lower case characters
524 and upper case characters). 526 and upper case characters).
525 527
526 \begin{figure}[h] 528 \begin{figure}[h]
527 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ 529 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
528 \caption{Left: overall results; error bars indicate a 95\% confidence interval. 530 \caption{Charts corresponding to table 1 of Appendix I. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from literature. }
529 Right: error rates on NIST test digits only, with results from literature. }
530 \label{fig:error-rates-charts} 531 \label{fig:error-rates-charts}
531 \end{figure} 532 \end{figure}
532 533
533 %\vspace*{-1mm} 534 %\vspace*{-1mm}
534 %\subsection{Perturbed Training Data More Helpful for SDAE} 535 %\subsection{Perturbed Training Data More Helpful for SDAE}
564 \fi 565 \fi
565 566
566 567
567 \begin{figure}[h] 568 \begin{figure}[h]
568 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ 569 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
569 \caption{Relative improvement in error rate due to self-taught learning. 570 \caption{Charts corresponding to tables 2 (left) and 3 (right), from Appendix I.}
570 Left: Improvement (or loss, when negative)
571 induced by out-of-distribution examples (perturbed data).
572 Right: Improvement (or loss, when negative) induced by multi-task
573 learning (training on all classes and testing only on either digits,
574 upper case, or lower-case). The deep learner (SDA) benefits more from
575 both self-taught learning scenarios, compared to the shallow MLP.}
576 \label{fig:improvements-charts} 571 \label{fig:improvements-charts}
577 \end{figure} 572 \end{figure}
578 573
579 \vspace*{-1mm} 574 \vspace*{-1mm}
580 \section{Conclusions} 575 \section{Conclusions}