Mercurial > ift6266
comparison writeup/aistats2011_submission.tex @ 602:203c6071e104
aistats submission looking good
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sun, 31 Oct 2010 22:27:30 -0400 |
parents | 1f5d2d01b84d |
children | eb6244c6d861 |
comparison
equal
deleted
inserted
replaced
601:84cb106ef428 | 602:203c6071e104 |
---|---|
1 %\documentclass[twoside,11pt]{article} % For LaTeX2e | 1 %\documentclass[twoside,11pt]{article} % For LaTeX2e |
2 \documentclass{article} % For LaTeX2e | 2 \documentclass{article} % For LaTeX2e |
3 \usepackage{aistats2e_2011} | 3 \usepackage{aistats2e_2011} |
4 \usepackage{times} | 4 %\usepackage{times} |
5 \usepackage{wrapfig} | 5 \usepackage{wrapfig} |
6 \usepackage{amsthm} | 6 \usepackage{amsthm} |
7 \usepackage{amsmath} | 7 \usepackage{amsmath} |
8 \usepackage{bbm} | 8 \usepackage{bbm} |
9 \usepackage[utf8]{inputenc} | 9 \usepackage[utf8]{inputenc} |
18 | 18 |
19 %\setlength\parindent{0mm} | 19 %\setlength\parindent{0mm} |
20 | 20 |
21 \begin{document} | 21 \begin{document} |
22 | 22 |
23 \title{Deeper Learners Benefit More from Multi-Task and Perturbed Examples} | 23 \twocolumn[ |
24 \author{ | 24 \aistatstitle{Deeper Learners Benefit More from Multi-Task and Perturbed Examples} |
25 \runningtitle{Deep Learners for Out-of-Distribution Examples} | |
26 \runningauthor{Bengio et. al.} | |
27 \aistatsauthor{Anonymous Authors}] | |
28 \iffalse | |
25 Yoshua Bengio \and | 29 Yoshua Bengio \and |
26 Frédéric Bastien \and | 30 Frédéric Bastien \and |
27 Arnaud Bergeron \and | 31 Arnaud Bergeron \and |
28 Nicolas Boulanger-Lewandowski \and | 32 Nicolas Boulanger-Lewandowski \and |
29 Thomas Breuel \and | 33 Thomas Breuel \and |
37 Sylvain Pannetier Lebeuf \and | 41 Sylvain Pannetier Lebeuf \and |
38 Razvan Pascanu \and | 42 Razvan Pascanu \and |
39 Salah Rifai \and | 43 Salah Rifai \and |
40 Francois Savard \and | 44 Francois Savard \and |
41 Guillaume Sicard | 45 Guillaume Sicard |
42 } | 46 %} |
43 \date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada} | 47 \fi |
48 %\aistatsaddress{Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada} | |
49 %\date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada} | |
44 %\jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al} | 50 %\jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al} |
45 %\editor{} | 51 %\editor{} |
46 | 52 |
47 %\makeanontitle | 53 %\makeanontitle |
48 \maketitle | 54 %\maketitle |
49 | 55 |
50 %{\bf Running title: Deep Self-Taught Learning} | 56 %{\bf Running title: Deep Self-Taught Learning} |
51 | 57 |
52 \vspace*{-2mm} | 58 %\vspace*{-2mm} |
53 \begin{abstract} | 59 \begin{abstract} |
54 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because | 60 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because |
55 they can be shared across tasks and examples from different but related | 61 they can be shared across tasks and examples from different but related |
56 distributions, can yield even more benefits where there are more such levels of representation. The experiments are performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits). We show that a deep learner could not only {\em beat previously published results but also reach human-level performance}. | 62 distributions, can yield even more benefits where there are more such levels of representation. The experiments are performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits). We show that a deep learner could not only {\em beat previously published results but also reach human-level performance}. |
57 \end{abstract} | 63 \end{abstract} |
58 \vspace*{-3mm} | 64 %\vspace*{-3mm} |
59 | 65 |
60 %\begin{keywords} | 66 %\begin{keywords} |
61 %Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning | 67 %Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning |
62 %\end{keywords} | 68 %\end{keywords} |
63 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition} | 69 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition} |
64 | 70 |
65 | 71 |
66 | 72 |
67 \section{Introduction} | 73 \section{Introduction} |
68 \vspace*{-1mm} | 74 %\vspace*{-1mm} |
69 | 75 |
70 {\bf Deep Learning} has emerged as a promising new area of research in | 76 {\bf Deep Learning} has emerged as a promising new area of research in |
71 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review. | 77 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review. |
72 Learning algorithms for deep architectures are centered on the learning | 78 Learning algorithms for deep architectures are centered on the learning |
73 of useful representations of data, which are better suited to the task at hand, | 79 of useful representations of data, which are better suited to the task at hand, |
103 stochastic gradient descent. | 109 stochastic gradient descent. |
104 One of these layer initialization techniques, | 110 One of these layer initialization techniques, |
105 applied here, is the Denoising | 111 applied here, is the Denoising |
106 Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see | 112 Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see |
107 Figure~\ref{fig:da}), which performed similarly or | 113 Figure~\ref{fig:da}), which performed similarly or |
108 better~\citep{VincentPLarchelleH2008-very-small} than previously | 114 better~\citep{VincentPLarochelleH2008-very-small} than previously |
109 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} | 115 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} |
110 in terms of unsupervised extraction | 116 in terms of unsupervised extraction |
111 of a hierarchy of features useful for classification. Each layer is trained | 117 of a hierarchy of features useful for classification. Each layer is trained |
112 to denoise its input, creating a layer of features that can be used as | 118 to denoise its input, creating a layer of features that can be used as |
113 input for the next layer. Note that training a Denoising Auto-Encoder | 119 input for the next layer. Note that training a Denoising Auto-Encoder |
208 the more general question of why deep learners may benefit so much from | 214 the more general question of why deep learners may benefit so much from |
209 the self-taught learning framework. Since out-of-distribution data | 215 the self-taught learning framework. Since out-of-distribution data |
210 (perturbed or from other related classes) is very common, this conclusion | 216 (perturbed or from other related classes) is very common, this conclusion |
211 is of practical importance. | 217 is of practical importance. |
212 | 218 |
213 \vspace*{-3mm} | 219 %\vspace*{-3mm} |
214 %\newpage | 220 %\newpage |
215 \section{Perturbed and Transformed Character Images} | 221 \section{Perturbed and Transformed Character Images} |
216 \label{s:perturbations} | 222 \label{s:perturbations} |
217 \vspace*{-2mm} | 223 %\vspace*{-2mm} |
218 | 224 |
219 Figure~\ref{fig:transform} shows the different transformations we used to stochastically | 225 Figure~\ref{fig:transform} shows the different transformations we used to stochastically |
220 transform $32 \times 32$ source images (such as the one in Fig.\ref{fig:torig}) | 226 transform $32 \times 32$ source images (such as the one in Fig.\ref{fig:torig}) |
221 in order to obtain data from a larger distribution which | 227 in order to obtain data from a larger distribution which |
222 covers a domain substantially larger than the clean characters distribution from | 228 covers a domain substantially larger than the clean characters distribution from |
232 There are two main parts in the pipeline. The first one, | 238 There are two main parts in the pipeline. The first one, |
233 from slant to pinch below, performs transformations. The second | 239 from slant to pinch below, performs transformations. The second |
234 part, from blur to contrast, adds different kinds of noise. | 240 part, from blur to contrast, adds different kinds of noise. |
235 More details can be found in~\citep{ift6266-tr-anonymous}. | 241 More details can be found in~\citep{ift6266-tr-anonymous}. |
236 | 242 |
237 \begin{figure}[ht] | 243 \begin{figure*}[ht] |
238 \centering | 244 \centering |
239 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}} | 245 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}} |
240 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}} | 246 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}} |
241 \subfigure[Slant]{\includegraphics[scale=0.6]{images/Slant_only.png}} | 247 \subfigure[Slant]{\includegraphics[scale=0.6]{images/Slant_only.png}} |
242 \subfigure[Affine Transformation]{\includegraphics[scale=0.6]{images/Affine_only.png}} | 248 \subfigure[Affine Transformation]{\includegraphics[scale=0.6]{images/Affine_only.png}} |
255 \caption{Top left (a): example original image. Others (b-o): examples of the effect | 261 \caption{Top left (a): example original image. Others (b-o): examples of the effect |
256 of each transformation module taken separately. Actual perturbed examples are obtained by | 262 of each transformation module taken separately. Actual perturbed examples are obtained by |
257 a pipeline of these, with random choices about which module to apply and how much perturbation | 263 a pipeline of these, with random choices about which module to apply and how much perturbation |
258 to apply.} | 264 to apply.} |
259 \label{fig:transform} | 265 \label{fig:transform} |
260 \vspace*{-2mm} | 266 %\vspace*{-2mm} |
261 \end{figure} | 267 \end{figure*} |
262 | 268 |
263 \vspace*{-3mm} | 269 %\vspace*{-3mm} |
264 \section{Experimental Setup} | 270 \section{Experimental Setup} |
265 \vspace*{-1mm} | 271 %\vspace*{-1mm} |
266 | 272 |
267 Much previous work on deep learning had been performed on | 273 Much previous work on deep learning had been performed on |
268 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, | 274 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, |
269 with 60~000 examples, and variants involving 10~000 | 275 with 60~000 examples, and variants involving 10~000 |
270 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}. | 276 examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008}. |
271 The focus here is on much larger training sets, from 10 times to | 277 The focus here is on much larger training sets, from 10 times to |
272 to 1000 times larger, and 62 classes. | 278 to 1000 times larger, and 62 classes. |
273 | 279 |
274 The first step in constructing the larger datasets (called NISTP and P07) is to sample from | 280 The first step in constructing the larger datasets (called NISTP and P07) is to sample from |
275 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, | 281 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, |
299 example, and we were able to estimate the error variance due to this effect | 305 example, and we were able to estimate the error variance due to this effect |
300 because each image was classified by 3 different persons. | 306 because each image was classified by 3 different persons. |
301 The average error of humans on the 62-class task NIST test set | 307 The average error of humans on the 62-class task NIST test set |
302 is 18.2\%, with a standard error of 0.1\%. | 308 is 18.2\%, with a standard error of 0.1\%. |
303 | 309 |
304 \vspace*{-3mm} | 310 %\vspace*{-3mm} |
305 \subsection{Data Sources} | 311 \subsection{Data Sources} |
306 \vspace*{-2mm} | 312 %\vspace*{-2mm} |
307 | 313 |
308 %\begin{itemize} | 314 %\begin{itemize} |
309 %\item | 315 %\item |
310 {\bf NIST.} | 316 {\bf NIST.} |
311 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995}, | 317 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995}, |
334 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. | 340 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. |
335 % TODO: pointless to anonymize, it's not pointing to our work | 341 % TODO: pointless to anonymize, it's not pointing to our work |
336 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. | 342 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. |
337 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | 343 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, |
338 directly as input to our models. | 344 directly as input to our models. |
339 \vspace*{-1mm} | 345 %\vspace*{-1mm} |
340 | 346 |
341 %\item | 347 %\item |
342 {\bf Captchas.} | 348 {\bf Captchas.} |
343 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 349 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for |
344 generating characters of the same format as the NIST dataset. This software is based on | 350 generating characters of the same format as the NIST dataset. This software is based on |
345 a random character class generator and various kinds of transformations similar to those described in the previous sections. | 351 a random character class generator and various kinds of transformations similar to those described in the previous sections. |
346 In order to increase the variability of the data generated, many different fonts are used for generating the characters. | 352 In order to increase the variability of the data generated, many different fonts are used for generating the characters. |
347 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity | 353 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity |
348 depending on the value of the complexity parameter provided by the user of the data source. | 354 depending on the value of the complexity parameter provided by the user of the data source. |
349 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class? | 355 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class? |
350 \vspace*{-1mm} | 356 %\vspace*{-1mm} |
351 | 357 |
352 %\item | 358 %\item |
353 {\bf OCR data.} | 359 {\bf OCR data.} |
354 A large set (2 million) of scanned, OCRed and manually verified machine-printed | 360 A large set (2 million) of scanned, OCRed and manually verified machine-printed |
355 characters where included as an | 361 characters where included as an |
357 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern | 363 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern |
358 ({\tt http://www.iupr.com}), and which will be publicly released. | 364 ({\tt http://www.iupr.com}), and which will be publicly released. |
359 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this | 365 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this |
360 %\end{itemize} | 366 %\end{itemize} |
361 | 367 |
362 \vspace*{-3mm} | 368 %\vspace*{-3mm} |
363 \subsection{Data Sets} | 369 \subsection{Data Sets} |
364 \vspace*{-2mm} | 370 %\vspace*{-2mm} |
365 | 371 |
366 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label | 372 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label |
367 from one of the 62 character classes. | 373 from one of the 62 character classes. |
368 %\begin{itemize} | 374 %\begin{itemize} |
369 \vspace*{-1mm} | 375 %\vspace*{-1mm} |
370 | 376 |
371 %\item | 377 %\item |
372 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has | 378 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has |
373 \{651668 / 80000 / 82587\} \{training / validation / test\} examples. | 379 \{651668 / 80000 / 82587\} \{training / validation / test\} examples. |
374 \vspace*{-1mm} | 380 %\vspace*{-1mm} |
375 | 381 |
376 %\item | 382 %\item |
377 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources | 383 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources |
378 and sending them through the transformation pipeline described in section \ref{s:perturbations}. | 384 and sending them through the transformation pipeline described in section \ref{s:perturbations}. |
379 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, | 385 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, |
380 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the | 386 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the |
381 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. | 387 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. |
382 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. | 388 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. |
383 \vspace*{-1mm} | 389 %\vspace*{-1mm} |
384 | 390 |
385 %\item | 391 %\item |
386 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) | 392 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) |
387 except that we only apply | 393 except that we only apply |
388 transformations from slant to pinch. Therefore, the character is | 394 transformations from slant to pinch. Therefore, the character is |
389 transformed but no additional noise is added to the image, giving images | 395 transformed but no additional noise is added to the image, giving images |
390 closer to the NIST dataset. | 396 closer to the NIST dataset. |
391 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. | 397 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. |
392 %\end{itemize} | 398 %\end{itemize} |
393 | 399 |
394 \vspace*{-3mm} | 400 \begin{figure*}[ht] |
401 %\vspace*{-2mm} | |
402 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | |
403 %\vspace*{-2mm} | |
404 \caption{Illustration of the computations and training criterion for the denoising | |
405 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | |
406 the layer (i.e. raw input or output of previous layer) | |
407 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | |
408 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | |
409 is compared to the uncorrupted input $x$ through the loss function | |
410 $L_H(x,z)$, whose expected value is approximately minimized during training | |
411 by tuning $\theta$ and $\theta'$.} | |
412 \label{fig:da} | |
413 %\vspace*{-2mm} | |
414 \end{figure*} | |
415 | |
416 %\vspace*{-3mm} | |
395 \subsection{Models and their Hyperparameters} | 417 \subsection{Models and their Hyperparameters} |
396 \vspace*{-2mm} | 418 %\vspace*{-2mm} |
397 | 419 |
398 The experiments are performed using MLPs (with a single | 420 The experiments are performed using MLPs (with a single |
399 hidden layer) and SDAs. | 421 hidden layer) and SDAs. |
400 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} | 422 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} |
401 | 423 |
414 Training examples are presented in minibatches of size 20. A constant learning | 436 Training examples are presented in minibatches of size 20. A constant learning |
415 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. | 437 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. |
416 %through preliminary experiments (measuring performance on a validation set), | 438 %through preliminary experiments (measuring performance on a validation set), |
417 %and $0.1$ (which was found to work best) was then selected for optimizing on | 439 %and $0.1$ (which was found to work best) was then selected for optimizing on |
418 %the whole training sets. | 440 %the whole training sets. |
419 \vspace*{-1mm} | 441 %\vspace*{-1mm} |
420 | 442 |
421 | 443 |
422 {\bf Stacked Denoising Auto-Encoders (SDA).} | 444 {\bf Stacked Denoising Auto-Encoders (SDA).} |
423 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 445 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
424 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 446 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
435 distribution $P(x)$ and the conditional distribution of interest | 457 distribution $P(x)$ and the conditional distribution of interest |
436 $P(y|x)$ (like in semi-supervised learning), and on the other hand | 458 $P(y|x)$ (like in semi-supervised learning), and on the other hand |
437 taking advantage of the expressive power and bias implicit in the | 459 taking advantage of the expressive power and bias implicit in the |
438 deep architecture (whereby complex concepts are expressed as | 460 deep architecture (whereby complex concepts are expressed as |
439 compositions of simpler ones through a deep hierarchy). | 461 compositions of simpler ones through a deep hierarchy). |
440 | |
441 \begin{figure}[ht] | |
442 \vspace*{-2mm} | |
443 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | |
444 \vspace*{-2mm} | |
445 \caption{Illustration of the computations and training criterion for the denoising | |
446 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | |
447 the layer (i.e. raw input or output of previous layer) | |
448 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | |
449 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | |
450 is compared to the uncorrupted input $x$ through the loss function | |
451 $L_H(x,z)$, whose expected value is approximately minimized during training | |
452 by tuning $\theta$ and $\theta'$.} | |
453 \label{fig:da} | |
454 \vspace*{-2mm} | |
455 \end{figure} | |
456 | 462 |
457 Here we chose to use the Denoising | 463 Here we chose to use the Denoising |
458 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for | 464 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for |
459 these deep hierarchies of features, as it is simple to train and | 465 these deep hierarchies of features, as it is simple to train and |
460 explain (see Figure~\ref{fig:da}, as well as | 466 explain (see Figure~\ref{fig:da}, as well as |
483 SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden | 489 SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden |
484 layers was kept constant across hidden layers, and the best results | 490 layers was kept constant across hidden layers, and the best results |
485 were obtained with the largest values that we could experiment | 491 were obtained with the largest values that we could experiment |
486 with given our patience, with 1000 hidden units. | 492 with given our patience, with 1000 hidden units. |
487 | 493 |
488 \vspace*{-1mm} | 494 %\vspace*{-1mm} |
489 | 495 |
490 \begin{figure}[ht] | 496 \begin{figure*}[ht] |
491 %\vspace*{-2mm} | 497 %\vspace*{-2mm} |
492 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} | 498 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} |
493 %\vspace*{-3mm} | 499 %\vspace*{-3mm} |
494 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained | 500 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained |
495 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | 501 on NIST, 1 on NISTP, and 2 on P07. Left: overall results |
496 of all models, on NIST and NISTP test sets. | 502 of all models, on NIST and NISTP test sets. |
497 Right: error rates on NIST test digits only, along with the previous results from | 503 Right: error rates on NIST test digits only, along with the previous results from |
498 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | 504 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
499 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | 505 respectively based on ART, nearest neighbors, MLPs, and SVMs.} |
500 \label{fig:error-rates-charts} | 506 \label{fig:error-rates-charts} |
501 \vspace*{-2mm} | 507 %\vspace*{-2mm} |
502 \end{figure} | 508 \end{figure*} |
503 | 509 |
504 | 510 |
505 \begin{figure}[ht] | 511 \begin{figure*}[ht] |
506 \vspace*{-3mm} | 512 \vspace*{-3mm} |
507 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} | 513 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} |
508 \vspace*{-3mm} | 514 \vspace*{-3mm} |
509 \caption{Relative improvement in error rate due to self-taught learning. | 515 \caption{Relative improvement in error rate due to self-taught learning. |
510 Left: Improvement (or loss, when negative) | 516 Left: Improvement (or loss, when negative) |
513 learning (training on all classes and testing only on either digits, | 519 learning (training on all classes and testing only on either digits, |
514 upper case, or lower-case). The deep learner (SDA) benefits more from | 520 upper case, or lower-case). The deep learner (SDA) benefits more from |
515 both self-taught learning scenarios, compared to the shallow MLP.} | 521 both self-taught learning scenarios, compared to the shallow MLP.} |
516 \label{fig:improvements-charts} | 522 \label{fig:improvements-charts} |
517 \vspace*{-2mm} | 523 \vspace*{-2mm} |
518 \end{figure} | 524 \end{figure*} |
519 | 525 |
526 \vspace*{-2mm} | |
520 \section{Experimental Results} | 527 \section{Experimental Results} |
521 \vspace*{-2mm} | 528 \vspace*{-2mm} |
522 | 529 |
523 %%\vspace*{-1mm} | 530 %%\vspace*{-1mm} |
524 %\subsection{SDA vs MLP vs Humans} | 531 %\subsection{SDA vs MLP vs Humans} |
693 does not allow the shallow or purely supervised models to discover | 700 does not allow the shallow or purely supervised models to discover |
694 the kind of better basins associated | 701 the kind of better basins associated |
695 with deep learning and self-taught learning. | 702 with deep learning and self-taught learning. |
696 | 703 |
697 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 704 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) |
698 can be executed on-line at {\tt http://deep.host22.com}. | 705 can be executed on-line at the anonymous site {\tt http://deep.host22.com}. |
699 | 706 |
700 \iffalse | 707 \iffalse |
701 \section*{Appendix I: Detailed Numerical Results} | 708 \section*{Appendix I: Detailed Numerical Results} |
702 | 709 |
703 These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered. | 710 These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered. |