comparison writeup/aistats2011_submission.tex @ 602:203c6071e104

aistats submission looking good
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 31 Oct 2010 22:27:30 -0400
parents 1f5d2d01b84d
children eb6244c6d861
comparison
equal deleted inserted replaced
601:84cb106ef428 602:203c6071e104
1 %\documentclass[twoside,11pt]{article} % For LaTeX2e 1 %\documentclass[twoside,11pt]{article} % For LaTeX2e
2 \documentclass{article} % For LaTeX2e 2 \documentclass{article} % For LaTeX2e
3 \usepackage{aistats2e_2011} 3 \usepackage{aistats2e_2011}
4 \usepackage{times} 4 %\usepackage{times}
5 \usepackage{wrapfig} 5 \usepackage{wrapfig}
6 \usepackage{amsthm} 6 \usepackage{amsthm}
7 \usepackage{amsmath} 7 \usepackage{amsmath}
8 \usepackage{bbm} 8 \usepackage{bbm}
9 \usepackage[utf8]{inputenc} 9 \usepackage[utf8]{inputenc}
18 18
19 %\setlength\parindent{0mm} 19 %\setlength\parindent{0mm}
20 20
21 \begin{document} 21 \begin{document}
22 22
23 \title{Deeper Learners Benefit More from Multi-Task and Perturbed Examples} 23 \twocolumn[
24 \author{ 24 \aistatstitle{Deeper Learners Benefit More from Multi-Task and Perturbed Examples}
25 \runningtitle{Deep Learners for Out-of-Distribution Examples}
26 \runningauthor{Bengio et. al.}
27 \aistatsauthor{Anonymous Authors}]
28 \iffalse
25 Yoshua Bengio \and 29 Yoshua Bengio \and
26 Frédéric Bastien \and 30 Frédéric Bastien \and
27 Arnaud Bergeron \and 31 Arnaud Bergeron \and
28 Nicolas Boulanger-Lewandowski \and 32 Nicolas Boulanger-Lewandowski \and
29 Thomas Breuel \and 33 Thomas Breuel \and
37 Sylvain Pannetier Lebeuf \and 41 Sylvain Pannetier Lebeuf \and
38 Razvan Pascanu \and 42 Razvan Pascanu \and
39 Salah Rifai \and 43 Salah Rifai \and
40 Francois Savard \and 44 Francois Savard \and
41 Guillaume Sicard 45 Guillaume Sicard
42 } 46 %}
43 \date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada} 47 \fi
48 %\aistatsaddress{Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
49 %\date{{\tt bengioy@iro.umontreal.ca}, Dept. IRO, U. Montreal, P.O. Box 6128, Centre-Ville branch, H3C 3J7, Montreal (Qc), Canada}
44 %\jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al} 50 %\jmlrheading{}{2010}{}{10/2010}{XX/2011}{Yoshua Bengio et al}
45 %\editor{} 51 %\editor{}
46 52
47 %\makeanontitle 53 %\makeanontitle
48 \maketitle 54 %\maketitle
49 55
50 %{\bf Running title: Deep Self-Taught Learning} 56 %{\bf Running title: Deep Self-Taught Learning}
51 57
52 \vspace*{-2mm} 58 %\vspace*{-2mm}
53 \begin{abstract} 59 \begin{abstract}
54 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because 60 Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because
55 they can be shared across tasks and examples from different but related 61 they can be shared across tasks and examples from different but related
56 distributions, can yield even more benefits where there are more such levels of representation. The experiments are performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits). We show that a deep learner could not only {\em beat previously published results but also reach human-level performance}. 62 distributions, can yield even more benefits where there are more such levels of representation. The experiments are performed on a large-scale handwritten character recognition setting with 62 classes (upper case, lower case, digits). We show that a deep learner could not only {\em beat previously published results but also reach human-level performance}.
57 \end{abstract} 63 \end{abstract}
58 \vspace*{-3mm} 64 %\vspace*{-3mm}
59 65
60 %\begin{keywords} 66 %\begin{keywords}
61 %Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning 67 %Deep learning, self-taught learning, out-of-distribution examples, handwritten character recognition, multi-task learning
62 %\end{keywords} 68 %\end{keywords}
63 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition} 69 %\keywords{self-taught learning \and multi-task learning \and out-of-distribution examples \and deep learning \and handwriting recognition}
64 70
65 71
66 72
67 \section{Introduction} 73 \section{Introduction}
68 \vspace*{-1mm} 74 %\vspace*{-1mm}
69 75
70 {\bf Deep Learning} has emerged as a promising new area of research in 76 {\bf Deep Learning} has emerged as a promising new area of research in
71 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review. 77 statistical machine learning~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,VincentPLarochelleH2008,ranzato-08,TaylorHintonICML2009,Larochelle-jmlr-2009,Salakhutdinov+Hinton-2009,HonglakL2009,HonglakLNIPS2009,Jarrett-ICCV2009,Taylor-cvpr-2010}. See \citet{Bengio-2009} for a review.
72 Learning algorithms for deep architectures are centered on the learning 78 Learning algorithms for deep architectures are centered on the learning
73 of useful representations of data, which are better suited to the task at hand, 79 of useful representations of data, which are better suited to the task at hand,
103 stochastic gradient descent. 109 stochastic gradient descent.
104 One of these layer initialization techniques, 110 One of these layer initialization techniques,
105 applied here, is the Denoising 111 applied here, is the Denoising
106 Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see 112 Auto-encoder~(DAE)~\citep{VincentPLarochelleH2008-very-small} (see
107 Figure~\ref{fig:da}), which performed similarly or 113 Figure~\ref{fig:da}), which performed similarly or
108 better~\citep{VincentPLarchelleH2008-very-small} than previously 114 better~\citep{VincentPLarochelleH2008-very-small} than previously
109 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06} 115 proposed Restricted Boltzmann Machines (RBM)~\citep{Hinton06}
110 in terms of unsupervised extraction 116 in terms of unsupervised extraction
111 of a hierarchy of features useful for classification. Each layer is trained 117 of a hierarchy of features useful for classification. Each layer is trained
112 to denoise its input, creating a layer of features that can be used as 118 to denoise its input, creating a layer of features that can be used as
113 input for the next layer. Note that training a Denoising Auto-Encoder 119 input for the next layer. Note that training a Denoising Auto-Encoder
208 the more general question of why deep learners may benefit so much from 214 the more general question of why deep learners may benefit so much from
209 the self-taught learning framework. Since out-of-distribution data 215 the self-taught learning framework. Since out-of-distribution data
210 (perturbed or from other related classes) is very common, this conclusion 216 (perturbed or from other related classes) is very common, this conclusion
211 is of practical importance. 217 is of practical importance.
212 218
213 \vspace*{-3mm} 219 %\vspace*{-3mm}
214 %\newpage 220 %\newpage
215 \section{Perturbed and Transformed Character Images} 221 \section{Perturbed and Transformed Character Images}
216 \label{s:perturbations} 222 \label{s:perturbations}
217 \vspace*{-2mm} 223 %\vspace*{-2mm}
218 224
219 Figure~\ref{fig:transform} shows the different transformations we used to stochastically 225 Figure~\ref{fig:transform} shows the different transformations we used to stochastically
220 transform $32 \times 32$ source images (such as the one in Fig.\ref{fig:torig}) 226 transform $32 \times 32$ source images (such as the one in Fig.\ref{fig:torig})
221 in order to obtain data from a larger distribution which 227 in order to obtain data from a larger distribution which
222 covers a domain substantially larger than the clean characters distribution from 228 covers a domain substantially larger than the clean characters distribution from
232 There are two main parts in the pipeline. The first one, 238 There are two main parts in the pipeline. The first one,
233 from slant to pinch below, performs transformations. The second 239 from slant to pinch below, performs transformations. The second
234 part, from blur to contrast, adds different kinds of noise. 240 part, from blur to contrast, adds different kinds of noise.
235 More details can be found in~\citep{ift6266-tr-anonymous}. 241 More details can be found in~\citep{ift6266-tr-anonymous}.
236 242
237 \begin{figure}[ht] 243 \begin{figure*}[ht]
238 \centering 244 \centering
239 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}} 245 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}}
240 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}} 246 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}}
241 \subfigure[Slant]{\includegraphics[scale=0.6]{images/Slant_only.png}} 247 \subfigure[Slant]{\includegraphics[scale=0.6]{images/Slant_only.png}}
242 \subfigure[Affine Transformation]{\includegraphics[scale=0.6]{images/Affine_only.png}} 248 \subfigure[Affine Transformation]{\includegraphics[scale=0.6]{images/Affine_only.png}}
255 \caption{Top left (a): example original image. Others (b-o): examples of the effect 261 \caption{Top left (a): example original image. Others (b-o): examples of the effect
256 of each transformation module taken separately. Actual perturbed examples are obtained by 262 of each transformation module taken separately. Actual perturbed examples are obtained by
257 a pipeline of these, with random choices about which module to apply and how much perturbation 263 a pipeline of these, with random choices about which module to apply and how much perturbation
258 to apply.} 264 to apply.}
259 \label{fig:transform} 265 \label{fig:transform}
260 \vspace*{-2mm} 266 %\vspace*{-2mm}
261 \end{figure} 267 \end{figure*}
262 268
263 \vspace*{-3mm} 269 %\vspace*{-3mm}
264 \section{Experimental Setup} 270 \section{Experimental Setup}
265 \vspace*{-1mm} 271 %\vspace*{-1mm}
266 272
267 Much previous work on deep learning had been performed on 273 Much previous work on deep learning had been performed on
268 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, 274 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
269 with 60~000 examples, and variants involving 10~000 275 with 60~000 examples, and variants involving 10~000
270 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}. 276 examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008}.
271 The focus here is on much larger training sets, from 10 times to 277 The focus here is on much larger training sets, from 10 times to
272 to 1000 times larger, and 62 classes. 278 to 1000 times larger, and 62 classes.
273 279
274 The first step in constructing the larger datasets (called NISTP and P07) is to sample from 280 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
275 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, 281 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
299 example, and we were able to estimate the error variance due to this effect 305 example, and we were able to estimate the error variance due to this effect
300 because each image was classified by 3 different persons. 306 because each image was classified by 3 different persons.
301 The average error of humans on the 62-class task NIST test set 307 The average error of humans on the 62-class task NIST test set
302 is 18.2\%, with a standard error of 0.1\%. 308 is 18.2\%, with a standard error of 0.1\%.
303 309
304 \vspace*{-3mm} 310 %\vspace*{-3mm}
305 \subsection{Data Sources} 311 \subsection{Data Sources}
306 \vspace*{-2mm} 312 %\vspace*{-2mm}
307 313
308 %\begin{itemize} 314 %\begin{itemize}
309 %\item 315 %\item
310 {\bf NIST.} 316 {\bf NIST.}
311 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995}, 317 Our main source of characters is the NIST Special Database 19~\citep{Grother-1995},
334 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. 340 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
335 % TODO: pointless to anonymize, it's not pointing to our work 341 % TODO: pointless to anonymize, it's not pointing to our work
336 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. 342 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
337 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 343 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
338 directly as input to our models. 344 directly as input to our models.
339 \vspace*{-1mm} 345 %\vspace*{-1mm}
340 346
341 %\item 347 %\item
342 {\bf Captchas.} 348 {\bf Captchas.}
343 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 349 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
344 generating characters of the same format as the NIST dataset. This software is based on 350 generating characters of the same format as the NIST dataset. This software is based on
345 a random character class generator and various kinds of transformations similar to those described in the previous sections. 351 a random character class generator and various kinds of transformations similar to those described in the previous sections.
346 In order to increase the variability of the data generated, many different fonts are used for generating the characters. 352 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
347 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity 353 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
348 depending on the value of the complexity parameter provided by the user of the data source. 354 depending on the value of the complexity parameter provided by the user of the data source.
349 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class? 355 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class?
350 \vspace*{-1mm} 356 %\vspace*{-1mm}
351 357
352 %\item 358 %\item
353 {\bf OCR data.} 359 {\bf OCR data.}
354 A large set (2 million) of scanned, OCRed and manually verified machine-printed 360 A large set (2 million) of scanned, OCRed and manually verified machine-printed
355 characters where included as an 361 characters where included as an
357 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern 363 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern
358 ({\tt http://www.iupr.com}), and which will be publicly released. 364 ({\tt http://www.iupr.com}), and which will be publicly released.
359 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this 365 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
360 %\end{itemize} 366 %\end{itemize}
361 367
362 \vspace*{-3mm} 368 %\vspace*{-3mm}
363 \subsection{Data Sets} 369 \subsection{Data Sets}
364 \vspace*{-2mm} 370 %\vspace*{-2mm}
365 371
366 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label 372 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
367 from one of the 62 character classes. 373 from one of the 62 character classes.
368 %\begin{itemize} 374 %\begin{itemize}
369 \vspace*{-1mm} 375 %\vspace*{-1mm}
370 376
371 %\item 377 %\item
372 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has 378 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
373 \{651668 / 80000 / 82587\} \{training / validation / test\} examples. 379 \{651668 / 80000 / 82587\} \{training / validation / test\} examples.
374 \vspace*{-1mm} 380 %\vspace*{-1mm}
375 381
376 %\item 382 %\item
377 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources 383 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
378 and sending them through the transformation pipeline described in section \ref{s:perturbations}. 384 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
379 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, 385 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
380 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the 386 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
381 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. 387 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
382 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. 388 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
383 \vspace*{-1mm} 389 %\vspace*{-1mm}
384 390
385 %\item 391 %\item
386 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) 392 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
387 except that we only apply 393 except that we only apply
388 transformations from slant to pinch. Therefore, the character is 394 transformations from slant to pinch. Therefore, the character is
389 transformed but no additional noise is added to the image, giving images 395 transformed but no additional noise is added to the image, giving images
390 closer to the NIST dataset. 396 closer to the NIST dataset.
391 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. 397 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
392 %\end{itemize} 398 %\end{itemize}
393 399
394 \vspace*{-3mm} 400 \begin{figure*}[ht]
401 %\vspace*{-2mm}
402 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
403 %\vspace*{-2mm}
404 \caption{Illustration of the computations and training criterion for the denoising
405 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
406 the layer (i.e. raw input or output of previous layer)
407 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
408 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
409 is compared to the uncorrupted input $x$ through the loss function
410 $L_H(x,z)$, whose expected value is approximately minimized during training
411 by tuning $\theta$ and $\theta'$.}
412 \label{fig:da}
413 %\vspace*{-2mm}
414 \end{figure*}
415
416 %\vspace*{-3mm}
395 \subsection{Models and their Hyperparameters} 417 \subsection{Models and their Hyperparameters}
396 \vspace*{-2mm} 418 %\vspace*{-2mm}
397 419
398 The experiments are performed using MLPs (with a single 420 The experiments are performed using MLPs (with a single
399 hidden layer) and SDAs. 421 hidden layer) and SDAs.
400 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} 422 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
401 423
414 Training examples are presented in minibatches of size 20. A constant learning 436 Training examples are presented in minibatches of size 20. A constant learning
415 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. 437 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
416 %through preliminary experiments (measuring performance on a validation set), 438 %through preliminary experiments (measuring performance on a validation set),
417 %and $0.1$ (which was found to work best) was then selected for optimizing on 439 %and $0.1$ (which was found to work best) was then selected for optimizing on
418 %the whole training sets. 440 %the whole training sets.
419 \vspace*{-1mm} 441 %\vspace*{-1mm}
420 442
421 443
422 {\bf Stacked Denoising Auto-Encoders (SDA).} 444 {\bf Stacked Denoising Auto-Encoders (SDA).}
423 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 445 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
424 can be used to initialize the weights of each layer of a deep MLP (with many hidden 446 can be used to initialize the weights of each layer of a deep MLP (with many hidden
435 distribution $P(x)$ and the conditional distribution of interest 457 distribution $P(x)$ and the conditional distribution of interest
436 $P(y|x)$ (like in semi-supervised learning), and on the other hand 458 $P(y|x)$ (like in semi-supervised learning), and on the other hand
437 taking advantage of the expressive power and bias implicit in the 459 taking advantage of the expressive power and bias implicit in the
438 deep architecture (whereby complex concepts are expressed as 460 deep architecture (whereby complex concepts are expressed as
439 compositions of simpler ones through a deep hierarchy). 461 compositions of simpler ones through a deep hierarchy).
440
441 \begin{figure}[ht]
442 \vspace*{-2mm}
443 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
444 \vspace*{-2mm}
445 \caption{Illustration of the computations and training criterion for the denoising
446 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
447 the layer (i.e. raw input or output of previous layer)
448 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
449 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
450 is compared to the uncorrupted input $x$ through the loss function
451 $L_H(x,z)$, whose expected value is approximately minimized during training
452 by tuning $\theta$ and $\theta'$.}
453 \label{fig:da}
454 \vspace*{-2mm}
455 \end{figure}
456 462
457 Here we chose to use the Denoising 463 Here we chose to use the Denoising
458 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for 464 Auto-encoder~\citep{VincentPLarochelleH2008} as the building block for
459 these deep hierarchies of features, as it is simple to train and 465 these deep hierarchies of features, as it is simple to train and
460 explain (see Figure~\ref{fig:da}, as well as 466 explain (see Figure~\ref{fig:da}, as well as
483 SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden 489 SDAs on MNIST~\citep{VincentPLarochelleH2008}. The size of the hidden
484 layers was kept constant across hidden layers, and the best results 490 layers was kept constant across hidden layers, and the best results
485 were obtained with the largest values that we could experiment 491 were obtained with the largest values that we could experiment
486 with given our patience, with 1000 hidden units. 492 with given our patience, with 1000 hidden units.
487 493
488 \vspace*{-1mm} 494 %\vspace*{-1mm}
489 495
490 \begin{figure}[ht] 496 \begin{figure*}[ht]
491 %\vspace*{-2mm} 497 %\vspace*{-2mm}
492 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} 498 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
493 %\vspace*{-3mm} 499 %\vspace*{-3mm}
494 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained 500 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
495 on NIST, 1 on NISTP, and 2 on P07. Left: overall results 501 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
496 of all models, on NIST and NISTP test sets. 502 of all models, on NIST and NISTP test sets.
497 Right: error rates on NIST test digits only, along with the previous results from 503 Right: error rates on NIST test digits only, along with the previous results from
498 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} 504 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
499 respectively based on ART, nearest neighbors, MLPs, and SVMs.} 505 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
500 \label{fig:error-rates-charts} 506 \label{fig:error-rates-charts}
501 \vspace*{-2mm} 507 %\vspace*{-2mm}
502 \end{figure} 508 \end{figure*}
503 509
504 510
505 \begin{figure}[ht] 511 \begin{figure*}[ht]
506 \vspace*{-3mm} 512 \vspace*{-3mm}
507 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} 513 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
508 \vspace*{-3mm} 514 \vspace*{-3mm}
509 \caption{Relative improvement in error rate due to self-taught learning. 515 \caption{Relative improvement in error rate due to self-taught learning.
510 Left: Improvement (or loss, when negative) 516 Left: Improvement (or loss, when negative)
513 learning (training on all classes and testing only on either digits, 519 learning (training on all classes and testing only on either digits,
514 upper case, or lower-case). The deep learner (SDA) benefits more from 520 upper case, or lower-case). The deep learner (SDA) benefits more from
515 both self-taught learning scenarios, compared to the shallow MLP.} 521 both self-taught learning scenarios, compared to the shallow MLP.}
516 \label{fig:improvements-charts} 522 \label{fig:improvements-charts}
517 \vspace*{-2mm} 523 \vspace*{-2mm}
518 \end{figure} 524 \end{figure*}
519 525
526 \vspace*{-2mm}
520 \section{Experimental Results} 527 \section{Experimental Results}
521 \vspace*{-2mm} 528 \vspace*{-2mm}
522 529
523 %%\vspace*{-1mm} 530 %%\vspace*{-1mm}
524 %\subsection{SDA vs MLP vs Humans} 531 %\subsection{SDA vs MLP vs Humans}
693 does not allow the shallow or purely supervised models to discover 700 does not allow the shallow or purely supervised models to discover
694 the kind of better basins associated 701 the kind of better basins associated
695 with deep learning and self-taught learning. 702 with deep learning and self-taught learning.
696 703
697 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 704 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
698 can be executed on-line at {\tt http://deep.host22.com}. 705 can be executed on-line at the anonymous site {\tt http://deep.host22.com}.
699 706
700 \iffalse 707 \iffalse
701 \section*{Appendix I: Detailed Numerical Results} 708 \section*{Appendix I: Detailed Numerical Results}
702 709
703 These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered. 710 These tables correspond to Figures 2 and 3 and contain the raw error rates for each model and dataset considered.