comparison writeup/aistats2011_cameraready.tex @ 639:507cb92d8e15

modifs mineures
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 20 Mar 2011 16:49:44 -0400
parents 677d1b1d8158
children 8b1a0b9fecff
comparison
equal deleted inserted replaced
638:677d1b1d8158 639:507cb92d8e15
9 \usepackage[utf8]{inputenc} 9 \usepackage[utf8]{inputenc}
10 \usepackage[psamsfonts]{amssymb} 10 \usepackage[psamsfonts]{amssymb}
11 %\usepackage{algorithm,algorithmic} % not used after all 11 %\usepackage{algorithm,algorithmic} % not used after all
12 \usepackage{graphicx,subfigure} 12 \usepackage{graphicx,subfigure}
13 \usepackage{natbib} 13 \usepackage{natbib}
14 %\usepackage{afterpage}
14 15
15 \addtolength{\textwidth}{10mm} 16 \addtolength{\textwidth}{10mm}
16 \addtolength{\evensidemargin}{-5mm} 17 \addtolength{\evensidemargin}{-5mm}
17 \addtolength{\oddsidemargin}{-5mm} 18 \addtolength{\oddsidemargin}{-5mm}
18 19
30 \bf Arnaud Bergeron \and 31 \bf Arnaud Bergeron \and
31 Nicolas Boulanger-Lewandowski \and \\ 32 Nicolas Boulanger-Lewandowski \and \\
32 \bf Thomas Breuel \and 33 \bf Thomas Breuel \and
33 Youssouf Chherawala \and 34 Youssouf Chherawala \and
34 \bf Moustapha Cisse \and 35 \bf Moustapha Cisse \and
35 Myriam Côté \and \\ 36 Myriam Côté \and
36 \bf Dumitru Erhan \and 37 \bf Dumitru Erhan \\
37 Jeremy Eustache \and 38 \and \bf Jeremy Eustache \and
38 \bf Xavier Glorot \and 39 \bf Xavier Glorot \and
39 Xavier Muller \and \\ 40 Xavier Muller \and
40 \bf Sylvain Pannetier Lebeuf \and 41 \bf Sylvain Pannetier Lebeuf \\
41 Razvan Pascanu \and 42 \and \bf Razvan Pascanu \and
42 \bf Salah Rifai \and 43 \bf Salah Rifai \and
43 Francois Savard \and \\ 44 Francois Savard \and
44 \bf Guillaume Sicard \\ 45 \bf Guillaume Sicard \\
45 \vspace*{1mm}} 46 \vspace*{1mm}}
46 47
47 %I can't use aistatsaddress in a single side paragraphe. 48 %I can't use aistatsaddress in a single side paragraphe.
48 %The document is 2 colums, but this section span the 2 colums, sot there is only 1 left 49 %The document is 2 colums, but this section span the 2 colums, sot there is only 1 left
141 observed with deep learners, we focus here on the following {\em hypothesis}: 142 observed with deep learners, we focus here on the following {\em hypothesis}:
142 intermediate levels of representation, especially when there are 143 intermediate levels of representation, especially when there are
143 more such levels, can be exploited to {\bf share 144 more such levels, can be exploited to {\bf share
144 statistical strength across different but related types of examples}, 145 statistical strength across different but related types of examples},
145 such as examples coming from other tasks than the task of interest 146 such as examples coming from other tasks than the task of interest
146 (the multi-task setting), or examples coming from an overlapping 147 (the multi-task setting~\citep{caruana97a}), or examples coming from an overlapping
147 but different distribution (images with different kinds of perturbations 148 but different distribution (images with different kinds of perturbations
148 and noises, here). This is consistent with the hypotheses discussed 149 and noises, here). This is consistent with the hypotheses discussed
149 in~\citet{Bengio-2009} regarding the potential advantage 150 in~\citet{Bengio-2009} regarding the potential advantage
150 of deep learning and the idea that more levels of representation can 151 of deep learning and the idea that more levels of representation can
151 give rise to more abstract, more general features of the raw input. 152 give rise to more abstract, more general features of the raw input.
152 153
153 This hypothesis is related to a learning setting called 154 This hypothesis is related to a learning setting called
154 {\bf self-taught learning}~\citep{RainaR2007}, which combines principles 155 {\bf self-taught learning}~\citep{RainaR2007}, which combines principles
155 of semi-supervised and multi-task learning: the learner can exploit examples 156 of semi-supervised and multi-task learning: in addition to the labeled
157 examples from the target distribution, the learner can exploit examples
156 that are unlabeled and possibly come from a distribution different from the target 158 that are unlabeled and possibly come from a distribution different from the target
157 distribution, e.g., from other classes than those of interest. 159 distribution, e.g., from other classes than those of interest.
158 It has already been shown that deep learners can clearly take advantage of 160 It has already been shown that deep learners can clearly take advantage of
159 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, 161 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}
162 in order to improve performance on a supervised task,
160 but more needed to be done to explore the impact 163 but more needed to be done to explore the impact
161 of {\em out-of-distribution} examples and of the {\em multi-task} setting 164 of {\em out-of-distribution} examples and of the {\em multi-task} setting
162 (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised 165 (two exceptions are~\citet{CollobertR2008}, which shares and uses unsupervised
163 pre-training only with the first layer). In particular the {\em relative 166 pre-training only with the first layer, and~\citet{icml2009_093} in the case
167 of video data). In particular the {\em relative
164 advantage of deep learning} for these settings has not been evaluated. 168 advantage of deep learning} for these settings has not been evaluated.
165 169
166 170
167 % 171 %
168 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can 172 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can
231 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the 235 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the
232 amount of deformation or noise introduced. 236 amount of deformation or noise introduced.
233 There are two main parts in the pipeline. The first one, 237 There are two main parts in the pipeline. The first one,
234 from thickness to pinch, performs transformations. The second 238 from thickness to pinch, performs transformations. The second
235 part, from blur to contrast, adds different kinds of noise. 239 part, from blur to contrast, adds different kinds of noise.
236 More details can be found in~\citep{ARXIV-2010}. 240 More details can be found in~\citet{ARXIV-2010}.
237 241
238 \begin{figure*}[ht] 242 \begin{figure*}[ht]
239 \centering 243 \centering
240 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}} 244 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}}
241 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}} 245 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}}
266 %\vspace*{-1mm} 270 %\vspace*{-1mm}
267 271
268 Much previous work on deep learning had been performed on 272 Much previous work on deep learning had been performed on
269 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, 273 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
270 with 60,000 examples, and variants involving 10,000 274 with 60,000 examples, and variants involving 10,000
271 examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}. 275 examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}\footnote{Fortunately, there
276 are more and more exceptions of course, such as~\citet{RainaICML09} using a million examples.}
272 The focus here is on much larger training sets, from 10 times to 277 The focus here is on much larger training sets, from 10 times to
273 to 1000 times larger, and 62 classes. 278 to 1000 times larger, and 62 classes.
274 279
275 The first step in constructing the larger datasets (called NISTP and P07) is to sample from 280 The first step in constructing the larger datasets (called NISTP and P07) is to sample from
276 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, 281 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
344 %\item 349 %\item
345 {\bf Fonts.} 350 {\bf Fonts.}
346 In order to have a good variety of sources we downloaded an important number of free fonts from: 351 In order to have a good variety of sources we downloaded an important number of free fonts from:
347 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. 352 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
348 % TODO: pointless to anonymize, it's not pointing to our work 353 % TODO: pointless to anonymize, it's not pointing to our work
349 Including an operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. 354 Including an operating system's (Windows 7) fonts, there we uniformly chose from $9817$ different fonts.
350 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 355 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
351 directly as input to our models. 356 directly as input to our models.
352 %\vspace*{-1mm} 357 %\vspace*{-1mm}
353 358
354 %\item 359 %\item
366 {\bf OCR data.} 371 {\bf OCR data.}
367 A large set (2 million) of scanned, OCRed and manually verified machine-printed 372 A large set (2 million) of scanned, OCRed and manually verified machine-printed
368 characters where included as an 373 characters where included as an
369 additional source. This set is part of a larger corpus being collected by the Image Understanding 374 additional source. This set is part of a larger corpus being collected by the Image Understanding
370 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern 375 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern
371 ({\tt http://www.iupr.com}), and which will be publicly released. 376 ({\tt http://www.iupr.com}).%, and which will be publicly released.
372 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this 377 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this
373 %\end{itemize} 378 %\end{itemize}
374 379
375 %\vspace*{-3mm} 380 %\vspace*{-3mm}
376 \subsection{Data Sets} 381 \subsection{Data Sets}
377 %\vspace*{-2mm} 382 %\vspace*{-2mm}
378 383
379 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label 384 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with one of 62 character labels.
380 from one of the 62 character classes.
381 %\begin{itemize} 385 %\begin{itemize}
382 %\vspace*{-1mm} 386 %\vspace*{-1mm}
383 387
384 %\item 388 %\item
385 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has 389 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
386 \{651,668 / 80,000 / 82,587\} \{training / validation / test\} examples. 390 \{651,668 / 80,000 / 82,587\} \{training / validation / test\} examples.
387 %\vspace*{-1mm} 391 %\vspace*{-1mm}
388 392
389 %\item 393 %\item
390 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources 394 {\bf P07.} This dataset is obtained by taking raw characters from the above 4 sources
391 and sending them through the transformation pipeline described in section \ref{s:perturbations}. 395 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
392 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, 396 For each generated example, a data source is selected with probability $10\%$ from the fonts,
393 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the 397 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. The transformations are
398 applied in the
394 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. 399 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
395 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples 400 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples
396 obtained from the corresponding NIST sets plus other sources. 401 obtained from the corresponding NIST sets plus other sources.
397 %\vspace*{-1mm} 402 %\vspace*{-1mm}
398 403
399 %\item 404 %\item
400 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) 405 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
401 except that we only apply 406 except that we only apply
402 transformations from slant to pinch (see Fig.\ref{fig:transform}(b-f)). 407 transformations from slant to pinch (see Fig.\ref{fig:transform}(b-f)).
403 Therefore, the character is 408 Therefore, the character is
404 transformed but no additional noise is added to the image, giving images 409 transformed but without added noise, yielding images
405 closer to the NIST dataset. 410 closer to the NIST dataset.
406 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples 411 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples
407 obtained from the corresponding NIST sets plus other sources. 412 obtained from the corresponding NIST sets plus other sources.
408 %\end{itemize} 413 %\end{itemize}
409 414
410 \begin{figure*}[ht] 415 \vspace*{-3mm}
411 %\vspace*{-2mm}
412 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
413 %\vspace*{-2mm}
414 \caption{Illustration of the computations and training criterion for the denoising
415 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
416 the layer (i.e. raw input or output of previous layer)
417 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
418 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
419 is compared to the uncorrupted input $x$ through the loss function
420 $L_H(x,z)$, whose expected value is approximately minimized during training
421 by tuning $\theta$ and $\theta'$.}
422 \label{fig:da}
423 %\vspace*{-2mm}
424 \end{figure*}
425
426 %\vspace*{-3mm}
427 \subsection{Models and their Hyper-parameters} 416 \subsection{Models and their Hyper-parameters}
428 %\vspace*{-2mm} 417 %\vspace*{-2mm}
429 418
430 The experiments are performed using MLPs (with a single 419 The experiments are performed using MLPs (with a single
431 hidden layer) and deep SDAs. 420 hidden layer) and deep SDAs.
432 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} 421 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
433 422
434 {\bf Multi-Layer Perceptrons (MLP).} The MLP output estimated with 423 {\bf Multi-Layer Perceptrons (MLP).} The MLP output estimates the
424 class-conditional probabilities
435 \[ 425 \[
436 P({\rm class}|{\rm input}=x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)), 426 P({\rm class}|{\rm input}=x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)),
437 \] 427 \]
438 i.e., two layers, where $p={\rm softmax}(a)$ means that 428 i.e., two layers, where $p={\rm softmax}(a)$ means that
439 $p_i(x)=\exp(a_i)/\sum_j \exp(a_j)$ 429 $p_i(x)=\exp(a_i)/\sum_j \exp(a_j)$
472 %through preliminary experiments (measuring performance on a validation set), 462 %through preliminary experiments (measuring performance on a validation set),
473 %and $0.1$ (which was found to work best) was then selected for optimizing on 463 %and $0.1$ (which was found to work best) was then selected for optimizing on
474 %the whole training sets. 464 %the whole training sets.
475 %\vspace*{-1mm} 465 %\vspace*{-1mm}
476 466
467 \begin{figure*}[htb]
468 %\vspace*{-2mm}
469 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
470 %\vspace*{-2mm}
471 \caption{Illustration of the computations and training criterion for the denoising
472 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
473 the layer (i.e. raw input or output of previous layer)
474 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
475 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
476 is compared to the uncorrupted input $x$ through the loss function
477 $L_H(x,z)$, whose expected value is approximately minimized during training
478 by tuning $\theta$ and $\theta'$.}
479 \label{fig:da}
480 %\vspace*{-2mm}
481 \end{figure*}
482
483 %\afterpage{\clearpage}
477 484
478 {\bf Stacked Denoising Auto-encoders (SDA).} 485 {\bf Stacked Denoising Auto-encoders (SDA).}
479 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 486 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
480 can be used to initialize the weights of each layer of a deep MLP (with many hidden 487 can be used to initialize the weights of each layer of a deep MLP (with many hidden
481 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, 488 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006},
482 apparently setting parameters in the 489 apparently setting parameters in the
483 basin of attraction of supervised gradient descent yielding better 490 basin of attraction of supervised gradient descent yielding better
484 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised 491 generalization~\citep{Erhan+al-2010}.
485 pre-training phase} uses all of the training images but not the training labels. 492 This initial {\em unsupervised
493 pre-training phase} does not use the training labels.
486 Each layer is trained in turn to produce a new representation of its input 494 Each layer is trained in turn to produce a new representation of its input
487 (starting from the raw pixels). 495 (starting from the raw pixels).
488 It is hypothesized that the 496 It is hypothesized that the
489 advantage brought by this procedure stems from a better prior, 497 advantage brought by this procedure stems from a better prior,
490 on the one hand taking advantage of the link between the input 498 on the one hand taking advantage of the link between the input
499 these deep hierarchies of features, as it is simple to train and 507 these deep hierarchies of features, as it is simple to train and
500 explain (see Figure~\ref{fig:da}, as well as 508 explain (see Figure~\ref{fig:da}, as well as
501 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 509 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
502 provides efficient inference, and yielded results 510 provides efficient inference, and yielded results
503 comparable or better than RBMs in series of experiments 511 comparable or better than RBMs in series of experiments
504 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian 512 \citep{VincentPLarochelleH2008-very-small}.
513 Some denoising auto-encoders correspond
514 to a Gaussian
505 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. 515 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}.
506 During its unsupervised training, a Denoising 516 During its unsupervised training, a Denoising
507 Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$ 517 Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$
508 of the input $x$ and trained to reconstruct to produce a reconstruction $z$ 518 of the input $x$ and trained to reconstruct to produce a reconstruction $z$
509 of the uncorrupted input $x$. Because the network has to denoise, it is 519 of the uncorrupted input $x$. Because the network has to denoise, it is
510 forcing the hidden units $y$ to represent the leading regularities in 520 forcing the hidden units $y$ to represent the leading regularities in
511 the data. Following~\citep{VincentPLarochelleH2008-very-small} 521 the data. In a slight departure from \citet{VincentPLarochelleH2008-very-small},
512 the hidden units output $y$ is obtained through the sigmoid-affine 522 the hidden units output $y$ is obtained through the tanh-affine
513 encoder 523 encoder
514 \[ 524 $y=\tanh(c+V x)$
515 y={\rm sigm}(c+V x) 525 and the reconstruction is obtained through the transposed transformation
516 \] 526 $z=\tanh(d+V' y)$.
517 where ${\rm sigm}(a)=1/(1+\exp(-a))$
518 and the reconstruction is obtained through the same transformation
519 \[
520 z={\rm sigm}(d+V' y)
521 \]
522 using the transpose of encoder weights.
523 The training 527 The training
524 set average of the cross-entropy 528 set average of the cross-entropy
525 reconstruction loss 529 reconstruction loss (after mapping back numbers in (-1,1) into (0,1))
526 \[ 530 \[
527 L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i) 531 L_H(x,z)=-\sum_i \frac{(z_i+1)}{2} \log \frac{(x_i+1)}{2} + \frac{z_i}{2} \log\frac{x_i}{2}
528 \] 532 \]
529 is minimized. 533 is minimized.
530 Here we use the random binary masking corruption 534 Here we use the random binary masking corruption
531 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and 535 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and
532 copies the rest). 536 copies the rest).
550 separate learning rate for the unsupervised pre-training stage (selected 554 separate learning rate for the unsupervised pre-training stage (selected
551 from the same above set). The fraction of inputs corrupted was selected 555 from the same above set). The fraction of inputs corrupted was selected
552 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number 556 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
553 of hidden layers but it was fixed to 3 for our experiments, 557 of hidden layers but it was fixed to 3 for our experiments,
554 based on previous work with 558 based on previous work with
555 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. 559 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}.
556 We also compared against 1 and against 2 hidden layers, 560 We also compared against 1 and against 2 hidden layers,
557 to disantangle the effect of depth from that of unsupervised 561 to disantangle the effect of depth from that of unsupervised
558 pre-training. 562 pre-training.
559 The size of each hidden 563 The size of each hidden
560 layer was kept constant across hidden layers, and the best results 564 layer was kept constant across hidden layers, and the best results
765 framework and out-of-distribution examples}? 769 framework and out-of-distribution examples}?
766 The key idea is that the lower layers of the predictor compute a hierarchy 770 The key idea is that the lower layers of the predictor compute a hierarchy
767 of features that can be shared across tasks or across variants of the 771 of features that can be shared across tasks or across variants of the
768 input distribution. A theoretical analysis of generalization improvements 772 input distribution. A theoretical analysis of generalization improvements
769 due to sharing of intermediate features across tasks already points 773 due to sharing of intermediate features across tasks already points
770 towards that explanation~\cite{baxter95a}. 774 towards that explanation~\citep{baxter95a}.
771 Intermediate features that can be used in different 775 Intermediate features that can be used in different
772 contexts can be estimated in a way that allows to share statistical 776 contexts can be estimated in a way that allows to share statistical
773 strength. Features extracted through many levels are more likely to 777 strength. Features extracted through many levels are more likely to
774 be more abstract and more invariant to some of the factors of variation 778 be more abstract and more invariant to some of the factors of variation
775 in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest), 779 in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest),
792 (with or without out-of-distribution examples) from random initialization, and more labeled examples 796 (with or without out-of-distribution examples) from random initialization, and more labeled examples
793 does not allow the shallow or purely supervised models to discover 797 does not allow the shallow or purely supervised models to discover
794 the kind of better basins associated 798 the kind of better basins associated
795 with deep learning and out-of-distribution examples. 799 with deep learning and out-of-distribution examples.
796 800
797 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 801 A Java demo of the recognizer (where both the MLP and the SDA can be compared)
798 can be executed on-line at {\tt http://deep.host22.com}. 802 can be executed on-line at {\tt http://deep.host22.com}.
799 803
800 \iffalse 804 \iffalse
801 \section*{Appendix I: Detailed Numerical Results} 805 \section*{Appendix I: Detailed Numerical Results}
802 806