Mercurial > ift6266
comparison writeup/aistats2011_cameraready.tex @ 639:507cb92d8e15
modifs mineures
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sun, 20 Mar 2011 16:49:44 -0400 |
parents | 677d1b1d8158 |
children | 8b1a0b9fecff |
comparison
equal
deleted
inserted
replaced
638:677d1b1d8158 | 639:507cb92d8e15 |
---|---|
9 \usepackage[utf8]{inputenc} | 9 \usepackage[utf8]{inputenc} |
10 \usepackage[psamsfonts]{amssymb} | 10 \usepackage[psamsfonts]{amssymb} |
11 %\usepackage{algorithm,algorithmic} % not used after all | 11 %\usepackage{algorithm,algorithmic} % not used after all |
12 \usepackage{graphicx,subfigure} | 12 \usepackage{graphicx,subfigure} |
13 \usepackage{natbib} | 13 \usepackage{natbib} |
14 %\usepackage{afterpage} | |
14 | 15 |
15 \addtolength{\textwidth}{10mm} | 16 \addtolength{\textwidth}{10mm} |
16 \addtolength{\evensidemargin}{-5mm} | 17 \addtolength{\evensidemargin}{-5mm} |
17 \addtolength{\oddsidemargin}{-5mm} | 18 \addtolength{\oddsidemargin}{-5mm} |
18 | 19 |
30 \bf Arnaud Bergeron \and | 31 \bf Arnaud Bergeron \and |
31 Nicolas Boulanger-Lewandowski \and \\ | 32 Nicolas Boulanger-Lewandowski \and \\ |
32 \bf Thomas Breuel \and | 33 \bf Thomas Breuel \and |
33 Youssouf Chherawala \and | 34 Youssouf Chherawala \and |
34 \bf Moustapha Cisse \and | 35 \bf Moustapha Cisse \and |
35 Myriam Côté \and \\ | 36 Myriam Côté \and |
36 \bf Dumitru Erhan \and | 37 \bf Dumitru Erhan \\ |
37 Jeremy Eustache \and | 38 \and \bf Jeremy Eustache \and |
38 \bf Xavier Glorot \and | 39 \bf Xavier Glorot \and |
39 Xavier Muller \and \\ | 40 Xavier Muller \and |
40 \bf Sylvain Pannetier Lebeuf \and | 41 \bf Sylvain Pannetier Lebeuf \\ |
41 Razvan Pascanu \and | 42 \and \bf Razvan Pascanu \and |
42 \bf Salah Rifai \and | 43 \bf Salah Rifai \and |
43 Francois Savard \and \\ | 44 Francois Savard \and |
44 \bf Guillaume Sicard \\ | 45 \bf Guillaume Sicard \\ |
45 \vspace*{1mm}} | 46 \vspace*{1mm}} |
46 | 47 |
47 %I can't use aistatsaddress in a single side paragraphe. | 48 %I can't use aistatsaddress in a single side paragraphe. |
48 %The document is 2 colums, but this section span the 2 colums, sot there is only 1 left | 49 %The document is 2 colums, but this section span the 2 colums, sot there is only 1 left |
141 observed with deep learners, we focus here on the following {\em hypothesis}: | 142 observed with deep learners, we focus here on the following {\em hypothesis}: |
142 intermediate levels of representation, especially when there are | 143 intermediate levels of representation, especially when there are |
143 more such levels, can be exploited to {\bf share | 144 more such levels, can be exploited to {\bf share |
144 statistical strength across different but related types of examples}, | 145 statistical strength across different but related types of examples}, |
145 such as examples coming from other tasks than the task of interest | 146 such as examples coming from other tasks than the task of interest |
146 (the multi-task setting), or examples coming from an overlapping | 147 (the multi-task setting~\citep{caruana97a}), or examples coming from an overlapping |
147 but different distribution (images with different kinds of perturbations | 148 but different distribution (images with different kinds of perturbations |
148 and noises, here). This is consistent with the hypotheses discussed | 149 and noises, here). This is consistent with the hypotheses discussed |
149 in~\citet{Bengio-2009} regarding the potential advantage | 150 in~\citet{Bengio-2009} regarding the potential advantage |
150 of deep learning and the idea that more levels of representation can | 151 of deep learning and the idea that more levels of representation can |
151 give rise to more abstract, more general features of the raw input. | 152 give rise to more abstract, more general features of the raw input. |
152 | 153 |
153 This hypothesis is related to a learning setting called | 154 This hypothesis is related to a learning setting called |
154 {\bf self-taught learning}~\citep{RainaR2007}, which combines principles | 155 {\bf self-taught learning}~\citep{RainaR2007}, which combines principles |
155 of semi-supervised and multi-task learning: the learner can exploit examples | 156 of semi-supervised and multi-task learning: in addition to the labeled |
157 examples from the target distribution, the learner can exploit examples | |
156 that are unlabeled and possibly come from a distribution different from the target | 158 that are unlabeled and possibly come from a distribution different from the target |
157 distribution, e.g., from other classes than those of interest. | 159 distribution, e.g., from other classes than those of interest. |
158 It has already been shown that deep learners can clearly take advantage of | 160 It has already been shown that deep learners can clearly take advantage of |
159 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small}, | 161 unsupervised learning and unlabeled examples~\citep{Bengio-2009,WestonJ2008-small} |
162 in order to improve performance on a supervised task, | |
160 but more needed to be done to explore the impact | 163 but more needed to be done to explore the impact |
161 of {\em out-of-distribution} examples and of the {\em multi-task} setting | 164 of {\em out-of-distribution} examples and of the {\em multi-task} setting |
162 (one exception is~\citep{CollobertR2008}, which shares and uses unsupervised | 165 (two exceptions are~\citet{CollobertR2008}, which shares and uses unsupervised |
163 pre-training only with the first layer). In particular the {\em relative | 166 pre-training only with the first layer, and~\citet{icml2009_093} in the case |
167 of video data). In particular the {\em relative | |
164 advantage of deep learning} for these settings has not been evaluated. | 168 advantage of deep learning} for these settings has not been evaluated. |
165 | 169 |
166 | 170 |
167 % | 171 % |
168 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can | 172 The {\bf main claim} of this paper is that deep learners (with several levels of representation) can |
231 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the | 235 a global control parameter ($0 \le complexity \le 1$) that allows one to modulate the |
232 amount of deformation or noise introduced. | 236 amount of deformation or noise introduced. |
233 There are two main parts in the pipeline. The first one, | 237 There are two main parts in the pipeline. The first one, |
234 from thickness to pinch, performs transformations. The second | 238 from thickness to pinch, performs transformations. The second |
235 part, from blur to contrast, adds different kinds of noise. | 239 part, from blur to contrast, adds different kinds of noise. |
236 More details can be found in~\citep{ARXIV-2010}. | 240 More details can be found in~\citet{ARXIV-2010}. |
237 | 241 |
238 \begin{figure*}[ht] | 242 \begin{figure*}[ht] |
239 \centering | 243 \centering |
240 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}} | 244 \subfigure[Original]{\includegraphics[scale=0.6]{images/Original.png}\label{fig:torig}} |
241 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}} | 245 \subfigure[Thickness]{\includegraphics[scale=0.6]{images/Thick_only.png}} |
266 %\vspace*{-1mm} | 270 %\vspace*{-1mm} |
267 | 271 |
268 Much previous work on deep learning had been performed on | 272 Much previous work on deep learning had been performed on |
269 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, | 273 the MNIST digits task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, |
270 with 60,000 examples, and variants involving 10,000 | 274 with 60,000 examples, and variants involving 10,000 |
271 examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}. | 275 examples~\citep{Larochelle-jmlr-2009,VincentPLarochelleH2008-very-small}\footnote{Fortunately, there |
276 are more and more exceptions of course, such as~\citet{RainaICML09} using a million examples.} | |
272 The focus here is on much larger training sets, from 10 times to | 277 The focus here is on much larger training sets, from 10 times to |
273 to 1000 times larger, and 62 classes. | 278 to 1000 times larger, and 62 classes. |
274 | 279 |
275 The first step in constructing the larger datasets (called NISTP and P07) is to sample from | 280 The first step in constructing the larger datasets (called NISTP and P07) is to sample from |
276 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, | 281 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, |
344 %\item | 349 %\item |
345 {\bf Fonts.} | 350 {\bf Fonts.} |
346 In order to have a good variety of sources we downloaded an important number of free fonts from: | 351 In order to have a good variety of sources we downloaded an important number of free fonts from: |
347 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. | 352 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. |
348 % TODO: pointless to anonymize, it's not pointing to our work | 353 % TODO: pointless to anonymize, it's not pointing to our work |
349 Including an operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. | 354 Including an operating system's (Windows 7) fonts, there we uniformly chose from $9817$ different fonts. |
350 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | 355 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, |
351 directly as input to our models. | 356 directly as input to our models. |
352 %\vspace*{-1mm} | 357 %\vspace*{-1mm} |
353 | 358 |
354 %\item | 359 %\item |
366 {\bf OCR data.} | 371 {\bf OCR data.} |
367 A large set (2 million) of scanned, OCRed and manually verified machine-printed | 372 A large set (2 million) of scanned, OCRed and manually verified machine-printed |
368 characters where included as an | 373 characters where included as an |
369 additional source. This set is part of a larger corpus being collected by the Image Understanding | 374 additional source. This set is part of a larger corpus being collected by the Image Understanding |
370 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern | 375 Pattern Recognition Research group led by Thomas Breuel at University of Kaiserslautern |
371 ({\tt http://www.iupr.com}), and which will be publicly released. | 376 ({\tt http://www.iupr.com}).%, and which will be publicly released. |
372 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this | 377 %TODO: let's hope that Thomas is not a reviewer! :) Seriously though, maybe we should anonymize this |
373 %\end{itemize} | 378 %\end{itemize} |
374 | 379 |
375 %\vspace*{-3mm} | 380 %\vspace*{-3mm} |
376 \subsection{Data Sets} | 381 \subsection{Data Sets} |
377 %\vspace*{-2mm} | 382 %\vspace*{-2mm} |
378 | 383 |
379 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label | 384 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with one of 62 character labels. |
380 from one of the 62 character classes. | |
381 %\begin{itemize} | 385 %\begin{itemize} |
382 %\vspace*{-1mm} | 386 %\vspace*{-1mm} |
383 | 387 |
384 %\item | 388 %\item |
385 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has | 389 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has |
386 \{651,668 / 80,000 / 82,587\} \{training / validation / test\} examples. | 390 \{651,668 / 80,000 / 82,587\} \{training / validation / test\} examples. |
387 %\vspace*{-1mm} | 391 %\vspace*{-1mm} |
388 | 392 |
389 %\item | 393 %\item |
390 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources | 394 {\bf P07.} This dataset is obtained by taking raw characters from the above 4 sources |
391 and sending them through the transformation pipeline described in section \ref{s:perturbations}. | 395 and sending them through the transformation pipeline described in section \ref{s:perturbations}. |
392 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, | 396 For each generated example, a data source is selected with probability $10\%$ from the fonts, |
393 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the | 397 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. The transformations are |
398 applied in the | |
394 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. | 399 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. |
395 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples | 400 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples |
396 obtained from the corresponding NIST sets plus other sources. | 401 obtained from the corresponding NIST sets plus other sources. |
397 %\vspace*{-1mm} | 402 %\vspace*{-1mm} |
398 | 403 |
399 %\item | 404 %\item |
400 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) | 405 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) |
401 except that we only apply | 406 except that we only apply |
402 transformations from slant to pinch (see Fig.\ref{fig:transform}(b-f)). | 407 transformations from slant to pinch (see Fig.\ref{fig:transform}(b-f)). |
403 Therefore, the character is | 408 Therefore, the character is |
404 transformed but no additional noise is added to the image, giving images | 409 transformed but without added noise, yielding images |
405 closer to the NIST dataset. | 410 closer to the NIST dataset. |
406 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples | 411 It has \{81,920,000 / 80,000 / 20,000\} \{training / validation / test\} examples |
407 obtained from the corresponding NIST sets plus other sources. | 412 obtained from the corresponding NIST sets plus other sources. |
408 %\end{itemize} | 413 %\end{itemize} |
409 | 414 |
410 \begin{figure*}[ht] | 415 \vspace*{-3mm} |
411 %\vspace*{-2mm} | |
412 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | |
413 %\vspace*{-2mm} | |
414 \caption{Illustration of the computations and training criterion for the denoising | |
415 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | |
416 the layer (i.e. raw input or output of previous layer) | |
417 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | |
418 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | |
419 is compared to the uncorrupted input $x$ through the loss function | |
420 $L_H(x,z)$, whose expected value is approximately minimized during training | |
421 by tuning $\theta$ and $\theta'$.} | |
422 \label{fig:da} | |
423 %\vspace*{-2mm} | |
424 \end{figure*} | |
425 | |
426 %\vspace*{-3mm} | |
427 \subsection{Models and their Hyper-parameters} | 416 \subsection{Models and their Hyper-parameters} |
428 %\vspace*{-2mm} | 417 %\vspace*{-2mm} |
429 | 418 |
430 The experiments are performed using MLPs (with a single | 419 The experiments are performed using MLPs (with a single |
431 hidden layer) and deep SDAs. | 420 hidden layer) and deep SDAs. |
432 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} | 421 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} |
433 | 422 |
434 {\bf Multi-Layer Perceptrons (MLP).} The MLP output estimated with | 423 {\bf Multi-Layer Perceptrons (MLP).} The MLP output estimates the |
424 class-conditional probabilities | |
435 \[ | 425 \[ |
436 P({\rm class}|{\rm input}=x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)), | 426 P({\rm class}|{\rm input}=x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)), |
437 \] | 427 \] |
438 i.e., two layers, where $p={\rm softmax}(a)$ means that | 428 i.e., two layers, where $p={\rm softmax}(a)$ means that |
439 $p_i(x)=\exp(a_i)/\sum_j \exp(a_j)$ | 429 $p_i(x)=\exp(a_i)/\sum_j \exp(a_j)$ |
472 %through preliminary experiments (measuring performance on a validation set), | 462 %through preliminary experiments (measuring performance on a validation set), |
473 %and $0.1$ (which was found to work best) was then selected for optimizing on | 463 %and $0.1$ (which was found to work best) was then selected for optimizing on |
474 %the whole training sets. | 464 %the whole training sets. |
475 %\vspace*{-1mm} | 465 %\vspace*{-1mm} |
476 | 466 |
467 \begin{figure*}[htb] | |
468 %\vspace*{-2mm} | |
469 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | |
470 %\vspace*{-2mm} | |
471 \caption{Illustration of the computations and training criterion for the denoising | |
472 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | |
473 the layer (i.e. raw input or output of previous layer) | |
474 s corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | |
475 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | |
476 is compared to the uncorrupted input $x$ through the loss function | |
477 $L_H(x,z)$, whose expected value is approximately minimized during training | |
478 by tuning $\theta$ and $\theta'$.} | |
479 \label{fig:da} | |
480 %\vspace*{-2mm} | |
481 \end{figure*} | |
482 | |
483 %\afterpage{\clearpage} | |
477 | 484 |
478 {\bf Stacked Denoising Auto-encoders (SDA).} | 485 {\bf Stacked Denoising Auto-encoders (SDA).} |
479 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 486 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
480 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 487 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
481 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, | 488 layers)~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006}, |
482 apparently setting parameters in the | 489 apparently setting parameters in the |
483 basin of attraction of supervised gradient descent yielding better | 490 basin of attraction of supervised gradient descent yielding better |
484 generalization~\citep{Erhan+al-2010}. This initial {\em unsupervised | 491 generalization~\citep{Erhan+al-2010}. |
485 pre-training phase} uses all of the training images but not the training labels. | 492 This initial {\em unsupervised |
493 pre-training phase} does not use the training labels. | |
486 Each layer is trained in turn to produce a new representation of its input | 494 Each layer is trained in turn to produce a new representation of its input |
487 (starting from the raw pixels). | 495 (starting from the raw pixels). |
488 It is hypothesized that the | 496 It is hypothesized that the |
489 advantage brought by this procedure stems from a better prior, | 497 advantage brought by this procedure stems from a better prior, |
490 on the one hand taking advantage of the link between the input | 498 on the one hand taking advantage of the link between the input |
499 these deep hierarchies of features, as it is simple to train and | 507 these deep hierarchies of features, as it is simple to train and |
500 explain (see Figure~\ref{fig:da}, as well as | 508 explain (see Figure~\ref{fig:da}, as well as |
501 tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 509 tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
502 provides efficient inference, and yielded results | 510 provides efficient inference, and yielded results |
503 comparable or better than RBMs in series of experiments | 511 comparable or better than RBMs in series of experiments |
504 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian | 512 \citep{VincentPLarochelleH2008-very-small}. |
513 Some denoising auto-encoders correspond | |
514 to a Gaussian | |
505 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. | 515 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. |
506 During its unsupervised training, a Denoising | 516 During its unsupervised training, a Denoising |
507 Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$ | 517 Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$ |
508 of the input $x$ and trained to reconstruct to produce a reconstruction $z$ | 518 of the input $x$ and trained to reconstruct to produce a reconstruction $z$ |
509 of the uncorrupted input $x$. Because the network has to denoise, it is | 519 of the uncorrupted input $x$. Because the network has to denoise, it is |
510 forcing the hidden units $y$ to represent the leading regularities in | 520 forcing the hidden units $y$ to represent the leading regularities in |
511 the data. Following~\citep{VincentPLarochelleH2008-very-small} | 521 the data. In a slight departure from \citet{VincentPLarochelleH2008-very-small}, |
512 the hidden units output $y$ is obtained through the sigmoid-affine | 522 the hidden units output $y$ is obtained through the tanh-affine |
513 encoder | 523 encoder |
514 \[ | 524 $y=\tanh(c+V x)$ |
515 y={\rm sigm}(c+V x) | 525 and the reconstruction is obtained through the transposed transformation |
516 \] | 526 $z=\tanh(d+V' y)$. |
517 where ${\rm sigm}(a)=1/(1+\exp(-a))$ | |
518 and the reconstruction is obtained through the same transformation | |
519 \[ | |
520 z={\rm sigm}(d+V' y) | |
521 \] | |
522 using the transpose of encoder weights. | |
523 The training | 527 The training |
524 set average of the cross-entropy | 528 set average of the cross-entropy |
525 reconstruction loss | 529 reconstruction loss (after mapping back numbers in (-1,1) into (0,1)) |
526 \[ | 530 \[ |
527 L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i) | 531 L_H(x,z)=-\sum_i \frac{(z_i+1)}{2} \log \frac{(x_i+1)}{2} + \frac{z_i}{2} \log\frac{x_i}{2} |
528 \] | 532 \] |
529 is minimized. | 533 is minimized. |
530 Here we use the random binary masking corruption | 534 Here we use the random binary masking corruption |
531 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and | 535 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and |
532 copies the rest). | 536 copies the rest). |
550 separate learning rate for the unsupervised pre-training stage (selected | 554 separate learning rate for the unsupervised pre-training stage (selected |
551 from the same above set). The fraction of inputs corrupted was selected | 555 from the same above set). The fraction of inputs corrupted was selected |
552 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | 556 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number |
553 of hidden layers but it was fixed to 3 for our experiments, | 557 of hidden layers but it was fixed to 3 for our experiments, |
554 based on previous work with | 558 based on previous work with |
555 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. | 559 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. |
556 We also compared against 1 and against 2 hidden layers, | 560 We also compared against 1 and against 2 hidden layers, |
557 to disantangle the effect of depth from that of unsupervised | 561 to disantangle the effect of depth from that of unsupervised |
558 pre-training. | 562 pre-training. |
559 The size of each hidden | 563 The size of each hidden |
560 layer was kept constant across hidden layers, and the best results | 564 layer was kept constant across hidden layers, and the best results |
765 framework and out-of-distribution examples}? | 769 framework and out-of-distribution examples}? |
766 The key idea is that the lower layers of the predictor compute a hierarchy | 770 The key idea is that the lower layers of the predictor compute a hierarchy |
767 of features that can be shared across tasks or across variants of the | 771 of features that can be shared across tasks or across variants of the |
768 input distribution. A theoretical analysis of generalization improvements | 772 input distribution. A theoretical analysis of generalization improvements |
769 due to sharing of intermediate features across tasks already points | 773 due to sharing of intermediate features across tasks already points |
770 towards that explanation~\cite{baxter95a}. | 774 towards that explanation~\citep{baxter95a}. |
771 Intermediate features that can be used in different | 775 Intermediate features that can be used in different |
772 contexts can be estimated in a way that allows to share statistical | 776 contexts can be estimated in a way that allows to share statistical |
773 strength. Features extracted through many levels are more likely to | 777 strength. Features extracted through many levels are more likely to |
774 be more abstract and more invariant to some of the factors of variation | 778 be more abstract and more invariant to some of the factors of variation |
775 in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest), | 779 in the underlying distribution (as the experiments in~\citet{Goodfellow2009} suggest), |
792 (with or without out-of-distribution examples) from random initialization, and more labeled examples | 796 (with or without out-of-distribution examples) from random initialization, and more labeled examples |
793 does not allow the shallow or purely supervised models to discover | 797 does not allow the shallow or purely supervised models to discover |
794 the kind of better basins associated | 798 the kind of better basins associated |
795 with deep learning and out-of-distribution examples. | 799 with deep learning and out-of-distribution examples. |
796 | 800 |
797 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 801 A Java demo of the recognizer (where both the MLP and the SDA can be compared) |
798 can be executed on-line at {\tt http://deep.host22.com}. | 802 can be executed on-line at {\tt http://deep.host22.com}. |
799 | 803 |
800 \iffalse | 804 \iffalse |
801 \section*{Appendix I: Detailed Numerical Results} | 805 \section*{Appendix I: Detailed Numerical Results} |
802 | 806 |