comparison writeup/aistats2011_cameraready.tex @ 631:510220effb14

corrections demandees par reviewer
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 19 Mar 2011 22:44:53 -0400
parents f55f1b1499c4
children 54e8958e963b
comparison
equal deleted inserted replaced
630:f55f1b1499c4 631:510220effb14
8 \usepackage{bbm} 8 \usepackage{bbm}
9 \usepackage[utf8]{inputenc} 9 \usepackage[utf8]{inputenc}
10 \usepackage[psamsfonts]{amssymb} 10 \usepackage[psamsfonts]{amssymb}
11 %\usepackage{algorithm,algorithmic} % not used after all 11 %\usepackage{algorithm,algorithmic} % not used after all
12 \usepackage{graphicx,subfigure} 12 \usepackage{graphicx,subfigure}
13 \usepackage[numbers]{natbib} 13 \usepackage{natbib}
14 14
15 \addtolength{\textwidth}{10mm} 15 \addtolength{\textwidth}{10mm}
16 \addtolength{\evensidemargin}{-5mm} 16 \addtolength{\evensidemargin}{-5mm}
17 \addtolength{\oddsidemargin}{-5mm} 17 \addtolength{\oddsidemargin}{-5mm}
18 18
428 428
429 The experiments are performed using MLPs (with a single 429 The experiments are performed using MLPs (with a single
430 hidden layer) and deep SDAs. 430 hidden layer) and deep SDAs.
431 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} 431 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
432 432
433 {\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared 433 {\bf Multi-Layer Perceptrons (MLP).} The MLP output estimated
434 \[
435 P({\rm class}|{\rm input}=x)
436 \]
437 with
438 \[
439 f(x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)),
440 \]
441 i.e., two layers, where
442 \[
443 p={\rm softmax}(a)
444 \]
445 means that
446 \[
447 p_i(x)=\exp(a_i)/\sum_j \exp(a_j)
448 \]
449 representing the probability
450 for class $i$, $\tanh$ is the element-wise
451 hyperbolic tangent, $b_i$ are parameter vectors, and $W_i$ are
452 parameter matrices (one per layer). The
453 number of rows of $W_1$ is called the number of hidden units (of the
454 single hidden layer, here), and
455 is one way to control capacity (the main other ways to control capacity are
456 the number of training iterations and optionally a regularization penalty
457 on the parameters, not used here because it did not help).
458 Whereas previous work had compared
434 deep architectures to both shallow MLPs and SVMs, we only compared to MLPs 459 deep architectures to both shallow MLPs and SVMs, we only compared to MLPs
435 here because of the very large datasets used (making the use of SVMs 460 here because of the very large datasets used (making the use of SVMs
436 computationally challenging because of their quadratic scaling 461 computationally challenging because of their quadratic scaling
437 behavior). Preliminary experiments on training SVMs (libSVM) with subsets 462 behavior). Preliminary experiments on training SVMs (libSVM) with subsets
438 of the training set allowing the program to fit in memory yielded 463 of the training set allowing the program to fit in memory yielded
446 better implementation allowing for training with more examples and 471 better implementation allowing for training with more examples and
447 a higher-order non-linear projection.} For training on nearly a hundred million examples (with the 472 a higher-order non-linear projection.} For training on nearly a hundred million examples (with the
448 perturbed data), the MLPs and SDA are much more convenient than classifiers 473 perturbed data), the MLPs and SDA are much more convenient than classifiers
449 based on kernel methods. The MLP has a single hidden layer with $\tanh$ 474 based on kernel methods. The MLP has a single hidden layer with $\tanh$
450 activation functions, and softmax (normalized exponentials) on the output 475 activation functions, and softmax (normalized exponentials) on the output
451 layer for estimating $P(class | image)$. The number of hidden units is 476 layer for estimating $P({\rm class} | {\rm input})$. The number of hidden units is
452 taken in $\{300,500,800,1000,1500\}$. Training examples are presented in 477 taken in $\{300,500,800,1000,1500\}$. Training examples are presented in
453 minibatches of size 20. A constant learning rate was chosen among $\{0.001, 478 minibatches of size 20, i.e., the parameters are iteratively updated in the direction
479 of the mean gradient of the next 20 examples. A constant learning rate was chosen among $\{0.001,
454 0.01, 0.025, 0.075, 0.1, 0.5\}$. 480 0.01, 0.025, 0.075, 0.1, 0.5\}$.
455 %through preliminary experiments (measuring performance on a validation set), 481 %through preliminary experiments (measuring performance on a validation set),
456 %and $0.1$ (which was found to work best) was then selected for optimizing on 482 %and $0.1$ (which was found to work best) was then selected for optimizing on
457 %the whole training sets. 483 %the whole training sets.
458 %\vspace*{-1mm} 484 %\vspace*{-1mm}
484 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 510 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
485 provides efficient inference, and yielded results 511 provides efficient inference, and yielded results
486 comparable or better than RBMs in series of experiments 512 comparable or better than RBMs in series of experiments
487 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian 513 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian
488 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. 514 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}.
489 During training, a Denoising 515 During its unsupervised training, a Denoising
490 Auto-encoder is presented with a stochastically corrupted version 516 Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$
491 of the input and trained to reconstruct the uncorrupted input, 517 of the input $x$ and trained to reconstruct to produce a reconstruction $z$
492 forcing the hidden units to represent the leading regularities in 518 of the uncorrupted input $x$. Because the network has to denoise, it is
493 the data. Here we use the random binary masking corruption 519 forcing the hidden units $y$ to represent the leading regularities in
494 (which sets to 0 a random subset of the inputs). 520 the data. Following~\citep{VincentPLarochelleH2008-very-small}
495 Once it is trained, in a purely unsupervised way, 521 the hidden units output $y$ is obtained through
496 its hidden units' activations can 522 \[
497 be used as inputs for training a second one, etc. 523 y={\rm sigm}(c+V x)
524 \]
525 where ${\rm sigm}(a)=1/(1+\exp(-a))$
526 and the reconstruction is
527 \[
528 z={\rm sigm}(d+V' y).
529 \]
530 We minimize the training
531 set average of the cross-entropy
532 reconstruction error
533 \[
534 L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i).
535 \]
536 Here we use the random binary masking corruption
537 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and
538 copies the rest).
539 Once the first denoising auto-encoder is trained, its parameters can be used
540 to set the first layer of the deep MLP. The original data are then processed
541 through that first layer, and the output of the hidden units form a new
542 representation that can be used as input data for training a second denoising
543 auto-encoder, still in a purely unsupervised way.
544 This is repeated for the desired number of hidden layers.
498 After this unsupervised pre-training stage, the parameters 545 After this unsupervised pre-training stage, the parameters
499 are used to initialize a deep MLP, which is fine-tuned by 546 are used to initialize a deep MLP (similar to the above, but
500 the same standard procedure used to train them (see above). 547 with more layers), which is fine-tuned by
548 the same standard procedure (stochastic gradient descent)
549 used to train MLPs in general (see above).
550 The top layer parameters of the deep MLP (the one which outputs the
551 class probabilities and takes the top hidden layer as input) can
552 be initialized at 0.
501 The SDA hyper-parameters are the same as for the MLP, with the addition of the 553 The SDA hyper-parameters are the same as for the MLP, with the addition of the
502 amount of corruption noise (we used the masking noise process, whereby a 554 amount of corruption noise (we used the masking noise process, whereby a
503 fixed proportion of the input values, randomly selected, are zeroed), and a 555 fixed proportion of the input values, randomly selected, are zeroed), and a
504 separate learning rate for the unsupervised pre-training stage (selected 556 separate learning rate for the unsupervised pre-training stage (selected
505 from the same above set). The fraction of inputs corrupted was selected 557 from the same above set). The fraction of inputs corrupted was selected
506 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number 558 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
507 of hidden layers but it was fixed to 3 for most experiments, 559 of hidden layers but it was fixed to 3 for our experiments,
508 based on previous work with 560 based on previous work with
509 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. 561 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}.
510 We also compared against 1 and against 2 hidden layers, in order 562 We also compared against 1 and against 2 hidden layers, in order
511 to disantangle the effect of depth from the effect of unsupervised 563 to disantangle the effect of depth from the effect of unsupervised
512 pre-training. 564 pre-training.