Mercurial > ift6266
comparison writeup/aistats2011_cameraready.tex @ 631:510220effb14
corrections demandees par reviewer
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sat, 19 Mar 2011 22:44:53 -0400 |
parents | f55f1b1499c4 |
children | 54e8958e963b |
comparison
equal
deleted
inserted
replaced
630:f55f1b1499c4 | 631:510220effb14 |
---|---|
8 \usepackage{bbm} | 8 \usepackage{bbm} |
9 \usepackage[utf8]{inputenc} | 9 \usepackage[utf8]{inputenc} |
10 \usepackage[psamsfonts]{amssymb} | 10 \usepackage[psamsfonts]{amssymb} |
11 %\usepackage{algorithm,algorithmic} % not used after all | 11 %\usepackage{algorithm,algorithmic} % not used after all |
12 \usepackage{graphicx,subfigure} | 12 \usepackage{graphicx,subfigure} |
13 \usepackage[numbers]{natbib} | 13 \usepackage{natbib} |
14 | 14 |
15 \addtolength{\textwidth}{10mm} | 15 \addtolength{\textwidth}{10mm} |
16 \addtolength{\evensidemargin}{-5mm} | 16 \addtolength{\evensidemargin}{-5mm} |
17 \addtolength{\oddsidemargin}{-5mm} | 17 \addtolength{\oddsidemargin}{-5mm} |
18 | 18 |
428 | 428 |
429 The experiments are performed using MLPs (with a single | 429 The experiments are performed using MLPs (with a single |
430 hidden layer) and deep SDAs. | 430 hidden layer) and deep SDAs. |
431 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} | 431 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} |
432 | 432 |
433 {\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared | 433 {\bf Multi-Layer Perceptrons (MLP).} The MLP output estimated |
434 \[ | |
435 P({\rm class}|{\rm input}=x) | |
436 \] | |
437 with | |
438 \[ | |
439 f(x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)), | |
440 \] | |
441 i.e., two layers, where | |
442 \[ | |
443 p={\rm softmax}(a) | |
444 \] | |
445 means that | |
446 \[ | |
447 p_i(x)=\exp(a_i)/\sum_j \exp(a_j) | |
448 \] | |
449 representing the probability | |
450 for class $i$, $\tanh$ is the element-wise | |
451 hyperbolic tangent, $b_i$ are parameter vectors, and $W_i$ are | |
452 parameter matrices (one per layer). The | |
453 number of rows of $W_1$ is called the number of hidden units (of the | |
454 single hidden layer, here), and | |
455 is one way to control capacity (the main other ways to control capacity are | |
456 the number of training iterations and optionally a regularization penalty | |
457 on the parameters, not used here because it did not help). | |
458 Whereas previous work had compared | |
434 deep architectures to both shallow MLPs and SVMs, we only compared to MLPs | 459 deep architectures to both shallow MLPs and SVMs, we only compared to MLPs |
435 here because of the very large datasets used (making the use of SVMs | 460 here because of the very large datasets used (making the use of SVMs |
436 computationally challenging because of their quadratic scaling | 461 computationally challenging because of their quadratic scaling |
437 behavior). Preliminary experiments on training SVMs (libSVM) with subsets | 462 behavior). Preliminary experiments on training SVMs (libSVM) with subsets |
438 of the training set allowing the program to fit in memory yielded | 463 of the training set allowing the program to fit in memory yielded |
446 better implementation allowing for training with more examples and | 471 better implementation allowing for training with more examples and |
447 a higher-order non-linear projection.} For training on nearly a hundred million examples (with the | 472 a higher-order non-linear projection.} For training on nearly a hundred million examples (with the |
448 perturbed data), the MLPs and SDA are much more convenient than classifiers | 473 perturbed data), the MLPs and SDA are much more convenient than classifiers |
449 based on kernel methods. The MLP has a single hidden layer with $\tanh$ | 474 based on kernel methods. The MLP has a single hidden layer with $\tanh$ |
450 activation functions, and softmax (normalized exponentials) on the output | 475 activation functions, and softmax (normalized exponentials) on the output |
451 layer for estimating $P(class | image)$. The number of hidden units is | 476 layer for estimating $P({\rm class} | {\rm input})$. The number of hidden units is |
452 taken in $\{300,500,800,1000,1500\}$. Training examples are presented in | 477 taken in $\{300,500,800,1000,1500\}$. Training examples are presented in |
453 minibatches of size 20. A constant learning rate was chosen among $\{0.001, | 478 minibatches of size 20, i.e., the parameters are iteratively updated in the direction |
479 of the mean gradient of the next 20 examples. A constant learning rate was chosen among $\{0.001, | |
454 0.01, 0.025, 0.075, 0.1, 0.5\}$. | 480 0.01, 0.025, 0.075, 0.1, 0.5\}$. |
455 %through preliminary experiments (measuring performance on a validation set), | 481 %through preliminary experiments (measuring performance on a validation set), |
456 %and $0.1$ (which was found to work best) was then selected for optimizing on | 482 %and $0.1$ (which was found to work best) was then selected for optimizing on |
457 %the whole training sets. | 483 %the whole training sets. |
458 %\vspace*{-1mm} | 484 %\vspace*{-1mm} |
484 tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 510 tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
485 provides efficient inference, and yielded results | 511 provides efficient inference, and yielded results |
486 comparable or better than RBMs in series of experiments | 512 comparable or better than RBMs in series of experiments |
487 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian | 513 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian |
488 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. | 514 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}. |
489 During training, a Denoising | 515 During its unsupervised training, a Denoising |
490 Auto-encoder is presented with a stochastically corrupted version | 516 Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$ |
491 of the input and trained to reconstruct the uncorrupted input, | 517 of the input $x$ and trained to reconstruct to produce a reconstruction $z$ |
492 forcing the hidden units to represent the leading regularities in | 518 of the uncorrupted input $x$. Because the network has to denoise, it is |
493 the data. Here we use the random binary masking corruption | 519 forcing the hidden units $y$ to represent the leading regularities in |
494 (which sets to 0 a random subset of the inputs). | 520 the data. Following~\citep{VincentPLarochelleH2008-very-small} |
495 Once it is trained, in a purely unsupervised way, | 521 the hidden units output $y$ is obtained through |
496 its hidden units' activations can | 522 \[ |
497 be used as inputs for training a second one, etc. | 523 y={\rm sigm}(c+V x) |
524 \] | |
525 where ${\rm sigm}(a)=1/(1+\exp(-a))$ | |
526 and the reconstruction is | |
527 \[ | |
528 z={\rm sigm}(d+V' y). | |
529 \] | |
530 We minimize the training | |
531 set average of the cross-entropy | |
532 reconstruction error | |
533 \[ | |
534 L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i). | |
535 \] | |
536 Here we use the random binary masking corruption | |
537 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and | |
538 copies the rest). | |
539 Once the first denoising auto-encoder is trained, its parameters can be used | |
540 to set the first layer of the deep MLP. The original data are then processed | |
541 through that first layer, and the output of the hidden units form a new | |
542 representation that can be used as input data for training a second denoising | |
543 auto-encoder, still in a purely unsupervised way. | |
544 This is repeated for the desired number of hidden layers. | |
498 After this unsupervised pre-training stage, the parameters | 545 After this unsupervised pre-training stage, the parameters |
499 are used to initialize a deep MLP, which is fine-tuned by | 546 are used to initialize a deep MLP (similar to the above, but |
500 the same standard procedure used to train them (see above). | 547 with more layers), which is fine-tuned by |
548 the same standard procedure (stochastic gradient descent) | |
549 used to train MLPs in general (see above). | |
550 The top layer parameters of the deep MLP (the one which outputs the | |
551 class probabilities and takes the top hidden layer as input) can | |
552 be initialized at 0. | |
501 The SDA hyper-parameters are the same as for the MLP, with the addition of the | 553 The SDA hyper-parameters are the same as for the MLP, with the addition of the |
502 amount of corruption noise (we used the masking noise process, whereby a | 554 amount of corruption noise (we used the masking noise process, whereby a |
503 fixed proportion of the input values, randomly selected, are zeroed), and a | 555 fixed proportion of the input values, randomly selected, are zeroed), and a |
504 separate learning rate for the unsupervised pre-training stage (selected | 556 separate learning rate for the unsupervised pre-training stage (selected |
505 from the same above set). The fraction of inputs corrupted was selected | 557 from the same above set). The fraction of inputs corrupted was selected |
506 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | 558 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number |
507 of hidden layers but it was fixed to 3 for most experiments, | 559 of hidden layers but it was fixed to 3 for our experiments, |
508 based on previous work with | 560 based on previous work with |
509 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. | 561 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. |
510 We also compared against 1 and against 2 hidden layers, in order | 562 We also compared against 1 and against 2 hidden layers, in order |
511 to disantangle the effect of depth from the effect of unsupervised | 563 to disantangle the effect of depth from the effect of unsupervised |
512 pre-training. | 564 pre-training. |