Mercurial > ift6266
comparison writeup/aistats2011_cameraready.tex @ 637:fe98896745a5
fitting
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sat, 19 Mar 2011 23:07:03 -0400 |
parents | 83d53ffe3f25 |
children | 677d1b1d8158 |
comparison
equal
deleted
inserted
replaced
636:83d53ffe3f25 | 637:fe98896745a5 |
---|---|
523 where ${\rm sigm}(a)=1/(1+\exp(-a))$ | 523 where ${\rm sigm}(a)=1/(1+\exp(-a))$ |
524 and the reconstruction is obtained through the same transformation | 524 and the reconstruction is obtained through the same transformation |
525 \[ | 525 \[ |
526 z={\rm sigm}(d+V' y) | 526 z={\rm sigm}(d+V' y) |
527 \] | 527 \] |
528 but using the transpose of the encoder weights. | 528 using the transpose of encoder weights. |
529 We minimize the training | 529 The training |
530 set average of the cross-entropy | 530 set average of the cross-entropy |
531 reconstruction error | 531 reconstruction loss |
532 \[ | 532 \[ |
533 L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i). | 533 L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i) |
534 \] | 534 \] |
535 is minimized. | |
535 Here we use the random binary masking corruption | 536 Here we use the random binary masking corruption |
536 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and | 537 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and |
537 copies the rest). | 538 copies the rest). |
538 Once the first denoising auto-encoder is trained, its parameters can be used | 539 Once the first denoising auto-encoder is trained, its parameters can be used |
539 to set the first layer of the deep MLP. The original data are then processed | 540 to set the first layer of the deep MLP. The original data are then processed |
556 from the same above set). The fraction of inputs corrupted was selected | 557 from the same above set). The fraction of inputs corrupted was selected |
557 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | 558 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number |
558 of hidden layers but it was fixed to 3 for our experiments, | 559 of hidden layers but it was fixed to 3 for our experiments, |
559 based on previous work with | 560 based on previous work with |
560 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. | 561 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. |
561 We also compared against 1 and against 2 hidden layers, in order | 562 We also compared against 1 and against 2 hidden layers, |
562 to disantangle the effect of depth from the effect of unsupervised | 563 to disantangle the effect of depth from that of unsupervised |
563 pre-training. | 564 pre-training. |
564 The size of the hidden | 565 The size of each hidden |
565 layers was kept constant across hidden layers, and the best results | 566 layer was kept constant across hidden layers, and the best results |
566 were obtained with the largest values that we could experiment | 567 were obtained with the largest values that we tried |
567 with given our patience, with 1000 hidden units. | 568 (1000 hidden units). |
568 | 569 |
569 %\vspace*{-1mm} | 570 %\vspace*{-1mm} |
570 | 571 |
571 \begin{figure*}[ht] | 572 \begin{figure*}[ht] |
572 %\vspace*{-2mm} | 573 %\vspace*{-2mm} |