comparison writeup/aistats2011_cameraready.tex @ 637:fe98896745a5

fitting
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 19 Mar 2011 23:07:03 -0400
parents 83d53ffe3f25
children 677d1b1d8158
comparison
equal deleted inserted replaced
636:83d53ffe3f25 637:fe98896745a5
523 where ${\rm sigm}(a)=1/(1+\exp(-a))$ 523 where ${\rm sigm}(a)=1/(1+\exp(-a))$
524 and the reconstruction is obtained through the same transformation 524 and the reconstruction is obtained through the same transformation
525 \[ 525 \[
526 z={\rm sigm}(d+V' y) 526 z={\rm sigm}(d+V' y)
527 \] 527 \]
528 but using the transpose of the encoder weights. 528 using the transpose of encoder weights.
529 We minimize the training 529 The training
530 set average of the cross-entropy 530 set average of the cross-entropy
531 reconstruction error 531 reconstruction loss
532 \[ 532 \[
533 L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i). 533 L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i)
534 \] 534 \]
535 is minimized.
535 Here we use the random binary masking corruption 536 Here we use the random binary masking corruption
536 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and 537 (which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and
537 copies the rest). 538 copies the rest).
538 Once the first denoising auto-encoder is trained, its parameters can be used 539 Once the first denoising auto-encoder is trained, its parameters can be used
539 to set the first layer of the deep MLP. The original data are then processed 540 to set the first layer of the deep MLP. The original data are then processed
556 from the same above set). The fraction of inputs corrupted was selected 557 from the same above set). The fraction of inputs corrupted was selected
557 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number 558 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
558 of hidden layers but it was fixed to 3 for our experiments, 559 of hidden layers but it was fixed to 3 for our experiments,
559 based on previous work with 560 based on previous work with
560 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}. 561 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}.
561 We also compared against 1 and against 2 hidden layers, in order 562 We also compared against 1 and against 2 hidden layers,
562 to disantangle the effect of depth from the effect of unsupervised 563 to disantangle the effect of depth from that of unsupervised
563 pre-training. 564 pre-training.
564 The size of the hidden 565 The size of each hidden
565 layers was kept constant across hidden layers, and the best results 566 layer was kept constant across hidden layers, and the best results
566 were obtained with the largest values that we could experiment 567 were obtained with the largest values that we tried
567 with given our patience, with 1000 hidden units. 568 (1000 hidden units).
568 569
569 %\vspace*{-1mm} 570 %\vspace*{-1mm}
570 571
571 \begin{figure*}[ht] 572 \begin{figure*}[ht]
572 %\vspace*{-2mm} 573 %\vspace*{-2mm}