# HG changeset patch # User Yoshua Bengio # Date 1275393338 14400 # Node ID a194ce5a42494f17a75bb1e37e2233a704a91ad7 # Parent 19eab4daf212e2e0d394696d90a698b495c9f414 difference stat. sign. diff -r 19eab4daf212 -r a194ce5a4249 writeup/nips2010_submission.tex --- a/writeup/nips2010_submission.tex Mon May 31 22:15:44 2010 -0400 +++ b/writeup/nips2010_submission.tex Tue Jun 01 07:55:38 2010 -0400 @@ -488,31 +488,37 @@ networks ~\citep{Granger+al-2007}, fast nearest-neighbor search ~\citep{Cortes+al-2000}, MLPs ~\citep{Oliveira+al-2002}, and SVMs ~\citep{Milgram+al-2005}. More detailed and complete numerical results -(figures and tables) can be found in the appendix. The 3 kinds of model -differ in the training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, -SDA1), or P07 (MLP2, SDA2). The deep learner not only outperformed the -shallow ones and previously published performance but reaches human -performance on both the 62-class task and the 10-class (digits) task. In -addition, as shown in the left of Figure~\ref{fig:fig:improvements-charts}, -the relative improvement in error rate brought by self-taught learning is -greater for the SDA. The left side shows the improvement to the clean NIST -test set error brought by the use of out-of-distribution examples (i.e. the -perturbed examples examples from NISTP or P07). The right side of +(figures and tables, including standard errors on the error rates) can be +found in the supplementary material. The 3 kinds of model differ in the +training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 +(MLP2, SDA2). The deep learner not only outperformed the shallow ones and +previously published performance (in a statistically and qualitatively +significant way) but reaches human performance on both the 62-class task +and the 10-class (digits) task. In addition, as shown in the left of +Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error +rate brought by self-taught learning is greater for the SDA, and these +differences with the MLP are statistically and qualitatively +significant. The left side of the figure shows the improvement to the clean +NIST test set error brought by the use of out-of-distribution examples +(i.e. the perturbed examples examples from NISTP or P07). The right side of Figure~\ref{fig:fig:improvements-charts} shows the relative improvement brought by the use of a multi-task setting, in which the same model is trained for more classes than the target classes of interest (i.e. training with all 62 classes when the target classes are respectively the digits, -lower-case, or upper-case characters). Again, whereas the gain is marginal -or negative for the MLP, it is substantial for the SDA. Note that for -these multi-task experiment, only the original NIST dataset is used. For -example, the MLP-digits bar shows the relative improvement in MLP error -rate on the NIST digits test set (1 - single-task model's error / -multi-task model's error). The single-task model is trained with only 10 -outputs (one per digit), seeing only digit examples, whereas the multi-task -model is trained with 62 outputs, with all 62 character classes as -examples. For the multi-task model, the digit error rate is measured by -comparing the correct digit class with the output class associated with -the maximum conditional probability among only the digit classes outputs. +lower-case, or upper-case characters). Again, whereas the gain from the +multi-task setting is marginal or negative for the MLP, it is substantial +for the SDA. Note that for these multi-task experiment, only the original +NIST dataset is used. For example, the MLP-digits bar shows the relative +improvement in MLP error rate on the NIST digits test set (1 - single-task +model's error / multi-task model's error). The single-task model is +trained with only 10 outputs (one per digit), seeing only digit examples, +whereas the multi-task model is trained with 62 outputs, with all 62 +character classes as examples. Hence the hidden units are shared across +all tasks. For the multi-task model, the digit error rate is measured by +comparing the correct digit class with the output class associated with the +maximum conditional probability among only the digit classes outputs. The +setting is similar for the other two target classes (lower case characters +and upper case characters). \begin{figure}[h] \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\