# HG changeset patch
# User Yoshua Bengio <bengioy@iro.umontreal.ca>
# Date 1274970566 21600
# Node ID 5ead24fd4d4988eba6b51b2c2d9c2c6b14be6010
# Parent  c0f738f0cef054428423a138c3230a6b521849a9# Parent  78ed4628071d8b225099f8dfe3eed3e5930cdb73
merge

diff -r 78ed4628071d -r 5ead24fd4d49 writeup/techreport.tex
--- a/writeup/techreport.tex	Wed May 26 20:25:39 2010 -0400
+++ b/writeup/techreport.tex	Thu May 27 08:29:26 2010 -0600
@@ -372,23 +372,33 @@
 to estimate inter-human variability (shown as +/- in parenthesis below).
 
 \begin{table}
-\caption{Overall comparison of error rates on 62 character classes (10 digits +
+\caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture 
-(MLP=Multi-Layer Perceptron). }
+(MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07)
+and using a validation set to select hyper-parameters and other training choices. 
+\{SDA,MLP\}0 are trained on NIST,
+\{SDA,MLP\}1 are trained on NISTP, and \{SDA,MLP\}2 are trained on P07.
+The human error rate on digits is a lower bound because it does not count digits that were
+recognized as letters.}
 \label{tab:sda-vs-mlp-vs-humans}
 \begin{center}
 \begin{tabular}{|l|r|r|r|r|} \hline
-      & NIST test & NISTP test & P07 test  & NIST test digits   \\ \hline
-Humans&            &           &   & \\ \hline 
-SDA   &            &           &  &\\ \hline 
-MLP   &            &           &  & \\ \hline 
+      & NIST test          & NISTP test       & P07 test       & NIST test digits   \\ \hline
+Humans&   18.2\% $\pm$.1\%   &  39.4\%$\pm$.1\%   &  46.9\%$\pm$.1\%  &  $>1.1\%$ \\ \hline 
+SDA0   &  23.7\% $\pm$.14\%  &  65.2\%$\pm$.34\%  & 97.45\%$\pm$.06\%  & 2.7\% $\pm$.14\%\\ \hline 
+SDA1   &  17.1\% $\pm$.13\%  &  29.7\%$\pm$.3\%  & 29.7\%$\pm$.3\%  & 1.4\% $\pm$.1\%\\ \hline 
+SDA2   &  18.7\% $\pm$.13\%  &  33.6\%$\pm$.3\%  & 39.9\%$\pm$.17\%  & 1.7\% $\pm$.1\%\\ \hline 
+MLP0   &  24.2\% $\pm$.15\%  &  \%$\pm$.35\%  & \%$\pm$.1\%  & 3.45\% $\pm$.16\% \\ \hline 
+MLP1   &  23.0\% $\pm$.15\%  &  41.8\%$\pm$.35\%  & 90.4\%$\pm$.1\%  & 3.85\% $\pm$.16\% \\ \hline 
+MLP2   &  ?\% $\pm$.15\%  &  ?\%$\pm$.35\%  & 90.4\%$\pm$.1\%  & 3.85\% $\pm$.16\% \\ \hline 
 \end{tabular}
 \end{center}
 \end{table}
 
 \subsection{Perturbed Training Data More Helpful for SDAE}
 
+
 \subsection{Training with More Classes than Necessary}
 
 As previously seen, the SDA is better able to benefit from the transformations applied to the data than the MLP. We are now training SDAs and MLPs on single classes from NIST (respectively digits, lower case characters and upper case characters), to compare the test results with those from models trained on the entire NIST database (per-class test error, with an a priori on the desired class). The goal is to find out if training the model with more classes than necessary reduces the test error on a single class, as opposed to training it only with the desired class. We use a single hidden layer MLP with 1000 hidden units, and a SDA with 3 hidden layers (1000 hidden units per layer), pre-trained and fine-tuned on NIST.