diff writeup/aistats2011_revised.tex @ 623:d44c78c90669

entered revisions for AMT and SVMs
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 09 Jan 2011 22:00:39 -0500
parents 09b7dee216f4
children 49933073590c
line wrap: on
line diff
--- a/writeup/aistats2011_revised.tex	Sun Jan 09 21:47:28 2011 -0500
+++ b/writeup/aistats2011_revised.tex	Sun Jan 09 22:00:39 2011 -0500
@@ -295,6 +295,15 @@
 because each image was classified by 3 different persons. 
 The average error of humans on the 62-class task NIST test set
 is 18.2\%, with a standard error of 0.1\%.
+We controlled noise in the labelling process by (1)
+requiring AMT workers with a higher than normal average of accepted
+responses (>95\%) on other tasks (2) discarding responses that were not
+complete (10 predictions) (3) discarding responses for which for which the
+time to predict was smaller than 3 seconds for NIST (the mean response time
+was 20 seconds) and 6 seconds seconds for NISTP (average response time of
+45 seconds) (4) discarding responses which were obviously wrong (10
+identical ones, or "12345..."). Overall, after such filtering, we kept
+approximately 95\% of the AMT workers' responses.
 
 %\vspace*{-3mm}
 \subsection{Data Sources}
@@ -414,20 +423,28 @@
 hidden layer) and deep SDAs.
 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
 
-{\bf Multi-Layer Perceptrons (MLP).}
-Whereas previous work had compared deep architectures to both shallow MLPs and
-SVMs, we only compared to MLPs here because of the very large datasets used
-(making the use of SVMs computationally challenging because of their quadratic
-scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training
-set allowing the program to fit in memory yielded substantially worse results
-than those obtained with MLPs. For training on nearly a hundred million examples
-(with the perturbed data), the MLPs and SDA are much more convenient than
-classifiers based on kernel methods.
-The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
-exponentials) on the output layer for estimating $P(class | image)$.
-The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 
-Training examples are presented in minibatches of size 20. A constant learning
-rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
+{\bf Multi-Layer Perceptrons (MLP).}  Whereas previous work had compared
+deep architectures to both shallow MLPs and SVMs, we only compared to MLPs
+here because of the very large datasets used (making the use of SVMs
+computationally challenging because of their quadratic scaling
+behavior). Preliminary experiments on training SVMs (libSVM) with subsets
+of the training set allowing the program to fit in memory yielded
+substantially worse results than those obtained with MLPs\footnote{RBF SVMs
+  trained with a subset of NISTP or NIST, 100k examples, to fit in memory,
+  yielded 64\% test error or worse; online linear SVMs trained on the whole
+  of NIST or 800k from NISTP yielded no better than 42\% error; slightly
+  better results were obtained by sparsifying the pixel intensities and
+  projecting to a second-order polynomial (a very sparse vector), still
+  41\% error. We expect that better results could be obtained with a
+  better implementation allowing for training with more examples and
+  a higher-order non-linear projection.}  For training on nearly a hundred million examples (with the
+perturbed data), the MLPs and SDA are much more convenient than classifiers
+based on kernel methods.  The MLP has a single hidden layer with $\tanh$
+activation functions, and softmax (normalized exponentials) on the output
+layer for estimating $P(class | image)$.  The number of hidden units is
+taken in $\{300,500,800,1000,1500\}$.  Training examples are presented in
+minibatches of size 20. A constant learning rate was chosen among $\{0.001,
+0.01, 0.025, 0.075, 0.1, 0.5\}$.
 %through preliminary experiments (measuring performance on a validation set),
 %and $0.1$ (which was found to work best) was then selected for optimizing on
 %the whole training sets.