changeset 623:d44c78c90669

entered revisions for AMT and SVMs
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 09 Jan 2011 22:00:39 -0500
parents 09b7dee216f4
children 49933073590c
files writeup/aistats2011_revised.tex writeup/aistats_review_response.txt
diffstat 2 files changed, 35 insertions(+), 16 deletions(-) [+]
line wrap: on
line diff
--- a/writeup/aistats2011_revised.tex	Sun Jan 09 21:47:28 2011 -0500
+++ b/writeup/aistats2011_revised.tex	Sun Jan 09 22:00:39 2011 -0500
@@ -295,6 +295,15 @@
 because each image was classified by 3 different persons. 
 The average error of humans on the 62-class task NIST test set
 is 18.2\%, with a standard error of 0.1\%.
+We controlled noise in the labelling process by (1)
+requiring AMT workers with a higher than normal average of accepted
+responses (>95\%) on other tasks (2) discarding responses that were not
+complete (10 predictions) (3) discarding responses for which for which the
+time to predict was smaller than 3 seconds for NIST (the mean response time
+was 20 seconds) and 6 seconds seconds for NISTP (average response time of
+45 seconds) (4) discarding responses which were obviously wrong (10
+identical ones, or "12345..."). Overall, after such filtering, we kept
+approximately 95\% of the AMT workers' responses.
 
 %\vspace*{-3mm}
 \subsection{Data Sources}
@@ -414,20 +423,28 @@
 hidden layer) and deep SDAs.
 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
 
-{\bf Multi-Layer Perceptrons (MLP).}
-Whereas previous work had compared deep architectures to both shallow MLPs and
-SVMs, we only compared to MLPs here because of the very large datasets used
-(making the use of SVMs computationally challenging because of their quadratic
-scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training
-set allowing the program to fit in memory yielded substantially worse results
-than those obtained with MLPs. For training on nearly a hundred million examples
-(with the perturbed data), the MLPs and SDA are much more convenient than
-classifiers based on kernel methods.
-The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
-exponentials) on the output layer for estimating $P(class | image)$.
-The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 
-Training examples are presented in minibatches of size 20. A constant learning
-rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
+{\bf Multi-Layer Perceptrons (MLP).}  Whereas previous work had compared
+deep architectures to both shallow MLPs and SVMs, we only compared to MLPs
+here because of the very large datasets used (making the use of SVMs
+computationally challenging because of their quadratic scaling
+behavior). Preliminary experiments on training SVMs (libSVM) with subsets
+of the training set allowing the program to fit in memory yielded
+substantially worse results than those obtained with MLPs\footnote{RBF SVMs
+  trained with a subset of NISTP or NIST, 100k examples, to fit in memory,
+  yielded 64\% test error or worse; online linear SVMs trained on the whole
+  of NIST or 800k from NISTP yielded no better than 42\% error; slightly
+  better results were obtained by sparsifying the pixel intensities and
+  projecting to a second-order polynomial (a very sparse vector), still
+  41\% error. We expect that better results could be obtained with a
+  better implementation allowing for training with more examples and
+  a higher-order non-linear projection.}  For training on nearly a hundred million examples (with the
+perturbed data), the MLPs and SDA are much more convenient than classifiers
+based on kernel methods.  The MLP has a single hidden layer with $\tanh$
+activation functions, and softmax (normalized exponentials) on the output
+layer for estimating $P(class | image)$.  The number of hidden units is
+taken in $\{300,500,800,1000,1500\}$.  Training examples are presented in
+minibatches of size 20. A constant learning rate was chosen among $\{0.001,
+0.01, 0.025, 0.075, 0.1, 0.5\}$.
 %through preliminary experiments (measuring performance on a validation set),
 %and $0.1$ (which was found to work best) was then selected for optimizing on
 %the whole training sets.
--- a/writeup/aistats_review_response.txt	Sun Jan 09 21:47:28 2011 -0500
+++ b/writeup/aistats_review_response.txt	Sun Jan 09 22:00:39 2011 -0500
@@ -28,7 +28,8 @@
 RBF SVM,     NISTP, 100k,  original,           74.73%,  56.57%,     64.22%
 
 The best results were obtained with the sparse quadratic input features, and
-training on the CLEAN data (NIST) rather than the perturbed data (NISTP).
+training on the CLEAN data (NIST) rather than the perturbed data (NISTP). 
+A summary of the above results was added to the revised paper.
 
 
 * Using distorted characters as the corruption process of the Denoising
@@ -59,7 +60,8 @@
 was 20 seconds) and 6 seconds seconds for NISTP (average response time of
 45 seconds) (4) discarding responses which were obviously wrong (10
 identical ones, or "12345..."). Overall, after such filtering, we kept
-approximately 95% of the AMT workers' responses. We thank the reviewer for
+approximately 95% of the AMT workers' responses. The above paragraph
+was added to the revision. We thank the reviewer for
 the suggestion about multi-stage questionnaires, we will definitely
 consider this as an option next time we perform this experiment. However,
 to be fair, if we were to do so, we should also consider the same