ift6266: writeup/aistats2011_revised.tex comparison

comparison writeup/aistats2011_revised.tex @ 623:d44c78c90669

entered revisions for AMT and SVMs

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sun, 09 Jan 2011 22:00:39 -0500
parents	09b7dee216f4
children	49933073590c

comparison

equal deleted inserted replaced

-:09b7dee216f4
+:d44c78c90669
 Different humans labelers sometimes provided a different label for the same
 example, and we were able to estimate the error variance due to this effect
 because each image was classified by 3 different persons.
 The average error of humans on the 62-class task NIST test set
 is 18.2\%, with a standard error of 0.1\%.
+We controlled noise in the labelling process by (1)
+requiring AMT workers with a higher than normal average of accepted
+responses (>95\%) on other tasks (2) discarding responses that were not
+complete (10 predictions) (3) discarding responses for which for which the
+time to predict was smaller than 3 seconds for NIST (the mean response time
+was 20 seconds) and 6 seconds seconds for NISTP (average response time of
+45 seconds) (4) discarding responses which were obviously wrong (10
+identical ones, or "12345..."). Overall, after such filtering, we kept
+approximately 95\% of the AMT workers' responses.
 %\vspace*{-3mm}
 \subsection{Data Sources}
 \label{sec:sources}
 %\vspace*{-2mm}
 The experiments are performed using MLPs (with a single
 hidden layer) and deep SDAs.
 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
-{\bf Multi-Layer Perceptrons (MLP).}
+{\bf Multi-Layer Perceptrons (MLP).}  Whereas previous work had compared
-Whereas previous work had compared deep architectures to both shallow MLPs and
+deep architectures to both shallow MLPs and SVMs, we only compared to MLPs
-SVMs, we only compared to MLPs here because of the very large datasets used
+here because of the very large datasets used (making the use of SVMs
-(making the use of SVMs computationally challenging because of their quadratic
+computationally challenging because of their quadratic scaling
-scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training
+behavior). Preliminary experiments on training SVMs (libSVM) with subsets
-set allowing the program to fit in memory yielded substantially worse results
+of the training set allowing the program to fit in memory yielded
-than those obtained with MLPs. For training on nearly a hundred million examples
+substantially worse results than those obtained with MLPs\footnote{RBF SVMs
-(with the perturbed data), the MLPs and SDA are much more convenient than
+trained with a subset of NISTP or NIST, 100k examples, to fit in memory,
-classifiers based on kernel methods.
+yielded 64\% test error or worse; online linear SVMs trained on the whole
-The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
+of NIST or 800k from NISTP yielded no better than 42\% error; slightly
-exponentials) on the output layer for estimating $P(class | image)$.
+better results were obtained by sparsifying the pixel intensities and
-The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
+projecting to a second-order polynomial (a very sparse vector), still
-Training examples are presented in minibatches of size 20. A constant learning
+41\% error. We expect that better results could be obtained with a
-rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$.
+better implementation allowing for training with more examples and
+a higher-order non-linear projection.}  For training on nearly a hundred million examples (with the
+perturbed data), the MLPs and SDA are much more convenient than classifiers
+based on kernel methods.  The MLP has a single hidden layer with $\tanh$
+activation functions, and softmax (normalized exponentials) on the output
+layer for estimating $P(class | image)$.  The number of hidden units is
+taken in $\{300,500,800,1000,1500\}$.  Training examples are presented in
+minibatches of size 20. A constant learning rate was chosen among $\{0.001,
+0.01, 0.025, 0.075, 0.1, 0.5\}$.
 %through preliminary experiments (measuring performance on a validation set),
 %and $0.1$ (which was found to work best) was then selected for optimizing on
 %the whole training sets.
 %\vspace*{-1mm}

Mercurial > ift6266

comparison writeup/aistats2011_revised.tex @ 623:d44c78c90669