Mercurial > ift6266
diff writeup/aistats2011_revised.tex @ 623:d44c78c90669
entered revisions for AMT and SVMs
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sun, 09 Jan 2011 22:00:39 -0500 |
parents | 09b7dee216f4 |
children | 49933073590c |
line wrap: on
line diff
--- a/writeup/aistats2011_revised.tex Sun Jan 09 21:47:28 2011 -0500 +++ b/writeup/aistats2011_revised.tex Sun Jan 09 22:00:39 2011 -0500 @@ -295,6 +295,15 @@ because each image was classified by 3 different persons. The average error of humans on the 62-class task NIST test set is 18.2\%, with a standard error of 0.1\%. +We controlled noise in the labelling process by (1) +requiring AMT workers with a higher than normal average of accepted +responses (>95\%) on other tasks (2) discarding responses that were not +complete (10 predictions) (3) discarding responses for which for which the +time to predict was smaller than 3 seconds for NIST (the mean response time +was 20 seconds) and 6 seconds seconds for NISTP (average response time of +45 seconds) (4) discarding responses which were obviously wrong (10 +identical ones, or "12345..."). Overall, after such filtering, we kept +approximately 95\% of the AMT workers' responses. %\vspace*{-3mm} \subsection{Data Sources} @@ -414,20 +423,28 @@ hidden layer) and deep SDAs. \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} -{\bf Multi-Layer Perceptrons (MLP).} -Whereas previous work had compared deep architectures to both shallow MLPs and -SVMs, we only compared to MLPs here because of the very large datasets used -(making the use of SVMs computationally challenging because of their quadratic -scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training -set allowing the program to fit in memory yielded substantially worse results -than those obtained with MLPs. For training on nearly a hundred million examples -(with the perturbed data), the MLPs and SDA are much more convenient than -classifiers based on kernel methods. -The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized -exponentials) on the output layer for estimating $P(class | image)$. -The number of hidden units is taken in $\{300,500,800,1000,1500\}$. -Training examples are presented in minibatches of size 20. A constant learning -rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. +{\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared +deep architectures to both shallow MLPs and SVMs, we only compared to MLPs +here because of the very large datasets used (making the use of SVMs +computationally challenging because of their quadratic scaling +behavior). Preliminary experiments on training SVMs (libSVM) with subsets +of the training set allowing the program to fit in memory yielded +substantially worse results than those obtained with MLPs\footnote{RBF SVMs + trained with a subset of NISTP or NIST, 100k examples, to fit in memory, + yielded 64\% test error or worse; online linear SVMs trained on the whole + of NIST or 800k from NISTP yielded no better than 42\% error; slightly + better results were obtained by sparsifying the pixel intensities and + projecting to a second-order polynomial (a very sparse vector), still + 41\% error. We expect that better results could be obtained with a + better implementation allowing for training with more examples and + a higher-order non-linear projection.} For training on nearly a hundred million examples (with the +perturbed data), the MLPs and SDA are much more convenient than classifiers +based on kernel methods. The MLP has a single hidden layer with $\tanh$ +activation functions, and softmax (normalized exponentials) on the output +layer for estimating $P(class | image)$. The number of hidden units is +taken in $\{300,500,800,1000,1500\}$. Training examples are presented in +minibatches of size 20. A constant learning rate was chosen among $\{0.001, +0.01, 0.025, 0.075, 0.1, 0.5\}$. %through preliminary experiments (measuring performance on a validation set), %and $0.1$ (which was found to work best) was then selected for optimizing on %the whole training sets.