comparison writeup/aistats2011_revised.tex @ 623:d44c78c90669

entered revisions for AMT and SVMs
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 09 Jan 2011 22:00:39 -0500
parents 09b7dee216f4
children 49933073590c
comparison
equal deleted inserted replaced
622:09b7dee216f4 623:d44c78c90669
293 Different humans labelers sometimes provided a different label for the same 293 Different humans labelers sometimes provided a different label for the same
294 example, and we were able to estimate the error variance due to this effect 294 example, and we were able to estimate the error variance due to this effect
295 because each image was classified by 3 different persons. 295 because each image was classified by 3 different persons.
296 The average error of humans on the 62-class task NIST test set 296 The average error of humans on the 62-class task NIST test set
297 is 18.2\%, with a standard error of 0.1\%. 297 is 18.2\%, with a standard error of 0.1\%.
298 We controlled noise in the labelling process by (1)
299 requiring AMT workers with a higher than normal average of accepted
300 responses (>95\%) on other tasks (2) discarding responses that were not
301 complete (10 predictions) (3) discarding responses for which for which the
302 time to predict was smaller than 3 seconds for NIST (the mean response time
303 was 20 seconds) and 6 seconds seconds for NISTP (average response time of
304 45 seconds) (4) discarding responses which were obviously wrong (10
305 identical ones, or "12345..."). Overall, after such filtering, we kept
306 approximately 95\% of the AMT workers' responses.
298 307
299 %\vspace*{-3mm} 308 %\vspace*{-3mm}
300 \subsection{Data Sources} 309 \subsection{Data Sources}
301 \label{sec:sources} 310 \label{sec:sources}
302 %\vspace*{-2mm} 311 %\vspace*{-2mm}
412 421
413 The experiments are performed using MLPs (with a single 422 The experiments are performed using MLPs (with a single
414 hidden layer) and deep SDAs. 423 hidden layer) and deep SDAs.
415 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} 424 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
416 425
417 {\bf Multi-Layer Perceptrons (MLP).} 426 {\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared
418 Whereas previous work had compared deep architectures to both shallow MLPs and 427 deep architectures to both shallow MLPs and SVMs, we only compared to MLPs
419 SVMs, we only compared to MLPs here because of the very large datasets used 428 here because of the very large datasets used (making the use of SVMs
420 (making the use of SVMs computationally challenging because of their quadratic 429 computationally challenging because of their quadratic scaling
421 scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training 430 behavior). Preliminary experiments on training SVMs (libSVM) with subsets
422 set allowing the program to fit in memory yielded substantially worse results 431 of the training set allowing the program to fit in memory yielded
423 than those obtained with MLPs. For training on nearly a hundred million examples 432 substantially worse results than those obtained with MLPs\footnote{RBF SVMs
424 (with the perturbed data), the MLPs and SDA are much more convenient than 433 trained with a subset of NISTP or NIST, 100k examples, to fit in memory,
425 classifiers based on kernel methods. 434 yielded 64\% test error or worse; online linear SVMs trained on the whole
426 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized 435 of NIST or 800k from NISTP yielded no better than 42\% error; slightly
427 exponentials) on the output layer for estimating $P(class | image)$. 436 better results were obtained by sparsifying the pixel intensities and
428 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 437 projecting to a second-order polynomial (a very sparse vector), still
429 Training examples are presented in minibatches of size 20. A constant learning 438 41\% error. We expect that better results could be obtained with a
430 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. 439 better implementation allowing for training with more examples and
440 a higher-order non-linear projection.} For training on nearly a hundred million examples (with the
441 perturbed data), the MLPs and SDA are much more convenient than classifiers
442 based on kernel methods. The MLP has a single hidden layer with $\tanh$
443 activation functions, and softmax (normalized exponentials) on the output
444 layer for estimating $P(class | image)$. The number of hidden units is
445 taken in $\{300,500,800,1000,1500\}$. Training examples are presented in
446 minibatches of size 20. A constant learning rate was chosen among $\{0.001,
447 0.01, 0.025, 0.075, 0.1, 0.5\}$.
431 %through preliminary experiments (measuring performance on a validation set), 448 %through preliminary experiments (measuring performance on a validation set),
432 %and $0.1$ (which was found to work best) was then selected for optimizing on 449 %and $0.1$ (which was found to work best) was then selected for optimizing on
433 %the whole training sets. 450 %the whole training sets.
434 %\vspace*{-1mm} 451 %\vspace*{-1mm}
435 452