Mercurial > ift6266
comparison writeup/aistats2011_revised.tex @ 623:d44c78c90669
entered revisions for AMT and SVMs
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Sun, 09 Jan 2011 22:00:39 -0500 |
parents | 09b7dee216f4 |
children | 49933073590c |
comparison
equal
deleted
inserted
replaced
622:09b7dee216f4 | 623:d44c78c90669 |
---|---|
293 Different humans labelers sometimes provided a different label for the same | 293 Different humans labelers sometimes provided a different label for the same |
294 example, and we were able to estimate the error variance due to this effect | 294 example, and we were able to estimate the error variance due to this effect |
295 because each image was classified by 3 different persons. | 295 because each image was classified by 3 different persons. |
296 The average error of humans on the 62-class task NIST test set | 296 The average error of humans on the 62-class task NIST test set |
297 is 18.2\%, with a standard error of 0.1\%. | 297 is 18.2\%, with a standard error of 0.1\%. |
298 We controlled noise in the labelling process by (1) | |
299 requiring AMT workers with a higher than normal average of accepted | |
300 responses (>95\%) on other tasks (2) discarding responses that were not | |
301 complete (10 predictions) (3) discarding responses for which for which the | |
302 time to predict was smaller than 3 seconds for NIST (the mean response time | |
303 was 20 seconds) and 6 seconds seconds for NISTP (average response time of | |
304 45 seconds) (4) discarding responses which were obviously wrong (10 | |
305 identical ones, or "12345..."). Overall, after such filtering, we kept | |
306 approximately 95\% of the AMT workers' responses. | |
298 | 307 |
299 %\vspace*{-3mm} | 308 %\vspace*{-3mm} |
300 \subsection{Data Sources} | 309 \subsection{Data Sources} |
301 \label{sec:sources} | 310 \label{sec:sources} |
302 %\vspace*{-2mm} | 311 %\vspace*{-2mm} |
412 | 421 |
413 The experiments are performed using MLPs (with a single | 422 The experiments are performed using MLPs (with a single |
414 hidden layer) and deep SDAs. | 423 hidden layer) and deep SDAs. |
415 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} | 424 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.} |
416 | 425 |
417 {\bf Multi-Layer Perceptrons (MLP).} | 426 {\bf Multi-Layer Perceptrons (MLP).} Whereas previous work had compared |
418 Whereas previous work had compared deep architectures to both shallow MLPs and | 427 deep architectures to both shallow MLPs and SVMs, we only compared to MLPs |
419 SVMs, we only compared to MLPs here because of the very large datasets used | 428 here because of the very large datasets used (making the use of SVMs |
420 (making the use of SVMs computationally challenging because of their quadratic | 429 computationally challenging because of their quadratic scaling |
421 scaling behavior). Preliminary experiments on training SVMs (libSVM) with subsets of the training | 430 behavior). Preliminary experiments on training SVMs (libSVM) with subsets |
422 set allowing the program to fit in memory yielded substantially worse results | 431 of the training set allowing the program to fit in memory yielded |
423 than those obtained with MLPs. For training on nearly a hundred million examples | 432 substantially worse results than those obtained with MLPs\footnote{RBF SVMs |
424 (with the perturbed data), the MLPs and SDA are much more convenient than | 433 trained with a subset of NISTP or NIST, 100k examples, to fit in memory, |
425 classifiers based on kernel methods. | 434 yielded 64\% test error or worse; online linear SVMs trained on the whole |
426 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | 435 of NIST or 800k from NISTP yielded no better than 42\% error; slightly |
427 exponentials) on the output layer for estimating $P(class | image)$. | 436 better results were obtained by sparsifying the pixel intensities and |
428 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. | 437 projecting to a second-order polynomial (a very sparse vector), still |
429 Training examples are presented in minibatches of size 20. A constant learning | 438 41\% error. We expect that better results could be obtained with a |
430 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$. | 439 better implementation allowing for training with more examples and |
440 a higher-order non-linear projection.} For training on nearly a hundred million examples (with the | |
441 perturbed data), the MLPs and SDA are much more convenient than classifiers | |
442 based on kernel methods. The MLP has a single hidden layer with $\tanh$ | |
443 activation functions, and softmax (normalized exponentials) on the output | |
444 layer for estimating $P(class | image)$. The number of hidden units is | |
445 taken in $\{300,500,800,1000,1500\}$. Training examples are presented in | |
446 minibatches of size 20. A constant learning rate was chosen among $\{0.001, | |
447 0.01, 0.025, 0.075, 0.1, 0.5\}$. | |
431 %through preliminary experiments (measuring performance on a validation set), | 448 %through preliminary experiments (measuring performance on a validation set), |
432 %and $0.1$ (which was found to work best) was then selected for optimizing on | 449 %and $0.1$ (which was found to work best) was then selected for optimizing on |
433 %the whole training sets. | 450 %the whole training sets. |
434 %\vspace*{-1mm} | 451 %\vspace*{-1mm} |
435 | 452 |