Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 502:2b35a6e5ece4
changements de Myriam
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 01 Jun 2010 13:37:40 -0400 |
parents | 5927432d8b8d |
children | a0e820f04f8e |
comparison
equal
deleted
inserted
replaced
501:5927432d8b8d | 502:2b35a6e5ece4 |
---|---|
325 and {\bf OCR data} (scanned machine printed characters). Once a character | 325 and {\bf OCR data} (scanned machine printed characters). Once a character |
326 is sampled from one of these sources (chosen randomly), a pipeline of | 326 is sampled from one of these sources (chosen randomly), a pipeline of |
327 the above transformations and/or noise processes is applied to the | 327 the above transformations and/or noise processes is applied to the |
328 image. | 328 image. |
329 | 329 |
330 We compare the best MLP (according to validation set error) that we found against | |
331 the best SDA (again according to validation set error), along with a precise estimate | |
332 of human performance obtained via Amazon's Mechanical Turk (AMT) | |
333 service\footnote{http://mturk.com}. | |
334 AMT users are paid small amounts | |
335 of money to perform tasks for which human intelligence is required. | |
336 Mechanical Turk has been used extensively in natural language | |
337 processing \citep{SnowEtAl2008} and vision | |
338 \citep{SorokinAndForsyth2008,whitehill09}. | |
339 AMT users where presented | |
340 with 10 character images and asked to type 10 corresponding ASCII | |
341 characters. They were forced to make a hard choice among the | |
342 62 or 10 character classes (all classes or digits only). | |
343 Three users classified each image, allowing | |
344 to estimate inter-human variability. | |
345 | |
330 \vspace*{-1mm} | 346 \vspace*{-1mm} |
331 \subsection{Data Sources} | 347 \subsection{Data Sources} |
332 \vspace*{-1mm} | 348 \vspace*{-1mm} |
333 | 349 |
334 %\begin{itemize} | 350 %\begin{itemize} |
408 | 424 |
409 \vspace*{-1mm} | 425 \vspace*{-1mm} |
410 \subsection{Models and their Hyperparameters} | 426 \subsection{Models and their Hyperparameters} |
411 \vspace*{-1mm} | 427 \vspace*{-1mm} |
412 | 428 |
429 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single | |
430 hidden layer and with Stacked Denoising Auto-Encoders (SDA). | |
413 All hyper-parameters are selected based on performance on the NISTP validation set. | 431 All hyper-parameters are selected based on performance on the NISTP validation set. |
414 | 432 |
415 {\bf Multi-Layer Perceptrons (MLP).} | 433 {\bf Multi-Layer Perceptrons (MLP).} |
416 Whereas previous work had compared deep architectures to both shallow MLPs and | 434 Whereas previous work had compared deep architectures to both shallow MLPs and |
417 SVMs, we only compared to MLPs here because of the very large datasets used. | 435 SVMs, we only compared to MLPs here because of the very large datasets used |
436 (making the use of SVMs computationally inconvenient because of their quadratic | |
437 scaling behavior). | |
418 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | 438 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
419 exponentials) on the output layer for estimating P(class | image). | 439 exponentials) on the output layer for estimating P(class | image). |
420 The hyper-parameters are the following: number of hidden units, taken in | 440 The hyper-parameters are the following: number of hidden units, taken in |
421 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training | 441 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training |
422 examples are presented in minibatches of size 20. A constant learning | 442 examples are presented in minibatches of size 20. A constant learning |
423 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ | 443 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ |
424 through preliminary experiments, and 0.1 was selected. | 444 through preliminary experiments, and 0.1 was selected. |
425 | 445 |
426 {\bf Stacked Denoising Auto-Encoders (SDAE).} | 446 {\bf Stacked Denoising Auto-Encoders (SDA).} |
427 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 447 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
428 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 448 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
429 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} | 449 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} |
430 enabling better generalization, apparently setting parameters in the | 450 enabling better generalization, apparently setting parameters in the |
431 basin of attraction of supervised gradient descent yielding better | 451 basin of attraction of supervised gradient descent yielding better |
437 taking advantage of the expressive power and bias implicit in the | 457 taking advantage of the expressive power and bias implicit in the |
438 deep architecture (whereby complex concepts are expressed as | 458 deep architecture (whereby complex concepts are expressed as |
439 compositions of simpler ones through a deep hierarchy). | 459 compositions of simpler ones through a deep hierarchy). |
440 Here we chose to use the Denoising | 460 Here we chose to use the Denoising |
441 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for | 461 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for |
462 % AJOUTER UNE IMAGE? | |
442 these deep hierarchies of features, as it is very simple to train and | 463 these deep hierarchies of features, as it is very simple to train and |
443 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 464 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
444 provides immediate and efficient inference, and yielded results | 465 provides immediate and efficient inference, and yielded results |
445 comparable or better than RBMs in series of experiments | 466 comparable or better than RBMs in series of experiments |
446 \citep{VincentPLarochelleH2008}. During training of a Denoising | 467 \citep{VincentPLarochelleH2008}. During training of a Denoising |
466 | 487 |
467 %\vspace*{-1mm} | 488 %\vspace*{-1mm} |
468 %\subsection{SDA vs MLP vs Humans} | 489 %\subsection{SDA vs MLP vs Humans} |
469 %\vspace*{-1mm} | 490 %\vspace*{-1mm} |
470 | 491 |
471 We compare the best MLP (according to validation set error) that we found against | |
472 the best SDA (again according to validation set error), along with a precise estimate | |
473 of human performance obtained via Amazon's Mechanical Turk (AMT) | |
474 service\footnote{http://mturk.com}. | |
475 %AMT users are paid small amounts | |
476 %of money to perform tasks for which human intelligence is required. | |
477 %Mechanical Turk has been used extensively in natural language | |
478 %processing \citep{SnowEtAl2008} and vision | |
479 %\citep{SorokinAndForsyth2008,whitehill09}. | |
480 AMT users where presented | |
481 with 10 character images and asked to type 10 corresponding ASCII | |
482 characters. They were forced to make a hard choice among the | |
483 62 or 10 character classes (all classes or digits only). | |
484 Three users classified each image, allowing | |
485 to estimate inter-human variability (shown as +/- in parenthesis below). | |
486 | |
487 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, | 492 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, |
488 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, | 493 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, |
489 SDA2), along with the previous results on the digits NIST special database | 494 SDA2), along with the previous results on the digits NIST special database |
490 19 test set from the literature respectively based on ARTMAP neural | 495 19 test set from the literature respectively based on ARTMAP neural |
491 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search | 496 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search |
499 significant way) but reaches human performance on both the 62-class task | 504 significant way) but reaches human performance on both the 62-class task |
500 and the 10-class (digits) task. In addition, as shown in the left of | 505 and the 10-class (digits) task. In addition, as shown in the left of |
501 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error | 506 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error |
502 rate brought by self-taught learning is greater for the SDA, and these | 507 rate brought by self-taught learning is greater for the SDA, and these |
503 differences with the MLP are statistically and qualitatively | 508 differences with the MLP are statistically and qualitatively |
504 significant. The left side of the figure shows the improvement to the clean | 509 significant. |
510 The left side of the figure shows the improvement to the clean | |
505 NIST test set error brought by the use of out-of-distribution examples | 511 NIST test set error brought by the use of out-of-distribution examples |
506 (i.e. the perturbed examples examples from NISTP or P07). The right side of | 512 (i.e. the perturbed examples examples from NISTP or P07). |
513 Relative change is measured by taking | |
514 (original model's error / perturbed-data model's error - 1). | |
515 The right side of | |
507 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement | 516 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement |
508 brought by the use of a multi-task setting, in which the same model is | 517 brought by the use of a multi-task setting, in which the same model is |
509 trained for more classes than the target classes of interest (i.e. training | 518 trained for more classes than the target classes of interest (i.e. training |
510 with all 62 classes when the target classes are respectively the digits, | 519 with all 62 classes when the target classes are respectively the digits, |
511 lower-case, or upper-case characters). Again, whereas the gain from the | 520 lower-case, or upper-case characters). Again, whereas the gain from the |
523 setting is similar for the other two target classes (lower case characters | 532 setting is similar for the other two target classes (lower case characters |
524 and upper case characters). | 533 and upper case characters). |
525 | 534 |
526 \begin{figure}[h] | 535 \begin{figure}[h] |
527 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ | 536 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ |
528 \caption{Left: overall results; error bars indicate a 95\% confidence interval. | 537 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training |
529 Right: error rates on NIST test digits only, with results from literature. } | 538 on NIST, 1 on NISTP, and 2 on P07. Left: overall results |
539 of all models, on 3 different test sets corresponding to the three | |
540 datasets. | |
541 Right: error rates on NIST test digits only, along with the previous results from | |
542 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} | |
543 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | |
544 | |
530 \label{fig:error-rates-charts} | 545 \label{fig:error-rates-charts} |
531 \end{figure} | 546 \end{figure} |
532 | 547 |
533 %\vspace*{-1mm} | 548 %\vspace*{-1mm} |
534 %\subsection{Perturbed Training Data More Helpful for SDAE} | 549 %\subsection{Perturbed Training Data More Helpful for SDA} |
535 %\vspace*{-1mm} | 550 %\vspace*{-1mm} |
536 | 551 |
537 %\vspace*{-1mm} | 552 %\vspace*{-1mm} |
538 %\subsection{Multi-Task Learning Effects} | 553 %\subsection{Multi-Task Learning Effects} |
539 %\vspace*{-1mm} | 554 %\vspace*{-1mm} |
578 | 593 |
579 \vspace*{-1mm} | 594 \vspace*{-1mm} |
580 \section{Conclusions} | 595 \section{Conclusions} |
581 \vspace*{-1mm} | 596 \vspace*{-1mm} |
582 | 597 |
583 The conclusions are positive for all the questions asked in the introduction. | 598 We have found that the self-taught learning framework is more beneficial |
599 to a deep learner than to a traditional shallow and purely | |
600 supervised learner. More precisely, | |
601 the conclusions are positive for all the questions asked in the introduction. | |
584 %\begin{itemize} | 602 %\begin{itemize} |
585 | 603 |
586 $\bullet$ %\item | 604 $\bullet$ %\item |
587 Do the good results previously obtained with deep architectures on the | 605 Do the good results previously obtained with deep architectures on the |
588 MNIST digits generalize to the setting of a much larger and richer (but similar) | 606 MNIST digits generalize to the setting of a much larger and richer (but similar) |
589 dataset, the NIST special database 19, with 62 classes and around 800k examples? | 607 dataset, the NIST special database 19, with 62 classes and around 800k examples? |
590 Yes, the SDA systematically outperformed the MLP and all the previously | 608 Yes, the SDA {\bf systematically outperformed the MLP and all the previously |
591 published results on this dataset (as far as we know), in fact reaching human-level | 609 published results on this dataset (as far as we know), in fact reaching human-level |
592 performance. | 610 performance} at round 17\% error on the 62-class task and 1.4\% on the digits. |
593 | 611 |
594 $\bullet$ %\item | 612 $\bullet$ %\item |
595 To what extent does the perturbation of input images (e.g. adding | 613 To what extent does the perturbation of input images (e.g. adding |
596 noise, affine transformations, background images) make the resulting | 614 noise, affine transformations, background images) make the resulting |
597 classifier better not only on similarly perturbed images but also on | 615 classifier better not only on similarly perturbed images but also on |
598 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | 616 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} |
599 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 617 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? |
600 MLPs were helped by perturbed training examples when tested on perturbed input images, | 618 MLPs were helped by perturbed training examples when tested on perturbed input |
601 but only marginally helped with respect to clean examples. On the other hand, the deep SDAs | 619 images (65\% relative improvement on NISTP) |
620 but only marginally helped (5\% relative improvement on all classes) | |
621 or even hurt (10\% relative loss on digits) | |
622 with respect to clean examples . On the other hand, the deep SDAs | |
602 were very significantly boosted by these out-of-distribution examples. | 623 were very significantly boosted by these out-of-distribution examples. |
603 | 624 |
604 $\bullet$ %\item | 625 $\bullet$ %\item |
605 Similarly, does the feature learning step in deep learning algorithms benefit more | 626 Similarly, does the feature learning step in deep learning algorithms benefit more |
606 training with similar but different classes (i.e. a multi-task learning scenario) than | 627 training with similar but different classes (i.e. a multi-task learning scenario) than |
607 a corresponding shallow and purely supervised architecture? | 628 a corresponding shallow and purely supervised architecture? |
608 Whereas the improvement due to the multi-task setting was marginal or | 629 Whereas the improvement due to the multi-task setting was marginal or |
609 negative for the MLP, it was very significant for the SDA. | 630 negative for the MLP (from +5.6\% to -3.6\% relative change), |
631 it was very significant for the SDA (from +13\% to +27\% relative change). | |
610 %\end{itemize} | 632 %\end{itemize} |
633 | |
634 Why would deep learners benefit more from the self-taught learning framework? | |
635 The key idea is that the lower layers of the predictor compute a hierarchy | |
636 of features that can be shared across tasks or across variants of the | |
637 input distribution. Intermediate features that can be used in different | |
638 contexts can be estimated in a way that allows to share statistical | |
639 strength. Features extracted through many levels are more likely to | |
640 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), | |
641 increasing the likelihood that they would be useful for a larger array | |
642 of tasks and input conditions. | |
643 Therefore, we hypothesize that both depth and unsupervised | |
644 pre-training play a part in explaining the advantages observed here, and future | |
645 experiments could attempt at teasing apart these factors. | |
611 | 646 |
612 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 647 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) |
613 can be executed on-line at {\tt http://deep.host22.com}. | 648 can be executed on-line at {\tt http://deep.host22.com}. |
614 | 649 |
615 \newpage | 650 \newpage |