Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 485:6beaf3328521
les tables enlevées
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Mon, 31 May 2010 21:50:00 -0400 |
parents | 9a757d565e46 |
children | 877af97ee193 6c9ff48e15cd |
comparison
equal
deleted
inserted
replaced
484:9a757d565e46 | 485:6beaf3328521 |
---|---|
459 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. | 459 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. |
460 | 460 |
461 \vspace*{-1mm} | 461 \vspace*{-1mm} |
462 \section{Experimental Results} | 462 \section{Experimental Results} |
463 | 463 |
464 \vspace*{-1mm} | 464 %\vspace*{-1mm} |
465 \subsection{SDA vs MLP vs Humans} | 465 %\subsection{SDA vs MLP vs Humans} |
466 \vspace*{-1mm} | 466 %\vspace*{-1mm} |
467 | 467 |
468 We compare here the best MLP (according to validation set error) that we found against | 468 We compare the best MLP (according to validation set error) that we found against |
469 the best SDA (again according to validation set error), along with a precise estimate | 469 the best SDA (again according to validation set error), along with a precise estimate |
470 of human performance obtained via Amazon's Mechanical Turk (AMT) | 470 of human performance obtained via Amazon's Mechanical Turk (AMT) |
471 service\footnote{http://mturk.com}. AMT users are paid small amounts | 471 service\footnote{http://mturk.com}. |
472 of money to perform tasks for which human intelligence is required. | 472 %AMT users are paid small amounts |
473 Mechanical Turk has been used extensively in natural language | 473 %of money to perform tasks for which human intelligence is required. |
474 processing \citep{SnowEtAl2008} and vision | 474 %Mechanical Turk has been used extensively in natural language |
475 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented | 475 %processing \citep{SnowEtAl2008} and vision |
476 %\citep{SorokinAndForsyth2008,whitehill09}. | |
477 AMT users where presented | |
476 with 10 character images and asked to type 10 corresponding ascii | 478 with 10 character images and asked to type 10 corresponding ascii |
477 characters. They were forced to make a hard choice among the | 479 characters. They were forced to make a hard choice among the |
478 62 or 10 character classes (all classes or digits only). | 480 62 or 10 character classes (all classes or digits only). |
479 Three users classified each image, allowing | 481 Three users classified each image, allowing |
480 to estimate inter-human variability (shown as +/- in parenthesis below). | 482 to estimate inter-human variability (shown as +/- in parenthesis below). |
481 | 483 |
482 Figure~\ref{fig:error-rates-charts} summarizes the results obtained. | 484 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, |
483 More detailed results and tables can be found in the appendix. | 485 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, |
484 | 486 SDA2), along with the previous results on the digits NIST special database 19 |
485 \begin{table} | 487 test set from the |
488 literature | |
489 respectively based on ARTMAP neural networks | |
490 ~\citep{Granger+al-2007}, fast nearest-neighbor search | |
491 ~\citep{Cortes+al-2000}, MLPs | |
492 ~\citep{Oliveira+al-2002}, and SVMs | |
493 ~\citep{Milgram+al-2005}. | |
494 More detailed and complete numerical results (figures and tables) | |
495 can be found in the appendix. The 3 kinds of model differ in the | |
496 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), | |
497 or P07 (MLP2, SDA2). The deep learner not only outperformed | |
498 the shallow ones and previously published performance | |
499 but reaches human performance on both the 62-class | |
500 task and the 10-class (digits) task. In addition, as shown | |
501 in the left of Figure~\ref{fig:fig:improvements-charts}, | |
502 the relative improvement in error rate brought by | |
503 self-taught learning is greater for the SDA. The left | |
504 side shows the improvement to the clean NIST test set error | |
505 brought by the use of out-of-distribution | |
506 examples (i.e. the perturbed examples examples from NISTP | |
507 or P07). The right side of Figure~\ref{fig:fig:improvements-charts} | |
508 shows the relative improvement brought by the use | |
509 of a multi-task setting, in which the same model is trained | |
510 for more classes than the target classes of interest | |
511 (i.e. training with all 62 classes when the target classes | |
512 are respectively the digits, lower-case, or upper-case | |
513 characters). Again, whereas the gain is marginal | |
514 or negative for the MLP, it is substantial for the SDA. | |
515 | |
516 | |
517 \begin{figure}[h] | |
518 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ | |
519 \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. } | |
520 \label{fig:error-rates-charts} | |
521 \end{figure} | |
522 | |
523 %\vspace*{-1mm} | |
524 %\subsection{Perturbed Training Data More Helpful for SDAE} | |
525 %\vspace*{-1mm} | |
526 | |
527 %\vspace*{-1mm} | |
528 %\subsection{Multi-Task Learning Effects} | |
529 %\vspace*{-1mm} | |
530 | |
531 \iffalse | |
532 As previously seen, the SDA is better able to benefit from the | |
533 transformations applied to the data than the MLP. In this experiment we | |
534 define three tasks: recognizing digits (knowing that the input is a digit), | |
535 recognizing upper case characters (knowing that the input is one), and | |
536 recognizing lower case characters (knowing that the input is one). We | |
537 consider the digit classification task as the target task and we want to | |
538 evaluate whether training with the other tasks can help or hurt, and | |
539 whether the effect is different for MLPs versus SDAs. The goal is to find | |
540 out if deep learning can benefit more (or less) from multiple related tasks | |
541 (i.e. the multi-task setting) compared to a corresponding purely supervised | |
542 shallow learner. | |
543 | |
544 We use a single hidden layer MLP with 1000 hidden units, and a SDA | |
545 with 3 hidden layers (1000 hidden units per layer), pre-trained and | |
546 fine-tuned on NIST. | |
547 | |
548 Our results show that the MLP benefits marginally from the multi-task setting | |
549 in the case of digits (5\% relative improvement) but is actually hurt in the case | |
550 of characters (respectively 3\% and 4\% worse for lower and upper class characters). | |
551 On the other hand the SDA benefitted from the multi-task setting, with relative | |
552 error rate improvements of 27\%, 15\% and 13\% respectively for digits, | |
553 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. | |
554 \fi | |
555 | |
556 | |
557 \begin{figure}[h] | |
558 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ | |
559 \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).} | |
560 \label{fig:improvements-charts} | |
561 \end{figure} | |
562 | |
563 \vspace*{-1mm} | |
564 \section{Conclusions} | |
565 \vspace*{-1mm} | |
566 | |
567 The conclusions are positive for all the questions asked in the introduction. | |
568 %\begin{itemize} | |
569 $\bullet$ %\item | |
570 Do the good results previously obtained with deep architectures on the | |
571 MNIST digits generalize to the setting of a much larger and richer (but similar) | |
572 dataset, the NIST special database 19, with 62 classes and around 800k examples? | |
573 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level | |
574 performance. | |
575 | |
576 $\bullet$ %\item | |
577 To what extent does the perturbation of input images (e.g. adding | |
578 noise, affine transformations, background images) make the resulting | |
579 classifier better not only on similarly perturbed images but also on | |
580 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | |
581 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | |
582 MLPs were helped by perturbed training examples when tested on perturbed input images, | |
583 but only marginally helped wrt clean examples. On the other hand, the deep SDAs | |
584 were very significantly boosted by these out-of-distribution examples. | |
585 | |
586 $\bullet$ %\item | |
587 Similarly, does the feature learning step in deep learning algorithms benefit more | |
588 training with similar but different classes (i.e. a multi-task learning scenario) than | |
589 a corresponding shallow and purely supervised architecture? | |
590 Whereas the improvement due to the multi-task setting was marginal or | |
591 negative for the MLP, it was very significant for the SDA. | |
592 %\end{itemize} | |
593 | |
594 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | |
595 can be executed on-line at {\tt http://deep.host22.com}. | |
596 | |
597 | |
598 {\small | |
599 \bibliography{strings,ml,aigaion,specials} | |
600 %\bibliographystyle{plainnat} | |
601 \bibliographystyle{unsrtnat} | |
602 %\bibliographystyle{apalike} | |
603 } | |
604 | |
605 \newpage | |
606 | |
607 \centerline{APPENDIX FOR {\bf Deep Self-Taught Learning for Handwritten Character Recognition}} | |
608 | |
609 \vspace*{1cm} | |
610 | |
611 \begin{table}[h] | |
486 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + | 612 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + |
487 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training | 613 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training |
488 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture | 614 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture |
489 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07) | 615 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07) |
490 and using a validation set to select hyper-parameters and other training choices. | 616 and using a validation set to select hyper-parameters and other training choices. |
510 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline | 636 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline |
511 \end{tabular} | 637 \end{tabular} |
512 \end{center} | 638 \end{center} |
513 \end{table} | 639 \end{table} |
514 | 640 |
515 \begin{figure}[h] | 641 \begin{table}[h] |
516 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ | |
517 \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. } | |
518 \label{fig:error-rates-charts} | |
519 \end{figure} | |
520 | |
521 \vspace*{-1mm} | |
522 \subsection{Perturbed Training Data More Helpful for SDAE} | |
523 \vspace*{-1mm} | |
524 | |
525 \begin{table} | |
526 \caption{Relative change in error rates due to the use of perturbed training data, | 642 \caption{Relative change in error rates due to the use of perturbed training data, |
527 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. | 643 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. |
528 A positive value indicates that training on the perturbed data helped for the | 644 A positive value indicates that training on the perturbed data helped for the |
529 given test set (the first 3 columns on the 62-class tasks and the last one is | 645 given test set (the first 3 columns on the 62-class tasks and the last one is |
530 on the clean 10-class digits). Clearly, the deep learning models did benefit more | 646 on the clean 10-class digits). Clearly, the deep learning models did benefit more |
541 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline | 657 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline |
542 \end{tabular} | 658 \end{tabular} |
543 \end{center} | 659 \end{center} |
544 \end{table} | 660 \end{table} |
545 | 661 |
546 \vspace*{-1mm} | 662 \begin{table}[h] |
547 \subsection{Multi-Task Learning Effects} | |
548 \vspace*{-1mm} | |
549 | |
550 As previously seen, the SDA is better able to benefit from the | |
551 transformations applied to the data than the MLP. In this experiment we | |
552 define three tasks: recognizing digits (knowing that the input is a digit), | |
553 recognizing upper case characters (knowing that the input is one), and | |
554 recognizing lower case characters (knowing that the input is one). We | |
555 consider the digit classification task as the target task and we want to | |
556 evaluate whether training with the other tasks can help or hurt, and | |
557 whether the effect is different for MLPs versus SDAs. The goal is to find | |
558 out if deep learning can benefit more (or less) from multiple related tasks | |
559 (i.e. the multi-task setting) compared to a corresponding purely supervised | |
560 shallow learner. | |
561 | |
562 We use a single hidden layer MLP with 1000 hidden units, and a SDA | |
563 with 3 hidden layers (1000 hidden units per layer), pre-trained and | |
564 fine-tuned on NIST. | |
565 | |
566 Our results show that the MLP benefits marginally from the multi-task setting | |
567 in the case of digits (5\% relative improvement) but is actually hurt in the case | |
568 of characters (respectively 3\% and 4\% worse for lower and upper class characters). | |
569 On the other hand the SDA benefitted from the multi-task setting, with relative | |
570 error rate improvements of 27\%, 15\% and 13\% respectively for digits, | |
571 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. | |
572 | |
573 \begin{table} | |
574 \caption{Test error rates and relative change in error rates due to the use of | 663 \caption{Test error rates and relative change in error rates due to the use of |
575 a multi-task setting, i.e., training on each task in isolation vs training | 664 a multi-task setting, i.e., training on each task in isolation vs training |
576 for all three tasks together, for MLPs vs SDAs. The SDA benefits much | 665 for all three tasks together, for MLPs vs SDAs. The SDA benefits much |
577 more from the multi-task setting. All experiments on only on the | 666 more from the multi-task setting. All experiments on only on the |
578 unperturbed NIST data, using validation error for model selection. | 667 unperturbed NIST data, using validation error for model selection. |
591 \end{tabular} | 680 \end{tabular} |
592 \end{center} | 681 \end{center} |
593 \end{table} | 682 \end{table} |
594 | 683 |
595 | 684 |
596 \begin{figure}[h] | |
597 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ | |
598 \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).} | |
599 \label{fig:improvements-charts} | |
600 \end{figure} | |
601 | |
602 \vspace*{-1mm} | |
603 \section{Conclusions} | |
604 \vspace*{-1mm} | |
605 | |
606 The conclusions are positive for all the questions asked in the introduction. | |
607 %\begin{itemize} | |
608 $\bullet$ %\item | |
609 Do the good results previously obtained with deep architectures on the | |
610 MNIST digits generalize to the setting of a much larger and richer (but similar) | |
611 dataset, the NIST special database 19, with 62 classes and around 800k examples? | |
612 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level | |
613 performance. | |
614 | |
615 $\bullet$ %\item | |
616 To what extent does the perturbation of input images (e.g. adding | |
617 noise, affine transformations, background images) make the resulting | |
618 classifier better not only on similarly perturbed images but also on | |
619 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | |
620 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | |
621 MLPs were helped by perturbed training examples when tested on perturbed input images, | |
622 but only marginally helped wrt clean examples. On the other hand, the deep SDAs | |
623 were very significantly boosted by these out-of-distribution examples. | |
624 | |
625 $\bullet$ %\item | |
626 Similarly, does the feature learning step in deep learning algorithms benefit more | |
627 training with similar but different classes (i.e. a multi-task learning scenario) than | |
628 a corresponding shallow and purely supervised architecture? | |
629 Whereas the improvement due to the multi-task setting was marginal or | |
630 negative for the MLP, it was very significant for the SDA. | |
631 %\end{itemize} | |
632 | |
633 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | |
634 can be executed on-line at {\tt http://deep.host22.com}. | |
635 | |
636 | |
637 {\small | |
638 \bibliography{strings,ml,aigaion,specials} | |
639 %\bibliographystyle{plainnat} | |
640 \bibliographystyle{unsrtnat} | |
641 %\bibliographystyle{apalike} | |
642 } | |
643 | |
644 \end{document} | 685 \end{document} |