comparison writeup/nips2010_submission.tex @ 485:6beaf3328521

les tables enlevées
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Mon, 31 May 2010 21:50:00 -0400
parents 9a757d565e46
children 877af97ee193 6c9ff48e15cd
comparison
equal deleted inserted replaced
484:9a757d565e46 485:6beaf3328521
459 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. 459 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
460 460
461 \vspace*{-1mm} 461 \vspace*{-1mm}
462 \section{Experimental Results} 462 \section{Experimental Results}
463 463
464 \vspace*{-1mm} 464 %\vspace*{-1mm}
465 \subsection{SDA vs MLP vs Humans} 465 %\subsection{SDA vs MLP vs Humans}
466 \vspace*{-1mm} 466 %\vspace*{-1mm}
467 467
468 We compare here the best MLP (according to validation set error) that we found against 468 We compare the best MLP (according to validation set error) that we found against
469 the best SDA (again according to validation set error), along with a precise estimate 469 the best SDA (again according to validation set error), along with a precise estimate
470 of human performance obtained via Amazon's Mechanical Turk (AMT) 470 of human performance obtained via Amazon's Mechanical Turk (AMT)
471 service\footnote{http://mturk.com}. AMT users are paid small amounts 471 service\footnote{http://mturk.com}.
472 of money to perform tasks for which human intelligence is required. 472 %AMT users are paid small amounts
473 Mechanical Turk has been used extensively in natural language 473 %of money to perform tasks for which human intelligence is required.
474 processing \citep{SnowEtAl2008} and vision 474 %Mechanical Turk has been used extensively in natural language
475 \citep{SorokinAndForsyth2008,whitehill09}. AMT users where presented 475 %processing \citep{SnowEtAl2008} and vision
476 %\citep{SorokinAndForsyth2008,whitehill09}.
477 AMT users where presented
476 with 10 character images and asked to type 10 corresponding ascii 478 with 10 character images and asked to type 10 corresponding ascii
477 characters. They were forced to make a hard choice among the 479 characters. They were forced to make a hard choice among the
478 62 or 10 character classes (all classes or digits only). 480 62 or 10 character classes (all classes or digits only).
479 Three users classified each image, allowing 481 Three users classified each image, allowing
480 to estimate inter-human variability (shown as +/- in parenthesis below). 482 to estimate inter-human variability (shown as +/- in parenthesis below).
481 483
482 Figure~\ref{fig:error-rates-charts} summarizes the results obtained. 484 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
483 More detailed results and tables can be found in the appendix. 485 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1,
484 486 SDA2), along with the previous results on the digits NIST special database 19
485 \begin{table} 487 test set from the
488 literature
489 respectively based on ARTMAP neural networks
490 ~\citep{Granger+al-2007}, fast nearest-neighbor search
491 ~\citep{Cortes+al-2000}, MLPs
492 ~\citep{Oliveira+al-2002}, and SVMs
493 ~\citep{Milgram+al-2005}.
494 More detailed and complete numerical results (figures and tables)
495 can be found in the appendix. The 3 kinds of model differ in the
496 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1),
497 or P07 (MLP2, SDA2). The deep learner not only outperformed
498 the shallow ones and previously published performance
499 but reaches human performance on both the 62-class
500 task and the 10-class (digits) task. In addition, as shown
501 in the left of Figure~\ref{fig:fig:improvements-charts},
502 the relative improvement in error rate brought by
503 self-taught learning is greater for the SDA. The left
504 side shows the improvement to the clean NIST test set error
505 brought by the use of out-of-distribution
506 examples (i.e. the perturbed examples examples from NISTP
507 or P07). The right side of Figure~\ref{fig:fig:improvements-charts}
508 shows the relative improvement brought by the use
509 of a multi-task setting, in which the same model is trained
510 for more classes than the target classes of interest
511 (i.e. training with all 62 classes when the target classes
512 are respectively the digits, lower-case, or upper-case
513 characters). Again, whereas the gain is marginal
514 or negative for the MLP, it is substantial for the SDA.
515
516
517 \begin{figure}[h]
518 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
519 \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. }
520 \label{fig:error-rates-charts}
521 \end{figure}
522
523 %\vspace*{-1mm}
524 %\subsection{Perturbed Training Data More Helpful for SDAE}
525 %\vspace*{-1mm}
526
527 %\vspace*{-1mm}
528 %\subsection{Multi-Task Learning Effects}
529 %\vspace*{-1mm}
530
531 \iffalse
532 As previously seen, the SDA is better able to benefit from the
533 transformations applied to the data than the MLP. In this experiment we
534 define three tasks: recognizing digits (knowing that the input is a digit),
535 recognizing upper case characters (knowing that the input is one), and
536 recognizing lower case characters (knowing that the input is one). We
537 consider the digit classification task as the target task and we want to
538 evaluate whether training with the other tasks can help or hurt, and
539 whether the effect is different for MLPs versus SDAs. The goal is to find
540 out if deep learning can benefit more (or less) from multiple related tasks
541 (i.e. the multi-task setting) compared to a corresponding purely supervised
542 shallow learner.
543
544 We use a single hidden layer MLP with 1000 hidden units, and a SDA
545 with 3 hidden layers (1000 hidden units per layer), pre-trained and
546 fine-tuned on NIST.
547
548 Our results show that the MLP benefits marginally from the multi-task setting
549 in the case of digits (5\% relative improvement) but is actually hurt in the case
550 of characters (respectively 3\% and 4\% worse for lower and upper class characters).
551 On the other hand the SDA benefitted from the multi-task setting, with relative
552 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
553 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
554 \fi
555
556
557 \begin{figure}[h]
558 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
559 \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).}
560 \label{fig:improvements-charts}
561 \end{figure}
562
563 \vspace*{-1mm}
564 \section{Conclusions}
565 \vspace*{-1mm}
566
567 The conclusions are positive for all the questions asked in the introduction.
568 %\begin{itemize}
569 $\bullet$ %\item
570 Do the good results previously obtained with deep architectures on the
571 MNIST digits generalize to the setting of a much larger and richer (but similar)
572 dataset, the NIST special database 19, with 62 classes and around 800k examples?
573 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
574 performance.
575
576 $\bullet$ %\item
577 To what extent does the perturbation of input images (e.g. adding
578 noise, affine transformations, background images) make the resulting
579 classifier better not only on similarly perturbed images but also on
580 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
581 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
582 MLPs were helped by perturbed training examples when tested on perturbed input images,
583 but only marginally helped wrt clean examples. On the other hand, the deep SDAs
584 were very significantly boosted by these out-of-distribution examples.
585
586 $\bullet$ %\item
587 Similarly, does the feature learning step in deep learning algorithms benefit more
588 training with similar but different classes (i.e. a multi-task learning scenario) than
589 a corresponding shallow and purely supervised architecture?
590 Whereas the improvement due to the multi-task setting was marginal or
591 negative for the MLP, it was very significant for the SDA.
592 %\end{itemize}
593
594 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
595 can be executed on-line at {\tt http://deep.host22.com}.
596
597
598 {\small
599 \bibliography{strings,ml,aigaion,specials}
600 %\bibliographystyle{plainnat}
601 \bibliographystyle{unsrtnat}
602 %\bibliographystyle{apalike}
603 }
604
605 \newpage
606
607 \centerline{APPENDIX FOR {\bf Deep Self-Taught Learning for Handwritten Character Recognition}}
608
609 \vspace*{1cm}
610
611 \begin{table}[h]
486 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits + 612 \caption{Overall comparison of error rates ($\pm$ std.err.) on 62 character classes (10 digits +
487 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training 613 26 lower + 26 upper), except for last columns -- digits only, between deep architecture with pre-training
488 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture 614 (SDA=Stacked Denoising Autoencoder) and ordinary shallow architecture
489 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07) 615 (MLP=Multi-Layer Perceptron). The models shown are all trained using perturbed data (NISTP or P07)
490 and using a validation set to select hyper-parameters and other training choices. 616 and using a validation set to select hyper-parameters and other training choices.
510 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline 636 \citep{Milgram+al-2005} & & & & 2.1\% $\pm$.12\% \\ \hline
511 \end{tabular} 637 \end{tabular}
512 \end{center} 638 \end{center}
513 \end{table} 639 \end{table}
514 640
515 \begin{figure}[h] 641 \begin{table}[h]
516 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
517 \caption{Charts corresponding to table \ref{tab:sda-vs-mlp-vs-humans}. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from litterature. }
518 \label{fig:error-rates-charts}
519 \end{figure}
520
521 \vspace*{-1mm}
522 \subsection{Perturbed Training Data More Helpful for SDAE}
523 \vspace*{-1mm}
524
525 \begin{table}
526 \caption{Relative change in error rates due to the use of perturbed training data, 642 \caption{Relative change in error rates due to the use of perturbed training data,
527 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models. 643 either using NISTP, for the MLP1/SDA1 models, or using P07, for the MLP2/SDA2 models.
528 A positive value indicates that training on the perturbed data helped for the 644 A positive value indicates that training on the perturbed data helped for the
529 given test set (the first 3 columns on the 62-class tasks and the last one is 645 given test set (the first 3 columns on the 62-class tasks and the last one is
530 on the clean 10-class digits). Clearly, the deep learning models did benefit more 646 on the clean 10-class digits). Clearly, the deep learning models did benefit more
541 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline 657 MLP0/MLP2-1 & -0.4\% & 49\% & 44\% & -29\% \\ \hline
542 \end{tabular} 658 \end{tabular}
543 \end{center} 659 \end{center}
544 \end{table} 660 \end{table}
545 661
546 \vspace*{-1mm} 662 \begin{table}[h]
547 \subsection{Multi-Task Learning Effects}
548 \vspace*{-1mm}
549
550 As previously seen, the SDA is better able to benefit from the
551 transformations applied to the data than the MLP. In this experiment we
552 define three tasks: recognizing digits (knowing that the input is a digit),
553 recognizing upper case characters (knowing that the input is one), and
554 recognizing lower case characters (knowing that the input is one). We
555 consider the digit classification task as the target task and we want to
556 evaluate whether training with the other tasks can help or hurt, and
557 whether the effect is different for MLPs versus SDAs. The goal is to find
558 out if deep learning can benefit more (or less) from multiple related tasks
559 (i.e. the multi-task setting) compared to a corresponding purely supervised
560 shallow learner.
561
562 We use a single hidden layer MLP with 1000 hidden units, and a SDA
563 with 3 hidden layers (1000 hidden units per layer), pre-trained and
564 fine-tuned on NIST.
565
566 Our results show that the MLP benefits marginally from the multi-task setting
567 in the case of digits (5\% relative improvement) but is actually hurt in the case
568 of characters (respectively 3\% and 4\% worse for lower and upper class characters).
569 On the other hand the SDA benefitted from the multi-task setting, with relative
570 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
571 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
572
573 \begin{table}
574 \caption{Test error rates and relative change in error rates due to the use of 663 \caption{Test error rates and relative change in error rates due to the use of
575 a multi-task setting, i.e., training on each task in isolation vs training 664 a multi-task setting, i.e., training on each task in isolation vs training
576 for all three tasks together, for MLPs vs SDAs. The SDA benefits much 665 for all three tasks together, for MLPs vs SDAs. The SDA benefits much
577 more from the multi-task setting. All experiments on only on the 666 more from the multi-task setting. All experiments on only on the
578 unperturbed NIST data, using validation error for model selection. 667 unperturbed NIST data, using validation error for model selection.
591 \end{tabular} 680 \end{tabular}
592 \end{center} 681 \end{center}
593 \end{table} 682 \end{table}
594 683
595 684
596 \begin{figure}[h]
597 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
598 \caption{Charts corresponding to tables \ref{tab:perturbation-effect} (left) and \ref{tab:multi-task} (right).}
599 \label{fig:improvements-charts}
600 \end{figure}
601
602 \vspace*{-1mm}
603 \section{Conclusions}
604 \vspace*{-1mm}
605
606 The conclusions are positive for all the questions asked in the introduction.
607 %\begin{itemize}
608 $\bullet$ %\item
609 Do the good results previously obtained with deep architectures on the
610 MNIST digits generalize to the setting of a much larger and richer (but similar)
611 dataset, the NIST special database 19, with 62 classes and around 800k examples?
612 Yes, the SDA systematically outperformed the MLP, in fact reaching human-level
613 performance.
614
615 $\bullet$ %\item
616 To what extent does the perturbation of input images (e.g. adding
617 noise, affine transformations, background images) make the resulting
618 classifier better not only on similarly perturbed images but also on
619 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
620 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
621 MLPs were helped by perturbed training examples when tested on perturbed input images,
622 but only marginally helped wrt clean examples. On the other hand, the deep SDAs
623 were very significantly boosted by these out-of-distribution examples.
624
625 $\bullet$ %\item
626 Similarly, does the feature learning step in deep learning algorithms benefit more
627 training with similar but different classes (i.e. a multi-task learning scenario) than
628 a corresponding shallow and purely supervised architecture?
629 Whereas the improvement due to the multi-task setting was marginal or
630 negative for the MLP, it was very significant for the SDA.
631 %\end{itemize}
632
633 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
634 can be executed on-line at {\tt http://deep.host22.com}.
635
636
637 {\small
638 \bibliography{strings,ml,aigaion,specials}
639 %\bibliographystyle{plainnat}
640 \bibliographystyle{unsrtnat}
641 %\bibliographystyle{apalike}
642 }
643
644 \end{document} 685 \end{document}