comparison writeup/nips2010_submission.tex @ 502:2b35a6e5ece4

changements de Myriam
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 01 Jun 2010 13:37:40 -0400
parents 5927432d8b8d
children a0e820f04f8e
comparison
equal deleted inserted replaced
501:5927432d8b8d 502:2b35a6e5ece4
325 and {\bf OCR data} (scanned machine printed characters). Once a character 325 and {\bf OCR data} (scanned machine printed characters). Once a character
326 is sampled from one of these sources (chosen randomly), a pipeline of 326 is sampled from one of these sources (chosen randomly), a pipeline of
327 the above transformations and/or noise processes is applied to the 327 the above transformations and/or noise processes is applied to the
328 image. 328 image.
329 329
330 We compare the best MLP (according to validation set error) that we found against
331 the best SDA (again according to validation set error), along with a precise estimate
332 of human performance obtained via Amazon's Mechanical Turk (AMT)
333 service\footnote{http://mturk.com}.
334 AMT users are paid small amounts
335 of money to perform tasks for which human intelligence is required.
336 Mechanical Turk has been used extensively in natural language
337 processing \citep{SnowEtAl2008} and vision
338 \citep{SorokinAndForsyth2008,whitehill09}.
339 AMT users where presented
340 with 10 character images and asked to type 10 corresponding ASCII
341 characters. They were forced to make a hard choice among the
342 62 or 10 character classes (all classes or digits only).
343 Three users classified each image, allowing
344 to estimate inter-human variability.
345
330 \vspace*{-1mm} 346 \vspace*{-1mm}
331 \subsection{Data Sources} 347 \subsection{Data Sources}
332 \vspace*{-1mm} 348 \vspace*{-1mm}
333 349
334 %\begin{itemize} 350 %\begin{itemize}
408 424
409 \vspace*{-1mm} 425 \vspace*{-1mm}
410 \subsection{Models and their Hyperparameters} 426 \subsection{Models and their Hyperparameters}
411 \vspace*{-1mm} 427 \vspace*{-1mm}
412 428
429 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
430 hidden layer and with Stacked Denoising Auto-Encoders (SDA).
413 All hyper-parameters are selected based on performance on the NISTP validation set. 431 All hyper-parameters are selected based on performance on the NISTP validation set.
414 432
415 {\bf Multi-Layer Perceptrons (MLP).} 433 {\bf Multi-Layer Perceptrons (MLP).}
416 Whereas previous work had compared deep architectures to both shallow MLPs and 434 Whereas previous work had compared deep architectures to both shallow MLPs and
417 SVMs, we only compared to MLPs here because of the very large datasets used. 435 SVMs, we only compared to MLPs here because of the very large datasets used
436 (making the use of SVMs computationally inconvenient because of their quadratic
437 scaling behavior).
418 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized 438 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
419 exponentials) on the output layer for estimating P(class | image). 439 exponentials) on the output layer for estimating P(class | image).
420 The hyper-parameters are the following: number of hidden units, taken in 440 The hyper-parameters are the following: number of hidden units, taken in
421 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training 441 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
422 examples are presented in minibatches of size 20. A constant learning 442 examples are presented in minibatches of size 20. A constant learning
423 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ 443 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
424 through preliminary experiments, and 0.1 was selected. 444 through preliminary experiments, and 0.1 was selected.
425 445
426 {\bf Stacked Denoising Auto-Encoders (SDAE).} 446 {\bf Stacked Denoising Auto-Encoders (SDA).}
427 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 447 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
428 can be used to initialize the weights of each layer of a deep MLP (with many hidden 448 can be used to initialize the weights of each layer of a deep MLP (with many hidden
429 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} 449 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
430 enabling better generalization, apparently setting parameters in the 450 enabling better generalization, apparently setting parameters in the
431 basin of attraction of supervised gradient descent yielding better 451 basin of attraction of supervised gradient descent yielding better
437 taking advantage of the expressive power and bias implicit in the 457 taking advantage of the expressive power and bias implicit in the
438 deep architecture (whereby complex concepts are expressed as 458 deep architecture (whereby complex concepts are expressed as
439 compositions of simpler ones through a deep hierarchy). 459 compositions of simpler ones through a deep hierarchy).
440 Here we chose to use the Denoising 460 Here we chose to use the Denoising
441 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for 461 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
462 % AJOUTER UNE IMAGE?
442 these deep hierarchies of features, as it is very simple to train and 463 these deep hierarchies of features, as it is very simple to train and
443 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), 464 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}),
444 provides immediate and efficient inference, and yielded results 465 provides immediate and efficient inference, and yielded results
445 comparable or better than RBMs in series of experiments 466 comparable or better than RBMs in series of experiments
446 \citep{VincentPLarochelleH2008}. During training of a Denoising 467 \citep{VincentPLarochelleH2008}. During training of a Denoising
466 487
467 %\vspace*{-1mm} 488 %\vspace*{-1mm}
468 %\subsection{SDA vs MLP vs Humans} 489 %\subsection{SDA vs MLP vs Humans}
469 %\vspace*{-1mm} 490 %\vspace*{-1mm}
470 491
471 We compare the best MLP (according to validation set error) that we found against
472 the best SDA (again according to validation set error), along with a precise estimate
473 of human performance obtained via Amazon's Mechanical Turk (AMT)
474 service\footnote{http://mturk.com}.
475 %AMT users are paid small amounts
476 %of money to perform tasks for which human intelligence is required.
477 %Mechanical Turk has been used extensively in natural language
478 %processing \citep{SnowEtAl2008} and vision
479 %\citep{SorokinAndForsyth2008,whitehill09}.
480 AMT users where presented
481 with 10 character images and asked to type 10 corresponding ASCII
482 characters. They were forced to make a hard choice among the
483 62 or 10 character classes (all classes or digits only).
484 Three users classified each image, allowing
485 to estimate inter-human variability (shown as +/- in parenthesis below).
486
487 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, 492 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
488 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, 493 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1,
489 SDA2), along with the previous results on the digits NIST special database 494 SDA2), along with the previous results on the digits NIST special database
490 19 test set from the literature respectively based on ARTMAP neural 495 19 test set from the literature respectively based on ARTMAP neural
491 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search 496 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
499 significant way) but reaches human performance on both the 62-class task 504 significant way) but reaches human performance on both the 62-class task
500 and the 10-class (digits) task. In addition, as shown in the left of 505 and the 10-class (digits) task. In addition, as shown in the left of
501 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error 506 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error
502 rate brought by self-taught learning is greater for the SDA, and these 507 rate brought by self-taught learning is greater for the SDA, and these
503 differences with the MLP are statistically and qualitatively 508 differences with the MLP are statistically and qualitatively
504 significant. The left side of the figure shows the improvement to the clean 509 significant.
510 The left side of the figure shows the improvement to the clean
505 NIST test set error brought by the use of out-of-distribution examples 511 NIST test set error brought by the use of out-of-distribution examples
506 (i.e. the perturbed examples examples from NISTP or P07). The right side of 512 (i.e. the perturbed examples examples from NISTP or P07).
513 Relative change is measured by taking
514 (original model's error / perturbed-data model's error - 1).
515 The right side of
507 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement 516 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement
508 brought by the use of a multi-task setting, in which the same model is 517 brought by the use of a multi-task setting, in which the same model is
509 trained for more classes than the target classes of interest (i.e. training 518 trained for more classes than the target classes of interest (i.e. training
510 with all 62 classes when the target classes are respectively the digits, 519 with all 62 classes when the target classes are respectively the digits,
511 lower-case, or upper-case characters). Again, whereas the gain from the 520 lower-case, or upper-case characters). Again, whereas the gain from the
523 setting is similar for the other two target classes (lower case characters 532 setting is similar for the other two target classes (lower case characters
524 and upper case characters). 533 and upper case characters).
525 534
526 \begin{figure}[h] 535 \begin{figure}[h]
527 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ 536 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
528 \caption{Left: overall results; error bars indicate a 95\% confidence interval. 537 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training
529 Right: error rates on NIST test digits only, with results from literature. } 538 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
539 of all models, on 3 different test sets corresponding to the three
540 datasets.
541 Right: error rates on NIST test digits only, along with the previous results from
542 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
543 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
544
530 \label{fig:error-rates-charts} 545 \label{fig:error-rates-charts}
531 \end{figure} 546 \end{figure}
532 547
533 %\vspace*{-1mm} 548 %\vspace*{-1mm}
534 %\subsection{Perturbed Training Data More Helpful for SDAE} 549 %\subsection{Perturbed Training Data More Helpful for SDA}
535 %\vspace*{-1mm} 550 %\vspace*{-1mm}
536 551
537 %\vspace*{-1mm} 552 %\vspace*{-1mm}
538 %\subsection{Multi-Task Learning Effects} 553 %\subsection{Multi-Task Learning Effects}
539 %\vspace*{-1mm} 554 %\vspace*{-1mm}
578 593
579 \vspace*{-1mm} 594 \vspace*{-1mm}
580 \section{Conclusions} 595 \section{Conclusions}
581 \vspace*{-1mm} 596 \vspace*{-1mm}
582 597
583 The conclusions are positive for all the questions asked in the introduction. 598 We have found that the self-taught learning framework is more beneficial
599 to a deep learner than to a traditional shallow and purely
600 supervised learner. More precisely,
601 the conclusions are positive for all the questions asked in the introduction.
584 %\begin{itemize} 602 %\begin{itemize}
585 603
586 $\bullet$ %\item 604 $\bullet$ %\item
587 Do the good results previously obtained with deep architectures on the 605 Do the good results previously obtained with deep architectures on the
588 MNIST digits generalize to the setting of a much larger and richer (but similar) 606 MNIST digits generalize to the setting of a much larger and richer (but similar)
589 dataset, the NIST special database 19, with 62 classes and around 800k examples? 607 dataset, the NIST special database 19, with 62 classes and around 800k examples?
590 Yes, the SDA systematically outperformed the MLP and all the previously 608 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
591 published results on this dataset (as far as we know), in fact reaching human-level 609 published results on this dataset (as far as we know), in fact reaching human-level
592 performance. 610 performance} at round 17\% error on the 62-class task and 1.4\% on the digits.
593 611
594 $\bullet$ %\item 612 $\bullet$ %\item
595 To what extent does the perturbation of input images (e.g. adding 613 To what extent does the perturbation of input images (e.g. adding
596 noise, affine transformations, background images) make the resulting 614 noise, affine transformations, background images) make the resulting
597 classifier better not only on similarly perturbed images but also on 615 classifier better not only on similarly perturbed images but also on
598 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} 616 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
599 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 617 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
600 MLPs were helped by perturbed training examples when tested on perturbed input images, 618 MLPs were helped by perturbed training examples when tested on perturbed input
601 but only marginally helped with respect to clean examples. On the other hand, the deep SDAs 619 images (65\% relative improvement on NISTP)
620 but only marginally helped (5\% relative improvement on all classes)
621 or even hurt (10\% relative loss on digits)
622 with respect to clean examples . On the other hand, the deep SDAs
602 were very significantly boosted by these out-of-distribution examples. 623 were very significantly boosted by these out-of-distribution examples.
603 624
604 $\bullet$ %\item 625 $\bullet$ %\item
605 Similarly, does the feature learning step in deep learning algorithms benefit more 626 Similarly, does the feature learning step in deep learning algorithms benefit more
606 training with similar but different classes (i.e. a multi-task learning scenario) than 627 training with similar but different classes (i.e. a multi-task learning scenario) than
607 a corresponding shallow and purely supervised architecture? 628 a corresponding shallow and purely supervised architecture?
608 Whereas the improvement due to the multi-task setting was marginal or 629 Whereas the improvement due to the multi-task setting was marginal or
609 negative for the MLP, it was very significant for the SDA. 630 negative for the MLP (from +5.6\% to -3.6\% relative change),
631 it was very significant for the SDA (from +13\% to +27\% relative change).
610 %\end{itemize} 632 %\end{itemize}
633
634 Why would deep learners benefit more from the self-taught learning framework?
635 The key idea is that the lower layers of the predictor compute a hierarchy
636 of features that can be shared across tasks or across variants of the
637 input distribution. Intermediate features that can be used in different
638 contexts can be estimated in a way that allows to share statistical
639 strength. Features extracted through many levels are more likely to
640 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
641 increasing the likelihood that they would be useful for a larger array
642 of tasks and input conditions.
643 Therefore, we hypothesize that both depth and unsupervised
644 pre-training play a part in explaining the advantages observed here, and future
645 experiments could attempt at teasing apart these factors.
611 646
612 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 647 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
613 can be executed on-line at {\tt http://deep.host22.com}. 648 can be executed on-line at {\tt http://deep.host22.com}.
614 649
615 \newpage 650 \newpage