comparison writeup/nips2010_submission.tex @ 505:a41a8925be70

merge
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Tue, 01 Jun 2010 10:55:08 -0700
parents e837ef6eef8c a0e820f04f8e
children b8e33d3d7f65 860c755ddcff
comparison
equal deleted inserted replaced
504:e837ef6eef8c 505:a41a8925be70
199 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times 199 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
200 \sqrt[3]{complexity}$.\\ 200 \sqrt[3]{complexity}$.\\
201 {\bf Pinch.} 201 {\bf Pinch.}
202 This GIMP filter is named "Whirl and 202 This GIMP filter is named "Whirl and
203 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic 203 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
204 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. 204 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
205 For a square input image, think of drawing a circle of 205 For a square input image, think of drawing a circle of
206 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to 206 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to
207 that disk (region inside circle) will have its value recalculated by taking 207 that disk (region inside circle) will have its value recalculated by taking
208 the value of another "source" pixel in the original image. The position of 208 the value of another "source" pixel in the original image. The position of
209 that source pixel is found on the line that goes through $C$ and $P$, but 209 that source pixel is found on the line that goes through $C$ and $P$, but
327 and {\bf OCR data} (scanned machine printed characters). Once a character 327 and {\bf OCR data} (scanned machine printed characters). Once a character
328 is sampled from one of these sources (chosen randomly), a pipeline of 328 is sampled from one of these sources (chosen randomly), a pipeline of
329 the above transformations and/or noise processes is applied to the 329 the above transformations and/or noise processes is applied to the
330 image. 330 image.
331 331
332 We compare the best MLP (according to validation set error) that we found against
333 the best SDA (again according to validation set error), along with a precise estimate
334 of human performance obtained via Amazon's Mechanical Turk (AMT)
335 service\footnote{http://mturk.com}.
336 AMT users are paid small amounts
337 of money to perform tasks for which human intelligence is required.
338 Mechanical Turk has been used extensively in natural language processing and vision.
339 %processing \citep{SnowEtAl2008} and vision
340 %\citep{SorokinAndForsyth2008,whitehill09}.
341 %\citep{SorokinAndForsyth2008,whitehill09}.
342 AMT users where presented
343 with 10 character images and asked to type 10 corresponding ASCII
344 characters. They were forced to make a hard choice among the
345 62 or 10 character classes (all classes or digits only).
346 Three users classified each image, allowing
347 to estimate inter-human variability.
348
332 \vspace*{-1mm} 349 \vspace*{-1mm}
333 \subsection{Data Sources} 350 \subsection{Data Sources}
334 \vspace*{-1mm} 351 \vspace*{-1mm}
335 352
336 %\begin{itemize} 353 %\begin{itemize}
410 427
411 \vspace*{-1mm} 428 \vspace*{-1mm}
412 \subsection{Models and their Hyperparameters} 429 \subsection{Models and their Hyperparameters}
413 \vspace*{-1mm} 430 \vspace*{-1mm}
414 431
432 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
433 hidden layer and with Stacked Denoising Auto-Encoders (SDA).
415 All hyper-parameters are selected based on performance on the NISTP validation set. 434 All hyper-parameters are selected based on performance on the NISTP validation set.
416 435
417 {\bf Multi-Layer Perceptrons (MLP).} 436 {\bf Multi-Layer Perceptrons (MLP).}
418 Whereas previous work had compared deep architectures to both shallow MLPs and 437 Whereas previous work had compared deep architectures to both shallow MLPs and
419 SVMs, we only compared to MLPs here because of the very large datasets used. 438 SVMs, we only compared to MLPs here because of the very large datasets used
439 (making the use of SVMs computationally inconvenient because of their quadratic
440 scaling behavior).
420 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized 441 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
421 exponentials) on the output layer for estimating P(class | image). 442 exponentials) on the output layer for estimating P(class | image).
422 The hyper-parameters are the following: number of hidden units, taken in 443 The hyper-parameters are the following: number of hidden units, taken in
423 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training 444 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training
424 examples are presented in minibatches of size 20. A constant learning 445 examples are presented in minibatches of size 20. A constant learning
425 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ 446 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
426 through preliminary experiments, and 0.1 was selected. 447 through preliminary experiments, and 0.1 was selected.
427 448
428 {\bf Stacked Denoising Auto-Encoders (SDAE).} 449 {\bf Stacked Denoising Auto-Encoders (SDA).}
429 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) 450 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
430 can be used to initialize the weights of each layer of a deep MLP (with many hidden 451 can be used to initialize the weights of each layer of a deep MLP (with many hidden
431 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} 452 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006}
432 enabling better generalization, apparently setting parameters in the 453 enabling better generalization, apparently setting parameters in the
433 basin of attraction of supervised gradient descent yielding better 454 basin of attraction of supervised gradient descent yielding better
439 taking advantage of the expressive power and bias implicit in the 460 taking advantage of the expressive power and bias implicit in the
440 deep architecture (whereby complex concepts are expressed as 461 deep architecture (whereby complex concepts are expressed as
441 compositions of simpler ones through a deep hierarchy). 462 compositions of simpler ones through a deep hierarchy).
442 Here we chose to use the Denoising 463 Here we chose to use the Denoising
443 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for 464 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
465 % AJOUTER UNE IMAGE?
444 these deep hierarchies of features, as it is very simple to train and 466 these deep hierarchies of features, as it is very simple to train and
445 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), 467 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}),
446 provides immediate and efficient inference, and yielded results 468 provides immediate and efficient inference, and yielded results
447 comparable or better than RBMs in series of experiments 469 comparable or better than RBMs in series of experiments
448 \citep{VincentPLarochelleH2008}. During training of a Denoising 470 \citep{VincentPLarochelleH2008}. During training of a Denoising
468 490
469 %\vspace*{-1mm} 491 %\vspace*{-1mm}
470 %\subsection{SDA vs MLP vs Humans} 492 %\subsection{SDA vs MLP vs Humans}
471 %\vspace*{-1mm} 493 %\vspace*{-1mm}
472 494
473 We compare the best MLP (according to validation set error) that we found against
474 the best SDA (again according to validation set error), along with a precise estimate
475 of human performance obtained via Amazon's Mechanical Turk (AMT)
476 service\footnote{http://mturk.com}.
477 %AMT users are paid small amounts
478 %of money to perform tasks for which human intelligence is required.
479 %Mechanical Turk has been used extensively in natural language
480 %processing \citep{SnowEtAl2008} and vision
481 %\citep{SorokinAndForsyth2008,whitehill09}.
482 AMT users where presented
483 with 10 character images and asked to type 10 corresponding ASCII
484 characters. They were forced to make a hard choice among the
485 62 or 10 character classes (all classes or digits only).
486 Three users classified each image, allowing
487 to estimate inter-human variability (shown as +/- in parenthesis below).
488
489 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, 495 Figure~\ref{fig:error-rates-charts} summarizes the results obtained,
490 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, 496 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1,
491 SDA2), along with the previous results on the digits NIST special database 497 SDA2), along with the previous results on the digits NIST special database
492 19 test set from the literature respectively based on ARTMAP neural 498 19 test set from the literature respectively based on ARTMAP neural
493 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search 499 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search
501 significant way) but reaches human performance on both the 62-class task 507 significant way) but reaches human performance on both the 62-class task
502 and the 10-class (digits) task. In addition, as shown in the left of 508 and the 10-class (digits) task. In addition, as shown in the left of
503 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error 509 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error
504 rate brought by self-taught learning is greater for the SDA, and these 510 rate brought by self-taught learning is greater for the SDA, and these
505 differences with the MLP are statistically and qualitatively 511 differences with the MLP are statistically and qualitatively
506 significant. The left side of the figure shows the improvement to the clean 512 significant.
513 The left side of the figure shows the improvement to the clean
507 NIST test set error brought by the use of out-of-distribution examples 514 NIST test set error brought by the use of out-of-distribution examples
508 (i.e. the perturbed examples examples from NISTP or P07). The right side of 515 (i.e. the perturbed examples examples from NISTP or P07).
516 Relative change is measured by taking
517 (original model's error / perturbed-data model's error - 1).
518 The right side of
509 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement 519 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement
510 brought by the use of a multi-task setting, in which the same model is 520 brought by the use of a multi-task setting, in which the same model is
511 trained for more classes than the target classes of interest (i.e. training 521 trained for more classes than the target classes of interest (i.e. training
512 with all 62 classes when the target classes are respectively the digits, 522 with all 62 classes when the target classes are respectively the digits,
513 lower-case, or upper-case characters). Again, whereas the gain from the 523 lower-case, or upper-case characters). Again, whereas the gain from the
525 setting is similar for the other two target classes (lower case characters 535 setting is similar for the other two target classes (lower case characters
526 and upper case characters). 536 and upper case characters).
527 537
528 \begin{figure}[h] 538 \begin{figure}[h]
529 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ 539 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
530 \caption{Charts corresponding to table 1 of Appendix I. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from literature. } 540 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training
541 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
542 of all models, on 3 different test sets corresponding to the three
543 datasets.
544 Right: error rates on NIST test digits only, along with the previous results from
545 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005}
546 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
547
531 \label{fig:error-rates-charts} 548 \label{fig:error-rates-charts}
532 \end{figure} 549 \end{figure}
533 550
534 %\vspace*{-1mm} 551 %\vspace*{-1mm}
535 %\subsection{Perturbed Training Data More Helpful for SDAE} 552 %\subsection{Perturbed Training Data More Helpful for SDA}
536 %\vspace*{-1mm} 553 %\vspace*{-1mm}
537 554
538 %\vspace*{-1mm} 555 %\vspace*{-1mm}
539 %\subsection{Multi-Task Learning Effects} 556 %\subsection{Multi-Task Learning Effects}
540 %\vspace*{-1mm} 557 %\vspace*{-1mm}
573 590
574 \vspace*{-1mm} 591 \vspace*{-1mm}
575 \section{Conclusions} 592 \section{Conclusions}
576 \vspace*{-1mm} 593 \vspace*{-1mm}
577 594
578 The conclusions are positive for all the questions asked in the introduction. 595 We have found that the self-taught learning framework is more beneficial
596 to a deep learner than to a traditional shallow and purely
597 supervised learner. More precisely,
598 the conclusions are positive for all the questions asked in the introduction.
579 %\begin{itemize} 599 %\begin{itemize}
580 600
581 $\bullet$ %\item 601 $\bullet$ %\item
582 Do the good results previously obtained with deep architectures on the 602 Do the good results previously obtained with deep architectures on the
583 MNIST digits generalize to the setting of a much larger and richer (but similar) 603 MNIST digits generalize to the setting of a much larger and richer (but similar)
584 dataset, the NIST special database 19, with 62 classes and around 800k examples? 604 dataset, the NIST special database 19, with 62 classes and around 800k examples?
585 Yes, the SDA systematically outperformed the MLP and all the previously 605 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
586 published results on this dataset (as far as we know), in fact reaching human-level 606 published results on this dataset (as far as we know), in fact reaching human-level
587 performance. 607 performance} at round 17\% error on the 62-class task and 1.4\% on the digits.
588 608
589 $\bullet$ %\item 609 $\bullet$ %\item
590 To what extent does the perturbation of input images (e.g. adding 610 To what extent does the perturbation of input images (e.g. adding
591 noise, affine transformations, background images) make the resulting 611 noise, affine transformations, background images) make the resulting
592 classifier better not only on similarly perturbed images but also on 612 classifier better not only on similarly perturbed images but also on
593 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} 613 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution}
594 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? 614 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework?
595 MLPs were helped by perturbed training examples when tested on perturbed input images, 615 MLPs were helped by perturbed training examples when tested on perturbed input
596 but only marginally helped with respect to clean examples. On the other hand, the deep SDAs 616 images (65\% relative improvement on NISTP)
617 but only marginally helped (5\% relative improvement on all classes)
618 or even hurt (10\% relative loss on digits)
619 with respect to clean examples . On the other hand, the deep SDAs
597 were very significantly boosted by these out-of-distribution examples. 620 were very significantly boosted by these out-of-distribution examples.
598 621
599 $\bullet$ %\item 622 $\bullet$ %\item
600 Similarly, does the feature learning step in deep learning algorithms benefit more 623 Similarly, does the feature learning step in deep learning algorithms benefit more
601 training with similar but different classes (i.e. a multi-task learning scenario) than 624 training with similar but different classes (i.e. a multi-task learning scenario) than
602 a corresponding shallow and purely supervised architecture? 625 a corresponding shallow and purely supervised architecture?
603 Whereas the improvement due to the multi-task setting was marginal or 626 Whereas the improvement due to the multi-task setting was marginal or
604 negative for the MLP, it was very significant for the SDA. 627 negative for the MLP (from +5.6\% to -3.6\% relative change),
628 it was very significant for the SDA (from +13\% to +27\% relative change).
605 %\end{itemize} 629 %\end{itemize}
630
631 Why would deep learners benefit more from the self-taught learning framework?
632 The key idea is that the lower layers of the predictor compute a hierarchy
633 of features that can be shared across tasks or across variants of the
634 input distribution. Intermediate features that can be used in different
635 contexts can be estimated in a way that allows to share statistical
636 strength. Features extracted through many levels are more likely to
637 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest),
638 increasing the likelihood that they would be useful for a larger array
639 of tasks and input conditions.
640 Therefore, we hypothesize that both depth and unsupervised
641 pre-training play a part in explaining the advantages observed here, and future
642 experiments could attempt at teasing apart these factors.
606 643
607 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) 644 A Flash demo of the recognizer (where both the MLP and the SDA can be compared)
608 can be executed on-line at {\tt http://deep.host22.com}. 645 can be executed on-line at {\tt http://deep.host22.com}.
609 646
610 \newpage 647 \newpage