Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 505:a41a8925be70
merge
author | Dumitru Erhan <dumitru.erhan@gmail.com> |
---|---|
date | Tue, 01 Jun 2010 10:55:08 -0700 |
parents | e837ef6eef8c a0e820f04f8e |
children | b8e33d3d7f65 860c755ddcff |
comparison
equal
deleted
inserted
replaced
504:e837ef6eef8c | 505:a41a8925be70 |
---|---|
199 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times | 199 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times |
200 \sqrt[3]{complexity}$.\\ | 200 \sqrt[3]{complexity}$.\\ |
201 {\bf Pinch.} | 201 {\bf Pinch.} |
202 This GIMP filter is named "Whirl and | 202 This GIMP filter is named "Whirl and |
203 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic | 203 pinch", but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic |
204 surface and pressing or pulling on the center of the surface''~\citep{GIMP-manual}. | 204 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). |
205 For a square input image, think of drawing a circle of | 205 For a square input image, think of drawing a circle of |
206 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to | 206 radius $r$ around a center point $C$. Any point (pixel) $P$ belonging to |
207 that disk (region inside circle) will have its value recalculated by taking | 207 that disk (region inside circle) will have its value recalculated by taking |
208 the value of another "source" pixel in the original image. The position of | 208 the value of another "source" pixel in the original image. The position of |
209 that source pixel is found on the line that goes through $C$ and $P$, but | 209 that source pixel is found on the line that goes through $C$ and $P$, but |
327 and {\bf OCR data} (scanned machine printed characters). Once a character | 327 and {\bf OCR data} (scanned machine printed characters). Once a character |
328 is sampled from one of these sources (chosen randomly), a pipeline of | 328 is sampled from one of these sources (chosen randomly), a pipeline of |
329 the above transformations and/or noise processes is applied to the | 329 the above transformations and/or noise processes is applied to the |
330 image. | 330 image. |
331 | 331 |
332 We compare the best MLP (according to validation set error) that we found against | |
333 the best SDA (again according to validation set error), along with a precise estimate | |
334 of human performance obtained via Amazon's Mechanical Turk (AMT) | |
335 service\footnote{http://mturk.com}. | |
336 AMT users are paid small amounts | |
337 of money to perform tasks for which human intelligence is required. | |
338 Mechanical Turk has been used extensively in natural language processing and vision. | |
339 %processing \citep{SnowEtAl2008} and vision | |
340 %\citep{SorokinAndForsyth2008,whitehill09}. | |
341 %\citep{SorokinAndForsyth2008,whitehill09}. | |
342 AMT users where presented | |
343 with 10 character images and asked to type 10 corresponding ASCII | |
344 characters. They were forced to make a hard choice among the | |
345 62 or 10 character classes (all classes or digits only). | |
346 Three users classified each image, allowing | |
347 to estimate inter-human variability. | |
348 | |
332 \vspace*{-1mm} | 349 \vspace*{-1mm} |
333 \subsection{Data Sources} | 350 \subsection{Data Sources} |
334 \vspace*{-1mm} | 351 \vspace*{-1mm} |
335 | 352 |
336 %\begin{itemize} | 353 %\begin{itemize} |
410 | 427 |
411 \vspace*{-1mm} | 428 \vspace*{-1mm} |
412 \subsection{Models and their Hyperparameters} | 429 \subsection{Models and their Hyperparameters} |
413 \vspace*{-1mm} | 430 \vspace*{-1mm} |
414 | 431 |
432 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single | |
433 hidden layer and with Stacked Denoising Auto-Encoders (SDA). | |
415 All hyper-parameters are selected based on performance on the NISTP validation set. | 434 All hyper-parameters are selected based on performance on the NISTP validation set. |
416 | 435 |
417 {\bf Multi-Layer Perceptrons (MLP).} | 436 {\bf Multi-Layer Perceptrons (MLP).} |
418 Whereas previous work had compared deep architectures to both shallow MLPs and | 437 Whereas previous work had compared deep architectures to both shallow MLPs and |
419 SVMs, we only compared to MLPs here because of the very large datasets used. | 438 SVMs, we only compared to MLPs here because of the very large datasets used |
439 (making the use of SVMs computationally inconvenient because of their quadratic | |
440 scaling behavior). | |
420 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | 441 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
421 exponentials) on the output layer for estimating P(class | image). | 442 exponentials) on the output layer for estimating P(class | image). |
422 The hyper-parameters are the following: number of hidden units, taken in | 443 The hyper-parameters are the following: number of hidden units, taken in |
423 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training | 444 $\{300,500,800,1000,1500\}$. The optimization procedure is as follows. Training |
424 examples are presented in minibatches of size 20. A constant learning | 445 examples are presented in minibatches of size 20. A constant learning |
425 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ | 446 rate is chosen in $10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ |
426 through preliminary experiments, and 0.1 was selected. | 447 through preliminary experiments, and 0.1 was selected. |
427 | 448 |
428 {\bf Stacked Denoising Auto-Encoders (SDAE).} | 449 {\bf Stacked Denoising Auto-Encoders (SDA).} |
429 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 450 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
430 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 451 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
431 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} | 452 layers)~\citep{Hinton06,ranzato-07,Bengio-nips-2006} |
432 enabling better generalization, apparently setting parameters in the | 453 enabling better generalization, apparently setting parameters in the |
433 basin of attraction of supervised gradient descent yielding better | 454 basin of attraction of supervised gradient descent yielding better |
439 taking advantage of the expressive power and bias implicit in the | 460 taking advantage of the expressive power and bias implicit in the |
440 deep architecture (whereby complex concepts are expressed as | 461 deep architecture (whereby complex concepts are expressed as |
441 compositions of simpler ones through a deep hierarchy). | 462 compositions of simpler ones through a deep hierarchy). |
442 Here we chose to use the Denoising | 463 Here we chose to use the Denoising |
443 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for | 464 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for |
465 % AJOUTER UNE IMAGE? | |
444 these deep hierarchies of features, as it is very simple to train and | 466 these deep hierarchies of features, as it is very simple to train and |
445 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 467 teach (see tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
446 provides immediate and efficient inference, and yielded results | 468 provides immediate and efficient inference, and yielded results |
447 comparable or better than RBMs in series of experiments | 469 comparable or better than RBMs in series of experiments |
448 \citep{VincentPLarochelleH2008}. During training of a Denoising | 470 \citep{VincentPLarochelleH2008}. During training of a Denoising |
468 | 490 |
469 %\vspace*{-1mm} | 491 %\vspace*{-1mm} |
470 %\subsection{SDA vs MLP vs Humans} | 492 %\subsection{SDA vs MLP vs Humans} |
471 %\vspace*{-1mm} | 493 %\vspace*{-1mm} |
472 | 494 |
473 We compare the best MLP (according to validation set error) that we found against | |
474 the best SDA (again according to validation set error), along with a precise estimate | |
475 of human performance obtained via Amazon's Mechanical Turk (AMT) | |
476 service\footnote{http://mturk.com}. | |
477 %AMT users are paid small amounts | |
478 %of money to perform tasks for which human intelligence is required. | |
479 %Mechanical Turk has been used extensively in natural language | |
480 %processing \citep{SnowEtAl2008} and vision | |
481 %\citep{SorokinAndForsyth2008,whitehill09}. | |
482 AMT users where presented | |
483 with 10 character images and asked to type 10 corresponding ASCII | |
484 characters. They were forced to make a hard choice among the | |
485 62 or 10 character classes (all classes or digits only). | |
486 Three users classified each image, allowing | |
487 to estimate inter-human variability (shown as +/- in parenthesis below). | |
488 | |
489 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, | 495 Figure~\ref{fig:error-rates-charts} summarizes the results obtained, |
490 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, | 496 comparing Humans, three MLPs (MLP0, MLP1, MLP2) and three SDAs (SDA0, SDA1, |
491 SDA2), along with the previous results on the digits NIST special database | 497 SDA2), along with the previous results on the digits NIST special database |
492 19 test set from the literature respectively based on ARTMAP neural | 498 19 test set from the literature respectively based on ARTMAP neural |
493 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search | 499 networks ~\citep{Granger+al-2007}, fast nearest-neighbor search |
501 significant way) but reaches human performance on both the 62-class task | 507 significant way) but reaches human performance on both the 62-class task |
502 and the 10-class (digits) task. In addition, as shown in the left of | 508 and the 10-class (digits) task. In addition, as shown in the left of |
503 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error | 509 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error |
504 rate brought by self-taught learning is greater for the SDA, and these | 510 rate brought by self-taught learning is greater for the SDA, and these |
505 differences with the MLP are statistically and qualitatively | 511 differences with the MLP are statistically and qualitatively |
506 significant. The left side of the figure shows the improvement to the clean | 512 significant. |
513 The left side of the figure shows the improvement to the clean | |
507 NIST test set error brought by the use of out-of-distribution examples | 514 NIST test set error brought by the use of out-of-distribution examples |
508 (i.e. the perturbed examples examples from NISTP or P07). The right side of | 515 (i.e. the perturbed examples examples from NISTP or P07). |
516 Relative change is measured by taking | |
517 (original model's error / perturbed-data model's error - 1). | |
518 The right side of | |
509 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement | 519 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement |
510 brought by the use of a multi-task setting, in which the same model is | 520 brought by the use of a multi-task setting, in which the same model is |
511 trained for more classes than the target classes of interest (i.e. training | 521 trained for more classes than the target classes of interest (i.e. training |
512 with all 62 classes when the target classes are respectively the digits, | 522 with all 62 classes when the target classes are respectively the digits, |
513 lower-case, or upper-case characters). Again, whereas the gain from the | 523 lower-case, or upper-case characters). Again, whereas the gain from the |
525 setting is similar for the other two target classes (lower case characters | 535 setting is similar for the other two target classes (lower case characters |
526 and upper case characters). | 536 and upper case characters). |
527 | 537 |
528 \begin{figure}[h] | 538 \begin{figure}[h] |
529 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ | 539 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ |
530 \caption{Charts corresponding to table 1 of Appendix I. Left: overall results; error bars indicate a 95\% confidence interval. Right: error rates on NIST test digits only, with results from literature. } | 540 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training |
541 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | |
542 of all models, on 3 different test sets corresponding to the three | |
543 datasets. | |
544 Right: error rates on NIST test digits only, along with the previous results from | |
545 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002,Milgram+al-2005} | |
546 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | |
547 | |
531 \label{fig:error-rates-charts} | 548 \label{fig:error-rates-charts} |
532 \end{figure} | 549 \end{figure} |
533 | 550 |
534 %\vspace*{-1mm} | 551 %\vspace*{-1mm} |
535 %\subsection{Perturbed Training Data More Helpful for SDAE} | 552 %\subsection{Perturbed Training Data More Helpful for SDA} |
536 %\vspace*{-1mm} | 553 %\vspace*{-1mm} |
537 | 554 |
538 %\vspace*{-1mm} | 555 %\vspace*{-1mm} |
539 %\subsection{Multi-Task Learning Effects} | 556 %\subsection{Multi-Task Learning Effects} |
540 %\vspace*{-1mm} | 557 %\vspace*{-1mm} |
573 | 590 |
574 \vspace*{-1mm} | 591 \vspace*{-1mm} |
575 \section{Conclusions} | 592 \section{Conclusions} |
576 \vspace*{-1mm} | 593 \vspace*{-1mm} |
577 | 594 |
578 The conclusions are positive for all the questions asked in the introduction. | 595 We have found that the self-taught learning framework is more beneficial |
596 to a deep learner than to a traditional shallow and purely | |
597 supervised learner. More precisely, | |
598 the conclusions are positive for all the questions asked in the introduction. | |
579 %\begin{itemize} | 599 %\begin{itemize} |
580 | 600 |
581 $\bullet$ %\item | 601 $\bullet$ %\item |
582 Do the good results previously obtained with deep architectures on the | 602 Do the good results previously obtained with deep architectures on the |
583 MNIST digits generalize to the setting of a much larger and richer (but similar) | 603 MNIST digits generalize to the setting of a much larger and richer (but similar) |
584 dataset, the NIST special database 19, with 62 classes and around 800k examples? | 604 dataset, the NIST special database 19, with 62 classes and around 800k examples? |
585 Yes, the SDA systematically outperformed the MLP and all the previously | 605 Yes, the SDA {\bf systematically outperformed the MLP and all the previously |
586 published results on this dataset (as far as we know), in fact reaching human-level | 606 published results on this dataset (as far as we know), in fact reaching human-level |
587 performance. | 607 performance} at round 17\% error on the 62-class task and 1.4\% on the digits. |
588 | 608 |
589 $\bullet$ %\item | 609 $\bullet$ %\item |
590 To what extent does the perturbation of input images (e.g. adding | 610 To what extent does the perturbation of input images (e.g. adding |
591 noise, affine transformations, background images) make the resulting | 611 noise, affine transformations, background images) make the resulting |
592 classifier better not only on similarly perturbed images but also on | 612 classifier better not only on similarly perturbed images but also on |
593 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} | 613 the {\em original clean examples}? Do deep architectures benefit more from such {\em out-of-distribution} |
594 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? | 614 examples, i.e. do they benefit more from the self-taught learning~\citep{RainaR2007} framework? |
595 MLPs were helped by perturbed training examples when tested on perturbed input images, | 615 MLPs were helped by perturbed training examples when tested on perturbed input |
596 but only marginally helped with respect to clean examples. On the other hand, the deep SDAs | 616 images (65\% relative improvement on NISTP) |
617 but only marginally helped (5\% relative improvement on all classes) | |
618 or even hurt (10\% relative loss on digits) | |
619 with respect to clean examples . On the other hand, the deep SDAs | |
597 were very significantly boosted by these out-of-distribution examples. | 620 were very significantly boosted by these out-of-distribution examples. |
598 | 621 |
599 $\bullet$ %\item | 622 $\bullet$ %\item |
600 Similarly, does the feature learning step in deep learning algorithms benefit more | 623 Similarly, does the feature learning step in deep learning algorithms benefit more |
601 training with similar but different classes (i.e. a multi-task learning scenario) than | 624 training with similar but different classes (i.e. a multi-task learning scenario) than |
602 a corresponding shallow and purely supervised architecture? | 625 a corresponding shallow and purely supervised architecture? |
603 Whereas the improvement due to the multi-task setting was marginal or | 626 Whereas the improvement due to the multi-task setting was marginal or |
604 negative for the MLP, it was very significant for the SDA. | 627 negative for the MLP (from +5.6\% to -3.6\% relative change), |
628 it was very significant for the SDA (from +13\% to +27\% relative change). | |
605 %\end{itemize} | 629 %\end{itemize} |
630 | |
631 Why would deep learners benefit more from the self-taught learning framework? | |
632 The key idea is that the lower layers of the predictor compute a hierarchy | |
633 of features that can be shared across tasks or across variants of the | |
634 input distribution. Intermediate features that can be used in different | |
635 contexts can be estimated in a way that allows to share statistical | |
636 strength. Features extracted through many levels are more likely to | |
637 be more abstract (as the experiments in~\citet{Goodfellow2009} suggest), | |
638 increasing the likelihood that they would be useful for a larger array | |
639 of tasks and input conditions. | |
640 Therefore, we hypothesize that both depth and unsupervised | |
641 pre-training play a part in explaining the advantages observed here, and future | |
642 experiments could attempt at teasing apart these factors. | |
606 | 643 |
607 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) | 644 A Flash demo of the recognizer (where both the MLP and the SDA can be compared) |
608 can be executed on-line at {\tt http://deep.host22.com}. | 645 can be executed on-line at {\tt http://deep.host22.com}. |
609 | 646 |
610 \newpage | 647 \newpage |