Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 549:ef172f4a322a
ca fitte
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Wed, 02 Jun 2010 13:56:01 -0400 |
parents | 34cb28249de0 |
children | 662299f265ab |
comparison
equal
deleted
inserted
replaced
548:34cb28249de0 | 549:ef172f4a322a |
---|---|
390 useful to estimate the effect of a multi-task setting. | 390 useful to estimate the effect of a multi-task setting. |
391 Note that the distribution of the classes in the NIST training and test sets differs | 391 Note that the distribution of the classes in the NIST training and test sets differs |
392 substantially, with relatively many more digits in the test set, and more uniform distribution | 392 substantially, with relatively many more digits in the test set, and more uniform distribution |
393 of letters in the test set, compared to the training set (in the latter, the letters are distributed | 393 of letters in the test set, compared to the training set (in the latter, the letters are distributed |
394 more like the natural distribution of letters in text). | 394 more like the natural distribution of letters in text). |
395 \vspace*{-1mm} | |
395 | 396 |
396 %\item | 397 %\item |
397 {\bf Fonts.} | 398 {\bf Fonts.} |
398 In order to have a good variety of sources we downloaded an important number of free fonts from: | 399 In order to have a good variety of sources we downloaded an important number of free fonts from: |
399 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. | 400 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. |
400 % TODO: pointless to anonymize, it's not pointing to our work | 401 % TODO: pointless to anonymize, it's not pointing to our work |
401 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. | 402 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. |
402 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | 403 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, |
403 directly as input to our models. | 404 directly as input to our models. |
405 \vspace*{-1mm} | |
404 | 406 |
405 %\item | 407 %\item |
406 {\bf Captchas.} | 408 {\bf Captchas.} |
407 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 409 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for |
408 generating characters of the same format as the NIST dataset. This software is based on | 410 generating characters of the same format as the NIST dataset. This software is based on |
409 a random character class generator and various kinds of transformations similar to those described in the previous sections. | 411 a random character class generator and various kinds of transformations similar to those described in the previous sections. |
410 In order to increase the variability of the data generated, many different fonts are used for generating the characters. | 412 In order to increase the variability of the data generated, many different fonts are used for generating the characters. |
411 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity | 413 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity |
412 depending on the value of the complexity parameter provided by the user of the data source. | 414 depending on the value of the complexity parameter provided by the user of the data source. |
413 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class? | 415 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class? |
416 \vspace*{-1mm} | |
414 | 417 |
415 %\item | 418 %\item |
416 {\bf OCR data.} | 419 {\bf OCR data.} |
417 A large set (2 million) of scanned, OCRed and manually verified machine-printed | 420 A large set (2 million) of scanned, OCRed and manually verified machine-printed |
418 characters (from various documents and books) where included as an | 421 characters (from various documents and books) where included as an |
427 \vspace*{-1mm} | 430 \vspace*{-1mm} |
428 | 431 |
429 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label | 432 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label |
430 from one of the 62 character classes. | 433 from one of the 62 character classes. |
431 %\begin{itemize} | 434 %\begin{itemize} |
435 \vspace*{-1mm} | |
432 | 436 |
433 %\item | 437 %\item |
434 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has | 438 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has |
435 \{651668 / 80000 / 82587\} \{training / validation / test\} examples. | 439 \{651668 / 80000 / 82587\} \{training / validation / test\} examples. |
440 \vspace*{-1mm} | |
436 | 441 |
437 %\item | 442 %\item |
438 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources | 443 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources |
439 and sending them through the transformation pipeline described in section \ref{s:perturbations}. | 444 and sending them through the transformation pipeline described in section \ref{s:perturbations}. |
440 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, | 445 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, |
441 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the | 446 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the |
442 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. | 447 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. |
443 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. | 448 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples. |
449 \vspace*{-1mm} | |
444 | 450 |
445 %\item | 451 %\item |
446 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) | 452 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) |
447 except that we only apply | 453 except that we only apply |
448 transformations from slant to pinch. Therefore, the character is | 454 transformations from slant to pinch. Therefore, the character is |
469 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. | 475 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. |
470 Training examples are presented in minibatches of size 20. A constant learning | 476 Training examples are presented in minibatches of size 20. A constant learning |
471 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ | 477 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ |
472 through preliminary experiments (measuring performance on a validation set), | 478 through preliminary experiments (measuring performance on a validation set), |
473 and $0.1$ was then selected for optimizing on the whole training sets. | 479 and $0.1$ was then selected for optimizing on the whole training sets. |
480 \vspace*{-1mm} | |
474 | 481 |
475 | 482 |
476 {\bf Stacked Denoising Auto-Encoders (SDA).} | 483 {\bf Stacked Denoising Auto-Encoders (SDA).} |
477 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) | 484 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs) |
478 can be used to initialize the weights of each layer of a deep MLP (with many hidden | 485 can be used to initialize the weights of each layer of a deep MLP (with many hidden |
489 compositions of simpler ones through a deep hierarchy). | 496 compositions of simpler ones through a deep hierarchy). |
490 | 497 |
491 \begin{figure}[ht] | 498 \begin{figure}[ht] |
492 \vspace*{-2mm} | 499 \vspace*{-2mm} |
493 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | 500 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} |
501 \vspace*{-2mm} | |
494 \caption{Illustration of the computations and training criterion for the denoising | 502 \caption{Illustration of the computations and training criterion for the denoising |
495 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | 503 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of |
496 the layer (i.e. raw input or output of previous layer) | 504 the layer (i.e. raw input or output of previous layer) |
497 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | 505 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. |
498 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | 506 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which |
532 \vspace*{-1mm} | 540 \vspace*{-1mm} |
533 | 541 |
534 \begin{figure}[ht] | 542 \begin{figure}[ht] |
535 \vspace*{-2mm} | 543 \vspace*{-2mm} |
536 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} | 544 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} |
545 \vspace*{-3mm} | |
537 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained | 546 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained |
538 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | 547 on NIST, 1 on NISTP, and 2 on P07. Left: overall results |
539 of all models, on NIST and NISTP test sets. | 548 of all models, on NIST and NISTP test sets. |
540 Right: error rates on NIST test digits only, along with the previous results from | 549 Right: error rates on NIST test digits only, along with the previous results from |
541 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | 550 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
542 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | 551 respectively based on ART, nearest neighbors, MLPs, and SVMs.} |
543 | |
544 \label{fig:error-rates-charts} | 552 \label{fig:error-rates-charts} |
545 \vspace*{-2mm} | 553 \vspace*{-2mm} |
546 \end{figure} | 554 \end{figure} |
547 | 555 |
548 | 556 |
549 \section{Experimental Results} | 557 \section{Experimental Results} |
558 \vspace*{-2mm} | |
550 | 559 |
551 %\vspace*{-1mm} | 560 %\vspace*{-1mm} |
552 %\subsection{SDA vs MLP vs Humans} | 561 %\subsection{SDA vs MLP vs Humans} |
553 %\vspace*{-1mm} | 562 %\vspace*{-1mm} |
554 The models are either trained on NIST (MLP0 and SDA0), | 563 The models are either trained on NIST (MLP0 and SDA0), |
570 significant way) but when trained with perturbed data | 579 significant way) but when trained with perturbed data |
571 reaches human performance on both the 62-class task | 580 reaches human performance on both the 62-class task |
572 and the 10-class (digits) task. | 581 and the 10-class (digits) task. |
573 | 582 |
574 \begin{figure}[ht] | 583 \begin{figure}[ht] |
575 \vspace*{-2mm} | 584 \vspace*{-3mm} |
576 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} | 585 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} |
586 \vspace*{-3mm} | |
577 \caption{Relative improvement in error rate due to self-taught learning. | 587 \caption{Relative improvement in error rate due to self-taught learning. |
578 Left: Improvement (or loss, when negative) | 588 Left: Improvement (or loss, when negative) |
579 induced by out-of-distribution examples (perturbed data). | 589 induced by out-of-distribution examples (perturbed data). |
580 Right: Improvement (or loss, when negative) induced by multi-task | 590 Right: Improvement (or loss, when negative) induced by multi-task |
581 learning (training on all classes and testing only on either digits, | 591 learning (training on all classes and testing only on either digits, |
648 error rate improvements of 27\%, 15\% and 13\% respectively for digits, | 658 error rate improvements of 27\%, 15\% and 13\% respectively for digits, |
649 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. | 659 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. |
650 \fi | 660 \fi |
651 | 661 |
652 | 662 |
653 \vspace*{-1mm} | 663 \vspace*{-2mm} |
654 \section{Conclusions and Discussion} | 664 \section{Conclusions and Discussion} |
655 \vspace*{-1mm} | 665 \vspace*{-2mm} |
656 | 666 |
657 We have found that the self-taught learning framework is more beneficial | 667 We have found that the self-taught learning framework is more beneficial |
658 to a deep learner than to a traditional shallow and purely | 668 to a deep learner than to a traditional shallow and purely |
659 supervised learner. More precisely, | 669 supervised learner. More precisely, |
660 the answers are positive for all the questions asked in the introduction. | 670 the answers are positive for all the questions asked in the introduction. |
661 %\begin{itemize} | 671 %\begin{itemize} |
662 | 672 |
663 $\bullet$ %\item | 673 $\bullet$ %\item |
664 {\bf Do the good results previously obtained with deep architectures on the | 674 {\bf Do the good results previously obtained with deep architectures on the |
665 MNIST digits generalize to the setting of a much larger and richer (but similar) | 675 MNIST digits generalize to a much larger and richer (but similar) |
666 dataset, the NIST special database 19, with 62 classes and around 800k examples}? | 676 dataset, the NIST special database 19, with 62 classes and around 800k examples}? |
667 Yes, the SDA {\bf systematically outperformed the MLP and all the previously | 677 Yes, the SDA {\bf systematically outperformed the MLP and all the previously |
668 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level | 678 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level |
669 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. | 679 performance} at around 17\% error on the 62-class task and 1.4\% on the digits. |
670 | 680 |