Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 530:8fe77eac344f
Clarifying the experimental setup, typos here and there
author | Dumitru Erhan <dumitru.erhan@gmail.com> |
---|---|
date | Tue, 01 Jun 2010 18:18:01 -0700 |
parents | 07bc0ca8d246 |
children | 85f2337d47d2 |
comparison
equal
deleted
inserted
replaced
525:8b7e054d22bd | 530:8fe77eac344f |
---|---|
1 \documentclass{article} % For LaTeX2e | 1 \documentclass{article} % For LaTeX2e |
2 \usepackage{nips10submit_e,times} | 2 \usepackage{nips10submit_e,times} |
3 | 3 |
4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} | 4 \usepackage{amsthm,amsmath,bbm} |
5 \usepackage[psamsfonts]{amssymb} | |
5 \usepackage{algorithm,algorithmic} | 6 \usepackage{algorithm,algorithmic} |
6 \usepackage[utf8]{inputenc} | 7 \usepackage[utf8]{inputenc} |
7 \usepackage{graphicx,subfigure} | 8 \usepackage{graphicx,subfigure} |
8 \usepackage[numbers]{natbib} | 9 \usepackage[numbers]{natbib} |
9 | 10 |
75 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which | 76 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which |
76 performed similarly or better than previously proposed Restricted Boltzmann | 77 performed similarly or better than previously proposed Restricted Boltzmann |
77 Machines in terms of unsupervised extraction of a hierarchy of features | 78 Machines in terms of unsupervised extraction of a hierarchy of features |
78 useful for classification. The principle is that each layer starting from | 79 useful for classification. The principle is that each layer starting from |
79 the bottom is trained to encode its input (the output of the previous | 80 the bottom is trained to encode its input (the output of the previous |
80 layer) and to reconstruct it from a corrupted version of it. After this | 81 layer) and to reconstruct it from a corrupted version. After this |
81 unsupervised initialization, the stack of denoising auto-encoders can be | 82 unsupervised initialization, the stack of denoising auto-encoders can be |
82 converted into a deep supervised feedforward neural network and fine-tuned by | 83 converted into a deep supervised feedforward neural network and fine-tuned by |
83 stochastic gradient descent. | 84 stochastic gradient descent. |
84 | 85 |
85 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles | 86 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles |
95 The hypothesis explored here is that a deep hierarchy of features | 96 The hypothesis explored here is that a deep hierarchy of features |
96 may be better able to provide sharing of statistical strength | 97 may be better able to provide sharing of statistical strength |
97 between different regions in input space or different tasks, | 98 between different regions in input space or different tasks, |
98 as discussed in the conclusion. | 99 as discussed in the conclusion. |
99 | 100 |
100 % TODO: why we care to evaluate this relative advantage | |
101 | |
102 In this paper we ask the following questions: | 101 In this paper we ask the following questions: |
103 | 102 |
104 %\begin{enumerate} | 103 %\begin{enumerate} |
105 $\bullet$ %\item | 104 $\bullet$ %\item |
106 Do the good results previously obtained with deep architectures on the | 105 Do the good results previously obtained with deep architectures on the |
125 | 124 |
126 Our experimental results provide positive evidence towards all of these questions. | 125 Our experimental results provide positive evidence towards all of these questions. |
127 | 126 |
128 \vspace*{-1mm} | 127 \vspace*{-1mm} |
129 \section{Perturbation and Transformation of Character Images} | 128 \section{Perturbation and Transformation of Character Images} |
129 \label{s:perturbations} | |
130 \vspace*{-1mm} | 130 \vspace*{-1mm} |
131 | 131 |
132 This section describes the different transformations we used to stochastically | 132 This section describes the different transformations we used to stochastically |
133 transform source images in order to obtain data. More details can | 133 transform source images in order to obtain data. More details can |
134 be found in this technical report~\citep{ift6266-tr-anonymous}. | 134 be found in this technical report~\citep{ift6266-tr-anonymous}. |
140 There are two main parts in the pipeline. The first one, | 140 There are two main parts in the pipeline. The first one, |
141 from slant to pinch below, performs transformations. The second | 141 from slant to pinch below, performs transformations. The second |
142 part, from blur to contrast, adds different kinds of noise. | 142 part, from blur to contrast, adds different kinds of noise. |
143 | 143 |
144 \begin{figure}[ht] | 144 \begin{figure}[ht] |
145 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}} | 145 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/transfo.png}}} |
146 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE | 146 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE |
147 \caption{Illustration of each transformation applied alone to the same image | 147 \caption{Illustration of each transformation applied alone to the same image |
148 of an upper-case h (top left). First row (from left to right) : original image, slant, | 148 of an upper-case h (top left). First row (from left to right) : original image, slant, |
149 thickness, affine transformation (translation, rotation, shear), | 149 thickness, affine transformation (translation, rotation, shear), |
150 local elastic deformation; second row (from left to right) : | 150 local elastic deformation; second row (from left to right) : |
352 Whereas much previous work on deep learning algorithms had been performed on | 352 Whereas much previous work on deep learning algorithms had been performed on |
353 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, | 353 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, |
354 with 60~000 examples, and variants involving 10~000 | 354 with 60~000 examples, and variants involving 10~000 |
355 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want | 355 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want |
356 to focus here on the case of much larger training sets, from 10 times to | 356 to focus here on the case of much larger training sets, from 10 times to |
357 to 1000 times larger. The larger datasets are obtained by first sampling from | 357 to 1000 times larger. |
358 | |
359 The first step in constructing the larger datasets is to sample from | |
358 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, | 360 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, |
359 and {\bf OCR data} (scanned machine printed characters). Once a character | 361 and {\bf OCR data} (scanned machine printed characters). Once a character |
360 is sampled from one of these sources (chosen randomly), a pipeline of | 362 is sampled from one of these sources (chosen randomly), the pipeline of |
361 the above transformations and/or noise processes is applied to the | 363 the transformations and/or noise processes described in section \ref{s:perturbations} |
362 image. | 364 is applied to the image. |
363 | 365 |
364 We compare the best MLP (according to validation set error) that we found against | 366 We compare the best MLP against |
365 the best SDA (again according to validation set error), along with a precise estimate | 367 the best SDA (both models' hyper-parameters are selected to minimize the validation set error), |
368 along with a comparison against a precise estimate | |
366 of human performance obtained via Amazon's Mechanical Turk (AMT) | 369 of human performance obtained via Amazon's Mechanical Turk (AMT) |
367 service\footnote{http://mturk.com}. | 370 service (http://mturk.com). |
368 AMT users are paid small amounts | 371 AMT users are paid small amounts |
369 of money to perform tasks for which human intelligence is required. | 372 of money to perform tasks for which human intelligence is required. |
370 Mechanical Turk has been used extensively in natural language processing and vision. | 373 Mechanical Turk has been used extensively in natural language processing and vision. |
371 %processing \citep{SnowEtAl2008} and vision | 374 %processing \citep{SnowEtAl2008} and vision |
372 %\citep{SorokinAndForsyth2008,whitehill09}. | 375 %\citep{SorokinAndForsyth2008,whitehill09}. |
373 AMT users where presented | 376 AMT users were presented |
374 with 10 character images and asked to type 10 corresponding ASCII | 377 with 10 character images and asked to choose 10 corresponding ASCII |
375 characters. They were forced to make a hard choice among the | 378 characters. They were forced to make a hard choice among the |
376 62 or 10 character classes (all classes or digits only). | 379 62 or 10 character classes (all classes or digits only). |
377 Three users classified each image, allowing | 380 Three users classified each image, allowing |
378 to estimate inter-human variability. | 381 to estimate inter-human variability. A total 2500 images/dataset were classified. |
379 | 382 |
380 \vspace*{-1mm} | 383 \vspace*{-1mm} |
381 \subsection{Data Sources} | 384 \subsection{Data Sources} |
382 \vspace*{-1mm} | 385 \vspace*{-1mm} |
383 | 386 |
388 widely used for training and testing character | 391 widely used for training and testing character |
389 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. | 392 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. |
390 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, | 393 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, |
391 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | 394 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes |
392 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. | 395 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. |
393 The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended | 396 The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended |
394 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | 397 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
395 for that purpose. We randomly split the remainder into a training set and a validation set for | 398 for that purpose. We randomly split the remainder into a training set and a validation set for |
396 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, | 399 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, |
397 and 82587 for testing. | 400 and 82587 for testing. |
398 The performances reported by previous work on that dataset mostly use only the digits. | 401 The performances reported by previous work on that dataset mostly use only the digits. |
404 more like the natural distribution of letters in text). | 407 more like the natural distribution of letters in text). |
405 | 408 |
406 %\item | 409 %\item |
407 {\bf Fonts.} | 410 {\bf Fonts.} |
408 In order to have a good variety of sources we downloaded an important number of free fonts from: | 411 In order to have a good variety of sources we downloaded an important number of free fonts from: |
409 {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} | 412 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. |
410 % TODO: pointless to anonymize, it's not pointing to our work | 413 % TODO: pointless to anonymize, it's not pointing to our work |
411 Including operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. | 414 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. |
412 The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | 415 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, |
413 directly as input to our models. | 416 directly as input to our models. |
414 | 417 |
415 %\item | 418 %\item |
416 {\bf Captchas.} | 419 {\bf Captchas.} |
417 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 420 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for |
443 %\item | 446 %\item |
444 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. | 447 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. |
445 | 448 |
446 %\item | 449 %\item |
447 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources | 450 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources |
448 and sending them through the above transformation pipeline. | 451 and sending them through the transformation pipeline described in section \ref{s:perturbations}. |
449 For each new example to generate, a source is selected with probability $10\%$ from the fonts, | 452 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, |
450 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the | 453 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the |
451 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. | 454 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. |
452 | 455 |
453 %\item | 456 %\item |
454 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) | 457 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) |
455 except that we only apply | 458 except that we only apply |
456 transformations from slant to pinch. Therefore, the character is | 459 transformations from slant to pinch. Therefore, the character is |
457 transformed but no additional noise is added to the image, giving images | 460 transformed but no additional noise is added to the image, giving images |
458 closer to the NIST dataset. | 461 closer to the NIST dataset. |
459 %\end{itemize} | 462 %\end{itemize} |
462 \subsection{Models and their Hyperparameters} | 465 \subsection{Models and their Hyperparameters} |
463 \vspace*{-1mm} | 466 \vspace*{-1mm} |
464 | 467 |
465 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single | 468 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single |
466 hidden layer and with Stacked Denoising Auto-Encoders (SDA). | 469 hidden layer and with Stacked Denoising Auto-Encoders (SDA). |
467 All hyper-parameters are selected based on performance on the NISTP validation set. | 470 \emph{Note that all hyper-parameters are selected based on performance on the {\bf NISTP} validation set.} |
468 | 471 |
469 {\bf Multi-Layer Perceptrons (MLP).} | 472 {\bf Multi-Layer Perceptrons (MLP).} |
470 Whereas previous work had compared deep architectures to both shallow MLPs and | 473 Whereas previous work had compared deep architectures to both shallow MLPs and |
471 SVMs, we only compared to MLPs here because of the very large datasets used | 474 SVMs, we only compared to MLPs here because of the very large datasets used |
472 (making the use of SVMs computationally inconvenient because of their quadratic | 475 (making the use of SVMs computationally challenging because of their quadratic |
473 scaling behavior). | 476 scaling behavior). |
474 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | 477 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
475 exponentials) on the output layer for estimating $P(class | image)$. | 478 exponentials) on the output layer for estimating $P(class | image)$. |
476 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. | 479 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. |
477 The optimization procedure is as follows: training | 480 Training examples are presented in minibatches of size 20. A constant learning |
478 examples are presented in minibatches of size 20, a constant learning | 481 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ |
479 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ | |
480 through preliminary experiments (measuring performance on a validation set), | 482 through preliminary experiments (measuring performance on a validation set), |
481 and $0.1$ was then selected. | 483 and $0.1$ was then selected for optimizing on the whole training sets. |
482 | 484 |
483 \begin{figure}[ht] | 485 \begin{figure}[ht] |
484 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | 486 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} |
485 \caption{Illustration of the computations and training criterion for the denoising | 487 \caption{Illustration of the computations and training criterion for the denoising |
486 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | 488 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of |
505 distribution $P(x)$ and the conditional distribution of interest | 507 distribution $P(x)$ and the conditional distribution of interest |
506 $P(y|x)$ (like in semi-supervised learning), and on the other hand | 508 $P(y|x)$ (like in semi-supervised learning), and on the other hand |
507 taking advantage of the expressive power and bias implicit in the | 509 taking advantage of the expressive power and bias implicit in the |
508 deep architecture (whereby complex concepts are expressed as | 510 deep architecture (whereby complex concepts are expressed as |
509 compositions of simpler ones through a deep hierarchy). | 511 compositions of simpler ones through a deep hierarchy). |
512 | |
510 Here we chose to use the Denoising | 513 Here we chose to use the Denoising |
511 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for | 514 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for |
512 these deep hierarchies of features, as it is very simple to train and | 515 these deep hierarchies of features, as it is very simple to train and |
513 teach (see Figure~\ref{fig:da}, as well as | 516 teach (see Figure~\ref{fig:da}, as well as |
514 tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 517 tutorial and code at {\tt http://deeplearning.net/tutorial}), |
515 provides immediate and efficient inference, and yielded results | 518 provides immediate and efficient inference, and yielded results |
516 comparable or better than RBMs in series of experiments | 519 comparable or better than RBMs in series of experiments |
517 \citep{VincentPLarochelleH2008}. During training, a Denoising | 520 \citep{VincentPLarochelleH2008}. During training, a Denoising |
518 Auto-Encoder is presented with a stochastically corrupted version | 521 Auto-Encoder is presented with a stochastically corrupted version |
519 of the input and trained to reconstruct the uncorrupted input, | 522 of the input and trained to reconstruct the uncorrupted input, |
534 | 537 |
535 \vspace*{-1mm} | 538 \vspace*{-1mm} |
536 | 539 |
537 \begin{figure}[ht] | 540 \begin{figure}[ht] |
538 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} | 541 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} |
539 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training | 542 \caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained |
540 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | 543 on NIST, 1 on NISTP, and 2 on P07. Left: overall results |
541 of all models, on 3 different test sets corresponding to the three | 544 of all models, on 3 different test sets (NIST, NISTP, P07). |
542 datasets. | |
543 Right: error rates on NIST test digits only, along with the previous results from | 545 Right: error rates on NIST test digits only, along with the previous results from |
544 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | 546 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
545 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | 547 respectively based on ART, nearest neighbors, MLPs, and SVMs.} |
546 | 548 |
547 \label{fig:error-rates-charts} | 549 \label{fig:error-rates-charts} |
691 In the original self-taught learning framework~\citep{RainaR2007}, the | 693 In the original self-taught learning framework~\citep{RainaR2007}, the |
692 out-of-sample examples were used as a source of unsupervised data, and | 694 out-of-sample examples were used as a source of unsupervised data, and |
693 experiments showed its positive effects in a \emph{limited labeled data} | 695 experiments showed its positive effects in a \emph{limited labeled data} |
694 scenario. However, many of the results by \citet{RainaR2007} (who used a | 696 scenario. However, many of the results by \citet{RainaR2007} (who used a |
695 shallow, sparse coding approach) suggest that the relative gain of self-taught | 697 shallow, sparse coding approach) suggest that the relative gain of self-taught |
696 learning diminishes as the number of labeled examples increases, (essentially, | 698 learning diminishes as the number of labeled examples increases (essentially, |
697 a ``diminishing returns'' scenario occurs). We note that, for deep | 699 a ``diminishing returns'' scenario occurs). We note that, for deep |
698 architectures, our experiments show that such a positive effect is accomplished | 700 architectures, our experiments show that such a positive effect is accomplished |
699 even in a scenario with a \emph{very large number of labeled examples}. | 701 even in a scenario with a \emph{very large number of labeled examples}. |
700 | 702 |
701 Why would deep learners benefit more from the self-taught learning framework? | 703 Why would deep learners benefit more from the self-taught learning framework? |