comparison writeup/nips2010_submission.tex @ 530:8fe77eac344f

Clarifying the experimental setup, typos here and there
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Tue, 01 Jun 2010 18:18:01 -0700
parents 07bc0ca8d246
children 85f2337d47d2
comparison
equal deleted inserted replaced
525:8b7e054d22bd 530:8fe77eac344f
1 \documentclass{article} % For LaTeX2e 1 \documentclass{article} % For LaTeX2e
2 \usepackage{nips10submit_e,times} 2 \usepackage{nips10submit_e,times}
3 3
4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} 4 \usepackage{amsthm,amsmath,bbm}
5 \usepackage[psamsfonts]{amssymb}
5 \usepackage{algorithm,algorithmic} 6 \usepackage{algorithm,algorithmic}
6 \usepackage[utf8]{inputenc} 7 \usepackage[utf8]{inputenc}
7 \usepackage{graphicx,subfigure} 8 \usepackage{graphicx,subfigure}
8 \usepackage[numbers]{natbib} 9 \usepackage[numbers]{natbib}
9 10
75 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which 76 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which
76 performed similarly or better than previously proposed Restricted Boltzmann 77 performed similarly or better than previously proposed Restricted Boltzmann
77 Machines in terms of unsupervised extraction of a hierarchy of features 78 Machines in terms of unsupervised extraction of a hierarchy of features
78 useful for classification. The principle is that each layer starting from 79 useful for classification. The principle is that each layer starting from
79 the bottom is trained to encode its input (the output of the previous 80 the bottom is trained to encode its input (the output of the previous
80 layer) and to reconstruct it from a corrupted version of it. After this 81 layer) and to reconstruct it from a corrupted version. After this
81 unsupervised initialization, the stack of denoising auto-encoders can be 82 unsupervised initialization, the stack of denoising auto-encoders can be
82 converted into a deep supervised feedforward neural network and fine-tuned by 83 converted into a deep supervised feedforward neural network and fine-tuned by
83 stochastic gradient descent. 84 stochastic gradient descent.
84 85
85 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles 86 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles
95 The hypothesis explored here is that a deep hierarchy of features 96 The hypothesis explored here is that a deep hierarchy of features
96 may be better able to provide sharing of statistical strength 97 may be better able to provide sharing of statistical strength
97 between different regions in input space or different tasks, 98 between different regions in input space or different tasks,
98 as discussed in the conclusion. 99 as discussed in the conclusion.
99 100
100 % TODO: why we care to evaluate this relative advantage
101
102 In this paper we ask the following questions: 101 In this paper we ask the following questions:
103 102
104 %\begin{enumerate} 103 %\begin{enumerate}
105 $\bullet$ %\item 104 $\bullet$ %\item
106 Do the good results previously obtained with deep architectures on the 105 Do the good results previously obtained with deep architectures on the
125 124
126 Our experimental results provide positive evidence towards all of these questions. 125 Our experimental results provide positive evidence towards all of these questions.
127 126
128 \vspace*{-1mm} 127 \vspace*{-1mm}
129 \section{Perturbation and Transformation of Character Images} 128 \section{Perturbation and Transformation of Character Images}
129 \label{s:perturbations}
130 \vspace*{-1mm} 130 \vspace*{-1mm}
131 131
132 This section describes the different transformations we used to stochastically 132 This section describes the different transformations we used to stochastically
133 transform source images in order to obtain data. More details can 133 transform source images in order to obtain data. More details can
134 be found in this technical report~\citep{ift6266-tr-anonymous}. 134 be found in this technical report~\citep{ift6266-tr-anonymous}.
140 There are two main parts in the pipeline. The first one, 140 There are two main parts in the pipeline. The first one,
141 from slant to pinch below, performs transformations. The second 141 from slant to pinch below, performs transformations. The second
142 part, from blur to contrast, adds different kinds of noise. 142 part, from blur to contrast, adds different kinds of noise.
143 143
144 \begin{figure}[ht] 144 \begin{figure}[ht]
145 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}} 145 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/transfo.png}}}
146 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE 146 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE
147 \caption{Illustration of each transformation applied alone to the same image 147 \caption{Illustration of each transformation applied alone to the same image
148 of an upper-case h (top left). First row (from left to right) : original image, slant, 148 of an upper-case h (top left). First row (from left to right) : original image, slant,
149 thickness, affine transformation (translation, rotation, shear), 149 thickness, affine transformation (translation, rotation, shear),
150 local elastic deformation; second row (from left to right) : 150 local elastic deformation; second row (from left to right) :
352 Whereas much previous work on deep learning algorithms had been performed on 352 Whereas much previous work on deep learning algorithms had been performed on
353 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, 353 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009},
354 with 60~000 examples, and variants involving 10~000 354 with 60~000 examples, and variants involving 10~000
355 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want 355 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want
356 to focus here on the case of much larger training sets, from 10 times to 356 to focus here on the case of much larger training sets, from 10 times to
357 to 1000 times larger. The larger datasets are obtained by first sampling from 357 to 1000 times larger.
358
359 The first step in constructing the larger datasets is to sample from
358 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, 360 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas},
359 and {\bf OCR data} (scanned machine printed characters). Once a character 361 and {\bf OCR data} (scanned machine printed characters). Once a character
360 is sampled from one of these sources (chosen randomly), a pipeline of 362 is sampled from one of these sources (chosen randomly), the pipeline of
361 the above transformations and/or noise processes is applied to the 363 the transformations and/or noise processes described in section \ref{s:perturbations}
362 image. 364 is applied to the image.
363 365
364 We compare the best MLP (according to validation set error) that we found against 366 We compare the best MLP against
365 the best SDA (again according to validation set error), along with a precise estimate 367 the best SDA (both models' hyper-parameters are selected to minimize the validation set error),
368 along with a comparison against a precise estimate
366 of human performance obtained via Amazon's Mechanical Turk (AMT) 369 of human performance obtained via Amazon's Mechanical Turk (AMT)
367 service\footnote{http://mturk.com}. 370 service (http://mturk.com).
368 AMT users are paid small amounts 371 AMT users are paid small amounts
369 of money to perform tasks for which human intelligence is required. 372 of money to perform tasks for which human intelligence is required.
370 Mechanical Turk has been used extensively in natural language processing and vision. 373 Mechanical Turk has been used extensively in natural language processing and vision.
371 %processing \citep{SnowEtAl2008} and vision 374 %processing \citep{SnowEtAl2008} and vision
372 %\citep{SorokinAndForsyth2008,whitehill09}. 375 %\citep{SorokinAndForsyth2008,whitehill09}.
373 AMT users where presented 376 AMT users were presented
374 with 10 character images and asked to type 10 corresponding ASCII 377 with 10 character images and asked to choose 10 corresponding ASCII
375 characters. They were forced to make a hard choice among the 378 characters. They were forced to make a hard choice among the
376 62 or 10 character classes (all classes or digits only). 379 62 or 10 character classes (all classes or digits only).
377 Three users classified each image, allowing 380 Three users classified each image, allowing
378 to estimate inter-human variability. 381 to estimate inter-human variability. A total 2500 images/dataset were classified.
379 382
380 \vspace*{-1mm} 383 \vspace*{-1mm}
381 \subsection{Data Sources} 384 \subsection{Data Sources}
382 \vspace*{-1mm} 385 \vspace*{-1mm}
383 386
388 widely used for training and testing character 391 widely used for training and testing character
389 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. 392 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
390 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, 393 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
391 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 394 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
392 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. 395 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity.
393 The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended 396 The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended
394 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} 397 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
395 for that purpose. We randomly split the remainder into a training set and a validation set for 398 for that purpose. We randomly split the remainder into a training set and a validation set for
396 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, 399 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation,
397 and 82587 for testing. 400 and 82587 for testing.
398 The performances reported by previous work on that dataset mostly use only the digits. 401 The performances reported by previous work on that dataset mostly use only the digits.
404 more like the natural distribution of letters in text). 407 more like the natural distribution of letters in text).
405 408
406 %\item 409 %\item
407 {\bf Fonts.} 410 {\bf Fonts.}
408 In order to have a good variety of sources we downloaded an important number of free fonts from: 411 In order to have a good variety of sources we downloaded an important number of free fonts from:
409 {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} 412 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
410 % TODO: pointless to anonymize, it's not pointing to our work 413 % TODO: pointless to anonymize, it's not pointing to our work
411 Including operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. 414 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
412 The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, 415 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
413 directly as input to our models. 416 directly as input to our models.
414 417
415 %\item 418 %\item
416 {\bf Captchas.} 419 {\bf Captchas.}
417 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for 420 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
443 %\item 446 %\item
444 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. 447 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}.
445 448
446 %\item 449 %\item
447 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources 450 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
448 and sending them through the above transformation pipeline. 451 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
449 For each new example to generate, a source is selected with probability $10\%$ from the fonts, 452 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
450 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the 453 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
451 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. 454 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
452 455
453 %\item 456 %\item
454 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) 457 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
455 except that we only apply 458 except that we only apply
456 transformations from slant to pinch. Therefore, the character is 459 transformations from slant to pinch. Therefore, the character is
457 transformed but no additional noise is added to the image, giving images 460 transformed but no additional noise is added to the image, giving images
458 closer to the NIST dataset. 461 closer to the NIST dataset.
459 %\end{itemize} 462 %\end{itemize}
462 \subsection{Models and their Hyperparameters} 465 \subsection{Models and their Hyperparameters}
463 \vspace*{-1mm} 466 \vspace*{-1mm}
464 467
465 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single 468 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single
466 hidden layer and with Stacked Denoising Auto-Encoders (SDA). 469 hidden layer and with Stacked Denoising Auto-Encoders (SDA).
467 All hyper-parameters are selected based on performance on the NISTP validation set. 470 \emph{Note that all hyper-parameters are selected based on performance on the {\bf NISTP} validation set.}
468 471
469 {\bf Multi-Layer Perceptrons (MLP).} 472 {\bf Multi-Layer Perceptrons (MLP).}
470 Whereas previous work had compared deep architectures to both shallow MLPs and 473 Whereas previous work had compared deep architectures to both shallow MLPs and
471 SVMs, we only compared to MLPs here because of the very large datasets used 474 SVMs, we only compared to MLPs here because of the very large datasets used
472 (making the use of SVMs computationally inconvenient because of their quadratic 475 (making the use of SVMs computationally challenging because of their quadratic
473 scaling behavior). 476 scaling behavior).
474 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized 477 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized
475 exponentials) on the output layer for estimating $P(class | image)$. 478 exponentials) on the output layer for estimating $P(class | image)$.
476 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. 479 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
477 The optimization procedure is as follows: training 480 Training examples are presented in minibatches of size 20. A constant learning
478 examples are presented in minibatches of size 20, a constant learning 481 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
479 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
480 through preliminary experiments (measuring performance on a validation set), 482 through preliminary experiments (measuring performance on a validation set),
481 and $0.1$ was then selected. 483 and $0.1$ was then selected for optimizing on the whole training sets.
482 484
483 \begin{figure}[ht] 485 \begin{figure}[ht]
484 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} 486 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
485 \caption{Illustration of the computations and training criterion for the denoising 487 \caption{Illustration of the computations and training criterion for the denoising
486 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of 488 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
505 distribution $P(x)$ and the conditional distribution of interest 507 distribution $P(x)$ and the conditional distribution of interest
506 $P(y|x)$ (like in semi-supervised learning), and on the other hand 508 $P(y|x)$ (like in semi-supervised learning), and on the other hand
507 taking advantage of the expressive power and bias implicit in the 509 taking advantage of the expressive power and bias implicit in the
508 deep architecture (whereby complex concepts are expressed as 510 deep architecture (whereby complex concepts are expressed as
509 compositions of simpler ones through a deep hierarchy). 511 compositions of simpler ones through a deep hierarchy).
512
510 Here we chose to use the Denoising 513 Here we chose to use the Denoising
511 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for 514 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for
512 these deep hierarchies of features, as it is very simple to train and 515 these deep hierarchies of features, as it is very simple to train and
513 teach (see Figure~\ref{fig:da}, as well as 516 teach (see Figure~\ref{fig:da}, as well as
514 tutorial and code there: {\tt http://deeplearning.net/tutorial}), 517 tutorial and code at {\tt http://deeplearning.net/tutorial}),
515 provides immediate and efficient inference, and yielded results 518 provides immediate and efficient inference, and yielded results
516 comparable or better than RBMs in series of experiments 519 comparable or better than RBMs in series of experiments
517 \citep{VincentPLarochelleH2008}. During training, a Denoising 520 \citep{VincentPLarochelleH2008}. During training, a Denoising
518 Auto-Encoder is presented with a stochastically corrupted version 521 Auto-Encoder is presented with a stochastically corrupted version
519 of the input and trained to reconstruct the uncorrupted input, 522 of the input and trained to reconstruct the uncorrupted input,
534 537
535 \vspace*{-1mm} 538 \vspace*{-1mm}
536 539
537 \begin{figure}[ht] 540 \begin{figure}[ht]
538 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} 541 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
539 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training 542 \caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
540 on NIST, 1 on NISTP, and 2 on P07. Left: overall results 543 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
541 of all models, on 3 different test sets corresponding to the three 544 of all models, on 3 different test sets (NIST, NISTP, P07).
542 datasets.
543 Right: error rates on NIST test digits only, along with the previous results from 545 Right: error rates on NIST test digits only, along with the previous results from
544 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} 546 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
545 respectively based on ART, nearest neighbors, MLPs, and SVMs.} 547 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
546 548
547 \label{fig:error-rates-charts} 549 \label{fig:error-rates-charts}
691 In the original self-taught learning framework~\citep{RainaR2007}, the 693 In the original self-taught learning framework~\citep{RainaR2007}, the
692 out-of-sample examples were used as a source of unsupervised data, and 694 out-of-sample examples were used as a source of unsupervised data, and
693 experiments showed its positive effects in a \emph{limited labeled data} 695 experiments showed its positive effects in a \emph{limited labeled data}
694 scenario. However, many of the results by \citet{RainaR2007} (who used a 696 scenario. However, many of the results by \citet{RainaR2007} (who used a
695 shallow, sparse coding approach) suggest that the relative gain of self-taught 697 shallow, sparse coding approach) suggest that the relative gain of self-taught
696 learning diminishes as the number of labeled examples increases, (essentially, 698 learning diminishes as the number of labeled examples increases (essentially,
697 a ``diminishing returns'' scenario occurs). We note that, for deep 699 a ``diminishing returns'' scenario occurs). We note that, for deep
698 architectures, our experiments show that such a positive effect is accomplished 700 architectures, our experiments show that such a positive effect is accomplished
699 even in a scenario with a \emph{very large number of labeled examples}. 701 even in a scenario with a \emph{very large number of labeled examples}.
700 702
701 Why would deep learners benefit more from the self-taught learning framework? 703 Why would deep learners benefit more from the self-taught learning framework?