Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 533:22d5cd82d5f0
resolved conflit
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 01 Jun 2010 21:24:39 -0400 |
parents | 2e33885730cf 85f2337d47d2 |
children | 4d6493d171f6 47894d0ecbde |
comparison
equal
deleted
inserted
replaced
532:2e33885730cf | 533:22d5cd82d5f0 |
---|---|
1 \documentclass{article} % For LaTeX2e | 1 \documentclass{article} % For LaTeX2e |
2 \usepackage{nips10submit_e,times} | 2 \usepackage{nips10submit_e,times} |
3 | 3 |
4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} | 4 \usepackage{amsthm,amsmath,bbm} |
5 \usepackage[psamsfonts]{amssymb} | |
5 \usepackage{algorithm,algorithmic} | 6 \usepackage{algorithm,algorithmic} |
6 \usepackage[utf8]{inputenc} | 7 \usepackage[utf8]{inputenc} |
7 \usepackage{graphicx,subfigure} | 8 \usepackage{graphicx,subfigure} |
8 \usepackage[numbers]{natbib} | 9 \usepackage[numbers]{natbib} |
9 | 10 |
75 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which | 76 Auto-Encoder~(DEA)~\citep{VincentPLarochelleH2008-very-small}, which |
76 performed similarly or better than previously proposed Restricted Boltzmann | 77 performed similarly or better than previously proposed Restricted Boltzmann |
77 Machines in terms of unsupervised extraction of a hierarchy of features | 78 Machines in terms of unsupervised extraction of a hierarchy of features |
78 useful for classification. The principle is that each layer starting from | 79 useful for classification. The principle is that each layer starting from |
79 the bottom is trained to encode its input (the output of the previous | 80 the bottom is trained to encode its input (the output of the previous |
80 layer) and to reconstruct it from a corrupted version of it. After this | 81 layer) and to reconstruct it from a corrupted version. After this |
81 unsupervised initialization, the stack of denoising auto-encoders can be | 82 unsupervised initialization, the stack of denoising auto-encoders can be |
82 converted into a deep supervised feedforward neural network and fine-tuned by | 83 converted into a deep supervised feedforward neural network and fine-tuned by |
83 stochastic gradient descent. | 84 stochastic gradient descent. |
84 | 85 |
85 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles | 86 Self-taught learning~\citep{RainaR2007} is a paradigm that combines principles |
96 The hypothesis explored here is that a deep hierarchy of features | 97 The hypothesis explored here is that a deep hierarchy of features |
97 may be better able to provide sharing of statistical strength | 98 may be better able to provide sharing of statistical strength |
98 between different regions in input space or different tasks, | 99 between different regions in input space or different tasks, |
99 as discussed in the conclusion. | 100 as discussed in the conclusion. |
100 | 101 |
101 % TODO: why we care to evaluate this relative advantage | |
102 | |
103 In this paper we ask the following questions: | 102 In this paper we ask the following questions: |
104 | 103 |
105 %\begin{enumerate} | 104 %\begin{enumerate} |
106 $\bullet$ %\item | 105 $\bullet$ %\item |
107 Do the good results previously obtained with deep architectures on the | 106 Do the good results previously obtained with deep architectures on the |
126 | 125 |
127 Our experimental results provide positive evidence towards all of these questions. | 126 Our experimental results provide positive evidence towards all of these questions. |
128 | 127 |
129 \vspace*{-1mm} | 128 \vspace*{-1mm} |
130 \section{Perturbation and Transformation of Character Images} | 129 \section{Perturbation and Transformation of Character Images} |
130 \label{s:perturbations} | |
131 \vspace*{-1mm} | 131 \vspace*{-1mm} |
132 | 132 |
133 This section describes the different transformations we used to stochastically | 133 This section describes the different transformations we used to stochastically |
134 transform source images in order to obtain data. More details can | 134 transform source images in order to obtain data. More details can |
135 be found in this technical report~\citep{ift6266-tr-anonymous}. | 135 be found in this technical report~\citep{ift6266-tr-anonymous}. |
141 There are two main parts in the pipeline. The first one, | 141 There are two main parts in the pipeline. The first one, |
142 from slant to pinch below, performs transformations. The second | 142 from slant to pinch below, performs transformations. The second |
143 part, from blur to contrast, adds different kinds of noise. | 143 part, from blur to contrast, adds different kinds of noise. |
144 | 144 |
145 \begin{figure}[ht] | 145 \begin{figure}[ht] |
146 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}} | 146 \centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/transfo.png}}} |
147 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE | 147 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE |
148 \caption{Illustration of each transformation applied alone to the same image | 148 \caption{Illustration of each transformation applied alone to the same image |
149 of an upper-case h (top left). First row (from left to right) : original image, slant, | 149 of an upper-case h (top left). First row (from left to right) : original image, slant, |
150 thickness, affine transformation (translation, rotation, shear), | 150 thickness, affine transformation (translation, rotation, shear), |
151 local elastic deformation; second row (from left to right) : | 151 local elastic deformation; second row (from left to right) : |
353 Whereas much previous work on deep learning algorithms had been performed on | 353 Whereas much previous work on deep learning algorithms had been performed on |
354 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, | 354 the MNIST digits classification task~\citep{Hinton06,ranzato-07-small,Bengio-nips-2006,Salakhutdinov+Hinton-2009}, |
355 with 60~000 examples, and variants involving 10~000 | 355 with 60~000 examples, and variants involving 10~000 |
356 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want | 356 examples~\citep{Larochelle-jmlr-toappear-2008,VincentPLarochelleH2008}, we want |
357 to focus here on the case of much larger training sets, from 10 times to | 357 to focus here on the case of much larger training sets, from 10 times to |
358 to 1000 times larger. The larger datasets are obtained by first sampling from | 358 to 1000 times larger. |
359 | |
360 The first step in constructing the larger datasets is to sample from | |
359 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, | 361 a {\em data source}: {\bf NIST} (NIST database 19), {\bf Fonts}, {\bf Captchas}, |
360 and {\bf OCR data} (scanned machine printed characters). Once a character | 362 and {\bf OCR data} (scanned machine printed characters). Once a character |
361 is sampled from one of these sources (chosen randomly), a pipeline of | 363 is sampled from one of these sources (chosen randomly), the pipeline of |
362 the above transformations and/or noise processes is applied to the | 364 the transformations and/or noise processes described in section \ref{s:perturbations} |
363 image. | 365 is applied to the image. |
364 | 366 |
365 We compare the best MLP (according to validation set error) that we found against | 367 We compare the best MLP against |
366 the best SDA (again according to validation set error), along with a precise estimate | 368 the best SDA (both models' hyper-parameters are selected to minimize the validation set error), |
369 along with a comparison against a precise estimate | |
367 of human performance obtained via Amazon's Mechanical Turk (AMT) | 370 of human performance obtained via Amazon's Mechanical Turk (AMT) |
368 service\footnote{http://mturk.com}. | 371 service (http://mturk.com). |
369 AMT users are paid small amounts | 372 AMT users are paid small amounts |
370 of money to perform tasks for which human intelligence is required. | 373 of money to perform tasks for which human intelligence is required. |
371 Mechanical Turk has been used extensively in natural language processing and vision. | 374 Mechanical Turk has been used extensively in natural language processing and vision. |
372 %processing \citep{SnowEtAl2008} and vision | 375 %processing \citep{SnowEtAl2008} and vision |
373 %\citep{SorokinAndForsyth2008,whitehill09}. | 376 %\citep{SorokinAndForsyth2008,whitehill09}. |
374 AMT users where presented | 377 AMT users were presented |
375 with 10 character images and asked to type 10 corresponding ASCII | 378 with 10 character images and asked to choose 10 corresponding ASCII |
376 characters. They were forced to make a hard choice among the | 379 characters. They were forced to make a hard choice among the |
377 62 or 10 character classes (all classes or digits only). | 380 62 or 10 character classes (all classes or digits only). |
378 Three users classified each image, allowing | 381 Three users classified each image, allowing |
379 to estimate inter-human variability. | 382 to estimate inter-human variability. A total 2500 images/dataset were classified. |
380 | 383 |
381 \vspace*{-1mm} | 384 \vspace*{-1mm} |
382 \subsection{Data Sources} | 385 \subsection{Data Sources} |
383 \vspace*{-1mm} | 386 \vspace*{-1mm} |
384 | 387 |
389 widely used for training and testing character | 392 widely used for training and testing character |
390 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. | 393 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. |
391 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, | 394 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, |
392 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | 395 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes |
393 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. | 396 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. |
394 The fourth partition, $hsf_4$, experimentally recognized to be the most difficult one, is the one recommended | 397 The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended |
395 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | 398 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
396 for that purpose. We randomly split the remainder into a training set and a validation set for | 399 for that purpose. We randomly split the remainder into a training set and a validation set for |
397 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, | 400 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, |
398 and 82587 for testing. | 401 and 82587 for testing. |
399 The performances reported by previous work on that dataset mostly use only the digits. | 402 The performances reported by previous work on that dataset mostly use only the digits. |
405 more like the natural distribution of letters in text). | 408 more like the natural distribution of letters in text). |
406 | 409 |
407 %\item | 410 %\item |
408 {\bf Fonts.} | 411 {\bf Fonts.} |
409 In order to have a good variety of sources we downloaded an important number of free fonts from: | 412 In order to have a good variety of sources we downloaded an important number of free fonts from: |
410 {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} | 413 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}. |
411 % TODO: pointless to anonymize, it's not pointing to our work | 414 % TODO: pointless to anonymize, it's not pointing to our work |
412 Including operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. | 415 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from. |
413 The {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | 416 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, |
414 directly as input to our models. | 417 directly as input to our models. |
415 | 418 |
416 %\item | 419 %\item |
417 {\bf Captchas.} | 420 {\bf Captchas.} |
418 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 421 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for |
444 %\item | 447 %\item |
445 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. | 448 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. |
446 | 449 |
447 %\item | 450 %\item |
448 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources | 451 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources |
449 and sending them through the above transformation pipeline. | 452 and sending them through the transformation pipeline described in section \ref{s:perturbations}. |
450 For each new example to generate, a source is selected with probability $10\%$ from the fonts, | 453 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, |
451 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the | 454 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the |
452 order given above, and for each of them we sample uniformly a complexity in the range $[0,0.7]$. | 455 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. |
453 | 456 |
454 %\item | 457 %\item |
455 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same sources proportion) | 458 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) |
456 except that we only apply | 459 except that we only apply |
457 transformations from slant to pinch. Therefore, the character is | 460 transformations from slant to pinch. Therefore, the character is |
458 transformed but no additional noise is added to the image, giving images | 461 transformed but no additional noise is added to the image, giving images |
459 closer to the NIST dataset. | 462 closer to the NIST dataset. |
460 %\end{itemize} | 463 %\end{itemize} |
463 \subsection{Models and their Hyperparameters} | 466 \subsection{Models and their Hyperparameters} |
464 \vspace*{-1mm} | 467 \vspace*{-1mm} |
465 | 468 |
466 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single | 469 The experiments are performed with Multi-Layer Perceptrons (MLP) with a single |
467 hidden layer and with Stacked Denoising Auto-Encoders (SDA). | 470 hidden layer and with Stacked Denoising Auto-Encoders (SDA). |
468 All hyper-parameters are selected based on performance on the NISTP validation set. | 471 \emph{Note that all hyper-parameters are selected based on performance on the {\bf NISTP} validation set.} |
469 | 472 |
470 {\bf Multi-Layer Perceptrons (MLP).} | 473 {\bf Multi-Layer Perceptrons (MLP).} |
471 Whereas previous work had compared deep architectures to both shallow MLPs and | 474 Whereas previous work had compared deep architectures to both shallow MLPs and |
472 SVMs, we only compared to MLPs here because of the very large datasets used | 475 SVMs, we only compared to MLPs here because of the very large datasets used |
473 (making the use of SVMs computationally inconvenient because of their quadratic | 476 (making the use of SVMs computationally challenging because of their quadratic |
474 scaling behavior). | 477 scaling behavior). |
475 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized | 478 The MLP has a single hidden layer with $\tanh$ activation functions, and softmax (normalized |
476 exponentials) on the output layer for estimating $P(class | image)$. | 479 exponentials) on the output layer for estimating $P(class | image)$. |
477 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. | 480 The number of hidden units is taken in $\{300,500,800,1000,1500\}$. |
478 The optimization procedure is as follows: training | 481 Training examples are presented in minibatches of size 20. A constant learning |
479 examples are presented in minibatches of size 20, a constant learning | 482 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$ |
480 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ | |
481 through preliminary experiments (measuring performance on a validation set), | 483 through preliminary experiments (measuring performance on a validation set), |
482 and $0.1$ was then selected. | 484 and $0.1$ was then selected for optimizing on the whole training sets. |
483 | 485 |
484 \begin{figure}[ht] | 486 \begin{figure}[ht] |
485 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} | 487 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} |
486 \caption{Illustration of the computations and training criterion for the denoising | 488 \caption{Illustration of the computations and training criterion for the denoising |
487 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of | 489 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of |
506 distribution $P(x)$ and the conditional distribution of interest | 508 distribution $P(x)$ and the conditional distribution of interest |
507 $P(y|x)$ (like in semi-supervised learning), and on the other hand | 509 $P(y|x)$ (like in semi-supervised learning), and on the other hand |
508 taking advantage of the expressive power and bias implicit in the | 510 taking advantage of the expressive power and bias implicit in the |
509 deep architecture (whereby complex concepts are expressed as | 511 deep architecture (whereby complex concepts are expressed as |
510 compositions of simpler ones through a deep hierarchy). | 512 compositions of simpler ones through a deep hierarchy). |
513 | |
511 Here we chose to use the Denoising | 514 Here we chose to use the Denoising |
512 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for | 515 Auto-Encoder~\citep{VincentPLarochelleH2008} as the building block for |
513 these deep hierarchies of features, as it is very simple to train and | 516 these deep hierarchies of features, as it is very simple to train and |
514 explain (see Figure~\ref{fig:da}, as well as | 517 explain (see Figure~\ref{fig:da}, as well as |
515 tutorial and code there: {\tt http://deeplearning.net/tutorial}), | 518 tutorial and code there: {\tt http://deeplearning.net/tutorial}), |
535 | 538 |
536 \vspace*{-1mm} | 539 \vspace*{-1mm} |
537 | 540 |
538 \begin{figure}[ht] | 541 \begin{figure}[ht] |
539 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} | 542 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} |
540 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training | 543 \caption{Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained |
541 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | 544 on NIST, 1 on NISTP, and 2 on P07. Left: overall results |
542 of all models, on 3 different test sets corresponding to the three | 545 of all models, on 3 different test sets (NIST, NISTP, P07). |
543 datasets. | |
544 Right: error rates on NIST test digits only, along with the previous results from | 546 Right: error rates on NIST test digits only, along with the previous results from |
545 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | 547 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
546 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | 548 respectively based on ART, nearest neighbors, MLPs, and SVMs.} |
547 | 549 |
548 \label{fig:error-rates-charts} | 550 \label{fig:error-rates-charts} |