ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 549:ef172f4a322a

ca fitte

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Wed, 02 Jun 2010 13:56:01 -0400
parents	34cb28249de0
children	662299f265ab

comparison

equal deleted inserted replaced

-:34cb28249de0
+:ef172f4a322a
 useful to estimate the effect of a multi-task setting.
 Note that the distribution of the classes in the NIST training and test sets differs
 substantially, with relatively many more digits in the test set, and more uniform distribution
 of letters in the test set, compared to the training set (in the latter, the letters are distributed
 more like the natural distribution of letters in text).
+\vspace*{-1mm}
 %\item
 {\bf Fonts.}
 In order to have a good variety of sources we downloaded an important number of free fonts from:
 {\tt http://cg.scs.carleton.ca/\textasciitilde luc/freefonts.html}.
 % TODO: pointless to anonymize, it's not pointing to our work
 Including the operating system's (Windows 7) fonts, there is a total of $9817$ different fonts that we can choose uniformly from.
 The chosen {\tt ttf} file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image,
 directly as input to our models.
+\vspace*{-1mm}
 %\item
 {\bf Captchas.}
 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
 generating characters of the same format as the NIST dataset. This software is based on
 a random character class generator and various kinds of transformations similar to those described in the previous sections.
 In order to increase the variability of the data generated, many different fonts are used for generating the characters.
 Transformations (slant, distortions, rotation, translation) are applied to each randomly generated character with a complexity
 depending on the value of the complexity parameter provided by the user of the data source.
 %Two levels of complexity are allowed and can be controlled via an easy to use facade class. %TODO: what's a facade class?
+\vspace*{-1mm}
 %\item
 {\bf OCR data.}
 A large set (2 million) of scanned, OCRed and manually verified machine-printed
 characters (from various documents and books) where included as an
 \vspace*{-1mm}
 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
 from one of the 62 character classes.
 %\begin{itemize}
+\vspace*{-1mm}
 %\item
 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
 \{651668 / 80000 / 82587\} \{training / validation / test\} examples.
+\vspace*{-1mm}
 %\item
 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
 It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
+\vspace*{-1mm}
 %\item
 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
 except that we only apply
 transformations from slant to pinch. Therefore, the character is
 The number of hidden units is taken in $\{300,500,800,1000,1500\}$.
 Training examples are presented in minibatches of size 20. A constant learning
 rate was chosen among $\{0.001, 0.01, 0.025, 0.075, 0.1, 0.5\}$
 through preliminary experiments (measuring performance on a validation set),
 and $0.1$ was then selected for optimizing on the whole training sets.
+\vspace*{-1mm}
 {\bf Stacked Denoising Auto-Encoders (SDA).}
 Various auto-encoder variants and Restricted Boltzmann Machines (RBMs)
 can be used to initialize the weights of each layer of a deep MLP (with many hidden
 compositions of simpler ones through a deep hierarchy).
 \begin{figure}[ht]
 \vspace*{-2mm}
 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
+\vspace*{-2mm}
 \caption{Illustration of the computations and training criterion for the denoising
 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
 the layer (i.e. raw input or output of previous layer)
 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
 \vspace*{-1mm}
 \begin{figure}[ht]
 \vspace*{-2mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
+\vspace*{-3mm}
 \caption{SDAx are the {\bf deep} models. Error bars indicate a 95\% confidence interval. 0 indicates that the model was trained
 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
 of all models, on NIST and NISTP test sets.
 Right: error rates on NIST test digits only, along with the previous results from
 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
 \label{fig:error-rates-charts}
 \vspace*{-2mm}
 \end{figure}
 \section{Experimental Results}
+\vspace*{-2mm}
 %\vspace*{-1mm}
 %\subsection{SDA vs MLP vs Humans}
 %\vspace*{-1mm}
 The models are either trained on NIST (MLP0 and SDA0),
 significant way) but when trained with perturbed data
 reaches human performance on both the 62-class task
 and the 10-class (digits) task.
 \begin{figure}[ht]
-\vspace*{-2mm}
+\vspace*{-3mm}
 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
+\vspace*{-3mm}
 \caption{Relative improvement in error rate due to self-taught learning.
 Left: Improvement (or loss, when negative)
 induced by out-of-distribution examples (perturbed data).
 Right: Improvement (or loss, when negative) induced by multi-task
 learning (training on all classes and testing only on either digits,
 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
 \fi
-\vspace*{-1mm}
+\vspace*{-2mm}
 \section{Conclusions and Discussion}
-\vspace*{-1mm}
+\vspace*{-2mm}
 We have found that the self-taught learning framework is more beneficial
 to a deep learner than to a traditional shallow and purely
 supervised learner. More precisely,
 the answers are positive for all the questions asked in the introduction.
 %\begin{itemize}
 $\bullet$ %\item
 {\bf Do the good results previously obtained with deep architectures on the
-MNIST digits generalize to the setting of a much larger and richer (but similar)
+MNIST digits generalize to a much larger and richer (but similar)
 dataset, the NIST special database 19, with 62 classes and around 800k examples}?
 Yes, the SDA {\bf systematically outperformed the MLP and all the previously
 published results on this dataset} (the ones that we are aware of), {\bf in fact reaching human-level
 performance} at around 17\% error on the 62-class task and 1.4\% on the digits.

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 549:ef172f4a322a