ift6266: writeup/nips2010_submission.tex comparison

comparison writeup/nips2010_submission.tex @ 534:4d6493d171f6

added all sizes

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Tue, 01 Jun 2010 22:12:13 -0400
parents	22d5cd82d5f0
children	caf7769ca19c f0ee2212ea7c

comparison

equal deleted inserted replaced

-:22d5cd82d5f0
+:4d6493d171f6
 widely used for training and testing character
 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}.
 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes
 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity.
-The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended
+The fourth partition (called $hsf_4$, 82587 examples),
+experimentally recognized to be the most difficult one, is the one recommended
 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
-for that purpose. We randomly split the remainder into a training set and a validation set for
+for that purpose. We randomly split the remainder (731668 examples) into a training set and a validation set for
-model selection. The sizes of these data sets are: 651668 for training, 80000 for validation,
+model selection.
-and 82587 for testing.
 The performances reported by previous work on that dataset mostly use only the digits.
 Here we use all the classes both in the training and testing phase. This is especially
 useful to estimate the effect of a multi-task setting.
 Note that the distribution of the classes in the NIST training and test sets differs
 substantially, with relatively many more digits in the test set, and more uniform distribution
 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label
 from one of the 62 character classes.
 %\begin{itemize}
 %\item
-{\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}.
+{\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
+\{651668 / 80000 / 82587\} \{training / validation / test} examples.
 %\item
 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
 and sending them through the transformation pipeline described in section \ref{s:perturbations}.
 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
+It has \{81920000 / 80000 / 20000\} \{training / validation / test} examples.
 %\item
 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
 except that we only apply
 transformations from slant to pinch. Therefore, the character is
 transformed but no additional noise is added to the image, giving images
 closer to the NIST dataset.
+It has \{81920000 / 80000 / 20000\} \{training / validation / test} examples.
 %\end{itemize}
 \vspace*{-1mm}
 \subsection{Models and their Hyperparameters}
 \vspace*{-1mm}

Mercurial > ift6266

comparison writeup/nips2010_submission.tex @ 534:4d6493d171f6