diff writeup/nips2010_submission.tex @ 534:4d6493d171f6

added all sizes
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 01 Jun 2010 22:12:13 -0400
parents 22d5cd82d5f0
children caf7769ca19c f0ee2212ea7c
line wrap: on
line diff
--- a/writeup/nips2010_submission.tex	Tue Jun 01 21:24:39 2010 -0400
+++ b/writeup/nips2010_submission.tex	Tue Jun 01 22:12:13 2010 -0400
@@ -394,11 +394,11 @@
 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 
 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. 
-The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended 
+The fourth partition (called $hsf_4$, 82587 examples), 
+experimentally recognized to be the most difficult one, is the one recommended 
 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
-for that purpose. We randomly split the remainder into a training set and a validation set for
-model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, 
-and 82587 for testing.
+for that purpose. We randomly split the remainder (731668 examples) into a training set and a validation set for
+model selection. 
 The performances reported by previous work on that dataset mostly use only the digits.
 Here we use all the classes both in the training and testing phase. This is especially
 useful to estimate the effect of a multi-task setting.
@@ -445,7 +445,8 @@
 %\begin{itemize}
 
 %\item 
-{\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}.
+{\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
+\{651668 / 80000 / 82587\} \{training / validation / test} examples.
 
 %\item 
 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
@@ -453,13 +454,15 @@
 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
+It has \{81920000 / 80000 / 20000\} \{training / validation / test} examples.
 
 %\item 
 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
   except that we only apply
   transformations from slant to pinch. Therefore, the character is
   transformed but no additional noise is added to the image, giving images
-  closer to the NIST dataset.
+  closer to the NIST dataset. 
+It has \{81920000 / 80000 / 20000\} \{training / validation / test} examples.
 %\end{itemize}
 
 \vspace*{-1mm}