# HG changeset patch # User Yoshua Bengio # Date 1275444733 14400 # Node ID 4d6493d171f6ed246f104f3def1114ef600664b1 # Parent 22d5cd82d5f08cc0123105f955c7c16014e6be0e added all sizes diff -r 22d5cd82d5f0 -r 4d6493d171f6 writeup/nips2010_submission.tex --- a/writeup/nips2010_submission.tex Tue Jun 01 21:24:39 2010 -0400 +++ b/writeup/nips2010_submission.tex Tue Jun 01 22:12:13 2010 -0400 @@ -394,11 +394,11 @@ The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. -The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended +The fourth partition (called $hsf_4$, 82587 examples), +experimentally recognized to be the most difficult one, is the one recommended by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} -for that purpose. We randomly split the remainder into a training set and a validation set for -model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, -and 82587 for testing. +for that purpose. We randomly split the remainder (731668 examples) into a training set and a validation set for +model selection. The performances reported by previous work on that dataset mostly use only the digits. Here we use all the classes both in the training and testing phase. This is especially useful to estimate the effect of a multi-task setting. @@ -445,7 +445,8 @@ %\begin{itemize} %\item -{\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. +{\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has +\{651668 / 80000 / 82587\} \{training / validation / test} examples. %\item {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources @@ -453,13 +454,15 @@ For each new example to generate, a data source is selected with probability $10\%$ from the fonts, $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. +It has \{81920000 / 80000 / 20000\} \{training / validation / test} examples. %\item {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) except that we only apply transformations from slant to pinch. Therefore, the character is transformed but no additional noise is added to the image, giving images - closer to the NIST dataset. + closer to the NIST dataset. +It has \{81920000 / 80000 / 20000\} \{training / validation / test} examples. %\end{itemize} \vspace*{-1mm}