changeset 538:f0ee2212ea7c

typos and stuff
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Tue, 01 Jun 2010 19:34:00 -0700
parents 47894d0ecbde (current diff) 4d6493d171f6 (diff)
children 84f42fe05594
files writeup/nips2010_submission.tex
diffstat 1 files changed, 10 insertions(+), 7 deletions(-) [+]
line wrap: on
line diff
--- a/writeup/nips2010_submission.tex	Tue Jun 01 18:28:43 2010 -0700
+++ b/writeup/nips2010_submission.tex	Tue Jun 01 19:34:00 2010 -0700
@@ -334,7 +334,7 @@
 
 \iffalse
 \begin{figure}[ht]
-\centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}}\\
+\centerline{\resizebox{.9\textwidth}{!}{\includegraphics{images/example_t.png}}}\\
 \caption{Illustration of the pipeline of stochastic 
 transformations applied to the image of a lower-case \emph{t}
 (the upper left image). Each image in the pipeline (going from
@@ -394,11 +394,11 @@
 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications,
 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes 
 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. 
-The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended 
+The fourth partition (called $hsf_4$, 82587 examples), 
+experimentally recognized to be the most difficult one, is the one recommended 
 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
-for that purpose. We randomly split the remainder into a training set and a validation set for
-model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, 
-and 82587 for testing.
+for that purpose. We randomly split the remainder (731668 examples) into a training set and a validation set for
+model selection. 
 The performances reported by previous work on that dataset mostly use only the digits.
 Here we use all the classes both in the training and testing phase. This is especially
 useful to estimate the effect of a multi-task setting.
@@ -445,7 +445,8 @@
 %\begin{itemize}
 
 %\item 
-{\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}.
+{\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has
+\{651668 / 80000 / 82587\} \{training / validation / test\} examples.
 
 %\item 
 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources
@@ -453,13 +454,15 @@
 For each new example to generate, a data source is selected with probability $10\%$ from the fonts,
 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the
 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$.
+It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
 
 %\item 
 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources)
   except that we only apply
   transformations from slant to pinch. Therefore, the character is
   transformed but no additional noise is added to the image, giving images
-  closer to the NIST dataset.
+  closer to the NIST dataset. 
+It has \{81920000 / 80000 / 20000\} \{training / validation / test\} examples.
 %\end{itemize}
 
 \vspace*{-1mm}