Mercurial > ift6266

--- a/writeup/techreport.tex	Sat May 01 14:27:56 2010 -0400
+++ b/writeup/techreport.tex	Mon May 03 02:43:08 2010 -0400
@@ -248,8 +248,24 @@

 \begin{itemize}
 \item {\bf NIST}
+The NIST Special Database 19 (NIST19) \ref{Grother} is a very widely used dataset for training and testing OCR systems. The dataset is
+composed with over 800 000 digits and characters (upper and lower cases), with hand checked classifications, extracted from
+handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes corresponding to "0"-"9",
+"A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. The fourth series, $hsf_4$,
+experimentally recognized to be the most difficult one for classification task is recommended by NIST as testing set and is
+used in our work for that purpose. The performances reported by previous work on that dataset mostly use only the digits.
+Here we use the whole classes both in the training and testing phase.
+
+
 \item {\bf Fonts}
 \item {\bf Captchas}
+The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
+generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character
+generator and various kinds of tranformations similar to those described in the previous sections.
+In order to increase the variability of the data generated, different fonts are used for generating the characters.
+Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
+depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are
+allowed and can be controlled via an easy to use facade class.
 \item {\bf OCR data}
 \end{itemize}