# HG changeset patch # User goldfinger # Date 1272868988 14400 # Node ID e2fd928a7de0d36576709c728c7a1d0e10bca98f # Parent 9fcd0215b8d5c59919ffb4d578e96552d5f06f74 added description of nist19 and captcha data sources diff -r 9fcd0215b8d5 -r e2fd928a7de0 writeup/techreport.tex --- a/writeup/techreport.tex Sat May 01 14:27:56 2010 -0400 +++ b/writeup/techreport.tex Mon May 03 02:43:08 2010 -0400 @@ -248,8 +248,24 @@ \begin{itemize} \item {\bf NIST} +The NIST Special Database 19 (NIST19) \ref{Grother} is a very widely used dataset for training and testing OCR systems. The dataset is +composed with over 800 000 digits and characters (upper and lower cases), with hand checked classifications, extracted from +handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes corresponding to "0"-"9", +"A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. The fourth series, $hsf_4$, +experimentally recognized to be the most difficult one for classification task is recommended by NIST as testing set and is +used in our work for that purpose. The performances reported by previous work on that dataset mostly use only the digits. +Here we use the whole classes both in the training and testing phase. + + \item {\bf Fonts} \item {\bf Captchas} +The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for +generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character +generator and various kinds of tranformations similar to those described in the previous sections. +In order to increase the variability of the data generated, different fonts are used for generating the characters. +Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity +depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are +allowed and can be controlled via an easy to use facade class. \item {\bf OCR data} \end{itemize}