comparison writeup/techreport.tex @ 432:e2fd928a7de0

added description of nist19 and captcha data sources
author goldfinger
date Mon, 03 May 2010 02:43:08 -0400
parents 9fcd0215b8d5
children 310c730516af
comparison
equal deleted inserted replaced
428:9fcd0215b8d5 432:e2fd928a7de0
246 246
247 \subsubsection{Data Sources} 247 \subsubsection{Data Sources}
248 248
249 \begin{itemize} 249 \begin{itemize}
250 \item {\bf NIST} 250 \item {\bf NIST}
251 The NIST Special Database 19 (NIST19) \ref{Grother} is a very widely used dataset for training and testing OCR systems. The dataset is
252 composed with over 800 000 digits and characters (upper and lower cases), with hand checked classifications, extracted from
253 handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes corresponding to "0"-"9",
254 "A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. The fourth series, $hsf_4$,
255 experimentally recognized to be the most difficult one for classification task is recommended by NIST as testing set and is
256 used in our work for that purpose. The performances reported by previous work on that dataset mostly use only the digits.
257 Here we use the whole classes both in the training and testing phase.
258
259
251 \item {\bf Fonts} 260 \item {\bf Fonts}
252 \item {\bf Captchas} 261 \item {\bf Captchas}
262 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for
263 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character
264 generator and various kinds of tranformations similar to those described in the previous sections.
265 In order to increase the variability of the data generated, different fonts are used for generating the characters.
266 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity
267 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are
268 allowed and can be controlled via an easy to use facade class.
253 \item {\bf OCR data} 269 \item {\bf OCR data}
254 \end{itemize} 270 \end{itemize}
255 271
256 \subsubsection{Data Sets} 272 \subsubsection{Data Sets}
257 \begin{itemize} 273 \begin{itemize}