Mercurial > ift6266
comparison writeup/techreport.tex @ 432:e2fd928a7de0
added description of nist19 and captcha data sources
author | goldfinger |
---|---|
date | Mon, 03 May 2010 02:43:08 -0400 |
parents | 9fcd0215b8d5 |
children | 310c730516af |
comparison
equal
deleted
inserted
replaced
428:9fcd0215b8d5 | 432:e2fd928a7de0 |
---|---|
246 | 246 |
247 \subsubsection{Data Sources} | 247 \subsubsection{Data Sources} |
248 | 248 |
249 \begin{itemize} | 249 \begin{itemize} |
250 \item {\bf NIST} | 250 \item {\bf NIST} |
251 The NIST Special Database 19 (NIST19) \ref{Grother} is a very widely used dataset for training and testing OCR systems. The dataset is | |
252 composed with over 800 000 digits and characters (upper and lower cases), with hand checked classifications, extracted from | |
253 handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes corresponding to "0"-"9", | |
254 "A"-"Z" and "a"-"z". The dataset contains 8 series of different complexity. The fourth series, $hsf_4$, | |
255 experimentally recognized to be the most difficult one for classification task is recommended by NIST as testing set and is | |
256 used in our work for that purpose. The performances reported by previous work on that dataset mostly use only the digits. | |
257 Here we use the whole classes both in the training and testing phase. | |
258 | |
259 | |
251 \item {\bf Fonts} | 260 \item {\bf Fonts} |
252 \item {\bf Captchas} | 261 \item {\bf Captchas} |
262 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | |
263 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character | |
264 generator and various kinds of tranformations similar to those described in the previous sections. | |
265 In order to increase the variability of the data generated, different fonts are used for generating the characters. | |
266 Transformations (slant, distorsions, rotation, translation) are applied to each randomly generated character with a complexity | |
267 depending on the value of the complexity parameter provided by the user of the data source. Two levels of complexity are | |
268 allowed and can be controlled via an easy to use facade class. | |
253 \item {\bf OCR data} | 269 \item {\bf OCR data} |
254 \end{itemize} | 270 \end{itemize} |
255 | 271 |
256 \subsubsection{Data Sets} | 272 \subsubsection{Data Sets} |
257 \begin{itemize} | 273 \begin{itemize} |