Mercurial > ift6266
comparison writeup/techreport.tex @ 477:534d4ecf1bd1
small desription of the font added
author | Xavier Glorot <glorotxa@iro.umontreal.ca> |
---|---|
date | Sun, 30 May 2010 17:24:26 -0400 |
parents | 5fa1c653620c |
children | 6593e67381a3 |
comparison
equal
deleted
inserted
replaced
476:db28764b8252 | 477:534d4ecf1bd1 |
---|---|
427 The performances reported by previous work on that dataset mostly use only the digits. | 427 The performances reported by previous work on that dataset mostly use only the digits. |
428 Here we use the whole classes both in the training and testing phase. | 428 Here we use the whole classes both in the training and testing phase. |
429 | 429 |
430 | 430 |
431 \item {\bf Fonts} | 431 \item {\bf Fonts} |
432 In order to have a good variety of sources we downloaded an important number of free fonts from: {\tt http://anonymous.url.net} | |
433 %real adress {\tt http://cg.scs.carleton.ca/~luc/freefonts.html} | |
434 in addition to Windows 7's, this adds up to a total of $9817$ different fonts that we can choose uniformly. | |
435 The ttf file is either used as input of the Captcha generator (see next item) or, by producing a corresponding image, | |
436 directly as input to our models. | |
437 %Guillaume are there other details I forgot on the font selection? | |
438 | |
432 \item {\bf Captchas} | 439 \item {\bf Captchas} |
433 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for | 440 The Captcha data source is an adaptation of the \emph{pycaptcha} library (a python based captcha generator library) for |
434 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character | 441 generating characters of the same format as the NIST dataset. The core of this data source is composed with a random character |
435 generator and various kinds of tranformations similar to those described in the previous sections. | 442 generator and various kinds of tranformations similar to those described in the previous sections. |
436 In order to increase the variability of the data generated, different fonts are used for generating the characters. | 443 In order to increase the variability of the data generated, different fonts are used for generating the characters. |
440 \item {\bf OCR data} | 447 \item {\bf OCR data} |
441 \end{itemize} | 448 \end{itemize} |
442 | 449 |
443 \subsubsection{Data Sets} | 450 \subsubsection{Data Sets} |
444 \begin{itemize} | 451 \begin{itemize} |
445 \item {\bf NIST} | |
446 \item {\bf P07} | 452 \item {\bf P07} |
447 The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$. | 453 The dataset P07 is sampled with our transformation pipeline with a complexity parameter of $0.7$. |
448 For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts, | 454 For each new exemple to generate, we choose one source with the following probability: $0.1$ for the fonts, |
449 $0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order | 455 $0.25$ for the captchas, $0.25$ for OCR data and $0.4$ for NIST. We apply all the transformations in their order |
450 and for each of them we sample uniformly a complexity in the range $[0,0.7]$. | 456 and for each of them we sample uniformly a complexity in the range $[0,0.7]$. |