Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 534:4d6493d171f6
added all sizes
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 01 Jun 2010 22:12:13 -0400 |
parents | 22d5cd82d5f0 |
children | caf7769ca19c f0ee2212ea7c |
comparison
equal
deleted
inserted
replaced
533:22d5cd82d5f0 | 534:4d6493d171f6 |
---|---|
392 widely used for training and testing character | 392 widely used for training and testing character |
393 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. | 393 recognition systems~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}. |
394 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, | 394 The dataset is composed of 814255 digits and characters (upper and lower cases), with hand checked classifications, |
395 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes | 395 extracted from handwritten sample forms of 3600 writers. The characters are labelled by one of the 62 classes |
396 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. | 396 corresponding to ``0''-``9'',``A''-``Z'' and ``a''-``z''. The dataset contains 8 parts (partitions) of varying complexity. |
397 The fourth partition (called $hsf_4$), experimentally recognized to be the most difficult one, is the one recommended | 397 The fourth partition (called $hsf_4$, 82587 examples), |
398 experimentally recognized to be the most difficult one, is the one recommended | |
398 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | 399 by NIST as a testing set and is used in our work as well as some previous work~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} |
399 for that purpose. We randomly split the remainder into a training set and a validation set for | 400 for that purpose. We randomly split the remainder (731668 examples) into a training set and a validation set for |
400 model selection. The sizes of these data sets are: 651668 for training, 80000 for validation, | 401 model selection. |
401 and 82587 for testing. | |
402 The performances reported by previous work on that dataset mostly use only the digits. | 402 The performances reported by previous work on that dataset mostly use only the digits. |
403 Here we use all the classes both in the training and testing phase. This is especially | 403 Here we use all the classes both in the training and testing phase. This is especially |
404 useful to estimate the effect of a multi-task setting. | 404 useful to estimate the effect of a multi-task setting. |
405 Note that the distribution of the classes in the NIST training and test sets differs | 405 Note that the distribution of the classes in the NIST training and test sets differs |
406 substantially, with relatively many more digits in the test set, and more uniform distribution | 406 substantially, with relatively many more digits in the test set, and more uniform distribution |
443 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label | 443 All data sets contain 32$\times$32 grey-level images (values in $[0,1]$) associated with a label |
444 from one of the 62 character classes. | 444 from one of the 62 character classes. |
445 %\begin{itemize} | 445 %\begin{itemize} |
446 | 446 |
447 %\item | 447 %\item |
448 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. | 448 {\bf NIST.} This is the raw NIST special database 19~\citep{Grother-1995}. It has |
449 \{651668 / 80000 / 82587\} \{training / validation / test} examples. | |
449 | 450 |
450 %\item | 451 %\item |
451 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources | 452 {\bf P07.} This dataset is obtained by taking raw characters from all four of the above sources |
452 and sending them through the transformation pipeline described in section \ref{s:perturbations}. | 453 and sending them through the transformation pipeline described in section \ref{s:perturbations}. |
453 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, | 454 For each new example to generate, a data source is selected with probability $10\%$ from the fonts, |
454 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the | 455 $25\%$ from the captchas, $25\%$ from the OCR data and $40\%$ from NIST. We apply all the transformations in the |
455 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. | 456 order given above, and for each of them we sample uniformly a \emph{complexity} in the range $[0,0.7]$. |
457 It has \{81920000 / 80000 / 20000\} \{training / validation / test} examples. | |
456 | 458 |
457 %\item | 459 %\item |
458 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) | 460 {\bf NISTP.} This one is equivalent to P07 (complexity parameter of $0.7$ with the same proportions of data sources) |
459 except that we only apply | 461 except that we only apply |
460 transformations from slant to pinch. Therefore, the character is | 462 transformations from slant to pinch. Therefore, the character is |
461 transformed but no additional noise is added to the image, giving images | 463 transformed but no additional noise is added to the image, giving images |
462 closer to the NIST dataset. | 464 closer to the NIST dataset. |
465 It has \{81920000 / 80000 / 20000\} \{training / validation / test} examples. | |
463 %\end{itemize} | 466 %\end{itemize} |
464 | 467 |
465 \vspace*{-1mm} | 468 \vspace*{-1mm} |
466 \subsection{Models and their Hyperparameters} | 469 \subsection{Models and their Hyperparameters} |
467 \vspace*{-1mm} | 470 \vspace*{-1mm} |