comparison writeup/nips2010_submission.tex @ 523:c778d20ab6f8

space adjustments
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 01 Jun 2010 16:06:32 -0400
parents d41926a68993
children 07bc0ca8d246
comparison
equal deleted inserted replaced
522:d41926a68993 523:c778d20ab6f8
4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} 4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm}
5 \usepackage{algorithm,algorithmic} 5 \usepackage{algorithm,algorithmic}
6 \usepackage[utf8]{inputenc} 6 \usepackage[utf8]{inputenc}
7 \usepackage{graphicx,subfigure} 7 \usepackage{graphicx,subfigure}
8 \usepackage[numbers]{natbib} 8 \usepackage[numbers]{natbib}
9
10 %\setlength\parindent{0mm}
9 11
10 \title{Deep Self-Taught Learning for Handwritten Character Recognition} 12 \title{Deep Self-Taught Learning for Handwritten Character Recognition}
11 \author{The IFT6266 Gang} 13 \author{The IFT6266 Gang}
12 14
13 \begin{document} 15 \begin{document}
137 139
138 There are two main parts in the pipeline. The first one, 140 There are two main parts in the pipeline. The first one,
139 from slant to pinch below, performs transformations. The second 141 from slant to pinch below, performs transformations. The second
140 part, from blur to contrast, adds different kinds of noise. 142 part, from blur to contrast, adds different kinds of noise.
141 143
142 \begin{figure}[h] 144 \begin{figure}[ht]
143 \resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\ 145 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}}
144 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE 146 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE
145 \caption{Illustration of each transformation applied alone to the same image 147 \caption{Illustration of each transformation applied alone to the same image
146 of an upper-case h (top left). First row (from left to right) : original image, slant, 148 of an upper-case h (top left). First row (from left to right) : original image, slant,
147 thickness, affine transformation (translation, rotation, shear), 149 thickness, affine transformation (translation, rotation, shear),
148 local elastic deformation; second row (from left to right) : 150 local elastic deformation; second row (from left to right) :
161 proportionally to its height: $shift = round(slant \times height)$. 163 proportionally to its height: $shift = round(slant \times height)$.
162 The $slant$ coefficient can be negative or positive with equal probability 164 The $slant$ coefficient can be negative or positive with equal probability
163 and its value is randomly sampled according to the complexity level: 165 and its value is randomly sampled according to the complexity level:
164 $slant \sim U[0,complexity]$, so the 166 $slant \sim U[0,complexity]$, so the
165 maximum displacement for the lowest or highest pixel line is of 167 maximum displacement for the lowest or highest pixel line is of
166 $round(complexity \times 32)$.\\ 168 $round(complexity \times 32)$.
169 \vspace*{0mm}
170
167 {\bf Thickness.} 171 {\bf Thickness.}
168 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} 172 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
169 are applied. The neighborhood of each pixel is multiplied 173 are applied. The neighborhood of each pixel is multiplied
170 element-wise with a {\em structuring element} matrix. 174 element-wise with a {\em structuring element} matrix.
171 The pixel value is replaced by the maximum or the minimum of the resulting 175 The pixel value is replaced by the maximum or the minimum of the resulting
175 element from a subset of the $n$ smallest structuring elements where $n$ is 179 element from a subset of the $n$ smallest structuring elements where $n$ is
176 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ 180 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$
177 for erosion. A neutral element is always present in the set, and if it is 181 for erosion. A neutral element is always present in the set, and if it is
178 chosen no transformation is applied. Erosion allows only the six 182 chosen no transformation is applied. Erosion allows only the six
179 smallest structural elements because when the character is too thin it may 183 smallest structural elements because when the character is too thin it may
180 be completely erased.\\ 184 be completely erased.
185 \vspace*{0mm}
186
181 {\bf Affine Transformations.} 187 {\bf Affine Transformations.}
182 A $2 \times 3$ affine transform matrix (with 188 A $2 \times 3$ affine transform matrix (with
183 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. 189 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level.
184 Each pixel $(x,y)$ of the output image takes the value of the pixel 190 Each pixel $(x,y)$ of the output image takes the value of the pixel
185 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This 191 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This
187 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to 193 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to
188 forbid important rotations (not to confuse classes) but to give good 194 forbid important rotations (not to confuse classes) but to give good
189 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times 195 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times
190 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 196 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3
191 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times 197 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times
192 complexity]$.\\ 198 complexity]$.
199 \vspace*{0mm}
200
193 {\bf Local Elastic Deformations.} 201 {\bf Local Elastic Deformations.}
194 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, 202 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
195 which provides more details. 203 which provides more details.
196 Two ``displacements'' fields are generated and applied, for horizontal 204 Two ``displacements'' fields are generated and applied, for horizontal
197 and vertical displacements of pixels. 205 and vertical displacements of pixels.
200 multiplied by a constant $\alpha$ which controls the intensity of the 208 multiplied by a constant $\alpha$ which controls the intensity of the
201 displacements (larger $\alpha$ translates into larger wiggles). 209 displacements (larger $\alpha$ translates into larger wiggles).
202 Each field is convoluted with a Gaussian 2D kernel of 210 Each field is convoluted with a Gaussian 2D kernel of
203 standard deviation $\sigma$. Visually, this results in a blur. 211 standard deviation $\sigma$. Visually, this results in a blur.
204 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times 212 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times
205 \sqrt[3]{complexity}$.\\ 213 \sqrt[3]{complexity}$.
214 \vspace*{0mm}
215
206 {\bf Pinch.} 216 {\bf Pinch.}
207 This is a GIMP filter called ``Whirl and 217 This is a GIMP filter called ``Whirl and
208 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic 218 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic
209 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). 219 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
210 For a square input image, this is akin to drawing a circle of 220 For a square input image, this is akin to drawing a circle of
228 {\bf Motion Blur.} 238 {\bf Motion Blur.}
229 This is a ``linear motion blur'' in GIMP 239 This is a ``linear motion blur'' in GIMP
230 terminology, with two parameters, $length$ and $angle$. The value of 240 terminology, with two parameters, $length$ and $angle$. The value of
231 a pixel in the final image is approximately the mean value of the $length$ first pixels 241 a pixel in the final image is approximately the mean value of the $length$ first pixels
232 found by moving in the $angle$ direction. 242 found by moving in the $angle$ direction.
233 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\ 243 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
244 \vspace*{0mm}
245
234 {\bf Occlusion.} 246 {\bf Occlusion.}
235 Selects a random rectangle from an {\em occluder} character 247 Selects a random rectangle from an {\em occluder} character
236 images and places it over the original {\em occluded} character 248 images and places it over the original {\em occluded} character
237 image. Pixels are combined by taking the max(occluder,occluded), 249 image. Pixels are combined by taking the max(occluder,occluded),
238 closer to black. The rectangle corners 250 closer to black. The rectangle corners
239 are sampled so that larger complexity gives larger rectangles. 251 are sampled so that larger complexity gives larger rectangles.
240 The destination position in the occluded image are also sampled 252 The destination position in the occluded image are also sampled
241 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). 253 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}).
242 This filter has a probability of 60\% of not being applied.\\ 254 This filter has a probability of 60\% of not being applied.
255 \vspace*{0mm}
256
243 {\bf Pixel Permutation.} 257 {\bf Pixel Permutation.}
244 This filter permutes neighbouring pixels. It selects first 258 This filter permutes neighbouring pixels. It selects first
245 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then 259 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then
246 sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number 260 sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number
247 of exchanges to the left, right, top, bottom is equal or does not differ 261 of exchanges to the left, right, top, bottom is equal or does not differ
248 from more than 1 if the number of selected pixels is not a multiple of 4. 262 from more than 1 if the number of selected pixels is not a multiple of 4.
249 % TODO: The previous sentence is hard to parse 263 % TODO: The previous sentence is hard to parse
250 This filter has a probability of 80\% of not being applied.\\ 264 This filter has a probability of 80\% of not being applied.
265 \vspace*{0mm}
266
251 {\bf Gaussian Noise.} 267 {\bf Gaussian Noise.}
252 This filter simply adds, to each pixel of the image independently, a 268 This filter simply adds, to each pixel of the image independently, a
253 noise $\sim Normal(0(\frac{complexity}{10})^2)$. 269 noise $\sim Normal(0(\frac{complexity}{10})^2)$.
254 It has a probability of 70\% of not being applied.\\ 270 It has a probability of 70\% of not being applied.
271 \vspace*{0mm}
272
255 {\bf Background Images.} 273 {\bf Background Images.}
256 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random 274 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random
257 background behind the letter. The background is chosen by first selecting, 275 background behind the letter. The background is chosen by first selecting,
258 at random, an image from a set of images. Then a 32$\times$32 sub-region 276 at random, an image from a set of images. Then a 32$\times$32 sub-region
259 of that image is chosen as the background image (by sampling position 277 of that image is chosen as the background image (by sampling position
262 adjustments are made. We first get the maximal values (i.e. maximal 280 adjustments are made. We first get the maximal values (i.e. maximal
263 intensity) for both the original image and the background image, $maximage$ 281 intensity) for both the original image and the background image, $maximage$
264 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$. 282 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$.
265 Each background pixel value is multiplied by $\frac{max(maximage - 283 Each background pixel value is multiplied by $\frac{max(maximage -
266 contrast, 0)}{maxbg}$ (higher contrast yield darker 284 contrast, 0)}{maxbg}$ (higher contrast yield darker
267 background). The output image pixels are max(background,original).\\ 285 background). The output image pixels are max(background,original).
286 \vspace*{0mm}
287
268 {\bf Salt and Pepper Noise.} 288 {\bf Salt and Pepper Noise.}
269 This filter adds noise $\sim U[0,1]$ to random subsets of pixels. 289 This filter adds noise $\sim U[0,1]$ to random subsets of pixels.
270 The number of selected pixels is $0.2 \times complexity$. 290 The number of selected pixels is $0.2 \times complexity$.
271 This filter has a probability of not being applied at all of 75\%.\\ 291 This filter has a probability of not being applied at all of 75\%.
292 \vspace*{0mm}
293
272 {\bf Spatially Gaussian Noise.} 294 {\bf Spatially Gaussian Noise.}
273 Different regions of the image are spatially smoothed. 295 Different regions of the image are spatially smoothed.
274 The image is convolved with a symmetric Gaussian kernel of 296 The image is convolved with a symmetric Gaussian kernel of
275 size and variance chosen uniformly in the ranges $[12,12 + 20 \times 297 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
276 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized 298 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
280 averaging centers between the original image and the filtered one. We 302 averaging centers between the original image and the filtered one. We
281 initialize to zero a mask matrix of the image size. For each selected pixel 303 initialize to zero a mask matrix of the image size. For each selected pixel
282 we add to the mask the averaging window centered to it. The final image is 304 we add to the mask the averaging window centered to it. The final image is
283 computed from the following element-wise operation: $\frac{image + filtered 305 computed from the following element-wise operation: $\frac{image + filtered
284 image \times mask}{mask+1}$. 306 image \times mask}{mask+1}$.
285 This filter has a probability of not being applied at all of 75\%.\\ 307 This filter has a probability of not being applied at all of 75\%.
308 \vspace*{0mm}
309
286 {\bf Scratches.} 310 {\bf Scratches.}
287 The scratches module places line-like white patches on the image. The 311 The scratches module places line-like white patches on the image. The
288 lines are heavily transformed images of the digit ``1'' (one), chosen 312 lines are heavily transformed images of the digit ``1'' (one), chosen
289 at random among five thousands such 1 images. The 1 image is 313 at random among five thousands such 1 images. The 1 image is
290 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times 314 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
294 by an amount controlled by $complexity$. 318 by an amount controlled by $complexity$.
295 This filter is only applied only 15\% of the time. When it is applied, 50\% 319 This filter is only applied only 15\% of the time. When it is applied, 50\%
296 of the time, only one patch image is generated and applied. In 30\% of 320 of the time, only one patch image is generated and applied. In 30\% of
297 cases, two patches are generated, and otherwise three patches are 321 cases, two patches are generated, and otherwise three patches are
298 generated. The patch is applied by taking the maximal value on any given 322 generated. The patch is applied by taking the maximal value on any given
299 patch or the original image, for each of the 32x32 pixel locations.\\ 323 patch or the original image, for each of the 32x32 pixel locations.
324 \vspace*{0mm}
325
300 {\bf Grey Level and Contrast Changes.} 326 {\bf Grey Level and Contrast Changes.}
301 This filter changes the contrast and may invert the image polarity (white 327 This filter changes the contrast and may invert the image polarity (white
302 on black to black on white). The contrast $C$ is defined here as the 328 on black to black on white). The contrast $C$ is defined here as the
303 difference between the maximum and the minimum pixel value of the image. 329 difference between the maximum and the minimum pixel value of the image.
304 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$). 330 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$).
305 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The 331 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
306 polarity is inverted with $0.5$ probability. 332 polarity is inverted with $0.5$ probability.
307 333
308 \iffalse 334 \iffalse
309 \begin{figure}[h] 335 \begin{figure}[ht]
310 \resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\ 336 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}}\\
311 \caption{Illustration of the pipeline of stochastic 337 \caption{Illustration of the pipeline of stochastic
312 transformations applied to the image of a lower-case \emph{t} 338 transformations applied to the image of a lower-case \emph{t}
313 (the upper left image). Each image in the pipeline (going from 339 (the upper left image). Each image in the pipeline (going from
314 left to right, first top line, then bottom line) shows the result 340 left to right, first top line, then bottom line) shows the result
315 of applying one of the modules in the pipeline. The last image 341 of applying one of the modules in the pipeline. The last image
452 examples are presented in minibatches of size 20, a constant learning 478 examples are presented in minibatches of size 20, a constant learning
453 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ 479 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$
454 through preliminary experiments (measuring performance on a validation set), 480 through preliminary experiments (measuring performance on a validation set),
455 and $0.1$ was then selected. 481 and $0.1$ was then selected.
456 482
457 \begin{figure}[h] 483 \begin{figure}[ht]
458 \resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}} 484 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}}
459 \caption{Illustration of the computations and training criterion for the denoising 485 \caption{Illustration of the computations and training criterion for the denoising
460 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ 486 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of
487 the layer (i.e. raw input or output of previous layer)
461 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. 488 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$.
462 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which 489 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which
463 is compared to the uncorrupted input $x$ through the loss function 490 is compared to the uncorrupted input $x$ through the loss function
464 $L_H(x,z)$, whose expected value is approximately minimized during training 491 $L_H(x,z)$, whose expected value is approximately minimized during training
465 by tuning $\theta$ and $\theta'$.} 492 by tuning $\theta$ and $\theta'$.}
504 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number 531 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
505 of hidden layers but it was fixed to 3 based on previous work with 532 of hidden layers but it was fixed to 3 based on previous work with
506 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. 533 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}.
507 534
508 \vspace*{-1mm} 535 \vspace*{-1mm}
536
537 \begin{figure}[ht]
538 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}}
539 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training
540 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
541 of all models, on 3 different test sets corresponding to the three
542 datasets.
543 Right: error rates on NIST test digits only, along with the previous results from
544 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
545 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
546
547 \label{fig:error-rates-charts}
548 \vspace*{-1mm}
549 \end{figure}
550
551
509 \section{Experimental Results} 552 \section{Experimental Results}
510 553
511 %\vspace*{-1mm} 554 %\vspace*{-1mm}
512 %\subsection{SDA vs MLP vs Humans} 555 %\subsection{SDA vs MLP vs Humans}
513 %\vspace*{-1mm} 556 %\vspace*{-1mm}
523 found in Appendix I of the supplementary material. The 3 kinds of model differ in the 566 found in Appendix I of the supplementary material. The 3 kinds of model differ in the
524 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 567 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07
525 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and 568 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and
526 previously published performance (in a statistically and qualitatively 569 previously published performance (in a statistically and qualitatively
527 significant way) but reaches human performance on both the 62-class task 570 significant way) but reaches human performance on both the 62-class task
528 and the 10-class (digits) task. In addition, as shown in the left of 571 and the 10-class (digits) task.
529 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error 572
573 \begin{figure}[ht]
574 \vspace*{-2mm}
575 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
576 \caption{Relative improvement in error rate due to self-taught learning.
577 Left: Improvement (or loss, when negative)
578 induced by out-of-distribution examples (perturbed data).
579 Right: Improvement (or loss, when negative) induced by multi-task
580 learning (training on all classes and testing only on either digits,
581 upper case, or lower-case). The deep learner (SDA) benefits more from
582 both self-taught learning scenarios, compared to the shallow MLP.}
583 \label{fig:improvements-charts}
584 \vspace*{-2mm}
585 \end{figure}
586
587 In addition, as shown in the left of
588 Figure~\ref{fig:improvements-charts}, the relative improvement in error
530 rate brought by self-taught learning is greater for the SDA, and these 589 rate brought by self-taught learning is greater for the SDA, and these
531 differences with the MLP are statistically and qualitatively 590 differences with the MLP are statistically and qualitatively
532 significant. 591 significant.
533 The left side of the figure shows the improvement to the clean 592 The left side of the figure shows the improvement to the clean
534 NIST test set error brought by the use of out-of-distribution examples 593 NIST test set error brought by the use of out-of-distribution examples
535 (i.e. the perturbed examples examples from NISTP or P07). 594 (i.e. the perturbed examples examples from NISTP or P07).
536 Relative change is measured by taking 595 Relative change is measured by taking
537 (original model's error / perturbed-data model's error - 1). 596 (original model's error / perturbed-data model's error - 1).
538 The right side of 597 The right side of
539 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement 598 Figure~\ref{fig:improvements-charts} shows the relative improvement
540 brought by the use of a multi-task setting, in which the same model is 599 brought by the use of a multi-task setting, in which the same model is
541 trained for more classes than the target classes of interest (i.e. training 600 trained for more classes than the target classes of interest (i.e. training
542 with all 62 classes when the target classes are respectively the digits, 601 with all 62 classes when the target classes are respectively the digits,
543 lower-case, or upper-case characters). Again, whereas the gain from the 602 lower-case, or upper-case characters). Again, whereas the gain from the
544 multi-task setting is marginal or negative for the MLP, it is substantial 603 multi-task setting is marginal or negative for the MLP, it is substantial
553 comparing the correct digit class with the output class associated with the 612 comparing the correct digit class with the output class associated with the
554 maximum conditional probability among only the digit classes outputs. The 613 maximum conditional probability among only the digit classes outputs. The
555 setting is similar for the other two target classes (lower case characters 614 setting is similar for the other two target classes (lower case characters
556 and upper case characters). 615 and upper case characters).
557 616
558 \begin{figure}[h]
559 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\
560 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training
561 on NIST, 1 on NISTP, and 2 on P07. Left: overall results
562 of all models, on 3 different test sets corresponding to the three
563 datasets.
564 Right: error rates on NIST test digits only, along with the previous results from
565 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005}
566 respectively based on ART, nearest neighbors, MLPs, and SVMs.}
567
568 \label{fig:error-rates-charts}
569 \end{figure}
570
571 %\vspace*{-1mm} 617 %\vspace*{-1mm}
572 %\subsection{Perturbed Training Data More Helpful for SDA} 618 %\subsection{Perturbed Training Data More Helpful for SDA}
573 %\vspace*{-1mm} 619 %\vspace*{-1mm}
574 620
575 %\vspace*{-1mm} 621 %\vspace*{-1mm}
600 error rate improvements of 27\%, 15\% and 13\% respectively for digits, 646 error rate improvements of 27\%, 15\% and 13\% respectively for digits,
601 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. 647 lower and upper case characters, as shown in Table~\ref{tab:multi-task}.
602 \fi 648 \fi
603 649
604 650
605 \begin{figure}[h]
606 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\
607 \caption{Relative improvement in error rate due to self-taught learning.
608 Left: Improvement (or loss, when negative)
609 induced by out-of-distribution examples (perturbed data).
610 Right: Improvement (or loss, when negative) induced by multi-task
611 learning (training on all classes and testing only on either digits,
612 upper case, or lower-case). The deep learner (SDA) benefits more from
613 both self-taught learning scenarios, compared to the shallow MLP.}
614 \label{fig:improvements-charts}
615 \end{figure}
616
617 \vspace*{-1mm} 651 \vspace*{-1mm}
618 \section{Conclusions} 652 \section{Conclusions}
619 \vspace*{-1mm} 653 \vspace*{-1mm}
620 654
621 We have found that the self-taught learning framework is more beneficial 655 We have found that the self-taught learning framework is more beneficial