Mercurial > ift6266
comparison writeup/nips2010_submission.tex @ 523:c778d20ab6f8
space adjustments
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 01 Jun 2010 16:06:32 -0400 |
parents | d41926a68993 |
children | 07bc0ca8d246 |
comparison
equal
deleted
inserted
replaced
522:d41926a68993 | 523:c778d20ab6f8 |
---|---|
4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} | 4 \usepackage{amsthm,amsmath,amssymb,bbold,bbm} |
5 \usepackage{algorithm,algorithmic} | 5 \usepackage{algorithm,algorithmic} |
6 \usepackage[utf8]{inputenc} | 6 \usepackage[utf8]{inputenc} |
7 \usepackage{graphicx,subfigure} | 7 \usepackage{graphicx,subfigure} |
8 \usepackage[numbers]{natbib} | 8 \usepackage[numbers]{natbib} |
9 | |
10 %\setlength\parindent{0mm} | |
9 | 11 |
10 \title{Deep Self-Taught Learning for Handwritten Character Recognition} | 12 \title{Deep Self-Taught Learning for Handwritten Character Recognition} |
11 \author{The IFT6266 Gang} | 13 \author{The IFT6266 Gang} |
12 | 14 |
13 \begin{document} | 15 \begin{document} |
137 | 139 |
138 There are two main parts in the pipeline. The first one, | 140 There are two main parts in the pipeline. The first one, |
139 from slant to pinch below, performs transformations. The second | 141 from slant to pinch below, performs transformations. The second |
140 part, from blur to contrast, adds different kinds of noise. | 142 part, from blur to contrast, adds different kinds of noise. |
141 | 143 |
142 \begin{figure}[h] | 144 \begin{figure}[ht] |
143 \resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}\\ | 145 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/transfo.png}}} |
144 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE | 146 % TODO: METTRE LE NOM DE LA TRANSFO A COTE DE CHAQUE IMAGE |
145 \caption{Illustration of each transformation applied alone to the same image | 147 \caption{Illustration of each transformation applied alone to the same image |
146 of an upper-case h (top left). First row (from left to right) : original image, slant, | 148 of an upper-case h (top left). First row (from left to right) : original image, slant, |
147 thickness, affine transformation (translation, rotation, shear), | 149 thickness, affine transformation (translation, rotation, shear), |
148 local elastic deformation; second row (from left to right) : | 150 local elastic deformation; second row (from left to right) : |
161 proportionally to its height: $shift = round(slant \times height)$. | 163 proportionally to its height: $shift = round(slant \times height)$. |
162 The $slant$ coefficient can be negative or positive with equal probability | 164 The $slant$ coefficient can be negative or positive with equal probability |
163 and its value is randomly sampled according to the complexity level: | 165 and its value is randomly sampled according to the complexity level: |
164 $slant \sim U[0,complexity]$, so the | 166 $slant \sim U[0,complexity]$, so the |
165 maximum displacement for the lowest or highest pixel line is of | 167 maximum displacement for the lowest or highest pixel line is of |
166 $round(complexity \times 32)$.\\ | 168 $round(complexity \times 32)$. |
169 \vspace*{0mm} | |
170 | |
167 {\bf Thickness.} | 171 {\bf Thickness.} |
168 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} | 172 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} |
169 are applied. The neighborhood of each pixel is multiplied | 173 are applied. The neighborhood of each pixel is multiplied |
170 element-wise with a {\em structuring element} matrix. | 174 element-wise with a {\em structuring element} matrix. |
171 The pixel value is replaced by the maximum or the minimum of the resulting | 175 The pixel value is replaced by the maximum or the minimum of the resulting |
175 element from a subset of the $n$ smallest structuring elements where $n$ is | 179 element from a subset of the $n$ smallest structuring elements where $n$ is |
176 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ | 180 $round(10 \times complexity)$ for dilation and $round(6 \times complexity)$ |
177 for erosion. A neutral element is always present in the set, and if it is | 181 for erosion. A neutral element is always present in the set, and if it is |
178 chosen no transformation is applied. Erosion allows only the six | 182 chosen no transformation is applied. Erosion allows only the six |
179 smallest structural elements because when the character is too thin it may | 183 smallest structural elements because when the character is too thin it may |
180 be completely erased.\\ | 184 be completely erased. |
185 \vspace*{0mm} | |
186 | |
181 {\bf Affine Transformations.} | 187 {\bf Affine Transformations.} |
182 A $2 \times 3$ affine transform matrix (with | 188 A $2 \times 3$ affine transform matrix (with |
183 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. | 189 6 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$ level. |
184 Each pixel $(x,y)$ of the output image takes the value of the pixel | 190 Each pixel $(x,y)$ of the output image takes the value of the pixel |
185 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This | 191 nearest to $(ax+by+c,dx+ey+f)$ in the input image. This |
187 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to | 193 The marginal distributions of $(a,b,c,d,e,f)$ have been tuned by hand to |
188 forbid important rotations (not to confuse classes) but to give good | 194 forbid important rotations (not to confuse classes) but to give good |
189 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times | 195 variability of the transformation: $a$ and $d$ $\sim U[1-3 \times |
190 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 | 196 complexity,1+3 \times complexity]$, $b$ and $e$ $\sim[-3 \times complexity,3 |
191 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times | 197 \times complexity]$ and $c$ and $f$ $\sim U[-4 \times complexity, 4 \times |
192 complexity]$.\\ | 198 complexity]$. |
199 \vspace*{0mm} | |
200 | |
193 {\bf Local Elastic Deformations.} | 201 {\bf Local Elastic Deformations.} |
194 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, | 202 This filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, |
195 which provides more details. | 203 which provides more details. |
196 Two ``displacements'' fields are generated and applied, for horizontal | 204 Two ``displacements'' fields are generated and applied, for horizontal |
197 and vertical displacements of pixels. | 205 and vertical displacements of pixels. |
200 multiplied by a constant $\alpha$ which controls the intensity of the | 208 multiplied by a constant $\alpha$ which controls the intensity of the |
201 displacements (larger $\alpha$ translates into larger wiggles). | 209 displacements (larger $\alpha$ translates into larger wiggles). |
202 Each field is convoluted with a Gaussian 2D kernel of | 210 Each field is convoluted with a Gaussian 2D kernel of |
203 standard deviation $\sigma$. Visually, this results in a blur. | 211 standard deviation $\sigma$. Visually, this results in a blur. |
204 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times | 212 $\alpha = \sqrt[3]{complexity} \times 10.0$ and $\sigma = 10 - 7 \times |
205 \sqrt[3]{complexity}$.\\ | 213 \sqrt[3]{complexity}$. |
214 \vspace*{0mm} | |
215 | |
206 {\bf Pinch.} | 216 {\bf Pinch.} |
207 This is a GIMP filter called ``Whirl and | 217 This is a GIMP filter called ``Whirl and |
208 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic | 218 pinch'', but whirl was set to 0. A pinch is ``similar to projecting the image onto an elastic |
209 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). | 219 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). |
210 For a square input image, this is akin to drawing a circle of | 220 For a square input image, this is akin to drawing a circle of |
228 {\bf Motion Blur.} | 238 {\bf Motion Blur.} |
229 This is a ``linear motion blur'' in GIMP | 239 This is a ``linear motion blur'' in GIMP |
230 terminology, with two parameters, $length$ and $angle$. The value of | 240 terminology, with two parameters, $length$ and $angle$. The value of |
231 a pixel in the final image is approximately the mean value of the $length$ first pixels | 241 a pixel in the final image is approximately the mean value of the $length$ first pixels |
232 found by moving in the $angle$ direction. | 242 found by moving in the $angle$ direction. |
233 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.\\ | 243 Here $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. |
244 \vspace*{0mm} | |
245 | |
234 {\bf Occlusion.} | 246 {\bf Occlusion.} |
235 Selects a random rectangle from an {\em occluder} character | 247 Selects a random rectangle from an {\em occluder} character |
236 images and places it over the original {\em occluded} character | 248 images and places it over the original {\em occluded} character |
237 image. Pixels are combined by taking the max(occluder,occluded), | 249 image. Pixels are combined by taking the max(occluder,occluded), |
238 closer to black. The rectangle corners | 250 closer to black. The rectangle corners |
239 are sampled so that larger complexity gives larger rectangles. | 251 are sampled so that larger complexity gives larger rectangles. |
240 The destination position in the occluded image are also sampled | 252 The destination position in the occluded image are also sampled |
241 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). | 253 according to a normal distribution (see more details in~\citet{ift6266-tr-anonymous}). |
242 This filter has a probability of 60\% of not being applied.\\ | 254 This filter has a probability of 60\% of not being applied. |
255 \vspace*{0mm} | |
256 | |
243 {\bf Pixel Permutation.} | 257 {\bf Pixel Permutation.} |
244 This filter permutes neighbouring pixels. It selects first | 258 This filter permutes neighbouring pixels. It selects first |
245 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then | 259 $\frac{complexity}{3}$ pixels randomly in the image. Each of them are then |
246 sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number | 260 sequentially exchanged with one other pixel in its $V4$ neighbourhood. The number |
247 of exchanges to the left, right, top, bottom is equal or does not differ | 261 of exchanges to the left, right, top, bottom is equal or does not differ |
248 from more than 1 if the number of selected pixels is not a multiple of 4. | 262 from more than 1 if the number of selected pixels is not a multiple of 4. |
249 % TODO: The previous sentence is hard to parse | 263 % TODO: The previous sentence is hard to parse |
250 This filter has a probability of 80\% of not being applied.\\ | 264 This filter has a probability of 80\% of not being applied. |
265 \vspace*{0mm} | |
266 | |
251 {\bf Gaussian Noise.} | 267 {\bf Gaussian Noise.} |
252 This filter simply adds, to each pixel of the image independently, a | 268 This filter simply adds, to each pixel of the image independently, a |
253 noise $\sim Normal(0(\frac{complexity}{10})^2)$. | 269 noise $\sim Normal(0(\frac{complexity}{10})^2)$. |
254 It has a probability of 70\% of not being applied.\\ | 270 It has a probability of 70\% of not being applied. |
271 \vspace*{0mm} | |
272 | |
255 {\bf Background Images.} | 273 {\bf Background Images.} |
256 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random | 274 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random |
257 background behind the letter. The background is chosen by first selecting, | 275 background behind the letter. The background is chosen by first selecting, |
258 at random, an image from a set of images. Then a 32$\times$32 sub-region | 276 at random, an image from a set of images. Then a 32$\times$32 sub-region |
259 of that image is chosen as the background image (by sampling position | 277 of that image is chosen as the background image (by sampling position |
262 adjustments are made. We first get the maximal values (i.e. maximal | 280 adjustments are made. We first get the maximal values (i.e. maximal |
263 intensity) for both the original image and the background image, $maximage$ | 281 intensity) for both the original image and the background image, $maximage$ |
264 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$. | 282 and $maxbg$. We also have a parameter $contrast \sim U[complexity, 1]$. |
265 Each background pixel value is multiplied by $\frac{max(maximage - | 283 Each background pixel value is multiplied by $\frac{max(maximage - |
266 contrast, 0)}{maxbg}$ (higher contrast yield darker | 284 contrast, 0)}{maxbg}$ (higher contrast yield darker |
267 background). The output image pixels are max(background,original).\\ | 285 background). The output image pixels are max(background,original). |
286 \vspace*{0mm} | |
287 | |
268 {\bf Salt and Pepper Noise.} | 288 {\bf Salt and Pepper Noise.} |
269 This filter adds noise $\sim U[0,1]$ to random subsets of pixels. | 289 This filter adds noise $\sim U[0,1]$ to random subsets of pixels. |
270 The number of selected pixels is $0.2 \times complexity$. | 290 The number of selected pixels is $0.2 \times complexity$. |
271 This filter has a probability of not being applied at all of 75\%.\\ | 291 This filter has a probability of not being applied at all of 75\%. |
292 \vspace*{0mm} | |
293 | |
272 {\bf Spatially Gaussian Noise.} | 294 {\bf Spatially Gaussian Noise.} |
273 Different regions of the image are spatially smoothed. | 295 Different regions of the image are spatially smoothed. |
274 The image is convolved with a symmetric Gaussian kernel of | 296 The image is convolved with a symmetric Gaussian kernel of |
275 size and variance chosen uniformly in the ranges $[12,12 + 20 \times | 297 size and variance chosen uniformly in the ranges $[12,12 + 20 \times |
276 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized | 298 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized |
280 averaging centers between the original image and the filtered one. We | 302 averaging centers between the original image and the filtered one. We |
281 initialize to zero a mask matrix of the image size. For each selected pixel | 303 initialize to zero a mask matrix of the image size. For each selected pixel |
282 we add to the mask the averaging window centered to it. The final image is | 304 we add to the mask the averaging window centered to it. The final image is |
283 computed from the following element-wise operation: $\frac{image + filtered | 305 computed from the following element-wise operation: $\frac{image + filtered |
284 image \times mask}{mask+1}$. | 306 image \times mask}{mask+1}$. |
285 This filter has a probability of not being applied at all of 75\%.\\ | 307 This filter has a probability of not being applied at all of 75\%. |
308 \vspace*{0mm} | |
309 | |
286 {\bf Scratches.} | 310 {\bf Scratches.} |
287 The scratches module places line-like white patches on the image. The | 311 The scratches module places line-like white patches on the image. The |
288 lines are heavily transformed images of the digit ``1'' (one), chosen | 312 lines are heavily transformed images of the digit ``1'' (one), chosen |
289 at random among five thousands such 1 images. The 1 image is | 313 at random among five thousands such 1 images. The 1 image is |
290 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times | 314 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times |
294 by an amount controlled by $complexity$. | 318 by an amount controlled by $complexity$. |
295 This filter is only applied only 15\% of the time. When it is applied, 50\% | 319 This filter is only applied only 15\% of the time. When it is applied, 50\% |
296 of the time, only one patch image is generated and applied. In 30\% of | 320 of the time, only one patch image is generated and applied. In 30\% of |
297 cases, two patches are generated, and otherwise three patches are | 321 cases, two patches are generated, and otherwise three patches are |
298 generated. The patch is applied by taking the maximal value on any given | 322 generated. The patch is applied by taking the maximal value on any given |
299 patch or the original image, for each of the 32x32 pixel locations.\\ | 323 patch or the original image, for each of the 32x32 pixel locations. |
324 \vspace*{0mm} | |
325 | |
300 {\bf Grey Level and Contrast Changes.} | 326 {\bf Grey Level and Contrast Changes.} |
301 This filter changes the contrast and may invert the image polarity (white | 327 This filter changes the contrast and may invert the image polarity (white |
302 on black to black on white). The contrast $C$ is defined here as the | 328 on black to black on white). The contrast $C$ is defined here as the |
303 difference between the maximum and the minimum pixel value of the image. | 329 difference between the maximum and the minimum pixel value of the image. |
304 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$). | 330 Contrast $\sim U[1-0.85 \times complexity,1]$ (so contrast $\geq 0.15$). |
305 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The | 331 The image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The |
306 polarity is inverted with $0.5$ probability. | 332 polarity is inverted with $0.5$ probability. |
307 | 333 |
308 \iffalse | 334 \iffalse |
309 \begin{figure}[h] | 335 \begin{figure}[ht] |
310 \resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}\\ | 336 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/example_t.png}}}\\ |
311 \caption{Illustration of the pipeline of stochastic | 337 \caption{Illustration of the pipeline of stochastic |
312 transformations applied to the image of a lower-case \emph{t} | 338 transformations applied to the image of a lower-case \emph{t} |
313 (the upper left image). Each image in the pipeline (going from | 339 (the upper left image). Each image in the pipeline (going from |
314 left to right, first top line, then bottom line) shows the result | 340 left to right, first top line, then bottom line) shows the result |
315 of applying one of the modules in the pipeline. The last image | 341 of applying one of the modules in the pipeline. The last image |
452 examples are presented in minibatches of size 20, a constant learning | 478 examples are presented in minibatches of size 20, a constant learning |
453 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ | 479 rate is chosen in $\{10^{-3},0.01, 0.025, 0.075, 0.1, 0.5\}$ |
454 through preliminary experiments (measuring performance on a validation set), | 480 through preliminary experiments (measuring performance on a validation set), |
455 and $0.1$ was then selected. | 481 and $0.1$ was then selected. |
456 | 482 |
457 \begin{figure}[h] | 483 \begin{figure}[ht] |
458 \resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}} | 484 \centerline{\resizebox{0.8\textwidth}{!}{\includegraphics{images/denoising_autoencoder_small.pdf}}} |
459 \caption{Illustration of the computations and training criterion for the denoising | 485 \caption{Illustration of the computations and training criterion for the denoising |
460 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ | 486 auto-encoder used to pre-train each layer of the deep architecture. Input $x$ of |
487 the layer (i.e. raw input or output of previous layer) | |
461 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. | 488 is corrupted into $\tilde{x}$ and encoded into code $y$ by the encoder $f_\theta(\cdot)$. |
462 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which | 489 The decoder $g_{\theta'}(\cdot)$ maps $y$ to reconstruction $z$, which |
463 is compared to the uncorrupted input $x$ through the loss function | 490 is compared to the uncorrupted input $x$ through the loss function |
464 $L_H(x,z)$, whose expected value is approximately minimized during training | 491 $L_H(x,z)$, whose expected value is approximately minimized during training |
465 by tuning $\theta$ and $\theta'$.} | 492 by tuning $\theta$ and $\theta'$.} |
504 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number | 531 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number |
505 of hidden layers but it was fixed to 3 based on previous work with | 532 of hidden layers but it was fixed to 3 based on previous work with |
506 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. | 533 stacked denoising auto-encoders on MNIST~\citep{VincentPLarochelleH2008}. |
507 | 534 |
508 \vspace*{-1mm} | 535 \vspace*{-1mm} |
536 | |
537 \begin{figure}[ht] | |
538 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}} | |
539 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training | |
540 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | |
541 of all models, on 3 different test sets corresponding to the three | |
542 datasets. | |
543 Right: error rates on NIST test digits only, along with the previous results from | |
544 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | |
545 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | |
546 | |
547 \label{fig:error-rates-charts} | |
548 \vspace*{-1mm} | |
549 \end{figure} | |
550 | |
551 | |
509 \section{Experimental Results} | 552 \section{Experimental Results} |
510 | 553 |
511 %\vspace*{-1mm} | 554 %\vspace*{-1mm} |
512 %\subsection{SDA vs MLP vs Humans} | 555 %\subsection{SDA vs MLP vs Humans} |
513 %\vspace*{-1mm} | 556 %\vspace*{-1mm} |
523 found in Appendix I of the supplementary material. The 3 kinds of model differ in the | 566 found in Appendix I of the supplementary material. The 3 kinds of model differ in the |
524 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 | 567 training sets used: NIST only (MLP0,SDA0), NISTP (MLP1, SDA1), or P07 |
525 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and | 568 (MLP2, SDA2). The deep learner not only outperformed the shallow ones and |
526 previously published performance (in a statistically and qualitatively | 569 previously published performance (in a statistically and qualitatively |
527 significant way) but reaches human performance on both the 62-class task | 570 significant way) but reaches human performance on both the 62-class task |
528 and the 10-class (digits) task. In addition, as shown in the left of | 571 and the 10-class (digits) task. |
529 Figure~\ref{fig:fig:improvements-charts}, the relative improvement in error | 572 |
573 \begin{figure}[ht] | |
574 \vspace*{-2mm} | |
575 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}} | |
576 \caption{Relative improvement in error rate due to self-taught learning. | |
577 Left: Improvement (or loss, when negative) | |
578 induced by out-of-distribution examples (perturbed data). | |
579 Right: Improvement (or loss, when negative) induced by multi-task | |
580 learning (training on all classes and testing only on either digits, | |
581 upper case, or lower-case). The deep learner (SDA) benefits more from | |
582 both self-taught learning scenarios, compared to the shallow MLP.} | |
583 \label{fig:improvements-charts} | |
584 \vspace*{-2mm} | |
585 \end{figure} | |
586 | |
587 In addition, as shown in the left of | |
588 Figure~\ref{fig:improvements-charts}, the relative improvement in error | |
530 rate brought by self-taught learning is greater for the SDA, and these | 589 rate brought by self-taught learning is greater for the SDA, and these |
531 differences with the MLP are statistically and qualitatively | 590 differences with the MLP are statistically and qualitatively |
532 significant. | 591 significant. |
533 The left side of the figure shows the improvement to the clean | 592 The left side of the figure shows the improvement to the clean |
534 NIST test set error brought by the use of out-of-distribution examples | 593 NIST test set error brought by the use of out-of-distribution examples |
535 (i.e. the perturbed examples examples from NISTP or P07). | 594 (i.e. the perturbed examples examples from NISTP or P07). |
536 Relative change is measured by taking | 595 Relative change is measured by taking |
537 (original model's error / perturbed-data model's error - 1). | 596 (original model's error / perturbed-data model's error - 1). |
538 The right side of | 597 The right side of |
539 Figure~\ref{fig:fig:improvements-charts} shows the relative improvement | 598 Figure~\ref{fig:improvements-charts} shows the relative improvement |
540 brought by the use of a multi-task setting, in which the same model is | 599 brought by the use of a multi-task setting, in which the same model is |
541 trained for more classes than the target classes of interest (i.e. training | 600 trained for more classes than the target classes of interest (i.e. training |
542 with all 62 classes when the target classes are respectively the digits, | 601 with all 62 classes when the target classes are respectively the digits, |
543 lower-case, or upper-case characters). Again, whereas the gain from the | 602 lower-case, or upper-case characters). Again, whereas the gain from the |
544 multi-task setting is marginal or negative for the MLP, it is substantial | 603 multi-task setting is marginal or negative for the MLP, it is substantial |
553 comparing the correct digit class with the output class associated with the | 612 comparing the correct digit class with the output class associated with the |
554 maximum conditional probability among only the digit classes outputs. The | 613 maximum conditional probability among only the digit classes outputs. The |
555 setting is similar for the other two target classes (lower case characters | 614 setting is similar for the other two target classes (lower case characters |
556 and upper case characters). | 615 and upper case characters). |
557 | 616 |
558 \begin{figure}[h] | |
559 \resizebox{.99\textwidth}{!}{\includegraphics{images/error_rates_charts.pdf}}\\ | |
560 \caption{Error bars indicate a 95\% confidence interval. 0 indicates training | |
561 on NIST, 1 on NISTP, and 2 on P07. Left: overall results | |
562 of all models, on 3 different test sets corresponding to the three | |
563 datasets. | |
564 Right: error rates on NIST test digits only, along with the previous results from | |
565 literature~\citep{Granger+al-2007,Cortes+al-2000,Oliveira+al-2002-short,Milgram+al-2005} | |
566 respectively based on ART, nearest neighbors, MLPs, and SVMs.} | |
567 | |
568 \label{fig:error-rates-charts} | |
569 \end{figure} | |
570 | |
571 %\vspace*{-1mm} | 617 %\vspace*{-1mm} |
572 %\subsection{Perturbed Training Data More Helpful for SDA} | 618 %\subsection{Perturbed Training Data More Helpful for SDA} |
573 %\vspace*{-1mm} | 619 %\vspace*{-1mm} |
574 | 620 |
575 %\vspace*{-1mm} | 621 %\vspace*{-1mm} |
600 error rate improvements of 27\%, 15\% and 13\% respectively for digits, | 646 error rate improvements of 27\%, 15\% and 13\% respectively for digits, |
601 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. | 647 lower and upper case characters, as shown in Table~\ref{tab:multi-task}. |
602 \fi | 648 \fi |
603 | 649 |
604 | 650 |
605 \begin{figure}[h] | |
606 \resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}\\ | |
607 \caption{Relative improvement in error rate due to self-taught learning. | |
608 Left: Improvement (or loss, when negative) | |
609 induced by out-of-distribution examples (perturbed data). | |
610 Right: Improvement (or loss, when negative) induced by multi-task | |
611 learning (training on all classes and testing only on either digits, | |
612 upper case, or lower-case). The deep learner (SDA) benefits more from | |
613 both self-taught learning scenarios, compared to the shallow MLP.} | |
614 \label{fig:improvements-charts} | |
615 \end{figure} | |
616 | |
617 \vspace*{-1mm} | 651 \vspace*{-1mm} |
618 \section{Conclusions} | 652 \section{Conclusions} |
619 \vspace*{-1mm} | 653 \vspace*{-1mm} |
620 | 654 |
621 We have found that the self-taught learning framework is more beneficial | 655 We have found that the self-taught learning framework is more beneficial |