comparison writeup/nips2010_submission.tex @ 557:17d16700e0c8

encore du visuel de Myriam
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Thu, 03 Jun 2010 09:17:01 -0400
parents b6dfba0a110c
children cf5a7ee2d892
comparison
equal deleted inserted replaced
556:a7193b092b0a 557:17d16700e0c8
105 a corresponding shallow and purely supervised architecture? 105 a corresponding shallow and purely supervised architecture?
106 %\end{enumerate} 106 %\end{enumerate}
107 107
108 Our experimental results provide positive evidence towards all of these questions. 108 Our experimental results provide positive evidence towards all of these questions.
109 To achieve these results, we introduce in the next section a sophisticated system 109 To achieve these results, we introduce in the next section a sophisticated system
110 for stochastically transforming character images. The conclusion discusses 110 for stochastically transforming character images and then explain the methodology.
111 The conclusion discusses
111 the more general question of why deep learners may benefit so much from 112 the more general question of why deep learners may benefit so much from
112 the self-taught learning framework. 113 the self-taught learning framework.
113 114
114 \vspace*{-1mm} 115 \vspace*{-1mm}
115 \section{Perturbation and Transformation of Character Images} 116 \section{Perturbation and Transformation of Character Images}
163 \end{center} 164 \end{center}
164 %\vspace{.6cm} 165 %\vspace{.6cm}
165 %\end{minipage}% 166 %\end{minipage}%
166 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} 167 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
167 \end{wrapfigure} 168 \end{wrapfigure}
168 Morphological operators of dilation and erosion~\citep{Haralick87,Serra82} 169 To change character {\bf thickness}, morphological operators of dilation and erosion~\citep{Haralick87,Serra82}
169 are applied. The neighborhood of each pixel is multiplied 170 are applied. The neighborhood of each pixel is multiplied
170 element-wise with a {\em structuring element} matrix. 171 element-wise with a {\em structuring element} matrix.
171 The pixel value is replaced by the maximum or the minimum of the resulting 172 The pixel value is replaced by the maximum or the minimum of the resulting
172 matrix, respectively for dilation or erosion. Ten different structural elements with 173 matrix, respectively for dilation or erosion. Ten different structural elements with
173 increasing dimensions (largest is $5\times5$) were used. For each image, 174 increasing dimensions (largest is $5\times5$) were used. For each image,
186 {\bf Slant} 187 {\bf Slant}
187 \end{minipage}% 188 \end{minipage}%
188 \hspace{0.3cm}\begin{minipage}[b]{0.83\linewidth} 189 \hspace{0.3cm}\begin{minipage}[b]{0.83\linewidth}
189 %\centering 190 %\centering
190 %\vspace*{-15mm} 191 %\vspace*{-15mm}
191 Each row of the image is shifted 192 To produce {\bf slant}, each row of the image is shifted
192 proportionally to its height: $shift = round(slant \times height)$. 193 proportionally to its height: $shift = round(slant \times height)$.
193 $slant \sim U[-complexity,complexity]$. 194 $slant \sim U[-complexity,complexity]$.
194 \vspace{1.5cm} 195 \vspace{1.5cm}
195 \end{minipage} 196 \end{minipage}
196 %\vspace*{-4mm} 197 %\vspace*{-4mm}
199 %\centering 200 %\centering
200 \begin{wrapfigure}[8]{l}{0.15\textwidth} 201 \begin{wrapfigure}[8]{l}{0.15\textwidth}
201 \vspace*{-6mm} 202 \vspace*{-6mm}
202 \begin{center} 203 \begin{center}
203 \includegraphics[scale=.4]{images/Affine_only.PNG}\\ 204 \includegraphics[scale=.4]{images/Affine_only.PNG}\\
204 {\bf Affine} 205 {\bf Affine Transformation}
205 \end{center} 206 \end{center}
206 \end{wrapfigure} 207 \end{wrapfigure}
207 %\end{minipage}% 208 %\end{minipage}%
208 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} 209 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
209 A $2 \times 3$ affine transform matrix (with 210 A $2 \times 3$ {\bf affine transform} matrix (with
210 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$. 211 parameters $(a,b,c,d,e,f)$) is sampled according to the $complexity$.
211 Output pixel $(x,y)$ takes the value of input pixel 212 Output pixel $(x,y)$ takes the value of input pixel
212 nearest to $(ax+by+c,dx+ey+f)$, 213 nearest to $(ax+by+c,dx+ey+f)$,
213 producing scaling, translation, rotation and shearing. 214 producing scaling, translation, rotation and shearing.
214 Marginal distributions of $(a,b,c,d,e,f)$ have been tuned to 215 Marginal distributions of $(a,b,c,d,e,f)$ have been tuned to
232 \end{center} 233 \end{center}
233 \end{wrapfigure} 234 \end{wrapfigure}
234 %\end{minipage}% 235 %\end{minipage}%
235 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth} 236 %\hspace{-3mm}\begin{minipage}[b]{0.85\linewidth}
236 %\vspace*{-20mm} 237 %\vspace*{-20mm}
237 This local elastic deformation 238 The {\bf local elastic} deformation
238 filter induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short}, 239 module induces a ``wiggly'' effect in the image, following~\citet{SimardSP03-short},
239 which provides more details. 240 which provides more details.
240 The intensity of the displacement fields is given by 241 The intensity of the displacement fields is given by
241 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are 242 $\alpha = \sqrt[3]{complexity} \times 10.0$, which are
242 convolved with a Gaussian 2D kernel (resulting in a blur) of 243 convolved with a Gaussian 2D kernel (resulting in a blur) of
243 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$. 244 standard deviation $\sigma = 10 - 7 \times\sqrt[3]{complexity}$.
256 \end{center} 257 \end{center}
257 \end{wrapfigure} 258 \end{wrapfigure}
258 %\vspace{.6cm} 259 %\vspace{.6cm}
259 %\end{minipage}% 260 %\end{minipage}%
260 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth} 261 %\hspace{0.3cm}\begin{minipage}[b]{0.86\linewidth}
261 This is the ``Whirl and pinch'' GIMP filter with whirl was set to 0. 262 The {\bf pinch} module applies the ``Whirl and pinch'' GIMP filter with whirl was set to 0.
262 A pinch is ``similar to projecting the image onto an elastic 263 A pinch is ``similar to projecting the image onto an elastic
263 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual). 264 surface and pressing or pulling on the center of the surface'' (GIMP documentation manual).
264 For a square input image, draw a radius-$r$ disk 265 For a square input image, draw a radius-$r$ disk
265 around $C$. Any pixel $P$ belonging to 266 around $C$. Any pixel $P$ belonging to
266 that disk has its value replaced by 267 that disk has its value replaced by
267 the value of a ``source'' pixel in the original image, 268 the value of a ``source'' pixel in the original image,
268 on the line that goes through $C$ and $P$, but 269 on the line that goes through $C$ and $P$, but
269 at some other distance $d_2$. Define $d_1=distance(P,C) = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times 270 at some other distance $d_2$. Define $d_1=distance(P,C) = sin(\frac{\pi{}d_1}{2r})^{-pinch} \times
270 d_1$, where $pinch$ is a parameter to the filter. 271 d_1$, where $pinch$ is a parameter of the filter.
271 The actual value is given by bilinear interpolation considering the pixels 272 The actual value is given by bilinear interpolation considering the pixels
272 around the (non-integer) source position thus found. 273 around the (non-integer) source position thus found.
273 Here $pinch \sim U[-complexity, 0.7 \times complexity]$. 274 Here $pinch \sim U[-complexity, 0.7 \times complexity]$.
274 %\vspace{1.5cm} 275 %\vspace{1.5cm}
275 %\end{minipage} 276 %\end{minipage}
287 \includegraphics[scale=.4]{images/Motionblur_only.PNG}\\ 288 \includegraphics[scale=.4]{images/Motionblur_only.PNG}\\
288 {\bf Motion Blur} 289 {\bf Motion Blur}
289 \end{minipage}% 290 \end{minipage}%
290 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} 291 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
291 %\vspace*{.5mm} 292 %\vspace*{.5mm}
292 This is GIMP's ``linear motion blur'' 293 The {\bf motion blur} module is GIMP's ``linear motion blur'', which
293 with parameters $length$ and $angle$. The value of 294 has parameters $length$ and $angle$. The value of
294 a pixel in the final image is approximately the mean of the first $length$ pixels 295 a pixel in the final image is approximately the mean of the first $length$ pixels
295 found by moving in the $angle$ direction, 296 found by moving in the $angle$ direction,
296 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$. 297 $angle \sim U[0,360]$ degrees, and $length \sim {\rm Normal}(0,(3 \times complexity)^2)$.
297 \vspace{5mm} 298 \vspace{5mm}
298 \end{minipage} 299 \end{minipage}
305 {\bf Occlusion} 306 {\bf Occlusion}
306 %\vspace{.5cm} 307 %\vspace{.5cm}
307 \end{minipage}% 308 \end{minipage}%
308 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} 309 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
309 \vspace*{-18mm} 310 \vspace*{-18mm}
310 Selects a random rectangle from an {\em occluder} character 311 The {\bf occlusion} module selects a random rectangle from an {\em occluder} character
311 image and places it over the original {\em occluded} 312 image and places it over the original {\em occluded}
312 image. Pixels are combined by taking the max(occluder,occluded), 313 image. Pixels are combined by taking the max(occluder,occluded),
313 closer to black. The rectangle corners 314 closer to black. The rectangle corners
314 are sampled so that larger complexity gives larger rectangles. 315 are sampled so that larger complexity gives larger rectangles.
315 The destination position in the occluded image are also sampled 316 The destination position in the occluded image are also sampled
316 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}). 317 according to a normal distribution (more details in~\citet{ift6266-tr-anonymous}).
317 This filter is skipped with probability 60\%. 318 This module is skipped with probability 60\%.
318 %\vspace{7mm} 319 %\vspace{7mm}
319 \end{minipage} 320 \end{minipage}
320 321
321 \vspace*{1mm} 322 \vspace*{1mm}
322 323
330 \end{center} 331 \end{center}
331 \end{wrapfigure} 332 \end{wrapfigure}
332 %\vspace{.5cm} 333 %\vspace{.5cm}
333 %\end{minipage}% 334 %\end{minipage}%
334 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} 335 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
335 Different regions of the image are spatially smoothed by convolving 336 With the {\bf Gaussian smoothing} module,
337 different regions of the image are spatially smoothed by convolving
336 the image with a symmetric Gaussian kernel of 338 the image with a symmetric Gaussian kernel of
337 size and variance chosen uniformly in the ranges $[12,12 + 20 \times 339 size and variance chosen uniformly in the ranges $[12,12 + 20 \times
338 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized 340 complexity]$ and $[2,2 + 6 \times complexity]$. The result is normalized
339 between $0$ and $1$. We also create a symmetric weighted averaging window, of the 341 between $0$ and $1$. We also create a symmetric weighted averaging window, of the
340 kernel size, with maximum value at the center. For each image we sample 342 kernel size, with maximum value at the center. For each image we sample
342 averaging centers between the original image and the filtered one. We 344 averaging centers between the original image and the filtered one. We
343 initialize to zero a mask matrix of the image size. For each selected pixel 345 initialize to zero a mask matrix of the image size. For each selected pixel
344 we add to the mask the averaging window centered to it. The final image is 346 we add to the mask the averaging window centered to it. The final image is
345 computed from the following element-wise operation: $\frac{image + filtered 347 computed from the following element-wise operation: $\frac{image + filtered
346 image \times mask}{mask+1}$. 348 image \times mask}{mask+1}$.
347 This filter is skipped with probability 75\%. 349 This module is skipped with probability 75\%.
348 %\end{minipage} 350 %\end{minipage}
349 351
350 \newpage 352 \newpage
351 353
352 \vspace*{-9mm} 354 \vspace*{-9mm}
362 \end{center} 364 \end{center}
363 \end{wrapfigure} 365 \end{wrapfigure}
364 %\end{minipage}% 366 %\end{minipage}%
365 %\hspace{-0cm}\begin{minipage}[t]{0.86\linewidth} 367 %\hspace{-0cm}\begin{minipage}[t]{0.86\linewidth}
366 %\vspace*{-20mm} 368 %\vspace*{-20mm}
367 This filter permutes neighbouring pixels. It first selects 369 This module {\bf permutes neighbouring pixels}. It first selects
368 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then 370 fraction $\frac{complexity}{3}$ of pixels randomly in the image. Each of them are then
369 sequentially exchanged with one other in as $V4$ neighbourhood. 371 sequentially exchanged with one other in as $V4$ neighbourhood.
370 This filter is skipped with probability 80\%.\\ 372 This module is skipped with probability 80\%.\\
371 \vspace*{1mm} 373 \vspace*{1mm}
372 \end{minipage} 374 \end{minipage}
373 375
374 \vspace{-1mm} 376 \vspace{-3mm}
375 377
376 \begin{minipage}[t]{\linewidth} 378 \begin{minipage}[t]{\linewidth}
377 \begin{wrapfigure}[7]{l}{0.15\textwidth} 379 \begin{wrapfigure}[7]{l}{0.15\textwidth}
378 %\vspace*{-3mm} 380 %\vspace*{-3mm}
379 \begin{center} 381 \begin{center}
385 \end{center} 387 \end{center}
386 \end{wrapfigure} 388 \end{wrapfigure}
387 %\end{minipage}% 389 %\end{minipage}%
388 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} 390 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
389 \vspace*{12mm} 391 \vspace*{12mm}
390 This filter simply adds, to each pixel of the image independently, a 392 The {\bf Gaussian noise} module simply adds, to each pixel of the image independently, a
391 noise $\sim Normal(0,(\frac{complexity}{10})^2)$. 393 noise $\sim Normal(0,(\frac{complexity}{10})^2)$.
392 This filter is skipped with probability 70\%. 394 This module is skipped with probability 70\%.
393 %\vspace{1.1cm} 395 %\vspace{1.1cm}
394 \end{minipage} 396 \end{minipage}
395 397
396 \vspace*{1.5cm} 398 \vspace*{1.2cm}
397 399
398 \begin{minipage}[t]{\linewidth} 400 \begin{minipage}[t]{\linewidth}
399 \begin{minipage}[t]{0.14\linewidth} 401 \begin{minipage}[t]{0.14\linewidth}
400 \centering 402 \centering
401 \includegraphics[scale=.4]{images/background_other_only.png}\\ 403 \includegraphics[scale=.4]{images/background_other_only.png}\\
402 {\small \bf Bg Image} 404 {\small \bf Bg Image}
403 \end{minipage}% 405 \end{minipage}%
404 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} 406 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
405 \vspace*{-18mm} 407 \vspace*{-18mm}
406 Following~\citet{Larochelle-jmlr-2009}, this transformation adds a random 408 Following~\citet{Larochelle-jmlr-2009}, the {\bf background image} module adds a random
407 background image behind the letter, from a randomly chosen natural image, 409 background image behind the letter, from a randomly chosen natural image,
408 with contrast adjustments depending on $complexity$, to preserve 410 with contrast adjustments depending on $complexity$, to preserve
409 more or less of the original character image. 411 more or less of the original character image.
410 %\vspace{.8cm} 412 %\vspace{.8cm}
411 \end{minipage} 413 \end{minipage}
417 \includegraphics[scale=.4]{images/Poivresel_only.PNG}\\ 419 \includegraphics[scale=.4]{images/Poivresel_only.PNG}\\
418 {\small \bf Salt \& Pepper} 420 {\small \bf Salt \& Pepper}
419 \end{minipage}% 421 \end{minipage}%
420 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth} 422 \hspace{0.3cm}\begin{minipage}[t]{0.83\linewidth}
421 \vspace*{-18mm} 423 \vspace*{-18mm}
422 This filter adds noise $\sim U[0,1]$ to random subsets of pixels. 424 The {\bf salt and pepper noise} module adds noise $\sim U[0,1]$ to random subsets of pixels.
423 The number of selected pixels is $0.2 \times complexity$. 425 The number of selected pixels is $0.2 \times complexity$.
424 This filter is skipped with probability 75\%. 426 This module is skipped with probability 75\%.
425 %\vspace{.9cm} 427 %\vspace{.9cm}
426 \end{minipage} 428 \end{minipage}
427 %\vspace{-.7cm} 429 %\vspace{-.7cm}
428 430
429 \vspace{1mm} 431 \vspace{1mm}
439 %\end{minipage}% 441 %\end{minipage}%
440 \end{center} 442 \end{center}
441 \end{wrapfigure} 443 \end{wrapfigure}
442 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth} 444 %\hspace{0.3cm}\begin{minipage}[t]{0.86\linewidth}
443 %\vspace{.4cm} 445 %\vspace{.4cm}
444 The scratches module places line-like white patches on the image. The 446 The {\bf scratches} module places line-like white patches on the image. The
445 lines are heavily transformed images of the digit ``1'' (one), chosen 447 lines are heavily transformed images of the digit ``1'' (one), chosen
446 at random among 500 such 1 images, 448 at random among 500 such 1 images,
447 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times 449 randomly cropped and rotated by an angle $\sim Normal(0,(100 \times
448 complexity)^2$ (in degrees), using bi-cubic interpolation. 450 complexity)^2$ (in degrees), using bi-cubic interpolation.
449 Two passes of a grey-scale morphological erosion filter 451 Two passes of a grey-scale morphological erosion filter
450 are applied, reducing the width of the line 452 are applied, reducing the width of the line
451 by an amount controlled by $complexity$. 453 by an amount controlled by $complexity$.
452 This filter is skipped with probability 85\%. The probabilities 454 This module is skipped with probability 85\%. The probabilities
453 of applying 1, 2, or 3 patches are (50\%,30\%,20\%). 455 of applying 1, 2, or 3 patches are (50\%,30\%,20\%).
454 \end{minipage} 456 \end{minipage}
455 457
456 \vspace*{2mm} 458 \vspace*{2mm}
457 459
458 \begin{minipage}[t]{0.20\linewidth} 460 \begin{minipage}[t]{0.25\linewidth}
459 \centering 461 \centering
460 \hspace*{-7mm}\includegraphics[scale=.4]{images/Contrast_only.PNG}\\ 462 \hspace*{-16mm}\includegraphics[scale=.4]{images/Contrast_only.PNG}\\
461 {\bf Grey \& Contrast} 463 {\bf Grey Level \& Contrast}
462 \end{minipage}% 464 \end{minipage}%
463 \hspace{-4mm}\begin{minipage}[t]{0.82\linewidth} 465 \hspace{-12mm}\begin{minipage}[t]{0.82\linewidth}
464 \vspace*{-18mm} 466 \vspace*{-18mm}
465 This filter changes the contrast by changing grey levels, and may invert the image polarity (white 467 The {\bf grey level and contrast} module changes the contrast by changing grey levels, and may invert the image polarity (white
466 to black and black to white). The contrast is $C \sim U[1-0.85 \times complexity,1]$ 468 to black and black to white). The contrast is $C \sim U[1-0.85 \times complexity,1]$
467 so the image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The 469 so the image is normalized into $[\frac{1-C}{2},1-\frac{1-C}{2}]$. The
468 polarity is inverted with probability 50\%. 470 polarity is inverted with probability 50\%.
469 %\vspace{.7cm} 471 %\vspace{.7cm}
470 \end{minipage} 472 \end{minipage}
708 \label{fig:error-rates-charts} 710 \label{fig:error-rates-charts}
709 \vspace*{-2mm} 711 \vspace*{-2mm}
710 \end{figure} 712 \end{figure}
711 713
712 714
715 \begin{figure}[ht]
716 \vspace*{-3mm}
717 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
718 \vspace*{-3mm}
719 \caption{Relative improvement in error rate due to self-taught learning.
720 Left: Improvement (or loss, when negative)
721 induced by out-of-distribution examples (perturbed data).
722 Right: Improvement (or loss, when negative) induced by multi-task
723 learning (training on all classes and testing only on either digits,
724 upper case, or lower-case). The deep learner (SDA) benefits more from
725 both self-taught learning scenarios, compared to the shallow MLP.}
726 \label{fig:improvements-charts}
727 \vspace*{-2mm}
728 \end{figure}
729
713 \section{Experimental Results} 730 \section{Experimental Results}
714 \vspace*{-2mm} 731 \vspace*{-2mm}
715 732
716 %\vspace*{-1mm} 733 %\vspace*{-1mm}
717 %\subsection{SDA vs MLP vs Humans} 734 %\subsection{SDA vs MLP vs Humans}
736 and the 10-class (digits) task. 753 and the 10-class (digits) task.
737 17\% error (SDA1) or 18\% error (humans) may seem large but a large 754 17\% error (SDA1) or 18\% error (humans) may seem large but a large
738 majority of the errors from humans and from SDA1 are from out-of-context 755 majority of the errors from humans and from SDA1 are from out-of-context
739 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a 756 confusions (e.g. a vertical bar can be a ``1'', an ``l'' or an ``L'', and a
740 ``c'' and a ``C'' are often indistinguishible). 757 ``c'' and a ``C'' are often indistinguishible).
741
742 \begin{figure}[ht]
743 \vspace*{-3mm}
744 \centerline{\resizebox{.99\textwidth}{!}{\includegraphics{images/improvements_charts.pdf}}}
745 \vspace*{-3mm}
746 \caption{Relative improvement in error rate due to self-taught learning.
747 Left: Improvement (or loss, when negative)
748 induced by out-of-distribution examples (perturbed data).
749 Right: Improvement (or loss, when negative) induced by multi-task
750 learning (training on all classes and testing only on either digits,
751 upper case, or lower-case). The deep learner (SDA) benefits more from
752 both self-taught learning scenarios, compared to the shallow MLP.}
753 \label{fig:improvements-charts}
754 \vspace*{-2mm}
755 \end{figure}
756 758
757 In addition, as shown in the left of 759 In addition, as shown in the left of
758 Figure~\ref{fig:improvements-charts}, the relative improvement in error 760 Figure~\ref{fig:improvements-charts}, the relative improvement in error
759 rate brought by self-taught learning is greater for the SDA, and these 761 rate brought by self-taught learning is greater for the SDA, and these
760 differences with the MLP are statistically and qualitatively 762 differences with the MLP are statistically and qualitatively