ift6266: writeup/aistats2011_cameraready.tex comparison

comparison writeup/aistats2011_cameraready.tex @ 631:510220effb14

corrections demandees par reviewer

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Sat, 19 Mar 2011 22:44:53 -0400
parents	f55f1b1499c4
children	54e8958e963b

comparison

equal deleted inserted replaced

-:f55f1b1499c4
+:510220effb14
 \usepackage{bbm}
 \usepackage[utf8]{inputenc}
 \usepackage[psamsfonts]{amssymb}
 %\usepackage{algorithm,algorithmic} % not used after all
 \usepackage{graphicx,subfigure}
-\usepackage[numbers]{natbib}
+\usepackage{natbib}
 \addtolength{\textwidth}{10mm}
 \addtolength{\evensidemargin}{-5mm}
 \addtolength{\oddsidemargin}{-5mm}
 The experiments are performed using MLPs (with a single
 hidden layer) and deep SDAs.
 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}
-{\bf Multi-Layer Perceptrons (MLP).}  Whereas previous work had compared
+{\bf Multi-Layer Perceptrons (MLP).}  The MLP output estimated
+\[
+P({\rm class}|{\rm input}=x)
+\]
+with
+\[
+f(x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)),
+\]
+i.e., two layers, where
+\[
+p={\rm softmax}(a)
+\]
+means that
+\[
+p_i(x)=\exp(a_i)/\sum_j \exp(a_j)
+\]
+representing the probability
+for class $i$, $\tanh$ is the element-wise
+hyperbolic tangent, $b_i$ are parameter vectors, and $W_i$ are
+parameter matrices (one per layer). The
+number of rows of $W_1$ is called the number of hidden units (of the
+single hidden layer, here), and
+is one way to control capacity (the main other ways to control capacity are
+the number of training iterations and optionally a regularization penalty
+on the parameters, not used here because it did not help).
+Whereas previous work had compared
 deep architectures to both shallow MLPs and SVMs, we only compared to MLPs
 here because of the very large datasets used (making the use of SVMs
 computationally challenging because of their quadratic scaling
 behavior). Preliminary experiments on training SVMs (libSVM) with subsets
 of the training set allowing the program to fit in memory yielded
 better implementation allowing for training with more examples and
 a higher-order non-linear projection.}  For training on nearly a hundred million examples (with the
 perturbed data), the MLPs and SDA are much more convenient than classifiers
 based on kernel methods.  The MLP has a single hidden layer with $\tanh$
 activation functions, and softmax (normalized exponentials) on the output
-layer for estimating $P(class | image)$.  The number of hidden units is
+layer for estimating $P({\rm class} | {\rm input})$.  The number of hidden units is
 taken in $\{300,500,800,1000,1500\}$.  Training examples are presented in
-minibatches of size 20. A constant learning rate was chosen among $\{0.001,
+minibatches of size 20, i.e., the parameters are iteratively updated in the direction
+of the mean gradient of the next 20 examples. A constant learning rate was chosen among $\{0.001,
 0.01, 0.025, 0.075, 0.1, 0.5\}$.
 %through preliminary experiments (measuring performance on a validation set),
 %and $0.1$ (which was found to work best) was then selected for optimizing on
 %the whole training sets.
 %\vspace*{-1mm}
 tutorial and code there: {\tt http://deeplearning.net/tutorial}),
 provides efficient inference, and yielded results
 comparable or better than RBMs in series of experiments
 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian
 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}.
-During training, a Denoising
+During its unsupervised training, a Denoising
-Auto-encoder is presented with a stochastically corrupted version
+Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$
-of the input and trained to reconstruct the uncorrupted input,
+of the input $x$ and trained to reconstruct to produce a reconstruction $z$
-forcing the hidden units to represent the leading regularities in
+of the uncorrupted input $x$. Because the network has to denoise, it is
-the data. Here we use the random binary masking corruption
+forcing the hidden units $y$ to represent the leading regularities in
-(which sets to 0 a random subset of the inputs).
+the data. Following~\citep{VincentPLarochelleH2008-very-small}
-Once it is trained, in a purely unsupervised way,
+the hidden units output $y$ is obtained through
-its hidden units' activations can
+\[
-be used as inputs for training a second one, etc.
+y={\rm sigm}(c+V x)
+\]
+where ${\rm sigm}(a)=1/(1+\exp(-a))$
+and the reconstruction is
+\[
+z={\rm sigm}(d+V' y).
+\]
+We minimize the training
+set average of the cross-entropy
+reconstruction error
+\[
+L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i).
+\]
+Here we use the random binary masking corruption
+(which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and
+copies the rest).
+Once the first denoising auto-encoder is trained, its parameters can be used
+to set the first layer of the deep MLP. The original data are then processed
+through that first layer, and the output of the hidden units form a new
+representation that can be used as input data for training a second denoising
+auto-encoder, still in a purely unsupervised way.
+This is repeated for the desired number of hidden layers.
 After this unsupervised pre-training stage, the parameters
-are used to initialize a deep MLP, which is fine-tuned by
+are used to initialize a deep MLP (similar to the above, but
-the same standard procedure used to train them (see above).
+with more layers), which is fine-tuned by
+the same standard procedure (stochastic gradient descent)
+used to train MLPs in general (see above).
+The top layer parameters of the deep MLP (the one which outputs the
+class probabilities and takes the top hidden layer as input) can
+be initialized at 0.
 The SDA hyper-parameters are the same as for the MLP, with the addition of the
 amount of corruption noise (we used the masking noise process, whereby a
 fixed proportion of the input values, randomly selected, are zeroed), and a
 separate learning rate for the unsupervised pre-training stage (selected
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
-of hidden layers but it was fixed to 3 for most experiments,
+of hidden layers but it was fixed to 3 for our experiments,
 based on previous work with
 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}.
 We also compared against 1 and against 2 hidden layers, in order
 to disantangle the effect of depth from the effect of unsupervised
 pre-training.

Mercurial > ift6266

comparison writeup/aistats2011_cameraready.tex @ 631:510220effb14