Mercurial > ift6266

--- a/writeup/aistats2011_cameraready.tex	Thu Mar 17 15:51:43 2011 -0400
+++ b/writeup/aistats2011_cameraready.tex	Sat Mar 19 22:44:53 2011 -0400
@@ -10,7 +10,7 @@
 \usepackage[psamsfonts]{amssymb}
 %\usepackage{algorithm,algorithmic} % not used after all
 \usepackage{graphicx,subfigure}
-\usepackage[numbers]{natbib}
+\usepackage{natbib}

 \addtolength{\textwidth}{10mm}
 \addtolength{\evensidemargin}{-5mm}
@@ -430,7 +430,32 @@
 hidden layer) and deep SDAs.
 \emph{Hyper-parameters are selected based on the {\bf NISTP} validation set error.}

-{\bf Multi-Layer Perceptrons (MLP).}  Whereas previous work had compared
+{\bf Multi-Layer Perceptrons (MLP).}  The MLP output estimated
+\[
+P({\rm class}|{\rm input}=x)
+\]
+with
+\[
+f(x)={\rm softmax}(b_2+W_2\tanh(b_1+W_1 x)),
+\]
+i.e., two layers, where
+\[
+ p={\rm softmax}(a)
+\]
+means that
+\[
+ p_i(x)=\exp(a_i)/\sum_j \exp(a_j)
+\]
+representing the probability
+for class $i$, $\tanh$ is the element-wise
+hyperbolic tangent, $b_i$ are parameter vectors, and $W_i$ are
+parameter matrices (one per layer). The
+number of rows of $W_1$ is called the number of hidden units (of the
+single hidden layer, here), and
+is one way to control capacity (the main other ways to control capacity are
+the number of training iterations and optionally a regularization penalty
+on the parameters, not used here because it did not help).
+Whereas previous work had compared
 deep architectures to both shallow MLPs and SVMs, we only compared to MLPs
 here because of the very large datasets used (making the use of SVMs
 computationally challenging because of their quadratic scaling
@@ -448,9 +473,10 @@
 perturbed data), the MLPs and SDA are much more convenient than classifiers
 based on kernel methods.  The MLP has a single hidden layer with $\tanh$
 activation functions, and softmax (normalized exponentials) on the output
-layer for estimating $P(class | image)$.  The number of hidden units is
+layer for estimating $P({\rm class} | {\rm input})$.  The number of hidden units is
 taken in $\{300,500,800,1000,1500\}$.  Training examples are presented in
-minibatches of size 20. A constant learning rate was chosen among $\{0.001,
+minibatches of size 20, i.e., the parameters are iteratively updated in the direction
+of the mean gradient of the next 20 examples. A constant learning rate was chosen among $\{0.001,
 0.01, 0.025, 0.075, 0.1, 0.5\}$.
 %through preliminary experiments (measuring performance on a validation set),
 %and $0.1$ (which was found to work best) was then selected for optimizing on
@@ -486,25 +512,51 @@
 comparable or better than RBMs in series of experiments
 \citep{VincentPLarochelleH2008-very-small}. It really corresponds to a Gaussian
 RBM trained by a Score Matching criterion~\cite{Vincent-SM-2010}.
-During training, a Denoising
-Auto-encoder is presented with a stochastically corrupted version
-of the input and trained to reconstruct the uncorrupted input,
-forcing the hidden units to represent the leading regularities in
-the data. Here we use the random binary masking corruption
-(which sets to 0 a random subset of the inputs).
- Once it is trained, in a purely unsupervised way,
-its hidden units' activations can
-be used as inputs for training a second one, etc.
+During its unsupervised training, a Denoising
+Auto-encoder is presented with a stochastically corrupted version $\tilde{x}$
+of the input $x$ and trained to reconstruct to produce a reconstruction $z$
+of the uncorrupted input $x$. Because the network has to denoise, it is
+forcing the hidden units $y$ to represent the leading regularities in
+the data. Following~\citep{VincentPLarochelleH2008-very-small}
+the hidden units output $y$ is obtained through
+\[
+ y={\rm sigm}(c+V x)
+\]
+where ${\rm sigm}(a)=1/(1+\exp(-a))$
+and the reconstruction is
+\[
+ z={\rm sigm}(d+V' y).
+\]
+We minimize the training
+set average of the cross-entropy
+reconstruction error
+\[
+ L_H(x,z)=\sum_i z_i \log x_i + (1-z_i) \log(1-x_i).
+\]
+Here we use the random binary masking corruption
+(which in $\tilde{x}$ sets to 0 a random subset of the elements of $x$, and
+copies the rest).
+Once the first denoising auto-encoder is trained, its parameters can be used
+to set the first layer of the deep MLP. The original data are then processed
+through that first layer, and the output of the hidden units form a new
+representation that can be used as input data for training a second denoising
+auto-encoder, still in a purely unsupervised way.
+This is repeated for the desired number of hidden layers.
 After this unsupervised pre-training stage, the parameters
-are used to initialize a deep MLP, which is fine-tuned by
-the same standard procedure used to train them (see above).
+are used to initialize a deep MLP (similar to the above, but
+with more layers), which is fine-tuned by
+the same standard procedure (stochastic gradient descent)
+used to train MLPs in general (see above).
+The top layer parameters of the deep MLP (the one which outputs the
+class probabilities and takes the top hidden layer as input) can
+be initialized at 0.
 The SDA hyper-parameters are the same as for the MLP, with the addition of the
 amount of corruption noise (we used the masking noise process, whereby a
 fixed proportion of the input values, randomly selected, are zeroed), and a
 separate learning rate for the unsupervised pre-training stage (selected
 from the same above set). The fraction of inputs corrupted was selected
 among $\{10\%, 20\%, 50\%\}$. Another hyper-parameter is the number
-of hidden layers but it was fixed to 3 for most experiments,
+of hidden layers but it was fixed to 3 for our experiments,
 based on previous work with
 SDAs on MNIST~\citep{VincentPLarochelleH2008-very-small}.
 We also compared against 1 and against 2 hidden layers, in order