comparison writeup/techreport.tex @ 411:4f69d915d142

Better description of the model parameters.
author Arnaud Bergeron <abergeron@gmail.com>
date Thu, 29 Apr 2010 13:18:15 -0400
parents 6330298791fb
children 1e9788ce1680
comparison
equal deleted inserted replaced
410:6330298791fb 411:4f69d915d142
138 138
139 An MLP is a family of functions that are described by stacking layers of of a function similar to 139 An MLP is a family of functions that are described by stacking layers of of a function similar to
140 $$g(x) = \tanh(b+Wx)$$ 140 $$g(x) = \tanh(b+Wx)$$
141 The input, $x$, is a $d$-dimension vector. 141 The input, $x$, is a $d$-dimension vector.
142 The output, $g(x)$, is a $m$-dimension vector. 142 The output, $g(x)$, is a $m$-dimension vector.
143 The parameter $W$ is a $m\times d$ matrix and $b$ is a $m$-vector. 143 The parameter $W$ is a $m\times d$ matrix and is called the weight matrix.
144 The parameter $b$ is a $m$-vector and is called the bias vector.
144 The non-linearity (here $\tanh$) is applied element-wise to the output vector. 145 The non-linearity (here $\tanh$) is applied element-wise to the output vector.
145 Usually the input is referred to a input layer and similarly for the output. 146 Usually the input is referred to a input layer and similarly for the output.
146 You can of course chain several such functions to obtain a more complex one. 147 You can of course chain several such functions to obtain a more complex one.
147 Here is a common example 148 Here is a common example
148 $$f(x) = c + V\tanh(b+Wx)$$ 149 $$f(x) = c + V\tanh(b+Wx)$$
149 In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer. 150 In this case the intermediate layer corresponding to $\tanh(b+Wx)$ is called a hidden layer.
150 Here the output layer does not have the same non-linearity as the hidden layer. 151 Here the output layer does not have the same non-linearity as the hidden layer.
151 This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand. 152 This is a common case where some specialized non-linearity is applied to the output layer only depending on the task at hand.
152 153
153 If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP. 154 If you put 3 or more hidden layers in such a network you obtain what is called a deep MLP.
155 The parameters to adapt are the weight matrix and the bias vector for each layer.
154 156
155 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)} 157 \subsubsection{Stacked Denoising Auto-Encoders (SDAE)}
156 158
157 Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization. 159 Auto-encoders are essentially a way to initialize the weights of the network to enable better generalization.
158 Denoising auto-encoders are a variant where the input is corrupted with random noise before trying to repair it. 160 Denoising auto-encoders are a variant where the input is corrupted with random noise before trying to repair it.
159 The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform. 161 The principle behind these initialization methods is that the network will learn the inherent relation between portions of the data and be able to represent them thus helping with whatever task we want to perform.
160 162
163 An auto-encoder unit is formed of two MLP layers with the bottom one called the encoding layer and the top one the decoding layer.
164 Usually the top and bottom weight matrices are the transpose of each other and are fixed this way.
165 The network is trained as such and, when sufficiently trained, the MLP layer is initialized with the parameters of the encoding layer.
166 The other parameters are discarded.
167
161 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom. 168 The stacked version is an adaptation to deep MLPs where you initialize each layer with a denoising auto-encoder starting from the bottom.
169 During the initialization, which is usually called pre-training, the bottom layer is treated as if it were an isolated auto-encoder.
170 The second and following layers receive the same treatment except that they take as input the encoded version of the data that has gone through the layers before it.
162 For additional details see \cite{vincent:icml08}. 171 For additional details see \cite{vincent:icml08}.
163 172
164 \section{Experimental Results} 173 \section{Experimental Results}
165 174
166 \subsection{SDA vs MLP} 175 \subsection{SDA vs MLP}