comparison doc/v2_planning/layer_RP.txt @ 1231:5ef96142492b

some typos
author Razvan Pascanu <r.pascanu@gmail.com>
date Wed, 22 Sep 2010 20:17:35 -0400
parents 515033d4d3bf
children 32fc5f442dde
comparison
equal deleted inserted replaced
1230:31b72defb680 1231:5ef96142492b
6 6
7 Proposal (RP) 7 Proposal (RP)
8 ============= 8 =============
9 9
10 You construct your neural network by constructing a graph of connections 10 You construct your neural network by constructing a graph of connections
11 between layesrs starting from data. While you construct the graph, 11 between layers starting from data. While you construct the graph,
12 different theano formulas are put together to construct your model. 12 different theano formulas are put together to construct your model.
13 13
14 Hard details are not set yet, but all members of the committee agreed 14 Hard details are not set yet, but all members of the committee agreed
15 that this sound as a good idea. 15 that this sound as a good idea.
16 16
39 39
40 40
41 Global observations : 41 Global observations :
42 --------------------- 42 ---------------------
43 43
44 1) Your graph can have multiple terminations; in this case rbm1, rbm2 and learner, valid_err, 44 1) Your graph can have multiple terminal nodes; in this case rbm1,
45 test_err are all end nodes of the graph; 45 rbm2 and learner, valid_err, test_err are all end nodes of the graph;
46 46
47 2) Any node is an "iterator", when you would call out.next() you would get the next prediction; 47 2) Any node is an "iterator", when you would call out.next() you would get
48 when you call err.next() you will get next error ( on the batch given by the data ). 48 the next prediction; when you call err.next() you will get next error
49 ( on the batch given by the data.next() ).
49 50
50 3) Replace can replace any subgraph 51 3) Replace can replace any subgraph
51 52
52 4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my 53 4) You can have MACROS or SUBROUTINE that already give you the graph for
53 view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG) 54 known components ( in my view the CDk is such a macro, but simpler
54 55 examples will be vanilla versions of MLP, DAA, DBN, LOGREG)
55 5) Any node has a pointer at the graph ( though arguably you don't use that graph that much). Running 56
56 such a node in general will be done by compiling the Theano expression up to that node, and using the 57 5) Any node has the entire graph ( though arguably you don't use that
57 data object that you get initially. This theano function is compiled lazy, in the sense that is compiled 58 graph too much). Running such a node in general will be done by compiling
58 when you try to iterate through the node. You use the graph only to : 59 the Theano expression up to that node( if you don't already have this
59 * update the Theano expression in case some part of the subgraph has been changed 60 function), and using the data object that you get initially. This theano
60 * collect the list of parameters of the model 61 function is compiled only if you need it. You use the graph only to :
61 * collect the list of hyper-parameters ( my personal view - this would mostly be useful for a 62 * update the Theano expression in case some part of the subgraph has
62 hyper learner .. and not day to day basis, but I think is something easy to provide and we should) 63 changed (hyper-parameter or a replace call)
63 * collect constraints on parameters ( I believe they can be inserted in the graph .. things like L1 64 * collect the list of parameters of the model
64 and so on ) 65 * collect the list of hyper-parameters ( my personal view - this
65 66 would mostly be useful for a hyper learner .. and not for day to
66 6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore 67 day stuff, but I think is something easy to provide and we should )
67 to the user who implemented that transform; also initializing the parameters ( so if we have different way 68 * collect constraints on parameters ( I believe they can be represented
68 to initialize the weight matrix that should be a hyperparameter with a default value) 69 in the graph as dependency links to other graphs that compute the
70 constraints..)
71
72 6) Registering parameters and hyper-parameters to the graph is the job of
73 the transform and therefore of the user who implemented that
74 transform; the same for initializing the parameters ( so if we have
75 different way to initialize the weight matrix that might be a
76 hyperparameter with a default value)
69 77
70 78
71 79
72 Detailed Proposal (RP) 80 Detailed Proposal (RP)
73 ====================== 81 ======================
75 I would go through a list of scenarios and possible issues : 83 I would go through a list of scenarios and possible issues :
76 84
77 Delayed or feature values 85 Delayed or feature values
78 ------------------------- 86 -------------------------
79 87
80 Sometimes you might want future values of some nodes. For example you might be interested in : 88 Sometimes you might want future values of some nodes. For example you might
89 be interested in :
81 90
82 y(t) = x(t) - x(t-1) 91 y(t) = x(t) - x(t-1)
83 92
84 You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by 93 You can get that by having a "delayed" version of a node. A delayed version
85 calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative. 94 a node x is obtained by calling x.t(k) which will give you a node that has
95 the value x(t+k). k can be positive or negative.
86 In my view this can be done as follows : 96 In my view this can be done as follows :
87 - a node is a class that points to : 97 - a node is a class that points to :
88 * a data object that feeds data 98 * a data object that feeds data
89 * a theano expression up to that point 99 * a theano expression up to that point
90 * the entire graph that describes the model ( not Theano graph !!!) 100 * the entire graph that describes the model ( not Theano graph !!!)
104 114
105 115
106 W1*f( W2*data + b) 116 W1*f( W2*data + b)
107 117
108 I think we can support that by doing the following : 118 I think we can support that by doing the following :
109 each node has a : 119 each node has a:
110 * a data object that feeds data 120 * a data object that feeds data
111 * a theano expression up to that point 121 * a theano expression up to that point
112 * the entire graph that describes the model 122 * the entire graph that describes the model
113 123
114 Let x1 = W2*data + b 124 Let x1 = W2*data + b
156 have a initial value, or a slice of some other node ( a time slice that is) 166 have a initial value, or a slice of some other node ( a time slice that is)
157 Then you call loop giving a expression that starts from those primitives. 167 Then you call loop giving a expression that starts from those primitives.
158 168
159 Similarly you can have foldl or map or anything else. 169 Similarly you can have foldl or map or anything else.
160 170
171 You would use this instead of writing scan especially if the formula is
172 more complicated and you want to automatically collect parameters,
173 hyper-parameters and so on.
174
161 Optimizer 175 Optimizer
162 --------- 176 ---------
163 177
164 Personally I would respect the findings of the optimization committee, 178 Personally I would respect the findings of the optimization committee,
165 and have the SGD to require a Node that produces some error ( which can 179 and have the SGD to require a Node that produces some error ( which can
177 but just a eval(). If you want the predictions you should use the 191 but just a eval(). If you want the predictions you should use the
178 corresponding node ( before applying the error measure ). This was 192 corresponding node ( before applying the error measure ). This was
179 for example **out** in my first example. 193 for example **out** in my first example.
180 194
181 Of course we could require learners to be special nodes that also have 195 Of course we could require learners to be special nodes that also have
182 a predict output. In that case I'm not sure what the iterator behaiour 196 a predict output. In that case I'm not sure what the iterating behaiour
183 of the node should produce. 197 of the node should produce.
184 198
185 Granularity 199 Granularity
186 ----------- 200 -----------
187 201
200 apply( lambda x,y,z: x*y+z, inputs = x, 214 apply( lambda x,y,z: x*y+z, inputs = x,
201 hyperparams = [(name,2)], 215 hyperparams = [(name,2)],
202 params = [(name,theano.shared(..)]) 216 params = [(name,theano.shared(..)])
203 The order of the arguments in lambda is nodes, params, hyper-params or so. 217 The order of the arguments in lambda is nodes, params, hyper-params or so.
204 This would apply the theano expression but it will also register the 218 This would apply the theano expression but it will also register the
205 the parameters. I think you can do such that the result of the apply is 219 the parameters. It is like creating a transform on the fly.
206 pickable, but not the apply. Meaning that in the graph, the op doesn't 220
207 actually store the lambda expression but a mini theano graph. 221 I think you can do such that the result of the apply is
222 pickable, but not the apply operation. Meaning that in the graph, the op
223 doesn't actually store the lambda expression but a mini theano graph.
208 224
209 Also names might be optional, so you can write hyperparam = [2,] 225 Also names might be optional, so you can write hyperparam = [2,]
210 226
211 227
212 What this way of doing things would buy you hopefully is that you do not 228 What this way of doing things would buy you hopefully is that you do not
213 need to worry about most of your model ( would be just a few macros or 229 need to worry about most of your model ( would be just a few macros or
214 subrutines). 230 subrutines).
215 you would do like : 231 you would do something like :
216 232
217 rbm1,hidden1 = rbm_layer(data,20) 233 rbm1,hidden1 = rbm_layer(data,20)
218 rbm2,hidden2 = rbm_layer(data,20) 234 rbm2,hidden2 = rbm_layer(data,20)
235
219 and then the part you care about : 236 and then the part you care about :
237
220 hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = 238 hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params =
221 theano.shared(scipy.sparse_CSR(..))) 239 theano.shared(scipy.sparse_CSR(..)))
240
222 and after that you pottentially still do what you did before : 241 and after that you pottentially still do what you did before :
242
223 err = cross_entropy(hidden3, target) 243 err = cross_entropy(hidden3, target)
224 grads = grad(err, err.paramters()) 244 grads = grad(err, err.paramters())
225 ... 245 ...
226 246
227 I do agree that some of the "transforms" that I have been writing here 247 I do agree that some of the "transforms" that I have been writing here
228 and there are pretty low level, and maybe we don't need them. We might need 248 and there are pretty low level, and maybe we don't need them. We might need
229 only somewhat higher level transforms. My hope is that for now people think 249 only somewhat higher level transforms. My hope is that for now people think
230 of the approach and not to all inner details ( like what transforms we need, 250 of the approach and not about all inner details ( like what transforms we
231 and so on) and see if they are comfortable with it or not. 251 need and so on) and see if they are comfortable with it or not.
232 252
233 Do we want to think in this terms? I think is a bit better do have your 253 Do we want to think in this terms? I think is a bit better do have
234 script like that, then hacking into the DBN class to change that W to be 254 a normal python class, hacking it to change something and then either add
235 sparse. 255 a parameter to init or create a new version. It seems a bit more natural.
256
257
258
236 259
237 Anyhow Guillaume I'm working on a better answer :) 260 Anyhow Guillaume I'm working on a better answer :)
238 261
239 262
240 Params and hyperparams 263 Params and hyperparams