comparison doc/v2_planning/layer_RP.txt @ 1229:515033d4d3bf

a first draft of layer committee
author Razvan Pascanu <r.pascanu@gmail.com>
date Wed, 22 Sep 2010 19:43:24 -0400
parents
children 5ef96142492b
comparison
equal deleted inserted replaced
1228:86d802226a97 1229:515033d4d3bf
1 ===============
2 Layer committee
3 ===============
4
5 Members : RP, XG, AB, DWF
6
7 Proposal (RP)
8 =============
9
10 You construct your neural network by constructing a graph of connections
11 between layesrs starting from data. While you construct the graph,
12 different theano formulas are put together to construct your model.
13
14 Hard details are not set yet, but all members of the committee agreed
15 that this sound as a good idea.
16
17
18 Example Code (RP):
19 ------------------
20
21 # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y
22
23 h1 = sigmoid(dotW_b(train_x, n = 300))
24 rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood)
25
26 h2 = sigmoid(dotW_b(h1, n = 300))
27 rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood)
28
29 out = sigmoid( dotW_b(h2, n= 10))
30
31 train_err = cross_entropy( out, train_y)
32
33 grads = grad( train_err, err.parameters() )
34 learner = SGD( err, grads)
35
36 valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y})
37 test_err = train_err.replace({ train_x : test_x , train_y : test_y})
38
39
40
41 Global observations :
42 ---------------------
43
44 1) Your graph can have multiple terminations; in this case rbm1, rbm2 and learner, valid_err,
45 test_err are all end nodes of the graph;
46
47 2) Any node is an "iterator", when you would call out.next() you would get the next prediction;
48 when you call err.next() you will get next error ( on the batch given by the data ).
49
50 3) Replace can replace any subgraph
51
52 4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my
53 view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG)
54
55 5) Any node has a pointer at the graph ( though arguably you don't use that graph that much). Running
56 such a node in general will be done by compiling the Theano expression up to that node, and using the
57 data object that you get initially. This theano function is compiled lazy, in the sense that is compiled
58 when you try to iterate through the node. You use the graph only to :
59 * update the Theano expression in case some part of the subgraph has been changed
60 * collect the list of parameters of the model
61 * collect the list of hyper-parameters ( my personal view - this would mostly be useful for a
62 hyper learner .. and not day to day basis, but I think is something easy to provide and we should)
63 * collect constraints on parameters ( I believe they can be inserted in the graph .. things like L1
64 and so on )
65
66 6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore
67 to the user who implemented that transform; also initializing the parameters ( so if we have different way
68 to initialize the weight matrix that should be a hyperparameter with a default value)
69
70
71
72 Detailed Proposal (RP)
73 ======================
74
75 I would go through a list of scenarios and possible issues :
76
77 Delayed or feature values
78 -------------------------
79
80 Sometimes you might want future values of some nodes. For example you might be interested in :
81
82 y(t) = x(t) - x(t-1)
83
84 You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by
85 calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative.
86 In my view this can be done as follows :
87 - a node is a class that points to :
88 * a data object that feeds data
89 * a theano expression up to that point
90 * the entire graph that describes the model ( not Theano graph !!!)
91 The only thing you need to do is to change the data object to reflect the
92 delay ( we might need to be able to pad it with 0?). You need also to create
93 a copy of the theano expression ( those are "new nodes" ) in the sense that
94 the starting theano tensors are different since they point to different data.
95
96
97
98 Non-theano transformation ( or function or whatever)
99 ----------------------------------------------------
100
101 Maybe you want to do something in the middle of your graph that is not Theano
102 supported. Let say you have a function f which you can not write in Theano.
103 You want to do something like
104
105
106 W1*f( W2*data + b)
107
108 I think we can support that by doing the following :
109 each node has a :
110 * a data object that feeds data
111 * a theano expression up to that point
112 * the entire graph that describes the model
113
114 Let x1 = W2*data + b
115 up to here everything is fine ( we have a theano expression )
116 dot(W2, tensor) + b,
117 where tensor is provided by the data object ( plus a dict of givens
118 and whatever else you need to compile the function)
119
120 When you apply f, what you do you create a node that is exactly like the
121 data object in the sense that it provides a new tensor and a new dict of
122 givens
123
124 so x2 = W1*f( W2*data+b)
125 will actually point to the expression
126 dot(W1, tensor)
127 and to the data node f(W2*data+b)
128
129 what this means is that you basically compile two theano functions t1 and t2
130 and evaluate t2(f(t1(data))). So everytime you have a non theano operation you
131 break the theano expression and start a new one.
132
133 What you loose :
134 - there is no optimization or anything between t1,t2 and f ( we don't
135 support that)
136 - if you are running things on GPU, after t1, data will be copied on CPU and
137 then probably again on GPU - so it doesn't make sense anymore
138
139
140
141 Recurrent Things
142 ----------------
143
144 I think that you can write a recurrent operation by first defining a
145 graph ( the recrrent relation ):
146
147 y_tm1 = recurrent_layer(init = zeros(50))
148 x_t = slice(x, t=0)
149 y = loop( dotW_b(y_tm1,50) + x_t, steps = 20)
150
151 This would basically give all the information you need to add a scan op
152 to your theano expression of the result op, it is just a different way
153 of writing things .. which I think is more intuitive.
154
155 You create your primitives which are either a recurrent_layer that should
156 have a initial value, or a slice of some other node ( a time slice that is)
157 Then you call loop giving a expression that starts from those primitives.
158
159 Similarly you can have foldl or map or anything else.
160
161 Optimizer
162 ---------
163
164 Personally I would respect the findings of the optimization committee,
165 and have the SGD to require a Node that produces some error ( which can
166 be omitted) and the gradients. For this I would also have the grad
167 function which would actually only call T.grad.
168
169 If you have non-theano thing in the middle? I don't have any smart
170 solution besides ignoring any parameter that it is below the first
171 non-theano node and throw a warning.
172
173 Learner
174 -------
175
176 In my case I would not have a predict() and eval() method of the learner,
177 but just a eval(). If you want the predictions you should use the
178 corresponding node ( before applying the error measure ). This was
179 for example **out** in my first example.
180
181 Of course we could require learners to be special nodes that also have
182 a predict output. In that case I'm not sure what the iterator behaiour
183 of the node should produce.
184
185 Granularity
186 -----------
187
188 Guillaume nicely pointed out that this library might be an overkill.
189 In the sense that you have a dotW_b transform, and then you will need
190 a dotW_b_sparse transform and so on. Plus way of initializing each param
191 would result in many more transforms.
192
193 I don't have a perfect answer yet, but my argument will go as this :
194
195 you would have transforms for the most popular option ( dotW_b) for example.
196 If you need something else you can always decorate a function that takes
197 theano arguments and produces theano arguments. More then decoratting you
198 can have a general apply transform that does something like :
199
200 apply( lambda x,y,z: x*y+z, inputs = x,
201 hyperparams = [(name,2)],
202 params = [(name,theano.shared(..)])
203 The order of the arguments in lambda is nodes, params, hyper-params or so.
204 This would apply the theano expression but it will also register the
205 the parameters. I think you can do such that the result of the apply is
206 pickable, but not the apply. Meaning that in the graph, the op doesn't
207 actually store the lambda expression but a mini theano graph.
208
209 Also names might be optional, so you can write hyperparam = [2,]
210
211
212 What this way of doing things would buy you hopefully is that you do not
213 need to worry about most of your model ( would be just a few macros or
214 subrutines).
215 you would do like :
216
217 rbm1,hidden1 = rbm_layer(data,20)
218 rbm2,hidden2 = rbm_layer(data,20)
219 and then the part you care about :
220 hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params =
221 theano.shared(scipy.sparse_CSR(..)))
222 and after that you pottentially still do what you did before :
223 err = cross_entropy(hidden3, target)
224 grads = grad(err, err.paramters())
225 ...
226
227 I do agree that some of the "transforms" that I have been writing here
228 and there are pretty low level, and maybe we don't need them. We might need
229 only somewhat higher level transforms. My hope is that for now people think
230 of the approach and not to all inner details ( like what transforms we need,
231 and so on) and see if they are comfortable with it or not.
232
233 Do we want to think in this terms? I think is a bit better do have your
234 script like that, then hacking into the DBN class to change that W to be
235 sparse.
236
237 Anyhow Guillaume I'm working on a better answer :)
238
239
240 Params and hyperparams
241 ----------------------
242
243 I think it is obvious from what I wrote above that there is a node wrapper
244 around the theano expression. I haven't wrote down all the details of that
245 class. I think there should be such a wrapper around parameters and
246 hyper-parameters as well. By default those wrappers might not provide
247 any informtion. Later on, they can provide for hyper-params for example a
248 distribution. If when inserting your hyper-param in the graph ( i.e. when
249 you call a given transform) you provide the distribution then maybe a
250 hyperlearner could use it to sample from it.
251
252 For parameters you might define properties like freeze. It can be true or
253 false. Whenever it is set to true, the param is not adapted by the optimizer.
254 Changing this value like changing most of hyper-params implies recompilation
255 of the graph.
256
257 I would have a special class of hyper-params which don't require
258 recompilation of the graph. Learning rate is an example. This info is also
259 given by the wrapper and by how the parameter is used.
260
261 It is up to the user and "transform" implementer to wrap params and
262 hyper-params correspondingly. But I don't think this is to complicated.
263 The apply function above has a default behaviour, maybe you would have
264 a forth type of argument which is hyper-param that doesn't require
265 compilation. We could find a nice name for it.
266
267
268 How does this work?
269 -------------------
270
271 You always have a pointer to the entire graph. Whenever a hyper-param
272 changes ( or a param freezes) all region of the graph affected get recompiled.
273 This is by traversing the graph from the bottom node and constructing the
274 theano expression.
275
276 This function that updates / re-constructs the graph is sligthly more complex
277 if you have non-theano functions in the graph ..
278
279 replace
280 -------
281
282 Replace, replaces a part of the graph. The way it works in my view is that
283 if I write :
284
285 x = x1+x2+x3
286 y = x.replace({x2:x5})
287
288 You would first copy the graph that is represented by x ( the params or
289 hyper-params are not copied) and then replace the subgraphs. I.e., x will
290 still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done
291 inplace.
292
293 I think these Node classes as something light-weighted, like theano variables.
294
295
296 reconstruct
297 -----------
298
299 This is something nice for DAA. It is definetely not useful for the rest.
300 I think though that is a shame having that transformation graph and not
301 being able to use it to do this. It will make life so much easier when you
302 do deep auto-encoders. I wouldn't put it in the core library, but I would
303 have in the DAA module. The way I see it you can either have something like
304
305 # generate your inversable transforms on the fly
306 fn = create_transform(lambda : , params, hyper-params )
307 inv = create_transform(lambda : , params, hyper-params )
308 my_transform = couple_transforms( forward = fn, inv = inv)
309
310 # have some already widely used such transform in the daa submodule.
311
312
313 transforms
314 ----------
315
316 In my view there will be quite a few of such standard transforms. They
317 can be grouped by architecture, basic, sampler, optimizer and so on.
318
319 We do not need to provide all of them, just the ones we need. Researching
320 on an architecture would actually lead in creating new such transforms in
321 the library.
322
323 There will be definetely a list of basic such transforms in the begining,
324 like :
325 replace,
326 search,
327 get_param(name)
328 get_params(..)
329
330 You can have and should have something like a switch ( that based on a
331 hyper parameter replaces a part of a graph with another or not). This is
332 done by re-compiling the graph.
333
334
335 Constraints
336 -----------
337
338 Nodes also can also keep track of constraints.
339
340 When you write
341
342 y = add_constraint(x, sum(x**2))
343
344 y is the same node as x, just that it also links to this second graph that
345 computes constraints. Whenever you call grad, grad will also sum to the
346 cost all attached constraints to the graph.
347
348