comparison doc/v2_planning/layer_RP.txt @ 1237:32fc5f442dde

LAYER: sligthly long but somewhat clearer rendering of what I have in mind
author Razvan Pascanu <r.pascanu@gmail.com>
date Thu, 23 Sep 2010 11:40:20 -0400
parents 5ef96142492b
children
comparison
equal deleted inserted replaced
1236:23f63ecf0a9a 1237:32fc5f442dde
6 6
7 Proposal (RP) 7 Proposal (RP)
8 ============= 8 =============
9 9
10 You construct your neural network by constructing a graph of connections 10 You construct your neural network by constructing a graph of connections
11 between layers starting from data. While you construct the graph, 11 between "layers" starting from data. While you construct the graph,
12 different theano formulas are put together to construct your model. 12 different theano formulas are put together to construct your model.
13
14 The idea would be that you need to describe exactly what you would draw
15 on the board if you are asked to draw the architecture. This would be of
16 course optional ( you will get macros that will return this graph
17 automatically for a well defined case). Things that are not neural networks,
18 and you wouldn't have any structure to draw are just a box. For example a
19 SVM, or PCA. This in case you want to connect their output to your network.
13 20
14 Hard details are not set yet, but all members of the committee agreed 21 Hard details are not set yet, but all members of the committee agreed
15 that this sound as a good idea. 22 that this sound as a good idea.
16 23
17 24
21 # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y 28 # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y
22 29
23 h1 = sigmoid(dotW_b(train_x, n = 300)) 30 h1 = sigmoid(dotW_b(train_x, n = 300))
24 rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood) 31 rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood)
25 32
33
26 h2 = sigmoid(dotW_b(h1, n = 300)) 34 h2 = sigmoid(dotW_b(h1, n = 300))
27 rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood) 35 rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood)
28 36
29 out = sigmoid( dotW_b(h2, n= 10)) 37 out = sigmoid( dotW_b(h2, n= 10))
30 38
31 train_err = cross_entropy( out, train_y) 39 train_err = cross_entropy( out, train_y)
32 40
33 grads = grad( train_err, err.parameters() ) 41 grads = grad( train_err, err.parameters() )
34 learner = SGD( err, grads) 42 learner = SGD( err, err.parameters(), grads)
35 43
36 valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y}) 44 valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y})
37 test_err = train_err.replace({ train_x : test_x , train_y : test_y}) 45 test_err = train_err.replace({ train_x : test_x , train_y : test_y})
38 46
39 47
40 48
41 Global observations : 49 Global observations :
42 --------------------- 50 ---------------------
43 51
44 1) Your graph can have multiple terminal nodes; in this case rbm1, 52 1) Your graph can have multiple terminal nodes; in this case rbm1,
45 rbm2 and learner, valid_err, test_err are all end nodes of the graph; 53 rbm2 and learner, valid_err, test_err are all end nodes of the graph;
46 54
47 2) Any node is an "iterator", when you would call out.next() you would get 55 2) Any node is an "iterator", when you would call out.next() you would get
48 the next prediction; when you call err.next() you will get next error 56 the next prediction; when you call err.next() you will get next error
49 ( on the batch given by the data.next() ). 57 ( on the batch given by the data.next() ).
50 58
51 3) Replace can replace any subgraph 59 3) Replace can replace any subgraph or subgraphs with other
60 subgraphs/subgraph as long as : there are the same number of input units
61 and output units ( there is a 1 to 1 maping from those). I see replacing
62 subgraphs as looping over the list of subgraphs to replace and call replace
63 on which nothing fancier. Since nodes in my view produce the same interface
64 (execpt parameter nodes and hyper-parameter nodes) this constraint is not
65 hard to respect, so is up to the user to do a replace that makes sense.
52 66
53 4) You can have MACROS or SUBROUTINE that already give you the graph for 67 4) You can have MACROS or SUBROUTINE that already give you the graph for
54 known components ( in my view the CDk is such a macro, but simpler 68 known components ( in my view the CDk is such a macro, but simpler
55 examples will be vanilla versions of MLP, DAA, DBN, LOGREG) 69 examples will be vanilla versions of MLP, DAA, DBN, LOGREG). After
70 Guillaume pointed out a real shortcomming of the approach I've modified
71 a bit what you get from a macro .. look below.
56 72
57 5) Any node has the entire graph ( though arguably you don't use that 73 5) Any node has the entire graph ( though arguably you don't use that
58 graph too much). Running such a node in general will be done by compiling 74 graph too much). Running such a node in general will be done by compiling
59 the Theano expression up to that node( if you don't already have this 75 the Theano expression up to that node( if you don't already have this
60 function), and using the data object that you get initially. This theano 76 function), and using the data object that you get initially. This theano
70 constraints..) 86 constraints..)
71 87
72 6) Registering parameters and hyper-parameters to the graph is the job of 88 6) Registering parameters and hyper-parameters to the graph is the job of
73 the transform and therefore of the user who implemented that 89 the transform and therefore of the user who implemented that
74 transform; the same for initializing the parameters ( so if we have 90 transform; the same for initializing the parameters ( so if we have
75 different way to initialize the weight matrix that might be a 91 different ways to initialize the weight matrix that might be a
76 hyperparameter with a default value) 92 hyperparameter with a default value or different transforms; to ease
93 the number of such transforms you can define a transform on the fly for
94 simple theano expressions )
77 95
78 96
79 97
80 Detailed Proposal (RP) 98 Detailed Proposal (RP)
81 ====================== 99 ======================
82 100
83 I would go through a list of scenarios and possible issues : 101 I would go through a list of scenarios and possible issues :
84 102
85 Delayed or feature values 103 Delayed or feature values
86 ------------------------- 104 -------------------------
105
106
107 This is can be dropped if people think is not useful.
87 108
88 Sometimes you might want future values of some nodes. For example you might 109 Sometimes you might want future values of some nodes. For example you might
89 be interested in : 110 be interested in :
90 111
91 y(t) = x(t) - x(t-1) 112 y(t) = x(t) - x(t-1)
157 y_tm1 = recurrent_layer(init = zeros(50)) 178 y_tm1 = recurrent_layer(init = zeros(50))
158 x_t = slice(x, t=0) 179 x_t = slice(x, t=0)
159 y = loop( dotW_b(y_tm1,50) + x_t, steps = 20) 180 y = loop( dotW_b(y_tm1,50) + x_t, steps = 20)
160 181
161 This would basically give all the information you need to add a scan op 182 This would basically give all the information you need to add a scan op
162 to your theano expression of the result op, it is just a different way 183 to your theano expression of the result node y, it is just a different way
163 of writing things .. which I think is more intuitive. 184 of writing things .. which I think is more intuitive.
164 185
165 You create your primitives which are either a recurrent_layer that should 186 You create your primitives which are either a recurrent_layer that should
166 have a initial value, or a slice of some other node ( a time slice that is) 187 have a initial value, or a slice of some other node ( a time slice that is).
167 Then you call loop giving a expression that starts from those primitives. 188 A tims slice is a special kind of node, which we should try to force people
189 not to use outside of a loop. If you use it though you have some default
190 behaviour like for example it behaves exactly like a delayed node.
191 You call loop giving a expression that starts from those primitives and
192 ta da, you have your recurrent expression in the graph.
168 193
169 Similarly you can have foldl or map or anything else. 194 Similarly you can have foldl or map or anything else.
170 195
171 You would use this instead of writing scan especially if the formula is 196 You would use this instead of writing scan especially if the formulas are
172 more complicated and you want to automatically collect parameters, 197 more complicated and you want to automatically collect parameters,
173 hyper-parameters and so on. 198 hyper-parameters and so on. You could also just use the scan op and
199 using a general apply command if you like that more.
174 200
175 Optimizer 201 Optimizer
176 --------- 202 ---------
177 203
178 Personally I would respect the findings of the optimization committee, 204 Personally I would respect the findings of the optimization committee,
179 and have the SGD to require a Node that produces some error ( which can 205 and have the SGD to require a Node that produces some error ( which can
180 be omitted) and the gradients. For this I would also have the grad 206 be omitted) and the parameter nodes and nodes that compute gradients for
181 function which would actually only call T.grad. 207 those paramters. For this I would also have the grad function which would
208 actually only call T.grad.
182 209
183 If you have non-theano thing in the middle? I don't have any smart 210 If you have non-theano thing in the middle? I don't have any smart
184 solution besides ignoring any parameter that it is below the first 211 solution besides ignoring any parameter that it is below the first
185 non-theano node and throw a warning. 212 non-theano node and throw a warning.
186 213
188 ------- 215 -------
189 216
190 In my case I would not have a predict() and eval() method of the learner, 217 In my case I would not have a predict() and eval() method of the learner,
191 but just a eval(). If you want the predictions you should use the 218 but just a eval(). If you want the predictions you should use the
192 corresponding node ( before applying the error measure ). This was 219 corresponding node ( before applying the error measure ). This was
193 for example **out** in my first example. 220 for example **out** in my first example. Note eval() in this case is
221 the same as next(). ( you might just have next for simplicity). The
222 only semantically important difference is that a call to next has now
223 side-effects in the sense that the parameters are updated.
194 224
195 Of course we could require learners to be special nodes that also have 225 Of course we could require learners to be special nodes that also have
196 a predict output. In that case I'm not sure what the iterating behaiour 226 a predict output. In that case I'm not sure what the iterating behaiour
197 of the node should produce. 227 of the node should produce.
198 228
206 236
207 I don't have a perfect answer yet, but my argument will go as this : 237 I don't have a perfect answer yet, but my argument will go as this :
208 238
209 you would have transforms for the most popular option ( dotW_b) for example. 239 you would have transforms for the most popular option ( dotW_b) for example.
210 If you need something else you can always decorate a function that takes 240 If you need something else you can always decorate a function that takes
211 theano arguments and produces theano arguments. More then decoratting you 241 theano arguments and produces theano arguments. The formulas produced by
212 can have a general apply transform that does something like : 242 the formula committee might be a rich source of such function to decorate.
243 More then decoratting, you can have a general apply transform that does
244 something like :
213 245
214 apply( lambda x,y,z: x*y+z, inputs = x, 246 apply( lambda x,y,z: x*y+z, inputs = x,
215 hyperparams = [(name,2)], 247 hyperparams = [(name,2)],
216 params = [(name,theano.shared(..)]) 248 params = [(name,theano.shared(..)])
217 The order of the arguments in lambda is nodes, params, hyper-params or so. 249 The order of the arguments in lambda is nodes, params, hyper-params or so.
218 This would apply the theano expression but it will also register the 250 This would apply the theano expression but it will also register the
219 the parameters. It is like creating a transform on the fly. 251 the parameters. It is like creating a transform on the fly.
252 You should, or could provide names for parameters, you might need them
253 later.
220 254
221 I think you can do such that the result of the apply is 255 I think you can do such that the result of the apply is
222 pickable, but not the apply operation. Meaning that in the graph, the op 256 pickable, but not the general apply transform. What I mean is that
223 doesn't actually store the lambda expression but a mini theano graph. 257 the output node does not store the lambda expression but some theano
224 258 graph (?) and it know which are the input ( and when you can replace
225 Also names might be optional, so you can write hyperparam = [2,] 259 them so that you link this little graph to the rest of the
260 theano expression. Is just an ugly hack given that you can not save
261 lambda expressions, but I'm open to other alternatives ..
226 262
227 263
228 What this way of doing things would buy you hopefully is that you do not 264 What this way of doing things would buy you hopefully is that you do not
229 need to worry about most of your model ( would be just a few macros or 265 need to worry about most of your model ( would be just a few macros) that
230 subrutines). 266 will get you to the point you want to change and then you do surgery on
231 you would do something like : 267 that point. Compare this with hacking a class, it feels cleaner, because
232 268 you what is up to that point you want to change is sort of separated from
233 rbm1,hidden1 = rbm_layer(data,20) 269 what you change. Plus you could do this in your script, and you don't need
234 rbm2,hidden2 = rbm_layer(data,20) 270 to create your local branch of the library where you hack the class, or
235 271 duplicate the class file under a different name ..
236 and then the part you care about : 272 Once what you are doing becomes stable it can be converted in either a
237 273 different macro or a parameter to the initial macro.
238 hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = 274
239 theano.shared(scipy.sparse_CSR(..))) 275 ** New part **
240 276
241 and after that you pottentially still do what you did before : 277 If this is not convincing enough, there is another point that I want to
242 278 make. While creating the graph you can optionally create a model object.
243 err = cross_entropy(hidden3, target) 279 I will encourage most people to do that ! This idea I had a long time ago,
244 grads = grad(err, err.paramters()) 280 but then I used a singleton class as the world which could potentially create
245 ... 281 a lot of issues. This is a nicer version of that.
246 282
247 I do agree that some of the "transforms" that I have been writing here 283 This model class is optional but it can be extremely useful. What you do in
248 and there are pretty low level, and maybe we don't need them. We might need 284 this model class is to store the graph, together with different annotations
249 only somewhat higher level transforms. My hope is that for now people think 285 on that graph. What I would do is identify different subgraphs in the model
250 of the approach and not about all inner details ( like what transforms we 286 and register them under different names. For example if err is the node that
251 need and so on) and see if they are comfortable with it or not. 287 points to the graph that represents a DBN, that graph will be registerd to
252 288 a model in which I have annotated which subgraphs represent the different
253 Do we want to think in this terms? I think is a bit better do have 289 rbms, which represents the logistic regression and so on. The model will also
254 a normal python class, hacking it to change something and then either add 290 have a list of all the input nodes and all the output nodes of the graph.
255 a parameter to init or create a new version. It seems a bit more natural. 291 We could potentially use this model class to control some global default
256 292 parameters initialization or hyper-parameters. This all might sound like
257 293 magic but is actually easy to implement.
258 294
259 295 If you have such a model, which is just some annotations on the graph, this
260 Anyhow Guillaume I'm working on a better answer :) 296 approach makes it easy to change components of the graph based on their names.
297 For example I can replace rbm1 with a daa, because based on these annotations
298 I know which part is rbm1.
299
300 Why do I feel you need such a thing? It is just because you get the DBN by
301 calling a macro, and you don't have variables that point to different nodes
302 of your network so that you can define where a subgraph starts or not. But
303 if a graph returns such a model, you can introspect what annotations you have.
304 There should also be standard conventions, but you could also in the
305 interactive shell look at :
306
307 model.annotations(depth = 2)
308
309 This would print something like :
310
311 'DBN'
312 'rbm1'
313 'hidden_layer1'
314 'CDk_layer1'
315 'rbm2'
316 'hidden_layer2'
317 'CDk_layer2'
318 'logreg'
319 'cross_entropy'
320
321 And then you can say
322
323 daa1 = daa(..)
324 daa2 = daa(..)
325 new_model = model.replace('rbm1', daa1, new_name = 'daa1')
326 new_model = new_model.replace('rbm2', daa2, new_name = 'daa2')
327
328 and you get a SDAA.
329 What is the hierarhical structure ? Well, in my view if some subgrah
330 (annotated as S1) is part of another subgraph (annotated as S2) then
331 S1 is a child of S2 in this hierarchy of annotations. If they share
332 just a few nodes, but have nodes that are not shared, then they are on
333 the same level. We might one a flat space for the annotations, but I think
334 this simple convention can get as a lot.
335
336
337 So macros should in general return such models. It is up to you if you want to
338 ground the graph that you create in your script into a model or not. You do
339 so by manually adding nodes to the model. The annotations are also manually
340 done .. So this might be a bit annoying for a developer of a macro, but I
341 don't think is cognitively complicated, and it would help a lot when using
342 the macros.
343
344 You can see how this annotation system becomes easily interesting. You can
345 also annotate parameters ( and it is not too overwhelming to do so when
346 you create the graph as well) and you can use this to sort of collect all
347 parameters that you annotated in some way and then do something to them.
348
349 The way I see it is just that a transform could have an optional annotations
350 argument and it will add that string to all parameters and hyper-parameters.
351 How much sense this makes is debatable, but I strongly believe that is not
352 complicated to implement ( I actually have something like this already
353 implemented, just that I use that single ton class, and I sort of made the
354 framework work mostly for DAA by making a few poor choices).
261 355
262 356
263 Params and hyperparams 357 Params and hyperparams
264 ---------------------- 358 ----------------------
265 359
266 I think it is obvious from what I wrote above that there is a node wrapper 360 I think it is obvious from what I wrote above that there is a node wrapper
267 around the theano expression. I haven't wrote down all the details of that 361 around the theano expression. I haven't wrote down all the details of that
268 class. I think there should be such a wrapper around parameters and 362 class. I think there should be such a wrapper around parameters and
269 hyper-parameters as well. By default those wrappers might not provide 363 hyper-parameters as well. By default those wrappers might not provide
270 any informtion. Later on, they can provide for hyper-params for example a 364 any informtion. But you can potentially add interesting information for
271 distribution. If when inserting your hyper-param in the graph ( i.e. when 365 "graph" aware transforms. For example you can add annotations for a find
272 you call a given transform) you provide the distribution then maybe a 366 or replace function that will collect you all parameters or hyper-parameter
273 hyperlearner could use it to sample from it. 367 so you do some common thing to all of them (when it makes sense).
274 368
275 For parameters you might define properties like freeze. It can be true or 369 You could have a freeze property for parameters. If you change that property
276 false. Whenever it is set to true, the param is not adapted by the optimizer. 370 the theano function (where needed) for all nodes that follow this one is
277 Changing this value like changing most of hyper-params implies recompilation 371 recomputed. This argument would be used by the collecting paramters function
278 of the graph. 372 used to compute the gradient. If parameters are frozen they are ignored,
279 373 if not they are updated.
280 I would have a special class of hyper-params which don't require 374
281 recompilation of the graph. Learning rate is an example. This info is also 375 For hyper-parameters you would also have a different wrapper that would
282 given by the wrapper and by how the parameter is used. 376 contain, possibly, the distribution of that hyper-parameters for a
283 377 hyper-learner.
284 It is up to the user and "transform" implementer to wrap params and 378
285 hyper-params correspondingly. But I don't think this is to complicated. 379 I would also have the learning rate or noise_amounts as some strange
286 The apply function above has a default behaviour, maybe you would have 380 hyper-paramter. I would say by default, if any hyper-paramter changes its
287 a forth type of argument which is hyper-param that doesn't require 381 value, then the theano expressions need to be recompiled. If you are dealing
288 compilation. We could find a nice name for it. 382 with this strange types of hyper-parameters you don't need to do that.
289 383 This can be automatically for you and I guess it will all boil down to,
384 is you hyper-paramter a theano shared variable or theano tensor ? If so we
385 are dealing with the second type. So this kind of stuff can be detected
386 automatically.
290 387
291 How does this work? 388 How does this work?
292 ------------------- 389 -------------------
293 390
294 You always have a pointer to the entire graph. Whenever a hyper-param 391 You always have a pointer to the entire graph. Whenever a hyper-param
295 changes ( or a param freezes) all region of the graph affected get recompiled. 392 changes ( or a param freezes) all region of the graph affected get recompiled.
296 This is by traversing the graph from the bottom node and constructing the 393 This is by traversing the graph from the bottom node and re-constructing the
297 theano expression. 394 theano expression. Where needed this theano expression get compiled.
298 395
299 This function that updates / re-constructs the graph is sligthly more complex 396 This function that updates / re-constructs the graph is sligthly more complex
300 if you have non-theano functions in the graph .. 397 if you have non-theano functions in the middle of the graph .. but not too
301 398 much in my view.
302 replace 399
303 ------- 400 replace & find
401 --------------
304 402
305 Replace, replaces a part of the graph. The way it works in my view is that 403 Replace, replaces a part of the graph. The way it works in my view is that
306 if I write : 404 if I write :
307 405
308 x = x1+x2+x3 406 x = x1+x2+x3
309 y = x.replace({x2:x5}) 407 y = x.replace({x2:x5})
310 408
311 You would first copy the graph that is represented by x ( the params or 409 You would first copy the graph that is represented by x ( the params or
312 hyper-params are not copied) and then replace the subgraphs. I.e., x will 410 hyper-params are not copied) and then replace the subgraphs. I.e., x will
313 still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done 411 still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done
314 inplace. 412 inplace !
315 413
316 I think these Node classes as something light-weighted, like theano variables. 414 I think these Node classes as something light-weighted, like theano variables
415 and creating copy is not harmful. Also params & shared variables are shared
416 between these graphs. If you want new params / shared variables we can offer
417 a copy / deepcopy command.
418
419 Replace (given that it starts from a model) can take string(s) that indicate
420 specific annotations.
421
422 Find does the same ( without the copying).
423
424
425
426 If you have two things named the same in the graph you would return the first
427 one in a breadth search starting from the top node. The idea is that if you
428 have all the weight matrices annotated as 'W' and you look for 'W' starting
429 from node hiddens2, you want the W of the second layer, and not of the first.
430
431 I wold support :
432 model.replace( look_at , search_for , replace_with, annotate_as)
433 replace(model , look_at , search_for , replace_with, annotate_as)
434 node.replace(model , look_at, replace_with, annotate_as)
435
436 look_at if it is a node it reffers to the subgraph that has as a final
437 node that node. I.e. all up to that point. If it is a string, you would look
438 at the subgraph annotated by that string.
439
440 Of course we can optionally choose not to allow things to be annotate with
441 the same name, though I sort of liked it. It makes a lot of things easy. For
442 a DBN I would have the annotations :
443
444 DBN
445 rbm1
446 hidden
447 CDk
448 rbm2
449 hidden
450 CDk
451 logreg
452
453 If I want to change the first CDk with PCD I would do
454
455 pcd1 = PCD (..)
456 model.replace(look_at='rbm1', search_for='CDk', replace_with=pcd1,
457 annotate_as='PCD1')
458
459
460 Bottom line is :
461
462 I think having a graph and having a way to search in that graph and replace
463 parts is a very flexible and powerful way of doing things.
317 464
318 465
319 reconstruct 466 reconstruct
320 ----------- 467 -----------
321 468
322 This is something nice for DAA. It is definetely not useful for the rest. 469 This is something nice for DAA. It is definetely not useful for the rest.
323 I think though that is a shame having that transformation graph and not 470 I think though that is a shame having that transformation graph and not
324 being able to use it to do this. It will make life so much easier when you 471 being able to use it to do this. It will make life so much easier when you
325 do deep auto-encoders. I wouldn't put it in the core library, but I would 472 do deep auto-encoders. I wouldn't put it in the core library, but I would
326 have in the DAA module. The way I see it you can either have something like 473 have in the DAA module. For reconstruct to work you need to have inverse
474 transforms for the ones you use.
475
476 The way I see it you can either have something like
327 477
328 # generate your inversable transforms on the fly 478 # generate your inversable transforms on the fly
329 fn = create_transform(lambda : , params, hyper-params ) 479 fn = create_transform(lambda : , params, hyper-params )
330 inv = create_transform(lambda : , params, hyper-params ) 480 inv = create_transform(lambda : , params, hyper-params )
331 my_transform = couple_transforms( forward = fn, inv = inv) 481 my_transform = couple_transforms( forward = fn, inv = inv)
332 482
333 # have some already widely used such transform in the daa submodule. 483 and generate special transforms on the fly that have some pseudo-inverses
484 when you construct the graph. Maybe you can also have spcific pre-defined
485 transforms for the most used cases, whith specific names. Even more I don't
486 see the harm of something as simple as dotW_b to have a inverse defined ( as
487 using tied weights) in all cases, but you would only use it for the DAA.
488 It just to reduce the number of names of transforms you have, is like a
489 feature that doesn't hurt or help in 95% of times but it helps in 5% of times.
490
491
492 But this is up to debate. The only reason I bring it up is to say that the
493 class that represents a transform should have a inverse method that by
494 default throws an exception.
334 495
335 496
336 transforms 497 transforms
337 ---------- 498 ----------
338 499
339 In my view there will be quite a few of such standard transforms. They 500 In my view there will be quite a few of such standard transforms.
340 can be grouped by architecture, basic, sampler, optimizer and so on. 501 This can be annoying, but I think that if we group them by
341 502 architectures (MLP, DAA, RBM), sampler, optimizers it will be less of a mess.
342 We do not need to provide all of them, just the ones we need. Researching 503 This would be crucial for their documentation as well. This categories should
343 on an architecture would actually lead in creating new such transforms in 504 also come with macros. There will be though some basic transforms that
344 the library. 505 are available at the core ( like replace, find, things related to annotating
345 506 and creating a model, collecting parameters and hyper-paramters)
346 There will be definetely a list of basic such transforms in the begining, 507
347 like : 508 I also think that we can start small by having just very few such transforms
348 replace, 509 and add them as the library grows. We don't need many of this, most are
349 search, 510 nice to have ..
350 get_param(name)
351 get_params(..)
352
353 You can have and should have something like a switch ( that based on a
354 hyper parameter replaces a part of a graph with another or not). This is
355 done by re-compiling the graph.
356 511
357 512
358 Constraints 513 Constraints
359 ----------- 514 -----------
360 515
361 Nodes also can also keep track of constraints. 516 You can always add constraints. I think the easier to make this explicit is to
362 517 get a hand on the parameter or ndoe on which you want to add constraint and
363 When you write 518 do something like
364 519
365 y = add_constraint(x, sum(x**2)) 520 add_constraint(on_what, what)
366 521
367 y is the same node as x, just that it also links to this second graph that 522 on_what can be a node, a parameter node, a list of nodes, a list of parameter
368 computes constraints. Whenever you call grad, grad will also sum to the 523 nodes, an annotation string, given that you provided a model, and what is a
369 cost all attached constraints to the graph. 524 graph. In terms of the graph that you are creating what this does is to
370 525 create a dependency link from your main graph to that constraint graph.
371 526 This means that the grad function that computes the grad function that
527 computes the gradients with respect to parameters will also (if there are
528 such dependency links) add the gradient of those parameters with respect
529 to the output of that dependency graph. There are some constraints on
530 what a dependency graph can be, in the sense that it should start from only
531 one input ( the parameters / node) and it should end in only one node that
532 is a scalar.
533
534 From an implementation point of view, this can be done by just collecting a
535 list of constraints cost, that will be added to the cost before calling
536 T.grad. But I like to think about it in terms of graph linked through
537 dependency links.
538
539
540
541
542 Some general comments
543 ---------------------
544
545 I think that what you get in the end is a very flexible framework, where
546 adding new things is just a matter of putting together a few transforms and
547 annotating the entire thing. Worst case scenario you would need to invent a
548 transform, which I do believe could be quite painless.
549
550 The harder part to implement is the back-bone. It is not difficult in my
551 view, mostly sligthly tideous. I had something like this implemented in a
552 matter of a week, though it was a bit less restrictive. I do believe though
553 that we should not oversimplify the backbone of the library just to make it
554 easy to implement, but we should rather carefully consider what you get in
555 the end
556
557
558 Connection to the architecture committee
559 -----------------------------------------
560
561 I think that if you get such iterator objects that can produce either
562 the error, or do an update step it is easy to wrap them in a plug-in,
563 or use it with the imperative language James proposed.
564
565 I actually have ideas ( using non theano nodes) how to break the algo at
566 points such that you can have different parts run on remote machines ..
567 though we might not want to support that ( using the plug-in system ..
568 though it might work with other systems that support the same idea)
569
570 I think it goes more natural with the imperative language that James
571 proposed, because that would create a graph as well. His graph is
572 in general simpler ( it always has only one termination node) where
573 the nodes have a different interpretation (?) so I would use a different
574 node class on those. But from writing the code, using some syntactic sugar
575 the difference can be blurred ( do we want this ?). I think that one
576 can come up with ways of making the approaches look alike and sligtly
577 homogeneous.