Mercurial > pylearn
comparison doc/v2_planning/layer_RP.txt @ 1237:32fc5f442dde
LAYER: sligthly long but somewhat clearer rendering of what I have in mind
author | Razvan Pascanu <r.pascanu@gmail.com> |
---|---|
date | Thu, 23 Sep 2010 11:40:20 -0400 |
parents | 5ef96142492b |
children |
comparison
equal
deleted
inserted
replaced
1236:23f63ecf0a9a | 1237:32fc5f442dde |
---|---|
6 | 6 |
7 Proposal (RP) | 7 Proposal (RP) |
8 ============= | 8 ============= |
9 | 9 |
10 You construct your neural network by constructing a graph of connections | 10 You construct your neural network by constructing a graph of connections |
11 between layers starting from data. While you construct the graph, | 11 between "layers" starting from data. While you construct the graph, |
12 different theano formulas are put together to construct your model. | 12 different theano formulas are put together to construct your model. |
13 | |
14 The idea would be that you need to describe exactly what you would draw | |
15 on the board if you are asked to draw the architecture. This would be of | |
16 course optional ( you will get macros that will return this graph | |
17 automatically for a well defined case). Things that are not neural networks, | |
18 and you wouldn't have any structure to draw are just a box. For example a | |
19 SVM, or PCA. This in case you want to connect their output to your network. | |
13 | 20 |
14 Hard details are not set yet, but all members of the committee agreed | 21 Hard details are not set yet, but all members of the committee agreed |
15 that this sound as a good idea. | 22 that this sound as a good idea. |
16 | 23 |
17 | 24 |
21 # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y | 28 # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y |
22 | 29 |
23 h1 = sigmoid(dotW_b(train_x, n = 300)) | 30 h1 = sigmoid(dotW_b(train_x, n = 300)) |
24 rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood) | 31 rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood) |
25 | 32 |
33 | |
26 h2 = sigmoid(dotW_b(h1, n = 300)) | 34 h2 = sigmoid(dotW_b(h1, n = 300)) |
27 rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood) | 35 rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood) |
28 | 36 |
29 out = sigmoid( dotW_b(h2, n= 10)) | 37 out = sigmoid( dotW_b(h2, n= 10)) |
30 | 38 |
31 train_err = cross_entropy( out, train_y) | 39 train_err = cross_entropy( out, train_y) |
32 | 40 |
33 grads = grad( train_err, err.parameters() ) | 41 grads = grad( train_err, err.parameters() ) |
34 learner = SGD( err, grads) | 42 learner = SGD( err, err.parameters(), grads) |
35 | 43 |
36 valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y}) | 44 valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y}) |
37 test_err = train_err.replace({ train_x : test_x , train_y : test_y}) | 45 test_err = train_err.replace({ train_x : test_x , train_y : test_y}) |
38 | 46 |
39 | 47 |
40 | 48 |
41 Global observations : | 49 Global observations : |
42 --------------------- | 50 --------------------- |
43 | 51 |
44 1) Your graph can have multiple terminal nodes; in this case rbm1, | 52 1) Your graph can have multiple terminal nodes; in this case rbm1, |
45 rbm2 and learner, valid_err, test_err are all end nodes of the graph; | 53 rbm2 and learner, valid_err, test_err are all end nodes of the graph; |
46 | 54 |
47 2) Any node is an "iterator", when you would call out.next() you would get | 55 2) Any node is an "iterator", when you would call out.next() you would get |
48 the next prediction; when you call err.next() you will get next error | 56 the next prediction; when you call err.next() you will get next error |
49 ( on the batch given by the data.next() ). | 57 ( on the batch given by the data.next() ). |
50 | 58 |
51 3) Replace can replace any subgraph | 59 3) Replace can replace any subgraph or subgraphs with other |
60 subgraphs/subgraph as long as : there are the same number of input units | |
61 and output units ( there is a 1 to 1 maping from those). I see replacing | |
62 subgraphs as looping over the list of subgraphs to replace and call replace | |
63 on which nothing fancier. Since nodes in my view produce the same interface | |
64 (execpt parameter nodes and hyper-parameter nodes) this constraint is not | |
65 hard to respect, so is up to the user to do a replace that makes sense. | |
52 | 66 |
53 4) You can have MACROS or SUBROUTINE that already give you the graph for | 67 4) You can have MACROS or SUBROUTINE that already give you the graph for |
54 known components ( in my view the CDk is such a macro, but simpler | 68 known components ( in my view the CDk is such a macro, but simpler |
55 examples will be vanilla versions of MLP, DAA, DBN, LOGREG) | 69 examples will be vanilla versions of MLP, DAA, DBN, LOGREG). After |
70 Guillaume pointed out a real shortcomming of the approach I've modified | |
71 a bit what you get from a macro .. look below. | |
56 | 72 |
57 5) Any node has the entire graph ( though arguably you don't use that | 73 5) Any node has the entire graph ( though arguably you don't use that |
58 graph too much). Running such a node in general will be done by compiling | 74 graph too much). Running such a node in general will be done by compiling |
59 the Theano expression up to that node( if you don't already have this | 75 the Theano expression up to that node( if you don't already have this |
60 function), and using the data object that you get initially. This theano | 76 function), and using the data object that you get initially. This theano |
70 constraints..) | 86 constraints..) |
71 | 87 |
72 6) Registering parameters and hyper-parameters to the graph is the job of | 88 6) Registering parameters and hyper-parameters to the graph is the job of |
73 the transform and therefore of the user who implemented that | 89 the transform and therefore of the user who implemented that |
74 transform; the same for initializing the parameters ( so if we have | 90 transform; the same for initializing the parameters ( so if we have |
75 different way to initialize the weight matrix that might be a | 91 different ways to initialize the weight matrix that might be a |
76 hyperparameter with a default value) | 92 hyperparameter with a default value or different transforms; to ease |
93 the number of such transforms you can define a transform on the fly for | |
94 simple theano expressions ) | |
77 | 95 |
78 | 96 |
79 | 97 |
80 Detailed Proposal (RP) | 98 Detailed Proposal (RP) |
81 ====================== | 99 ====================== |
82 | 100 |
83 I would go through a list of scenarios and possible issues : | 101 I would go through a list of scenarios and possible issues : |
84 | 102 |
85 Delayed or feature values | 103 Delayed or feature values |
86 ------------------------- | 104 ------------------------- |
105 | |
106 | |
107 This is can be dropped if people think is not useful. | |
87 | 108 |
88 Sometimes you might want future values of some nodes. For example you might | 109 Sometimes you might want future values of some nodes. For example you might |
89 be interested in : | 110 be interested in : |
90 | 111 |
91 y(t) = x(t) - x(t-1) | 112 y(t) = x(t) - x(t-1) |
157 y_tm1 = recurrent_layer(init = zeros(50)) | 178 y_tm1 = recurrent_layer(init = zeros(50)) |
158 x_t = slice(x, t=0) | 179 x_t = slice(x, t=0) |
159 y = loop( dotW_b(y_tm1,50) + x_t, steps = 20) | 180 y = loop( dotW_b(y_tm1,50) + x_t, steps = 20) |
160 | 181 |
161 This would basically give all the information you need to add a scan op | 182 This would basically give all the information you need to add a scan op |
162 to your theano expression of the result op, it is just a different way | 183 to your theano expression of the result node y, it is just a different way |
163 of writing things .. which I think is more intuitive. | 184 of writing things .. which I think is more intuitive. |
164 | 185 |
165 You create your primitives which are either a recurrent_layer that should | 186 You create your primitives which are either a recurrent_layer that should |
166 have a initial value, or a slice of some other node ( a time slice that is) | 187 have a initial value, or a slice of some other node ( a time slice that is). |
167 Then you call loop giving a expression that starts from those primitives. | 188 A tims slice is a special kind of node, which we should try to force people |
189 not to use outside of a loop. If you use it though you have some default | |
190 behaviour like for example it behaves exactly like a delayed node. | |
191 You call loop giving a expression that starts from those primitives and | |
192 ta da, you have your recurrent expression in the graph. | |
168 | 193 |
169 Similarly you can have foldl or map or anything else. | 194 Similarly you can have foldl or map or anything else. |
170 | 195 |
171 You would use this instead of writing scan especially if the formula is | 196 You would use this instead of writing scan especially if the formulas are |
172 more complicated and you want to automatically collect parameters, | 197 more complicated and you want to automatically collect parameters, |
173 hyper-parameters and so on. | 198 hyper-parameters and so on. You could also just use the scan op and |
199 using a general apply command if you like that more. | |
174 | 200 |
175 Optimizer | 201 Optimizer |
176 --------- | 202 --------- |
177 | 203 |
178 Personally I would respect the findings of the optimization committee, | 204 Personally I would respect the findings of the optimization committee, |
179 and have the SGD to require a Node that produces some error ( which can | 205 and have the SGD to require a Node that produces some error ( which can |
180 be omitted) and the gradients. For this I would also have the grad | 206 be omitted) and the parameter nodes and nodes that compute gradients for |
181 function which would actually only call T.grad. | 207 those paramters. For this I would also have the grad function which would |
208 actually only call T.grad. | |
182 | 209 |
183 If you have non-theano thing in the middle? I don't have any smart | 210 If you have non-theano thing in the middle? I don't have any smart |
184 solution besides ignoring any parameter that it is below the first | 211 solution besides ignoring any parameter that it is below the first |
185 non-theano node and throw a warning. | 212 non-theano node and throw a warning. |
186 | 213 |
188 ------- | 215 ------- |
189 | 216 |
190 In my case I would not have a predict() and eval() method of the learner, | 217 In my case I would not have a predict() and eval() method of the learner, |
191 but just a eval(). If you want the predictions you should use the | 218 but just a eval(). If you want the predictions you should use the |
192 corresponding node ( before applying the error measure ). This was | 219 corresponding node ( before applying the error measure ). This was |
193 for example **out** in my first example. | 220 for example **out** in my first example. Note eval() in this case is |
221 the same as next(). ( you might just have next for simplicity). The | |
222 only semantically important difference is that a call to next has now | |
223 side-effects in the sense that the parameters are updated. | |
194 | 224 |
195 Of course we could require learners to be special nodes that also have | 225 Of course we could require learners to be special nodes that also have |
196 a predict output. In that case I'm not sure what the iterating behaiour | 226 a predict output. In that case I'm not sure what the iterating behaiour |
197 of the node should produce. | 227 of the node should produce. |
198 | 228 |
206 | 236 |
207 I don't have a perfect answer yet, but my argument will go as this : | 237 I don't have a perfect answer yet, but my argument will go as this : |
208 | 238 |
209 you would have transforms for the most popular option ( dotW_b) for example. | 239 you would have transforms for the most popular option ( dotW_b) for example. |
210 If you need something else you can always decorate a function that takes | 240 If you need something else you can always decorate a function that takes |
211 theano arguments and produces theano arguments. More then decoratting you | 241 theano arguments and produces theano arguments. The formulas produced by |
212 can have a general apply transform that does something like : | 242 the formula committee might be a rich source of such function to decorate. |
243 More then decoratting, you can have a general apply transform that does | |
244 something like : | |
213 | 245 |
214 apply( lambda x,y,z: x*y+z, inputs = x, | 246 apply( lambda x,y,z: x*y+z, inputs = x, |
215 hyperparams = [(name,2)], | 247 hyperparams = [(name,2)], |
216 params = [(name,theano.shared(..)]) | 248 params = [(name,theano.shared(..)]) |
217 The order of the arguments in lambda is nodes, params, hyper-params or so. | 249 The order of the arguments in lambda is nodes, params, hyper-params or so. |
218 This would apply the theano expression but it will also register the | 250 This would apply the theano expression but it will also register the |
219 the parameters. It is like creating a transform on the fly. | 251 the parameters. It is like creating a transform on the fly. |
252 You should, or could provide names for parameters, you might need them | |
253 later. | |
220 | 254 |
221 I think you can do such that the result of the apply is | 255 I think you can do such that the result of the apply is |
222 pickable, but not the apply operation. Meaning that in the graph, the op | 256 pickable, but not the general apply transform. What I mean is that |
223 doesn't actually store the lambda expression but a mini theano graph. | 257 the output node does not store the lambda expression but some theano |
224 | 258 graph (?) and it know which are the input ( and when you can replace |
225 Also names might be optional, so you can write hyperparam = [2,] | 259 them so that you link this little graph to the rest of the |
260 theano expression. Is just an ugly hack given that you can not save | |
261 lambda expressions, but I'm open to other alternatives .. | |
226 | 262 |
227 | 263 |
228 What this way of doing things would buy you hopefully is that you do not | 264 What this way of doing things would buy you hopefully is that you do not |
229 need to worry about most of your model ( would be just a few macros or | 265 need to worry about most of your model ( would be just a few macros) that |
230 subrutines). | 266 will get you to the point you want to change and then you do surgery on |
231 you would do something like : | 267 that point. Compare this with hacking a class, it feels cleaner, because |
232 | 268 you what is up to that point you want to change is sort of separated from |
233 rbm1,hidden1 = rbm_layer(data,20) | 269 what you change. Plus you could do this in your script, and you don't need |
234 rbm2,hidden2 = rbm_layer(data,20) | 270 to create your local branch of the library where you hack the class, or |
235 | 271 duplicate the class file under a different name .. |
236 and then the part you care about : | 272 Once what you are doing becomes stable it can be converted in either a |
237 | 273 different macro or a parameter to the initial macro. |
238 hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = | 274 |
239 theano.shared(scipy.sparse_CSR(..))) | 275 ** New part ** |
240 | 276 |
241 and after that you pottentially still do what you did before : | 277 If this is not convincing enough, there is another point that I want to |
242 | 278 make. While creating the graph you can optionally create a model object. |
243 err = cross_entropy(hidden3, target) | 279 I will encourage most people to do that ! This idea I had a long time ago, |
244 grads = grad(err, err.paramters()) | 280 but then I used a singleton class as the world which could potentially create |
245 ... | 281 a lot of issues. This is a nicer version of that. |
246 | 282 |
247 I do agree that some of the "transforms" that I have been writing here | 283 This model class is optional but it can be extremely useful. What you do in |
248 and there are pretty low level, and maybe we don't need them. We might need | 284 this model class is to store the graph, together with different annotations |
249 only somewhat higher level transforms. My hope is that for now people think | 285 on that graph. What I would do is identify different subgraphs in the model |
250 of the approach and not about all inner details ( like what transforms we | 286 and register them under different names. For example if err is the node that |
251 need and so on) and see if they are comfortable with it or not. | 287 points to the graph that represents a DBN, that graph will be registerd to |
252 | 288 a model in which I have annotated which subgraphs represent the different |
253 Do we want to think in this terms? I think is a bit better do have | 289 rbms, which represents the logistic regression and so on. The model will also |
254 a normal python class, hacking it to change something and then either add | 290 have a list of all the input nodes and all the output nodes of the graph. |
255 a parameter to init or create a new version. It seems a bit more natural. | 291 We could potentially use this model class to control some global default |
256 | 292 parameters initialization or hyper-parameters. This all might sound like |
257 | 293 magic but is actually easy to implement. |
258 | 294 |
259 | 295 If you have such a model, which is just some annotations on the graph, this |
260 Anyhow Guillaume I'm working on a better answer :) | 296 approach makes it easy to change components of the graph based on their names. |
297 For example I can replace rbm1 with a daa, because based on these annotations | |
298 I know which part is rbm1. | |
299 | |
300 Why do I feel you need such a thing? It is just because you get the DBN by | |
301 calling a macro, and you don't have variables that point to different nodes | |
302 of your network so that you can define where a subgraph starts or not. But | |
303 if a graph returns such a model, you can introspect what annotations you have. | |
304 There should also be standard conventions, but you could also in the | |
305 interactive shell look at : | |
306 | |
307 model.annotations(depth = 2) | |
308 | |
309 This would print something like : | |
310 | |
311 'DBN' | |
312 'rbm1' | |
313 'hidden_layer1' | |
314 'CDk_layer1' | |
315 'rbm2' | |
316 'hidden_layer2' | |
317 'CDk_layer2' | |
318 'logreg' | |
319 'cross_entropy' | |
320 | |
321 And then you can say | |
322 | |
323 daa1 = daa(..) | |
324 daa2 = daa(..) | |
325 new_model = model.replace('rbm1', daa1, new_name = 'daa1') | |
326 new_model = new_model.replace('rbm2', daa2, new_name = 'daa2') | |
327 | |
328 and you get a SDAA. | |
329 What is the hierarhical structure ? Well, in my view if some subgrah | |
330 (annotated as S1) is part of another subgraph (annotated as S2) then | |
331 S1 is a child of S2 in this hierarchy of annotations. If they share | |
332 just a few nodes, but have nodes that are not shared, then they are on | |
333 the same level. We might one a flat space for the annotations, but I think | |
334 this simple convention can get as a lot. | |
335 | |
336 | |
337 So macros should in general return such models. It is up to you if you want to | |
338 ground the graph that you create in your script into a model or not. You do | |
339 so by manually adding nodes to the model. The annotations are also manually | |
340 done .. So this might be a bit annoying for a developer of a macro, but I | |
341 don't think is cognitively complicated, and it would help a lot when using | |
342 the macros. | |
343 | |
344 You can see how this annotation system becomes easily interesting. You can | |
345 also annotate parameters ( and it is not too overwhelming to do so when | |
346 you create the graph as well) and you can use this to sort of collect all | |
347 parameters that you annotated in some way and then do something to them. | |
348 | |
349 The way I see it is just that a transform could have an optional annotations | |
350 argument and it will add that string to all parameters and hyper-parameters. | |
351 How much sense this makes is debatable, but I strongly believe that is not | |
352 complicated to implement ( I actually have something like this already | |
353 implemented, just that I use that single ton class, and I sort of made the | |
354 framework work mostly for DAA by making a few poor choices). | |
261 | 355 |
262 | 356 |
263 Params and hyperparams | 357 Params and hyperparams |
264 ---------------------- | 358 ---------------------- |
265 | 359 |
266 I think it is obvious from what I wrote above that there is a node wrapper | 360 I think it is obvious from what I wrote above that there is a node wrapper |
267 around the theano expression. I haven't wrote down all the details of that | 361 around the theano expression. I haven't wrote down all the details of that |
268 class. I think there should be such a wrapper around parameters and | 362 class. I think there should be such a wrapper around parameters and |
269 hyper-parameters as well. By default those wrappers might not provide | 363 hyper-parameters as well. By default those wrappers might not provide |
270 any informtion. Later on, they can provide for hyper-params for example a | 364 any informtion. But you can potentially add interesting information for |
271 distribution. If when inserting your hyper-param in the graph ( i.e. when | 365 "graph" aware transforms. For example you can add annotations for a find |
272 you call a given transform) you provide the distribution then maybe a | 366 or replace function that will collect you all parameters or hyper-parameter |
273 hyperlearner could use it to sample from it. | 367 so you do some common thing to all of them (when it makes sense). |
274 | 368 |
275 For parameters you might define properties like freeze. It can be true or | 369 You could have a freeze property for parameters. If you change that property |
276 false. Whenever it is set to true, the param is not adapted by the optimizer. | 370 the theano function (where needed) for all nodes that follow this one is |
277 Changing this value like changing most of hyper-params implies recompilation | 371 recomputed. This argument would be used by the collecting paramters function |
278 of the graph. | 372 used to compute the gradient. If parameters are frozen they are ignored, |
279 | 373 if not they are updated. |
280 I would have a special class of hyper-params which don't require | 374 |
281 recompilation of the graph. Learning rate is an example. This info is also | 375 For hyper-parameters you would also have a different wrapper that would |
282 given by the wrapper and by how the parameter is used. | 376 contain, possibly, the distribution of that hyper-parameters for a |
283 | 377 hyper-learner. |
284 It is up to the user and "transform" implementer to wrap params and | 378 |
285 hyper-params correspondingly. But I don't think this is to complicated. | 379 I would also have the learning rate or noise_amounts as some strange |
286 The apply function above has a default behaviour, maybe you would have | 380 hyper-paramter. I would say by default, if any hyper-paramter changes its |
287 a forth type of argument which is hyper-param that doesn't require | 381 value, then the theano expressions need to be recompiled. If you are dealing |
288 compilation. We could find a nice name for it. | 382 with this strange types of hyper-parameters you don't need to do that. |
289 | 383 This can be automatically for you and I guess it will all boil down to, |
384 is you hyper-paramter a theano shared variable or theano tensor ? If so we | |
385 are dealing with the second type. So this kind of stuff can be detected | |
386 automatically. | |
290 | 387 |
291 How does this work? | 388 How does this work? |
292 ------------------- | 389 ------------------- |
293 | 390 |
294 You always have a pointer to the entire graph. Whenever a hyper-param | 391 You always have a pointer to the entire graph. Whenever a hyper-param |
295 changes ( or a param freezes) all region of the graph affected get recompiled. | 392 changes ( or a param freezes) all region of the graph affected get recompiled. |
296 This is by traversing the graph from the bottom node and constructing the | 393 This is by traversing the graph from the bottom node and re-constructing the |
297 theano expression. | 394 theano expression. Where needed this theano expression get compiled. |
298 | 395 |
299 This function that updates / re-constructs the graph is sligthly more complex | 396 This function that updates / re-constructs the graph is sligthly more complex |
300 if you have non-theano functions in the graph .. | 397 if you have non-theano functions in the middle of the graph .. but not too |
301 | 398 much in my view. |
302 replace | 399 |
303 ------- | 400 replace & find |
401 -------------- | |
304 | 402 |
305 Replace, replaces a part of the graph. The way it works in my view is that | 403 Replace, replaces a part of the graph. The way it works in my view is that |
306 if I write : | 404 if I write : |
307 | 405 |
308 x = x1+x2+x3 | 406 x = x1+x2+x3 |
309 y = x.replace({x2:x5}) | 407 y = x.replace({x2:x5}) |
310 | 408 |
311 You would first copy the graph that is represented by x ( the params or | 409 You would first copy the graph that is represented by x ( the params or |
312 hyper-params are not copied) and then replace the subgraphs. I.e., x will | 410 hyper-params are not copied) and then replace the subgraphs. I.e., x will |
313 still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done | 411 still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done |
314 inplace. | 412 inplace ! |
315 | 413 |
316 I think these Node classes as something light-weighted, like theano variables. | 414 I think these Node classes as something light-weighted, like theano variables |
415 and creating copy is not harmful. Also params & shared variables are shared | |
416 between these graphs. If you want new params / shared variables we can offer | |
417 a copy / deepcopy command. | |
418 | |
419 Replace (given that it starts from a model) can take string(s) that indicate | |
420 specific annotations. | |
421 | |
422 Find does the same ( without the copying). | |
423 | |
424 | |
425 | |
426 If you have two things named the same in the graph you would return the first | |
427 one in a breadth search starting from the top node. The idea is that if you | |
428 have all the weight matrices annotated as 'W' and you look for 'W' starting | |
429 from node hiddens2, you want the W of the second layer, and not of the first. | |
430 | |
431 I wold support : | |
432 model.replace( look_at , search_for , replace_with, annotate_as) | |
433 replace(model , look_at , search_for , replace_with, annotate_as) | |
434 node.replace(model , look_at, replace_with, annotate_as) | |
435 | |
436 look_at if it is a node it reffers to the subgraph that has as a final | |
437 node that node. I.e. all up to that point. If it is a string, you would look | |
438 at the subgraph annotated by that string. | |
439 | |
440 Of course we can optionally choose not to allow things to be annotate with | |
441 the same name, though I sort of liked it. It makes a lot of things easy. For | |
442 a DBN I would have the annotations : | |
443 | |
444 DBN | |
445 rbm1 | |
446 hidden | |
447 CDk | |
448 rbm2 | |
449 hidden | |
450 CDk | |
451 logreg | |
452 | |
453 If I want to change the first CDk with PCD I would do | |
454 | |
455 pcd1 = PCD (..) | |
456 model.replace(look_at='rbm1', search_for='CDk', replace_with=pcd1, | |
457 annotate_as='PCD1') | |
458 | |
459 | |
460 Bottom line is : | |
461 | |
462 I think having a graph and having a way to search in that graph and replace | |
463 parts is a very flexible and powerful way of doing things. | |
317 | 464 |
318 | 465 |
319 reconstruct | 466 reconstruct |
320 ----------- | 467 ----------- |
321 | 468 |
322 This is something nice for DAA. It is definetely not useful for the rest. | 469 This is something nice for DAA. It is definetely not useful for the rest. |
323 I think though that is a shame having that transformation graph and not | 470 I think though that is a shame having that transformation graph and not |
324 being able to use it to do this. It will make life so much easier when you | 471 being able to use it to do this. It will make life so much easier when you |
325 do deep auto-encoders. I wouldn't put it in the core library, but I would | 472 do deep auto-encoders. I wouldn't put it in the core library, but I would |
326 have in the DAA module. The way I see it you can either have something like | 473 have in the DAA module. For reconstruct to work you need to have inverse |
474 transforms for the ones you use. | |
475 | |
476 The way I see it you can either have something like | |
327 | 477 |
328 # generate your inversable transforms on the fly | 478 # generate your inversable transforms on the fly |
329 fn = create_transform(lambda : , params, hyper-params ) | 479 fn = create_transform(lambda : , params, hyper-params ) |
330 inv = create_transform(lambda : , params, hyper-params ) | 480 inv = create_transform(lambda : , params, hyper-params ) |
331 my_transform = couple_transforms( forward = fn, inv = inv) | 481 my_transform = couple_transforms( forward = fn, inv = inv) |
332 | 482 |
333 # have some already widely used such transform in the daa submodule. | 483 and generate special transforms on the fly that have some pseudo-inverses |
484 when you construct the graph. Maybe you can also have spcific pre-defined | |
485 transforms for the most used cases, whith specific names. Even more I don't | |
486 see the harm of something as simple as dotW_b to have a inverse defined ( as | |
487 using tied weights) in all cases, but you would only use it for the DAA. | |
488 It just to reduce the number of names of transforms you have, is like a | |
489 feature that doesn't hurt or help in 95% of times but it helps in 5% of times. | |
490 | |
491 | |
492 But this is up to debate. The only reason I bring it up is to say that the | |
493 class that represents a transform should have a inverse method that by | |
494 default throws an exception. | |
334 | 495 |
335 | 496 |
336 transforms | 497 transforms |
337 ---------- | 498 ---------- |
338 | 499 |
339 In my view there will be quite a few of such standard transforms. They | 500 In my view there will be quite a few of such standard transforms. |
340 can be grouped by architecture, basic, sampler, optimizer and so on. | 501 This can be annoying, but I think that if we group them by |
341 | 502 architectures (MLP, DAA, RBM), sampler, optimizers it will be less of a mess. |
342 We do not need to provide all of them, just the ones we need. Researching | 503 This would be crucial for their documentation as well. This categories should |
343 on an architecture would actually lead in creating new such transforms in | 504 also come with macros. There will be though some basic transforms that |
344 the library. | 505 are available at the core ( like replace, find, things related to annotating |
345 | 506 and creating a model, collecting parameters and hyper-paramters) |
346 There will be definetely a list of basic such transforms in the begining, | 507 |
347 like : | 508 I also think that we can start small by having just very few such transforms |
348 replace, | 509 and add them as the library grows. We don't need many of this, most are |
349 search, | 510 nice to have .. |
350 get_param(name) | |
351 get_params(..) | |
352 | |
353 You can have and should have something like a switch ( that based on a | |
354 hyper parameter replaces a part of a graph with another or not). This is | |
355 done by re-compiling the graph. | |
356 | 511 |
357 | 512 |
358 Constraints | 513 Constraints |
359 ----------- | 514 ----------- |
360 | 515 |
361 Nodes also can also keep track of constraints. | 516 You can always add constraints. I think the easier to make this explicit is to |
362 | 517 get a hand on the parameter or ndoe on which you want to add constraint and |
363 When you write | 518 do something like |
364 | 519 |
365 y = add_constraint(x, sum(x**2)) | 520 add_constraint(on_what, what) |
366 | 521 |
367 y is the same node as x, just that it also links to this second graph that | 522 on_what can be a node, a parameter node, a list of nodes, a list of parameter |
368 computes constraints. Whenever you call grad, grad will also sum to the | 523 nodes, an annotation string, given that you provided a model, and what is a |
369 cost all attached constraints to the graph. | 524 graph. In terms of the graph that you are creating what this does is to |
370 | 525 create a dependency link from your main graph to that constraint graph. |
371 | 526 This means that the grad function that computes the grad function that |
527 computes the gradients with respect to parameters will also (if there are | |
528 such dependency links) add the gradient of those parameters with respect | |
529 to the output of that dependency graph. There are some constraints on | |
530 what a dependency graph can be, in the sense that it should start from only | |
531 one input ( the parameters / node) and it should end in only one node that | |
532 is a scalar. | |
533 | |
534 From an implementation point of view, this can be done by just collecting a | |
535 list of constraints cost, that will be added to the cost before calling | |
536 T.grad. But I like to think about it in terms of graph linked through | |
537 dependency links. | |
538 | |
539 | |
540 | |
541 | |
542 Some general comments | |
543 --------------------- | |
544 | |
545 I think that what you get in the end is a very flexible framework, where | |
546 adding new things is just a matter of putting together a few transforms and | |
547 annotating the entire thing. Worst case scenario you would need to invent a | |
548 transform, which I do believe could be quite painless. | |
549 | |
550 The harder part to implement is the back-bone. It is not difficult in my | |
551 view, mostly sligthly tideous. I had something like this implemented in a | |
552 matter of a week, though it was a bit less restrictive. I do believe though | |
553 that we should not oversimplify the backbone of the library just to make it | |
554 easy to implement, but we should rather carefully consider what you get in | |
555 the end | |
556 | |
557 | |
558 Connection to the architecture committee | |
559 ----------------------------------------- | |
560 | |
561 I think that if you get such iterator objects that can produce either | |
562 the error, or do an update step it is easy to wrap them in a plug-in, | |
563 or use it with the imperative language James proposed. | |
564 | |
565 I actually have ideas ( using non theano nodes) how to break the algo at | |
566 points such that you can have different parts run on remote machines .. | |
567 though we might not want to support that ( using the plug-in system .. | |
568 though it might work with other systems that support the same idea) | |
569 | |
570 I think it goes more natural with the imperative language that James | |
571 proposed, because that would create a graph as well. His graph is | |
572 in general simpler ( it always has only one termination node) where | |
573 the nodes have a different interpretation (?) so I would use a different | |
574 node class on those. But from writing the code, using some syntactic sugar | |
575 the difference can be blurred ( do we want this ?). I think that one | |
576 can come up with ways of making the approaches look alike and sligtly | |
577 homogeneous. |