Mercurial > pylearn
comparison doc/v2_planning/plugin_RP.py @ 1202:7fff3d5c7694
ARCHITECTURE/LAYER: a incomplete story about the plug-ins and way of constructing models
author | pascanur |
---|---|
date | Mon, 20 Sep 2010 20:35:03 -0400 |
parents | fe6c25eb1e37 |
children | 681b5e7e3b81 |
comparison
equal
deleted
inserted
replaced
1201:46527ae6db53 | 1202:7fff3d5c7694 |
---|---|
1 ''' | 1 ''' |
2 ================================================= | 2 |
3 Plugin system for interative algortithm Version B | 3 !!! Incomplete file .. many of the things I've set up to do are not done |
4 ================================================= | 4 yet !!! |
5 | 5 |
6 After the meeting (September 16) we sort of stumbled on | 6 ============ |
7 two possible versions of the plug-in system. This represents | 7 Introduction |
8 the second version. It suffered a few changes after seeing | 8 ============ |
9 Olivier's code and talking to him. | 9 |
10 | 10 What this file talks about |
11 ========================== | |
12 * Proposal for the layer committee | |
13 * Proposal of how to deal with plug-ins ( STEP 2) | |
14 * Description of how to glue the two parts | |
15 * Some personal beliefs and argumentation | |
16 | |
17 The file will point out how : | |
18 * to use the API's other committee proposed or why and how they should | |
19 change | |
20 * it satisfies the listed requirements ( or why it doesn't) | |
21 * this approach might be better then others ( or worse) to the best of | |
22 my knowledge | |
23 | |
24 | |
25 Motivation for writing this file | |
26 ================================ | |
27 | |
28 I wrote this file because: | |
29 * It will probably answer most of the questions regarding my view, | |
30 minimizing the time wasted on talks | |
31 * When prezenting the entire interface helps see holes in the approach | |
32 * Is here for everybody to read ( easier disimination of information) | |
33 | |
34 | |
35 ======= | |
11 Concept | 36 Concept |
12 ======= | 37 ======= |
13 | 38 |
14 The basic idea behind this version is not to have a list of all | 39 I think any experiment that we ( or anybody else ) would want to run with |
15 possible events, but rather have plugin register to events.By | 40 our library will be composed of two steps : |
16 specifying what plugin listens to which event produced by what | 41 |
17 plugin you define a sort of dependency graph. Structuring things | 42 * Step 1. Constructing (or choosing or initializing) the model, the |
18 in such a graph might make the script more intuitive when reading. | 43 datasets, error measures, optimizers and so on ( everything up to the |
19 | 44 iterative loop). I think this step has been covered by different |
20 I will first go through pseudo-code for two example and then enumerate | 45 committies but possibly glued together by the layer committee. |
21 my insights and concepts on the matter | 46 |
22 | 47 * Step 2. Compose the iterative loops and perform them ( this is what the |
23 | 48 architecture committee dealt with) |
24 Example : Producer - Consumer that Guillaume described | 49 |
25 ====================================================== | 50 I believe there is a natural way of going from *Step 1* to *Step 2* |
26 | 51 which would be presented as Step 1.5 |
27 | 52 |
28 .. code-block:: | 53 Step 2 |
29 ''' | 54 ====== |
30 sch = Schedular() | 55 |
31 | 56 I will start with step 2 ( because I think that is more of a hot subject |
32 @FnPlugin(sch) | 57 right now). I will assume you have the write plugins at had. |
33 def producer(self,event): | 58 This is a DBN with early stopping and .. |
34 self.fire('stuff', value = 'some text') | 59 |
35 | 60 .. code-block:: python |
36 @FnPlugin(sch) | 61 ''' |
37 def consumer(self,event): | 62 data = load_mnist() |
38 print event.value | 63 train_xy valid_xy test_xy = split(data, split = |
39 | 64 [(0,40000),(40000,50000),[50000,60000]]) |
40 @FnPlugin(sch) | 65 train_x, train_y = train_xy |
41 def prod_consumer(self,event): | 66 valid_x, valid_y = valid_xy |
42 print event.value | 67 test_x, test_y = test_xy |
43 self.fire('stuff2', value = 'stuff') | 68 |
44 | 69 ################# CONSTRUCTING THE MODEL ################################### |
45 producer.act( on = Event('begin'), when = once() ) | 70 ############################################################################ |
46 producer.act( on = Event('stuff'), when = always() ) | 71 |
47 consumer.act( on = Event('stuff'), when = always() ) | 72 x0 = pca(train_x) |
48 prod_consumer.act( on = Event('stuff'), when = always() ) | 73 |
49 | 74 ## Layer 1: |
50 sch.run() | 75 h1 = sigmoid(dotW_b(x0,units = 200), constraint = L1( coeff = 0.1)) |
51 | 76 x1 = recurrent_layer() |
52 | 77 x1.t0 = x0 |
53 | 78 x1.value = binomial_sample(sigmoid( reconstruct( binomial_sample(h1), x0))) |
54 ''' | 79 cost = free_energy(train_x) - free_energy(x1.tp(5)) |
55 Example : Logistic regression | 80 grads = [ (g.var, T.grad(cost.var, g.var)) for g in cost.params ] |
56 ============================= | 81 pseudo_cost = sum([ pl.sum(pl.abs(g)) for g in cost.params]) |
57 | 82 rbm1 = SGD( cost = pseudo_cost, grads = grads) |
58 Task description | 83 |
59 ---------------- | 84 # Layer 2: |
60 | 85 rbm2,h2 = rbm(h1, units = 200, k = 5, use= 'CD') |
61 Apply a logistic regression network to some dataset. Use early stopping. | 86 # Logreg |
62 Save the weights everytime a new best score is obtained. Print trainnig score | 87 logreg,out = logreg(h2, units = 10) |
63 after each epoch. | 88 train_err = mean_over(missclassification(argmax(out), train_y)) |
64 | 89 valid_err = train_err.replace({train_x:valid_x, train_y:valid_y}) |
65 | 90 test_err = train_err.replace({train_x: test_x, train_y: test_y}) |
66 Possible script | 91 |
67 --------------- | 92 ########################################################################## |
68 | 93 ############### Constructing the training loop ########################### |
69 Notes : This would look the same for any other architecture that does not | 94 |
70 imply pre-training ( i.e. deep networks). For example the mlp. | 95 ca = Schedular() |
71 | 96 |
72 .. code-block:: | 97 |
73 ''' | 98 ### Constructing Modes ### |
74 | 99 pretrain_layer1 = ca.mode('pretrain0') |
75 sched = Schedular() | 100 pretrain_layer2 = ca.mode('pretrain1') |
76 | 101 early_stopping = ca.mode('early') |
77 # Data / Model Building : | 102 valid1 = ca.mode('stuff') |
78 # I skiped over how to design this part | 103 kfolds = ca.mode('kfolds') |
79 # though I have some ideas | 104 |
80 real_train_data, real_valid_data = load_mnist() | 105 # Construct modes dependency graph |
81 model = logreg() | 106 valid0.include([ pretrian_layer1, pretrain_layer2, early_stopper]) |
82 | 107 kfolds.include( valid0 ) |
83 # Main Plugins ( already provided in the library ); | 108 |
84 # This wrappers also registers the plugin | 109 pretrain_layer1.act( on = valid1.begin(), when = always()) |
85 train_data = create_data_plugin( sched, data = real_train_data) | 110 pretrain_layer2.act( on = pretrain_layer1.end(), when = always()) |
86 train_model = create_train_model(sched, model = model) | 111 early_stopping.act ( on = pretrain_layer2.end(), when = always()) |
87 validate_model = create_valid_model(sched, model = model, data = valid_data) | 112 |
88 early_stopper = create_early_stopper(sched) | 113 |
89 | 114 # Construct counter plugin that keeps track of number of epochs |
90 | 115 @FnPlugin |
91 # On the fly plugins ( print random stuff); the main difference from my | 116 def counter(self, msg): |
92 # FnPlugin from Olivier's version is that it also register the plugin in sched | 117 # a bit of a hack.. it will look more classic if you would |
93 @FnPlugin(sched) | 118 # start with a class instead |
94 def print_error(self, event): | 119 if not hasattr(self, 'val'): |
95 if event.type == Event('begin'): | 120 self.val = 0 |
96 self.value = [] | 121 |
97 elif event.type == train_model.error(): | 122 if msg = Message('eod'): |
98 self.value += [event.value] | 123 self.val += 1 |
99 else event.type == train_data.eod(): | 124 if self.val < 10: |
100 print 'Error :', numpy.mean(self.value) | 125 self.fire(Message('continue')) |
101 | 126 else: |
102 @FnPlugin(sched) | 127 self.fire(Message('terminate')) |
103 def save_model(self, event): | 128 |
104 if event.type == early_stopper.new_best_error(): | 129 |
105 cPickle.dump(model.parameters(), open('best_params.pkl','wb')) | 130 # Construct pre-training plugins |
106 | 131 rbm1_plugin = plugin_wrapper(rbm1, sched = pretrain_layer1) |
107 | 132 rbm1_plugin.listen(Message('init'), update_hyperparameters) |
108 # Create the dependency graph describing what does what | 133 rbm2_plugin = plugin_wrapper(rbm2, sched = pretrain_layer2) |
109 train_data.act( on = sched.begin(), when = once() ) | 134 rbm2_plugin.listen(Message('init'), update_hyperparameters) |
110 train_data.act( on = Event('batch'), | 135 rbm1_counter = pretrain_layer1.register(counter) |
111 train_data.act( on = train_model.done(), when = always()) | 136 rbm2_counter = pretrain_layer2.register(counter) |
112 train_model.act(on = train_data.batch(), when = always()) | 137 |
113 validate_model.act(on = train_model.done(), when = every(n=10000)) | 138 |
114 early_stopper.act(on = validate_model.error(), when = always()) | 139 # Dependency graph for pre-training layer 0 |
115 print_error.act( on = train_model.error(), when = always() ) | 140 rbm1_plugin.act( on = [ pretrain_layer1.begin() |
116 print_error.act( on = train_data.eod(), when = always() ) | 141 Message('continue') ], |
117 save_model.act( on = eraly_stopper.new_best_errot(), when = always() ) | 142 when = always()) |
118 | 143 rbm1_counter.act( on = rbm1_plugin.eod(), when = always() ) |
119 # Run the entire thing | 144 |
145 | |
146 # Dependency graph for pre-training layer 1 | |
147 rbm2_plugin.act( on = pretrain_layer2.begin(), when = always()) | |
148 pretrain_layer2.stop( on = rbm2_plugin.eod(), when = always()) | |
149 | |
150 | |
151 # Constructing fine-tunning plugins | |
152 learner = early_stopper.register(plugin_wrapper(logreg)) | |
153 learner.listen(Message('init'), update_hyperparameters) | |
154 validation = early_stopper.register( plugin_wrapper(valid_err))) | |
155 validation.listen(Message('init'), update_hyperparameters) | |
156 clock = early_stopper.register( ca.generate_clock()) | |
157 early_stopper_plugin = early_stopper.register( early_stopper_plugin) | |
158 | |
159 @FnPlugin | |
160 def save_weights(self, message): | |
161 cPickle.dump(logreg, open('model.pkl')) | |
162 | |
163 | |
164 learner.act( on = early_stopper.begin(), when = always()) | |
165 learner.act( on = learner.value(), when = always()) | |
166 validation.act( on = clock.hour(), when = every(n = 1)) | |
167 early_stopper.act( on = validation.value(), when = always()) | |
168 save_model.act( on = early_stopper.new_best_error(), when =always()) | |
169 | |
170 @FnPlugin | |
171 def kfolds_plugin(self,event): | |
172 if not hasattr(self, 'n'): | |
173 self.n = -1 | |
174 self.splits = [ [ ( 0,40000),(40000,50000),(50000,60000) ], | |
175 [ (10000,50000),(50000,60000),( 0,10000) ], | |
176 [ (20000,60000),( 0,10000),(10000,20000) ] ] | |
177 if self.n < 3: | |
178 self.n += 1 | |
179 msg = Message('new split') | |
180 msg.data = (data.get_hyperparam('split'),self.splits[self.n]) | |
181 self.fire(msg) | |
182 else: | |
183 self.fire(Message('terminate')) | |
184 | |
185 | |
186 kfolds.register(kfolds_plugin) | |
187 kfolds_plugin.act(kfolds.begin(), when = always()) | |
188 kfolds_plugin.act(valid0.end(), always() ) | |
189 valid0.act(Message('new split'), always() ) | |
190 | |
191 sched.include(kfolds) | |
192 | |
120 sched.run() | 193 sched.run() |
121 | 194 |
122 | 195 ''' |
123 ''' | 196 |
124 Notes | 197 Notes: |
125 ===== | 198 when a mode is regstered to begin with a certain message, it will |
126 | 199 rebroadcast that message when it starts, with only switching the |
127 * I think we should have a FnPlugin decorator ( exactly like Olivier's) just | 200 type from whatever it was to 'init'. It will also send all 'init' messages |
128 that also attaches the new created plugin to the schedule. This way you | 201 of the mode in which is included ( or of the schedular). |
129 can create plugin on the fly ( as long as they are simple functions that | 202 |
130 print stuff, or compute simple statitics ). | 203 one might be able to shorten this by having Macros that creates modes |
131 * I added a method act to a Plugin. You use that to create the dependency | 204 and automatically register certain plugins to it; you can always |
132 graph ( it could also be named listen to be more plugin like interface) | 205 afterwards add plugins to any mode |
133 * Plugins are obtained in 3 ways : | 206 |
134 - by wrapping a dataset / model or something similar | 207 |
135 - by a function that constructs it from nothing | 208 |
136 - by decorating a function | 209 Step 1 |
137 In all cases I would suggest then when creating them you should provide | 210 ====== |
138 the schedular as well, and the constructor also registers the plugin | 211 |
139 | 212 |
140 * The plugin concept works well as long as the plugins are a bit towards | 213 You start with the dataset that you construct as the dataset committee |
141 heavy duty computation, disregarding printing plugins and such. If you have | 214 proposed to. You continue constructing your model by applying |
142 many small plugins this system might only introduce an overhead. I would | 215 transformation, more or less like you would in Theano. When constructing |
143 argue that using theano is restricted to each plugin. Therefore I would | 216 your model you also get a graph "behind the scene". Note though that |
144 strongly suggest that the architecture to be done outside the schedular | 217 this graph is totally different then the one Theano would create! |
145 with a different approach. | 218 Let start with an example: |
146 | 219 |
147 * I would suggest that the framework to be used only for the training loop | 220 .. code-block:: python |
148 (after you get the adapt function, compute error function) so is more about | 221 |
149 the meta-learner, hyper-learner learner level. | 222 ''' |
150 | 223 data_x, data_y = GPU_transform(load_mnist()) |
151 * A general remark that I guess everyone will agree on. We should make | 224 output = sigmoid(dotW_b(data_x,10)) |
152 sure that implementing a new plugin is as easy/simple as possible. We | 225 err = cross_entropy(output, data_y) |
153 have to hide all the complexity in the schedular ( it is the part of the | 226 learner = SGD(err) |
154 code we will not need or we would rarely need to work on). | 227 ''' |
155 | 228 |
156 * I have not went into how to implement the different components, but | 229 This shows how to create the learner behind the logistic regression, |
157 following Olivier's code I think that part would be more or less straight | 230 but not the function that will compute the validation error or the test |
158 forward. | 231 error ( or any other statistics). Before going into the detail of what |
159 | 232 all those transforms ( or the results after applying one) means, here |
160 ''' | 233 is another partial example for a SdA : |
161 | 234 |
162 | 235 .. code-block:: python |
163 ''' | 236 |
237 ''' | |
238 ## Layer 1: | |
239 | |
240 data_x,data_y = GPU_transform(load_mnist()) | |
241 noisy_data_x = gaussian_noise(data_x, amount = 0.1) | |
242 hidden1 = tanh(dotW_b(data_x, n_units = 200)) | |
243 reconstruct1 = reconstruct(hidden1.replace(data_x, noisy_data_x), | |
244 noisy_data_x) | |
245 err1 = cross_entropy(reconstruct1, data_x) | |
246 learner1 = SGD(err1) | |
247 | |
248 # Layer 2 : | |
249 noisy_hidden1 = gaussian_noise(hidden1, amount = 0.1) | |
250 hidden2 = tanh(dotW_b(hidden1, n_units = 200)) | |
251 reconstruct2 = reconstruct(hidden2.replace(hidden1,noisy_hidden1), | |
252 noisy_hidden1) | |
253 err2 = cross_entropy(reconstruct2, hidden) | |
254 learner2 = SGD(err2) | |
255 | |
256 # Top layer: | |
257 | |
258 output = sigmoid(dotW_b(hidden2, n_units = 10)) | |
259 err = cross_entropy(output, data_y) | |
260 learner = SGD(err) | |
261 | |
262 ''' | |
263 | |
264 What's going on here? | |
265 --------------------- | |
266 | |
267 By calling different "transforms" (we could call them ops or functions) | |
268 you decide what the architecture does. What you get back from applying | |
269 any of these transforms, are nodes. You have different types of nodes | |
270 (which I will enumerate a bit later) but they all offer a basic interface. | |
271 That interface is the dataset API + a few more methods and/or attributes. | |
272 There are also a few transform that work on the graph that I think will | |
273 be pretty useful : | |
274 | |
275 * .replace(dict) -> method; replaces the subgraphs given as keys with | |
276 the ones given as values; throws an exception if it | |
277 is impossible | |
278 | |
279 * reconstruct(dict) -> transform; tries to reconstruct the nodes given as | |
280 keys starting from the nodes given as values by | |
281 going through the inverse of all transforms that | |
282 are in between | |
283 | |
284 * .tm, .tp -> methods; returns nodes that correspond to the value | |
285 at t-k or t+k | |
286 * recurrent_layer -> function; creates a special type of node that is | |
287 recurrent; the node has two important attributes that | |
288 need to be specified before calling the node iterator; | |
289 those attributes are .t0 which represents the initial | |
290 value and .value which should describe the recurrent | |
291 relation | |
292 * add_constraints -> transform; adds a constraint to a given node | |
293 * data_listener -> function; creates a special node that listens for | |
294 messages to get data; it should be used to decompose | |
295 the architecture in modules that can run on different | |
296 machines | |
297 | |
298 * switch(hyperparam, dict) -> transform; a lazy switch that allows you | |
299 do construct by hyper-parameters | |
300 | |
301 * get_hyperparameter -> method; given a name it will return the first node | |
302 starting from top that is a hyper parameter and has | |
303 that name | |
304 * get_parameter -> method; given a name it will return the first node | |
305 starting from top that is a parameter and has that | |
306 name | |
307 | |
308 | |
309 | |
310 Because every node provides the dataset API it means you can iterate over | |
311 any of the nodes. They will produce the original dataset transformed up | |
312 to that point. | |
313 | |
314 ** NOTES ** | |
315 1. This is not like a symbolic graph. When adding a transform | |
316 you can get a warning straight forward. This is because you start from | |
317 the dataset and you always have access to some data. Though sometime | |
318 you would want to have the nodes lazy, i.e. not try to compute everything | |
319 until the graph is done. | |
320 | |
321 2. You can still have complex Theano expressions. Each node has a | |
322 theano variable describing the graph up to that point + optionally | |
323 a compiled function over which you can iterate. We can use some | |
324 on_demand mechanism to compile when needed. | |
325 | |
326 What types of nodes do you have | |
327 -------------------------------- | |
328 | |
329 Note that this differentiation is more or less semantical and not | |
330 mandatory syntactical. Is just to help understanding the graph. | |
331 | |
332 | |
333 * Data Nodes -- datasets are such nodes; the result of any | |
334 simple transform is also a data node ( like the result | |
335 of a sigmoid, or dotW_b) | |
336 * Learner Nodes -- they are the same as data nodes, with the | |
337 difference that they have side effects on the model; | |
338 they update the weights | |
339 * Apply Nodes -- they are used to connect input variables to | |
340 the transformation/op node and output nodes | |
341 * Dependency Nodes -- very similar to apply nodes just that they connect | |
342 constraints subgraphs to a model graph | |
343 * Parameter Nodes -- when iterating over them they will only output | |
344 the values of the parameters; | |
345 * Hyper-parameter Nodes -- very similar to parameter nodes; this is a | |
346 semantical difference ( they are not updated by the | |
347 any learner nodes) | |
348 * Transform Nodes -- this nodes describe the mathematical function | |
349 and if there is one the inverse of that transform; there | |
350 would usually be two types of transforms; ones that use | |
351 theano and those that do not -- this is because those that | |
352 do can be composed | |
353 | |
354 Each node is lazy, in the sense that unless you try to iterate on it, it | |
355 will not try to compute the next value. | |
356 | |
357 | |
358 Isn't this too low level ? | |
359 -------------------------- | |
360 | |
361 I think that way of writing and decomposing your neural network is | |
362 efficient and useful when writing such networks. Of course when you | |
363 just want to run a classical SdA you shouldn't need to go through the | |
364 trouble of writing all that. I think we should have Macors for this. | |
365 | |
366 * Macro -- syntactically it looks just like a transform (i.e. a python | |
367 function) only that it actually applies multiple transforms to the input | |
368 and might return several nodes (not just one). | |
369 Example: | |
370 | |
371 | |
372 learner, prediction, pretraining-learners = SdA( | |
373 input = data_x, | |
374 target = data_y, | |
375 hiddens = [200,200], | |
376 noises = [0.1,0.1]) | |
377 | |
378 | |
379 How do you deal with loops ? | |
380 ---------------------------- | |
381 | |
382 When implementing architectures you some time need to loop like for | |
383 RNN or CD, PCD etc. Adding loops in such a scheme is always hard. | |
384 I borrowed the idea in the code below from PyBrain. You first construct | |
385 a shell layer that you call recurrent layer. Then you define the | |
386 functionality by giving the initial value and the recurrent step. | |
387 For example: | |
388 | |
389 .. code-block:: python | |
390 | |
391 ''' | |
392 # sketch of writing a RNN | |
393 x = load_mnist() | |
394 y = recurrent_layer() | |
395 y.value = tanh(dotW(x, n=50) + dotW(y.tm(1),50)) | |
396 y.t0 = zeros( (50,)) | |
397 out = dotW(y,10) | |
398 | |
399 | |
400 # sketch of writing CDk starting from x | |
401 x = recurrent_layer() | |
402 x.t0 = input_values | |
403 h = binomial_sample( sigmoid( dotW_b(x.tm(1)))) | |
404 x.value = binomial_sample( sigmoid( reconstruct(h, x.tm(1)))) | |
405 ## the assumption is that the inverse of sigmoid is the identity fn | |
406 pseudo_cost = free_energy(x.tp(k)) - free_energy(x.t0) | |
407 | |
408 | |
409 ''' | |
410 | |
411 How do I deal with constraints ? | |
412 -------------------------------- | |
413 | |
414 Use the add constrain. You are required to pass a transform with its | |
415 hyper-parameters initial values ? | |
416 | |
417 | |
418 How do I deal with other type of networs ? | |
419 ------------------------------------------ | |
420 | |
421 (opaque transforms) | |
422 | |
423 new_data = PCA(data_x) | |
424 | |
425 | |
426 svn_predictions = SVN(data_x) | |
427 svn_learner = SVN_learner(svn_predictions) | |
428 # Note that for the SVN this might be just syntactic sugar; we have the two | |
429 # steps because we expect different interfaces for this nodes | |
430 | |
431 | |
432 | |
433 Step 1.5 | |
434 ======== | |
435 | |
436 There is a wrapper function called plugin. Once you call plugin over | |
437 any of the previous nodes you will get a plugin that has a certain | |
438 set of conventions | |
439 | |
440 '''' |