Mercurial > pylearn
changeset 1202:7fff3d5c7694
ARCHITECTURE/LAYER: a incomplete story about the plug-ins and way of constructing models
author | pascanur |
---|---|
date | Mon, 20 Sep 2010 20:35:03 -0400 |
parents | 46527ae6db53 |
children | 865936d8221b |
files | doc/v2_planning/plugin_RP.py |
diffstat | 1 files changed, 406 insertions(+), 129 deletions(-) [+] |
line wrap: on
line diff
--- a/doc/v2_planning/plugin_RP.py Mon Sep 20 17:05:15 2010 -0400 +++ b/doc/v2_planning/plugin_RP.py Mon Sep 20 20:35:03 2010 -0400 @@ -1,163 +1,440 @@ ''' -================================================= -Plugin system for interative algortithm Version B -================================================= + +!!! Incomplete file .. many of the things I've set up to do are not done +yet !!! + +============ +Introduction +============ + +What this file talks about +========================== +* Proposal for the layer committee +* Proposal of how to deal with plug-ins ( STEP 2) +* Description of how to glue the two parts +* Some personal beliefs and argumentation -After the meeting (September 16) we sort of stumbled on -two possible versions of the plug-in system. This represents -the second version. It suffered a few changes after seeing -Olivier's code and talking to him. +The file will point out how : +* to use the API's other committee proposed or why and how they should +change +* it satisfies the listed requirements ( or why it doesn't) +* this approach might be better then others ( or worse) to the best of +my knowledge + +Motivation for writing this file +================================ + +I wrote this file because: +* It will probably answer most of the questions regarding my view, +minimizing the time wasted on talks +* When prezenting the entire interface helps see holes in the approach +* Is here for everybody to read ( easier disimination of information) + + +======= Concept ======= -The basic idea behind this version is not to have a list of all -possible events, but rather have plugin register to events.By -specifying what plugin listens to which event produced by what -plugin you define a sort of dependency graph. Structuring things -in such a graph might make the script more intuitive when reading. +I think any experiment that we ( or anybody else ) would want to run with +our library will be composed of two steps : + +* Step 1. Constructing (or choosing or initializing) the model, the +datasets, error measures, optimizers and so on ( everything up to the +iterative loop). I think this step has been covered by different +committies but possibly glued together by the layer committee. + +* Step 2. Compose the iterative loops and perform them ( this is what the +architecture committee dealt with) + +I believe there is a natural way of going from *Step 1* to *Step 2* +which would be presented as Step 1.5 + +Step 2 +====== + + I will start with step 2 ( because I think that is more of a hot subject + right now). I will assume you have the write plugins at had. + This is a DBN with early stopping and .. + +.. code-block:: python +''' +data = load_mnist() +train_xy valid_xy test_xy = split(data, split = + [(0,40000),(40000,50000),[50000,60000]]) +train_x, train_y = train_xy +valid_x, valid_y = valid_xy +test_x, test_y = test_xy + +################# CONSTRUCTING THE MODEL ################################### +############################################################################ + +x0 = pca(train_x) -I will first go through pseudo-code for two example and then enumerate -my insights and concepts on the matter +## Layer 1: +h1 = sigmoid(dotW_b(x0,units = 200), constraint = L1( coeff = 0.1)) +x1 = recurrent_layer() +x1.t0 = x0 +x1.value = binomial_sample(sigmoid( reconstruct( binomial_sample(h1), x0))) +cost = free_energy(train_x) - free_energy(x1.tp(5)) +grads = [ (g.var, T.grad(cost.var, g.var)) for g in cost.params ] +pseudo_cost = sum([ pl.sum(pl.abs(g)) for g in cost.params]) +rbm1 = SGD( cost = pseudo_cost, grads = grads) + +# Layer 2: +rbm2,h2 = rbm(h1, units = 200, k = 5, use= 'CD') +# Logreg +logreg,out = logreg(h2, units = 10) +train_err = mean_over(missclassification(argmax(out), train_y)) +valid_err = train_err.replace({train_x:valid_x, train_y:valid_y}) +test_err = train_err.replace({train_x: test_x, train_y: test_y}) + +########################################################################## +############### Constructing the training loop ########################### + +ca = Schedular() + + +### Constructing Modes ### +pretrain_layer1 = ca.mode('pretrain0') +pretrain_layer2 = ca.mode('pretrain1') +early_stopping = ca.mode('early') +valid1 = ca.mode('stuff') +kfolds = ca.mode('kfolds') + +# Construct modes dependency graph +valid0.include([ pretrian_layer1, pretrain_layer2, early_stopper]) +kfolds.include( valid0 ) + +pretrain_layer1.act( on = valid1.begin(), when = always()) +pretrain_layer2.act( on = pretrain_layer1.end(), when = always()) +early_stopping.act ( on = pretrain_layer2.end(), when = always()) -Example : Producer - Consumer that Guillaume described -====================================================== +# Construct counter plugin that keeps track of number of epochs +@FnPlugin +def counter(self, msg): + # a bit of a hack.. it will look more classic if you would + # start with a class instead + if not hasattr(self, 'val'): + self.val = 0 + + if msg = Message('eod'): + self.val += 1 + if self.val < 10: + self.fire(Message('continue')) + else: + self.fire(Message('terminate')) + + +# Construct pre-training plugins +rbm1_plugin = plugin_wrapper(rbm1, sched = pretrain_layer1) +rbm1_plugin.listen(Message('init'), update_hyperparameters) +rbm2_plugin = plugin_wrapper(rbm2, sched = pretrain_layer2) +rbm2_plugin.listen(Message('init'), update_hyperparameters) +rbm1_counter = pretrain_layer1.register(counter) +rbm2_counter = pretrain_layer2.register(counter) + + +# Dependency graph for pre-training layer 0 +rbm1_plugin.act( on = [ pretrain_layer1.begin() + Message('continue') ], + when = always()) +rbm1_counter.act( on = rbm1_plugin.eod(), when = always() ) + + +# Dependency graph for pre-training layer 1 +rbm2_plugin.act( on = pretrain_layer2.begin(), when = always()) +pretrain_layer2.stop( on = rbm2_plugin.eod(), when = always()) + + +# Constructing fine-tunning plugins +learner = early_stopper.register(plugin_wrapper(logreg)) +learner.listen(Message('init'), update_hyperparameters) +validation = early_stopper.register( plugin_wrapper(valid_err))) +validation.listen(Message('init'), update_hyperparameters) +clock = early_stopper.register( ca.generate_clock()) +early_stopper_plugin = early_stopper.register( early_stopper_plugin) + +@FnPlugin +def save_weights(self, message): + cPickle.dump(logreg, open('model.pkl')) + + +learner.act( on = early_stopper.begin(), when = always()) +learner.act( on = learner.value(), when = always()) +validation.act( on = clock.hour(), when = every(n = 1)) +early_stopper.act( on = validation.value(), when = always()) +save_model.act( on = early_stopper.new_best_error(), when =always()) + +@FnPlugin +def kfolds_plugin(self,event): + if not hasattr(self, 'n'): + self.n = -1 + self.splits = [ [ ( 0,40000),(40000,50000),(50000,60000) ], + [ (10000,50000),(50000,60000),( 0,10000) ], + [ (20000,60000),( 0,10000),(10000,20000) ] ] + if self.n < 3: + self.n += 1 + msg = Message('new split') + msg.data = (data.get_hyperparam('split'),self.splits[self.n]) + self.fire(msg) + else: + self.fire(Message('terminate')) + + +kfolds.register(kfolds_plugin) +kfolds_plugin.act(kfolds.begin(), when = always()) +kfolds_plugin.act(valid0.end(), always() ) +valid0.act(Message('new split'), always() ) + +sched.include(kfolds) + +sched.run() + +''' + +Notes: + when a mode is regstered to begin with a certain message, it will +rebroadcast that message when it starts, with only switching the +type from whatever it was to 'init'. It will also send all 'init' messages +of the mode in which is included ( or of the schedular). + + one might be able to shorten this by having Macros that creates modes + and automatically register certain plugins to it; you can always + afterwards add plugins to any mode + + + +Step 1 +====== -.. code-block:: +You start with the dataset that you construct as the dataset committee +proposed to. You continue constructing your model by applying +transformation, more or less like you would in Theano. When constructing +your model you also get a graph "behind the scene". Note though that +this graph is totally different then the one Theano would create! +Let start with an example: + +.. code-block:: python + +''' + data_x, data_y = GPU_transform(load_mnist()) + output = sigmoid(dotW_b(data_x,10)) + err = cross_entropy(output, data_y) + learner = SGD(err) ''' - sch = Schedular() + +This shows how to create the learner behind the logistic regression, +but not the function that will compute the validation error or the test +error ( or any other statistics). Before going into the detail of what +all those transforms ( or the results after applying one) means, here +is another partial example for a SdA : + +.. code-block:: python + +''' + ## Layer 1: + + data_x,data_y = GPU_transform(load_mnist()) + noisy_data_x = gaussian_noise(data_x, amount = 0.1) + hidden1 = tanh(dotW_b(data_x, n_units = 200)) + reconstruct1 = reconstruct(hidden1.replace(data_x, noisy_data_x), + noisy_data_x) + err1 = cross_entropy(reconstruct1, data_x) + learner1 = SGD(err1) + + # Layer 2 : + noisy_hidden1 = gaussian_noise(hidden1, amount = 0.1) + hidden2 = tanh(dotW_b(hidden1, n_units = 200)) + reconstruct2 = reconstruct(hidden2.replace(hidden1,noisy_hidden1), + noisy_hidden1) + err2 = cross_entropy(reconstruct2, hidden) + learner2 = SGD(err2) + + # Top layer: - @FnPlugin(sch) - def producer(self,event): - self.fire('stuff', value = 'some text') + output = sigmoid(dotW_b(hidden2, n_units = 10)) + err = cross_entropy(output, data_y) + learner = SGD(err) + +''' + +What's going on here? +--------------------- + +By calling different "transforms" (we could call them ops or functions) +you decide what the architecture does. What you get back from applying +any of these transforms, are nodes. You have different types of nodes +(which I will enumerate a bit later) but they all offer a basic interface. +That interface is the dataset API + a few more methods and/or attributes. +There are also a few transform that work on the graph that I think will +be pretty useful : + +* .replace(dict) -> method; replaces the subgraphs given as keys with + the ones given as values; throws an exception if it + is impossible + +* reconstruct(dict) -> transform; tries to reconstruct the nodes given as + keys starting from the nodes given as values by + going through the inverse of all transforms that + are in between - @FnPlugin(sch) - def consumer(self,event): - print event.value +* .tm, .tp -> methods; returns nodes that correspond to the value + at t-k or t+k +* recurrent_layer -> function; creates a special type of node that is + recurrent; the node has two important attributes that + need to be specified before calling the node iterator; + those attributes are .t0 which represents the initial + value and .value which should describe the recurrent + relation +* add_constraints -> transform; adds a constraint to a given node +* data_listener -> function; creates a special node that listens for + messages to get data; it should be used to decompose + the architecture in modules that can run on different + machines + +* switch(hyperparam, dict) -> transform; a lazy switch that allows you + do construct by hyper-parameters + +* get_hyperparameter -> method; given a name it will return the first node + starting from top that is a hyper parameter and has + that name +* get_parameter -> method; given a name it will return the first node + starting from top that is a parameter and has that + name + + - @FnPlugin(sch) - def prod_consumer(self,event): - print event.value - self.fire('stuff2', value = 'stuff') +Because every node provides the dataset API it means you can iterate over +any of the nodes. They will produce the original dataset transformed up +to that point. + +** NOTES ** +1. This is not like a symbolic graph. When adding a transform +you can get a warning straight forward. This is because you start from +the dataset and you always have access to some data. Though sometime +you would want to have the nodes lazy, i.e. not try to compute everything +until the graph is done. + +2. You can still have complex Theano expressions. Each node has a +theano variable describing the graph up to that point + optionally +a compiled function over which you can iterate. We can use some +on_demand mechanism to compile when needed. + +What types of nodes do you have +-------------------------------- + +Note that this differentiation is more or less semantical and not +mandatory syntactical. Is just to help understanding the graph. + + +* Data Nodes -- datasets are such nodes; the result of any + simple transform is also a data node ( like the result + of a sigmoid, or dotW_b) +* Learner Nodes -- they are the same as data nodes, with the + difference that they have side effects on the model; + they update the weights +* Apply Nodes -- they are used to connect input variables to + the transformation/op node and output nodes +* Dependency Nodes -- very similar to apply nodes just that they connect + constraints subgraphs to a model graph +* Parameter Nodes -- when iterating over them they will only output + the values of the parameters; +* Hyper-parameter Nodes -- very similar to parameter nodes; this is a + semantical difference ( they are not updated by the + any learner nodes) +* Transform Nodes -- this nodes describe the mathematical function + and if there is one the inverse of that transform; there + would usually be two types of transforms; ones that use + theano and those that do not -- this is because those that + do can be composed + +Each node is lazy, in the sense that unless you try to iterate on it, it +will not try to compute the next value. + - producer.act( on = Event('begin'), when = once() ) - producer.act( on = Event('stuff'), when = always() ) - consumer.act( on = Event('stuff'), when = always() ) - prod_consumer.act( on = Event('stuff'), when = always() ) +Isn't this too low level ? +-------------------------- + +I think that way of writing and decomposing your neural network is +efficient and useful when writing such networks. Of course when you +just want to run a classical SdA you shouldn't need to go through the +trouble of writing all that. I think we should have Macors for this. + +* Macro -- syntactically it looks just like a transform (i.e. a python +function) only that it actually applies multiple transforms to the input +and might return several nodes (not just one). +Example: + + +learner, prediction, pretraining-learners = SdA( + input = data_x, + target = data_y, + hiddens = [200,200], + noises = [0.1,0.1]) + + +How do you deal with loops ? +---------------------------- - sch.run() +When implementing architectures you some time need to loop like for +RNN or CD, PCD etc. Adding loops in such a scheme is always hard. +I borrowed the idea in the code below from PyBrain. You first construct +a shell layer that you call recurrent layer. Then you define the +functionality by giving the initial value and the recurrent step. +For example: + +.. code-block:: python +''' + # sketch of writing a RNN + x = load_mnist() + y = recurrent_layer() + y.value = tanh(dotW(x, n=50) + dotW(y.tm(1),50)) + y.t0 = zeros( (50,)) + out = dotW(y,10) + + + # sketch of writing CDk starting from x + x = recurrent_layer() + x.t0 = input_values + h = binomial_sample( sigmoid( dotW_b(x.tm(1)))) + x.value = binomial_sample( sigmoid( reconstruct(h, x.tm(1)))) + ## the assumption is that the inverse of sigmoid is the identity fn + pseudo_cost = free_energy(x.tp(k)) - free_energy(x.t0) ''' -Example : Logistic regression -============================= -Task description ----------------- +How do I deal with constraints ? +-------------------------------- -Apply a logistic regression network to some dataset. Use early stopping. -Save the weights everytime a new best score is obtained. Print trainnig score -after each epoch. +Use the add constrain. You are required to pass a transform with its +hyper-parameters initial values ? -Possible script ---------------- - -Notes : This would look the same for any other architecture that does not -imply pre-training ( i.e. deep networks). For example the mlp. - -.. code-block:: -''' - -sched = Schedular() - -# Data / Model Building : -# I skiped over how to design this part -# though I have some ideas -real_train_data, real_valid_data = load_mnist() -model = logreg() +How do I deal with other type of networs ? +------------------------------------------ -# Main Plugins ( already provided in the library ); -# This wrappers also registers the plugin -train_data = create_data_plugin( sched, data = real_train_data) -train_model = create_train_model(sched, model = model) -validate_model = create_valid_model(sched, model = model, data = valid_data) -early_stopper = create_early_stopper(sched) - +(opaque transforms) -# On the fly plugins ( print random stuff); the main difference from my -# FnPlugin from Olivier's version is that it also register the plugin in sched -@FnPlugin(sched) -def print_error(self, event): - if event.type == Event('begin'): - self.value = [] - elif event.type == train_model.error(): - self.value += [event.value] - else event.type == train_data.eod(): - print 'Error :', numpy.mean(self.value) - -@FnPlugin(sched) -def save_model(self, event): - if event.type == early_stopper.new_best_error(): - cPickle.dump(model.parameters(), open('best_params.pkl','wb')) +new_data = PCA(data_x) -# Create the dependency graph describing what does what -train_data.act( on = sched.begin(), when = once() ) -train_data.act( on = Event('batch'), -train_data.act( on = train_model.done(), when = always()) -train_model.act(on = train_data.batch(), when = always()) -validate_model.act(on = train_model.done(), when = every(n=10000)) -early_stopper.act(on = validate_model.error(), when = always()) -print_error.act( on = train_model.error(), when = always() ) -print_error.act( on = train_data.eod(), when = always() ) -save_model.act( on = eraly_stopper.new_best_errot(), when = always() ) +svn_predictions = SVN(data_x) +svn_learner = SVN_learner(svn_predictions) +# Note that for the SVN this might be just syntactic sugar; we have the two +# steps because we expect different interfaces for this nodes -# Run the entire thing -sched.run() -''' -Notes -===== - - * I think we should have a FnPlugin decorator ( exactly like Olivier's) just - that also attaches the new created plugin to the schedule. This way you - can create plugin on the fly ( as long as they are simple functions that - print stuff, or compute simple statitics ). - * I added a method act to a Plugin. You use that to create the dependency - graph ( it could also be named listen to be more plugin like interface) - * Plugins are obtained in 3 ways : - - by wrapping a dataset / model or something similar - - by a function that constructs it from nothing - - by decorating a function - In all cases I would suggest then when creating them you should provide - the schedular as well, and the constructor also registers the plugin +Step 1.5 +======== - * The plugin concept works well as long as the plugins are a bit towards - heavy duty computation, disregarding printing plugins and such. If you have - many small plugins this system might only introduce an overhead. I would - argue that using theano is restricted to each plugin. Therefore I would - strongly suggest that the architecture to be done outside the schedular - with a different approach. - - * I would suggest that the framework to be used only for the training loop - (after you get the adapt function, compute error function) so is more about - the meta-learner, hyper-learner learner level. +There is a wrapper function called plugin. Once you call plugin over +any of the previous nodes you will get a plugin that has a certain +set of conventions - * A general remark that I guess everyone will agree on. We should make - sure that implementing a new plugin is as easy/simple as possible. We - have to hide all the complexity in the schedular ( it is the part of the - code we will not need or we would rarely need to work on). - - * I have not went into how to implement the different components, but - following Olivier's code I think that part would be more or less straight - forward. - - ''' - - -''' +''''