Mercurial > pylearn
view doc/v2_planning/plugin_RP.py @ 1326:1b97fae7ea0d
added the parameter noise_value to binomial_noise formula.
author | Frederic Bastien <nouiz@nouiz.org> |
---|---|
date | Wed, 13 Oct 2010 15:04:42 -0400 |
parents | 681b5e7e3b81 |
children |
line wrap: on
line source
''' !!! Incomplete file .. many of the things I've set up to do are not done yet !!! ============ Introduction ============ What this file talks about ========================== * Proposal for the layer committee * Proposal of how to deal with plug-ins ( STEP 2) * Description of how to glue the two parts * Some personal beliefs and argumentation The file will point out how : * to use the API's other committee proposed or why and how they should change * it satisfies the listed requirements ( or why it doesn't) * this approach might be better then others ( or worse) to the best of my knowledge Motivation for writing this file ================================ I wrote this file because: * It will probably answer most of the questions regarding my view, minimizing the time wasted on talks * When prezenting the entire interface helps see holes in the approach * Is here for everybody to read ( easier disimination of information) ======= Concept ======= I think any experiment that we ( or anybody else ) would want to run with our library will be composed of two steps : * Step 1. Constructing (or choosing or initializing) the model, the datasets, error measures, optimizers and so on ( everything up to the iterative loop). I think this step has been covered by different committies but possibly glued together by the layer committee. * Step 2. Compose the iterative loops and perform them ( this is what the architecture committee dealt with) I believe there is a natural way of going from *Step 1* to *Step 2* which would be presented as Step 1.5 Step 2 ====== I will start with step 2 ( because I think that is more of a hot subject right now). I will assume you have the right plugins at hand. This is a DBN with early stopping and .. .. code-block:: python ''' data = load_mnist() train_xy valid_xy test_xy = split(data, split = [(0,40000),(40000,50000),[50000,60000]]) train_x, train_y = train_xy valid_x, valid_y = valid_xy test_x, test_y = test_xy ################# CONSTRUCTING THE MODEL ################################### ############################################################################ x0 = pca(train_x) ## Layer 1: h1 = sigmoid(dotW_b(x0,units = 200), constraint = L1( coeff = 0.1)) x1 = recurrent_layer() x1.t0 = x0 x1.value = binomial_sample(sigmoid( reconstruct( binomial_sample(h1), x0))) cost = free_energy(train_x) - free_energy(x1.t(5)) grads = [ (g.var, T.grad(cost.var, g.var)) for g in cost.params ] pseudo_cost = sum([ pl.sum(pl.abs(g)) for g in cost.params]) rbm1 = SGD( cost = pseudo_cost, grads = grads) # Layer 2: rbm2,h2 = rbm(h1, units = 200, k = 5, use= 'CD') # Logreg logreg,out = logreg(h2, units = 10) train_err = mean_over(missclassification(argmax(out), train_y)) valid_err = train_err.replace({train_x:valid_x, train_y:valid_y}) test_err = train_err.replace({train_x: test_x, train_y: test_y}) ########################################################################## ############### Constructing the training loop ########################### ca = Schedular() ### Constructing Modes ### class pretrain_layer1 () def register() { } pretrain_layer2 = ca.mode('pretrain1') early_stopping = ca.mode('early') code_block = ca.mode('code_block') kfolds = ca.mode('kfolds') # Construct modes dependency graph code_block.include([ pretrian_layer1, pretrain_layer2, early_stopper]) kfolds.include( code_block ) pretrain_layer1.act( on = code_block.begin(), when = always()) pretrain_layer2.act( on = pretrain_layer1.end(), when = always()) early_stopping.act ( on = pretrain_layer2.end(), when = always()) # Construct counter plugin that keeps track of number of epochs @FnPlugin def counter(self, msg): # a bit of a hack.. it will look more classic if you would # start with a class instead if not hasattr(self, 'val'): self.val = 0 if msg = Message('eod'): self.val += 1 if self.val < 10: self.fire(Message('continue')) else: self.fire(Message('terminate')) # Construct pre-training plugins rbm1_plugin = pretrain_layer1.include(plugin_wrapper(rbm1)) rbm2_plugin = pretrain_layer2.include(plugin_wrapper(rbm2)) rbm1_counter = pretrain_layer1.include(counter) rbm2_counter = pretrain_layer2.include(counter) rbm1_plugin.listen(Message('init'), update_hyperparameters) rbm1_plugin.listen(Message('continue'), dataset_restart) rbm2_plugin.listen(Message('init'), update_hyperparameters) rbm2_plugin.listen(Message('continue'), dataset_restart) # Dependency graph for pre-training layer 0 rbm1_plugin.act( on = [ pretrain_layer1.begin() , rbm1_plugin.value() ] , when = always()) rbm1_counter.act( on = rbm1_plugin.eod(), when = always() ) # Dependency graph for pre-training layer 1 rbm2_plugin.act( on = [ pretrain_layer2.begin() , rbm2_plugin.value() ] , when = always()) pretrain_layer2.stop( on = rbm2_plugin.eod(), when = always()) # Constructing fine-tunning plugins learner = early_stopper.include(plugin_wrapper(logreg)) validation = early_stopper.include( plugin_wrapper(valid_err))) clock = early_stopper.include( ca.generate_clock()) early_stopper_plugin = early_stopper.include( early_stopper_plugin) def save_model(plugin): cPickle.dump(plugin.object, 'just_the_model.pkl') learner.listen(Message('init'), update_hyperparameters) validation.listen(Message('init'), update_hyperparameters) validation.listen(early_stopper_plugin.new_best_score(), save_model) learner.act( on = early_stopper.begin(), when = always()) learner.act( on = learner.value(), when = always()) validation.act( on = clock.hour(), when = every(n = 1)) early_stopper.act( on = validation.value(), when = always()) @FnPlugin def kfolds_plugin(self,event): if not hasattr(self, 'n'): self.n = -1 self.splits = [ [ ( 0,40000),(40000,50000),(50000,60000) ], [ (10000,50000),(50000,60000),( 0,10000) ], [ (20000,60000),( 0,10000),(10000,20000) ] ] if self.n < 3: self.n += 1 msg = Message('new split') msg.data = (data.get_hyperparam('split'),self.splits[self.n]) self.fire(msg) else: self.fire(Message('terminate')) kfolds.include(kfolds_plugin) kfolds_plugin.act([kfolds.begin(), Message('new split')], when = always()) kfolds_plugin.act(code_block.end(), always() ) code_block.act(Message('new split'), always() ) sched.include(kfolds) sched.run() ''' Notes: when a mode is regstered to begin with a certain message, it will rebroadcast that message when it starts, with only switching the type from whatever it was to 'init'. It will also send all 'init' messages of the mode in which is included ( or of the schedular). one might be able to shorten this by having Macros that creates modes and automatically register certain plugins to it; you can always afterwards add plugins to any mode Step 1 ====== You start with the dataset that you construct as the dataset committee proposed to. You continue constructing your model by applying transformation, more or less like you would in Theano. When constructing your model you also get a graph "behind the scene". Note though that this graph is totally different then the one Theano would create! Let start with an example: .. code-block:: python ''' data_x, data_y = GPU_transform(load_mnist()) output = sigmoid(dotW_b(data_x,10)) err = cross_entropy(output, data_y) learner = SGD(err) ''' This shows how to create the learner behind the logistic regression, but not the function that will compute the validation error or the test error ( or any other statistics). Before going into the detail of what all those transforms ( or the results after applying one) means, here is another partial example for a SdA : .. code-block:: python ''' ## Layer 1: data_x,data_y = GPU_transform(load_mnist()) noisy_data_x = gaussian_noise(data_x, amount = 0.1) hidden1 = tanh(dotW_b(data_x, n_units = 200)) reconstruct1 = reconstruct(hidden1.replace(data_x, noisy_data_x), noisy_data_x) err1 = cross_entropy(reconstruct1, data_x) learner1 = SGD(err1) # Layer 2 : noisy_hidden1 = gaussian_noise(hidden1, amount = 0.1) hidden2 = tanh(dotW_b(hidden1, n_units = 200)) reconstruct2 = reconstruct(hidden2.replace(hidden1,noisy_hidden1), noisy_hidden1) err2 = cross_entropy(reconstruct2, hidden) learner2 = SGD(err2) # Top layer: output = sigmoid(dotW_b(hidden2, n_units = 10)) err = cross_entropy(output, data_y) learner = SGD(err) ''' What's going on here? --------------------- By calling different "transforms" (we could call them ops or functions) you decide what the architecture does. What you get back from applying any of these transforms, are nodes. You have different types of nodes (which I will enumerate a bit later) but they all offer a basic interface. That interface is the dataset API + a few more methods and/or attributes. There are also a few transform that work on the graph that I think will be pretty useful : * .replace(dict) -> method; replaces the subgraphs given as keys with the ones given as values; throws an exception if it is impossible * replace(nodes, dict) -> function; call replace on all nodes given that dictionary * reconstruct(dict) -> transform; tries to reconstruct the nodes given as keys starting from the nodes given as values by going through the inverse of all transforms that are in between * .tm, .tp -> methods; returns nodes that correspond to the value at t-k or t+k * recurrent_layer -> function; creates a special type of node that is recurrent; the node has two important attributes that need to be specified before calling the node iterator; those attributes are .t0 which represents the initial value and .value which should describe the recurrent relation * add_constraints -> transform; adds a constraint to a given node * data_listener -> function; creates a special node that listens for messages to get data; it should be used to decompose the architecture in modules that can run on different machines * switch(hyperparam, dict) -> transform; a lazy switch that allows you do construct by hyper-parameters * get_hyperparameter(name) -> method; given a name it will return the first node starting from top that is a hyper parameter and has that name * get_parameter(name) -> method; given a name it will return the first node starting from top that is a parameter and has that name * get_hyperparameters() * get_parameters() Because every node provides the dataset API it means you can iterate over any of the nodes. They will produce the original dataset transformed up to that point. ** NOTES ** 1. This is not like a symbolic graph. When adding a transform you can get a warning straight forward. This is because you start from the dataset and you always have access to some data. Though sometime you would want to have the nodes lazy, i.e. not try to compute everything until the graph is done. 2. You can still have complex Theano expressions. Each node has a theano variable describing the graph up to that point + optionally a compiled function over which you can iterate. We can use some on_demand mechanism to compile when needed. What types of nodes do you have -------------------------------- Note that this differentiation is more or less semantical and not mandatory syntactical. Is just to help understanding the graph. * Data Nodes -- datasets are such nodes; the result of any simple transform is also a data node ( like the result of a sigmoid, or dotW_b) * Learner Nodes -- they are the same as data nodes, with the difference that they have side effects on the model; they update the weights * Apply Nodes -- they are used to connect input variables to the transformation/op node and output nodes * Dependency Nodes -- very similar to apply nodes just that they connect constraints subgraphs to a model graph * Parameter Nodes -- when iterating over them they will only output the values of the parameters; * Hyper-parameter Nodes -- very similar to parameter nodes; this is a semantical difference ( they are not updated by the any learner nodes) * Transform Nodes -- this nodes describe the mathematical function and if there is one the inverse of that transform; there would usually be two types of transforms; ones that use theano and those that do not -- this is because those that do can be composed Each node is lazy, in the sense that unless you try to iterate on it, it will not try to compute the next value. Isn't this too low level ? -------------------------- I think that way of writing and decomposing your neural network is efficient and useful when writing such networks. Of course when you just want to run a classical SdA you shouldn't need to go through the trouble of writing all that. I think we should have Macors for this. * Macro -- syntactically it looks just like a transform (i.e. a python function) only that it actually applies multiple transforms to the input and might return several nodes (not just one). Example: learner, prediction, pretraining-learners = SdA( input = data_x, target = data_y, hiddens = [200,200], noises = [0.1,0.1]) How do you deal with loops ? ---------------------------- When implementing architectures you some time need to loop like for RNN or CD, PCD etc. Adding loops in such a scheme is always hard. I borrowed the idea in the code below from PyBrain. You first construct a shell layer that you call recurrent layer. Then you define the functionality by giving the initial value and the recurrent step. For example: .. code-block:: python ''' # sketch of writing a RNN x = load_mnist() y = recurrent_layer() y.value = tanh(dotW(x, n=50).t(0) + dotW(y.t(-1),50)) y.t0 = zeros( (50,)) out = dotW(y,10) # sketch of writing CDk starting from x x = recurrent_layer() x.t0 = input_values h = binomial_sample( sigmoid( dotW_b(x.tm(1)))) x.value = binomial_sample( sigmoid( reconstruct(h, x.tm(1)))) ## the assumption is that the inverse of sigmoid is the identity fn pseudo_cost = free_energy(x.tp(k)) - free_energy(x.t0) ''' How do I deal with constraints ? -------------------------------- Use the add constrain. You are required to pass a transform with its hyper-parameters initial values ? How do I deal with other type of networs ? ------------------------------------------ (opaque transforms) new_data = PCA(data_x) svn_predictions = SVN(data_x) svn_learner = SVN_learner(svn_predictions) # Note that for the SVN this might be just syntactic sugar; we have the two # steps because we expect different interfaces for this nodes Step 1.5 ======== There is a wrapper function called plugin. Once you call plugin over any of the previous nodes you will get a plugin that has a certain set of conventions ''''