Mercurial > pylearn

--- a/doc/v2_planning/plugin_RP.py	Mon Sep 20 17:05:15 2010 -0400
+++ b/doc/v2_planning/plugin_RP.py	Mon Sep 20 20:35:03 2010 -0400
@@ -1,163 +1,440 @@
 '''
-=================================================
-Plugin system for interative algortithm Version B
-=================================================
+
+!!! Incomplete file .. many of the things I've set up to do are not done
+yet !!!
+
+============
+Introduction
+============
+
+What this file talks about
+==========================
+* Proposal for the layer committee
+* Proposal of how to deal with plug-ins ( STEP 2)
+* Description of how to glue the two parts
+* Some personal beliefs and argumentation

-After the meeting (September 16) we sort of stumbled on
-two possible versions of the plug-in system. This represents
-the second version. It suffered a few changes after seeing
-Olivier's code and talking to him.
+The file will point out how :
+* to use the API's other committee proposed or why and how they should
+change
+* it satisfies the listed requirements ( or why it doesn't)
+* this approach might be better then others ( or worse) to the best of
+my knowledge
+

+Motivation for writing this file
+================================
+
+I wrote this file because:
+* It will probably answer most of the questions regarding my view,
+minimizing the time wasted on talks
+* When prezenting the entire interface helps see holes in the approach
+* Is here for everybody to read ( easier disimination of information)
+
+
+=======
 Concept
 =======

-The basic idea behind this version is not to have a list of all
-possible events, but rather have plugin register to events.By
-specifying what plugin listens to which event produced by what
-plugin you define a sort of dependency graph. Structuring things
-in such a graph might make the script more intuitive when reading.
+I think any experiment that we ( or anybody else ) would want to run with
+our library will be composed of two steps :
+
+* Step 1. Constructing (or choosing or initializing) the model, the
+datasets, error measures, optimizers and so on ( everything up to the
+iterative loop). I think this step has been covered by different
+committies but possibly glued together by the layer committee.
+
+* Step 2. Compose the iterative loops and perform them ( this is what the
+architecture committee dealt with)
+
+I believe there is a natural way of going from *Step 1* to *Step 2*
+which would be presented as Step 1.5
+
+Step 2
+======
+
+ I will start with step 2 ( because I think that is more of a hot subject
+ right now). I will assume you have the write plugins at had.
+ This is a DBN with early stopping and ..
+
+.. code-block:: python
+'''
+data = load_mnist()
+train_xy valid_xy test_xy  = split(data, split =
+                                   [(0,40000),(40000,50000),[50000,60000]])
+train_x, train_y = train_xy
+valid_x, valid_y = valid_xy
+test_x,  test_y  = test_xy
+
+################# CONSTRUCTING THE MODEL ###################################
+############################################################################
+
+x0 = pca(train_x)

-I will first go through pseudo-code for two example and then enumerate
-my insights and concepts on the matter
+## Layer 1:
+h1       = sigmoid(dotW_b(x0,units = 200), constraint = L1( coeff = 0.1))
+x1       = recurrent_layer()
+x1.t0    = x0
+x1.value = binomial_sample(sigmoid( reconstruct( binomial_sample(h1), x0)))
+cost     = free_energy(train_x) - free_energy(x1.tp(5))
+grads    = [ (g.var, T.grad(cost.var, g.var)) for g in cost.params ]
+pseudo_cost = sum([ pl.sum(pl.abs(g)) for g in cost.params])
+rbm1     = SGD( cost = pseudo_cost, grads = grads)
+
+# Layer 2:
+rbm2,h2    = rbm(h1, units = 200, k = 5, use= 'CD')
+# Logreg
+logreg,out = logreg(h2, units = 10)
+train_err  = mean_over(missclassification(argmax(out), train_y))
+valid_err  = train_err.replace({train_x:valid_x, train_y:valid_y})
+test_err   = train_err.replace({train_x: test_x, train_y: test_y})
+
+##########################################################################
+############### Constructing the training loop ###########################
+
+ca = Schedular()
+
+
+### Constructing Modes ###
+pretrain_layer1  = ca.mode('pretrain0')
+pretrain_layer2  = ca.mode('pretrain1')
+early_stopping   = ca.mode('early')
+valid1           = ca.mode('stuff')
+kfolds           = ca.mode('kfolds')
+
+# Construct modes dependency graph
+valid0.include([ pretrian_layer1, pretrain_layer2, early_stopper])
+kfolds.include( valid0 )
+
+pretrain_layer1.act( on = valid1.begin(), when = always())
+pretrain_layer2.act( on = pretrain_layer1.end(), when = always())
+early_stopping.act ( on = pretrain_layer2.end(), when = always())


-Example : Producer - Consumer that Guillaume described
-======================================================
+# Construct counter plugin that keeps track of number of epochs
+@FnPlugin
+def counter(self, msg):
+    # a bit of a hack.. it will look more classic if you would
+    # start with a class instead
+    if not hasattr(self, 'val'):
+        self.val = 0
+
+    if msg = Message('eod'):
+        self.val += 1
+    if self.val < 10:
+        self.fire(Message('continue'))
+    else:
+        self.fire(Message('terminate'))
+
+
+# Construct pre-training plugins
+rbm1_plugin = plugin_wrapper(rbm1, sched = pretrain_layer1)
+rbm1_plugin.listen(Message('init'), update_hyperparameters)
+rbm2_plugin = plugin_wrapper(rbm2, sched = pretrain_layer2)
+rbm2_plugin.listen(Message('init'), update_hyperparameters)
+rbm1_counter = pretrain_layer1.register(counter)
+rbm2_counter = pretrain_layer2.register(counter)
+
+
+# Dependency graph for pre-training layer 0
+rbm1_plugin.act( on = [ pretrain_layer1.begin()
+                        Message('continue')     ],
+                 when = always())
+rbm1_counter.act( on = rbm1_plugin.eod(), when = always() )
+
+
+# Dependency graph for pre-training layer 1
+rbm2_plugin.act( on = pretrain_layer2.begin(), when = always())
+pretrain_layer2.stop( on = rbm2_plugin.eod(), when = always())
+
+
+# Constructing fine-tunning plugins
+learner = early_stopper.register(plugin_wrapper(logreg))
+learner.listen(Message('init'), update_hyperparameters)
+validation = early_stopper.register( plugin_wrapper(valid_err)))
+validation.listen(Message('init'), update_hyperparameters)
+clock = early_stopper.register( ca.generate_clock())
+early_stopper_plugin = early_stopper.register( early_stopper_plugin)
+
+@FnPlugin
+def save_weights(self, message):
+    cPickle.dump(logreg, open('model.pkl'))
+
+
+learner.act( on = early_stopper.begin(), when = always())
+learner.act( on = learner.value(), when = always())
+validation.act( on = clock.hour(), when = every(n = 1))
+early_stopper.act( on = validation.value(), when = always())
+save_model.act( on = early_stopper.new_best_error(), when =always())
+
+@FnPlugin
+def kfolds_plugin(self,event):
+    if not hasattr(self, 'n'):
+        self.n = -1
+        self.splits = [ [ (    0,40000),(40000,50000),(50000,60000) ],
+                        [ (10000,50000),(50000,60000),(    0,10000) ],
+                        [ (20000,60000),(    0,10000),(10000,20000) ] ]
+    if self.n < 3:
+        self.n += 1
+        msg = Message('new split')
+        msg.data = (data.get_hyperparam('split'),self.splits[self.n])
+        self.fire(msg)
+    else:
+        self.fire(Message('terminate'))
+
+
+kfolds.register(kfolds_plugin)
+kfolds_plugin.act(kfolds.begin(), when = always())
+kfolds_plugin.act(valid0.end(), always() )
+valid0.act(Message('new split'), always() )
+
+sched.include(kfolds)
+
+sched.run()
+
+'''
+
+Notes:
+    when a mode is regstered to begin with a certain message, it will
+rebroadcast that message when it starts, with only switching the
+type from whatever it was to 'init'. It will also send all 'init' messages
+of the mode in which is included ( or of the schedular).
+
+    one might be able to shorten this by having Macros that creates modes
+    and automatically register certain plugins to it; you can always
+    afterwards add plugins to any mode
+
+
+
+Step 1
+======


-.. code-block::
+You start with the dataset that you construct as the dataset committee
+proposed to. You continue constructing your model by applying
+transformation, more or less like you would in Theano. When constructing
+your model you also get a graph "behind the scene". Note though that
+this graph is totally different then the one Theano would create!
+Let start with an example:
+
+.. code-block:: python
+
+'''
+    data_x, data_y = GPU_transform(load_mnist())
+    output         = sigmoid(dotW_b(data_x,10))
+    err            = cross_entropy(output, data_y)
+    learner        = SGD(err)
 '''
-    sch = Schedular()
+
+This shows how to create the learner behind the logistic regression,
+but not the function that will compute the validation error or the test
+error ( or any other statistics). Before going into the detail of what
+all those transforms ( or the results after applying one) means, here
+is another partial example for a SdA :
+
+.. code-block:: python
+
+'''
+    ## Layer 1:
+
+    data_x,data_y = GPU_transform(load_mnist())
+    noisy_data_x  = gaussian_noise(data_x, amount = 0.1)
+    hidden1       = tanh(dotW_b(data_x, n_units = 200))
+    reconstruct1  = reconstruct(hidden1.replace(data_x, noisy_data_x),
+                            noisy_data_x)
+    err1          = cross_entropy(reconstruct1, data_x)
+    learner1      = SGD(err1)
+
+    # Layer 2 :
+    noisy_hidden1 = gaussian_noise(hidden1, amount = 0.1)
+    hidden2       = tanh(dotW_b(hidden1, n_units = 200))
+    reconstruct2  = reconstruct(hidden2.replace(hidden1,noisy_hidden1),
+                            noisy_hidden1)
+    err2          = cross_entropy(reconstruct2, hidden)
+    learner2      = SGD(err2)
+
+    # Top layer:

-    @FnPlugin(sch)
-    def producer(self,event):
-        self.fire('stuff', value = 'some text')
+    output  = sigmoid(dotW_b(hidden2, n_units = 10))
+    err     = cross_entropy(output, data_y)
+    learner = SGD(err)
+
+'''
+
+What's going on here?
+---------------------
+
+By calling different "transforms" (we could call them ops or functions)
+you decide what the architecture does. What you get back from applying
+any of these transforms, are nodes. You have different types of nodes
+(which I will enumerate a bit later) but they all offer a basic interface.
+That interface is the dataset API + a few more methods and/or attributes.
+There are also a few transform that work on the graph that I think will
+be pretty useful :
+
+* .replace(dict) -> method; replaces the subgraphs given as keys with
+                    the ones given as values; throws an exception if it
+                    is impossible
+
+* reconstruct(dict) -> transform; tries to reconstruct the nodes given as
+                       keys starting from the nodes given as values by
+                       going through the inverse of all transforms that
+                       are in between

-    @FnPlugin(sch)
-    def consumer(self,event):
-        print event.value
+* .tm, .tp    -> methods; returns nodes that correspond to the value
+                 at t-k or t+k
+* recurrent_layer -> function; creates a special type of node that is
+                     recurrent; the node has two important attributes that
+                     need to be specified before calling the node iterator;
+                     those attributes are .t0 which represents the initial
+                     value and .value which should describe the recurrent
+                     relation
+* add_constraints -> transform; adds a constraint to a given node
+* data_listener -> function; creates a special node that listens for
+                   messages to get data; it should be used to decompose
+                   the architecture in modules that can run on different
+                   machines
+
+* switch(hyperparam, dict) -> transform; a lazy switch that allows you
+                    do construct by hyper-parameters
+
+* get_hyperparameter -> method; given a name it will return the first node
+                    starting from top that is a hyper parameter and has
+                    that name
+* get_parameter  -> method; given a name it will return the first node
+                    starting from top that is a parameter and has that
+                    name
+
+

-    @FnPlugin(sch)
-    def prod_consumer(self,event):
-        print event.value
-        self.fire('stuff2', value = 'stuff')
+Because every node provides the dataset API it means you can iterate over
+any of the nodes. They will produce the original dataset transformed up
+to that point.
+
+** NOTES **
+1. This is not like a symbolic graph. When adding a transform
+you can get a warning straight forward. This is because you start from
+the dataset and you always have access to some data. Though sometime
+you would want to have the nodes lazy, i.e. not try to compute everything
+until the graph is done.
+
+2. You can still have complex Theano expressions. Each node has a
+theano variable describing the graph up to that point + optionally
+a compiled function over which you can iterate. We can use some
+on_demand mechanism to compile when needed.
+
+What types of nodes do you have
+--------------------------------
+
+Note that this differentiation is more or less semantical and not
+mandatory syntactical. Is just to help understanding the graph.
+
+
+* Data Nodes         -- datasets are such nodes; the result of any
+                simple transform is also a data node ( like the result
+                of a sigmoid, or dotW_b)
+* Learner Nodes      --  they are the same as data nodes, with the
+                difference that they have side effects on the model;
+                they update the weights
+* Apply Nodes        -- they are used to connect input variables to
+                the transformation/op node and output nodes
+* Dependency Nodes   -- very similar to apply nodes just that they connect
+                constraints subgraphs to a model graph
+* Parameter Nodes    -- when iterating over them they will only output
+                the values of the parameters;
+* Hyper-parameter Nodes -- very similar to parameter nodes; this is a
+                semantical difference ( they are not updated by the
+                any learner nodes)
+* Transform Nodes       -- this nodes describe the mathematical function
+                and if there is one the inverse of that transform; there
+                would usually be two types of transforms; ones that use
+                theano and those that do not -- this is because those that
+                do can be composed
+
+Each node is lazy, in the sense that unless you try to iterate on it, it
+will not try to compute the next value.
+

-    producer.act( on = Event('begin'), when = once() )
-    producer.act( on = Event('stuff'), when = always() )
-    consumer.act( on = Event('stuff'), when = always() )
-    prod_consumer.act( on = Event('stuff'), when = always() )
+Isn't this too low level ?
+--------------------------
+
+I think that way of writing and decomposing your neural network is
+efficient and useful when writing such networks. Of course when you
+just want to run a classical SdA you shouldn't need to go through the
+trouble of writing all that. I think we should have Macors for this.
+
+* Macro -- syntactically it looks just like a transform (i.e. a python
+function) only that it actually applies multiple transforms to the input
+and might return several nodes (not just one).
+Example:
+
+
+learner, prediction, pretraining-learners = SdA(
+              input   = data_x,
+              target  = data_y,
+              hiddens = [200,200],
+              noises  = [0.1,0.1])
+
+
+How do you deal with loops ?
+----------------------------

-    sch.run()
+When implementing architectures you some time need to loop like for
+RNN or CD, PCD etc. Adding loops in such a scheme is always hard.
+I borrowed the idea in the code below from PyBrain. You first construct
+a shell layer that you call recurrent layer. Then you define the
+functionality by giving the initial value and the recurrent step.
+For example:
+
+.. code-block:: python

+'''
+    # sketch of writing a RNN
+    x = load_mnist()
+    y = recurrent_layer()
+    y.value = tanh(dotW(x, n=50) + dotW(y.tm(1),50))
+    y.t0 = zeros( (50,))
+    out = dotW(y,10)
+
+
+    # sketch of writing CDk starting from x
+    x       = recurrent_layer()
+    x.t0    = input_values
+    h       = binomial_sample( sigmoid( dotW_b(x.tm(1))))
+    x.value = binomial_sample( sigmoid( reconstruct(h, x.tm(1))))
+    ## the assumption is that the inverse of sigmoid is the identity fn
+    pseudo_cost = free_energy(x.tp(k)) - free_energy(x.t0)


 '''
-Example : Logistic regression
-=============================

-Task description
-----------------
+How do I deal with constraints ?
+--------------------------------

-Apply a logistic regression network to some dataset. Use early stopping.
-Save the weights everytime a new best score is obtained. Print trainnig score
-after each epoch.
+Use the add constrain. You are required to pass a transform with its
+hyper-parameters initial values ?


-Possible script
----------------
-
-Notes : This would look the same for any other architecture that does not
-imply pre-training ( i.e. deep networks). For example the mlp.
-
-.. code-block::
-'''
-
-sched = Schedular()
-
-# Data / Model Building :
-# I skiped over how to design this part
-# though I have some ideas
-real_train_data, real_valid_data = load_mnist()
-model = logreg()
+How do I deal with other type of networs ?
+------------------------------------------

-# Main Plugins ( already provided in the library );
-# This wrappers also registers the plugin
-train_data = create_data_plugin( sched, data = real_train_data)
-train_model    = create_train_model(sched, model = model)
-validate_model = create_valid_model(sched, model = model, data = valid_data)
-early_stopper  = create_early_stopper(sched)
-
+(opaque transforms)

-# On the fly plugins ( print random stuff); the main difference from my
-# FnPlugin from Olivier's version is that it also register the plugin in sched
-@FnPlugin(sched)
-def print_error(self, event):
-    if event.type == Event('begin'):
-        self.value = []
-    elif event.type == train_model.error():
-        self.value += [event.value]
-    else event.type == train_data.eod():
-        print 'Error :', numpy.mean(self.value)
-
-@FnPlugin(sched)
-def save_model(self, event):
-    if event.type == early_stopper.new_best_error():
-        cPickle.dump(model.parameters(), open('best_params.pkl','wb'))
+new_data = PCA(data_x)


-# Create the dependency graph describing what does what
-train_data.act( on = sched.begin(), when = once() )
-train_data.act( on = Event('batch'),
-train_data.act( on = train_model.done(), when = always())
-train_model.act(on = train_data.batch(), when = always())
-validate_model.act(on = train_model.done(), when = every(n=10000))
-early_stopper.act(on = validate_model.error(), when = always())
-print_error.act( on = train_model.error(), when = always() )
-print_error.act( on = train_data.eod(), when = always() )
-save_model.act( on = eraly_stopper.new_best_errot(), when = always() )
+svn_predictions = SVN(data_x)
+svn_learner  = SVN_learner(svn_predictions)
+# Note that for the SVN this might be just syntactic sugar; we have the two
+# steps because we expect different interfaces for this nodes

-# Run the entire thing
-sched.run()


-'''
-Notes
-=====
-
- * I think we should have a FnPlugin decorator ( exactly like Olivier's) just
- that also attaches the new created plugin to the schedule. This way you
- can create plugin on the fly ( as long as they are simple functions that
- print stuff, or compute simple statitics ).
- * I added a method act to a Plugin. You use that to create the dependency
- graph ( it could also be named listen to be more plugin like interface)
- * Plugins are obtained in 3 ways  :
-     - by wrapping a dataset / model or something similar
-     - by a function that constructs it from nothing
-     - by decorating a function
-   In all cases I would suggest then when creating them you should provide
-   the schedular as well, and the constructor also registers the plugin
+Step 1.5
+========

- * The plugin concept works well as long as the plugins are a bit towards
- heavy duty computation, disregarding printing plugins and such. If you have
- many small plugins this system might only introduce an overhead. I would
- argue that using theano is restricted to each plugin. Therefore I would
- strongly suggest that the architecture to be done outside the schedular
- with a different approach.
-
- * I would suggest that the framework to be used only for the training loop
- (after you get the adapt function, compute error function) so is more about
- the meta-learner, hyper-learner learner level.
+There is a wrapper function called plugin. Once you call plugin over
+any of the previous nodes you will get a plugin that has a certain
+set of conventions

- * A general remark that I guess everyone will agree on. We should make
- sure that implementing a new plugin is as easy/simple as possible. We
- have to hide all the complexity in the schedular ( it is the part of the
- code we will not need or we would rarely need to work on).
-
- * I have not went into how to implement the different components, but
- following Olivier's code I think that part would be more or less straight
- forward.
-
- '''
-
-
-'''
+''''