view doc/v2_planning/plugin_RP.py @ 1272:ba25c6e4f55d

mcRBM working with whole learning algo in theano
author James Bergstra <bergstrj@iro.umontreal.ca>
date Sat, 04 Sep 2010 19:32:27 -0400
parents 681b5e7e3b81
children
line wrap: on
line source

'''

!!! Incomplete file .. many of the things I've set up to do are not done 
yet !!!

============
Introduction
============

What this file talks about
==========================
* Proposal for the layer committee
* Proposal of how to deal with plug-ins ( STEP 2)
* Description of how to glue the two parts
* Some personal beliefs and argumentation

The file will point out how :
* to use the API's other committee proposed or why and how they should 
change
* it satisfies the listed requirements ( or why it doesn't)
* this approach might be better then others ( or worse) to the best of 
my knowledge


Motivation for writing this file
================================

I wrote this file because:
* It will probably answer most of the questions regarding my view,
minimizing the time wasted on talks
* When prezenting the entire interface helps see holes in the approach
* Is here for everybody to read ( easier disimination of information)


=======
Concept
=======

I think any experiment that we ( or anybody else ) would want to run with
our library will be composed of two steps :

* Step 1. Constructing (or choosing or initializing) the model, the
datasets, error measures, optimizers and so on ( everything up to the
iterative loop). I think this step has been covered by different
committies but possibly glued together by the layer committee.

* Step 2. Compose the iterative loops and perform them ( this is what the
architecture committee dealt with)

I believe there is a natural way of going from *Step 1* to *Step 2* 
which would be presented as Step 1.5

Step 2
======

 I will start with step 2 ( because I think that is more of a hot subject 
 right now). I will assume you have the right plugins at hand.
 This is a DBN with early stopping and ..

.. code-block:: python
'''
data = load_mnist()
train_xy valid_xy test_xy  = split(data, split = 
                                   [(0,40000),(40000,50000),[50000,60000]])
train_x, train_y = train_xy
valid_x, valid_y = valid_xy
test_x,  test_y  = test_xy

################# CONSTRUCTING THE MODEL ###################################
############################################################################

x0 = pca(train_x)

## Layer 1:
h1       = sigmoid(dotW_b(x0,units = 200), constraint = L1( coeff = 0.1))
x1       = recurrent_layer()
x1.t0    = x0
x1.value = binomial_sample(sigmoid( reconstruct( binomial_sample(h1), x0)))
cost     = free_energy(train_x) - free_energy(x1.t(5))
grads    = [ (g.var, T.grad(cost.var, g.var)) for g in cost.params ]
pseudo_cost = sum([ pl.sum(pl.abs(g)) for g in cost.params])
rbm1     = SGD( cost = pseudo_cost, grads = grads)

# Layer 2:
rbm2,h2    = rbm(h1, units = 200, k = 5, use= 'CD')
# Logreg
logreg,out = logreg(h2, units = 10)
train_err  = mean_over(missclassification(argmax(out), train_y))
valid_err  = train_err.replace({train_x:valid_x, train_y:valid_y})
test_err   = train_err.replace({train_x: test_x, train_y: test_y})

##########################################################################
############### Constructing the training loop ###########################

ca = Schedular()


### Constructing Modes ###
class pretrain_layer1  ()

 def register()
 {
         }
pretrain_layer2  = ca.mode('pretrain1')
early_stopping   = ca.mode('early')
code_block       = ca.mode('code_block')
kfolds           = ca.mode('kfolds')

# Construct modes dependency graph 
code_block.include([ pretrian_layer1, pretrain_layer2, early_stopper])
kfolds.include( code_block )

pretrain_layer1.act( on = code_block.begin(), when = always())
pretrain_layer2.act( on = pretrain_layer1.end(), when = always())
early_stopping.act ( on = pretrain_layer2.end(), when = always())


# Construct counter plugin that keeps track of number of epochs
@FnPlugin
def counter(self, msg):
    # a bit of a hack.. it will look more classic if you would
    # start with a class instead
    if not hasattr(self, 'val'):
        self.val = 0

    if msg = Message('eod'):
        self.val += 1
    if self.val < 10:
        self.fire(Message('continue'))
    else:
        self.fire(Message('terminate'))


# Construct pre-training plugins
rbm1_plugin = pretrain_layer1.include(plugin_wrapper(rbm1))
rbm2_plugin = pretrain_layer2.include(plugin_wrapper(rbm2))
rbm1_counter = pretrain_layer1.include(counter)
rbm2_counter = pretrain_layer2.include(counter)

rbm1_plugin.listen(Message('init'), update_hyperparameters)
rbm1_plugin.listen(Message('continue'), dataset_restart)
rbm2_plugin.listen(Message('init'), update_hyperparameters)
rbm2_plugin.listen(Message('continue'), dataset_restart)


# Dependency graph for pre-training layer 0
rbm1_plugin.act( on = [ pretrain_layer1.begin() ,
                        rbm1_plugin.value()     ] ,
                 when = always())
rbm1_counter.act( on = rbm1_plugin.eod(), when = always() )


# Dependency graph for pre-training layer 1
rbm2_plugin.act( on = [ pretrain_layer2.begin() ,
                        rbm2_plugin.value()     ] ,
                      when = always())
pretrain_layer2.stop( on = rbm2_plugin.eod(), when = always())


# Constructing fine-tunning plugins
learner = early_stopper.include(plugin_wrapper(logreg))
validation = early_stopper.include( plugin_wrapper(valid_err)))
clock = early_stopper.include( ca.generate_clock())
early_stopper_plugin = early_stopper.include( early_stopper_plugin)


def save_model(plugin):
    cPickle.dump(plugin.object, 'just_the_model.pkl')

learner.listen(Message('init'), update_hyperparameters)
validation.listen(Message('init'), update_hyperparameters)
validation.listen(early_stopper_plugin.new_best_score(), save_model)

learner.act( on = early_stopper.begin(), when = always())
learner.act( on = learner.value(), when = always())
validation.act( on = clock.hour(), when = every(n = 1))
early_stopper.act( on = validation.value(), when = always())

@FnPlugin
def kfolds_plugin(self,event):
    if not hasattr(self, 'n'):
        self.n = -1
        self.splits = [ [ (    0,40000),(40000,50000),(50000,60000) ],
                        [ (10000,50000),(50000,60000),(    0,10000) ],
                        [ (20000,60000),(    0,10000),(10000,20000) ] ]
    if self.n < 3:
        self.n += 1
        msg = Message('new split')
        msg.data = (data.get_hyperparam('split'),self.splits[self.n])
        self.fire(msg)
    else:
        self.fire(Message('terminate'))


kfolds.include(kfolds_plugin)
kfolds_plugin.act([kfolds.begin(), Message('new split')], when = always())
kfolds_plugin.act(code_block.end(), always() )
code_block.act(Message('new split'), always() )

sched.include(kfolds)

sched.run()

'''



Notes:
    when a mode is regstered to begin with a certain message, it will 
rebroadcast that message when it starts, with only switching the 
type from whatever it was to 'init'. It will also send all 'init' messages
of the mode in which is included ( or of the schedular). 
   
    one might be able to shorten this by having Macros that creates modes
    and automatically register certain plugins to it; you can always 
    afterwards add plugins to any mode



Step 1
======


You start with the dataset that you construct as the dataset committee
proposed to. You continue constructing your model by applying
transformation, more or less like you would in Theano. When constructing
your model you also get a graph "behind the scene". Note though that
this graph is totally different then the one Theano would create!
Let start with an example:

.. code-block:: python

'''
    data_x, data_y = GPU_transform(load_mnist())
    output         = sigmoid(dotW_b(data_x,10))
    err            = cross_entropy(output, data_y)
    learner        = SGD(err)
'''

This shows how to create the learner behind the logistic regression,
but not the function that will compute the validation error or the test
error ( or any other statistics). Before going into the detail of what
all those transforms ( or the results after applying one) means, here
is another partial example for a SdA :

.. code-block:: python

'''
    ## Layer 1:

    data_x,data_y = GPU_transform(load_mnist())
    noisy_data_x  = gaussian_noise(data_x, amount = 0.1)
    hidden1       = tanh(dotW_b(data_x, n_units = 200))
    reconstruct1  = reconstruct(hidden1.replace(data_x, noisy_data_x),
                            noisy_data_x)
    err1          = cross_entropy(reconstruct1, data_x)
    learner1      = SGD(err1)

    # Layer 2 :
    noisy_hidden1 = gaussian_noise(hidden1, amount = 0.1)
    hidden2       = tanh(dotW_b(hidden1, n_units = 200))
    reconstruct2  = reconstruct(hidden2.replace(hidden1,noisy_hidden1),
                            noisy_hidden1)
    err2          = cross_entropy(reconstruct2, hidden)
    learner2      = SGD(err2)

    # Top layer:

    output  = sigmoid(dotW_b(hidden2, n_units = 10))
    err     = cross_entropy(output, data_y)
    learner = SGD(err)

'''

What's going on here?
---------------------

By calling different "transforms" (we could call them ops or functions)
you decide what the architecture does. What you get back from applying
any of these transforms, are nodes. You have different types of nodes 
(which I will enumerate a bit later) but they all offer a basic interface.
That interface is the dataset API + a few more methods and/or attributes.
There are also a few transform that work on the graph that I think will 
be pretty useful :

* .replace(dict) -> method; replaces the subgraphs given as keys with 
                    the ones given as values; throws an exception if it
                    is impossible

* replace(nodes, dict) -> function; call replace on all nodes given that dictionary

* reconstruct(dict) -> transform; tries to reconstruct the nodes given as
                       keys starting from the nodes given as values by 
                       going through the inverse of all transforms that 
                       are in between

* .tm, .tp    -> methods; returns nodes that correspond to the value 
                 at t-k or t+k 
* recurrent_layer -> function; creates a special type of node that is 
                     recurrent; the node has two important attributes that
                     need to be specified before calling the node iterator;
                     those attributes are .t0 which represents the initial 
                     value and .value which should describe the recurrent
                     relation
* add_constraints -> transform; adds a constraint to a given node
* data_listener -> function; creates a special node that listens for 
                   messages to get data; it should be used to decompose
                   the architecture in modules that can run on different
                   machines

* switch(hyperparam, dict) -> transform; a lazy switch that allows you 
                    do construct by hyper-parameters

* get_hyperparameter(name) -> method; given a name it will return the first node
                    starting from top that is a hyper parameter and has 
                    that name
* get_parameter(name)  -> method; given a name it will return the first node 
                    starting from top that is a parameter and has that
                    name
* get_hyperparameters()
* get_parameters()




Because every node provides the dataset API it means you can iterate over
any of the nodes. They will produce the original dataset transformed up
to that point.

** NOTES ** 
1. This is not like a symbolic graph. When adding a transform 
you can get a warning straight forward. This is because you start from
the dataset and you always have access to some data. Though sometime
you would want to have the nodes lazy, i.e. not try to compute everything
until the graph is done.

2. You can still have complex Theano expressions. Each node has a 
theano variable describing the graph up to that point + optionally
a compiled function over which you can iterate. We can use some 
on_demand mechanism to compile when needed.

What types of nodes do you have
--------------------------------

Note that this differentiation is more or less semantical and not 
mandatory syntactical. Is just to help understanding the graph.


* Data Nodes         -- datasets are such nodes; the result of any 
                simple transform is also a data node ( like the result
                of a sigmoid, or dotW_b)
* Learner Nodes      --  they are the same as data nodes, with the 
                difference that they have side effects on the model;
                they update the weights
* Apply Nodes        -- they are used to connect input variables to 
                the transformation/op node and output nodes
* Dependency Nodes   -- very similar to apply nodes just that they connect
                constraints subgraphs to a model graph
* Parameter Nodes    -- when iterating over them they will only output
                the values of the parameters;
* Hyper-parameter Nodes -- very similar to parameter nodes; this is a 
                semantical difference ( they are not updated by the 
                any learner nodes)
* Transform Nodes       -- this nodes describe the mathematical function
                and if there is one the inverse of that transform; there
                would usually be two types of transforms; ones that use
                theano and those that do not -- this is because those that
                do can be composed

Each node is lazy, in the sense that unless you try to iterate on it, it 
will not try to compute the next value.


Isn't this too low level ?
--------------------------

I think that way of writing and decomposing your neural network is 
efficient and useful when writing such networks. Of course when you
just want to run a classical SdA you shouldn't need to go through the 
trouble of writing all that. I think we should have Macors for this.

* Macro -- syntactically it looks just like a transform (i.e. a python
function) only that it actually applies multiple transforms to the input
and might return several nodes (not just one).
Example:


learner, prediction, pretraining-learners = SdA(
              input   = data_x,
              target  = data_y,
              hiddens = [200,200],
              noises  = [0.1,0.1])


How do you deal with loops ?
----------------------------

When implementing architectures you some time need to loop like for
RNN or CD, PCD etc. Adding loops in such a scheme is always hard.
I borrowed the idea in the code below from PyBrain. You first construct
a shell layer that you call recurrent layer. Then you define the 
functionality by giving the initial value and the recurrent step.
For example:

.. code-block:: python

'''
    # sketch of writing a RNN
    x = load_mnist()
    y = recurrent_layer()
    y.value = tanh(dotW(x, n=50).t(0) + dotW(y.t(-1),50))
    y.t0 = zeros( (50,))
    out = dotW(y,10)


    # sketch of writing CDk starting from x
    x       = recurrent_layer()
    x.t0    = input_values
    h       = binomial_sample( sigmoid( dotW_b(x.tm(1))))
    x.value = binomial_sample( sigmoid( reconstruct(h, x.tm(1))))
    ## the assumption is that the inverse of sigmoid is the identity fn
    pseudo_cost = free_energy(x.tp(k)) - free_energy(x.t0)


'''

How do I deal with constraints ?
--------------------------------

Use the add constrain. You are required to pass a transform with its 
hyper-parameters initial values ? 


How do I deal with other type of networs ?
------------------------------------------

(opaque transforms)

new_data = PCA(data_x)


svn_predictions = SVN(data_x)
svn_learner  = SVN_learner(svn_predictions)
# Note that for the SVN this might be just syntactic sugar; we have the two
# steps because we expect different interfaces for this nodes



Step 1.5
========

There is a wrapper function called plugin. Once you call plugin over
any of the previous nodes you will get a plugin that has a certain 
set of conventions

''''