view doc/v2_planning/layer_RP.txt @ 1284:1817485d586d

mcRBM - many changes incl. adding support for pooling matrix
author James Bergstra <bergstrj@iro.umontreal.ca>
date Wed, 15 Sep 2010 17:49:21 -0400
parents 32fc5f442dde
children
line wrap: on
line source

===============
Layer committee
===============

Members : RP, XG, AB, DWF

Proposal (RP)
=============

 You construct your neural network by constructing a graph of connections
 between "layers" starting from data. While you construct the graph,
 different theano formulas are put together to construct your model.

 The idea would be that you need to describe exactly what you would draw
 on the board if you are asked to draw the architecture. This would be of 
 course optional ( you will get macros that will return this graph
 automatically for a well defined case). Things that are not neural networks, 
 and you wouldn't have any structure to draw are just a box. For example a 
 SVM, or PCA. This in case you want to connect their output to your network.

 Hard details are not set yet, but all members of the committee agreed
 that this sound as a good idea.


Example Code (RP):
------------------

 # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y

 h1   = sigmoid(dotW_b(train_x, n = 300))
 rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood)


 h2 = sigmoid(dotW_b(h1, n = 300))
 rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood)

 out = sigmoid( dotW_b(h2, n= 10))

 train_err = cross_entropy( out, train_y)

 grads   = grad( train_err, err.parameters() )
 learner = SGD( err, err.parameters(), grads)
 
 valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y})
 test_err  = train_err.replace({ train_x : test_x , train_y : test_y})



Global observations :
---------------------

  1) Your graph can have multiple terminal nodes; in this case rbm1, 
     rbm2 and learner, valid_err, test_err are all end nodes of the graph;

  2) Any node is an "iterator", when you would call out.next() you would get
    the next prediction;  when you call err.next() you will get next error
    ( on the batch given by the data.next() ).

  3) Replace can replace any subgraph or subgraphs with other
  subgraphs/subgraph as long as : there are the same number of input units 
  and output units ( there is a 1 to 1 maping from those). I see replacing
  subgraphs as looping over the list of subgraphs to replace and call replace
  on which nothing fancier. Since nodes in my view produce the same interface
  (execpt parameter nodes and hyper-parameter nodes) this constraint is not
  hard to respect, so is up to the user to do a replace that makes sense.

  4) You can have MACROS or SUBROUTINE that already give you the graph for 
  known components ( in my  view the CDk is such a macro, but simpler 
  examples will be vanilla versions of MLP, DAA, DBN, LOGREG). After 
  Guillaume pointed out a real shortcomming of the approach I've modified
  a bit what you get from a macro .. look below.

  5) Any node has the entire graph ( though arguably you don't use that 
  graph too much). Running such a node in general will be done by compiling 
  the Theano expression up to that node( if you don't already have this
  function), and using the data object that you get initially. This theano 
  function is compiled only if you need it. You use the graph only to : 
       * update the Theano expression in case some part of the subgraph has 
         changed (hyper-parameter or a replace call)
       * collect the list of parameters of the model
       * collect the list of hyper-parameters ( my personal view - this 
         would mostly be useful for a hyper learner .. and not for day to 
         day stuff, but I think is something easy to provide and we should )
       * collect constraints on parameters ( I believe they can be represented
         in the graph as dependency links to other graphs that compute the 
         constraints..)

  6) Registering parameters and hyper-parameters to the graph is the job of 
     the transform and therefore of the user who implemented that 
     transform; the same for initializing the parameters ( so if we have 
     different ways to initialize the weight matrix that might be a 
     hyperparameter with a default value or different transforms; to ease 
     the number of such transforms you can define a transform on the fly for
     simple theano expressions )



Detailed Proposal (RP)
======================

I would go through a list of scenarios and possible issues : 

Delayed or feature values
-------------------------


This is can be dropped if people think is not useful.

Sometimes you might want future values of some nodes.  For example you might 
be interested in :

y(t) = x(t) - x(t-1)

You can get that by having a "delayed" version of a node. A delayed version 
a node x is obtained by calling x.t(k) which will give you a node that has 
the value x(t+k). k can be positive or negative.
In my view this can be done as follows :
  - a node is a class that points to : 
      * a data object that feeds data
      * a theano expression up to that point
      * the entire graph that describes the model ( not Theano graph !!!)
The only thing you need to do is to change the data object to reflect the
delay ( we might need to be able to pad it with 0?). You need also to create
a copy of the theano expression ( those are "new nodes" ) in the sense that 
the starting theano tensors are different since they point to different data.



Non-theano transformation ( or function or whatever)
----------------------------------------------------

Maybe you want to do something in the middle of your graph that is not Theano
supported. Let say you have a function f which you can not write in Theano.
You want to do something like


 W1*f( W2*data + b)

I think we can support that by doing the following :
each node has a:
   * a data object that feeds data
   * a theano expression up to that point
   * the entire graph that describes the model

Let x1 = W2*data + b
up to here everything is fine ( we have a theano expression )
   dot(W2, tensor) + b,
   where tensor is provided by the data object ( plus a dict of givens 
and whatever else you need to compile the function)

When you apply f, what you do you create a node that is exactly like the 
data object in the sense that it provides a new tensor and a new dict of
givens

so x2 = W1*f( W2*data+b)
 will actually point to the expression
    dot(W1, tensor)
 and to the data node f(W2*data+b)

what this means is that you basically compile two theano functions t1 and t2
and evaluate t2(f(t1(data))). So everytime you have a non theano operation you
break the theano expression and start a new one. 

What you loose :
  - there is no optimization or anything between t1,t2 and f ( we don't
    support that)
  - if you are running things on GPU, after t1, data will be copied on CPU and
    then probably again on GPU - so it doesn't make sense anymore



Recurrent Things
----------------

I think that you can write a recurrent operation by first defining a 
graph ( the recrrent relation ):

y_tm1 = recurrent_layer(init = zeros(50))
x_t   = slice(x, t=0)
y     = loop( dotW_b(y_tm1,50) + x_t, steps = 20)

This would basically give all the information you need to add a scan op 
to your theano expression of the result node y, it is just a different way 
of writing things .. which I think is more intuitive. 

You create your primitives which are either a recurrent_layer that should
have a initial value, or a slice of some other node ( a time slice that is).
A tims slice is a special kind of node, which we should try to force people
not to use outside of a loop. If you use it though you have some default
behaviour like for example it behaves exactly like a delayed node.
You call loop giving a expression that starts from those primitives and 
ta da, you have your recurrent expression in the graph.

Similarly you can have foldl or map or anything else.

You would use this instead of writing scan especially if the formulas are
more complicated and you want to automatically collect parameters,
hyper-parameters and so on. You could also just use the scan op and 
using a general apply command if you like that more.

Optimizer
---------

 Personally I would respect the findings of the optimization committee,
 and have the SGD to require a Node that produces some error ( which can
 be omitted) and the parameter nodes and nodes that compute gradients for 
 those paramters. For this I would also have the grad function which would 
 actually only call T.grad. 

 If you have non-theano thing in the middle? I don't have any smart 
 solution besides ignoring any parameter that it is below the first 
 non-theano node and throw a warning.

Learner
-------

 In my case I would not have a predict() and eval() method of the learner,
 but just a eval(). If you want the predictions you should use the 
 corresponding node ( before applying the error measure ). This was 
 for example **out** in my first example. Note eval() in this case is
 the same as next(). ( you might just have next for simplicity). The 
 only semantically important difference is that a call to next has now
 side-effects in the sense that the parameters are updated.

 Of course we could require learners to be special nodes that also have
 a predict output. In that case I'm not sure what the iterating behaiour
 of the node should produce.

Granularity
-----------

Guillaume nicely pointed out that this library might be an overkill.
In the sense that you have a dotW_b transform, and then you will need
a dotW_b_sparse transform and so on. Plus way of initializing each param
would result in many more transforms.

I don't have a perfect answer yet, but my argument will go as this : 

you would have transforms for the most popular option ( dotW_b) for example.
If you need something else you can always decorate a function that takes
theano arguments and produces theano arguments. The formulas produced by 
the formula committee might be a rich source of such function to decorate.
More then decoratting, you can have a general apply transform that does 
something like : 

apply( lambda x,y,z: x*y+z, inputs = x, 
                            hyperparams = [(name,2)], 
                            params = [(name,theano.shared(..)])
The order of the arguments in lambda is nodes, params, hyper-params or so.
This would apply the theano expression but it will also register the 
the parameters. It is like creating a transform on the fly.
You should, or could provide names for parameters, you might need them 
later.

I think you can do such that the result of the apply is 
pickable, but not the general apply transform. What I mean is that 
the output node does not store the lambda expression but some theano 
graph (?) and it know which are the input ( and when you can replace 
them so that you link this little graph to the rest of the 
theano expression. Is just an ugly hack given that you can not save 
lambda expressions, but I'm open to other alternatives ..


What this way of doing things would buy you hopefully is that you do not 
need to worry about most of your model ( would be just a few macros) that 
will get you to the point you want to change and then you do surgery on 
that point. Compare this with hacking a class, it feels cleaner, because
you what is up to that point you want to change is sort of separated from 
what you change. Plus you could do this in your script, and you don't need
to create your local branch of the library where you hack the class, or 
duplicate the class file under a different name .. 
Once what you are doing becomes stable it can be converted in either a
different macro or a parameter to the initial macro.

** New part **

If this is not convincing enough, there is another point that I want to 
make. While creating the graph you can optionally create a model object.
I will encourage most people to do that ! This idea I had a long time ago,
but then I used a singleton class as the world which could potentially create
a lot of issues. This is a nicer version of that. 

This model class is optional but it can be extremely useful. What you do in
this model class is to store the graph, together with different annotations 
on that graph. What I would do is identify different subgraphs in the model
and register them under different names. For example if err is the node that 
points to the graph that represents a DBN, that graph will be registerd to 
a model in which I have annotated which subgraphs represent the different
rbms, which represents the logistic regression and so on. The model will also
have a list of all the input nodes and all the output nodes of the graph. 
We could potentially use this model class to control some global default 
parameters initialization or hyper-parameters. This all might sound like 
magic but is actually easy to implement.

If you have such a model, which is just some annotations on the graph, this 
approach makes it easy to change components of the graph based on their names.
For example I can replace rbm1 with a daa, because based on these annotations
I know which part is rbm1. 

Why do I feel you need such a thing? It is just because you get the DBN by 
calling a macro, and you don't have variables that point to different nodes 
of your network so that you can define where a subgraph starts or not. But
if a graph returns such a model, you can introspect what annotations you have.
There should also be standard conventions, but you could also in the
interactive shell look at :

model.annotations(depth = 2)

This would print something like : 

 'DBN'
    'rbm1'
       'hidden_layer1'
       'CDk_layer1'
    'rbm2'
       'hidden_layer2'
       'CDk_layer2'
    'logreg'
       'cross_entropy'

And then you can say 

daa1 = daa(..)
daa2 = daa(..)
new_model = model.replace('rbm1', daa1, new_name = 'daa1')
new_model = new_model.replace('rbm2', daa2, new_name = 'daa2')

and you get a SDAA.
What is the hierarhical structure ? Well, in my view if some subgrah
(annotated as S1) is part of another subgraph (annotated as S2) then 
S1 is a child of S2 in this hierarchy of annotations. If they share 
just a few nodes, but have nodes that are not shared, then they are on 
the same level. We might one a flat space for the annotations, but I think
this simple convention can get as a lot.


So macros should in general return such models. It is up to you if you want to 
ground the graph that you create in your script into a model or not. You do 
so by manually adding nodes to the model. The annotations are also manually 
done .. So this might be a bit annoying for a developer of a macro, but I
don't think is cognitively complicated, and it would help a lot when using 
the macros.

You can see how this annotation system becomes easily interesting. You can
also annotate parameters ( and it is not too overwhelming to do so when
you create the graph as well) and you can use this to sort of collect all 
parameters that you annotated in some way and then do something to them.

The way I see it is just that a transform could have an optional annotations
argument and it will add that string to all parameters and hyper-parameters.
How much sense this makes is debatable, but I strongly believe that is not 
complicated to implement ( I actually have something like this already
implemented, just that I use that single ton class, and I sort of made the 
framework work mostly for DAA by making a few poor choices).


Params and hyperparams
----------------------

I think it is obvious from what I wrote above that there is a node wrapper
around the theano expression. I haven't wrote down all the details of that
class. I think there should be such a wrapper around parameters and 
hyper-parameters as well. By default those wrappers might not provide
any informtion. But you can potentially add interesting information for 
"graph" aware transforms. For example you can add annotations for a find
or replace function that will collect you all parameters or hyper-parameter
so you do some common thing to all of them (when it makes sense). 

You could have a freeze property for parameters. If you change that property
the theano function (where needed) for all nodes that follow this one is 
recomputed. This argument would be used by the collecting paramters function
used to compute the gradient. If parameters are frozen they are ignored,
if not they are updated. 

For hyper-parameters you would also have a different wrapper that would
contain, possibly, the distribution of that hyper-parameters for a
hyper-learner.

I would also have the learning rate or noise_amounts as some strange 
hyper-paramter. I would say by default, if any hyper-paramter changes its
value, then the theano expressions need to be recompiled. If you are dealing
with this strange types of hyper-parameters you don't need to do that. 
This can be automatically for you and I guess it will all boil down to,
is you hyper-paramter a theano shared variable or theano tensor ? If so we 
are dealing with the second type. So this kind of stuff can be detected
automatically.

How does this work?
-------------------

You always have a pointer to the entire graph. Whenever a hyper-param 
changes ( or a param freezes) all region of the graph affected get recompiled.
This is by traversing the graph from the bottom node and re-constructing the
theano expression. Where needed this theano expression get compiled.

This function that updates / re-constructs the graph is sligthly more complex
if you have non-theano functions in the middle of the graph .. but not too 
much in my view.

replace & find
--------------

Replace, replaces a part of the graph. The way it works in my view is that
if I write : 

x = x1+x2+x3
y = x.replace({x2:x5})

You would first copy the graph that is represented by x ( the params or 
hyper-params are not copied) and then replace the subgraphs. I.e., x will
still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done 
inplace !

I think these Node classes as something light-weighted, like theano variables
and creating copy is not harmful. Also params & shared variables are shared 
between these graphs. If you want new params / shared variables we can offer
a copy / deepcopy command.

Replace (given that it starts from a model) can take string(s) that indicate
specific annotations.

Find does the same ( without the copying).



If you have two things named the same in the graph you would return the first 
one in a breadth search starting from the top node. The idea is that if you
have all the weight matrices annotated as 'W' and you look for 'W' starting 
from node hiddens2, you want the W of the second layer, and not of the first.

I wold support :
model.replace( look_at , search_for , replace_with, annotate_as)
replace(model , look_at  , search_for , replace_with, annotate_as)
node.replace(model  , look_at, replace_with, annotate_as)

look_at if it is a node it reffers to the subgraph that has as a final 
node that node. I.e. all up to that point. If it is a string, you would look
at the subgraph annotated by that string.

Of course we can optionally choose not to allow things to be annotate with 
the same name, though I sort of liked it. It makes a lot of things easy. For 
a DBN I would have the annotations : 

DBN
  rbm1
    hidden
    CDk
  rbm2
    hidden
    CDk
  logreg

If I want to change the first CDk with PCD I would do 

pcd1 = PCD (..)
model.replace(look_at='rbm1', search_for='CDk', replace_with=pcd1,
                annotate_as='PCD1')


Bottom line is : 

  I think having a graph and having a way to search in that graph and replace
  parts is a very flexible and powerful way of doing things.


reconstruct
-----------

This is something nice for DAA. It is definetely not useful for the rest.
I think though that is a shame having that transformation graph and not 
being able to use it to do this. It will make life so much easier when you
do deep auto-encoders. I wouldn't put it in the core library, but I would 
have in the DAA module. For reconstruct to work you need to have inverse 
transforms for the ones you use.

The way I see it you can either have something like

# generate your inversable transforms on the fly
fn  = create_transform(lambda : , params, hyper-params )
inv = create_transform(lambda : , params, hyper-params )
my_transform = couple_transforms( forward = fn, inv = inv)

and generate special transforms on the fly that have some pseudo-inverses
when you construct the graph. Maybe you can also have spcific pre-defined
transforms for the most used cases, whith specific names. Even more I don't
see the harm of something as simple as dotW_b to have a inverse defined ( as
using tied weights) in all cases, but you would only use it for the DAA. 
It just to reduce the number of names of transforms you have, is like a
feature that doesn't hurt or help in 95% of times but it helps in 5% of times.


But this is up to debate. The only reason I bring it up is to say that the 
class that represents a transform should have a inverse method that by 
default throws an exception.


transforms
----------

In my view there will be quite a few of such standard transforms. 
This can be annoying, but I think that if we group them by 
architectures (MLP, DAA, RBM), sampler, optimizers it will be less of a mess.
This would be crucial for their documentation as well. This categories should
also come with macros. There will be though some basic transforms that 
are available at the core ( like replace, find, things related to annotating
and creating a model, collecting parameters and hyper-paramters)

I also think that we can start small by having just very few such transforms
and add them as the library grows. We don't need many of this, most are 
nice to have ..


Constraints
-----------

You can always add constraints. I think the easier to make this explicit is to 
get a hand on the parameter or ndoe on which you want to add constraint and 
do something like 

add_constraint(on_what, what)

on_what can be a node, a parameter node, a list of nodes, a list of parameter 
nodes, an annotation string, given that you provided a model, and what is a 
graph. In terms of the graph that you are creating what this does is to 
create a dependency link from your main graph to that constraint graph.
This means that the grad function that computes the grad function that 
computes the gradients with respect to parameters will also (if there are 
such dependency links) add the gradient of those parameters with respect 
to the output of that dependency graph. There are some constraints on 
what a dependency graph can be, in the sense that it should start from only
one input ( the parameters / node) and it should end in only one node that 
is a scalar.

From an implementation point of view, this can be done by just collecting a 
list of constraints cost, that will be added to the cost before calling
T.grad. But I like to think about it in terms of graph linked through
dependency links.




Some general comments
---------------------

 I think that what you get in the end is a very flexible framework, where 
 adding new things is just a matter of putting together a few transforms and 
 annotating the entire thing. Worst case scenario you would need to invent a 
 transform, which I do believe could be quite painless.

 The harder part to implement is the back-bone. It is not difficult in my
 view, mostly sligthly tideous. I had something like this implemented in a
 matter of a week, though it was a bit less restrictive. I do believe though
 that we should not oversimplify the backbone of the library just to make it 
 easy to implement, but we should rather carefully consider what you get in
 the end


Connection to the architecture committee
-----------------------------------------

 I think that if you get such iterator objects that can produce either 
 the error, or do an update step it is easy to wrap them in a plug-in, 
 or use it with the imperative language James proposed. 

 I actually have ideas ( using non theano nodes) how to break the algo at 
 points such that you can have different parts run on remote machines ..
 though we might not want to support that ( using the plug-in system .. 
 though it might work with other systems that support the same idea)

 I think it goes more natural with the imperative language that James
 proposed, because that would create a graph as well. His graph is 
 in general simpler ( it always has only one termination node) where 
 the nodes have a different interpretation (?) so I would use a different
 node class on those. But from writing the code, using some syntactic sugar
 the difference can be blurred ( do we want this ?). I think that one 
 can come up with ways of making the approaches look alike and sligtly 
 homogeneous.