view doc/v2_planning/layer_RP.txt @ 1231:5ef96142492b

some typos
author Razvan Pascanu <r.pascanu@gmail.com>
date Wed, 22 Sep 2010 20:17:35 -0400
parents 515033d4d3bf
children 32fc5f442dde
line wrap: on
line source

===============
Layer committee
===============

Members : RP, XG, AB, DWF

Proposal (RP)
=============

 You construct your neural network by constructing a graph of connections
 between layers starting from data. While you construct the graph,
 different theano formulas are put together to construct your model.

 Hard details are not set yet, but all members of the committee agreed
 that this sound as a good idea.


Example Code (RP):
------------------

 # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y

 h1   = sigmoid(dotW_b(train_x, n = 300))
 rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood)

 h2 = sigmoid(dotW_b(h1, n = 300))
 rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood)

 out = sigmoid( dotW_b(h2, n= 10))

 train_err = cross_entropy( out, train_y)

 grads   = grad( train_err, err.parameters() )
 learner = SGD( err, grads)
 
 valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y})
 test_err  = train_err.replace({ train_x : test_x , train_y : test_y})



Global observations :
---------------------

  1) Your graph can have multiple terminal nodes; in this case rbm1, 
     rbm2 and learner, valid_err, test_err are all end nodes of the graph; 

  2) Any node is an "iterator", when you would call out.next() you would get 
    the next prediction;  when you call err.next() you will get next error 
    ( on the batch given by the data.next() ).

  3) Replace can replace any subgraph

  4) You can have MACROS or SUBROUTINE that already give you the graph for 
  known components ( in my  view the CDk is such a macro, but simpler 
  examples will be vanilla versions of MLP, DAA, DBN, LOGREG)

  5) Any node has the entire graph ( though arguably you don't use that 
  graph too much). Running such a node in general will be done by compiling 
  the Theano expression up to that node( if you don't already have this
  function), and using the data object that you get initially. This theano 
  function is compiled only if you need it. You use the graph only to : 
       * update the Theano expression in case some part of the subgraph has 
         changed (hyper-parameter or a replace call)
       * collect the list of parameters of the model
       * collect the list of hyper-parameters ( my personal view - this 
         would mostly be useful for a hyper learner .. and not for day to 
         day stuff, but I think is something easy to provide and we should )
       * collect constraints on parameters ( I believe they can be represented
         in the graph as dependency links to other graphs that compute the 
         constraints..)

  6) Registering parameters and hyper-parameters to the graph is the job of 
     the transform and therefore of the user who implemented that 
     transform; the same for initializing the parameters ( so if we have 
     different way to initialize the weight matrix that might be a 
     hyperparameter with a default value)



Detailed Proposal (RP)
======================

I would go through a list of scenarios and possible issues : 

Delayed or feature values
-------------------------

Sometimes you might want future values of some nodes.  For example you might 
be interested in :

y(t) = x(t) - x(t-1)

You can get that by having a "delayed" version of a node. A delayed version 
a node x is obtained by calling x.t(k) which will give you a node that has 
the value x(t+k). k can be positive or negative.
In my view this can be done as follows :
  - a node is a class that points to : 
      * a data object that feeds data
      * a theano expression up to that point
      * the entire graph that describes the model ( not Theano graph !!!)
The only thing you need to do is to change the data object to reflect the
delay ( we might need to be able to pad it with 0?). You need also to create
a copy of the theano expression ( those are "new nodes" ) in the sense that 
the starting theano tensors are different since they point to different data.



Non-theano transformation ( or function or whatever)
----------------------------------------------------

Maybe you want to do something in the middle of your graph that is not Theano
supported. Let say you have a function f which you can not write in Theano.
You want to do something like


 W1*f( W2*data + b)

I think we can support that by doing the following :
each node has a:
   * a data object that feeds data
   * a theano expression up to that point
   * the entire graph that describes the model

Let x1 = W2*data + b
up to here everything is fine ( we have a theano expression )
   dot(W2, tensor) + b,
   where tensor is provided by the data object ( plus a dict of givens 
and whatever else you need to compile the function)

When you apply f, what you do you create a node that is exactly like the 
data object in the sense that it provides a new tensor and a new dict of
givens

so x2 = W1*f( W2*data+b)
 will actually point to the expression
    dot(W1, tensor)
 and to the data node f(W2*data+b)

what this means is that you basically compile two theano functions t1 and t2
and evaluate t2(f(t1(data))). So everytime you have a non theano operation you
break the theano expression and start a new one. 

What you loose :
  - there is no optimization or anything between t1,t2 and f ( we don't
    support that)
  - if you are running things on GPU, after t1, data will be copied on CPU and
    then probably again on GPU - so it doesn't make sense anymore



Recurrent Things
----------------

I think that you can write a recurrent operation by first defining a 
graph ( the recrrent relation ):

y_tm1 = recurrent_layer(init = zeros(50))
x_t   = slice(x, t=0)
y     = loop( dotW_b(y_tm1,50) + x_t, steps = 20)

This would basically give all the information you need to add a scan op 
to your theano expression of the result op, it is just a different way 
of writing things .. which I think is more intuitive. 

You create your primitives which are either a recurrent_layer that should
have a initial value, or a slice of some other node ( a time slice that is)
Then you call loop giving a expression that starts from those primitives.

Similarly you can have foldl or map or anything else.

You would use this instead of writing scan especially if the formula is 
more complicated and you want to automatically collect parameters,
hyper-parameters and so on.

Optimizer
---------

 Personally I would respect the findings of the optimization committee,
 and have the SGD to require a Node that produces some error ( which can
 be omitted) and the gradients. For this I would also have the grad
 function which would actually only call T.grad. 

 If you have non-theano thing in the middle? I don't have any smart 
 solution besides ignoring any parameter that it is below the first 
 non-theano node and throw a warning.

Learner
-------

 In my case I would not have a predict() and eval() method of the learner,
 but just a eval(). If you want the predictions you should use the 
 corresponding node ( before applying the error measure ). This was 
 for example **out** in my first example.

 Of course we could require learners to be special nodes that also have
 a predict output. In that case I'm not sure what the iterating behaiour
 of the node should produce.

Granularity
-----------

Guillaume nicely pointed out that this library might be an overkill.
In the sense that you have a dotW_b transform, and then you will need
a dotW_b_sparse transform and so on. Plus way of initializing each param
would result in many more transforms.

I don't have a perfect answer yet, but my argument will go as this : 

you would have transforms for the most popular option ( dotW_b) for example.
If you need something else you can always decorate a function that takes
theano arguments and produces theano arguments. More then decoratting you
can have a general apply transform that does something like : 

apply( lambda x,y,z: x*y+z, inputs = x, 
                            hyperparams = [(name,2)], 
                            params = [(name,theano.shared(..)])
The order of the arguments in lambda is nodes, params, hyper-params or so.
This would apply the theano expression but it will also register the 
the parameters. It is like creating a transform on the fly.

I think you can do such that the result of the apply is 
pickable, but not the apply operation. Meaning that in the graph, the op 
doesn't actually store the lambda expression but a mini theano graph.

Also names might be optional, so you can write hyperparam = [2,]


What this way of doing things would buy you hopefully is that you do not 
need to worry about most of your model ( would be just a few macros or 
subrutines).
you would do something like :

rbm1,hidden1 = rbm_layer(data,20)
rbm2,hidden2 = rbm_layer(data,20)

and then the part you care about :

hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params =
theano.shared(scipy.sparse_CSR(..)))

and after that you pottentially still do what you did before :

err = cross_entropy(hidden3, target)
grads = grad(err, err.paramters())
...

I do agree that some of the "transforms" that I have been writing here 
and there are pretty low level, and maybe we don't need them. We might need
only somewhat higher level transforms. My hope is that for now people think
of the approach and not about all inner details ( like what transforms we 
need and so on) and see if they are comfortable with it or not.

Do we want to think in this terms? I think is a bit better do have
a normal python class, hacking it to change something and then either add
a parameter to init or create a new version. It seems a bit more natural.




Anyhow Guillaume I'm working on a better answer :)


Params and hyperparams
----------------------

I think it is obvious from what I wrote above that there is a node wrapper
around the theano expression. I haven't wrote down all the details of that
class. I think there should be such a wrapper around parameters and 
hyper-parameters as well. By default those wrappers might not provide
any informtion. Later on, they can provide for hyper-params for example a
distribution. If when inserting your hyper-param in the graph ( i.e. when
you call a given transform) you provide the distribution then maybe a
hyperlearner could use it to sample from it.

For parameters you might define properties like freeze. It can be true or 
false. Whenever it is set to true, the param is not adapted by the optimizer.
Changing this value like changing most of hyper-params implies recompilation
of the graph.

I would have a special class of hyper-params which don't require 
recompilation of the graph. Learning rate is an example. This info is also
given by the wrapper and by how the parameter is used.

It is up to the user and "transform" implementer to wrap params and 
hyper-params correspondingly. But I don't think this is to complicated.
The apply function above has a default behaviour, maybe you would have 
a forth type of argument which is hyper-param that doesn't require 
compilation. We could find a nice name for it.


How does this work?
-------------------

You always have a pointer to the entire graph. Whenever a hyper-param 
changes ( or a param freezes) all region of the graph affected get recompiled.
This is by traversing the graph from the bottom node and constructing the
theano expression.

This function that updates / re-constructs the graph is sligthly more complex
if you have non-theano functions in the graph ..

replace
-------

Replace, replaces a part of the graph. The way it works in my view is that
if I write : 

x = x1+x2+x3
y = x.replace({x2:x5})

You would first copy the graph that is represented by x ( the params or 
hyper-params are not copied) and then replace the subgraphs. I.e., x will
still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done 
inplace.

I think these Node classes as something light-weighted, like theano variables.


reconstruct
-----------

This is something nice for DAA. It is definetely not useful for the rest. 
I think though that is a shame having that transformation graph and not 
being able to use it to do this. It will make life so much easier when you
do deep auto-encoders. I wouldn't put it in the core library, but I would 
have in the DAA module. The way I see it you can either have something like

# generate your inversable transforms on the fly
fn  = create_transform(lambda : , params, hyper-params )
inv = create_transform(lambda : , params, hyper-params )
my_transform = couple_transforms( forward = fn, inv = inv)

# have some already widely used such transform in the daa submodule.


transforms
----------

In my view there will be quite a few of such standard transforms. They
can be grouped by architecture, basic, sampler, optimizer and so on. 

We do not need to provide all of them, just the ones we need. Researching
on an architecture would actually lead in creating new such transforms in 
the library.

There will be definetely a list of basic such transforms in the begining,
like : 
  replace, 
  search, 
  get_param(name)
  get_params(..)

You can have and should have something like a switch ( that based on a 
hyper parameter replaces a part of a graph with another or not). This is
done by re-compiling the graph. 


Constraints
-----------

Nodes also can also keep track of constraints. 

When you write 

y = add_constraint(x, sum(x**2))

y is the same node as x, just that it also links to this second graph that
computes constraints. Whenever you call grad, grad will also sum to the 
cost all attached constraints to the graph.