diff doc/v2_planning/layer_RP.txt @ 1229:515033d4d3bf

a first draft of layer committee
author Razvan Pascanu <r.pascanu@gmail.com>
date Wed, 22 Sep 2010 19:43:24 -0400
parents
children 5ef96142492b
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/v2_planning/layer_RP.txt	Wed Sep 22 19:43:24 2010 -0400
@@ -0,0 +1,348 @@
+===============
+Layer committee
+===============
+
+Members : RP, XG, AB, DWF
+
+Proposal (RP)
+=============
+
+ You construct your neural network by constructing a graph of connections
+ between layesrs starting from data. While you construct the graph,
+ different theano formulas are put together to construct your model.
+
+ Hard details are not set yet, but all members of the committee agreed
+ that this sound as a good idea.
+
+
+Example Code (RP):
+------------------
+
+ # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y
+
+ h1   = sigmoid(dotW_b(train_x, n = 300))
+ rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood)
+
+ h2 = sigmoid(dotW_b(h1, n = 300))
+ rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood)
+
+ out = sigmoid( dotW_b(h2, n= 10))
+
+ train_err = cross_entropy( out, train_y)
+
+ grads   = grad( train_err, err.parameters() )
+ learner = SGD( err, grads)
+ 
+ valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y})
+ test_err  = train_err.replace({ train_x : test_x , train_y : test_y})
+
+
+
+Global observations :
+---------------------
+
+  1) Your graph can have multiple terminations; in this case rbm1, rbm2 and learner, valid_err, 
+  test_err  are all end nodes of the graph; 
+
+  2) Any node is an "iterator", when you would call out.next() you would get the next prediction;
+  when you call err.next() you will get next error ( on the batch given by the data ).
+
+  3) Replace can replace any subgraph
+
+  4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my 
+  view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG)
+
+  5) Any node has a pointer at the graph ( though arguably you don't use that graph that much). Running
+  such a node in general will be done by compiling the Theano expression up to that node, and using the 
+  data object that you get initially. This theano function is compiled lazy, in the sense that is compiled
+  when you try to iterate through the node. You use the graph only to : 
+       * update the Theano expression in case some part of the subgraph has been changed
+       * collect the list of parameters of the model 
+       * collect the list of hyper-parameters ( my personal view - this would mostly be useful for a 
+       hyper learner .. and not day to day basis, but I think is something easy to provide and we should)
+       * collect constraints on parameters ( I believe they can be inserted in the graph .. things like L1 
+       and so on )
+
+  6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore
+  to the user who implemented that transform; also initializing the parameters ( so if we have different way
+  to initialize the weight matrix that should be a hyperparameter with a default value)
+
+
+
+Detailed Proposal (RP)
+======================
+
+I would go through a list of scenarios and possible issues : 
+
+Delayed or feature values
+-------------------------
+
+Sometimes you might want future values of some nodes.  For example you might be interested in :
+
+y(t) = x(t) - x(t-1)
+
+You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by
+calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative.
+In my view this can be done as follows :
+  - a node is a class that points to : 
+      * a data object that feeds data
+      * a theano expression up to that point
+      * the entire graph that describes the model ( not Theano graph !!!)
+The only thing you need to do is to change the data object to reflect the
+delay ( we might need to be able to pad it with 0?). You need also to create
+a copy of the theano expression ( those are "new nodes" ) in the sense that 
+the starting theano tensors are different since they point to different data.
+
+
+
+Non-theano transformation ( or function or whatever)
+----------------------------------------------------
+
+Maybe you want to do something in the middle of your graph that is not Theano
+supported. Let say you have a function f which you can not write in Theano.
+You want to do something like
+
+
+ W1*f( W2*data + b)
+
+I think we can support that by doing the following :
+each node has a :
+   * a data object that feeds data
+   * a theano expression up to that point
+   * the entire graph that describes the model
+
+Let x1 = W2*data + b
+up to here everything is fine ( we have a theano expression )
+   dot(W2, tensor) + b,
+   where tensor is provided by the data object ( plus a dict of givens 
+and whatever else you need to compile the function)
+
+When you apply f, what you do you create a node that is exactly like the 
+data object in the sense that it provides a new tensor and a new dict of
+givens
+
+so x2 = W1*f( W2*data+b)
+ will actually point to the expression
+    dot(W1, tensor)
+ and to the data node f(W2*data+b)
+
+what this means is that you basically compile two theano functions t1 and t2
+and evaluate t2(f(t1(data))). So everytime you have a non theano operation you
+break the theano expression and start a new one. 
+
+What you loose :
+  - there is no optimization or anything between t1,t2 and f ( we don't
+    support that)
+  - if you are running things on GPU, after t1, data will be copied on CPU and
+    then probably again on GPU - so it doesn't make sense anymore
+
+
+
+Recurrent Things
+----------------
+
+I think that you can write a recurrent operation by first defining a 
+graph ( the recrrent relation ):
+
+y_tm1 = recurrent_layer(init = zeros(50))
+x_t   = slice(x, t=0)
+y     = loop( dotW_b(y_tm1,50) + x_t, steps = 20)
+
+This would basically give all the information you need to add a scan op 
+to your theano expression of the result op, it is just a different way 
+of writing things .. which I think is more intuitive. 
+
+You create your primitives which are either a recurrent_layer that should
+have a initial value, or a slice of some other node ( a time slice that is)
+Then you call loop giving a expression that starts from those primitives.
+
+Similarly you can have foldl or map or anything else.
+
+Optimizer
+---------
+
+ Personally I would respect the findings of the optimization committee,
+ and have the SGD to require a Node that produces some error ( which can
+ be omitted) and the gradients. For this I would also have the grad
+ function which would actually only call T.grad. 
+
+ If you have non-theano thing in the middle? I don't have any smart 
+ solution besides ignoring any parameter that it is below the first 
+ non-theano node and throw a warning.
+
+Learner
+-------
+
+ In my case I would not have a predict() and eval() method of the learner,
+ but just a eval(). If you want the predictions you should use the 
+ corresponding node ( before applying the error measure ). This was 
+ for example **out** in my first example.
+
+ Of course we could require learners to be special nodes that also have
+ a predict output. In that case I'm not sure what the iterator behaiour
+ of the node should produce.
+
+Granularity
+-----------
+
+Guillaume nicely pointed out that this library might be an overkill.
+In the sense that you have a dotW_b transform, and then you will need
+a dotW_b_sparse transform and so on. Plus way of initializing each param
+would result in many more transforms.
+
+I don't have a perfect answer yet, but my argument will go as this : 
+
+you would have transforms for the most popular option ( dotW_b) for example.
+If you need something else you can always decorate a function that takes
+theano arguments and produces theano arguments. More then decoratting you
+can have a general apply transform that does something like : 
+
+apply( lambda x,y,z: x*y+z, inputs = x, 
+                            hyperparams = [(name,2)], 
+                            params = [(name,theano.shared(..)])
+The order of the arguments in lambda is nodes, params, hyper-params or so.
+This would apply the theano expression but it will also register the 
+the parameters. I think you can do such that the result of the apply is 
+pickable, but not the apply. Meaning that in the graph, the op doesn't
+actually store the lambda expression but a mini theano graph.
+
+Also names might be optional, so you can write hyperparam = [2,]
+
+
+What this way of doing things would buy you hopefully is that you do not 
+need to worry about most of your model ( would be just a few macros or 
+subrutines). 
+you would do like : 
+
+rbm1,hidden1 = rbm_layer(data,20)
+rbm2,hidden2 = rbm_layer(data,20)
+and then the part you care about :
+hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params =
+theano.shared(scipy.sparse_CSR(..)))
+and after that you pottentially still do what you did before :
+err = cross_entropy(hidden3, target)
+grads = grad(err, err.paramters())
+...
+
+I do agree that some of the "transforms" that I have been writing here 
+and there are pretty low level, and maybe we don't need them. We might need
+only somewhat higher level transforms. My hope is that for now people think
+of the approach and not to all inner details ( like what transforms we need,
+and so on) and see if they are comfortable with it or not.
+
+Do we want to think in this terms? I think is a bit better do have your 
+script like that, then hacking into the DBN class to change that W to be
+sparse.
+
+Anyhow Guillaume I'm working on a better answer :)
+
+
+Params and hyperparams
+----------------------
+
+I think it is obvious from what I wrote above that there is a node wrapper
+around the theano expression. I haven't wrote down all the details of that
+class. I think there should be such a wrapper around parameters and 
+hyper-parameters as well. By default those wrappers might not provide
+any informtion. Later on, they can provide for hyper-params for example a
+distribution. If when inserting your hyper-param in the graph ( i.e. when
+you call a given transform) you provide the distribution then maybe a
+hyperlearner could use it to sample from it.
+
+For parameters you might define properties like freeze. It can be true or 
+false. Whenever it is set to true, the param is not adapted by the optimizer.
+Changing this value like changing most of hyper-params implies recompilation
+of the graph.
+
+I would have a special class of hyper-params which don't require 
+recompilation of the graph. Learning rate is an example. This info is also
+given by the wrapper and by how the parameter is used.
+
+It is up to the user and "transform" implementer to wrap params and 
+hyper-params correspondingly. But I don't think this is to complicated.
+The apply function above has a default behaviour, maybe you would have 
+a forth type of argument which is hyper-param that doesn't require 
+compilation. We could find a nice name for it.
+
+
+How does this work?
+-------------------
+
+You always have a pointer to the entire graph. Whenever a hyper-param 
+changes ( or a param freezes) all region of the graph affected get recompiled.
+This is by traversing the graph from the bottom node and constructing the
+theano expression.
+
+This function that updates / re-constructs the graph is sligthly more complex
+if you have non-theano functions in the graph ..
+
+replace
+-------
+
+Replace, replaces a part of the graph. The way it works in my view is that
+if I write : 
+
+x = x1+x2+x3
+y = x.replace({x2:x5})
+
+You would first copy the graph that is represented by x ( the params or 
+hyper-params are not copied) and then replace the subgraphs. I.e., x will
+still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done 
+inplace.
+
+I think these Node classes as something light-weighted, like theano variables.
+
+
+reconstruct
+-----------
+
+This is something nice for DAA. It is definetely not useful for the rest. 
+I think though that is a shame having that transformation graph and not 
+being able to use it to do this. It will make life so much easier when you
+do deep auto-encoders. I wouldn't put it in the core library, but I would 
+have in the DAA module. The way I see it you can either have something like
+
+# generate your inversable transforms on the fly
+fn  = create_transform(lambda : , params, hyper-params )
+inv = create_transform(lambda : , params, hyper-params )
+my_transform = couple_transforms( forward = fn, inv = inv)
+
+# have some already widely used such transform in the daa submodule.
+
+
+transforms
+----------
+
+In my view there will be quite a few of such standard transforms. They
+can be grouped by architecture, basic, sampler, optimizer and so on. 
+
+We do not need to provide all of them, just the ones we need. Researching
+on an architecture would actually lead in creating new such transforms in 
+the library.
+
+There will be definetely a list of basic such transforms in the begining,
+like : 
+  replace, 
+  search, 
+  get_param(name)
+  get_params(..)
+
+You can have and should have something like a switch ( that based on a 
+hyper parameter replaces a part of a graph with another or not). This is
+done by re-compiling the graph. 
+
+
+Constraints
+-----------
+
+Nodes also can also keep track of constraints. 
+
+When you write 
+
+y = add_constraint(x, sum(x**2))
+
+y is the same node as x, just that it also links to this second graph that
+computes constraints. Whenever you call grad, grad will also sum to the 
+cost all attached constraints to the graph.
+
+