diff doc/v2_planning/layer_RP.txt @ 1237:32fc5f442dde

LAYER: sligthly long but somewhat clearer rendering of what I have in mind
author Razvan Pascanu <r.pascanu@gmail.com>
date Thu, 23 Sep 2010 11:40:20 -0400
parents 5ef96142492b
children
line wrap: on
line diff
--- a/doc/v2_planning/layer_RP.txt	Thu Sep 23 11:21:48 2010 -0400
+++ b/doc/v2_planning/layer_RP.txt	Thu Sep 23 11:40:20 2010 -0400
@@ -8,9 +8,16 @@
 =============
 
  You construct your neural network by constructing a graph of connections
- between layers starting from data. While you construct the graph,
+ between "layers" starting from data. While you construct the graph,
  different theano formulas are put together to construct your model.
 
+ The idea would be that you need to describe exactly what you would draw
+ on the board if you are asked to draw the architecture. This would be of 
+ course optional ( you will get macros that will return this graph
+ automatically for a well defined case). Things that are not neural networks, 
+ and you wouldn't have any structure to draw are just a box. For example a 
+ SVM, or PCA. This in case you want to connect their output to your network.
+
  Hard details are not set yet, but all members of the committee agreed
  that this sound as a good idea.
 
@@ -23,6 +30,7 @@
  h1   = sigmoid(dotW_b(train_x, n = 300))
  rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood)
 
+
  h2 = sigmoid(dotW_b(h1, n = 300))
  rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood)
 
@@ -31,7 +39,7 @@
  train_err = cross_entropy( out, train_y)
 
  grads   = grad( train_err, err.parameters() )
- learner = SGD( err, grads)
+ learner = SGD( err, err.parameters(), grads)
  
  valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y})
  test_err  = train_err.replace({ train_x : test_x , train_y : test_y})
@@ -42,17 +50,25 @@
 ---------------------
 
   1) Your graph can have multiple terminal nodes; in this case rbm1, 
-     rbm2 and learner, valid_err, test_err are all end nodes of the graph; 
+     rbm2 and learner, valid_err, test_err are all end nodes of the graph;
 
-  2) Any node is an "iterator", when you would call out.next() you would get 
-    the next prediction;  when you call err.next() you will get next error 
+  2) Any node is an "iterator", when you would call out.next() you would get
+    the next prediction;  when you call err.next() you will get next error
     ( on the batch given by the data.next() ).
 
-  3) Replace can replace any subgraph
+  3) Replace can replace any subgraph or subgraphs with other
+  subgraphs/subgraph as long as : there are the same number of input units 
+  and output units ( there is a 1 to 1 maping from those). I see replacing
+  subgraphs as looping over the list of subgraphs to replace and call replace
+  on which nothing fancier. Since nodes in my view produce the same interface
+  (execpt parameter nodes and hyper-parameter nodes) this constraint is not
+  hard to respect, so is up to the user to do a replace that makes sense.
 
   4) You can have MACROS or SUBROUTINE that already give you the graph for 
   known components ( in my  view the CDk is such a macro, but simpler 
-  examples will be vanilla versions of MLP, DAA, DBN, LOGREG)
+  examples will be vanilla versions of MLP, DAA, DBN, LOGREG). After 
+  Guillaume pointed out a real shortcomming of the approach I've modified
+  a bit what you get from a macro .. look below.
 
   5) Any node has the entire graph ( though arguably you don't use that 
   graph too much). Running such a node in general will be done by compiling 
@@ -72,8 +88,10 @@
   6) Registering parameters and hyper-parameters to the graph is the job of 
      the transform and therefore of the user who implemented that 
      transform; the same for initializing the parameters ( so if we have 
-     different way to initialize the weight matrix that might be a 
-     hyperparameter with a default value)
+     different ways to initialize the weight matrix that might be a 
+     hyperparameter with a default value or different transforms; to ease 
+     the number of such transforms you can define a transform on the fly for
+     simple theano expressions )
 
 
 
@@ -85,6 +103,9 @@
 Delayed or feature values
 -------------------------
 
+
+This is can be dropped if people think is not useful.
+
 Sometimes you might want future values of some nodes.  For example you might 
 be interested in :
 
@@ -159,26 +180,32 @@
 y     = loop( dotW_b(y_tm1,50) + x_t, steps = 20)
 
 This would basically give all the information you need to add a scan op 
-to your theano expression of the result op, it is just a different way 
+to your theano expression of the result node y, it is just a different way 
 of writing things .. which I think is more intuitive. 
 
 You create your primitives which are either a recurrent_layer that should
-have a initial value, or a slice of some other node ( a time slice that is)
-Then you call loop giving a expression that starts from those primitives.
+have a initial value, or a slice of some other node ( a time slice that is).
+A tims slice is a special kind of node, which we should try to force people
+not to use outside of a loop. If you use it though you have some default
+behaviour like for example it behaves exactly like a delayed node.
+You call loop giving a expression that starts from those primitives and 
+ta da, you have your recurrent expression in the graph.
 
 Similarly you can have foldl or map or anything else.
 
-You would use this instead of writing scan especially if the formula is 
+You would use this instead of writing scan especially if the formulas are
 more complicated and you want to automatically collect parameters,
-hyper-parameters and so on.
+hyper-parameters and so on. You could also just use the scan op and 
+using a general apply command if you like that more.
 
 Optimizer
 ---------
 
  Personally I would respect the findings of the optimization committee,
  and have the SGD to require a Node that produces some error ( which can
- be omitted) and the gradients. For this I would also have the grad
- function which would actually only call T.grad. 
+ be omitted) and the parameter nodes and nodes that compute gradients for 
+ those paramters. For this I would also have the grad function which would 
+ actually only call T.grad. 
 
  If you have non-theano thing in the middle? I don't have any smart 
  solution besides ignoring any parameter that it is below the first 
@@ -190,7 +217,10 @@
  In my case I would not have a predict() and eval() method of the learner,
  but just a eval(). If you want the predictions you should use the 
  corresponding node ( before applying the error measure ). This was 
- for example **out** in my first example.
+ for example **out** in my first example. Note eval() in this case is
+ the same as next(). ( you might just have next for simplicity). The 
+ only semantically important difference is that a call to next has now
+ side-effects in the sense that the parameters are updated.
 
  Of course we could require learners to be special nodes that also have
  a predict output. In that case I'm not sure what the iterating behaiour
@@ -208,8 +238,10 @@
 
 you would have transforms for the most popular option ( dotW_b) for example.
 If you need something else you can always decorate a function that takes
-theano arguments and produces theano arguments. More then decoratting you
-can have a general apply transform that does something like : 
+theano arguments and produces theano arguments. The formulas produced by 
+the formula committee might be a rich source of such function to decorate.
+More then decoratting, you can have a general apply transform that does 
+something like : 
 
 apply( lambda x,y,z: x*y+z, inputs = x, 
                             hyperparams = [(name,2)], 
@@ -217,47 +249,109 @@
 The order of the arguments in lambda is nodes, params, hyper-params or so.
 This would apply the theano expression but it will also register the 
 the parameters. It is like creating a transform on the fly.
+You should, or could provide names for parameters, you might need them 
+later.
 
 I think you can do such that the result of the apply is 
-pickable, but not the apply operation. Meaning that in the graph, the op 
-doesn't actually store the lambda expression but a mini theano graph.
-
-Also names might be optional, so you can write hyperparam = [2,]
+pickable, but not the general apply transform. What I mean is that 
+the output node does not store the lambda expression but some theano 
+graph (?) and it know which are the input ( and when you can replace 
+them so that you link this little graph to the rest of the 
+theano expression. Is just an ugly hack given that you can not save 
+lambda expressions, but I'm open to other alternatives ..
 
 
 What this way of doing things would buy you hopefully is that you do not 
-need to worry about most of your model ( would be just a few macros or 
-subrutines).
-you would do something like :
+need to worry about most of your model ( would be just a few macros) that 
+will get you to the point you want to change and then you do surgery on 
+that point. Compare this with hacking a class, it feels cleaner, because
+you what is up to that point you want to change is sort of separated from 
+what you change. Plus you could do this in your script, and you don't need
+to create your local branch of the library where you hack the class, or 
+duplicate the class file under a different name .. 
+Once what you are doing becomes stable it can be converted in either a
+different macro or a parameter to the initial macro.
 
-rbm1,hidden1 = rbm_layer(data,20)
-rbm2,hidden2 = rbm_layer(data,20)
+** New part **
 
-and then the part you care about :
+If this is not convincing enough, there is another point that I want to 
+make. While creating the graph you can optionally create a model object.
+I will encourage most people to do that ! This idea I had a long time ago,
+but then I used a singleton class as the world which could potentially create
+a lot of issues. This is a nicer version of that. 
 
-hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params =
-theano.shared(scipy.sparse_CSR(..)))
+This model class is optional but it can be extremely useful. What you do in
+this model class is to store the graph, together with different annotations 
+on that graph. What I would do is identify different subgraphs in the model
+and register them under different names. For example if err is the node that 
+points to the graph that represents a DBN, that graph will be registerd to 
+a model in which I have annotated which subgraphs represent the different
+rbms, which represents the logistic regression and so on. The model will also
+have a list of all the input nodes and all the output nodes of the graph. 
+We could potentially use this model class to control some global default 
+parameters initialization or hyper-parameters. This all might sound like 
+magic but is actually easy to implement.
 
-and after that you pottentially still do what you did before :
+If you have such a model, which is just some annotations on the graph, this 
+approach makes it easy to change components of the graph based on their names.
+For example I can replace rbm1 with a daa, because based on these annotations
+I know which part is rbm1. 
 
-err = cross_entropy(hidden3, target)
-grads = grad(err, err.paramters())
-...
+Why do I feel you need such a thing? It is just because you get the DBN by 
+calling a macro, and you don't have variables that point to different nodes 
+of your network so that you can define where a subgraph starts or not. But
+if a graph returns such a model, you can introspect what annotations you have.
+There should also be standard conventions, but you could also in the
+interactive shell look at :
+
+model.annotations(depth = 2)
+
+This would print something like : 
 
-I do agree that some of the "transforms" that I have been writing here 
-and there are pretty low level, and maybe we don't need them. We might need
-only somewhat higher level transforms. My hope is that for now people think
-of the approach and not about all inner details ( like what transforms we 
-need and so on) and see if they are comfortable with it or not.
+ 'DBN'
+    'rbm1'
+       'hidden_layer1'
+       'CDk_layer1'
+    'rbm2'
+       'hidden_layer2'
+       'CDk_layer2'
+    'logreg'
+       'cross_entropy'
+
+And then you can say 
 
-Do we want to think in this terms? I think is a bit better do have
-a normal python class, hacking it to change something and then either add
-a parameter to init or create a new version. It seems a bit more natural.
+daa1 = daa(..)
+daa2 = daa(..)
+new_model = model.replace('rbm1', daa1, new_name = 'daa1')
+new_model = new_model.replace('rbm2', daa2, new_name = 'daa2')
+
+and you get a SDAA.
+What is the hierarhical structure ? Well, in my view if some subgrah
+(annotated as S1) is part of another subgraph (annotated as S2) then 
+S1 is a child of S2 in this hierarchy of annotations. If they share 
+just a few nodes, but have nodes that are not shared, then they are on 
+the same level. We might one a flat space for the annotations, but I think
+this simple convention can get as a lot.
 
 
-
+So macros should in general return such models. It is up to you if you want to 
+ground the graph that you create in your script into a model or not. You do 
+so by manually adding nodes to the model. The annotations are also manually 
+done .. So this might be a bit annoying for a developer of a macro, but I
+don't think is cognitively complicated, and it would help a lot when using 
+the macros.
 
-Anyhow Guillaume I'm working on a better answer :)
+You can see how this annotation system becomes easily interesting. You can
+also annotate parameters ( and it is not too overwhelming to do so when
+you create the graph as well) and you can use this to sort of collect all 
+parameters that you annotated in some way and then do something to them.
+
+The way I see it is just that a transform could have an optional annotations
+argument and it will add that string to all parameters and hyper-parameters.
+How much sense this makes is debatable, but I strongly believe that is not 
+complicated to implement ( I actually have something like this already
+implemented, just that I use that single ton class, and I sort of made the 
+framework work mostly for DAA by making a few poor choices).
 
 
 Params and hyperparams
@@ -267,40 +361,44 @@
 around the theano expression. I haven't wrote down all the details of that
 class. I think there should be such a wrapper around parameters and 
 hyper-parameters as well. By default those wrappers might not provide
-any informtion. Later on, they can provide for hyper-params for example a
-distribution. If when inserting your hyper-param in the graph ( i.e. when
-you call a given transform) you provide the distribution then maybe a
-hyperlearner could use it to sample from it.
+any informtion. But you can potentially add interesting information for 
+"graph" aware transforms. For example you can add annotations for a find
+or replace function that will collect you all parameters or hyper-parameter
+so you do some common thing to all of them (when it makes sense). 
 
-For parameters you might define properties like freeze. It can be true or 
-false. Whenever it is set to true, the param is not adapted by the optimizer.
-Changing this value like changing most of hyper-params implies recompilation
-of the graph.
+You could have a freeze property for parameters. If you change that property
+the theano function (where needed) for all nodes that follow this one is 
+recomputed. This argument would be used by the collecting paramters function
+used to compute the gradient. If parameters are frozen they are ignored,
+if not they are updated. 
 
-I would have a special class of hyper-params which don't require 
-recompilation of the graph. Learning rate is an example. This info is also
-given by the wrapper and by how the parameter is used.
+For hyper-parameters you would also have a different wrapper that would
+contain, possibly, the distribution of that hyper-parameters for a
+hyper-learner.
 
-It is up to the user and "transform" implementer to wrap params and 
-hyper-params correspondingly. But I don't think this is to complicated.
-The apply function above has a default behaviour, maybe you would have 
-a forth type of argument which is hyper-param that doesn't require 
-compilation. We could find a nice name for it.
-
+I would also have the learning rate or noise_amounts as some strange 
+hyper-paramter. I would say by default, if any hyper-paramter changes its
+value, then the theano expressions need to be recompiled. If you are dealing
+with this strange types of hyper-parameters you don't need to do that. 
+This can be automatically for you and I guess it will all boil down to,
+is you hyper-paramter a theano shared variable or theano tensor ? If so we 
+are dealing with the second type. So this kind of stuff can be detected
+automatically.
 
 How does this work?
 -------------------
 
 You always have a pointer to the entire graph. Whenever a hyper-param 
 changes ( or a param freezes) all region of the graph affected get recompiled.
-This is by traversing the graph from the bottom node and constructing the
-theano expression.
+This is by traversing the graph from the bottom node and re-constructing the
+theano expression. Where needed this theano expression get compiled.
 
 This function that updates / re-constructs the graph is sligthly more complex
-if you have non-theano functions in the graph ..
+if you have non-theano functions in the middle of the graph .. but not too 
+much in my view.
 
-replace
--------
+replace & find
+--------------
 
 Replace, replaces a part of the graph. The way it works in my view is that
 if I write : 
@@ -311,61 +409,169 @@
 You would first copy the graph that is represented by x ( the params or 
 hyper-params are not copied) and then replace the subgraphs. I.e., x will
 still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done 
-inplace.
+inplace !
+
+I think these Node classes as something light-weighted, like theano variables
+and creating copy is not harmful. Also params & shared variables are shared 
+between these graphs. If you want new params / shared variables we can offer
+a copy / deepcopy command.
+
+Replace (given that it starts from a model) can take string(s) that indicate
+specific annotations.
+
+Find does the same ( without the copying).
+
+
+
+If you have two things named the same in the graph you would return the first 
+one in a breadth search starting from the top node. The idea is that if you
+have all the weight matrices annotated as 'W' and you look for 'W' starting 
+from node hiddens2, you want the W of the second layer, and not of the first.
+
+I wold support :
+model.replace( look_at , search_for , replace_with, annotate_as)
+replace(model , look_at  , search_for , replace_with, annotate_as)
+node.replace(model  , look_at, replace_with, annotate_as)
 
-I think these Node classes as something light-weighted, like theano variables.
+look_at if it is a node it reffers to the subgraph that has as a final 
+node that node. I.e. all up to that point. If it is a string, you would look
+at the subgraph annotated by that string.
+
+Of course we can optionally choose not to allow things to be annotate with 
+the same name, though I sort of liked it. It makes a lot of things easy. For 
+a DBN I would have the annotations : 
+
+DBN
+  rbm1
+    hidden
+    CDk
+  rbm2
+    hidden
+    CDk
+  logreg
+
+If I want to change the first CDk with PCD I would do 
+
+pcd1 = PCD (..)
+model.replace(look_at='rbm1', search_for='CDk', replace_with=pcd1,
+                annotate_as='PCD1')
+
+
+Bottom line is : 
+
+  I think having a graph and having a way to search in that graph and replace
+  parts is a very flexible and powerful way of doing things.
 
 
 reconstruct
 -----------
 
-This is something nice for DAA. It is definetely not useful for the rest. 
+This is something nice for DAA. It is definetely not useful for the rest.
 I think though that is a shame having that transformation graph and not 
 being able to use it to do this. It will make life so much easier when you
 do deep auto-encoders. I wouldn't put it in the core library, but I would 
-have in the DAA module. The way I see it you can either have something like
+have in the DAA module. For reconstruct to work you need to have inverse 
+transforms for the ones you use.
+
+The way I see it you can either have something like
 
 # generate your inversable transforms on the fly
 fn  = create_transform(lambda : , params, hyper-params )
 inv = create_transform(lambda : , params, hyper-params )
 my_transform = couple_transforms( forward = fn, inv = inv)
 
-# have some already widely used such transform in the daa submodule.
+and generate special transforms on the fly that have some pseudo-inverses
+when you construct the graph. Maybe you can also have spcific pre-defined
+transforms for the most used cases, whith specific names. Even more I don't
+see the harm of something as simple as dotW_b to have a inverse defined ( as
+using tied weights) in all cases, but you would only use it for the DAA. 
+It just to reduce the number of names of transforms you have, is like a
+feature that doesn't hurt or help in 95% of times but it helps in 5% of times.
+
+
+But this is up to debate. The only reason I bring it up is to say that the 
+class that represents a transform should have a inverse method that by 
+default throws an exception.
 
 
 transforms
 ----------
 
-In my view there will be quite a few of such standard transforms. They
-can be grouped by architecture, basic, sampler, optimizer and so on. 
-
-We do not need to provide all of them, just the ones we need. Researching
-on an architecture would actually lead in creating new such transforms in 
-the library.
+In my view there will be quite a few of such standard transforms. 
+This can be annoying, but I think that if we group them by 
+architectures (MLP, DAA, RBM), sampler, optimizers it will be less of a mess.
+This would be crucial for their documentation as well. This categories should
+also come with macros. There will be though some basic transforms that 
+are available at the core ( like replace, find, things related to annotating
+and creating a model, collecting parameters and hyper-paramters)
 
-There will be definetely a list of basic such transforms in the begining,
-like : 
-  replace, 
-  search, 
-  get_param(name)
-  get_params(..)
-
-You can have and should have something like a switch ( that based on a 
-hyper parameter replaces a part of a graph with another or not). This is
-done by re-compiling the graph. 
+I also think that we can start small by having just very few such transforms
+and add them as the library grows. We don't need many of this, most are 
+nice to have ..
 
 
 Constraints
 -----------
 
-Nodes also can also keep track of constraints. 
+You can always add constraints. I think the easier to make this explicit is to 
+get a hand on the parameter or ndoe on which you want to add constraint and 
+do something like 
 
-When you write 
+add_constraint(on_what, what)
 
-y = add_constraint(x, sum(x**2))
+on_what can be a node, a parameter node, a list of nodes, a list of parameter 
+nodes, an annotation string, given that you provided a model, and what is a 
+graph. In terms of the graph that you are creating what this does is to 
+create a dependency link from your main graph to that constraint graph.
+This means that the grad function that computes the grad function that 
+computes the gradients with respect to parameters will also (if there are 
+such dependency links) add the gradient of those parameters with respect 
+to the output of that dependency graph. There are some constraints on 
+what a dependency graph can be, in the sense that it should start from only
+one input ( the parameters / node) and it should end in only one node that 
+is a scalar.
 
-y is the same node as x, just that it also links to this second graph that
-computes constraints. Whenever you call grad, grad will also sum to the 
-cost all attached constraints to the graph.
+From an implementation point of view, this can be done by just collecting a 
+list of constraints cost, that will be added to the cost before calling
+T.grad. But I like to think about it in terms of graph linked through
+dependency links.
+
+
 
 
+Some general comments
+---------------------
+
+ I think that what you get in the end is a very flexible framework, where 
+ adding new things is just a matter of putting together a few transforms and 
+ annotating the entire thing. Worst case scenario you would need to invent a 
+ transform, which I do believe could be quite painless.
+
+ The harder part to implement is the back-bone. It is not difficult in my
+ view, mostly sligthly tideous. I had something like this implemented in a
+ matter of a week, though it was a bit less restrictive. I do believe though
+ that we should not oversimplify the backbone of the library just to make it 
+ easy to implement, but we should rather carefully consider what you get in
+ the end
+
+
+Connection to the architecture committee
+-----------------------------------------
+
+ I think that if you get such iterator objects that can produce either 
+ the error, or do an update step it is easy to wrap them in a plug-in, 
+ or use it with the imperative language James proposed. 
+
+ I actually have ideas ( using non theano nodes) how to break the algo at 
+ points such that you can have different parts run on remote machines ..
+ though we might not want to support that ( using the plug-in system .. 
+ though it might work with other systems that support the same idea)
+
+ I think it goes more natural with the imperative language that James
+ proposed, because that would create a graph as well. His graph is 
+ in general simpler ( it always has only one termination node) where 
+ the nodes have a different interpretation (?) so I would use a different
+ node class on those. But from writing the code, using some syntactic sugar
+ the difference can be blurred ( do we want this ?). I think that one 
+ can come up with ways of making the approaches look alike and sligtly 
+ homogeneous.