pylearn: doc/v2_planning/layer

comparison doc/v2_planning/layer_RP.txt @ 1237:32fc5f442dde

LAYER: sligthly long but somewhat clearer rendering of what I have in mind

author	Razvan Pascanu <r.pascanu@gmail.com>
date	Thu, 23 Sep 2010 11:40:20 -0400
parents	5ef96142492b
children

comparison

equal deleted inserted replaced

-:23f63ecf0a9a
+:32fc5f442dde
 Proposal (RP)
 =============
 You construct your neural network by constructing a graph of connections
-between layers starting from data. While you construct the graph,
+between "layers" starting from data. While you construct the graph,
 different theano formulas are put together to construct your model.
+The idea would be that you need to describe exactly what you would draw
+on the board if you are asked to draw the architecture. This would be of
+course optional ( you will get macros that will return this graph
+automatically for a well defined case). Things that are not neural networks,
+and you wouldn't have any structure to draw are just a box. For example a
+SVM, or PCA. This in case you want to connect their output to your network.
 Hard details are not set yet, but all members of the committee agreed
 that this sound as a good idea.
 # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y
 h1   = sigmoid(dotW_b(train_x, n = 300))
 rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood)
 h2 = sigmoid(dotW_b(h1, n = 300))
 rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood)
 out = sigmoid( dotW_b(h2, n= 10))
 train_err = cross_entropy( out, train_y)
 grads   = grad( train_err, err.parameters() )
-learner = SGD( err, grads)
+learner = SGD( err, err.parameters(), grads)
 valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y})
 test_err  = train_err.replace({ train_x : test_x , train_y : test_y})
 Global observations :
 ---------------------
 1) Your graph can have multiple terminal nodes; in this case rbm1,
 rbm2 and learner, valid_err, test_err are all end nodes of the graph;
 2) Any node is an "iterator", when you would call out.next() you would get
 the next prediction;  when you call err.next() you will get next error
 ( on the batch given by the data.next() ).
-3) Replace can replace any subgraph
+3) Replace can replace any subgraph or subgraphs with other
+subgraphs/subgraph as long as : there are the same number of input units
+and output units ( there is a 1 to 1 maping from those). I see replacing
+subgraphs as looping over the list of subgraphs to replace and call replace
+on which nothing fancier. Since nodes in my view produce the same interface
+(execpt parameter nodes and hyper-parameter nodes) this constraint is not
+hard to respect, so is up to the user to do a replace that makes sense.
 4) You can have MACROS or SUBROUTINE that already give you the graph for
 known components ( in my  view the CDk is such a macro, but simpler
-examples will be vanilla versions of MLP, DAA, DBN, LOGREG)
+examples will be vanilla versions of MLP, DAA, DBN, LOGREG). After
+Guillaume pointed out a real shortcomming of the approach I've modified
+a bit what you get from a macro .. look below.
 5) Any node has the entire graph ( though arguably you don't use that
 graph too much). Running such a node in general will be done by compiling
 the Theano expression up to that node( if you don't already have this
 function), and using the data object that you get initially. This theano
 constraints..)
 6) Registering parameters and hyper-parameters to the graph is the job of
 the transform and therefore of the user who implemented that
 transform; the same for initializing the parameters ( so if we have
-different way to initialize the weight matrix that might be a
+different ways to initialize the weight matrix that might be a
-hyperparameter with a default value)
+hyperparameter with a default value or different transforms; to ease
+the number of such transforms you can define a transform on the fly for
+simple theano expressions )
 Detailed Proposal (RP)
 ======================
 I would go through a list of scenarios and possible issues :
 Delayed or feature values
 -------------------------
+This is can be dropped if people think is not useful.
 Sometimes you might want future values of some nodes.  For example you might
 be interested in :
 y(t) = x(t) - x(t-1)
 y_tm1 = recurrent_layer(init = zeros(50))
 x_t   = slice(x, t=0)
 y     = loop( dotW_b(y_tm1,50) + x_t, steps = 20)
 This would basically give all the information you need to add a scan op
-to your theano expression of the result op, it is just a different way
+to your theano expression of the result node y, it is just a different way
 of writing things .. which I think is more intuitive.
 You create your primitives which are either a recurrent_layer that should
-have a initial value, or a slice of some other node ( a time slice that is)
+have a initial value, or a slice of some other node ( a time slice that is).
-Then you call loop giving a expression that starts from those primitives.
+A tims slice is a special kind of node, which we should try to force people
+not to use outside of a loop. If you use it though you have some default
+behaviour like for example it behaves exactly like a delayed node.
+You call loop giving a expression that starts from those primitives and
+ta da, you have your recurrent expression in the graph.
 Similarly you can have foldl or map or anything else.
-You would use this instead of writing scan especially if the formula is
+You would use this instead of writing scan especially if the formulas are
 more complicated and you want to automatically collect parameters,
-hyper-parameters and so on.
+hyper-parameters and so on. You could also just use the scan op and
+using a general apply command if you like that more.
 Optimizer
 ---------
 Personally I would respect the findings of the optimization committee,
 and have the SGD to require a Node that produces some error ( which can
-be omitted) and the gradients. For this I would also have the grad
+be omitted) and the parameter nodes and nodes that compute gradients for
-function which would actually only call T.grad.
+those paramters. For this I would also have the grad function which would
+actually only call T.grad.
 If you have non-theano thing in the middle? I don't have any smart
 solution besides ignoring any parameter that it is below the first
 non-theano node and throw a warning.
 -------
 In my case I would not have a predict() and eval() method of the learner,
 but just a eval(). If you want the predictions you should use the
 corresponding node ( before applying the error measure ). This was
-for example **out** in my first example.
+for example **out** in my first example. Note eval() in this case is
+the same as next(). ( you might just have next for simplicity). The
+only semantically important difference is that a call to next has now
+side-effects in the sense that the parameters are updated.
 Of course we could require learners to be special nodes that also have
 a predict output. In that case I'm not sure what the iterating behaiour
 of the node should produce.
 I don't have a perfect answer yet, but my argument will go as this :
 you would have transforms for the most popular option ( dotW_b) for example.
 If you need something else you can always decorate a function that takes
-theano arguments and produces theano arguments. More then decoratting you
+theano arguments and produces theano arguments. The formulas produced by
-can have a general apply transform that does something like :
+the formula committee might be a rich source of such function to decorate.
+More then decoratting, you can have a general apply transform that does
+something like :
 apply( lambda x,y,z: x*y+z, inputs = x,
 hyperparams = [(name,2)],
 params = [(name,theano.shared(..)])
 The order of the arguments in lambda is nodes, params, hyper-params or so.
 This would apply the theano expression but it will also register the
 the parameters. It is like creating a transform on the fly.
+You should, or could provide names for parameters, you might need them
+later.
 I think you can do such that the result of the apply is
-pickable, but not the apply operation. Meaning that in the graph, the op
+pickable, but not the general apply transform. What I mean is that
-doesn't actually store the lambda expression but a mini theano graph.
+the output node does not store the lambda expression but some theano
+graph (?) and it know which are the input ( and when you can replace
-Also names might be optional, so you can write hyperparam = [2,]
+them so that you link this little graph to the rest of the
+theano expression. Is just an ugly hack given that you can not save
+lambda expressions, but I'm open to other alternatives ..
 What this way of doing things would buy you hopefully is that you do not
-need to worry about most of your model ( would be just a few macros or
+need to worry about most of your model ( would be just a few macros) that
-subrutines).
+will get you to the point you want to change and then you do surgery on
-you would do something like :
+that point. Compare this with hacking a class, it feels cleaner, because
+you what is up to that point you want to change is sort of separated from
-rbm1,hidden1 = rbm_layer(data,20)
+what you change. Plus you could do this in your script, and you don't need
-rbm2,hidden2 = rbm_layer(data,20)
+to create your local branch of the library where you hack the class, or
+duplicate the class file under a different name ..
-and then the part you care about :
+Once what you are doing becomes stable it can be converted in either a
+different macro or a parameter to the initial macro.
-hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params =
-theano.shared(scipy.sparse_CSR(..)))
+** New part **
-and after that you pottentially still do what you did before :
+If this is not convincing enough, there is another point that I want to
+make. While creating the graph you can optionally create a model object.
-err = cross_entropy(hidden3, target)
+I will encourage most people to do that ! This idea I had a long time ago,
-grads = grad(err, err.paramters())
+but then I used a singleton class as the world which could potentially create
-...
+a lot of issues. This is a nicer version of that.
-I do agree that some of the "transforms" that I have been writing here
+This model class is optional but it can be extremely useful. What you do in
-and there are pretty low level, and maybe we don't need them. We might need
+this model class is to store the graph, together with different annotations
-only somewhat higher level transforms. My hope is that for now people think
+on that graph. What I would do is identify different subgraphs in the model
-of the approach and not about all inner details ( like what transforms we
+and register them under different names. For example if err is the node that
-need and so on) and see if they are comfortable with it or not.
+points to the graph that represents a DBN, that graph will be registerd to
+a model in which I have annotated which subgraphs represent the different
-Do we want to think in this terms? I think is a bit better do have
+rbms, which represents the logistic regression and so on. The model will also
-a normal python class, hacking it to change something and then either add
+have a list of all the input nodes and all the output nodes of the graph.
-a parameter to init or create a new version. It seems a bit more natural.
+We could potentially use this model class to control some global default
+parameters initialization or hyper-parameters. This all might sound like
+magic but is actually easy to implement.
+If you have such a model, which is just some annotations on the graph, this
-Anyhow Guillaume I'm working on a better answer :)
+approach makes it easy to change components of the graph based on their names.
+For example I can replace rbm1 with a daa, because based on these annotations
+I know which part is rbm1.
+Why do I feel you need such a thing? It is just because you get the DBN by
+calling a macro, and you don't have variables that point to different nodes
+of your network so that you can define where a subgraph starts or not. But
+if a graph returns such a model, you can introspect what annotations you have.
+There should also be standard conventions, but you could also in the
+interactive shell look at :
+model.annotations(depth = 2)
+This would print something like :
+'DBN'
+'rbm1'
+'hidden_layer1'
+'CDk_layer1'
+'rbm2'
+'hidden_layer2'
+'CDk_layer2'
+'logreg'
+'cross_entropy'
+And then you can say
+daa1 = daa(..)
+daa2 = daa(..)
+new_model = model.replace('rbm1', daa1, new_name = 'daa1')
+new_model = new_model.replace('rbm2', daa2, new_name = 'daa2')
+and you get a SDAA.
+What is the hierarhical structure ? Well, in my view if some subgrah
+(annotated as S1) is part of another subgraph (annotated as S2) then
+S1 is a child of S2 in this hierarchy of annotations. If they share
+just a few nodes, but have nodes that are not shared, then they are on
+the same level. We might one a flat space for the annotations, but I think
+this simple convention can get as a lot.
+So macros should in general return such models. It is up to you if you want to
+ground the graph that you create in your script into a model or not. You do
+so by manually adding nodes to the model. The annotations are also manually
+done .. So this might be a bit annoying for a developer of a macro, but I
+don't think is cognitively complicated, and it would help a lot when using
+the macros.
+You can see how this annotation system becomes easily interesting. You can
+also annotate parameters ( and it is not too overwhelming to do so when
+you create the graph as well) and you can use this to sort of collect all
+parameters that you annotated in some way and then do something to them.
+The way I see it is just that a transform could have an optional annotations
+argument and it will add that string to all parameters and hyper-parameters.
+How much sense this makes is debatable, but I strongly believe that is not
+complicated to implement ( I actually have something like this already
+implemented, just that I use that single ton class, and I sort of made the
+framework work mostly for DAA by making a few poor choices).
 Params and hyperparams
 ----------------------
 I think it is obvious from what I wrote above that there is a node wrapper
 around the theano expression. I haven't wrote down all the details of that
 class. I think there should be such a wrapper around parameters and
 hyper-parameters as well. By default those wrappers might not provide
-any informtion. Later on, they can provide for hyper-params for example a
+any informtion. But you can potentially add interesting information for
-distribution. If when inserting your hyper-param in the graph ( i.e. when
+"graph" aware transforms. For example you can add annotations for a find
-you call a given transform) you provide the distribution then maybe a
+or replace function that will collect you all parameters or hyper-parameter
-hyperlearner could use it to sample from it.
+so you do some common thing to all of them (when it makes sense).
-For parameters you might define properties like freeze. It can be true or
+You could have a freeze property for parameters. If you change that property
-false. Whenever it is set to true, the param is not adapted by the optimizer.
+the theano function (where needed) for all nodes that follow this one is
-Changing this value like changing most of hyper-params implies recompilation
+recomputed. This argument would be used by the collecting paramters function
-of the graph.
+used to compute the gradient. If parameters are frozen they are ignored,
+if not they are updated.
-I would have a special class of hyper-params which don't require
-recompilation of the graph. Learning rate is an example. This info is also
+For hyper-parameters you would also have a different wrapper that would
-given by the wrapper and by how the parameter is used.
+contain, possibly, the distribution of that hyper-parameters for a
+hyper-learner.
-It is up to the user and "transform" implementer to wrap params and
-hyper-params correspondingly. But I don't think this is to complicated.
+I would also have the learning rate or noise_amounts as some strange
-The apply function above has a default behaviour, maybe you would have
+hyper-paramter. I would say by default, if any hyper-paramter changes its
-a forth type of argument which is hyper-param that doesn't require
+value, then the theano expressions need to be recompiled. If you are dealing
-compilation. We could find a nice name for it.
+with this strange types of hyper-parameters you don't need to do that.
+This can be automatically for you and I guess it will all boil down to,
+is you hyper-paramter a theano shared variable or theano tensor ? If so we
+are dealing with the second type. So this kind of stuff can be detected
+automatically.
 How does this work?
 -------------------
 You always have a pointer to the entire graph. Whenever a hyper-param
 changes ( or a param freezes) all region of the graph affected get recompiled.
-This is by traversing the graph from the bottom node and constructing the
+This is by traversing the graph from the bottom node and re-constructing the
-theano expression.
+theano expression. Where needed this theano expression get compiled.
 This function that updates / re-constructs the graph is sligthly more complex
-if you have non-theano functions in the graph ..
+if you have non-theano functions in the middle of the graph .. but not too
+much in my view.
-replace
--------
+replace & find
+--------------
 Replace, replaces a part of the graph. The way it works in my view is that
 if I write :
 x = x1+x2+x3
 y = x.replace({x2:x5})
 You would first copy the graph that is represented by x ( the params or
 hyper-params are not copied) and then replace the subgraphs. I.e., x will
 still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done
-inplace.
+inplace !
-I think these Node classes as something light-weighted, like theano variables.
+I think these Node classes as something light-weighted, like theano variables
+and creating copy is not harmful. Also params & shared variables are shared
+between these graphs. If you want new params / shared variables we can offer
+a copy / deepcopy command.
+Replace (given that it starts from a model) can take string(s) that indicate
+specific annotations.
+Find does the same ( without the copying).
+If you have two things named the same in the graph you would return the first
+one in a breadth search starting from the top node. The idea is that if you
+have all the weight matrices annotated as 'W' and you look for 'W' starting
+from node hiddens2, you want the W of the second layer, and not of the first.
+I wold support :
+model.replace( look_at , search_for , replace_with, annotate_as)
+replace(model , look_at  , search_for , replace_with, annotate_as)
+node.replace(model  , look_at, replace_with, annotate_as)
+look_at if it is a node it reffers to the subgraph that has as a final
+node that node. I.e. all up to that point. If it is a string, you would look
+at the subgraph annotated by that string.
+Of course we can optionally choose not to allow things to be annotate with
+the same name, though I sort of liked it. It makes a lot of things easy. For
+a DBN I would have the annotations :
+DBN
+rbm1
+hidden
+CDk
+rbm2
+hidden
+CDk
+logreg
+If I want to change the first CDk with PCD I would do
+pcd1 = PCD (..)
+model.replace(look_at='rbm1', search_for='CDk', replace_with=pcd1,
+annotate_as='PCD1')
+Bottom line is :
+I think having a graph and having a way to search in that graph and replace
+parts is a very flexible and powerful way of doing things.
 reconstruct
 -----------
 This is something nice for DAA. It is definetely not useful for the rest.
 I think though that is a shame having that transformation graph and not
 being able to use it to do this. It will make life so much easier when you
 do deep auto-encoders. I wouldn't put it in the core library, but I would
-have in the DAA module. The way I see it you can either have something like
+have in the DAA module. For reconstruct to work you need to have inverse
+transforms for the ones you use.
+The way I see it you can either have something like
 # generate your inversable transforms on the fly
 fn  = create_transform(lambda : , params, hyper-params )
 inv = create_transform(lambda : , params, hyper-params )
 my_transform = couple_transforms( forward = fn, inv = inv)
-# have some already widely used such transform in the daa submodule.
+and generate special transforms on the fly that have some pseudo-inverses
+when you construct the graph. Maybe you can also have spcific pre-defined
+transforms for the most used cases, whith specific names. Even more I don't
+see the harm of something as simple as dotW_b to have a inverse defined ( as
+using tied weights) in all cases, but you would only use it for the DAA.
+It just to reduce the number of names of transforms you have, is like a
+feature that doesn't hurt or help in 95% of times but it helps in 5% of times.
+But this is up to debate. The only reason I bring it up is to say that the
+class that represents a transform should have a inverse method that by
+default throws an exception.
 transforms
 ----------
-In my view there will be quite a few of such standard transforms. They
+In my view there will be quite a few of such standard transforms.
-can be grouped by architecture, basic, sampler, optimizer and so on.
+This can be annoying, but I think that if we group them by
+architectures (MLP, DAA, RBM), sampler, optimizers it will be less of a mess.
-We do not need to provide all of them, just the ones we need. Researching
+This would be crucial for their documentation as well. This categories should
-on an architecture would actually lead in creating new such transforms in
+also come with macros. There will be though some basic transforms that
-the library.
+are available at the core ( like replace, find, things related to annotating
+and creating a model, collecting parameters and hyper-paramters)
-There will be definetely a list of basic such transforms in the begining,
-like :
+I also think that we can start small by having just very few such transforms
-replace,
+and add them as the library grows. We don't need many of this, most are
-search,
+nice to have ..
-get_param(name)
-get_params(..)
-You can have and should have something like a switch ( that based on a
-hyper parameter replaces a part of a graph with another or not). This is
-done by re-compiling the graph.
 Constraints
 -----------
-Nodes also can also keep track of constraints.
+You can always add constraints. I think the easier to make this explicit is to
+get a hand on the parameter or ndoe on which you want to add constraint and
-When you write
+do something like
-y = add_constraint(x, sum(x**2))
+add_constraint(on_what, what)
-y is the same node as x, just that it also links to this second graph that
+on_what can be a node, a parameter node, a list of nodes, a list of parameter
-computes constraints. Whenever you call grad, grad will also sum to the
+nodes, an annotation string, given that you provided a model, and what is a
-cost all attached constraints to the graph.
+graph. In terms of the graph that you are creating what this does is to
+create a dependency link from your main graph to that constraint graph.
+This means that the grad function that computes the grad function that
+computes the gradients with respect to parameters will also (if there are
+such dependency links) add the gradient of those parameters with respect
+to the output of that dependency graph. There are some constraints on
+what a dependency graph can be, in the sense that it should start from only
+one input ( the parameters / node) and it should end in only one node that
+is a scalar.
+From an implementation point of view, this can be done by just collecting a
+list of constraints cost, that will be added to the cost before calling
+T.grad. But I like to think about it in terms of graph linked through
+dependency links.
+Some general comments
+---------------------
+I think that what you get in the end is a very flexible framework, where
+adding new things is just a matter of putting together a few transforms and
+annotating the entire thing. Worst case scenario you would need to invent a
+transform, which I do believe could be quite painless.
+The harder part to implement is the back-bone. It is not difficult in my
+view, mostly sligthly tideous. I had something like this implemented in a
+matter of a week, though it was a bit less restrictive. I do believe though
+that we should not oversimplify the backbone of the library just to make it
+easy to implement, but we should rather carefully consider what you get in
+the end
+Connection to the architecture committee
+-----------------------------------------
+I think that if you get such iterator objects that can produce either
+the error, or do an update step it is easy to wrap them in a plug-in,
+or use it with the imperative language James proposed.
+I actually have ideas ( using non theano nodes) how to break the algo at
+points such that you can have different parts run on remote machines ..
+though we might not want to support that ( using the plug-in system ..
+though it might work with other systems that support the same idea)
+I think it goes more natural with the imperative language that James
+proposed, because that would create a graph as well. His graph is
+in general simpler ( it always has only one termination node) where
+the nodes have a different interpretation (?) so I would use a different
+node class on those. But from writing the code, using some syntactic sugar
+the difference can be blurred ( do we want this ?). I think that one
+can come up with ways of making the approaches look alike and sligtly
+homogeneous.

Mercurial > pylearn

comparison doc/v2_planning/layer_RP.txt @ 1237:32fc5f442dde