Mercurial > pylearn
changeset 1237:32fc5f442dde
LAYER: sligthly long but somewhat clearer rendering of what I have in mind
author | Razvan Pascanu <r.pascanu@gmail.com> |
---|---|
date | Thu, 23 Sep 2010 11:40:20 -0400 |
parents | 23f63ecf0a9a |
children | 470beb000694 |
files | doc/v2_planning/layer_RP.txt |
diffstat | 1 files changed, 300 insertions(+), 94 deletions(-) [+] |
line wrap: on
line diff
--- a/doc/v2_planning/layer_RP.txt Thu Sep 23 11:21:48 2010 -0400 +++ b/doc/v2_planning/layer_RP.txt Thu Sep 23 11:40:20 2010 -0400 @@ -8,9 +8,16 @@ ============= You construct your neural network by constructing a graph of connections - between layers starting from data. While you construct the graph, + between "layers" starting from data. While you construct the graph, different theano formulas are put together to construct your model. + The idea would be that you need to describe exactly what you would draw + on the board if you are asked to draw the architecture. This would be of + course optional ( you will get macros that will return this graph + automatically for a well defined case). Things that are not neural networks, + and you wouldn't have any structure to draw are just a box. For example a + SVM, or PCA. This in case you want to connect their output to your network. + Hard details are not set yet, but all members of the committee agreed that this sound as a good idea. @@ -23,6 +30,7 @@ h1 = sigmoid(dotW_b(train_x, n = 300)) rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood) + h2 = sigmoid(dotW_b(h1, n = 300)) rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood) @@ -31,7 +39,7 @@ train_err = cross_entropy( out, train_y) grads = grad( train_err, err.parameters() ) - learner = SGD( err, grads) + learner = SGD( err, err.parameters(), grads) valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y}) test_err = train_err.replace({ train_x : test_x , train_y : test_y}) @@ -42,17 +50,25 @@ --------------------- 1) Your graph can have multiple terminal nodes; in this case rbm1, - rbm2 and learner, valid_err, test_err are all end nodes of the graph; + rbm2 and learner, valid_err, test_err are all end nodes of the graph; - 2) Any node is an "iterator", when you would call out.next() you would get - the next prediction; when you call err.next() you will get next error + 2) Any node is an "iterator", when you would call out.next() you would get + the next prediction; when you call err.next() you will get next error ( on the batch given by the data.next() ). - 3) Replace can replace any subgraph + 3) Replace can replace any subgraph or subgraphs with other + subgraphs/subgraph as long as : there are the same number of input units + and output units ( there is a 1 to 1 maping from those). I see replacing + subgraphs as looping over the list of subgraphs to replace and call replace + on which nothing fancier. Since nodes in my view produce the same interface + (execpt parameter nodes and hyper-parameter nodes) this constraint is not + hard to respect, so is up to the user to do a replace that makes sense. 4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my view the CDk is such a macro, but simpler - examples will be vanilla versions of MLP, DAA, DBN, LOGREG) + examples will be vanilla versions of MLP, DAA, DBN, LOGREG). After + Guillaume pointed out a real shortcomming of the approach I've modified + a bit what you get from a macro .. look below. 5) Any node has the entire graph ( though arguably you don't use that graph too much). Running such a node in general will be done by compiling @@ -72,8 +88,10 @@ 6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore of the user who implemented that transform; the same for initializing the parameters ( so if we have - different way to initialize the weight matrix that might be a - hyperparameter with a default value) + different ways to initialize the weight matrix that might be a + hyperparameter with a default value or different transforms; to ease + the number of such transforms you can define a transform on the fly for + simple theano expressions ) @@ -85,6 +103,9 @@ Delayed or feature values ------------------------- + +This is can be dropped if people think is not useful. + Sometimes you might want future values of some nodes. For example you might be interested in : @@ -159,26 +180,32 @@ y = loop( dotW_b(y_tm1,50) + x_t, steps = 20) This would basically give all the information you need to add a scan op -to your theano expression of the result op, it is just a different way +to your theano expression of the result node y, it is just a different way of writing things .. which I think is more intuitive. You create your primitives which are either a recurrent_layer that should -have a initial value, or a slice of some other node ( a time slice that is) -Then you call loop giving a expression that starts from those primitives. +have a initial value, or a slice of some other node ( a time slice that is). +A tims slice is a special kind of node, which we should try to force people +not to use outside of a loop. If you use it though you have some default +behaviour like for example it behaves exactly like a delayed node. +You call loop giving a expression that starts from those primitives and +ta da, you have your recurrent expression in the graph. Similarly you can have foldl or map or anything else. -You would use this instead of writing scan especially if the formula is +You would use this instead of writing scan especially if the formulas are more complicated and you want to automatically collect parameters, -hyper-parameters and so on. +hyper-parameters and so on. You could also just use the scan op and +using a general apply command if you like that more. Optimizer --------- Personally I would respect the findings of the optimization committee, and have the SGD to require a Node that produces some error ( which can - be omitted) and the gradients. For this I would also have the grad - function which would actually only call T.grad. + be omitted) and the parameter nodes and nodes that compute gradients for + those paramters. For this I would also have the grad function which would + actually only call T.grad. If you have non-theano thing in the middle? I don't have any smart solution besides ignoring any parameter that it is below the first @@ -190,7 +217,10 @@ In my case I would not have a predict() and eval() method of the learner, but just a eval(). If you want the predictions you should use the corresponding node ( before applying the error measure ). This was - for example **out** in my first example. + for example **out** in my first example. Note eval() in this case is + the same as next(). ( you might just have next for simplicity). The + only semantically important difference is that a call to next has now + side-effects in the sense that the parameters are updated. Of course we could require learners to be special nodes that also have a predict output. In that case I'm not sure what the iterating behaiour @@ -208,8 +238,10 @@ you would have transforms for the most popular option ( dotW_b) for example. If you need something else you can always decorate a function that takes -theano arguments and produces theano arguments. More then decoratting you -can have a general apply transform that does something like : +theano arguments and produces theano arguments. The formulas produced by +the formula committee might be a rich source of such function to decorate. +More then decoratting, you can have a general apply transform that does +something like : apply( lambda x,y,z: x*y+z, inputs = x, hyperparams = [(name,2)], @@ -217,47 +249,109 @@ The order of the arguments in lambda is nodes, params, hyper-params or so. This would apply the theano expression but it will also register the the parameters. It is like creating a transform on the fly. +You should, or could provide names for parameters, you might need them +later. I think you can do such that the result of the apply is -pickable, but not the apply operation. Meaning that in the graph, the op -doesn't actually store the lambda expression but a mini theano graph. - -Also names might be optional, so you can write hyperparam = [2,] +pickable, but not the general apply transform. What I mean is that +the output node does not store the lambda expression but some theano +graph (?) and it know which are the input ( and when you can replace +them so that you link this little graph to the rest of the +theano expression. Is just an ugly hack given that you can not save +lambda expressions, but I'm open to other alternatives .. What this way of doing things would buy you hopefully is that you do not -need to worry about most of your model ( would be just a few macros or -subrutines). -you would do something like : +need to worry about most of your model ( would be just a few macros) that +will get you to the point you want to change and then you do surgery on +that point. Compare this with hacking a class, it feels cleaner, because +you what is up to that point you want to change is sort of separated from +what you change. Plus you could do this in your script, and you don't need +to create your local branch of the library where you hack the class, or +duplicate the class file under a different name .. +Once what you are doing becomes stable it can be converted in either a +different macro or a parameter to the initial macro. -rbm1,hidden1 = rbm_layer(data,20) -rbm2,hidden2 = rbm_layer(data,20) +** New part ** -and then the part you care about : +If this is not convincing enough, there is another point that I want to +make. While creating the graph you can optionally create a model object. +I will encourage most people to do that ! This idea I had a long time ago, +but then I used a singleton class as the world which could potentially create +a lot of issues. This is a nicer version of that. -hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = -theano.shared(scipy.sparse_CSR(..))) +This model class is optional but it can be extremely useful. What you do in +this model class is to store the graph, together with different annotations +on that graph. What I would do is identify different subgraphs in the model +and register them under different names. For example if err is the node that +points to the graph that represents a DBN, that graph will be registerd to +a model in which I have annotated which subgraphs represent the different +rbms, which represents the logistic regression and so on. The model will also +have a list of all the input nodes and all the output nodes of the graph. +We could potentially use this model class to control some global default +parameters initialization or hyper-parameters. This all might sound like +magic but is actually easy to implement. -and after that you pottentially still do what you did before : +If you have such a model, which is just some annotations on the graph, this +approach makes it easy to change components of the graph based on their names. +For example I can replace rbm1 with a daa, because based on these annotations +I know which part is rbm1. -err = cross_entropy(hidden3, target) -grads = grad(err, err.paramters()) -... +Why do I feel you need such a thing? It is just because you get the DBN by +calling a macro, and you don't have variables that point to different nodes +of your network so that you can define where a subgraph starts or not. But +if a graph returns such a model, you can introspect what annotations you have. +There should also be standard conventions, but you could also in the +interactive shell look at : + +model.annotations(depth = 2) + +This would print something like : -I do agree that some of the "transforms" that I have been writing here -and there are pretty low level, and maybe we don't need them. We might need -only somewhat higher level transforms. My hope is that for now people think -of the approach and not about all inner details ( like what transforms we -need and so on) and see if they are comfortable with it or not. + 'DBN' + 'rbm1' + 'hidden_layer1' + 'CDk_layer1' + 'rbm2' + 'hidden_layer2' + 'CDk_layer2' + 'logreg' + 'cross_entropy' + +And then you can say -Do we want to think in this terms? I think is a bit better do have -a normal python class, hacking it to change something and then either add -a parameter to init or create a new version. It seems a bit more natural. +daa1 = daa(..) +daa2 = daa(..) +new_model = model.replace('rbm1', daa1, new_name = 'daa1') +new_model = new_model.replace('rbm2', daa2, new_name = 'daa2') + +and you get a SDAA. +What is the hierarhical structure ? Well, in my view if some subgrah +(annotated as S1) is part of another subgraph (annotated as S2) then +S1 is a child of S2 in this hierarchy of annotations. If they share +just a few nodes, but have nodes that are not shared, then they are on +the same level. We might one a flat space for the annotations, but I think +this simple convention can get as a lot. - +So macros should in general return such models. It is up to you if you want to +ground the graph that you create in your script into a model or not. You do +so by manually adding nodes to the model. The annotations are also manually +done .. So this might be a bit annoying for a developer of a macro, but I +don't think is cognitively complicated, and it would help a lot when using +the macros. -Anyhow Guillaume I'm working on a better answer :) +You can see how this annotation system becomes easily interesting. You can +also annotate parameters ( and it is not too overwhelming to do so when +you create the graph as well) and you can use this to sort of collect all +parameters that you annotated in some way and then do something to them. + +The way I see it is just that a transform could have an optional annotations +argument and it will add that string to all parameters and hyper-parameters. +How much sense this makes is debatable, but I strongly believe that is not +complicated to implement ( I actually have something like this already +implemented, just that I use that single ton class, and I sort of made the +framework work mostly for DAA by making a few poor choices). Params and hyperparams @@ -267,40 +361,44 @@ around the theano expression. I haven't wrote down all the details of that class. I think there should be such a wrapper around parameters and hyper-parameters as well. By default those wrappers might not provide -any informtion. Later on, they can provide for hyper-params for example a -distribution. If when inserting your hyper-param in the graph ( i.e. when -you call a given transform) you provide the distribution then maybe a -hyperlearner could use it to sample from it. +any informtion. But you can potentially add interesting information for +"graph" aware transforms. For example you can add annotations for a find +or replace function that will collect you all parameters or hyper-parameter +so you do some common thing to all of them (when it makes sense). -For parameters you might define properties like freeze. It can be true or -false. Whenever it is set to true, the param is not adapted by the optimizer. -Changing this value like changing most of hyper-params implies recompilation -of the graph. +You could have a freeze property for parameters. If you change that property +the theano function (where needed) for all nodes that follow this one is +recomputed. This argument would be used by the collecting paramters function +used to compute the gradient. If parameters are frozen they are ignored, +if not they are updated. -I would have a special class of hyper-params which don't require -recompilation of the graph. Learning rate is an example. This info is also -given by the wrapper and by how the parameter is used. +For hyper-parameters you would also have a different wrapper that would +contain, possibly, the distribution of that hyper-parameters for a +hyper-learner. -It is up to the user and "transform" implementer to wrap params and -hyper-params correspondingly. But I don't think this is to complicated. -The apply function above has a default behaviour, maybe you would have -a forth type of argument which is hyper-param that doesn't require -compilation. We could find a nice name for it. - +I would also have the learning rate or noise_amounts as some strange +hyper-paramter. I would say by default, if any hyper-paramter changes its +value, then the theano expressions need to be recompiled. If you are dealing +with this strange types of hyper-parameters you don't need to do that. +This can be automatically for you and I guess it will all boil down to, +is you hyper-paramter a theano shared variable or theano tensor ? If so we +are dealing with the second type. So this kind of stuff can be detected +automatically. How does this work? ------------------- You always have a pointer to the entire graph. Whenever a hyper-param changes ( or a param freezes) all region of the graph affected get recompiled. -This is by traversing the graph from the bottom node and constructing the -theano expression. +This is by traversing the graph from the bottom node and re-constructing the +theano expression. Where needed this theano expression get compiled. This function that updates / re-constructs the graph is sligthly more complex -if you have non-theano functions in the graph .. +if you have non-theano functions in the middle of the graph .. but not too +much in my view. -replace -------- +replace & find +-------------- Replace, replaces a part of the graph. The way it works in my view is that if I write : @@ -311,61 +409,169 @@ You would first copy the graph that is represented by x ( the params or hyper-params are not copied) and then replace the subgraphs. I.e., x will still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done -inplace. +inplace ! + +I think these Node classes as something light-weighted, like theano variables +and creating copy is not harmful. Also params & shared variables are shared +between these graphs. If you want new params / shared variables we can offer +a copy / deepcopy command. + +Replace (given that it starts from a model) can take string(s) that indicate +specific annotations. + +Find does the same ( without the copying). + + + +If you have two things named the same in the graph you would return the first +one in a breadth search starting from the top node. The idea is that if you +have all the weight matrices annotated as 'W' and you look for 'W' starting +from node hiddens2, you want the W of the second layer, and not of the first. + +I wold support : +model.replace( look_at , search_for , replace_with, annotate_as) +replace(model , look_at , search_for , replace_with, annotate_as) +node.replace(model , look_at, replace_with, annotate_as) -I think these Node classes as something light-weighted, like theano variables. +look_at if it is a node it reffers to the subgraph that has as a final +node that node. I.e. all up to that point. If it is a string, you would look +at the subgraph annotated by that string. + +Of course we can optionally choose not to allow things to be annotate with +the same name, though I sort of liked it. It makes a lot of things easy. For +a DBN I would have the annotations : + +DBN + rbm1 + hidden + CDk + rbm2 + hidden + CDk + logreg + +If I want to change the first CDk with PCD I would do + +pcd1 = PCD (..) +model.replace(look_at='rbm1', search_for='CDk', replace_with=pcd1, + annotate_as='PCD1') + + +Bottom line is : + + I think having a graph and having a way to search in that graph and replace + parts is a very flexible and powerful way of doing things. reconstruct ----------- -This is something nice for DAA. It is definetely not useful for the rest. +This is something nice for DAA. It is definetely not useful for the rest. I think though that is a shame having that transformation graph and not being able to use it to do this. It will make life so much easier when you do deep auto-encoders. I wouldn't put it in the core library, but I would -have in the DAA module. The way I see it you can either have something like +have in the DAA module. For reconstruct to work you need to have inverse +transforms for the ones you use. + +The way I see it you can either have something like # generate your inversable transforms on the fly fn = create_transform(lambda : , params, hyper-params ) inv = create_transform(lambda : , params, hyper-params ) my_transform = couple_transforms( forward = fn, inv = inv) -# have some already widely used such transform in the daa submodule. +and generate special transforms on the fly that have some pseudo-inverses +when you construct the graph. Maybe you can also have spcific pre-defined +transforms for the most used cases, whith specific names. Even more I don't +see the harm of something as simple as dotW_b to have a inverse defined ( as +using tied weights) in all cases, but you would only use it for the DAA. +It just to reduce the number of names of transforms you have, is like a +feature that doesn't hurt or help in 95% of times but it helps in 5% of times. + + +But this is up to debate. The only reason I bring it up is to say that the +class that represents a transform should have a inverse method that by +default throws an exception. transforms ---------- -In my view there will be quite a few of such standard transforms. They -can be grouped by architecture, basic, sampler, optimizer and so on. - -We do not need to provide all of them, just the ones we need. Researching -on an architecture would actually lead in creating new such transforms in -the library. +In my view there will be quite a few of such standard transforms. +This can be annoying, but I think that if we group them by +architectures (MLP, DAA, RBM), sampler, optimizers it will be less of a mess. +This would be crucial for their documentation as well. This categories should +also come with macros. There will be though some basic transforms that +are available at the core ( like replace, find, things related to annotating +and creating a model, collecting parameters and hyper-paramters) -There will be definetely a list of basic such transforms in the begining, -like : - replace, - search, - get_param(name) - get_params(..) - -You can have and should have something like a switch ( that based on a -hyper parameter replaces a part of a graph with another or not). This is -done by re-compiling the graph. +I also think that we can start small by having just very few such transforms +and add them as the library grows. We don't need many of this, most are +nice to have .. Constraints ----------- -Nodes also can also keep track of constraints. +You can always add constraints. I think the easier to make this explicit is to +get a hand on the parameter or ndoe on which you want to add constraint and +do something like -When you write +add_constraint(on_what, what) -y = add_constraint(x, sum(x**2)) +on_what can be a node, a parameter node, a list of nodes, a list of parameter +nodes, an annotation string, given that you provided a model, and what is a +graph. In terms of the graph that you are creating what this does is to +create a dependency link from your main graph to that constraint graph. +This means that the grad function that computes the grad function that +computes the gradients with respect to parameters will also (if there are +such dependency links) add the gradient of those parameters with respect +to the output of that dependency graph. There are some constraints on +what a dependency graph can be, in the sense that it should start from only +one input ( the parameters / node) and it should end in only one node that +is a scalar. -y is the same node as x, just that it also links to this second graph that -computes constraints. Whenever you call grad, grad will also sum to the -cost all attached constraints to the graph. +From an implementation point of view, this can be done by just collecting a +list of constraints cost, that will be added to the cost before calling +T.grad. But I like to think about it in terms of graph linked through +dependency links. + + +Some general comments +--------------------- + + I think that what you get in the end is a very flexible framework, where + adding new things is just a matter of putting together a few transforms and + annotating the entire thing. Worst case scenario you would need to invent a + transform, which I do believe could be quite painless. + + The harder part to implement is the back-bone. It is not difficult in my + view, mostly sligthly tideous. I had something like this implemented in a + matter of a week, though it was a bit less restrictive. I do believe though + that we should not oversimplify the backbone of the library just to make it + easy to implement, but we should rather carefully consider what you get in + the end + + +Connection to the architecture committee +----------------------------------------- + + I think that if you get such iterator objects that can produce either + the error, or do an update step it is easy to wrap them in a plug-in, + or use it with the imperative language James proposed. + + I actually have ideas ( using non theano nodes) how to break the algo at + points such that you can have different parts run on remote machines .. + though we might not want to support that ( using the plug-in system .. + though it might work with other systems that support the same idea) + + I think it goes more natural with the imperative language that James + proposed, because that would create a graph as well. His graph is + in general simpler ( it always has only one termination node) where + the nodes have a different interpretation (?) so I would use a different + node class on those. But from writing the code, using some syntactic sugar + the difference can be blurred ( do we want this ?). I think that one + can come up with ways of making the approaches look alike and sligtly + homogeneous.