Mercurial > pylearn
view doc/v2_planning/layer_RP.txt @ 1272:ba25c6e4f55d
mcRBM working with whole learning algo in theano
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Sat, 04 Sep 2010 19:32:27 -0400 |
parents | 32fc5f442dde |
children |
line wrap: on
line source
=============== Layer committee =============== Members : RP, XG, AB, DWF Proposal (RP) ============= You construct your neural network by constructing a graph of connections between "layers" starting from data. While you construct the graph, different theano formulas are put together to construct your model. The idea would be that you need to describe exactly what you would draw on the board if you are asked to draw the architecture. This would be of course optional ( you will get macros that will return this graph automatically for a well defined case). Things that are not neural networks, and you wouldn't have any structure to draw are just a box. For example a SVM, or PCA. This in case you want to connect their output to your network. Hard details are not set yet, but all members of the committee agreed that this sound as a good idea. Example Code (RP): ------------------ # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y h1 = sigmoid(dotW_b(train_x, n = 300)) rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood) h2 = sigmoid(dotW_b(h1, n = 300)) rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood) out = sigmoid( dotW_b(h2, n= 10)) train_err = cross_entropy( out, train_y) grads = grad( train_err, err.parameters() ) learner = SGD( err, err.parameters(), grads) valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y}) test_err = train_err.replace({ train_x : test_x , train_y : test_y}) Global observations : --------------------- 1) Your graph can have multiple terminal nodes; in this case rbm1, rbm2 and learner, valid_err, test_err are all end nodes of the graph; 2) Any node is an "iterator", when you would call out.next() you would get the next prediction; when you call err.next() you will get next error ( on the batch given by the data.next() ). 3) Replace can replace any subgraph or subgraphs with other subgraphs/subgraph as long as : there are the same number of input units and output units ( there is a 1 to 1 maping from those). I see replacing subgraphs as looping over the list of subgraphs to replace and call replace on which nothing fancier. Since nodes in my view produce the same interface (execpt parameter nodes and hyper-parameter nodes) this constraint is not hard to respect, so is up to the user to do a replace that makes sense. 4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG). After Guillaume pointed out a real shortcomming of the approach I've modified a bit what you get from a macro .. look below. 5) Any node has the entire graph ( though arguably you don't use that graph too much). Running such a node in general will be done by compiling the Theano expression up to that node( if you don't already have this function), and using the data object that you get initially. This theano function is compiled only if you need it. You use the graph only to : * update the Theano expression in case some part of the subgraph has changed (hyper-parameter or a replace call) * collect the list of parameters of the model * collect the list of hyper-parameters ( my personal view - this would mostly be useful for a hyper learner .. and not for day to day stuff, but I think is something easy to provide and we should ) * collect constraints on parameters ( I believe they can be represented in the graph as dependency links to other graphs that compute the constraints..) 6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore of the user who implemented that transform; the same for initializing the parameters ( so if we have different ways to initialize the weight matrix that might be a hyperparameter with a default value or different transforms; to ease the number of such transforms you can define a transform on the fly for simple theano expressions ) Detailed Proposal (RP) ====================== I would go through a list of scenarios and possible issues : Delayed or feature values ------------------------- This is can be dropped if people think is not useful. Sometimes you might want future values of some nodes. For example you might be interested in : y(t) = x(t) - x(t-1) You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative. In my view this can be done as follows : - a node is a class that points to : * a data object that feeds data * a theano expression up to that point * the entire graph that describes the model ( not Theano graph !!!) The only thing you need to do is to change the data object to reflect the delay ( we might need to be able to pad it with 0?). You need also to create a copy of the theano expression ( those are "new nodes" ) in the sense that the starting theano tensors are different since they point to different data. Non-theano transformation ( or function or whatever) ---------------------------------------------------- Maybe you want to do something in the middle of your graph that is not Theano supported. Let say you have a function f which you can not write in Theano. You want to do something like W1*f( W2*data + b) I think we can support that by doing the following : each node has a: * a data object that feeds data * a theano expression up to that point * the entire graph that describes the model Let x1 = W2*data + b up to here everything is fine ( we have a theano expression ) dot(W2, tensor) + b, where tensor is provided by the data object ( plus a dict of givens and whatever else you need to compile the function) When you apply f, what you do you create a node that is exactly like the data object in the sense that it provides a new tensor and a new dict of givens so x2 = W1*f( W2*data+b) will actually point to the expression dot(W1, tensor) and to the data node f(W2*data+b) what this means is that you basically compile two theano functions t1 and t2 and evaluate t2(f(t1(data))). So everytime you have a non theano operation you break the theano expression and start a new one. What you loose : - there is no optimization or anything between t1,t2 and f ( we don't support that) - if you are running things on GPU, after t1, data will be copied on CPU and then probably again on GPU - so it doesn't make sense anymore Recurrent Things ---------------- I think that you can write a recurrent operation by first defining a graph ( the recrrent relation ): y_tm1 = recurrent_layer(init = zeros(50)) x_t = slice(x, t=0) y = loop( dotW_b(y_tm1,50) + x_t, steps = 20) This would basically give all the information you need to add a scan op to your theano expression of the result node y, it is just a different way of writing things .. which I think is more intuitive. You create your primitives which are either a recurrent_layer that should have a initial value, or a slice of some other node ( a time slice that is). A tims slice is a special kind of node, which we should try to force people not to use outside of a loop. If you use it though you have some default behaviour like for example it behaves exactly like a delayed node. You call loop giving a expression that starts from those primitives and ta da, you have your recurrent expression in the graph. Similarly you can have foldl or map or anything else. You would use this instead of writing scan especially if the formulas are more complicated and you want to automatically collect parameters, hyper-parameters and so on. You could also just use the scan op and using a general apply command if you like that more. Optimizer --------- Personally I would respect the findings of the optimization committee, and have the SGD to require a Node that produces some error ( which can be omitted) and the parameter nodes and nodes that compute gradients for those paramters. For this I would also have the grad function which would actually only call T.grad. If you have non-theano thing in the middle? I don't have any smart solution besides ignoring any parameter that it is below the first non-theano node and throw a warning. Learner ------- In my case I would not have a predict() and eval() method of the learner, but just a eval(). If you want the predictions you should use the corresponding node ( before applying the error measure ). This was for example **out** in my first example. Note eval() in this case is the same as next(). ( you might just have next for simplicity). The only semantically important difference is that a call to next has now side-effects in the sense that the parameters are updated. Of course we could require learners to be special nodes that also have a predict output. In that case I'm not sure what the iterating behaiour of the node should produce. Granularity ----------- Guillaume nicely pointed out that this library might be an overkill. In the sense that you have a dotW_b transform, and then you will need a dotW_b_sparse transform and so on. Plus way of initializing each param would result in many more transforms. I don't have a perfect answer yet, but my argument will go as this : you would have transforms for the most popular option ( dotW_b) for example. If you need something else you can always decorate a function that takes theano arguments and produces theano arguments. The formulas produced by the formula committee might be a rich source of such function to decorate. More then decoratting, you can have a general apply transform that does something like : apply( lambda x,y,z: x*y+z, inputs = x, hyperparams = [(name,2)], params = [(name,theano.shared(..)]) The order of the arguments in lambda is nodes, params, hyper-params or so. This would apply the theano expression but it will also register the the parameters. It is like creating a transform on the fly. You should, or could provide names for parameters, you might need them later. I think you can do such that the result of the apply is pickable, but not the general apply transform. What I mean is that the output node does not store the lambda expression but some theano graph (?) and it know which are the input ( and when you can replace them so that you link this little graph to the rest of the theano expression. Is just an ugly hack given that you can not save lambda expressions, but I'm open to other alternatives .. What this way of doing things would buy you hopefully is that you do not need to worry about most of your model ( would be just a few macros) that will get you to the point you want to change and then you do surgery on that point. Compare this with hacking a class, it feels cleaner, because you what is up to that point you want to change is sort of separated from what you change. Plus you could do this in your script, and you don't need to create your local branch of the library where you hack the class, or duplicate the class file under a different name .. Once what you are doing becomes stable it can be converted in either a different macro or a parameter to the initial macro. ** New part ** If this is not convincing enough, there is another point that I want to make. While creating the graph you can optionally create a model object. I will encourage most people to do that ! This idea I had a long time ago, but then I used a singleton class as the world which could potentially create a lot of issues. This is a nicer version of that. This model class is optional but it can be extremely useful. What you do in this model class is to store the graph, together with different annotations on that graph. What I would do is identify different subgraphs in the model and register them under different names. For example if err is the node that points to the graph that represents a DBN, that graph will be registerd to a model in which I have annotated which subgraphs represent the different rbms, which represents the logistic regression and so on. The model will also have a list of all the input nodes and all the output nodes of the graph. We could potentially use this model class to control some global default parameters initialization or hyper-parameters. This all might sound like magic but is actually easy to implement. If you have such a model, which is just some annotations on the graph, this approach makes it easy to change components of the graph based on their names. For example I can replace rbm1 with a daa, because based on these annotations I know which part is rbm1. Why do I feel you need such a thing? It is just because you get the DBN by calling a macro, and you don't have variables that point to different nodes of your network so that you can define where a subgraph starts or not. But if a graph returns such a model, you can introspect what annotations you have. There should also be standard conventions, but you could also in the interactive shell look at : model.annotations(depth = 2) This would print something like : 'DBN' 'rbm1' 'hidden_layer1' 'CDk_layer1' 'rbm2' 'hidden_layer2' 'CDk_layer2' 'logreg' 'cross_entropy' And then you can say daa1 = daa(..) daa2 = daa(..) new_model = model.replace('rbm1', daa1, new_name = 'daa1') new_model = new_model.replace('rbm2', daa2, new_name = 'daa2') and you get a SDAA. What is the hierarhical structure ? Well, in my view if some subgrah (annotated as S1) is part of another subgraph (annotated as S2) then S1 is a child of S2 in this hierarchy of annotations. If they share just a few nodes, but have nodes that are not shared, then they are on the same level. We might one a flat space for the annotations, but I think this simple convention can get as a lot. So macros should in general return such models. It is up to you if you want to ground the graph that you create in your script into a model or not. You do so by manually adding nodes to the model. The annotations are also manually done .. So this might be a bit annoying for a developer of a macro, but I don't think is cognitively complicated, and it would help a lot when using the macros. You can see how this annotation system becomes easily interesting. You can also annotate parameters ( and it is not too overwhelming to do so when you create the graph as well) and you can use this to sort of collect all parameters that you annotated in some way and then do something to them. The way I see it is just that a transform could have an optional annotations argument and it will add that string to all parameters and hyper-parameters. How much sense this makes is debatable, but I strongly believe that is not complicated to implement ( I actually have something like this already implemented, just that I use that single ton class, and I sort of made the framework work mostly for DAA by making a few poor choices). Params and hyperparams ---------------------- I think it is obvious from what I wrote above that there is a node wrapper around the theano expression. I haven't wrote down all the details of that class. I think there should be such a wrapper around parameters and hyper-parameters as well. By default those wrappers might not provide any informtion. But you can potentially add interesting information for "graph" aware transforms. For example you can add annotations for a find or replace function that will collect you all parameters or hyper-parameter so you do some common thing to all of them (when it makes sense). You could have a freeze property for parameters. If you change that property the theano function (where needed) for all nodes that follow this one is recomputed. This argument would be used by the collecting paramters function used to compute the gradient. If parameters are frozen they are ignored, if not they are updated. For hyper-parameters you would also have a different wrapper that would contain, possibly, the distribution of that hyper-parameters for a hyper-learner. I would also have the learning rate or noise_amounts as some strange hyper-paramter. I would say by default, if any hyper-paramter changes its value, then the theano expressions need to be recompiled. If you are dealing with this strange types of hyper-parameters you don't need to do that. This can be automatically for you and I guess it will all boil down to, is you hyper-paramter a theano shared variable or theano tensor ? If so we are dealing with the second type. So this kind of stuff can be detected automatically. How does this work? ------------------- You always have a pointer to the entire graph. Whenever a hyper-param changes ( or a param freezes) all region of the graph affected get recompiled. This is by traversing the graph from the bottom node and re-constructing the theano expression. Where needed this theano expression get compiled. This function that updates / re-constructs the graph is sligthly more complex if you have non-theano functions in the middle of the graph .. but not too much in my view. replace & find -------------- Replace, replaces a part of the graph. The way it works in my view is that if I write : x = x1+x2+x3 y = x.replace({x2:x5}) You would first copy the graph that is represented by x ( the params or hyper-params are not copied) and then replace the subgraphs. I.e., x will still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done inplace ! I think these Node classes as something light-weighted, like theano variables and creating copy is not harmful. Also params & shared variables are shared between these graphs. If you want new params / shared variables we can offer a copy / deepcopy command. Replace (given that it starts from a model) can take string(s) that indicate specific annotations. Find does the same ( without the copying). If you have two things named the same in the graph you would return the first one in a breadth search starting from the top node. The idea is that if you have all the weight matrices annotated as 'W' and you look for 'W' starting from node hiddens2, you want the W of the second layer, and not of the first. I wold support : model.replace( look_at , search_for , replace_with, annotate_as) replace(model , look_at , search_for , replace_with, annotate_as) node.replace(model , look_at, replace_with, annotate_as) look_at if it is a node it reffers to the subgraph that has as a final node that node. I.e. all up to that point. If it is a string, you would look at the subgraph annotated by that string. Of course we can optionally choose not to allow things to be annotate with the same name, though I sort of liked it. It makes a lot of things easy. For a DBN I would have the annotations : DBN rbm1 hidden CDk rbm2 hidden CDk logreg If I want to change the first CDk with PCD I would do pcd1 = PCD (..) model.replace(look_at='rbm1', search_for='CDk', replace_with=pcd1, annotate_as='PCD1') Bottom line is : I think having a graph and having a way to search in that graph and replace parts is a very flexible and powerful way of doing things. reconstruct ----------- This is something nice for DAA. It is definetely not useful for the rest. I think though that is a shame having that transformation graph and not being able to use it to do this. It will make life so much easier when you do deep auto-encoders. I wouldn't put it in the core library, but I would have in the DAA module. For reconstruct to work you need to have inverse transforms for the ones you use. The way I see it you can either have something like # generate your inversable transforms on the fly fn = create_transform(lambda : , params, hyper-params ) inv = create_transform(lambda : , params, hyper-params ) my_transform = couple_transforms( forward = fn, inv = inv) and generate special transforms on the fly that have some pseudo-inverses when you construct the graph. Maybe you can also have spcific pre-defined transforms for the most used cases, whith specific names. Even more I don't see the harm of something as simple as dotW_b to have a inverse defined ( as using tied weights) in all cases, but you would only use it for the DAA. It just to reduce the number of names of transforms you have, is like a feature that doesn't hurt or help in 95% of times but it helps in 5% of times. But this is up to debate. The only reason I bring it up is to say that the class that represents a transform should have a inverse method that by default throws an exception. transforms ---------- In my view there will be quite a few of such standard transforms. This can be annoying, but I think that if we group them by architectures (MLP, DAA, RBM), sampler, optimizers it will be less of a mess. This would be crucial for their documentation as well. This categories should also come with macros. There will be though some basic transforms that are available at the core ( like replace, find, things related to annotating and creating a model, collecting parameters and hyper-paramters) I also think that we can start small by having just very few such transforms and add them as the library grows. We don't need many of this, most are nice to have .. Constraints ----------- You can always add constraints. I think the easier to make this explicit is to get a hand on the parameter or ndoe on which you want to add constraint and do something like add_constraint(on_what, what) on_what can be a node, a parameter node, a list of nodes, a list of parameter nodes, an annotation string, given that you provided a model, and what is a graph. In terms of the graph that you are creating what this does is to create a dependency link from your main graph to that constraint graph. This means that the grad function that computes the grad function that computes the gradients with respect to parameters will also (if there are such dependency links) add the gradient of those parameters with respect to the output of that dependency graph. There are some constraints on what a dependency graph can be, in the sense that it should start from only one input ( the parameters / node) and it should end in only one node that is a scalar. From an implementation point of view, this can be done by just collecting a list of constraints cost, that will be added to the cost before calling T.grad. But I like to think about it in terms of graph linked through dependency links. Some general comments --------------------- I think that what you get in the end is a very flexible framework, where adding new things is just a matter of putting together a few transforms and annotating the entire thing. Worst case scenario you would need to invent a transform, which I do believe could be quite painless. The harder part to implement is the back-bone. It is not difficult in my view, mostly sligthly tideous. I had something like this implemented in a matter of a week, though it was a bit less restrictive. I do believe though that we should not oversimplify the backbone of the library just to make it easy to implement, but we should rather carefully consider what you get in the end Connection to the architecture committee ----------------------------------------- I think that if you get such iterator objects that can produce either the error, or do an update step it is easy to wrap them in a plug-in, or use it with the imperative language James proposed. I actually have ideas ( using non theano nodes) how to break the algo at points such that you can have different parts run on remote machines .. though we might not want to support that ( using the plug-in system .. though it might work with other systems that support the same idea) I think it goes more natural with the imperative language that James proposed, because that would create a graph as well. His graph is in general simpler ( it always has only one termination node) where the nodes have a different interpretation (?) so I would use a different node class on those. But from writing the code, using some syntactic sugar the difference can be blurred ( do we want this ?). I think that one can come up with ways of making the approaches look alike and sligtly homogeneous.