# HG changeset patch # User Razvan Pascanu # Date 1285199004 14400 # Node ID 515033d4d3bf77d30d8788d961dc5dbe9a93afee # Parent 86d802226a97a6b5766aef1fa959584bd6776698 a first draft of layer committee diff -r 86d802226a97 -r 515033d4d3bf doc/v2_planning/layer_RP.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/layer_RP.txt Wed Sep 22 19:43:24 2010 -0400 @@ -0,0 +1,348 @@ +=============== +Layer committee +=============== + +Members : RP, XG, AB, DWF + +Proposal (RP) +============= + + You construct your neural network by constructing a graph of connections + between layesrs starting from data. While you construct the graph, + different theano formulas are put together to construct your model. + + Hard details are not set yet, but all members of the committee agreed + that this sound as a good idea. + + +Example Code (RP): +------------------ + + # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y + + h1 = sigmoid(dotW_b(train_x, n = 300)) + rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood) + + h2 = sigmoid(dotW_b(h1, n = 300)) + rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood) + + out = sigmoid( dotW_b(h2, n= 10)) + + train_err = cross_entropy( out, train_y) + + grads = grad( train_err, err.parameters() ) + learner = SGD( err, grads) + + valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y}) + test_err = train_err.replace({ train_x : test_x , train_y : test_y}) + + + +Global observations : +--------------------- + + 1) Your graph can have multiple terminations; in this case rbm1, rbm2 and learner, valid_err, + test_err are all end nodes of the graph; + + 2) Any node is an "iterator", when you would call out.next() you would get the next prediction; + when you call err.next() you will get next error ( on the batch given by the data ). + + 3) Replace can replace any subgraph + + 4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my + view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG) + + 5) Any node has a pointer at the graph ( though arguably you don't use that graph that much). Running + such a node in general will be done by compiling the Theano expression up to that node, and using the + data object that you get initially. This theano function is compiled lazy, in the sense that is compiled + when you try to iterate through the node. You use the graph only to : + * update the Theano expression in case some part of the subgraph has been changed + * collect the list of parameters of the model + * collect the list of hyper-parameters ( my personal view - this would mostly be useful for a + hyper learner .. and not day to day basis, but I think is something easy to provide and we should) + * collect constraints on parameters ( I believe they can be inserted in the graph .. things like L1 + and so on ) + + 6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore + to the user who implemented that transform; also initializing the parameters ( so if we have different way + to initialize the weight matrix that should be a hyperparameter with a default value) + + + +Detailed Proposal (RP) +====================== + +I would go through a list of scenarios and possible issues : + +Delayed or feature values +------------------------- + +Sometimes you might want future values of some nodes. For example you might be interested in : + +y(t) = x(t) - x(t-1) + +You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by +calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative. +In my view this can be done as follows : + - a node is a class that points to : + * a data object that feeds data + * a theano expression up to that point + * the entire graph that describes the model ( not Theano graph !!!) +The only thing you need to do is to change the data object to reflect the +delay ( we might need to be able to pad it with 0?). You need also to create +a copy of the theano expression ( those are "new nodes" ) in the sense that +the starting theano tensors are different since they point to different data. + + + +Non-theano transformation ( or function or whatever) +---------------------------------------------------- + +Maybe you want to do something in the middle of your graph that is not Theano +supported. Let say you have a function f which you can not write in Theano. +You want to do something like + + + W1*f( W2*data + b) + +I think we can support that by doing the following : +each node has a : + * a data object that feeds data + * a theano expression up to that point + * the entire graph that describes the model + +Let x1 = W2*data + b +up to here everything is fine ( we have a theano expression ) + dot(W2, tensor) + b, + where tensor is provided by the data object ( plus a dict of givens +and whatever else you need to compile the function) + +When you apply f, what you do you create a node that is exactly like the +data object in the sense that it provides a new tensor and a new dict of +givens + +so x2 = W1*f( W2*data+b) + will actually point to the expression + dot(W1, tensor) + and to the data node f(W2*data+b) + +what this means is that you basically compile two theano functions t1 and t2 +and evaluate t2(f(t1(data))). So everytime you have a non theano operation you +break the theano expression and start a new one. + +What you loose : + - there is no optimization or anything between t1,t2 and f ( we don't + support that) + - if you are running things on GPU, after t1, data will be copied on CPU and + then probably again on GPU - so it doesn't make sense anymore + + + +Recurrent Things +---------------- + +I think that you can write a recurrent operation by first defining a +graph ( the recrrent relation ): + +y_tm1 = recurrent_layer(init = zeros(50)) +x_t = slice(x, t=0) +y = loop( dotW_b(y_tm1,50) + x_t, steps = 20) + +This would basically give all the information you need to add a scan op +to your theano expression of the result op, it is just a different way +of writing things .. which I think is more intuitive. + +You create your primitives which are either a recurrent_layer that should +have a initial value, or a slice of some other node ( a time slice that is) +Then you call loop giving a expression that starts from those primitives. + +Similarly you can have foldl or map or anything else. + +Optimizer +--------- + + Personally I would respect the findings of the optimization committee, + and have the SGD to require a Node that produces some error ( which can + be omitted) and the gradients. For this I would also have the grad + function which would actually only call T.grad. + + If you have non-theano thing in the middle? I don't have any smart + solution besides ignoring any parameter that it is below the first + non-theano node and throw a warning. + +Learner +------- + + In my case I would not have a predict() and eval() method of the learner, + but just a eval(). If you want the predictions you should use the + corresponding node ( before applying the error measure ). This was + for example **out** in my first example. + + Of course we could require learners to be special nodes that also have + a predict output. In that case I'm not sure what the iterator behaiour + of the node should produce. + +Granularity +----------- + +Guillaume nicely pointed out that this library might be an overkill. +In the sense that you have a dotW_b transform, and then you will need +a dotW_b_sparse transform and so on. Plus way of initializing each param +would result in many more transforms. + +I don't have a perfect answer yet, but my argument will go as this : + +you would have transforms for the most popular option ( dotW_b) for example. +If you need something else you can always decorate a function that takes +theano arguments and produces theano arguments. More then decoratting you +can have a general apply transform that does something like : + +apply( lambda x,y,z: x*y+z, inputs = x, + hyperparams = [(name,2)], + params = [(name,theano.shared(..)]) +The order of the arguments in lambda is nodes, params, hyper-params or so. +This would apply the theano expression but it will also register the +the parameters. I think you can do such that the result of the apply is +pickable, but not the apply. Meaning that in the graph, the op doesn't +actually store the lambda expression but a mini theano graph. + +Also names might be optional, so you can write hyperparam = [2,] + + +What this way of doing things would buy you hopefully is that you do not +need to worry about most of your model ( would be just a few macros or +subrutines). +you would do like : + +rbm1,hidden1 = rbm_layer(data,20) +rbm2,hidden2 = rbm_layer(data,20) +and then the part you care about : +hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = +theano.shared(scipy.sparse_CSR(..))) +and after that you pottentially still do what you did before : +err = cross_entropy(hidden3, target) +grads = grad(err, err.paramters()) +... + +I do agree that some of the "transforms" that I have been writing here +and there are pretty low level, and maybe we don't need them. We might need +only somewhat higher level transforms. My hope is that for now people think +of the approach and not to all inner details ( like what transforms we need, +and so on) and see if they are comfortable with it or not. + +Do we want to think in this terms? I think is a bit better do have your +script like that, then hacking into the DBN class to change that W to be +sparse. + +Anyhow Guillaume I'm working on a better answer :) + + +Params and hyperparams +---------------------- + +I think it is obvious from what I wrote above that there is a node wrapper +around the theano expression. I haven't wrote down all the details of that +class. I think there should be such a wrapper around parameters and +hyper-parameters as well. By default those wrappers might not provide +any informtion. Later on, they can provide for hyper-params for example a +distribution. If when inserting your hyper-param in the graph ( i.e. when +you call a given transform) you provide the distribution then maybe a +hyperlearner could use it to sample from it. + +For parameters you might define properties like freeze. It can be true or +false. Whenever it is set to true, the param is not adapted by the optimizer. +Changing this value like changing most of hyper-params implies recompilation +of the graph. + +I would have a special class of hyper-params which don't require +recompilation of the graph. Learning rate is an example. This info is also +given by the wrapper and by how the parameter is used. + +It is up to the user and "transform" implementer to wrap params and +hyper-params correspondingly. But I don't think this is to complicated. +The apply function above has a default behaviour, maybe you would have +a forth type of argument which is hyper-param that doesn't require +compilation. We could find a nice name for it. + + +How does this work? +------------------- + +You always have a pointer to the entire graph. Whenever a hyper-param +changes ( or a param freezes) all region of the graph affected get recompiled. +This is by traversing the graph from the bottom node and constructing the +theano expression. + +This function that updates / re-constructs the graph is sligthly more complex +if you have non-theano functions in the graph .. + +replace +------- + +Replace, replaces a part of the graph. The way it works in my view is that +if I write : + +x = x1+x2+x3 +y = x.replace({x2:x5}) + +You would first copy the graph that is represented by x ( the params or +hyper-params are not copied) and then replace the subgraphs. I.e., x will +still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done +inplace. + +I think these Node classes as something light-weighted, like theano variables. + + +reconstruct +----------- + +This is something nice for DAA. It is definetely not useful for the rest. +I think though that is a shame having that transformation graph and not +being able to use it to do this. It will make life so much easier when you +do deep auto-encoders. I wouldn't put it in the core library, but I would +have in the DAA module. The way I see it you can either have something like + +# generate your inversable transforms on the fly +fn = create_transform(lambda : , params, hyper-params ) +inv = create_transform(lambda : , params, hyper-params ) +my_transform = couple_transforms( forward = fn, inv = inv) + +# have some already widely used such transform in the daa submodule. + + +transforms +---------- + +In my view there will be quite a few of such standard transforms. They +can be grouped by architecture, basic, sampler, optimizer and so on. + +We do not need to provide all of them, just the ones we need. Researching +on an architecture would actually lead in creating new such transforms in +the library. + +There will be definetely a list of basic such transforms in the begining, +like : + replace, + search, + get_param(name) + get_params(..) + +You can have and should have something like a switch ( that based on a +hyper parameter replaces a part of a graph with another or not). This is +done by re-compiling the graph. + + +Constraints +----------- + +Nodes also can also keep track of constraints. + +When you write + +y = add_constraint(x, sum(x**2)) + +y is the same node as x, just that it also links to this second graph that +computes constraints. Whenever you call grad, grad will also sum to the +cost all attached constraints to the graph. + +