Mercurial > pylearn
view doc/v2_planning/layer_RP.txt @ 1231:5ef96142492b
some typos
author | Razvan Pascanu <r.pascanu@gmail.com> |
---|---|
date | Wed, 22 Sep 2010 20:17:35 -0400 |
parents | 515033d4d3bf |
children | 32fc5f442dde |
line wrap: on
line source
=============== Layer committee =============== Members : RP, XG, AB, DWF Proposal (RP) ============= You construct your neural network by constructing a graph of connections between layers starting from data. While you construct the graph, different theano formulas are put together to construct your model. Hard details are not set yet, but all members of the committee agreed that this sound as a good idea. Example Code (RP): ------------------ # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y h1 = sigmoid(dotW_b(train_x, n = 300)) rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood) h2 = sigmoid(dotW_b(h1, n = 300)) rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood) out = sigmoid( dotW_b(h2, n= 10)) train_err = cross_entropy( out, train_y) grads = grad( train_err, err.parameters() ) learner = SGD( err, grads) valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y}) test_err = train_err.replace({ train_x : test_x , train_y : test_y}) Global observations : --------------------- 1) Your graph can have multiple terminal nodes; in this case rbm1, rbm2 and learner, valid_err, test_err are all end nodes of the graph; 2) Any node is an "iterator", when you would call out.next() you would get the next prediction; when you call err.next() you will get next error ( on the batch given by the data.next() ). 3) Replace can replace any subgraph 4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG) 5) Any node has the entire graph ( though arguably you don't use that graph too much). Running such a node in general will be done by compiling the Theano expression up to that node( if you don't already have this function), and using the data object that you get initially. This theano function is compiled only if you need it. You use the graph only to : * update the Theano expression in case some part of the subgraph has changed (hyper-parameter or a replace call) * collect the list of parameters of the model * collect the list of hyper-parameters ( my personal view - this would mostly be useful for a hyper learner .. and not for day to day stuff, but I think is something easy to provide and we should ) * collect constraints on parameters ( I believe they can be represented in the graph as dependency links to other graphs that compute the constraints..) 6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore of the user who implemented that transform; the same for initializing the parameters ( so if we have different way to initialize the weight matrix that might be a hyperparameter with a default value) Detailed Proposal (RP) ====================== I would go through a list of scenarios and possible issues : Delayed or feature values ------------------------- Sometimes you might want future values of some nodes. For example you might be interested in : y(t) = x(t) - x(t-1) You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative. In my view this can be done as follows : - a node is a class that points to : * a data object that feeds data * a theano expression up to that point * the entire graph that describes the model ( not Theano graph !!!) The only thing you need to do is to change the data object to reflect the delay ( we might need to be able to pad it with 0?). You need also to create a copy of the theano expression ( those are "new nodes" ) in the sense that the starting theano tensors are different since they point to different data. Non-theano transformation ( or function or whatever) ---------------------------------------------------- Maybe you want to do something in the middle of your graph that is not Theano supported. Let say you have a function f which you can not write in Theano. You want to do something like W1*f( W2*data + b) I think we can support that by doing the following : each node has a: * a data object that feeds data * a theano expression up to that point * the entire graph that describes the model Let x1 = W2*data + b up to here everything is fine ( we have a theano expression ) dot(W2, tensor) + b, where tensor is provided by the data object ( plus a dict of givens and whatever else you need to compile the function) When you apply f, what you do you create a node that is exactly like the data object in the sense that it provides a new tensor and a new dict of givens so x2 = W1*f( W2*data+b) will actually point to the expression dot(W1, tensor) and to the data node f(W2*data+b) what this means is that you basically compile two theano functions t1 and t2 and evaluate t2(f(t1(data))). So everytime you have a non theano operation you break the theano expression and start a new one. What you loose : - there is no optimization or anything between t1,t2 and f ( we don't support that) - if you are running things on GPU, after t1, data will be copied on CPU and then probably again on GPU - so it doesn't make sense anymore Recurrent Things ---------------- I think that you can write a recurrent operation by first defining a graph ( the recrrent relation ): y_tm1 = recurrent_layer(init = zeros(50)) x_t = slice(x, t=0) y = loop( dotW_b(y_tm1,50) + x_t, steps = 20) This would basically give all the information you need to add a scan op to your theano expression of the result op, it is just a different way of writing things .. which I think is more intuitive. You create your primitives which are either a recurrent_layer that should have a initial value, or a slice of some other node ( a time slice that is) Then you call loop giving a expression that starts from those primitives. Similarly you can have foldl or map or anything else. You would use this instead of writing scan especially if the formula is more complicated and you want to automatically collect parameters, hyper-parameters and so on. Optimizer --------- Personally I would respect the findings of the optimization committee, and have the SGD to require a Node that produces some error ( which can be omitted) and the gradients. For this I would also have the grad function which would actually only call T.grad. If you have non-theano thing in the middle? I don't have any smart solution besides ignoring any parameter that it is below the first non-theano node and throw a warning. Learner ------- In my case I would not have a predict() and eval() method of the learner, but just a eval(). If you want the predictions you should use the corresponding node ( before applying the error measure ). This was for example **out** in my first example. Of course we could require learners to be special nodes that also have a predict output. In that case I'm not sure what the iterating behaiour of the node should produce. Granularity ----------- Guillaume nicely pointed out that this library might be an overkill. In the sense that you have a dotW_b transform, and then you will need a dotW_b_sparse transform and so on. Plus way of initializing each param would result in many more transforms. I don't have a perfect answer yet, but my argument will go as this : you would have transforms for the most popular option ( dotW_b) for example. If you need something else you can always decorate a function that takes theano arguments and produces theano arguments. More then decoratting you can have a general apply transform that does something like : apply( lambda x,y,z: x*y+z, inputs = x, hyperparams = [(name,2)], params = [(name,theano.shared(..)]) The order of the arguments in lambda is nodes, params, hyper-params or so. This would apply the theano expression but it will also register the the parameters. It is like creating a transform on the fly. I think you can do such that the result of the apply is pickable, but not the apply operation. Meaning that in the graph, the op doesn't actually store the lambda expression but a mini theano graph. Also names might be optional, so you can write hyperparam = [2,] What this way of doing things would buy you hopefully is that you do not need to worry about most of your model ( would be just a few macros or subrutines). you would do something like : rbm1,hidden1 = rbm_layer(data,20) rbm2,hidden2 = rbm_layer(data,20) and then the part you care about : hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = theano.shared(scipy.sparse_CSR(..))) and after that you pottentially still do what you did before : err = cross_entropy(hidden3, target) grads = grad(err, err.paramters()) ... I do agree that some of the "transforms" that I have been writing here and there are pretty low level, and maybe we don't need them. We might need only somewhat higher level transforms. My hope is that for now people think of the approach and not about all inner details ( like what transforms we need and so on) and see if they are comfortable with it or not. Do we want to think in this terms? I think is a bit better do have a normal python class, hacking it to change something and then either add a parameter to init or create a new version. It seems a bit more natural. Anyhow Guillaume I'm working on a better answer :) Params and hyperparams ---------------------- I think it is obvious from what I wrote above that there is a node wrapper around the theano expression. I haven't wrote down all the details of that class. I think there should be such a wrapper around parameters and hyper-parameters as well. By default those wrappers might not provide any informtion. Later on, they can provide for hyper-params for example a distribution. If when inserting your hyper-param in the graph ( i.e. when you call a given transform) you provide the distribution then maybe a hyperlearner could use it to sample from it. For parameters you might define properties like freeze. It can be true or false. Whenever it is set to true, the param is not adapted by the optimizer. Changing this value like changing most of hyper-params implies recompilation of the graph. I would have a special class of hyper-params which don't require recompilation of the graph. Learning rate is an example. This info is also given by the wrapper and by how the parameter is used. It is up to the user and "transform" implementer to wrap params and hyper-params correspondingly. But I don't think this is to complicated. The apply function above has a default behaviour, maybe you would have a forth type of argument which is hyper-param that doesn't require compilation. We could find a nice name for it. How does this work? ------------------- You always have a pointer to the entire graph. Whenever a hyper-param changes ( or a param freezes) all region of the graph affected get recompiled. This is by traversing the graph from the bottom node and constructing the theano expression. This function that updates / re-constructs the graph is sligthly more complex if you have non-theano functions in the graph .. replace ------- Replace, replaces a part of the graph. The way it works in my view is that if I write : x = x1+x2+x3 y = x.replace({x2:x5}) You would first copy the graph that is represented by x ( the params or hyper-params are not copied) and then replace the subgraphs. I.e., x will still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done inplace. I think these Node classes as something light-weighted, like theano variables. reconstruct ----------- This is something nice for DAA. It is definetely not useful for the rest. I think though that is a shame having that transformation graph and not being able to use it to do this. It will make life so much easier when you do deep auto-encoders. I wouldn't put it in the core library, but I would have in the DAA module. The way I see it you can either have something like # generate your inversable transforms on the fly fn = create_transform(lambda : , params, hyper-params ) inv = create_transform(lambda : , params, hyper-params ) my_transform = couple_transforms( forward = fn, inv = inv) # have some already widely used such transform in the daa submodule. transforms ---------- In my view there will be quite a few of such standard transforms. They can be grouped by architecture, basic, sampler, optimizer and so on. We do not need to provide all of them, just the ones we need. Researching on an architecture would actually lead in creating new such transforms in the library. There will be definetely a list of basic such transforms in the begining, like : replace, search, get_param(name) get_params(..) You can have and should have something like a switch ( that based on a hyper parameter replaces a part of a graph with another or not). This is done by re-compiling the graph. Constraints ----------- Nodes also can also keep track of constraints. When you write y = add_constraint(x, sum(x**2)) y is the same node as x, just that it also links to this second graph that computes constraints. Whenever you call grad, grad will also sum to the cost all attached constraints to the graph.