pylearn: doc/v2_planning/layer

comparison doc/v2_planning/layer_RP.txt @ 1229:515033d4d3bf

a first draft of layer committee

author	Razvan Pascanu <r.pascanu@gmail.com>
date	Wed, 22 Sep 2010 19:43:24 -0400
parents
children	5ef96142492b

comparison

equal deleted inserted replaced

-:86d802226a97
+:515033d4d3bf
+===============
+Layer committee
+===============
+Members : RP, XG, AB, DWF
+Proposal (RP)
+=============
+You construct your neural network by constructing a graph of connections
+between layesrs starting from data. While you construct the graph,
+different theano formulas are put together to construct your model.
+Hard details are not set yet, but all members of the committee agreed
+that this sound as a good idea.
+Example Code (RP):
+------------------
+# Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y
+h1   = sigmoid(dotW_b(train_x, n = 300))
+rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood)
+h2 = sigmoid(dotW_b(h1, n = 300))
+rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood)
+out = sigmoid( dotW_b(h2, n= 10))
+train_err = cross_entropy( out, train_y)
+grads   = grad( train_err, err.parameters() )
+learner = SGD( err, grads)
+valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y})
+test_err  = train_err.replace({ train_x : test_x , train_y : test_y})
+Global observations :
+---------------------
+1) Your graph can have multiple terminations; in this case rbm1, rbm2 and learner, valid_err,
+test_err  are all end nodes of the graph;
+2) Any node is an "iterator", when you would call out.next() you would get the next prediction;
+when you call err.next() you will get next error ( on the batch given by the data ).
+3) Replace can replace any subgraph
+4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my
+view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG)
+5) Any node has a pointer at the graph ( though arguably you don't use that graph that much). Running
+such a node in general will be done by compiling the Theano expression up to that node, and using the
+data object that you get initially. This theano function is compiled lazy, in the sense that is compiled
+when you try to iterate through the node. You use the graph only to :
+* update the Theano expression in case some part of the subgraph has been changed
+* collect the list of parameters of the model
+* collect the list of hyper-parameters ( my personal view - this would mostly be useful for a
+hyper learner .. and not day to day basis, but I think is something easy to provide and we should)
+* collect constraints on parameters ( I believe they can be inserted in the graph .. things like L1
+and so on )
+6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore
+to the user who implemented that transform; also initializing the parameters ( so if we have different way
+to initialize the weight matrix that should be a hyperparameter with a default value)
+Detailed Proposal (RP)
+======================
+I would go through a list of scenarios and possible issues :
+Delayed or feature values
+-------------------------
+Sometimes you might want future values of some nodes.  For example you might be interested in :
+y(t) = x(t) - x(t-1)
+You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by
+calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative.
+In my view this can be done as follows :
+- a node is a class that points to :
+* a data object that feeds data
+* a theano expression up to that point
+* the entire graph that describes the model ( not Theano graph !!!)
+The only thing you need to do is to change the data object to reflect the
+delay ( we might need to be able to pad it with 0?). You need also to create
+a copy of the theano expression ( those are "new nodes" ) in the sense that
+the starting theano tensors are different since they point to different data.
+Non-theano transformation ( or function or whatever)
+----------------------------------------------------
+Maybe you want to do something in the middle of your graph that is not Theano
+supported. Let say you have a function f which you can not write in Theano.
+You want to do something like
+W1*f( W2*data + b)
+I think we can support that by doing the following :
+each node has a :
+* a data object that feeds data
+* a theano expression up to that point
+* the entire graph that describes the model
+Let x1 = W2*data + b
+up to here everything is fine ( we have a theano expression )
+dot(W2, tensor) + b,
+where tensor is provided by the data object ( plus a dict of givens
+and whatever else you need to compile the function)
+When you apply f, what you do you create a node that is exactly like the
+data object in the sense that it provides a new tensor and a new dict of
+givens
+so x2 = W1*f( W2*data+b)
+will actually point to the expression
+dot(W1, tensor)
+and to the data node f(W2*data+b)
+what this means is that you basically compile two theano functions t1 and t2
+and evaluate t2(f(t1(data))). So everytime you have a non theano operation you
+break the theano expression and start a new one.
+What you loose :
+- there is no optimization or anything between t1,t2 and f ( we don't
+support that)
+- if you are running things on GPU, after t1, data will be copied on CPU and
+then probably again on GPU - so it doesn't make sense anymore
+Recurrent Things
+----------------
+I think that you can write a recurrent operation by first defining a
+graph ( the recrrent relation ):
+y_tm1 = recurrent_layer(init = zeros(50))
+x_t   = slice(x, t=0)
+y     = loop( dotW_b(y_tm1,50) + x_t, steps = 20)
+This would basically give all the information you need to add a scan op
+to your theano expression of the result op, it is just a different way
+of writing things .. which I think is more intuitive.
+You create your primitives which are either a recurrent_layer that should
+have a initial value, or a slice of some other node ( a time slice that is)
+Then you call loop giving a expression that starts from those primitives.
+Similarly you can have foldl or map or anything else.
+Optimizer
+---------
+Personally I would respect the findings of the optimization committee,
+and have the SGD to require a Node that produces some error ( which can
+be omitted) and the gradients. For this I would also have the grad
+function which would actually only call T.grad.
+If you have non-theano thing in the middle? I don't have any smart
+solution besides ignoring any parameter that it is below the first
+non-theano node and throw a warning.
+Learner
+-------
+In my case I would not have a predict() and eval() method of the learner,
+but just a eval(). If you want the predictions you should use the
+corresponding node ( before applying the error measure ). This was
+for example **out** in my first example.
+Of course we could require learners to be special nodes that also have
+a predict output. In that case I'm not sure what the iterator behaiour
+of the node should produce.
+Granularity
+-----------
+Guillaume nicely pointed out that this library might be an overkill.
+In the sense that you have a dotW_b transform, and then you will need
+a dotW_b_sparse transform and so on. Plus way of initializing each param
+would result in many more transforms.
+I don't have a perfect answer yet, but my argument will go as this :
+you would have transforms for the most popular option ( dotW_b) for example.
+If you need something else you can always decorate a function that takes
+theano arguments and produces theano arguments. More then decoratting you
+can have a general apply transform that does something like :
+apply( lambda x,y,z: x*y+z, inputs = x,
+hyperparams = [(name,2)],
+params = [(name,theano.shared(..)])
+The order of the arguments in lambda is nodes, params, hyper-params or so.
+This would apply the theano expression but it will also register the
+the parameters. I think you can do such that the result of the apply is
+pickable, but not the apply. Meaning that in the graph, the op doesn't
+actually store the lambda expression but a mini theano graph.
+Also names might be optional, so you can write hyperparam = [2,]
+What this way of doing things would buy you hopefully is that you do not
+need to worry about most of your model ( would be just a few macros or
+subrutines).
+you would do like :
+rbm1,hidden1 = rbm_layer(data,20)
+rbm2,hidden2 = rbm_layer(data,20)
+and then the part you care about :
+hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params =
+theano.shared(scipy.sparse_CSR(..)))
+and after that you pottentially still do what you did before :
+err = cross_entropy(hidden3, target)
+grads = grad(err, err.paramters())
+...
+I do agree that some of the "transforms" that I have been writing here
+and there are pretty low level, and maybe we don't need them. We might need
+only somewhat higher level transforms. My hope is that for now people think
+of the approach and not to all inner details ( like what transforms we need,
+and so on) and see if they are comfortable with it or not.
+Do we want to think in this terms? I think is a bit better do have your
+script like that, then hacking into the DBN class to change that W to be
+sparse.
+Anyhow Guillaume I'm working on a better answer :)
+Params and hyperparams
+----------------------
+I think it is obvious from what I wrote above that there is a node wrapper
+around the theano expression. I haven't wrote down all the details of that
+class. I think there should be such a wrapper around parameters and
+hyper-parameters as well. By default those wrappers might not provide
+any informtion. Later on, they can provide for hyper-params for example a
+distribution. If when inserting your hyper-param in the graph ( i.e. when
+you call a given transform) you provide the distribution then maybe a
+hyperlearner could use it to sample from it.
+For parameters you might define properties like freeze. It can be true or
+false. Whenever it is set to true, the param is not adapted by the optimizer.
+Changing this value like changing most of hyper-params implies recompilation
+of the graph.
+I would have a special class of hyper-params which don't require
+recompilation of the graph. Learning rate is an example. This info is also
+given by the wrapper and by how the parameter is used.
+It is up to the user and "transform" implementer to wrap params and
+hyper-params correspondingly. But I don't think this is to complicated.
+The apply function above has a default behaviour, maybe you would have
+a forth type of argument which is hyper-param that doesn't require
+compilation. We could find a nice name for it.
+How does this work?
+-------------------
+You always have a pointer to the entire graph. Whenever a hyper-param
+changes ( or a param freezes) all region of the graph affected get recompiled.
+This is by traversing the graph from the bottom node and constructing the
+theano expression.
+This function that updates / re-constructs the graph is sligthly more complex
+if you have non-theano functions in the graph ..
+replace
+-------
+Replace, replaces a part of the graph. The way it works in my view is that
+if I write :
+x = x1+x2+x3
+y = x.replace({x2:x5})
+You would first copy the graph that is represented by x ( the params or
+hyper-params are not copied) and then replace the subgraphs. I.e., x will
+still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done
+inplace.
+I think these Node classes as something light-weighted, like theano variables.
+reconstruct
+-----------
+This is something nice for DAA. It is definetely not useful for the rest.
+I think though that is a shame having that transformation graph and not
+being able to use it to do this. It will make life so much easier when you
+do deep auto-encoders. I wouldn't put it in the core library, but I would
+have in the DAA module. The way I see it you can either have something like
+# generate your inversable transforms on the fly
+fn  = create_transform(lambda : , params, hyper-params )
+inv = create_transform(lambda : , params, hyper-params )
+my_transform = couple_transforms( forward = fn, inv = inv)
+# have some already widely used such transform in the daa submodule.
+transforms
+----------
+In my view there will be quite a few of such standard transforms. They
+can be grouped by architecture, basic, sampler, optimizer and so on.
+We do not need to provide all of them, just the ones we need. Researching
+on an architecture would actually lead in creating new such transforms in
+the library.
+There will be definetely a list of basic such transforms in the begining,
+like :
+replace,
+search,
+get_param(name)
+get_params(..)
+You can have and should have something like a switch ( that based on a
+hyper parameter replaces a part of a graph with another or not). This is
+done by re-compiling the graph.
+Constraints
+-----------
+Nodes also can also keep track of constraints.
+When you write
+y = add_constraint(x, sum(x**2))
+y is the same node as x, just that it also links to this second graph that
+computes constraints. Whenever you call grad, grad will also sum to the
+cost all attached constraints to the graph.

Mercurial > pylearn

comparison doc/v2_planning/layer_RP.txt @ 1229:515033d4d3bf