Mercurial > pylearn
diff doc/v2_planning/layer_RP.txt @ 1231:5ef96142492b
some typos
author | Razvan Pascanu <r.pascanu@gmail.com> |
---|---|
date | Wed, 22 Sep 2010 20:17:35 -0400 |
parents | 515033d4d3bf |
children | 32fc5f442dde |
line wrap: on
line diff
--- a/doc/v2_planning/layer_RP.txt Wed Sep 22 19:59:52 2010 -0400 +++ b/doc/v2_planning/layer_RP.txt Wed Sep 22 20:17:35 2010 -0400 @@ -8,7 +8,7 @@ ============= You construct your neural network by constructing a graph of connections - between layesrs starting from data. While you construct the graph, + between layers starting from data. While you construct the graph, different theano formulas are put together to construct your model. Hard details are not set yet, but all members of the committee agreed @@ -41,31 +41,39 @@ Global observations : --------------------- - 1) Your graph can have multiple terminations; in this case rbm1, rbm2 and learner, valid_err, - test_err are all end nodes of the graph; + 1) Your graph can have multiple terminal nodes; in this case rbm1, + rbm2 and learner, valid_err, test_err are all end nodes of the graph; - 2) Any node is an "iterator", when you would call out.next() you would get the next prediction; - when you call err.next() you will get next error ( on the batch given by the data ). + 2) Any node is an "iterator", when you would call out.next() you would get + the next prediction; when you call err.next() you will get next error + ( on the batch given by the data.next() ). 3) Replace can replace any subgraph - 4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my - view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG) + 4) You can have MACROS or SUBROUTINE that already give you the graph for + known components ( in my view the CDk is such a macro, but simpler + examples will be vanilla versions of MLP, DAA, DBN, LOGREG) - 5) Any node has a pointer at the graph ( though arguably you don't use that graph that much). Running - such a node in general will be done by compiling the Theano expression up to that node, and using the - data object that you get initially. This theano function is compiled lazy, in the sense that is compiled - when you try to iterate through the node. You use the graph only to : - * update the Theano expression in case some part of the subgraph has been changed - * collect the list of parameters of the model - * collect the list of hyper-parameters ( my personal view - this would mostly be useful for a - hyper learner .. and not day to day basis, but I think is something easy to provide and we should) - * collect constraints on parameters ( I believe they can be inserted in the graph .. things like L1 - and so on ) + 5) Any node has the entire graph ( though arguably you don't use that + graph too much). Running such a node in general will be done by compiling + the Theano expression up to that node( if you don't already have this + function), and using the data object that you get initially. This theano + function is compiled only if you need it. You use the graph only to : + * update the Theano expression in case some part of the subgraph has + changed (hyper-parameter or a replace call) + * collect the list of parameters of the model + * collect the list of hyper-parameters ( my personal view - this + would mostly be useful for a hyper learner .. and not for day to + day stuff, but I think is something easy to provide and we should ) + * collect constraints on parameters ( I believe they can be represented + in the graph as dependency links to other graphs that compute the + constraints..) - 6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore - to the user who implemented that transform; also initializing the parameters ( so if we have different way - to initialize the weight matrix that should be a hyperparameter with a default value) + 6) Registering parameters and hyper-parameters to the graph is the job of + the transform and therefore of the user who implemented that + transform; the same for initializing the parameters ( so if we have + different way to initialize the weight matrix that might be a + hyperparameter with a default value) @@ -77,12 +85,14 @@ Delayed or feature values ------------------------- -Sometimes you might want future values of some nodes. For example you might be interested in : +Sometimes you might want future values of some nodes. For example you might +be interested in : y(t) = x(t) - x(t-1) -You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by -calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative. +You can get that by having a "delayed" version of a node. A delayed version +a node x is obtained by calling x.t(k) which will give you a node that has +the value x(t+k). k can be positive or negative. In my view this can be done as follows : - a node is a class that points to : * a data object that feeds data @@ -106,7 +116,7 @@ W1*f( W2*data + b) I think we can support that by doing the following : -each node has a : +each node has a: * a data object that feeds data * a theano expression up to that point * the entire graph that describes the model @@ -158,6 +168,10 @@ Similarly you can have foldl or map or anything else. +You would use this instead of writing scan especially if the formula is +more complicated and you want to automatically collect parameters, +hyper-parameters and so on. + Optimizer --------- @@ -179,7 +193,7 @@ for example **out** in my first example. Of course we could require learners to be special nodes that also have - a predict output. In that case I'm not sure what the iterator behaiour + a predict output. In that case I'm not sure what the iterating behaiour of the node should produce. Granularity @@ -202,24 +216,30 @@ params = [(name,theano.shared(..)]) The order of the arguments in lambda is nodes, params, hyper-params or so. This would apply the theano expression but it will also register the -the parameters. I think you can do such that the result of the apply is -pickable, but not the apply. Meaning that in the graph, the op doesn't -actually store the lambda expression but a mini theano graph. +the parameters. It is like creating a transform on the fly. + +I think you can do such that the result of the apply is +pickable, but not the apply operation. Meaning that in the graph, the op +doesn't actually store the lambda expression but a mini theano graph. Also names might be optional, so you can write hyperparam = [2,] What this way of doing things would buy you hopefully is that you do not need to worry about most of your model ( would be just a few macros or -subrutines). -you would do like : +subrutines). +you would do something like : rbm1,hidden1 = rbm_layer(data,20) rbm2,hidden2 = rbm_layer(data,20) + and then the part you care about : + hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = theano.shared(scipy.sparse_CSR(..))) + and after that you pottentially still do what you did before : + err = cross_entropy(hidden3, target) grads = grad(err, err.paramters()) ... @@ -227,12 +247,15 @@ I do agree that some of the "transforms" that I have been writing here and there are pretty low level, and maybe we don't need them. We might need only somewhat higher level transforms. My hope is that for now people think -of the approach and not to all inner details ( like what transforms we need, -and so on) and see if they are comfortable with it or not. +of the approach and not about all inner details ( like what transforms we +need and so on) and see if they are comfortable with it or not. -Do we want to think in this terms? I think is a bit better do have your -script like that, then hacking into the DBN class to change that W to be -sparse. +Do we want to think in this terms? I think is a bit better do have +a normal python class, hacking it to change something and then either add +a parameter to init or create a new version. It seems a bit more natural. + + + Anyhow Guillaume I'm working on a better answer :)