Mercurial > pylearn
comparison doc/v2_planning/layer_RP.txt @ 1231:5ef96142492b
some typos
author | Razvan Pascanu <r.pascanu@gmail.com> |
---|---|
date | Wed, 22 Sep 2010 20:17:35 -0400 |
parents | 515033d4d3bf |
children | 32fc5f442dde |
comparison
equal
deleted
inserted
replaced
1230:31b72defb680 | 1231:5ef96142492b |
---|---|
6 | 6 |
7 Proposal (RP) | 7 Proposal (RP) |
8 ============= | 8 ============= |
9 | 9 |
10 You construct your neural network by constructing a graph of connections | 10 You construct your neural network by constructing a graph of connections |
11 between layesrs starting from data. While you construct the graph, | 11 between layers starting from data. While you construct the graph, |
12 different theano formulas are put together to construct your model. | 12 different theano formulas are put together to construct your model. |
13 | 13 |
14 Hard details are not set yet, but all members of the committee agreed | 14 Hard details are not set yet, but all members of the committee agreed |
15 that this sound as a good idea. | 15 that this sound as a good idea. |
16 | 16 |
39 | 39 |
40 | 40 |
41 Global observations : | 41 Global observations : |
42 --------------------- | 42 --------------------- |
43 | 43 |
44 1) Your graph can have multiple terminations; in this case rbm1, rbm2 and learner, valid_err, | 44 1) Your graph can have multiple terminal nodes; in this case rbm1, |
45 test_err are all end nodes of the graph; | 45 rbm2 and learner, valid_err, test_err are all end nodes of the graph; |
46 | 46 |
47 2) Any node is an "iterator", when you would call out.next() you would get the next prediction; | 47 2) Any node is an "iterator", when you would call out.next() you would get |
48 when you call err.next() you will get next error ( on the batch given by the data ). | 48 the next prediction; when you call err.next() you will get next error |
49 ( on the batch given by the data.next() ). | |
49 | 50 |
50 3) Replace can replace any subgraph | 51 3) Replace can replace any subgraph |
51 | 52 |
52 4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my | 53 4) You can have MACROS or SUBROUTINE that already give you the graph for |
53 view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG) | 54 known components ( in my view the CDk is such a macro, but simpler |
54 | 55 examples will be vanilla versions of MLP, DAA, DBN, LOGREG) |
55 5) Any node has a pointer at the graph ( though arguably you don't use that graph that much). Running | 56 |
56 such a node in general will be done by compiling the Theano expression up to that node, and using the | 57 5) Any node has the entire graph ( though arguably you don't use that |
57 data object that you get initially. This theano function is compiled lazy, in the sense that is compiled | 58 graph too much). Running such a node in general will be done by compiling |
58 when you try to iterate through the node. You use the graph only to : | 59 the Theano expression up to that node( if you don't already have this |
59 * update the Theano expression in case some part of the subgraph has been changed | 60 function), and using the data object that you get initially. This theano |
60 * collect the list of parameters of the model | 61 function is compiled only if you need it. You use the graph only to : |
61 * collect the list of hyper-parameters ( my personal view - this would mostly be useful for a | 62 * update the Theano expression in case some part of the subgraph has |
62 hyper learner .. and not day to day basis, but I think is something easy to provide and we should) | 63 changed (hyper-parameter or a replace call) |
63 * collect constraints on parameters ( I believe they can be inserted in the graph .. things like L1 | 64 * collect the list of parameters of the model |
64 and so on ) | 65 * collect the list of hyper-parameters ( my personal view - this |
65 | 66 would mostly be useful for a hyper learner .. and not for day to |
66 6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore | 67 day stuff, but I think is something easy to provide and we should ) |
67 to the user who implemented that transform; also initializing the parameters ( so if we have different way | 68 * collect constraints on parameters ( I believe they can be represented |
68 to initialize the weight matrix that should be a hyperparameter with a default value) | 69 in the graph as dependency links to other graphs that compute the |
70 constraints..) | |
71 | |
72 6) Registering parameters and hyper-parameters to the graph is the job of | |
73 the transform and therefore of the user who implemented that | |
74 transform; the same for initializing the parameters ( so if we have | |
75 different way to initialize the weight matrix that might be a | |
76 hyperparameter with a default value) | |
69 | 77 |
70 | 78 |
71 | 79 |
72 Detailed Proposal (RP) | 80 Detailed Proposal (RP) |
73 ====================== | 81 ====================== |
75 I would go through a list of scenarios and possible issues : | 83 I would go through a list of scenarios and possible issues : |
76 | 84 |
77 Delayed or feature values | 85 Delayed or feature values |
78 ------------------------- | 86 ------------------------- |
79 | 87 |
80 Sometimes you might want future values of some nodes. For example you might be interested in : | 88 Sometimes you might want future values of some nodes. For example you might |
89 be interested in : | |
81 | 90 |
82 y(t) = x(t) - x(t-1) | 91 y(t) = x(t) - x(t-1) |
83 | 92 |
84 You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by | 93 You can get that by having a "delayed" version of a node. A delayed version |
85 calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative. | 94 a node x is obtained by calling x.t(k) which will give you a node that has |
95 the value x(t+k). k can be positive or negative. | |
86 In my view this can be done as follows : | 96 In my view this can be done as follows : |
87 - a node is a class that points to : | 97 - a node is a class that points to : |
88 * a data object that feeds data | 98 * a data object that feeds data |
89 * a theano expression up to that point | 99 * a theano expression up to that point |
90 * the entire graph that describes the model ( not Theano graph !!!) | 100 * the entire graph that describes the model ( not Theano graph !!!) |
104 | 114 |
105 | 115 |
106 W1*f( W2*data + b) | 116 W1*f( W2*data + b) |
107 | 117 |
108 I think we can support that by doing the following : | 118 I think we can support that by doing the following : |
109 each node has a : | 119 each node has a: |
110 * a data object that feeds data | 120 * a data object that feeds data |
111 * a theano expression up to that point | 121 * a theano expression up to that point |
112 * the entire graph that describes the model | 122 * the entire graph that describes the model |
113 | 123 |
114 Let x1 = W2*data + b | 124 Let x1 = W2*data + b |
156 have a initial value, or a slice of some other node ( a time slice that is) | 166 have a initial value, or a slice of some other node ( a time slice that is) |
157 Then you call loop giving a expression that starts from those primitives. | 167 Then you call loop giving a expression that starts from those primitives. |
158 | 168 |
159 Similarly you can have foldl or map or anything else. | 169 Similarly you can have foldl or map or anything else. |
160 | 170 |
171 You would use this instead of writing scan especially if the formula is | |
172 more complicated and you want to automatically collect parameters, | |
173 hyper-parameters and so on. | |
174 | |
161 Optimizer | 175 Optimizer |
162 --------- | 176 --------- |
163 | 177 |
164 Personally I would respect the findings of the optimization committee, | 178 Personally I would respect the findings of the optimization committee, |
165 and have the SGD to require a Node that produces some error ( which can | 179 and have the SGD to require a Node that produces some error ( which can |
177 but just a eval(). If you want the predictions you should use the | 191 but just a eval(). If you want the predictions you should use the |
178 corresponding node ( before applying the error measure ). This was | 192 corresponding node ( before applying the error measure ). This was |
179 for example **out** in my first example. | 193 for example **out** in my first example. |
180 | 194 |
181 Of course we could require learners to be special nodes that also have | 195 Of course we could require learners to be special nodes that also have |
182 a predict output. In that case I'm not sure what the iterator behaiour | 196 a predict output. In that case I'm not sure what the iterating behaiour |
183 of the node should produce. | 197 of the node should produce. |
184 | 198 |
185 Granularity | 199 Granularity |
186 ----------- | 200 ----------- |
187 | 201 |
200 apply( lambda x,y,z: x*y+z, inputs = x, | 214 apply( lambda x,y,z: x*y+z, inputs = x, |
201 hyperparams = [(name,2)], | 215 hyperparams = [(name,2)], |
202 params = [(name,theano.shared(..)]) | 216 params = [(name,theano.shared(..)]) |
203 The order of the arguments in lambda is nodes, params, hyper-params or so. | 217 The order of the arguments in lambda is nodes, params, hyper-params or so. |
204 This would apply the theano expression but it will also register the | 218 This would apply the theano expression but it will also register the |
205 the parameters. I think you can do such that the result of the apply is | 219 the parameters. It is like creating a transform on the fly. |
206 pickable, but not the apply. Meaning that in the graph, the op doesn't | 220 |
207 actually store the lambda expression but a mini theano graph. | 221 I think you can do such that the result of the apply is |
222 pickable, but not the apply operation. Meaning that in the graph, the op | |
223 doesn't actually store the lambda expression but a mini theano graph. | |
208 | 224 |
209 Also names might be optional, so you can write hyperparam = [2,] | 225 Also names might be optional, so you can write hyperparam = [2,] |
210 | 226 |
211 | 227 |
212 What this way of doing things would buy you hopefully is that you do not | 228 What this way of doing things would buy you hopefully is that you do not |
213 need to worry about most of your model ( would be just a few macros or | 229 need to worry about most of your model ( would be just a few macros or |
214 subrutines). | 230 subrutines). |
215 you would do like : | 231 you would do something like : |
216 | 232 |
217 rbm1,hidden1 = rbm_layer(data,20) | 233 rbm1,hidden1 = rbm_layer(data,20) |
218 rbm2,hidden2 = rbm_layer(data,20) | 234 rbm2,hidden2 = rbm_layer(data,20) |
235 | |
219 and then the part you care about : | 236 and then the part you care about : |
237 | |
220 hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = | 238 hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = |
221 theano.shared(scipy.sparse_CSR(..))) | 239 theano.shared(scipy.sparse_CSR(..))) |
240 | |
222 and after that you pottentially still do what you did before : | 241 and after that you pottentially still do what you did before : |
242 | |
223 err = cross_entropy(hidden3, target) | 243 err = cross_entropy(hidden3, target) |
224 grads = grad(err, err.paramters()) | 244 grads = grad(err, err.paramters()) |
225 ... | 245 ... |
226 | 246 |
227 I do agree that some of the "transforms" that I have been writing here | 247 I do agree that some of the "transforms" that I have been writing here |
228 and there are pretty low level, and maybe we don't need them. We might need | 248 and there are pretty low level, and maybe we don't need them. We might need |
229 only somewhat higher level transforms. My hope is that for now people think | 249 only somewhat higher level transforms. My hope is that for now people think |
230 of the approach and not to all inner details ( like what transforms we need, | 250 of the approach and not about all inner details ( like what transforms we |
231 and so on) and see if they are comfortable with it or not. | 251 need and so on) and see if they are comfortable with it or not. |
232 | 252 |
233 Do we want to think in this terms? I think is a bit better do have your | 253 Do we want to think in this terms? I think is a bit better do have |
234 script like that, then hacking into the DBN class to change that W to be | 254 a normal python class, hacking it to change something and then either add |
235 sparse. | 255 a parameter to init or create a new version. It seems a bit more natural. |
256 | |
257 | |
258 | |
236 | 259 |
237 Anyhow Guillaume I'm working on a better answer :) | 260 Anyhow Guillaume I'm working on a better answer :) |
238 | 261 |
239 | 262 |
240 Params and hyperparams | 263 Params and hyperparams |