Mercurial > pylearn
comparison doc/v2_planning/layer_RP.txt @ 1229:515033d4d3bf
a first draft of layer committee
author | Razvan Pascanu <r.pascanu@gmail.com> |
---|---|
date | Wed, 22 Sep 2010 19:43:24 -0400 |
parents | |
children | 5ef96142492b |
comparison
equal
deleted
inserted
replaced
1228:86d802226a97 | 1229:515033d4d3bf |
---|---|
1 =============== | |
2 Layer committee | |
3 =============== | |
4 | |
5 Members : RP, XG, AB, DWF | |
6 | |
7 Proposal (RP) | |
8 ============= | |
9 | |
10 You construct your neural network by constructing a graph of connections | |
11 between layesrs starting from data. While you construct the graph, | |
12 different theano formulas are put together to construct your model. | |
13 | |
14 Hard details are not set yet, but all members of the committee agreed | |
15 that this sound as a good idea. | |
16 | |
17 | |
18 Example Code (RP): | |
19 ------------------ | |
20 | |
21 # Assume you have the dataset as train_x, train_y, valid_x, valid_y, test_x, test_y | |
22 | |
23 h1 = sigmoid(dotW_b(train_x, n = 300)) | |
24 rbm1 = CDk( h1, train_x, k=5, sampler = binomial, cost = pseudolikelihood) | |
25 | |
26 h2 = sigmoid(dotW_b(h1, n = 300)) | |
27 rbm2 = CDk( h2, h1, k=5, sampler = binomial, cost= pseudolikelihood) | |
28 | |
29 out = sigmoid( dotW_b(h2, n= 10)) | |
30 | |
31 train_err = cross_entropy( out, train_y) | |
32 | |
33 grads = grad( train_err, err.parameters() ) | |
34 learner = SGD( err, grads) | |
35 | |
36 valid_err = train_err.replace({ train_x : valid_x, train_y : valid_y}) | |
37 test_err = train_err.replace({ train_x : test_x , train_y : test_y}) | |
38 | |
39 | |
40 | |
41 Global observations : | |
42 --------------------- | |
43 | |
44 1) Your graph can have multiple terminations; in this case rbm1, rbm2 and learner, valid_err, | |
45 test_err are all end nodes of the graph; | |
46 | |
47 2) Any node is an "iterator", when you would call out.next() you would get the next prediction; | |
48 when you call err.next() you will get next error ( on the batch given by the data ). | |
49 | |
50 3) Replace can replace any subgraph | |
51 | |
52 4) You can have MACROS or SUBROUTINE that already give you the graph for known components ( in my | |
53 view the CDk is such a macro, but simpler examples will be vanilla versions of MLP, DAA, DBN, LOGREG) | |
54 | |
55 5) Any node has a pointer at the graph ( though arguably you don't use that graph that much). Running | |
56 such a node in general will be done by compiling the Theano expression up to that node, and using the | |
57 data object that you get initially. This theano function is compiled lazy, in the sense that is compiled | |
58 when you try to iterate through the node. You use the graph only to : | |
59 * update the Theano expression in case some part of the subgraph has been changed | |
60 * collect the list of parameters of the model | |
61 * collect the list of hyper-parameters ( my personal view - this would mostly be useful for a | |
62 hyper learner .. and not day to day basis, but I think is something easy to provide and we should) | |
63 * collect constraints on parameters ( I believe they can be inserted in the graph .. things like L1 | |
64 and so on ) | |
65 | |
66 6) Registering parameters and hyper-parameters to the graph is the job of the transform and therefore | |
67 to the user who implemented that transform; also initializing the parameters ( so if we have different way | |
68 to initialize the weight matrix that should be a hyperparameter with a default value) | |
69 | |
70 | |
71 | |
72 Detailed Proposal (RP) | |
73 ====================== | |
74 | |
75 I would go through a list of scenarios and possible issues : | |
76 | |
77 Delayed or feature values | |
78 ------------------------- | |
79 | |
80 Sometimes you might want future values of some nodes. For example you might be interested in : | |
81 | |
82 y(t) = x(t) - x(t-1) | |
83 | |
84 You can get that by having a "delayed" version of a node. A delayed version a node x is obtained by | |
85 calling x.t(k) which will give you a node that has the value x(t+k). k can be positive or negative. | |
86 In my view this can be done as follows : | |
87 - a node is a class that points to : | |
88 * a data object that feeds data | |
89 * a theano expression up to that point | |
90 * the entire graph that describes the model ( not Theano graph !!!) | |
91 The only thing you need to do is to change the data object to reflect the | |
92 delay ( we might need to be able to pad it with 0?). You need also to create | |
93 a copy of the theano expression ( those are "new nodes" ) in the sense that | |
94 the starting theano tensors are different since they point to different data. | |
95 | |
96 | |
97 | |
98 Non-theano transformation ( or function or whatever) | |
99 ---------------------------------------------------- | |
100 | |
101 Maybe you want to do something in the middle of your graph that is not Theano | |
102 supported. Let say you have a function f which you can not write in Theano. | |
103 You want to do something like | |
104 | |
105 | |
106 W1*f( W2*data + b) | |
107 | |
108 I think we can support that by doing the following : | |
109 each node has a : | |
110 * a data object that feeds data | |
111 * a theano expression up to that point | |
112 * the entire graph that describes the model | |
113 | |
114 Let x1 = W2*data + b | |
115 up to here everything is fine ( we have a theano expression ) | |
116 dot(W2, tensor) + b, | |
117 where tensor is provided by the data object ( plus a dict of givens | |
118 and whatever else you need to compile the function) | |
119 | |
120 When you apply f, what you do you create a node that is exactly like the | |
121 data object in the sense that it provides a new tensor and a new dict of | |
122 givens | |
123 | |
124 so x2 = W1*f( W2*data+b) | |
125 will actually point to the expression | |
126 dot(W1, tensor) | |
127 and to the data node f(W2*data+b) | |
128 | |
129 what this means is that you basically compile two theano functions t1 and t2 | |
130 and evaluate t2(f(t1(data))). So everytime you have a non theano operation you | |
131 break the theano expression and start a new one. | |
132 | |
133 What you loose : | |
134 - there is no optimization or anything between t1,t2 and f ( we don't | |
135 support that) | |
136 - if you are running things on GPU, after t1, data will be copied on CPU and | |
137 then probably again on GPU - so it doesn't make sense anymore | |
138 | |
139 | |
140 | |
141 Recurrent Things | |
142 ---------------- | |
143 | |
144 I think that you can write a recurrent operation by first defining a | |
145 graph ( the recrrent relation ): | |
146 | |
147 y_tm1 = recurrent_layer(init = zeros(50)) | |
148 x_t = slice(x, t=0) | |
149 y = loop( dotW_b(y_tm1,50) + x_t, steps = 20) | |
150 | |
151 This would basically give all the information you need to add a scan op | |
152 to your theano expression of the result op, it is just a different way | |
153 of writing things .. which I think is more intuitive. | |
154 | |
155 You create your primitives which are either a recurrent_layer that should | |
156 have a initial value, or a slice of some other node ( a time slice that is) | |
157 Then you call loop giving a expression that starts from those primitives. | |
158 | |
159 Similarly you can have foldl or map or anything else. | |
160 | |
161 Optimizer | |
162 --------- | |
163 | |
164 Personally I would respect the findings of the optimization committee, | |
165 and have the SGD to require a Node that produces some error ( which can | |
166 be omitted) and the gradients. For this I would also have the grad | |
167 function which would actually only call T.grad. | |
168 | |
169 If you have non-theano thing in the middle? I don't have any smart | |
170 solution besides ignoring any parameter that it is below the first | |
171 non-theano node and throw a warning. | |
172 | |
173 Learner | |
174 ------- | |
175 | |
176 In my case I would not have a predict() and eval() method of the learner, | |
177 but just a eval(). If you want the predictions you should use the | |
178 corresponding node ( before applying the error measure ). This was | |
179 for example **out** in my first example. | |
180 | |
181 Of course we could require learners to be special nodes that also have | |
182 a predict output. In that case I'm not sure what the iterator behaiour | |
183 of the node should produce. | |
184 | |
185 Granularity | |
186 ----------- | |
187 | |
188 Guillaume nicely pointed out that this library might be an overkill. | |
189 In the sense that you have a dotW_b transform, and then you will need | |
190 a dotW_b_sparse transform and so on. Plus way of initializing each param | |
191 would result in many more transforms. | |
192 | |
193 I don't have a perfect answer yet, but my argument will go as this : | |
194 | |
195 you would have transforms for the most popular option ( dotW_b) for example. | |
196 If you need something else you can always decorate a function that takes | |
197 theano arguments and produces theano arguments. More then decoratting you | |
198 can have a general apply transform that does something like : | |
199 | |
200 apply( lambda x,y,z: x*y+z, inputs = x, | |
201 hyperparams = [(name,2)], | |
202 params = [(name,theano.shared(..)]) | |
203 The order of the arguments in lambda is nodes, params, hyper-params or so. | |
204 This would apply the theano expression but it will also register the | |
205 the parameters. I think you can do such that the result of the apply is | |
206 pickable, but not the apply. Meaning that in the graph, the op doesn't | |
207 actually store the lambda expression but a mini theano graph. | |
208 | |
209 Also names might be optional, so you can write hyperparam = [2,] | |
210 | |
211 | |
212 What this way of doing things would buy you hopefully is that you do not | |
213 need to worry about most of your model ( would be just a few macros or | |
214 subrutines). | |
215 you would do like : | |
216 | |
217 rbm1,hidden1 = rbm_layer(data,20) | |
218 rbm2,hidden2 = rbm_layer(data,20) | |
219 and then the part you care about : | |
220 hidden3 = apply( lambda x,W: T.dot(x,W), inputs = hidden2, params = | |
221 theano.shared(scipy.sparse_CSR(..))) | |
222 and after that you pottentially still do what you did before : | |
223 err = cross_entropy(hidden3, target) | |
224 grads = grad(err, err.paramters()) | |
225 ... | |
226 | |
227 I do agree that some of the "transforms" that I have been writing here | |
228 and there are pretty low level, and maybe we don't need them. We might need | |
229 only somewhat higher level transforms. My hope is that for now people think | |
230 of the approach and not to all inner details ( like what transforms we need, | |
231 and so on) and see if they are comfortable with it or not. | |
232 | |
233 Do we want to think in this terms? I think is a bit better do have your | |
234 script like that, then hacking into the DBN class to change that W to be | |
235 sparse. | |
236 | |
237 Anyhow Guillaume I'm working on a better answer :) | |
238 | |
239 | |
240 Params and hyperparams | |
241 ---------------------- | |
242 | |
243 I think it is obvious from what I wrote above that there is a node wrapper | |
244 around the theano expression. I haven't wrote down all the details of that | |
245 class. I think there should be such a wrapper around parameters and | |
246 hyper-parameters as well. By default those wrappers might not provide | |
247 any informtion. Later on, they can provide for hyper-params for example a | |
248 distribution. If when inserting your hyper-param in the graph ( i.e. when | |
249 you call a given transform) you provide the distribution then maybe a | |
250 hyperlearner could use it to sample from it. | |
251 | |
252 For parameters you might define properties like freeze. It can be true or | |
253 false. Whenever it is set to true, the param is not adapted by the optimizer. | |
254 Changing this value like changing most of hyper-params implies recompilation | |
255 of the graph. | |
256 | |
257 I would have a special class of hyper-params which don't require | |
258 recompilation of the graph. Learning rate is an example. This info is also | |
259 given by the wrapper and by how the parameter is used. | |
260 | |
261 It is up to the user and "transform" implementer to wrap params and | |
262 hyper-params correspondingly. But I don't think this is to complicated. | |
263 The apply function above has a default behaviour, maybe you would have | |
264 a forth type of argument which is hyper-param that doesn't require | |
265 compilation. We could find a nice name for it. | |
266 | |
267 | |
268 How does this work? | |
269 ------------------- | |
270 | |
271 You always have a pointer to the entire graph. Whenever a hyper-param | |
272 changes ( or a param freezes) all region of the graph affected get recompiled. | |
273 This is by traversing the graph from the bottom node and constructing the | |
274 theano expression. | |
275 | |
276 This function that updates / re-constructs the graph is sligthly more complex | |
277 if you have non-theano functions in the graph .. | |
278 | |
279 replace | |
280 ------- | |
281 | |
282 Replace, replaces a part of the graph. The way it works in my view is that | |
283 if I write : | |
284 | |
285 x = x1+x2+x3 | |
286 y = x.replace({x2:x5}) | |
287 | |
288 You would first copy the graph that is represented by x ( the params or | |
289 hyper-params are not copied) and then replace the subgraphs. I.e., x will | |
290 still point to x1+x2+x3, y will point to x1+x5+x3. Replace is not done | |
291 inplace. | |
292 | |
293 I think these Node classes as something light-weighted, like theano variables. | |
294 | |
295 | |
296 reconstruct | |
297 ----------- | |
298 | |
299 This is something nice for DAA. It is definetely not useful for the rest. | |
300 I think though that is a shame having that transformation graph and not | |
301 being able to use it to do this. It will make life so much easier when you | |
302 do deep auto-encoders. I wouldn't put it in the core library, but I would | |
303 have in the DAA module. The way I see it you can either have something like | |
304 | |
305 # generate your inversable transforms on the fly | |
306 fn = create_transform(lambda : , params, hyper-params ) | |
307 inv = create_transform(lambda : , params, hyper-params ) | |
308 my_transform = couple_transforms( forward = fn, inv = inv) | |
309 | |
310 # have some already widely used such transform in the daa submodule. | |
311 | |
312 | |
313 transforms | |
314 ---------- | |
315 | |
316 In my view there will be quite a few of such standard transforms. They | |
317 can be grouped by architecture, basic, sampler, optimizer and so on. | |
318 | |
319 We do not need to provide all of them, just the ones we need. Researching | |
320 on an architecture would actually lead in creating new such transforms in | |
321 the library. | |
322 | |
323 There will be definetely a list of basic such transforms in the begining, | |
324 like : | |
325 replace, | |
326 search, | |
327 get_param(name) | |
328 get_params(..) | |
329 | |
330 You can have and should have something like a switch ( that based on a | |
331 hyper parameter replaces a part of a graph with another or not). This is | |
332 done by re-compiling the graph. | |
333 | |
334 | |
335 Constraints | |
336 ----------- | |
337 | |
338 Nodes also can also keep track of constraints. | |
339 | |
340 When you write | |
341 | |
342 y = add_constraint(x, sum(x**2)) | |
343 | |
344 y is the same node as x, just that it also links to this second graph that | |
345 computes constraints. Whenever you call grad, grad will also sum to the | |
346 cost all attached constraints to the graph. | |
347 | |
348 |