Mercurial > pylearn
comparison doc/v2_planning/datalearn.txt @ 1357:ffa2932a8cba
Added datalearn committee discussion file
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Thu, 11 Nov 2010 16:34:38 -0500 |
parents | |
children | 5db730bb0e8e |
comparison
equal
deleted
inserted
replaced
1356:26644a775a0d | 1357:ffa2932a8cba |
---|---|
1 DataLearn: How to plug Datasets & Learner together? | |
2 =================================================== | |
3 | |
4 Participants | |
5 ------------ | |
6 - Yoshua | |
7 - Razvan | |
8 - Olivier D [leader?] | |
9 | |
10 High-Level Objectives | |
11 --------------------- | |
12 | |
13 * Simple ML experiments should be simple to write | |
14 * More complex / advanced scenarios should be possible without being forced | |
15 to work "outside" of this framework | |
16 * Computations should be optimized whenever possible | |
17 * Existing code (in any language) should be "wrappable" within this | |
18 framework | |
19 * It should be possible to replace [parts of] this framework with C++ code | |
20 | |
21 Theano-Like Data Flow | |
22 --------------------- | |
23 | |
24 We want to rely on Theano to be able to take advantage of its efficient | |
25 computations. The general idea is that if we chain multiple processing | |
26 elements (think e.g. of a feature selection step followed by a PCA projection, | |
27 then a rescaling within a fixed bounded interval), the overall transformation | |
28 from input to output data can be represented by a Theano symbolic graph. When | |
29 one wants to access the actual numeric data, a function is compiled so as to | |
30 do these computations efficiently. | |
31 | |
32 We discussed some specific API options for datasets and learners, which will | |
33 be added to this file in the future, but a core question that we feel should | |
34 be addressed first is how this Theano-based implementation could be achieved | |
35 exactly. For this purpose, in the following, let us assume that a dataset is | |
36 simply a matrix whose rows represent individual samples, and columns | |
37 individual features. How to handle field names, non-tensor-like data, etc. is | |
38 a very important topic that is not yet discussed in this file. | |
39 | |
40 A question we did not really discuss is whether datasets should be Theano | |
41 Variables. The advantage would be that they would fit directly within the | |
42 Theano framework, which may allow high level optimizations on data | |
43 transformations. However, we would lose the ability to combine Theano | |
44 expressions coded in individual datasets into a single graph. Currently, we | |
45 instead consider that a dataset has a member that is a Theano variable, and | |
46 this variable represents the data stored in the dataset. The same is done for | |
47 individual data samples. | |
48 | |
49 One issue with this approach is illustrated by the following example. Imagine | |
50 we want to iterate on samples in a dataset and do something with their | |
51 numeric value. We would want the code to be as close as possible to: | |
52 | |
53 .. code-block:: python | |
54 | |
55 for sample in dataset: | |
56 do_something_with(sample.numeric_value()) | |
57 | |
58 A naive implementation of the sample API could be (assuming each sample | |
59 contains a ``variable`` member which is the variable representing this | |
60 sample's data): | |
61 | |
62 .. code-block:: python | |
63 | |
64 def numeric_value(self): | |
65 if self.function is None: | |
66 # Compile function to output the numeric value stored in this | |
67 # sample's variable. | |
68 self.function = theano.function([], self.variable) | |
69 return self.function() | |
70 | |
71 However, this is not a good idea, because it would trigger a new function | |
72 compilation for each sample. Instead, we would want something like this: | |
73 | |
74 .. code-block:: python | |
75 | |
76 def numeric_value(self): | |
77 if self.function_storage[0] is None: | |
78 # Compile function to output the numeric value stored in this | |
79 # sample's variable. This function takes as input the index of | |
80 # the sample in the dataset, and is shared among all samples. | |
81 self.function_storage[0] = theano.function( | |
82 [self.symbolic_index], self.variable) | |
83 return self.function(self.numeric_index) | |
84 | |
85 In the code above, we assume that all samples created by the action of | |
86 iterating over the dataset share the same ``function_storage``, | |
87 ``symbolic_index`` and ``variable``: the first time we try to access the numeric | |
88 value of some sample, a function is compiled, that takes as input the index, | |
89 and outputs the variable. The only difference between samples is thus that | |
90 they are given a different numeric value for the index (``numeric_index``). | |
91 | |
92 Another way to obtain the same result is to actually let the user take care of | |
93 compiling the function. It would allow the user to really control what is | |
94 being compiled, at the cost of having to write more code: | |
95 | |
96 .. code-block:: python | |
97 | |
98 symbolic_index = dataset.get_index() # Or just theano.tensor.iscalar() | |
99 get_sample = theano.function([symbolic_index], | |
100 dataset[symbolic_index].variable) | |
101 for numeric_index in xrange(len(dataset)) | |
102 do_something_with(get_sample(numeric_index)) | |
103 | |
104 Note that although the above example focused on how to iterate over a dataset, | |
105 it can be cast into a more generic problem, where some data (either dataset or | |
106 sample) is the result of some transformation applied to other data, which is | |
107 parameterized by parameters p1, p2, ..., pN (in the above example, we were | |
108 considering a sample that was obtained by taking the p1-th element in a | |
109 dataset). If we use different values for a subset Q of the parameters but keep | |
110 other parameters fixed, we would probably want to compile a single function | |
111 that takes as input all parameters in Q, while other parameters are fixed. | |
112 Ideally it would be nice to let the user take control on what is being | |
113 compiled, while leaving the option of using a default sensible behavior for | |
114 those who do not want to worry about it. How to achieve this is still to be | |
115 determined. | |
116 | |
117 What About Learners? | |
118 -------------------- | |
119 | |
120 The discussion above only mentioned datasets, but not learners. The learning | |
121 part of a learner is not a main concern (currently). What matters most w.r.t. | |
122 what was discussed above is how a learner takes as input a dataset and outputs | |
123 another dataset that can be used with the dataset API. | |
124 | |
125 A Learner may be able to compute various things. For instance, a Neural | |
126 Network may output a ``prediction`` vector (whose elements correspond to | |
127 estimated probabilities of each class in a classification task), as well as a | |
128 ``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone | |
129 and the classification error). We would want to be able to build a dataset | |
130 that contains some of these quantities computed on each sample in the input | |
131 dataset. | |
132 | |
133 The Neural Network code would then look something like this: | |
134 | |
135 .. code-block:: python | |
136 | |
137 class NeuralNetwork(Learner): | |
138 | |
139 @datalearn(..) | |
140 def compute_prediction(self, sample): | |
141 return softmax(theano.tensor.dot(self.weights, sample.input)) | |
142 | |
143 @datalearn(..) | |
144 def compute_nll(self, sample): | |
145 return - log(self.compute_prediction(sample)[sample.target]) | |
146 | |
147 @datalearn(..) | |
148 def compute_penalized_nll(self, sample): | |
149 return (self.compute_nll(self, sample) + | |
150 theano.tensor.sum(self.weights**2)) | |
151 | |
152 @datalearn(..) | |
153 def compute_class_error(self, sample): | |
154 probabilities = self.compute_prediction(sample) | |
155 predicted_class = theano.tensor.argmax(probabilities) | |
156 return predicted_class != sample.target | |
157 | |
158 @datalearn(..) | |
159 def compute_cost(self, sample): | |
160 return theano.tensor.concatenate([ | |
161 self.compute_penalized_nll(sample), | |
162 self.compute_nll(sample), | |
163 self.compute_class_error(sample), | |
164 ]) | |
165 | |
166 The ``@datalearn`` decorator would be responsible for allowing such a Learner | |
167 to be used e.g. like this: | |
168 | |
169 .. code-block:: python | |
170 | |
171 nnet = NeuralNetwork() | |
172 predict_dataset = nnet.compute_prediction(dataset) | |
173 for sample in dataset: | |
174 predict_sample = nnet.compute_prediction(sample) | |
175 predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)}) | |
176 multiple_fields_dataset = ConcatDataSet([ | |
177 nnet.compute_prediction(dataset), | |
178 nnet.compute_cost(dataset), | |
179 ]) | |
180 | |
181 In the code above, if one wants to obtain the numeric value of an element of | |
182 ``multiple_fields_dataset``, the Theano function being compiled would be able | |
183 to optimize computations so that the simultaneous computation of | |
184 ``prediction`` and ``cost`` is done efficiently. | |
185 |