comparison doc/v2_planning/datalearn.txt @ 1357:ffa2932a8cba

Added datalearn committee discussion file
author Olivier Delalleau <delallea@iro>
date Thu, 11 Nov 2010 16:34:38 -0500
parents
children 5db730bb0e8e
comparison
equal deleted inserted replaced
1356:26644a775a0d 1357:ffa2932a8cba
1 DataLearn: How to plug Datasets & Learner together?
2 ===================================================
3
4 Participants
5 ------------
6 - Yoshua
7 - Razvan
8 - Olivier D [leader?]
9
10 High-Level Objectives
11 ---------------------
12
13 * Simple ML experiments should be simple to write
14 * More complex / advanced scenarios should be possible without being forced
15 to work "outside" of this framework
16 * Computations should be optimized whenever possible
17 * Existing code (in any language) should be "wrappable" within this
18 framework
19 * It should be possible to replace [parts of] this framework with C++ code
20
21 Theano-Like Data Flow
22 ---------------------
23
24 We want to rely on Theano to be able to take advantage of its efficient
25 computations. The general idea is that if we chain multiple processing
26 elements (think e.g. of a feature selection step followed by a PCA projection,
27 then a rescaling within a fixed bounded interval), the overall transformation
28 from input to output data can be represented by a Theano symbolic graph. When
29 one wants to access the actual numeric data, a function is compiled so as to
30 do these computations efficiently.
31
32 We discussed some specific API options for datasets and learners, which will
33 be added to this file in the future, but a core question that we feel should
34 be addressed first is how this Theano-based implementation could be achieved
35 exactly. For this purpose, in the following, let us assume that a dataset is
36 simply a matrix whose rows represent individual samples, and columns
37 individual features. How to handle field names, non-tensor-like data, etc. is
38 a very important topic that is not yet discussed in this file.
39
40 A question we did not really discuss is whether datasets should be Theano
41 Variables. The advantage would be that they would fit directly within the
42 Theano framework, which may allow high level optimizations on data
43 transformations. However, we would lose the ability to combine Theano
44 expressions coded in individual datasets into a single graph. Currently, we
45 instead consider that a dataset has a member that is a Theano variable, and
46 this variable represents the data stored in the dataset. The same is done for
47 individual data samples.
48
49 One issue with this approach is illustrated by the following example. Imagine
50 we want to iterate on samples in a dataset and do something with their
51 numeric value. We would want the code to be as close as possible to:
52
53 .. code-block:: python
54
55 for sample in dataset:
56 do_something_with(sample.numeric_value())
57
58 A naive implementation of the sample API could be (assuming each sample
59 contains a ``variable`` member which is the variable representing this
60 sample's data):
61
62 .. code-block:: python
63
64 def numeric_value(self):
65 if self.function is None:
66 # Compile function to output the numeric value stored in this
67 # sample's variable.
68 self.function = theano.function([], self.variable)
69 return self.function()
70
71 However, this is not a good idea, because it would trigger a new function
72 compilation for each sample. Instead, we would want something like this:
73
74 .. code-block:: python
75
76 def numeric_value(self):
77 if self.function_storage[0] is None:
78 # Compile function to output the numeric value stored in this
79 # sample's variable. This function takes as input the index of
80 # the sample in the dataset, and is shared among all samples.
81 self.function_storage[0] = theano.function(
82 [self.symbolic_index], self.variable)
83 return self.function(self.numeric_index)
84
85 In the code above, we assume that all samples created by the action of
86 iterating over the dataset share the same ``function_storage``,
87 ``symbolic_index`` and ``variable``: the first time we try to access the numeric
88 value of some sample, a function is compiled, that takes as input the index,
89 and outputs the variable. The only difference between samples is thus that
90 they are given a different numeric value for the index (``numeric_index``).
91
92 Another way to obtain the same result is to actually let the user take care of
93 compiling the function. It would allow the user to really control what is
94 being compiled, at the cost of having to write more code:
95
96 .. code-block:: python
97
98 symbolic_index = dataset.get_index() # Or just theano.tensor.iscalar()
99 get_sample = theano.function([symbolic_index],
100 dataset[symbolic_index].variable)
101 for numeric_index in xrange(len(dataset))
102 do_something_with(get_sample(numeric_index))
103
104 Note that although the above example focused on how to iterate over a dataset,
105 it can be cast into a more generic problem, where some data (either dataset or
106 sample) is the result of some transformation applied to other data, which is
107 parameterized by parameters p1, p2, ..., pN (in the above example, we were
108 considering a sample that was obtained by taking the p1-th element in a
109 dataset). If we use different values for a subset Q of the parameters but keep
110 other parameters fixed, we would probably want to compile a single function
111 that takes as input all parameters in Q, while other parameters are fixed.
112 Ideally it would be nice to let the user take control on what is being
113 compiled, while leaving the option of using a default sensible behavior for
114 those who do not want to worry about it. How to achieve this is still to be
115 determined.
116
117 What About Learners?
118 --------------------
119
120 The discussion above only mentioned datasets, but not learners. The learning
121 part of a learner is not a main concern (currently). What matters most w.r.t.
122 what was discussed above is how a learner takes as input a dataset and outputs
123 another dataset that can be used with the dataset API.
124
125 A Learner may be able to compute various things. For instance, a Neural
126 Network may output a ``prediction`` vector (whose elements correspond to
127 estimated probabilities of each class in a classification task), as well as a
128 ``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
129 and the classification error). We would want to be able to build a dataset
130 that contains some of these quantities computed on each sample in the input
131 dataset.
132
133 The Neural Network code would then look something like this:
134
135 .. code-block:: python
136
137 class NeuralNetwork(Learner):
138
139 @datalearn(..)
140 def compute_prediction(self, sample):
141 return softmax(theano.tensor.dot(self.weights, sample.input))
142
143 @datalearn(..)
144 def compute_nll(self, sample):
145 return - log(self.compute_prediction(sample)[sample.target])
146
147 @datalearn(..)
148 def compute_penalized_nll(self, sample):
149 return (self.compute_nll(self, sample) +
150 theano.tensor.sum(self.weights**2))
151
152 @datalearn(..)
153 def compute_class_error(self, sample):
154 probabilities = self.compute_prediction(sample)
155 predicted_class = theano.tensor.argmax(probabilities)
156 return predicted_class != sample.target
157
158 @datalearn(..)
159 def compute_cost(self, sample):
160 return theano.tensor.concatenate([
161 self.compute_penalized_nll(sample),
162 self.compute_nll(sample),
163 self.compute_class_error(sample),
164 ])
165
166 The ``@datalearn`` decorator would be responsible for allowing such a Learner
167 to be used e.g. like this:
168
169 .. code-block:: python
170
171 nnet = NeuralNetwork()
172 predict_dataset = nnet.compute_prediction(dataset)
173 for sample in dataset:
174 predict_sample = nnet.compute_prediction(sample)
175 predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
176 multiple_fields_dataset = ConcatDataSet([
177 nnet.compute_prediction(dataset),
178 nnet.compute_cost(dataset),
179 ])
180
181 In the code above, if one wants to obtain the numeric value of an element of
182 ``multiple_fields_dataset``, the Theano function being compiled would be able
183 to optimize computations so that the simultaneous computation of
184 ``prediction`` and ``cost`` is done efficiently.
185