# HG changeset patch # User Olivier Delalleau # Date 1289511278 18000 # Node ID ffa2932a8cba46f7976af28a952d5c01a5287476 # Parent 26644a775a0d0bf33d826ce9feb5c08be1bd18c7 Added datalearn committee discussion file diff -r 26644a775a0d -r ffa2932a8cba doc/v2_planning/datalearn.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/datalearn.txt Thu Nov 11 16:34:38 2010 -0500 @@ -0,0 +1,185 @@ +DataLearn: How to plug Datasets & Learner together? +=================================================== + +Participants +------------ +- Yoshua +- Razvan +- Olivier D [leader?] + +High-Level Objectives +--------------------- + + * Simple ML experiments should be simple to write + * More complex / advanced scenarios should be possible without being forced + to work "outside" of this framework + * Computations should be optimized whenever possible + * Existing code (in any language) should be "wrappable" within this + framework + * It should be possible to replace [parts of] this framework with C++ code + +Theano-Like Data Flow +--------------------- + +We want to rely on Theano to be able to take advantage of its efficient +computations. The general idea is that if we chain multiple processing +elements (think e.g. of a feature selection step followed by a PCA projection, +then a rescaling within a fixed bounded interval), the overall transformation +from input to output data can be represented by a Theano symbolic graph. When +one wants to access the actual numeric data, a function is compiled so as to +do these computations efficiently. + +We discussed some specific API options for datasets and learners, which will +be added to this file in the future, but a core question that we feel should +be addressed first is how this Theano-based implementation could be achieved +exactly. For this purpose, in the following, let us assume that a dataset is +simply a matrix whose rows represent individual samples, and columns +individual features. How to handle field names, non-tensor-like data, etc. is +a very important topic that is not yet discussed in this file. + +A question we did not really discuss is whether datasets should be Theano +Variables. The advantage would be that they would fit directly within the +Theano framework, which may allow high level optimizations on data +transformations. However, we would lose the ability to combine Theano +expressions coded in individual datasets into a single graph. Currently, we +instead consider that a dataset has a member that is a Theano variable, and +this variable represents the data stored in the dataset. The same is done for +individual data samples. + +One issue with this approach is illustrated by the following example. Imagine +we want to iterate on samples in a dataset and do something with their +numeric value. We would want the code to be as close as possible to: + + .. code-block:: python + + for sample in dataset: + do_something_with(sample.numeric_value()) + +A naive implementation of the sample API could be (assuming each sample +contains a ``variable`` member which is the variable representing this +sample's data): + + .. code-block:: python + + def numeric_value(self): + if self.function is None: + # Compile function to output the numeric value stored in this + # sample's variable. + self.function = theano.function([], self.variable) + return self.function() + +However, this is not a good idea, because it would trigger a new function +compilation for each sample. Instead, we would want something like this: + + .. code-block:: python + + def numeric_value(self): + if self.function_storage[0] is None: + # Compile function to output the numeric value stored in this + # sample's variable. This function takes as input the index of + # the sample in the dataset, and is shared among all samples. + self.function_storage[0] = theano.function( + [self.symbolic_index], self.variable) + return self.function(self.numeric_index) + +In the code above, we assume that all samples created by the action of +iterating over the dataset share the same ``function_storage``, +``symbolic_index`` and ``variable``: the first time we try to access the numeric +value of some sample, a function is compiled, that takes as input the index, +and outputs the variable. The only difference between samples is thus that +they are given a different numeric value for the index (``numeric_index``). + +Another way to obtain the same result is to actually let the user take care of +compiling the function. It would allow the user to really control what is +being compiled, at the cost of having to write more code: + + .. code-block:: python + + symbolic_index = dataset.get_index() # Or just theano.tensor.iscalar() + get_sample = theano.function([symbolic_index], + dataset[symbolic_index].variable) + for numeric_index in xrange(len(dataset)) + do_something_with(get_sample(numeric_index)) + +Note that although the above example focused on how to iterate over a dataset, +it can be cast into a more generic problem, where some data (either dataset or +sample) is the result of some transformation applied to other data, which is +parameterized by parameters p1, p2, ..., pN (in the above example, we were +considering a sample that was obtained by taking the p1-th element in a +dataset). If we use different values for a subset Q of the parameters but keep +other parameters fixed, we would probably want to compile a single function +that takes as input all parameters in Q, while other parameters are fixed. +Ideally it would be nice to let the user take control on what is being +compiled, while leaving the option of using a default sensible behavior for +those who do not want to worry about it. How to achieve this is still to be +determined. + +What About Learners? +-------------------- + +The discussion above only mentioned datasets, but not learners. The learning +part of a learner is not a main concern (currently). What matters most w.r.t. +what was discussed above is how a learner takes as input a dataset and outputs +another dataset that can be used with the dataset API. + +A Learner may be able to compute various things. For instance, a Neural +Network may output a ``prediction`` vector (whose elements correspond to +estimated probabilities of each class in a classification task), as well as a +``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone +and the classification error). We would want to be able to build a dataset +that contains some of these quantities computed on each sample in the input +dataset. + +The Neural Network code would then look something like this: + + .. code-block:: python + + class NeuralNetwork(Learner): + + @datalearn(..) + def compute_prediction(self, sample): + return softmax(theano.tensor.dot(self.weights, sample.input)) + + @datalearn(..) + def compute_nll(self, sample): + return - log(self.compute_prediction(sample)[sample.target]) + + @datalearn(..) + def compute_penalized_nll(self, sample): + return (self.compute_nll(self, sample) + + theano.tensor.sum(self.weights**2)) + + @datalearn(..) + def compute_class_error(self, sample): + probabilities = self.compute_prediction(sample) + predicted_class = theano.tensor.argmax(probabilities) + return predicted_class != sample.target + + @datalearn(..) + def compute_cost(self, sample): + return theano.tensor.concatenate([ + self.compute_penalized_nll(sample), + self.compute_nll(sample), + self.compute_class_error(sample), + ]) + +The ``@datalearn`` decorator would be responsible for allowing such a Learner +to be used e.g. like this: + + .. code-block:: python + + nnet = NeuralNetwork() + predict_dataset = nnet.compute_prediction(dataset) + for sample in dataset: + predict_sample = nnet.compute_prediction(sample) + predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)}) + multiple_fields_dataset = ConcatDataSet([ + nnet.compute_prediction(dataset), + nnet.compute_cost(dataset), + ]) + +In the code above, if one wants to obtain the numeric value of an element of +``multiple_fields_dataset``, the Theano function being compiled would be able +to optimize computations so that the simultaneous computation of +``prediction`` and ``cost`` is done efficiently. +