diff doc/v2_planning/datalearn.txt @ 1357:ffa2932a8cba

Added datalearn committee discussion file
author Olivier Delalleau <delallea@iro>
date Thu, 11 Nov 2010 16:34:38 -0500
parents
children 5db730bb0e8e
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/v2_planning/datalearn.txt	Thu Nov 11 16:34:38 2010 -0500
@@ -0,0 +1,185 @@
+DataLearn: How to plug Datasets & Learner together?
+===================================================
+
+Participants
+------------
+- Yoshua
+- Razvan
+- Olivier D [leader?]
+
+High-Level Objectives
+---------------------
+
+   * Simple ML experiments should be simple to write
+   * More complex / advanced scenarios should be possible without being forced
+     to work "outside" of this framework
+   * Computations should be optimized whenever possible
+   * Existing code (in any language) should be "wrappable" within this
+     framework
+   * It should be possible to replace [parts of] this framework with C++ code
+
+Theano-Like Data Flow
+---------------------
+
+We want to rely on Theano to be able to take advantage of its efficient
+computations. The general idea is that if we chain multiple processing
+elements (think e.g. of a feature selection step followed by a PCA projection,
+then a rescaling within a fixed bounded interval), the overall transformation
+from input to output data can be represented by a Theano symbolic graph. When
+one wants to access the actual numeric data, a function is compiled so as to
+do these computations efficiently.
+
+We discussed some specific API options for datasets and learners, which will
+be added to this file in the future, but a core question that we feel should
+be addressed first is how this Theano-based implementation could be achieved
+exactly. For this purpose, in the following, let us assume that a dataset is
+simply a matrix whose rows represent individual samples, and columns
+individual features. How to handle field names, non-tensor-like data, etc. is
+a very important topic that is not yet discussed in this file.
+
+A question we did not really discuss is whether datasets should be Theano
+Variables. The advantage would be that they would fit directly within the
+Theano framework, which may allow high level optimizations on data
+transformations. However, we would lose the ability to combine Theano
+expressions coded in individual datasets into a single graph. Currently, we
+instead consider that a dataset has a member that is a Theano variable, and
+this variable represents the data stored in the dataset. The same is done for
+individual data samples.
+
+One issue with this approach is illustrated by the following example. Imagine
+we want to iterate on samples in a dataset and do something with their
+numeric value. We would want the code to be as close as possible to:
+
+    .. code-block:: python
+
+        for sample in dataset:
+            do_something_with(sample.numeric_value())
+
+A naive implementation of the sample API could be (assuming each sample
+contains a ``variable`` member which is the variable representing this
+sample's data):
+
+    .. code-block:: python
+
+        def numeric_value(self):
+            if self.function is None:
+                # Compile function to output the numeric value stored in this
+                # sample's variable.
+                self.function = theano.function([], self.variable)
+            return self.function()
+
+However, this is not a good idea, because it would trigger a new function
+compilation for each sample. Instead, we would want something like this:
+
+    .. code-block:: python
+
+        def numeric_value(self):
+            if self.function_storage[0] is None:
+                # Compile function to output the numeric value stored in this
+                # sample's variable. This function takes as input the index of
+                # the sample in the dataset, and is shared among all samples.
+                self.function_storage[0] = theano.function(
+                                        [self.symbolic_index], self.variable)
+            return self.function(self.numeric_index)
+
+In the code above, we assume that all samples created by the action of
+iterating over the dataset share the same ``function_storage``,
+``symbolic_index`` and ``variable``: the first time we try to access the numeric
+value of some sample, a function is compiled, that takes as input the index,
+and outputs the variable.  The only difference between samples is thus that
+they are given a different numeric value for the index (``numeric_index``).
+
+Another way to obtain the same result is to actually let the user take care of
+compiling the function. It would allow the user to really control what is
+being compiled, at the cost of having to write more code:
+
+    .. code-block:: python
+
+        symbolic_index = dataset.get_index()  # Or just theano.tensor.iscalar()
+        get_sample = theano.function([symbolic_index],
+                                     dataset[symbolic_index].variable)
+        for numeric_index in xrange(len(dataset))
+            do_something_with(get_sample(numeric_index))
+
+Note that although the above example focused on how to iterate over a dataset,
+it can be cast into a more generic problem, where some data (either dataset or
+sample) is the result of some transformation applied to other data, which is
+parameterized by parameters p1, p2, ..., pN (in the above example, we were
+considering a sample that was obtained by taking the p1-th element in a
+dataset). If we use different values for a subset Q of the parameters but keep
+other parameters fixed, we would probably want to compile a single function
+that takes as input all parameters in Q, while other parameters are fixed.
+Ideally it would be nice to let the user take control on what is being
+compiled, while leaving the option of using a default sensible behavior for
+those who do not want to worry about it. How to achieve this is still to be
+determined.
+
+What About Learners?
+--------------------
+
+The discussion above only mentioned datasets, but not learners. The learning
+part of a learner is not a main concern (currently). What matters most w.r.t.
+what was discussed above is how a learner takes as input a dataset and outputs
+another dataset that can be used with the dataset API.
+
+A Learner may be able to compute various things. For instance, a Neural
+Network may output a ``prediction`` vector (whose elements correspond to
+estimated probabilities of each class in a classification task), as well as a
+``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
+and the classification error). We would want to be able to build a dataset
+that contains some of these quantities computed on each sample in the input
+dataset.
+
+The Neural Network code would then look something like this:
+
+    .. code-block:: python
+
+        class NeuralNetwork(Learner):
+
+            @datalearn(..)
+            def compute_prediction(self, sample):
+                return softmax(theano.tensor.dot(self.weights, sample.input))
+
+            @datalearn(..)
+            def compute_nll(self, sample):
+                return - log(self.compute_prediction(sample)[sample.target])
+
+            @datalearn(..)
+            def compute_penalized_nll(self, sample):
+                return (self.compute_nll(self, sample) +
+                        theano.tensor.sum(self.weights**2))
+
+            @datalearn(..)
+            def compute_class_error(self, sample):
+                probabilities = self.compute_prediction(sample)
+                predicted_class = theano.tensor.argmax(probabilities)
+                return predicted_class != sample.target
+
+            @datalearn(..)
+            def compute_cost(self, sample):
+                return theano.tensor.concatenate([
+                        self.compute_penalized_nll(sample),
+                        self.compute_nll(sample),
+                        self.compute_class_error(sample),
+                        ])
+            
+The ``@datalearn`` decorator would be responsible for allowing such a Learner
+to be used e.g. like this:
+
+    .. code-block:: python
+
+        nnet = NeuralNetwork()
+        predict_dataset = nnet.compute_prediction(dataset)
+        for sample in dataset:
+            predict_sample = nnet.compute_prediction(sample)
+        predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
+        multiple_fields_dataset = ConcatDataSet([
+                nnet.compute_prediction(dataset),
+                nnet.compute_cost(dataset),
+                ])
+        
+In the code above, if one wants to obtain the numeric value of an element of
+``multiple_fields_dataset``, the Theano function being compiled would be able
+to optimize computations so that the simultaneous computation of
+``prediction`` and ``cost`` is done efficiently.
+