changeset 1376:e8fc563dad74

Rewrote the Theano-Like Data Flow section in datalearn.txt
author Olivier Delalleau <delallea@iro>
date Thu, 18 Nov 2010 14:00:49 -0500
parents 7b61bfda1dab
children 0665274b14af
files doc/v2_planning/datalearn.txt
diffstat 1 files changed, 234 insertions(+), 79 deletions(-) [+]
line wrap: on
line diff
--- a/doc/v2_planning/datalearn.txt	Thu Nov 18 11:45:10 2010 -0500
+++ b/doc/v2_planning/datalearn.txt	Thu Nov 18 14:00:49 2010 -0500
@@ -1,11 +1,14 @@
 DataLearn: How to plug Datasets & Learner together?
 ===================================================
 
+
 Participants
 ------------
 - Yoshua
 - Razvan
 - Olivier D [leader]
+- James
+
 
 High-Level Objectives
 ---------------------
@@ -18,6 +21,7 @@
      framework
    * It should be possible to replace [parts of] this framework with C++ code
 
+
 Theano-Like Data Flow
 ---------------------
 
@@ -37,6 +41,234 @@
 individual features. How to handle field names, non-tensor-like data, etc. is
 a very important topic that is not yet discussed in this file.
 
+The main idea in this proposal is to consider some Data object as a Theano
+Variable (we call 'data' an object that is either a sample, or a collection of
+samples i.e a dataset). Because the Data API (for the Machine Learning user)
+may conflict with the Variable API, in the following we take the approach that
+a data object contains a Theano variable accessible through data.variable
+(instead of Data being a subclass of Variable). For instance a basic way of
+printing the content of a dataset could be:
+
+    .. code-block:: python
+
+        dataset = NumpyDataset(some_numpy_array)  # View array as dataset.
+        index = theano.tensor.lscalar()
+        get_sample_value = theano.function([index], dataset[index].variable)
+        for i in xrange(len(dataset)):
+            print get_sample_value(i)
+
+There may also exist some helper function for the common task on iterating
+over the numeric values found in a dataset, which would allow one to simply
+write:
+
+    .. code-block:: python
+
+        for sample_value in theano_iterate(dataset):
+            print sample_value
+
+where the theano_iterate function would take care of the extra work:
+
+    .. code-block:: python
+
+        def theano_iterate(dataset, index=None, condition=None,
+                           stop_exceptions=(IndexError, )):
+            if index is None:
+                index = theano.tensor.lscalar()
+            if condition is None:
+                condition = index < len(dataset)
+            get_value = theano.function([index],
+                                        [dataset[index].variable, condition])
+            i = 0
+            while True:
+                try:
+                    output, cond = get_value(i)
+                except stop_exceptions:
+                    break
+                i += 1
+                if cond:
+                    yield output
+                else:
+                    break
+
+Now imagine a similar situation (willing to iterate on a dataset) where the
+datsaet is the result of some transformation parameterized by another
+Variable. For instance, let's say there exists a GetColumnDataset class such
+that GetColumnDataset(dataset, index_variable) is a dataset whose associated
+variable is dataset.variable[:, index_variable] (assuming here that
+dataset.variable is a matrix variable). One would like to write:
+
+    .. code-block:: python
+
+        for j in xrange(dataset.n_columns()):
+            print 'Printing column %s' % j
+            for sample_value in theano_iterate(GetColumnDataset(dataset, j)):
+                print sample_value
+
+Although this would work, note that it would compile a new Theano function
+each time theano_iterate is called (one for each value of j), which may be a
+performance bottleneck. One way to avoid this is to just ignore the helper
+function and manually compile a function that also takes the column index as 
+input parameter:
+
+    .. code-block:: python
+
+        sample_idx = theano.tensor.lscalar()
+        column_idx = theano.tensor.lscalar()
+        get_value = theano.function(
+            [sample_idx, column_idx],
+            GetColumnDataset(dataset, column_idx)[sample_idx].variable)
+        for j in xrange(dataset.n_columns()):
+            print 'Printing column %s' % j
+            for i in xrange(len(dataset)):
+                print get_value(i, j)
+
+It is however possible to use the helper function if it can accept an extra
+argument ('givens') to be provided to the theano compilation step:
+
+    .. code-block:: python
+
+        def theano_iterate(dataset, index=None, condition=None,
+                           stop_exceptions=(IndexError, ),
+                           givens={}):
+            (...)
+            get_value = theano.function([index],
+                                        [dataset[index].variable, condition],
+                                        givens=givens)
+            (...)
+        
+        column_idx = theano.tensor.lscalar()
+        shared_column_idx = theano.shared(0)
+        iterate = theano_iterate(GetColumnDataset(dataset, column_idx),
+                                 givens={column_idx: shared_column_idx})
+        for j in xrange(dataset.n_columns()):
+            print 'Printing column %s' % j
+            shared_column_idx.value = j
+            for sample_value in iterate:
+                print sample_value
+
+Note there are a couple oddities in the example above:
+   1. The way theano_iterate was written, it is not possible to iterate on it
+      more than once. This is easily fixed by making it an iterable object.
+   2. It would make more sense here to remove 'column_idx' and directly use
+      GetColumnDataset(dataset, shared_column_idx), in which case there is no
+      need to use the 'givens' keyword. But the goal here is to illustrate a
+      situation where one is given a dataset defined from a symbolic variable,
+      and we want to compute it for different numeric values of this variable.
+      This dataset may have been provided by code the user has no control on,
+      thus the need for 'givens' to replace the variable with a shared one
+      whose value can be updated between successive calls to the same
+      function.
+
+In summary:
+- Data (samples and datasets) are basically Theano Variables, and a data
+  transformation an Op.
+- When writing code that requires some data numeric value, one has to compile
+  a Theano function to obtain it. This is done either manually or through some
+  helper Pylearn functions for common tasks. In both cases, the user should
+  have enough control to be able to obtain an efficient implementation.
+
+
+What About Learners?
+--------------------
+
+The discussion above only mentioned datasets, but not learners. The learning
+part of a learner is not a main concern (currently). What matters most w.r.t.
+what was discussed above is how a learner takes as input a dataset and outputs
+another dataset that can be used with the dataset API.
+
+A Learner may be able to compute various things. For instance, a Neural
+Network may output a ``prediction`` vector (whose elements correspond to
+estimated probabilities of each class in a classification task), as well as a
+``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
+and the classification error). We would want to be able to build a dataset
+that contains some of these quantities computed on each sample in the input
+dataset.
+
+The Neural Network code would then look something like this:
+
+    .. code-block:: python
+
+        class NeuralNetwork(Learner):
+
+            # The decorator below is reponsible for turning a function that
+            # takes a symbolic sample as input, and outputs a Theano variable,
+            # into a function that can also be applied on numeric sample data,
+            # or symbolic datasets.
+            # Other approaches than a decorator are possible (e.g. using
+            # different function names).
+            @datalearn(..)
+            def compute_prediction(self, sample):
+                return softmax(theano.tensor.dot(self.weights, sample.input))
+
+            @datalearn(..)
+            def compute_nll(self, sample):
+                return - log(self.compute_prediction(sample)[sample.target])
+
+            @datalearn(..)
+            def compute_penalized_nll(self, sample):
+                return (self.compute_nll(self, sample) +
+                        theano.tensor.sum(self.weights**2))
+
+            @datalearn(..)
+            def compute_class_error(self, sample):
+                probabilities = self.compute_prediction(sample)
+                predicted_class = theano.tensor.argmax(probabilities)
+                return predicted_class != sample.target
+
+            @datalearn(..)
+            def compute_cost(self, sample):
+                return theano.tensor.concatenate([
+                        self.compute_penalized_nll(sample),
+                        self.compute_nll(sample),
+                        self.compute_class_error(sample),
+                        ])
+            
+The ``@datalearn`` decorator would allow such a Learner to be used e.g. like
+this:
+
+    .. code-block:: python
+
+        nnet = NeuralNetwork()
+        # Symbolic dataset that represents the output on symbolic input data.
+        predict_dataset = nnet.compute_prediction(dataset)
+        for sample in dataset:
+            # Symbolic sample that represents the output on a single symbolic
+            # input sample.
+            predict_sample = nnet.compute_prediction(sample)
+        # Numeric prediction.
+        predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
+        # Combining multiple symbolic outputs.
+        multiple_fields_dataset = ConcatDataSet([
+                nnet.compute_prediction(dataset),
+                nnet.compute_cost(dataset),
+                ])
+        
+In the code above, if one wants to obtain the numeric value of an element of
+``multiple_fields_dataset``, the Theano function being compiled should be able
+to optimize computations so that the simultaneous computation of
+``prediction`` and ``cost`` is done efficiently.
+
+
+Open Problems
+-------------
+
+The above is not yet a practical proposal. Investigation of the following
+topics is still missing:
+- Datasets whose variables are not matrices (e.g. large datasets that do not
+  fit in memory, non fixed-length vector samples, ...)
+- Field names.
+- Typical input / target / weight split.
+- Learners whose output on a dataset cannot be obtained by computing outputs
+  on individual samples (e.g. a Learner that ranks samples based on pair-wise
+  comparisons).
+- Code parallelization, stop & restart.
+- Modular C++ implementation without Theano.
+- ...
+
+
+Previous Introduction (deprecated)
+----------------------------------
+
 A question we did not discuss much is to which extent the architecture could
 be "theanified", i.e. whether a whole experiment could be defined as a Theano
 graph on which high level optimizations could be made possible, while also
@@ -122,85 +354,6 @@
 sensible behavior for those who do not want to worry about it. Whether this is
 possible / desirable is still to-be-determined.
 
-What About Learners?
---------------------
-
-The discussion above only mentioned datasets, but not learners. The learning
-part of a learner is not a main concern (currently). What matters most w.r.t.
-what was discussed above is how a learner takes as input a dataset and outputs
-another dataset that can be used with the dataset API.
-
-A Learner may be able to compute various things. For instance, a Neural
-Network may output a ``prediction`` vector (whose elements correspond to
-estimated probabilities of each class in a classification task), as well as a
-``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
-and the classification error). We would want to be able to build a dataset
-that contains some of these quantities computed on each sample in the input
-dataset.
-
-The Neural Network code would then look something like this:
-
-    .. code-block:: python
-
-        class NeuralNetwork(Learner):
-
-            # The decorator below is reponsible for turning a function that
-            # takes a symbolic sample as input, and outputs a Theano variable,
-            # into a function that can also be applied on numeric sample data,
-            # or symbolic datasets.
-            # Other approaches than a decorator are possible (e.g. using
-            # different function names).
-            @datalearn(..)
-            def compute_prediction(self, sample):
-                return softmax(theano.tensor.dot(self.weights, sample.input))
-
-            @datalearn(..)
-            def compute_nll(self, sample):
-                return - log(self.compute_prediction(sample)[sample.target])
-
-            @datalearn(..)
-            def compute_penalized_nll(self, sample):
-                return (self.compute_nll(self, sample) +
-                        theano.tensor.sum(self.weights**2))
-
-            @datalearn(..)
-            def compute_class_error(self, sample):
-                probabilities = self.compute_prediction(sample)
-                predicted_class = theano.tensor.argmax(probabilities)
-                return predicted_class != sample.target
-
-            @datalearn(..)
-            def compute_cost(self, sample):
-                return theano.tensor.concatenate([
-                        self.compute_penalized_nll(sample),
-                        self.compute_nll(sample),
-                        self.compute_class_error(sample),
-                        ])
-            
-The ``@datalearn`` decorator would allow such a Learner to be used e.g. like
-this:
-
-    .. code-block:: python
-
-        nnet = NeuralNetwork()
-        # Symbolic dataset that represents the output on symbolic input data.
-        predict_dataset = nnet.compute_prediction(dataset)
-        for sample in dataset:
-            # Symbolic sample that represents the output on a single symbolic
-            # input sample.
-            predict_sample = nnet.compute_prediction(sample)
-        # Numeric prediction.
-        predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
-        # Combining multiple symbolic outputs.
-        multiple_fields_dataset = ConcatDataSet([
-                nnet.compute_prediction(dataset),
-                nnet.compute_cost(dataset),
-                ])
-        
-In the code above, if one wants to obtain the numeric value of an element of
-``multiple_fields_dataset``, the Theano function being compiled should be able
-to optimize computations so that the simultaneous computation of
-``prediction`` and ``cost`` is done efficiently.
 
 Discussion: Are Datasets Variables / Ops?
 -----------------------------------------
@@ -390,6 +543,7 @@
 and valid options.
 </Razvan comments>
 
+
 Discussion: Fixed Parameters vs. Function Arguments
 ---------------------------------------------------
 
@@ -534,6 +688,7 @@
   once. Maybe this can be solved at the Theano level with an efficient
   function cache?
 
+
 Discussion: Dataset as Learner Ouptut
 -------------------------------------