diff doc/v2_planning/datalearn.txt @ 1392:2d3cbbb36178

author gdesjardins
date Mon, 20 Dec 2010 18:09:11 -0500
parents e3d02b0a05e3
line wrap: on
line diff
--- a/doc/v2_planning/datalearn.txt	Mon Dec 20 18:08:48 2010 -0500
+++ b/doc/v2_planning/datalearn.txt	Mon Dec 20 18:09:11 2010 -0500
@@ -1,11 +1,14 @@
 DataLearn: How to plug Datasets & Learner together?
 - Yoshua
 - Razvan
 - Olivier D [leader]
+- James
 High-Level Objectives
@@ -18,6 +21,7 @@
    * It should be possible to replace [parts of] this framework with C++ code
 Theano-Like Data Flow
@@ -37,6 +41,235 @@
 individual features. How to handle field names, non-tensor-like data, etc. is
 a very important topic that is not yet discussed in this file.
+The main idea in this proposal is to consider some Data object as a Theano
+Variable (we call 'data' an object that is either a sample, or a collection of
+samples i.e a dataset). Because the Data API (for the Machine Learning user)
+may conflict with the Variable API, in the following we take the approach that
+a data object contains a Theano variable accessible through data.variable
+(instead of Data being a subclass of Variable). For instance a basic way of
+printing the content of a dataset could be:
+    .. code-block:: python
+        dataset = NumpyDataset(some_numpy_array)  # View array as dataset.
+        index = theano.tensor.lscalar()
+        get_sample_value = theano.function([index], dataset[index].variable)
+        for i in xrange(len(dataset)):
+            print get_sample_value(i)
+There may also exist some helper function for the common task on iterating
+over the numeric values found in a dataset, which would allow one to simply
+    .. code-block:: python
+        for sample_value in theano_iterate(dataset):
+            print sample_value
+where the theano_iterate function would take care of the extra work:
+    .. code-block:: python
+        def theano_iterate(dataset, index=None, condition=None,
+                           stop_exceptions=(IndexError, )):
+            if index is None:
+                index = theano.tensor.lscalar()
+            if condition is None:
+                condition = index < len(dataset)
+            get_value = theano.function([index],
+                                        [dataset[index].variable, condition])
+            i = 0
+            while True:
+                try:
+                    output, cond = get_value(i)
+                except stop_exceptions:
+                    break
+                i += 1
+                if cond:
+                    yield output
+                else:
+                    break
+Now imagine a similar situation (willing to iterate on a dataset) where the
+datsaet is the result of some transformation parameterized by another
+Variable. For instance, let's say there exists a GetColumnDataset class such
+that GetColumnDataset(dataset, index_variable) is a dataset whose associated
+variable is dataset.variable[:, index_variable] (assuming here that
+dataset.variable is a matrix variable). One would like to write:
+    .. code-block:: python
+        for j in xrange(dataset.n_columns()):
+            print 'Printing column %s' % j
+            for sample_value in theano_iterate(GetColumnDataset(dataset, j)):
+                print sample_value
+Although this would work, note that it would compile a new Theano function
+each time theano_iterate is called (one for each value of j), which may be a
+performance bottleneck. One way to avoid this is to just ignore the helper
+function and manually compile a function that also takes the column index as 
+input parameter:
+    .. code-block:: python
+        sample_idx = theano.tensor.lscalar()
+        column_idx = theano.tensor.lscalar()
+        get_value = theano.function(
+            [sample_idx, column_idx],
+            GetColumnDataset(dataset, column_idx)[sample_idx].variable)
+        for j in xrange(dataset.n_columns()):
+            print 'Printing column %s' % j
+            for i in xrange(len(dataset)):
+                print get_value(i, j)
+It is however possible to use the helper function if it can accept an extra
+argument ('givens') to be provided to the theano compilation step:
+    .. code-block:: python
+        def theano_iterate(dataset, index=None, condition=None,
+                           stop_exceptions=(IndexError, ),
+                           givens={}):
+            (...)
+            get_value = theano.function([index],
+                                        [dataset[index].variable, condition],
+                                        givens=givens)
+            (...)
+        column_idx = theano.tensor.lscalar()
+        shared_column_idx = theano.shared(0)
+        iterate = theano_iterate(GetColumnDataset(dataset, column_idx),
+                                 givens={column_idx: shared_column_idx})
+        for j in xrange(dataset.n_columns()):
+            print 'Printing column %s' % j
+            shared_column_idx.value = j
+            for sample_value in iterate:
+                print sample_value
+Note there are a couple oddities in the example above:
+   1. The way theano_iterate was written, it is not possible to iterate on it
+      more than once. This is easily fixed by making it an iterable object.
+   2. It would make more sense here to remove 'column_idx' and directly use
+      GetColumnDataset(dataset, shared_column_idx), in which case there is no
+      need to use the 'givens' keyword. But the goal here is to illustrate a
+      situation where one is given a dataset defined from a symbolic variable,
+      and we want to compute it for different numeric values of this variable.
+      This dataset may have been provided by code the user has no control on,
+      thus the need for 'givens' to replace the variable with a shared one
+      whose value can be updated between successive calls to the same
+      function.
+In summary:
+    - Data (samples and datasets) are basically Theano Variables, and a data
+      transformation an Op.
+    - When writing code that requires some data numeric value, one has to compile
+      a Theano function to obtain it. This is done either manually or through some
+      helper Pylearn functions for common tasks. In both cases, the user should
+      have enough control to be able to obtain an efficient implementation.
+What About Learners?
+The discussion above only mentioned datasets, but not learners. The learning
+part of a learner is not a main concern (currently). What matters most w.r.t.
+what was discussed above is how a learner takes as input a dataset and outputs
+another dataset that can be used with the dataset API.
+A Learner may be able to compute various things. For instance, a Neural
+Network may output a ``prediction`` vector (whose elements correspond to
+estimated probabilities of each class in a classification task), as well as a
+``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
+and the classification error). We would want to be able to build a dataset
+that contains some of these quantities computed on each sample in the input
+The Neural Network code would then look something like this:
+    .. code-block:: python
+        class NeuralNetwork(Learner):
+            # The decorator below is reponsible for turning a function that
+            # takes a symbolic sample as input, and outputs a Theano variable,
+            # into a function that can also be applied on numeric sample data,
+            # or symbolic datasets.
+            # Other approaches than a decorator are possible (e.g. using
+            # different function names).
+            def compute_prediction(self, sample):
+                return softmax(theano.tensor.dot(self.weights, sample.input))
+            @datalearn
+            def compute_nll(self, sample):
+                return - log(self.compute_prediction(sample)[sample.target])
+            @datalearn
+            def compute_penalized_nll(self, sample):
+                return (self.compute_nll(self, sample) +
+                        theano.tensor.sum(self.weights**2))
+            @datalearn
+            def compute_class_error(self, sample):
+                probabilities = self.compute_prediction(sample)
+                predicted_class = theano.tensor.argmax(probabilities)
+                return predicted_class != sample.target
+            @datalearn
+            def compute_cost(self, sample):
+                return theano.tensor.concatenate([
+                        self.compute_penalized_nll(sample),
+                        self.compute_nll(sample),
+                        self.compute_class_error(sample),
+                        ])
+The ``@datalearn`` decorator would allow such a Learner to be used e.g. like
+    .. code-block:: python
+        nnet = NeuralNetwork()
+        # Symbolic dataset that represents the output on symbolic input data.
+        predict_dataset = nnet.compute_prediction(dataset)
+        for sample in dataset:
+            # Symbolic sample that represents the output on a single symbolic
+            # input sample.
+            predict_sample = nnet.compute_prediction(sample)
+        # Numeric prediction.
+        predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
+        # Combining multiple symbolic outputs.
+        multiple_fields_dataset = ConcatDataSet([
+                nnet.compute_prediction(dataset),
+                nnet.compute_cost(dataset),
+                ])
+In the code above, if one wants to obtain the numeric value of an element of
+``multiple_fields_dataset``, the Theano function being compiled should be able
+to optimize computations so that the simultaneous computation of
+``prediction`` and ``cost`` is done efficiently.
+Open Problems
+The above is not yet a practical proposal. Investigation of the following
+topics is still missing:
+    - Datasets whose variables are not matrices (e.g. large datasets that do not
+      fit in memory, non fixed-length vector samples, ...)
+    - Field names.
+    - Typical input / target / weight split.
+    - Learners whose output on a dataset cannot be obtained by computing outputs
+      on individual samples (e.g. a Learner that ranks samples based on pair-wise
+      comparisons).
+    - Code parallelization, stop & restart.
+    - Modular C++ implementation without Theano.
+    - How do we take care of model learning within such a Theano graph?
+    - ...
+Previous Introduction (deprecated)
 A question we did not discuss much is to which extent the architecture could
 be "theanified", i.e. whether a whole experiment could be defined as a Theano
 graph on which high level optimizations could be made possible, while also
@@ -122,85 +355,6 @@
 sensible behavior for those who do not want to worry about it. Whether this is
 possible / desirable is still to-be-determined.
-What About Learners?
-The discussion above only mentioned datasets, but not learners. The learning
-part of a learner is not a main concern (currently). What matters most w.r.t.
-what was discussed above is how a learner takes as input a dataset and outputs
-another dataset that can be used with the dataset API.
-A Learner may be able to compute various things. For instance, a Neural
-Network may output a ``prediction`` vector (whose elements correspond to
-estimated probabilities of each class in a classification task), as well as a
-``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
-and the classification error). We would want to be able to build a dataset
-that contains some of these quantities computed on each sample in the input
-The Neural Network code would then look something like this:
-    .. code-block:: python
-        class NeuralNetwork(Learner):
-            # The decorator below is reponsible for turning a function that
-            # takes a symbolic sample as input, and outputs a Theano variable,
-            # into a function that can also be applied on numeric sample data,
-            # or symbolic datasets.
-            # Other approaches than a decorator are possible (e.g. using
-            # different function names).
-            @datalearn(..)
-            def compute_prediction(self, sample):
-                return softmax(theano.tensor.dot(self.weights, sample.input))
-            @datalearn(..)
-            def compute_nll(self, sample):
-                return - log(self.compute_prediction(sample)[sample.target])
-            @datalearn(..)
-            def compute_penalized_nll(self, sample):
-                return (self.compute_nll(self, sample) +
-                        theano.tensor.sum(self.weights**2))
-            @datalearn(..)
-            def compute_class_error(self, sample):
-                probabilities = self.compute_prediction(sample)
-                predicted_class = theano.tensor.argmax(probabilities)
-                return predicted_class != sample.target
-            @datalearn(..)
-            def compute_cost(self, sample):
-                return theano.tensor.concatenate([
-                        self.compute_penalized_nll(sample),
-                        self.compute_nll(sample),
-                        self.compute_class_error(sample),
-                        ])
-The ``@datalearn`` decorator would allow such a Learner to be used e.g. like
-    .. code-block:: python
-        nnet = NeuralNetwork()
-        # Symbolic dataset that represents the output on symbolic input data.
-        predict_dataset = nnet.compute_prediction(dataset)
-        for sample in dataset:
-            # Symbolic sample that represents the output on a single symbolic
-            # input sample.
-            predict_sample = nnet.compute_prediction(sample)
-        # Numeric prediction.
-        predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
-        # Combining multiple symbolic outputs.
-        multiple_fields_dataset = ConcatDataSet([
-                nnet.compute_prediction(dataset),
-                nnet.compute_cost(dataset),
-                ])
-In the code above, if one wants to obtain the numeric value of an element of
-``multiple_fields_dataset``, the Theano function being compiled should be able
-to optimize computations so that the simultaneous computation of
-``prediction`` and ``cost`` is done efficiently.
 Discussion: Are Datasets Variables / Ops?
@@ -264,6 +418,7 @@
 numeric function, and dataset in this case is the result of some
 computations on a initial dataset.
 I would differentiate the two approaches (1) and (2) as :
  - first of all whatever you can do with (1) you can do with (2)
  - approach (1) hides the fact that you are working with symbolic graphs.
    You apply functions to datasets, and when you want to see values a
@@ -390,6 +545,7 @@
 and valid options.
 </Razvan comments>
 Discussion: Fixed Parameters vs. Function Arguments
@@ -534,6 +690,7 @@
   once. Maybe this can be solved at the Theano level with an efficient
   function cache?
 Discussion: Dataset as Learner Ouptut