# HG changeset patch
# User Olivier Delalleau <delallea@iro>
# Date 1289839893 18000
# Node ID 9474fb4ad10978b09ca6c533b34c4988c2219984
# Parent  f945ed016c6811cb431d860eee7da99e143f3161
Refactored datalearn committee file to be easier to read

diff -r f945ed016c68 -r 9474fb4ad109 doc/v2_planning/datalearn.txt
--- a/doc/v2_planning/datalearn.txt	Fri Nov 12 13:49:13 2010 -0500
+++ b/doc/v2_planning/datalearn.txt	Mon Nov 15 11:51:33 2010 -0500
@@ -5,7 +5,7 @@
 ------------
 - Yoshua
 - Razvan
-- Olivier D [leader?]
+- Olivier D [leader]
 
 High-Level Objectives
 ---------------------
@@ -37,14 +37,182 @@
 individual features. How to handle field names, non-tensor-like data, etc. is
 a very important topic that is not yet discussed in this file.
 
-A question we did not really discuss is whether datasets should be Theano
-Variables. The advantage would be that they would fit directly within the
-Theano framework, which may allow high level optimizations on data
-transformations. However, we would lose the ability to combine Theano
-expressions coded in individual datasets into a single graph. Currently, we
-instead consider that a dataset has a member that is a Theano variable, and
-this variable represents the data stored in the dataset. The same is done for
-individual data samples.
+A question we did not discuss much is to which extent the architecture could
+be "theanified", i.e. whether a whole experiment could be defined as a Theano
+graph on which high level optimizations could be made possible, while also
+relying on Theano to "run" the graph. The other option is to use a different
+mechanism, with underlying Theano graphs being built wherever possible to link
+the various components of an experiment together.
+
+For now, let us consider the latter option, where each dataset contains a
+pointer to a Theano variable that represents the data stored in this dataset.
+One issue with this approach is illustrated by the following example. Imagine
+we want to iterate on samples in a dataset and do something with their numeric
+value. We would want the code to be as close as possible to:
+
+    .. code-block:: python
+
+        for sample in dataset:
+            do_something_with(sample.numeric_value())
+
+A naive implementation of the sample API could be (assuming each sample also
+contains a ``variable`` member which is the variable representing this
+sample's data):
+
+    .. code-block:: python
+
+        def numeric_value(self):
+            if self.function is None:
+                # Compile function to output the numeric value stored in this
+                # sample's variable.
+                self.function = theano.function([], self.variable)
+            return self.function()
+
+However, this is not a good idea, because it would trigger a new function
+compilation for each sample. Instead, we would want something like this:
+
+    .. code-block:: python
+
+        def numeric_value(self):
+            if self.function_storage[0] is None:
+                # Compile function to output the numeric value stored in this
+                # sample's variable. This function takes as input the index of
+                # the sample in the dataset, and is shared among all samples.
+                self.function_storage[0] = theano.function(
+                                        [self.symbolic_index], self.variable)
+            return self.function(self.numeric_index)
+
+In the code above, we assume that all samples created by the action of
+iterating over the dataset share the same ``function_storage``,
+``symbolic_index`` and ``variable``: the first time we try to access the numeric
+value of some sample, a function is compiled, that takes as input the index,
+and outputs the variable. The only difference between samples is thus that
+they are given a different numeric value for the index (``numeric_index``).
+
+Another way to obtain the same result is to actually let the user take care of
+compiling the function. It would allow the user to really control what is
+being compiled, at the cost of having to write more code:
+
+    .. code-block:: python
+
+        symbolic_index = dataset.get_index()  # Or just theano.tensor.iscalar()
+        get_sample = theano.function([symbolic_index],
+                                     dataset[symbolic_index].variable)
+        for numeric_index in xrange(len(dataset))
+            do_something_with(get_sample(numeric_index))
+
+James comments: this is how I have written the last couple of projects, it's
+slightly verbose but it's clear and efficient.
+
+The code above may also be simplified by providing helper functions. In the
+example above, such a function could allow us to iterate on the numeric values
+of samples in a dataset while taking care of compiling the appropriate Theano
+function. See Discussion: Helper Functions below.
+
+Note that although the above example focused on how to iterate over a dataset,
+it can be cast into a more generic problem, where some data (either dataset or
+sample) is the result of some transformation applied to other data, which is
+parameterized by parameters p1, p2, ..., pN (in the above example, we were
+considering a sample that was obtained by taking the p1-th element in a
+dataset). If we use different values for a subset Q of the parameters but keep
+other parameters fixed, we would probably want to compile a single function
+that takes as input all parameters in Q, while other parameters are fixed. It
+may be nice to try and get the best of both worlds, letting the user take
+control on what is being compiled, while leaving the option of using a default
+sensible behavior for those who do not want to worry about it. Whether this is
+possible / desirable is still to-be-determined.
+
+What About Learners?
+--------------------
+
+The discussion above only mentioned datasets, but not learners. The learning
+part of a learner is not a main concern (currently). What matters most w.r.t.
+what was discussed above is how a learner takes as input a dataset and outputs
+another dataset that can be used with the dataset API.
+
+A Learner may be able to compute various things. For instance, a Neural
+Network may output a ``prediction`` vector (whose elements correspond to
+estimated probabilities of each class in a classification task), as well as a
+``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
+and the classification error). We would want to be able to build a dataset
+that contains some of these quantities computed on each sample in the input
+dataset.
+
+The Neural Network code would then look something like this:
+
+    .. code-block:: python
+
+        class NeuralNetwork(Learner):
+
+            # The decorator below is reponsible for turning a function that
+            # takes a symbolic sample as input, and outputs a Theano variable,
+            # into a function that can also be applied on numeric sample data,
+            # or symbolic datasets.
+            # Other approaches than a decorator are possible (e.g. using
+            # different function names).
+            @datalearn(..)
+            def compute_prediction(self, sample):
+                return softmax(theano.tensor.dot(self.weights, sample.input))
+
+            @datalearn(..)
+            def compute_nll(self, sample):
+                return - log(self.compute_prediction(sample)[sample.target])
+
+            @datalearn(..)
+            def compute_penalized_nll(self, sample):
+                return (self.compute_nll(self, sample) +
+                        theano.tensor.sum(self.weights**2))
+
+            @datalearn(..)
+            def compute_class_error(self, sample):
+                probabilities = self.compute_prediction(sample)
+                predicted_class = theano.tensor.argmax(probabilities)
+                return predicted_class != sample.target
+
+            @datalearn(..)
+            def compute_cost(self, sample):
+                return theano.tensor.concatenate([
+                        self.compute_penalized_nll(sample),
+                        self.compute_nll(sample),
+                        self.compute_class_error(sample),
+                        ])
+            
+The ``@datalearn`` decorator would allow such a Learner to be used e.g. like
+this:
+
+    .. code-block:: python
+
+        nnet = NeuralNetwork()
+        # Symbolic dataset that represents the output on symbolic input data.
+        predict_dataset = nnet.compute_prediction(dataset)
+        for sample in dataset:
+            # Symbolic sample that represents the output on a single symbolic
+            # input sample.
+            predict_sample = nnet.compute_prediction(sample)
+        # Numeric prediction.
+        predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
+        # Combining multiple symbolic outputs.
+        multiple_fields_dataset = ConcatDataSet([
+                nnet.compute_prediction(dataset),
+                nnet.compute_cost(dataset),
+                ])
+        
+In the code above, if one wants to obtain the numeric value of an element of
+``multiple_fields_dataset``, the Theano function being compiled should be able
+to optimize computations so that the simultaneous computation of
+``prediction`` and ``cost`` is done efficiently.
+
+Discussion: Are Datasets Variables / Ops?
+-----------------------------------------
+
+OD wonders: Should datasets directly be Theano Variables, or should they be a
+different object subclass containing a Theano Variable?  The advantage of the
+former option would be that they would fit directly within the Theano
+framework, which may allow high level optimizations on data transformations.
+However, we would lose the ability to combine Theano expressions coded in
+individual datasets into a single graph. Currently, I instead considered that
+a dataset has a member that is a Theano variable, and this variable represents
+the data stored in the dataset. The same is done for individual data samples.
 
 James asks: Why would a Theano graph in which some nodes represent datasets give
 up the ability to combine Theano expressions coded in individual datasets?
@@ -88,63 +256,9 @@
 of the dataset directly). Note that I'm mixing up Op/Variable here, because
 it's just not clear yet for me which would go where...
 
-One issue with this approach is illustrated by the following example. Imagine
-we want to iterate on samples in a dataset and do something with their
-numeric value. We would want the code to be as close as possible to:
 
-    .. code-block:: python
-
-        for sample in dataset:
-            do_something_with(sample.numeric_value())
-
-A naive implementation of the sample API could be (assuming each sample
-contains a ``variable`` member which is the variable representing this
-sample's data):
-
-    .. code-block:: python
-
-        def numeric_value(self):
-            if self.function is None:
-                # Compile function to output the numeric value stored in this
-                # sample's variable.
-                self.function = theano.function([], self.variable)
-            return self.function()
-
-However, this is not a good idea, because it would trigger a new function
-compilation for each sample. Instead, we would want something like this:
-
-    .. code-block:: python
-
-        def numeric_value(self):
-            if self.function_storage[0] is None:
-                # Compile function to output the numeric value stored in this
-                # sample's variable. This function takes as input the index of
-                # the sample in the dataset, and is shared among all samples.
-                self.function_storage[0] = theano.function(
-                                        [self.symbolic_index], self.variable)
-            return self.function(self.numeric_index)
-
-In the code above, we assume that all samples created by the action of
-iterating over the dataset share the same ``function_storage``,
-``symbolic_index`` and ``variable``: the first time we try to access the numeric
-value of some sample, a function is compiled, that takes as input the index,
-and outputs the variable.  The only difference between samples is thus that
-they are given a different numeric value for the index (``numeric_index``).
-
-Another way to obtain the same result is to actually let the user take care of
-compiling the function. It would allow the user to really control what is
-being compiled, at the cost of having to write more code:
-
-    .. code-block:: python
-
-        symbolic_index = dataset.get_index()  # Or just theano.tensor.iscalar()
-        get_sample = theano.function([symbolic_index],
-                                     dataset[symbolic_index].variable)
-        for numeric_index in xrange(len(dataset))
-            do_something_with(get_sample(numeric_index))
-
-James comments: this is how I have written the last couple of projects, it's
-slightly verbose but it's clear and efficient.
+Discussion: Implicit / Explicit Function Compilation
+----------------------------------------------------
 
 <Razvan comments>: I assume that ``do_something_with`` is suppose to be some
 numeric function, and dataset in this case is the result of some
@@ -276,18 +390,8 @@
 and valid options.
 </Razvan comments>
 
-Note that although the above example focused on how to iterate over a dataset,
-it can be cast into a more generic problem, where some data (either dataset or
-sample) is the result of some transformation applied to other data, which is
-parameterized by parameters p1, p2, ..., pN (in the above example, we were
-considering a sample that was obtained by taking the p1-th element in a
-dataset). If we use different values for a subset Q of the parameters but keep
-other parameters fixed, we would probably want to compile a single function
-that takes as input all parameters in Q, while other parameters are fixed.
-Ideally it would be nice to let the user take control on what is being
-compiled, while leaving the option of using a default sensible behavior for
-those who do not want to worry about it. How to achieve this is still to be
-determined.
+Discussion: Fixed Parameters vs. Function Arguments
+---------------------------------------------------
 
 Razvan Comment: I thought about this a bit at the Pylearn level. In my
 original train of thought you would have the distinction between ``hand
@@ -309,6 +413,9 @@
 are possibly constant (e.g. holding some hyper-parameters constant 
 for a while)?
 
+Discussion: Helper Functions
+----------------------------
+
 James: Another syntactic option for iterating over datasets is
 
     .. code-block:: python
@@ -330,13 +437,8 @@
 already compiled in the same program? (note that I am assuming here it is not
 efficient, but I may be wrong). 
 
-What About Learners?
---------------------
-
-The discussion above only mentioned datasets, but not learners. The learning
-part of a learner is not a main concern (currently). What matters most w.r.t.
-what was discussed above is how a learner takes as input a dataset and outputs
-another dataset that can be used with the dataset API.
+Discussion: Dataset as Learner Ouptut
+-------------------------------------
 
 James asks:
 What's wrong with simply passing the variables corresponding to the dataset to
@@ -352,67 +454,6 @@
 could also be instead different functions in the base Learner class if the
 decorator approach is considered ugly / confusing.
 
-A Learner may be able to compute various things. For instance, a Neural
-Network may output a ``prediction`` vector (whose elements correspond to
-estimated probabilities of each class in a classification task), as well as a
-``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
-and the classification error). We would want to be able to build a dataset
-that contains some of these quantities computed on each sample in the input
-dataset.
-
-The Neural Network code would then look something like this:
-
-    .. code-block:: python
-
-        class NeuralNetwork(Learner):
-
-            @datalearn(..)
-            def compute_prediction(self, sample):
-                return softmax(theano.tensor.dot(self.weights, sample.input))
-
-            @datalearn(..)
-            def compute_nll(self, sample):
-                return - log(self.compute_prediction(sample)[sample.target])
-
-            @datalearn(..)
-            def compute_penalized_nll(self, sample):
-                return (self.compute_nll(self, sample) +
-                        theano.tensor.sum(self.weights**2))
-
-            @datalearn(..)
-            def compute_class_error(self, sample):
-                probabilities = self.compute_prediction(sample)
-                predicted_class = theano.tensor.argmax(probabilities)
-                return predicted_class != sample.target
-
-            @datalearn(..)
-            def compute_cost(self, sample):
-                return theano.tensor.concatenate([
-                        self.compute_penalized_nll(sample),
-                        self.compute_nll(sample),
-                        self.compute_class_error(sample),
-                        ])
-            
-The ``@datalearn`` decorator would be responsible for allowing such a Learner
-to be used e.g. like this:
-
-    .. code-block:: python
-
-        nnet = NeuralNetwork()
-        predict_dataset = nnet.compute_prediction(dataset)
-        for sample in dataset:
-            predict_sample = nnet.compute_prediction(sample)
-        predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
-        multiple_fields_dataset = ConcatDataSet([
-                nnet.compute_prediction(dataset),
-                nnet.compute_cost(dataset),
-                ])
-        
-In the code above, if one wants to obtain the numeric value of an element of
-``multiple_fields_dataset``, the Theano function being compiled would be able
-to optimize computations so that the simultaneous computation of
-``prediction`` and ``cost`` is done efficiently.
-
 Razvan asks: What is predict_sample for ? What is predict_dataset? What I
 guess you mean is that the decorator is used to convert a function that
 takes a theano variable and outputs a theano variable into a class/function
@@ -433,3 +474,4 @@
 
 OD: Yes, you guessed right, the decorator's role is to do something different
 depending on the input to the function (see my reply to James above).
+