view doc/v2_planning/datalearn.txt @ 1359:5db730bb0e8e

comments on datalearn
author James Bergstra <bergstrj@iro.umontreal.ca>
date Thu, 11 Nov 2010 17:53:13 -0500
parents ffa2932a8cba
children 7548dc1b163c
line wrap: on
line source

DataLearn: How to plug Datasets & Learner together?
===================================================

Participants
------------
- Yoshua
- Razvan
- Olivier D [leader?]

High-Level Objectives
---------------------

   * Simple ML experiments should be simple to write
   * More complex / advanced scenarios should be possible without being forced
     to work "outside" of this framework
   * Computations should be optimized whenever possible
   * Existing code (in any language) should be "wrappable" within this
     framework
   * It should be possible to replace [parts of] this framework with C++ code

Theano-Like Data Flow
---------------------

We want to rely on Theano to be able to take advantage of its efficient
computations. The general idea is that if we chain multiple processing
elements (think e.g. of a feature selection step followed by a PCA projection,
then a rescaling within a fixed bounded interval), the overall transformation
from input to output data can be represented by a Theano symbolic graph. When
one wants to access the actual numeric data, a function is compiled so as to
do these computations efficiently.

We discussed some specific API options for datasets and learners, which will
be added to this file in the future, but a core question that we feel should
be addressed first is how this Theano-based implementation could be achieved
exactly. For this purpose, in the following, let us assume that a dataset is
simply a matrix whose rows represent individual samples, and columns
individual features. How to handle field names, non-tensor-like data, etc. is
a very important topic that is not yet discussed in this file.

A question we did not really discuss is whether datasets should be Theano
Variables. The advantage would be that they would fit directly within the
Theano framework, which may allow high level optimizations on data
transformations. However, we would lose the ability to combine Theano
expressions coded in individual datasets into a single graph. Currently, we
instead consider that a dataset has a member that is a Theano variable, and
this variable represents the data stored in the dataset. The same is done for
individual data samples.

James asks: Why would a Theano graph in which some nodes represent datasets give
up the ability to combine Theano expressions coded in individual datasets?
Firstly, if you want to use Theano expressions and compiled functions to
implement the perform() method of an Op, you can do that.  Secondly, you can
just include those 'expressions coded in individual datasets' into the overall
graph.

One issue with this approach is illustrated by the following example. Imagine
we want to iterate on samples in a dataset and do something with their
numeric value. We would want the code to be as close as possible to:

    .. code-block:: python

        for sample in dataset:
            do_something_with(sample.numeric_value())

A naive implementation of the sample API could be (assuming each sample
contains a ``variable`` member which is the variable representing this
sample's data):

    .. code-block:: python

        def numeric_value(self):
            if self.function is None:
                # Compile function to output the numeric value stored in this
                # sample's variable.
                self.function = theano.function([], self.variable)
            return self.function()

However, this is not a good idea, because it would trigger a new function
compilation for each sample. Instead, we would want something like this:

    .. code-block:: python

        def numeric_value(self):
            if self.function_storage[0] is None:
                # Compile function to output the numeric value stored in this
                # sample's variable. This function takes as input the index of
                # the sample in the dataset, and is shared among all samples.
                self.function_storage[0] = theano.function(
                                        [self.symbolic_index], self.variable)
            return self.function(self.numeric_index)

In the code above, we assume that all samples created by the action of
iterating over the dataset share the same ``function_storage``,
``symbolic_index`` and ``variable``: the first time we try to access the numeric
value of some sample, a function is compiled, that takes as input the index,
and outputs the variable.  The only difference between samples is thus that
they are given a different numeric value for the index (``numeric_index``).

Another way to obtain the same result is to actually let the user take care of
compiling the function. It would allow the user to really control what is
being compiled, at the cost of having to write more code:

    .. code-block:: python

        symbolic_index = dataset.get_index()  # Or just theano.tensor.iscalar()
        get_sample = theano.function([symbolic_index],
                                     dataset[symbolic_index].variable)
        for numeric_index in xrange(len(dataset))
            do_something_with(get_sample(numeric_index))

James comments: this is how I have written the last couple of projects, it's
slightly verbose but it's clear and efficient.

Note that although the above example focused on how to iterate over a dataset,
it can be cast into a more generic problem, where some data (either dataset or
sample) is the result of some transformation applied to other data, which is
parameterized by parameters p1, p2, ..., pN (in the above example, we were
considering a sample that was obtained by taking the p1-th element in a
dataset). If we use different values for a subset Q of the parameters but keep
other parameters fixed, we would probably want to compile a single function
that takes as input all parameters in Q, while other parameters are fixed.
Ideally it would be nice to let the user take control on what is being
compiled, while leaving the option of using a default sensible behavior for
those who do not want to worry about it. How to achieve this is still to be
determined.


Another syntactic option for iterating over datasets is

    .. code-block:: python

        for sample in dataset.numeric_iterator(batchsize=10):
            do_something_with(sample)

The numeric_iterator would create a symbolic batch index, and compile a single function
that extracts the corresponding minibatch.  The arguments to the
numeric_iterator function can also specify what compile mode to use, any givens
you might want to apply, etc.


What About Learners?
--------------------

The discussion above only mentioned datasets, but not learners. The learning
part of a learner is not a main concern (currently). What matters most w.r.t.
what was discussed above is how a learner takes as input a dataset and outputs
another dataset that can be used with the dataset API.

James asks:
What's wrong with simply passing the variables corresponding to the dataset to
the constructor of the learner?
That seems much more flexible, compact, and clear than the decorator.

A Learner may be able to compute various things. For instance, a Neural
Network may output a ``prediction`` vector (whose elements correspond to
estimated probabilities of each class in a classification task), as well as a
``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
and the classification error). We would want to be able to build a dataset
that contains some of these quantities computed on each sample in the input
dataset.

The Neural Network code would then look something like this:

    .. code-block:: python

        class NeuralNetwork(Learner):

            @datalearn(..)
            def compute_prediction(self, sample):
                return softmax(theano.tensor.dot(self.weights, sample.input))

            @datalearn(..)
            def compute_nll(self, sample):
                return - log(self.compute_prediction(sample)[sample.target])

            @datalearn(..)
            def compute_penalized_nll(self, sample):
                return (self.compute_nll(self, sample) +
                        theano.tensor.sum(self.weights**2))

            @datalearn(..)
            def compute_class_error(self, sample):
                probabilities = self.compute_prediction(sample)
                predicted_class = theano.tensor.argmax(probabilities)
                return predicted_class != sample.target

            @datalearn(..)
            def compute_cost(self, sample):
                return theano.tensor.concatenate([
                        self.compute_penalized_nll(sample),
                        self.compute_nll(sample),
                        self.compute_class_error(sample),
                        ])
            
The ``@datalearn`` decorator would be responsible for allowing such a Learner
to be used e.g. like this:

    .. code-block:: python

        nnet = NeuralNetwork()
        predict_dataset = nnet.compute_prediction(dataset)
        for sample in dataset:
            predict_sample = nnet.compute_prediction(sample)
        predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
        multiple_fields_dataset = ConcatDataSet([
                nnet.compute_prediction(dataset),
                nnet.compute_cost(dataset),
                ])
        
In the code above, if one wants to obtain the numeric value of an element of
``multiple_fields_dataset``, the Theano function being compiled would be able
to optimize computations so that the simultaneous computation of
``prediction`` and ``cost`` is done efficiently.