view doc/v2_planning/datalearn.txt @ 1390:746ebceeb46f

added comments to hmc code (old outstanding changes)
author gdesjardins
date Mon, 20 Dec 2010 18:08:04 -0500
parents decee534c78d
children e8fc563dad74
line wrap: on
line source

DataLearn: How to plug Datasets & Learner together?
===================================================

Participants
------------
- Yoshua
- Razvan
- Olivier D [leader]

High-Level Objectives
---------------------

   * Simple ML experiments should be simple to write
   * More complex / advanced scenarios should be possible without being forced
     to work "outside" of this framework
   * Computations should be optimized whenever possible
   * Existing code (in any language) should be "wrappable" within this
     framework
   * It should be possible to replace [parts of] this framework with C++ code

Theano-Like Data Flow
---------------------

We want to rely on Theano to be able to take advantage of its efficient
computations. The general idea is that if we chain multiple processing
elements (think e.g. of a feature selection step followed by a PCA projection,
then a rescaling within a fixed bounded interval), the overall transformation
from input to output data can be represented by a Theano symbolic graph. When
one wants to access the actual numeric data, a function is compiled so as to
do these computations efficiently.

We discussed some specific API options for datasets and learners, which will
be added to this file in the future, but a core question that we feel should
be addressed first is how this Theano-based implementation could be achieved
exactly. For this purpose, in the following, let us assume that a dataset is
simply a matrix whose rows represent individual samples, and columns
individual features. How to handle field names, non-tensor-like data, etc. is
a very important topic that is not yet discussed in this file.

A question we did not discuss much is to which extent the architecture could
be "theanified", i.e. whether a whole experiment could be defined as a Theano
graph on which high level optimizations could be made possible, while also
relying on Theano to "run" the graph. The other option is to use a different
mechanism, with underlying Theano graphs being built wherever possible to link
the various components of an experiment together.

For now, let us consider the latter option, where each dataset contains a
pointer to a Theano variable that represents the data stored in this dataset.
One issue with this approach is illustrated by the following example. Imagine
we want to iterate on samples in a dataset and do something with their numeric
value. We would want the code to be as close as possible to:

    .. code-block:: python

        for sample in dataset:
            do_something_with(sample.numeric_value())

A naive implementation of the sample API could be (assuming each sample also
contains a ``variable`` member which is the variable representing this
sample's data):

    .. code-block:: python

        def numeric_value(self):
            if self.function is None:
                # Compile function to output the numeric value stored in this
                # sample's variable.
                self.function = theano.function([], self.variable)
            return self.function()

However, this is not a good idea, because it would trigger a new function
compilation for each sample. Instead, we would want something like this:

    .. code-block:: python

        def numeric_value(self):
            if self.function_storage[0] is None:
                # Compile function to output the numeric value stored in this
                # sample's variable. This function takes as input the index of
                # the sample in the dataset, and is shared among all samples.
                self.function_storage[0] = theano.function(
                                        [self.symbolic_index], self.variable)
            return self.function(self.numeric_index)

In the code above, we assume that all samples created by the action of
iterating over the dataset share the same ``function_storage``,
``symbolic_index`` and ``variable``: the first time we try to access the numeric
value of some sample, a function is compiled, that takes as input the index,
and outputs the variable. The only difference between samples is thus that
they are given a different numeric value for the index (``numeric_index``).

Another way to obtain the same result is to actually let the user take care of
compiling the function. It would allow the user to really control what is
being compiled, at the cost of having to write more code:

    .. code-block:: python

        symbolic_index = dataset.get_index()  # Or just theano.tensor.iscalar()
        get_sample = theano.function([symbolic_index],
                                     dataset[symbolic_index].variable)
        for numeric_index in xrange(len(dataset))
            do_something_with(get_sample(numeric_index))

James comments: this is how I have written the last couple of projects, it's
slightly verbose but it's clear and efficient.

The code above may also be simplified by providing helper functions. In the
example above, such a function could allow us to iterate on the numeric values
of samples in a dataset while taking care of compiling the appropriate Theano
function. See Discussion: Helper Functions below.

Note that although the above example focused on how to iterate over a dataset,
it can be cast into a more generic problem, where some data (either dataset or
sample) is the result of some transformation applied to other data, which is
parameterized by parameters p1, p2, ..., pN (in the above example, we were
considering a sample that was obtained by taking the p1-th element in a
dataset). If we use different values for a subset Q of the parameters but keep
other parameters fixed, we would probably want to compile a single function
that takes as input all parameters in Q, while other parameters are fixed. It
may be nice to try and get the best of both worlds, letting the user take
control on what is being compiled, while leaving the option of using a default
sensible behavior for those who do not want to worry about it. Whether this is
possible / desirable is still to-be-determined.

What About Learners?
--------------------

The discussion above only mentioned datasets, but not learners. The learning
part of a learner is not a main concern (currently). What matters most w.r.t.
what was discussed above is how a learner takes as input a dataset and outputs
another dataset that can be used with the dataset API.

A Learner may be able to compute various things. For instance, a Neural
Network may output a ``prediction`` vector (whose elements correspond to
estimated probabilities of each class in a classification task), as well as a
``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
and the classification error). We would want to be able to build a dataset
that contains some of these quantities computed on each sample in the input
dataset.

The Neural Network code would then look something like this:

    .. code-block:: python

        class NeuralNetwork(Learner):

            # The decorator below is reponsible for turning a function that
            # takes a symbolic sample as input, and outputs a Theano variable,
            # into a function that can also be applied on numeric sample data,
            # or symbolic datasets.
            # Other approaches than a decorator are possible (e.g. using
            # different function names).
            @datalearn(..)
            def compute_prediction(self, sample):
                return softmax(theano.tensor.dot(self.weights, sample.input))

            @datalearn(..)
            def compute_nll(self, sample):
                return - log(self.compute_prediction(sample)[sample.target])

            @datalearn(..)
            def compute_penalized_nll(self, sample):
                return (self.compute_nll(self, sample) +
                        theano.tensor.sum(self.weights**2))

            @datalearn(..)
            def compute_class_error(self, sample):
                probabilities = self.compute_prediction(sample)
                predicted_class = theano.tensor.argmax(probabilities)
                return predicted_class != sample.target

            @datalearn(..)
            def compute_cost(self, sample):
                return theano.tensor.concatenate([
                        self.compute_penalized_nll(sample),
                        self.compute_nll(sample),
                        self.compute_class_error(sample),
                        ])
            
The ``@datalearn`` decorator would allow such a Learner to be used e.g. like
this:

    .. code-block:: python

        nnet = NeuralNetwork()
        # Symbolic dataset that represents the output on symbolic input data.
        predict_dataset = nnet.compute_prediction(dataset)
        for sample in dataset:
            # Symbolic sample that represents the output on a single symbolic
            # input sample.
            predict_sample = nnet.compute_prediction(sample)
        # Numeric prediction.
        predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
        # Combining multiple symbolic outputs.
        multiple_fields_dataset = ConcatDataSet([
                nnet.compute_prediction(dataset),
                nnet.compute_cost(dataset),
                ])
        
In the code above, if one wants to obtain the numeric value of an element of
``multiple_fields_dataset``, the Theano function being compiled should be able
to optimize computations so that the simultaneous computation of
``prediction`` and ``cost`` is done efficiently.

Discussion: Are Datasets Variables / Ops?
-----------------------------------------

OD wonders: Should datasets directly be Theano Variables, or should they be a
different object subclass containing a Theano Variable?  The advantage of the
former option would be that they would fit directly within the Theano
framework, which may allow high level optimizations on data transformations.
However, we would lose the ability to combine Theano expressions coded in
individual datasets into a single graph. Currently, I instead considered that
a dataset has a member that is a Theano variable, and this variable represents
the data stored in the dataset. The same is done for individual data samples.

James asks: Why would a Theano graph in which some nodes represent datasets give
up the ability to combine Theano expressions coded in individual datasets?
Firstly, if you want to use Theano expressions and compiled functions to
implement the perform() method of an Op, you can do that.  Secondly, you can
just include those 'expressions coded in individual datasets' into the overall
graph.

OD replies to James: What I had in mind is you would be forced to compile your
own function inside the perform() method of an Op. This seemed like a
potential problem to me because it would prevent Theano from seeing the whole
fine-grained graph and do optimizations across multiple dataset
transformations (there may also be additional overhead from calling multiple
function). But if you are saying it is possible to include 'expressions coded
in individual datasets' into the overall graph, then I guess this point is
moot. Would this be achieved with an optimization that replaces the dataset
node with its internal graph?

Razvan comments: 1) Having Theano expressions inside the perform of a Theano
Op can lead to issues. I know I had to deal with a few when implementing
Scan which does exactly this. Well to be fair these issues mostly come into
play when the inner graph has to interact with the outer graph and most of
the time they can be solved. I guess all that I'm saying is going that way
might lead to some head-ache to developers, though I guess some head-ache 
will be involved no matter what
2) In my view (I'm not sure this is what Olivier was saying) the idea of
not putting the Dataset into a Variable is to not put the logic related to 
loading data, dividing it into slices when running it on the GPU and so on
into a theano variable. In my view this logic goes into a DataSet class 
that gives you  shared variables, symbolic indices into that shared 
variables, and also numeric indices. When looping through those numeric 
indices, the dataset class can reload parts of the data into the 
shared variable and so on.

OD replies to Razvan's point 2: I think what you are saying is another concern
I had, which was the fact it may be confusing to mix in the same class the
Variable/Op and DataSet interfaces. I would indeed prefer to keep them
separate. However, it may be possible to come up with a system that would get
the best of both worlds (maybe by having the Op/Variable as members of
Dataset, and just asking the user building a theano graph to use these instead
of the dataset directly). Note that I'm mixing up Op/Variable here, because
it's just not clear yet for me which would go where...


Discussion: Implicit / Explicit Function Compilation
----------------------------------------------------

<Razvan comments>: I assume that ``do_something_with`` is suppose to be some
numeric function, and dataset in this case is the result of some
computations on a initial dataset.
I would differentiate the two approaches (1) and (2) as :
 - first of all whatever you can do with (1) you can do with (2)
 - approach (1) hides the fact that you are working with symbolic graphs.
   You apply functions to datasets, and when you want to see values a
   function is compiled under the hood and those values are computed for
   you. In approach (2) the fact that you deal with a symbolic graph is
   explicit because you have to manually compile your functions.
 - approach (1) needs to use this function_storage trick shared between
   certain nodes of the graph to reduce the number of compilation while in
   approach (2) we don't need to deal with the complexity of lazy
   compilation

OD comments: Well, to be fair, it means we put the burden of dealing with the
complexity of lazy compilation on the user (it's up to him to make sure he
compiles only one function).

 - approach (1) needs a replace function if you want to change the dataset.
   What you would do, is once you have a "computational graph" or pipeline
   or whatever you call it, say ``graph``, to change the input you would do
   graph.replace({ init_data_X: new_data_X}), In approach (2) the init_data_X
   and new_data_X is the ``dataset`` so you would compile two different
   functions. Well I would re-write (2) -- to make the above more clear --
   as : 

   .. code-block:: python

        symbolic_index = theano.tensor.iscalar()
        get_sample1 = theano.function( [symbolic_index],
                        graph( dataset[symbolic_index] ).variable)
        for numeric_index in xrange(len(dataset)):
            do_something_with(get_sample(numeric_index))

        get_sample2 = theano.function( [symbolic_index],
                        graph( new_dataset[symbolic_index] ).variable)
                        ## Note: the dataset was replaced with new_dataset
        for numeric_index in xrange(len(new_dataset)):
            do_something_with(get_sample2(numeric_index))

        ######### FOR (1) you write: 

        for datapoint in graph:
            do_something_with( datapoint() )

        new_graph = graph.replace({dataset:dataset2})

        for datapoint in new_graph:
            do_something_with(datapoint())

OD comments: I don't really understand what is 'graph' in this code (it
appears in both approaches but is used differently). What I have in mind would
be more with 'graph' removed in the first approach you describe (#2), and
graph / new_graph replaced by dataset / new_dataset in the second one (#1).
You wouldn't need to call some graph.replace method: the graphs compiled for
iterating on 'dataset' and 'new_dataset' would be entirely separate (using two
different compiled functions, pretty much like #2).

RP answers: Yes you are right. What I was trying to say is if you have two
different datasets on which you want to apply the same pre-processing you
can do that in both approaches. ``graph`` represents the pre-processing
steps in (2) and the end dataset (after preprocessing) in (1). So the idea
is that instead of making new_graph from scratch (re-applying all the
transforms on the original dataset) you can use replace. Or maybe the
__call__ (that compiles the function if needed) can get a givens dictionary
( that replaces datasets or more ). I only gave this argument because I
thought this will be an issue people will raise. They will say, well in (2)
the pipeline logic is separated from the data, so you can use the same
transformation with different data easily, while in (1) you write the
transformation rooted in a dataset, and if you want same transformation
for a different dataset you have to re-write everything.

OD replies: Still not sure I understand. If you have a "graph" function that
takes a dataset as input and outputs a new dataset, you can use this same
function with both (1) and (2). With (2) it is:
    theano.function([index], graph(my_dataset)[index].variable)
while with (1) the same function is compiled implicitly with:
    for sample in graph(my_dataset):
        ...

RP answers: right. I was actually constructing this stupid example in my mind when
you would do like : 
      i1 = f1(data)
      i2 = f2(i1)
      i3 = f3(i2)
      ...
      iN = fN(iN-1)
 and then you would say .. wait I want to do this on new_data as well. Oh no, I
 have to copy the entire block or whatever. That is so annoying. But actually you
 could just write:

     def my_f(data):
       i1 = f1(data)
       ...
       return iN

 and then just use that function which is what you pointed out. I agree I'm
 not sure anymore on the point that I was trying to make. Is like if you are
 a lazy programmer, and you write everything without functions, you can
 argue that you like more (2) because you only pass the dataset at the end
 and not at the beginning. But if (1) would have the replace function this
 argument will fail. Though this only stands if you like don't want to make
 a function out of your pipeline that takes the dataset as input, which now
 that I think about it is pretty stupid not to do. Sorry for that.


 - in approach (1) the initial dataset object (the one that loads the data)
   decides if you will use shared variables and indices to deal with the
   dataset or if you will use ``theano.tensor.matrix`` and not the user( at
   least not without hacking the code). Of course whoever writes that class
   can add a flag to it to switch between behaviours that make sense.
   In approach (2) one is not forced to do this
   inside that class by construction, though by convention I would do it. 
   So if you consider the one who writes that class as a developer than 
   in (2) the user can decide/deal with this and not the developer.
   Though this is a fine-line -- I would say the user would actually 
   write that class as well using some template. 
   That is to say (2) looks and feels more like working with Theano 
   directly, 

Bottom line, I think (1) puts more stress on the development of the library,
and hides Theano and some of the complexity for day to day usage.
In (2) everything is a bit more explicit, leaving the impression that you
have more control over the code, though I strongly feel that whatever can
be done in (2) can be done in (1). Traditionally I was more inclined
towards (1) but now I'm not that sure, I think both are equally interesting
and valid options.
</Razvan comments>

Discussion: Fixed Parameters vs. Function Arguments
---------------------------------------------------

Razvan Comment: I thought about this a bit at the Pylearn level. In my
original train of thought you would have the distinction between ``hand
picked parameters`` which I would call hyper-parameter and learned
parameters. A transformation in this framework (an op if you wish) could
take as inputs DataSet(s),  DataField(s), Parameter(s) (which are the things
that the learner should adapt) and HyperParameter(s). All hyper-parameters
will turn into arguments of the compiled function (like the indices of each
of the dataset objects ) and therefore they can be changed without
re-compilation. Or in other words this can be easily done by having new
types of Variables that would represent Parameters and Hyper-parameters.
And as an ending note I would say  that there are
hyper-parameters for which you need to recompile the thenao function and 
can not be just parameters ( so we would have yet another category ?).

Yoshua's comments on RP's comments: I don't understand why we would
need to create these types. Isn't it just a matter for the programmer
to decide what are the inputs of the compiled function, and which
are possibly constant (e.g. holding some hyper-parameters constant 
for a while)?

RP answers: If we opt for this lazy compilation mechanism, the library needs
to know what to put into a shared, and what to expect as input. The
programmer should give hints to the library by saying this value will always
be constant, or this is a hyper-parameter that I might want to change, and
when I do that I don't want to recompile everything so put it as an
argument. Even when the compilation is done by the user, it would be helpful 
to have some function that collects all the parameters for you. What I mean
is that it would be nice to write something like 

  corruption_layer_1 = Parameter ( value = 0.1, name = 'c1')
  # Followed by  (many) lines of code 
  f = function ( results.inputs()+ results.hyper-params(), result )


where results.hyper-params parses the graph, collects the hyper-parameter 
and returns them as a list of theano.Variables wrappen in theano.In with 
a default value and a name. You could call the function either as 

    f()
or 
    f(c1 = 0.2)

OD comments: Here is a (hopefully simpler) suggestion to solve this problem.
Consider any data{set,point} obtained by a transformation of an existing
data{set,point} with parameters p1, p2, ..., pN. From the point of view of
theano variables, this is something like x2 = h(x1, p1=v1, ..., pn=vN) where
x1, x2 are variables and h is an Op. In addition v1 ... vN are also variables
since they are parameters of the transformation we may want to vary. This is
not, however, the way the user would build the graph, because being forced to
use variables for parameters is not user-friendly (IMO). Instead, someone
would write:
    d2 = t(d1, p1=w1, ..., pn=wN)
where d1, d2 are data{set,point}s, t is the transformation, and w1 ... wN are
numeric values of the parameters. Then t would build the piece of graph above,
so that when you ask d2.numeric_value(), a function computing x2 would be
compiled, that would take as input variables v1, ... vN.
Now, the problem is that this may not be fully optimized, since parameters are
assumed to be varying (so as not to be forced to recompile a different
function when the user calls t with different parameter values). My suggestion
is to make this the default behavior, but add an extra argument to t:
    d2 = t(d1, p1=w1, ..., pn=Wn, constants=['p3', 'p5'])
The line above would do the same, except that the function being compiled
would use the constant values w3 and w5 for p3 and p5.
Razvan's example above would be written in a different way as follows:
    def f(c1=0.2):
        return transformK(..(transform2(transform1(input_data,
                                                  corruption_layer_1=c1))))
With this code you could create various transformed datasets by callling f
with different values for c1. The first time you call f(c1=0).numeric_value()
a Theano function is compiled that takes a `corruption_layer_1` input variable
(whose value is 0 when the function is called by `numeric_value`). If you call
f().numeric_value(), the same function is re-used (no need to compile it) with
this input set to 0.2.  If on another hand you want to compile a new function
for each new value of your `corruption_layer_1` parameter, you would instead
write:
    def f(c1=0.2):
        return transformK(..(transform2(transform1(input_data,
                                                  corruption_layer_1=c1,
                                                  constants=['corruption_layer_1']))))
This would be one way to have automatic lazy function cache / compilation
while still letting the user specify for which parameters a new function needs
to be compiled when their value changes.

RP comment : What about the same trick that Theano uses, namely, if you want
a non "default" behaviour you wrap the input in a dictionary. You would
write tranform1( input_data, 
                corruption_layer_1= In(value = c1, fixed = True)) ?
I started to like this approach of passing extra info about an argument :). 
Other that this it sounds good to me. 

OD replies: Yes, I guess it would make sense. The more I look at it, the more
it seems like it is very close to directly writing a Theano transform on some
variables.


Discussion: Helper Functions
----------------------------

James: Another syntactic option for iterating over datasets is

    .. code-block:: python

        for sample in dataset.numeric_iterator(batchsize=10):
            do_something_with(sample)

The numeric_iterator would create a symbolic batch index, and compile a single function
that extracts the corresponding minibatch.  The arguments to the
numeric_iterator function can also specify what compile mode to use, any givens
you might want to apply, etc.

Yoshua's comment to James' comment: I like that approach.

OD comments: Would there also be some kind of function cache to avoid
compiling the same function again if we re-iterate on the same dataset with
the same arguments? Maybe a more generic issue is: would there be a way for
Theano to be more efficient when re-compiling the same function that was
already compiled in the same program? (note that I am assuming here it is not
efficient, but I may be wrong). 

OD adds: After thinking more about it, this seems very close to my first
version where a function is automatically compiled "under the hood" when
iterating on a dataset and accessing the numeric value of a resulting
sample. The main differences are:
- In your version, the result is directly a numeric value, while in my version
  one would obtain symbolic samples and would need to call some method to
  obtain their numeric value. I think I like mine a bit better because it
  means you can use the same syntax to e.g. iterate on a dataset, whether you
  are interested in the symbolic representation of samples, or their numeric
  values. On another hand, doing so could be less efficient since you create an
  intermediate representation you may not use. The overhead does not seem much
  to me but I am not sure about that.
- In your version, you can provide to the function e.g. compile modes /
  givens. This could probably also be done in my version, although it makes it
  more difficult if you want to cache the function to avoid compiling it more
  than once (see next point).
- (Related to my first comment above) In your version it seems like a new
  function would be compiled every time the user calls e.g.
  'numeric_iterator', while in my version the function would be compiled only
  once. Maybe this can be solved at the Theano level with an efficient
  function cache?

Discussion: Dataset as Learner Ouptut
-------------------------------------

James asks:
What's wrong with simply passing the variables corresponding to the dataset to
the constructor of the learner?
That seems much more flexible, compact, and clear than the decorator.

OD replies: Not sure I understand your idea here. We probably want a learner
to be able to compute its output on multiple datasets, without having to point
to these datasets within the learner itself (which seems cumbersome to me).
The point of the decorators is mostly to turn a single function (that outputs
a theano variable for the ouptut computed on a single sample) into a function
that can compute symbolic datasets as well as numeric sample outputs. Those
could also be instead different functions in the base Learner class if the
decorator approach is considered ugly / confusing.

Razvan asks: What is predict_sample for ? What is predict_dataset? What I
guess you mean is that the decorator is used to convert a function that
takes a theano variable and outputs a theano variable into a class/function
that takes a DataField/DataSet and outputs a DataField/DataSet. It could 
also register all those different functions, so that the Dataset that 
you get out of (not one of the function) the entire Learner (this Dataset
is returned by __call__) would contain all those as fields. 
I would use it like this:

.. code-block:: python

    nnet = NeuralNetwork()
    results = nnet(dataset)
    for datapoint in results:
        print datapoint.prediction, datapoint.nll, ...

Is this close to what you are suggesting?

OD: Yes, you guessed right, the decorator's role is to do something different
depending on the input to the function (see my reply to James above).