view doc/v2_planning/dataset.txt @ 1084:7e6e77d50eeb

dataset: I say the learner committee should take care of dataset as well
author Olivier Delalleau <delallea@iro>
date Fri, 10 Sep 2010 17:06:38 -0400
parents 4c00af69c164
children de456561ec40
line wrap: on
line source

Discussion of Function Specification for Dataset Types
======================================================

Some talking points from the September 2 meeting:

 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
 needs to be flexible enough to accommodate different (sub)tasks and views of
 the same underlying data.
 * Datasets as probability distributions from which one can sample.
    * That's not something I would consider to be a dataset-related problem to
        tackle now: a probability distribution in Pylearn would probably be a
        different kind of beast, and it should be easy enough to have a
        DatasetToDistribution class for instance, that would take care of viewing a
        dataset as a probability distribution. -- OD
 * Our specification should allow transparent handling of infinite datasets (or
 simply datasets which cannot fit in memory)
 * GPU/buffering issues.

Commiteee: DE, OB, OD, AB, PV
Leader: DE

Some ideas from existing ML libraries:

- PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
  PairDataSet, Aggregate. Ultimately, the learner decides	
- mlpy: very primitive notions of data (simple 2D matrices)
- PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet,
    SequentialDataSet, ReinforcementDataSet, ... Each class is quite
    constrained and may have a different interface.
- MDP: Seems to have restrictions on the type of data being passed around, as
    well as its dimensionality ("Input array data is typically assumed to be
    two-dimensional and ordered such that observations of the same variable are
    stored on rows and different variables are stored on columns.")
- Orange: Data matrices, with names and types associated to each column.
  Basically there seems to be only one base dataset class that contains the
  data. Data points are lists (of values corresponding to each column).
- APGL: Hard to say how they deal with data from the documentation alone.
- Monte: Data is simply numpy arrays.
- scikits.learn: Dataset is a simple container with e.g. dataset.data being
    a 2D numpy array of input features, and dataset.target the target vector.
- Shogun: Vade Retro C++! (may be worth looking into their feature concept
    though).
- Any more worth looking at?

A few things that our dataset containers should support at a minimum:

    - streams, possibly infinite
    - task/views of the data for different problems
    - indexing & slicing 
    - pairs or triples or etc of examples
    - a 'distance/gram matrix' container (imagine that the data is given to you
      as a distance matrix)
    - multi-dimensional time-series (again, maybe with pairs/triples, maybe
      given to you as a distance matrix over time)

Another question to consider is the following: how tight should it integrate
with Theano? Do we want to be able to store data as shared variables or just
have an option for that? Theano + GPU constrains things that we can do (in terms
of sizes, buffering, etc): these are things we need to think about, but it's not
clear whether we should aim for building them into the interface.

Task views of the data for different problems: How can we achieve this? Should
we simply have a set of standard dataset descriptors ('classification',
'regression', 'multi-label', 'density_estimation') and have a set_view method
that changes the current dataset view type?

There is then the question of how to approach the design of a Dataset class from
an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class' 
Dataset that doesn't implement any methods except a few setters/getters. The reason
to have the methods listed that way is to have a common 'specification', but classes
that inherit from Dataset need not implement every single method (only the ones
that are relevant) and can obviously implement other methods as appropriate. The
reason to have a common specification (as abstract as it might be) is to, well,
have a common specification that would make our code clearer and cleaner.

An example of what I (Dumi) am thinking in terms of concrete API:

class Dataset:
    def __init__(self):
        self.type = None
        self.in_memory = None
        self.inputs = None # list of filepaths, or objects in memory, or...
        self.outputs = None

    def get_example(self,example_index):
        raise NotImplementedError()

    def get_next_example(self):
        raise NotImplementedError()

    def get_batch(self,batch_index):
        raise NotImplementedError()

    def get_next_batch(self):
        raise NotImplementedError()

    def get_slice(self,slice_object):
        raise NotImplementedError()

    def set_view(self,view_type):
        self.view_type = view_type
        self.n_classes = None

    def set_n_classes(self,n_classes):
        self.n_classes = n_classes

    def set_batch_size(self,batch_size):
        self.batch_size = batch_size

You will note that there is no notion of train/valid/test in this class: I think we should
just have a train dataset, a valid one and a test one instead or (if it's in one
big file or infinite stream) just handle the split ourselves (via slicing, for
instance). I (Dumi) am of the opinion that it keeps things cleaner, but the
specification does not preclude more fine-grained 'splitting' of the data.

A concrete implementation would look like this (we would have one class per
dataset that we use, and the class declaration contains essentially everything
there is to know about the dataset):

class MNIST(Dataset):
    def  __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
        self.type='standard_xy'
        self.in_memory = True
        self.inputs = inputs # load them or create 
        self.outputs = outputs
        self.set_view('classification') 
        self.set_n_classes(10)
        self.set_batch_size(20)
        self.n_batches = self._compute_n_batches()

    def get_batch(self,batch_index):
        x,y = self._fetch_batch(batch_index)
        if self.view_type == 'classification':
            return x,numpy.int32(y)
        elif self.view_type == 'density_estimation':
            return x
        else:
            raise NotImplementedError()

    def shared_data(self):
        shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX))
        shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX))
        return shared_x, T.cast(shared_y, 'int32')

    def _compute_n_batches(self):
        pass

    def _fetch_batch(self,batch_index):
        pass

But nothing stops you from defining get_train_batch, get_valid_batch and stuff
like that! 

So we'd use it as:

train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy'])
valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy'])

x,y = train_mnist.get_batch(0)
train_mnist.set_view('density_estimation')
x = train_mnist.get_batch(0)

or

mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy'])
batches_train = range(int(mnist_data.n_batches*0.8))
batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches)

xt,yt = mnist_data.get_batch(batches_train[0])
xv,yv = mnist_data.get_batch(batches_valid[0])




COMMENTS
~~~~~~~~


JB asks: What may be passed as argument to the functions in Dataset, and what
can be expected in return?  Are there side effects (e.g. on the state of the
Dataset) associated with any of the functions?

JB asks: What properties are part of the Dataset API? What possible types can
they have, are they expected to be read-only or writeable?  What do they mean?


JB asks: What is a view?  Does set_view change the Dataset or return a new
Dataset with a certain view of the original (in which case call it get_view)?
Does the view imply the types of the return-value of functions like
get_batch?  What is the difference between the view and the subclasses of
Dataset in PyML?

JB asks:  Do container formats (I'm thinking of HDF5) offer features for fast
retrieval that we would like to expose via this interface?

JB asks: How would you recommend using this sort of dataset in a boosting
algorithm where points need to be re-weighted.


JB asks: Do we want to provide for the possibility of feedback that modifies the
dataset?  For example, curriculum learning might be adaptive in this sense, or
if we wanted to provide a virtual world for an agent as a dataset then we need
to provide 'actions' to get the next batch.  Could this be done in the current
API?


Field names and attributes
~~~~~~~~~~~~~~~~~~~~~~~~~~

OD: One important question is how to handle fields' names and characteristics.
For instance, it can be useful to know that the 3rd input field represents a
number of fingers, and is a non-negative discrete field whose numeric value is
meaningful (compared, to, say, an integer index that would correspond to an
animal's category). We mentioned metadata during the meeting, but we did not
get into its details: that may be a place where to put this kind of things.


Freeing memory
~~~~~~~~~~~~~~

OD: It is sometimes useful to be able to free memory used by previous
computations. A typical example is when you load in memory the original
dataset, then perform various processing steps, ending with a new dataset that
you also store in memory before feeding it to the learner. Unless you very
carefully design your code to avoid it, your original dataset will still
remain in memory (as well as maybe the results of some computations performed
along the way). So there may be a use for a `clear()` method that would be
called by the topmost dataset (the one doing the final memory caching), and
would be forwarded iteratively to previous datasets so as to get back all this
wasted memory space.

What is a mini-batch?
~~~~~~~~~~~~~~~~~~~~~

This is a follow-up to the meeting's discussion about whether a mini-batch
returned by a dataset should be itself a dataset.

OD: During the meeting I was voting in favor of a 'yes', mostly because it
made sense to me (a mini-batch is a subset of a dataset and thus should be a
dataset), but now I tend towards 'no'. The main reason is it is not clear yet
what the dataset interface will be, so that it is hard to judge whether this
is good idea (my main concern is how much additional work would be required by
the writer of a new dataset subclass). Anyway, maybe a first thing we could
think about is what we want a mini-batch to be. I think we can agree that we
would like to be able to do something like:
    for mb in dataset.mini_batches(size=10):
        learner.update(mb.input, mb.target)
so that it should be ok for a mini-batch to be an object whose fields
(that should have the same name as those of the dataset) are numpy arrays.
More generally, we would like to be able to iterate on samples in a
mini-batch, or do random access on them, so a mini-batch should implement
__iter__ and __getitem__.
Besides this, is there any other typical use-case of a mini-batch? In
particular, is there any reason to want an infinite mini-batch? (in which case
we may need to revise our idea of what 'mini' means) Hopefully the answer to
that last question is no, as I think it would definitely keep things simpler,
since we could simply use numpy arrays (for numeric data) or lists (for
anything else) to store mini-batches' data. So I vote for 'no'.

A dataset is a learner
~~~~~~~~~~~~~~~~~~~~~~

OD: This is more a high-level comment that may or may not be relevant
depending on how we get to plug our different classes together.
In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing
all sorts of fancy things, the majority of these classes taking as input
another dataset, and transforming it in some way (e.g. taking a subset of
samples, a subset of features, normalizing features, computing extra fields
given existing fields, etc.). I think right now our interface is heading in a
similar direction.
When you think about it, this kind of operation is equivalent to writing a
learner class that is trained on the input dataset, and whose output on this
same dataset is used to obtain an output dataset (note that the training phase
may do nothing, e.g. if the goal is only to filter out a predefined set of
samples).
If you push it even further, even a dataset that has no input dataset, say
e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a
learner that was trained on nothing and whose output is computed on nothing
(but still outputs this 2D matrix).
In the small ML library I have been using at Ubisoft, the dataset class
actually inherits from learner, based on this point of view. Actually pretty
much all objects that are plugged together to make an experiment are learners.
The main advantage is everything has the same interface and the "plugging" of
the different parts can remain very simple. Confusion is avoided by the module
hierarchy to ensure objects with different behavior have different names.
Something like dataset.MatrixDataset would create a dataset from scratch (i.e.
a numpy matrix), process.FilterSamples would be something that does not need
to be trained, but needs an input dataset, and learner.NNet would be a usual
learning algorithm that must be trained on an input dataset, and computes an
output (possibly on the same dataset, possibly on another one).

Ok, this is getting too long, I am definitely not saying we should do this,
but I think there is some close relationship between the usual data processing
we do and the learning process, so it may be worth thinking how to put them
together in a coherent framework. For instance, in PLearn there was (something
like) a NormalizeVMatrix (think of it as a dataset subclass), but it could
not be used in a natural way to learn the normalization parameters on a
training set (e.g. mean and std of features) and normalize another dataset.
Instead you could use (something like) a
PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both
ways to do (almost the) same thing can be confusing.