Mercurial > pylearn
view doc/v2_planning/dataset.txt @ 1076:20a1af112a75
dataset: Looked into datasets from some other ML libraries
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 10 Sep 2010 12:11:10 -0400 |
parents | a474fabd1f37 |
children | 5c14d2ffcbb3 |
line wrap: on
line source
Discussion of Function Specification for Dataset Types ====================================================== Some talking points from the September 2 meeting: * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification needs to be flexible enough to accommodate different (sub)tasks and views of the same underlying data. * Datasets as probability distributions from which one can sample. * That's not something I would consider to be a dataset-related problem to tackle now: a probability distribution in Pylearn would probably be a different kind of beast, and it should be easy enough to have a DatasetToDistribution class for instance, that would take care of viewing a dataset as a probability distribution. -- OD * Our specification should allow transparent handling of infinite datasets (or simply datasets which cannot fit in memory) * GPU/buffering issues. Commiteee: DE, OB, OD, AB, PV Leader: DE Some ideas from existing ML libraries: - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData, PairDataSet, Aggregate. Ultimately, the learner decides - mlpy: very primitive notions of data - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet, SequentialDataSet, ReinforcementDataSet, ... Each class is quite constrained and may have a different interface. - MDP: Seems to have restrictions on the type of data being passed around, as well as its dimensionality ("Input array data is typically assumed to be two-dimensional and ordered such that observations of the same variable are stored on rows and different variables are stored on columns.") - Orange: Data matrices, with names and types associated to each column. Basically there seems to be only one base dataset class that contains the data. Data points are lists (of values corresponding to each column). - (still going through the other ones) A few things that our dataset containers should support at a minimum: - streams, possibly infinite - task/views of the data for different problems - indexing & slicing - pairs or triples or etc of examples - a 'distance/gram matrix' container (imagine that the data is given to you as a distance matrix) - multi-dimensional time-series (again, maybe with pairs/triples, maybe given to you as a distance matrix over time) Another question to consider is the following: how tight should it integrate with Theano? Do we want to be able to store data as shared variables or just have an option for that? Theano + GPU constrains things that we can do (in terms of sizes, buffering, etc): these are things we need to think about, but it's not clear whether we should aim for building them into the interface. Task views of the data for different problems: How can we achieve this? Should we simply have a set of standard dataset descriptors ('classification', 'regression', 'multi-label', 'density_estimation') and have a set_view method that changes the current dataset view type? There is then the question of how to approach the design of a Dataset class from an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class' Dataset that doesn't implement any methods except a few setters/getters. The reason to have the methods listed that way is to have a common 'specification', but classes that inherit from Dataset need not implement every single method (only the ones that are relevant) and can obviously implement other methods as appropriate. The reason to have a common specification (as abstract as it might be) is to, well, have a common specification that would make our code clearer and cleaner. An example of what I (Dumi) am thinking in terms of concrete API: class Dataset: def __init__(self): self.type = None self.in_memory = None self.inputs = None # list of filepaths, or objects in memory, or... self.outputs = None def get_example(self,example_index): raise NotImplementedError() def get_next_example(self): raise NotImplementedError() def get_batch(self,batch_index): raise NotImplementedError() def get_next_batch(self): raise NotImplementedError() def get_slice(self,slice_object): raise NotImplementedError() def set_view(self,view_type): self.view_type = view_type self.n_classes = None def set_n_classes(self,n_classes): self.n_classes = n_classes def set_batch_size(self,batch_size): self.batch_size = batch_size You will note that there is no notion of train/valid/test in this class: I think we should just have a train dataset, a valid one and a test one instead or (if it's in one big file or infinite stream) just handle the split ourselves (via slicing, for instance). I (Dumi) am of the opinion that it keeps things cleaner, but the specification does not preclude more fine-grained 'splitting' of the data. A concrete implementation would look like this (we would have one class per dataset that we use, and the class declaration contains essentially everything there is to know about the dataset): class MNIST(Dataset): def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): self.type='standard_xy' self.in_memory = True self.inputs = inputs # load them or create self.outputs = outputs self.set_view('classification') self.set_n_classes(10) self.set_batch_size(20) self.n_batches = self._compute_n_batches() def get_batch(self,batch_index): x,y = self._fetch_batch(batch_index) if self.view_type == 'classification': return x,numpy.int32(y) elif self.view_type == 'density_estimation': return x else: raise NotImplementedError() def shared_data(self): shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX)) shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX)) return shared_x, T.cast(shared_y, 'int32') def _compute_n_batches(self): pass def _fetch_batch(self,batch_index): pass But nothing stops you from defining get_train_batch, get_valid_batch and stuff like that! So we'd use it as: train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy']) valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy']) x,y = train_mnist.get_batch(0) train_mnist.set_view('density_estimation') x = train_mnist.get_batch(0) or mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy']) batches_train = range(int(mnist_data.n_batches*0.8)) batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches) xt,yt = mnist_data.get_batch(batches_train[0]) xv,yv = mnist_data.get_batch(batches_valid[0]) COMMENTS ~~~~~~~~ JB asks: What may be passed as argument to the functions in Dataset, and what can be expected in return? Are there side effects (e.g. on the state of the Dataset) associated with any of the functions? JB asks: What properties are part of the Dataset API? What possible types can they have, are they expected to be read-only or writeable? What do they mean? JB asks: What is a view? Does set_view change the Dataset or return a new Dataset with a certain view of the original (in which case call it get_view)? Does the view imply the types of the return-value of functions like get_batch? What is the difference between the view and the subclasses of Dataset in PyML? JB asks: Do container formats (I'm thinking of HDF5) offer features for fast retrieval that we would like to expose via this interface? JB asks: How would you recommend using this sort of dataset in a boosting algorithm where points need to be re-weighted. JB asks: Do we want to provide for the possibility of feedback that modifies the dataset? For example, curriculum learning might be adaptive in this sense, or if we wanted to provide a virtual world for an agent as a dataset then we need to provide 'actions' to get the next batch. Could this be done in the current API?