Mercurial > pylearn
view doc/v2_planning/dataset.txt @ 1429:b0141efbf6a2
fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
author | Frederic Bastien <nouiz@nouiz.org> |
---|---|
date | Tue, 08 Feb 2011 16:17:56 -0500 |
parents | 04b988fb00b6 |
children |
line wrap: on
line source
Discussion of Function Specification for Dataset Types ====================================================== Some talking points from the September 2 meeting: * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification needs to be flexible enough to accommodate different (sub)tasks and views of the same underlying data. * Datasets as probability distributions from which one can sample. * That's not something I would consider to be a dataset-related problem to tackle now: a probability distribution in Pylearn would probably be a different kind of beast, and it should be easy enough to have a DatasetToDistribution class for instance, that would take care of viewing a dataset as a probability distribution. -- OD * Our specification should allow transparent handling of infinite datasets (or simply datasets which cannot fit in memory) * GPU/buffering issues. Commiteee: DE, OB, OD, AB, PV Leader: DE Some ideas from existing ML libraries: - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData, PairDataSet, Aggregate. Ultimately, the learner decides - mlpy: very primitive notions of data (simple 2D matrices) - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet, SequentialDataSet, ReinforcementDataSet, ... Each class is quite constrained and may have a different interface. - MDP: Seems to have restrictions on the type of data being passed around, as well as its dimensionality ("Input array data is typically assumed to be two-dimensional and ordered such that observations of the same variable are stored on rows and different variables are stored on columns.") - Orange: Data matrices, with names and types associated to each column. Basically there seems to be only one base dataset class that contains the data. Data points are lists (of values corresponding to each column). - APGL: Hard to say how they deal with data from the documentation alone. - Monte: Data is simply numpy arrays. - scikits.learn: Dataset is a simple container with e.g. dataset.data being a 2D numpy array of input features, and dataset.target the target vector. - Shogun: Vade Retro C++! (may be worth looking into their feature concept though). - Any more worth looking at? A few things that our dataset containers should support at a minimum: - streams, possibly infinite - task/views of the data for different problems - indexing & slicing - pairs or triples or etc of examples - a 'distance/gram matrix' container (imagine that the data is given to you as a distance matrix) - multi-dimensional time-series (again, maybe with pairs/triples, maybe given to you as a distance matrix over time) Another question to consider is the following: how tight should it integrate with Theano? Do we want to be able to store data as shared variables or just have an option for that? Theano + GPU constrains things that we can do (in terms of sizes, buffering, etc): these are things we need to think about, but it's not clear whether we should aim for building them into the interface. Task views of the data for different problems: How can we achieve this? Should we simply have a set of standard dataset descriptors ('classification', 'regression', 'multi-label', 'density_estimation') and have a set_view method that changes the current dataset view type? There is then the question of how to approach the design of a Dataset class from an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class' Dataset that doesn't implement any methods except a few setters/getters. The reason to have the methods listed that way is to have a common 'specification', but classes that inherit from Dataset need not implement every single method (only the ones that are relevant) and can obviously implement other methods as appropriate. The reason to have a common specification (as abstract as it might be) is to, well, have a common specification that would make our code clearer and cleaner. An example of what I (Dumi) am thinking in terms of concrete API: class Dataset: def __init__(self): self.type = None self.in_memory = None self.inputs = None # list of filepaths, or objects in memory, or... self.outputs = None def get_example(self,example_index): raise NotImplementedError() def get_next_example(self): raise NotImplementedError() def get_batch(self,batch_index): raise NotImplementedError() def get_next_batch(self): raise NotImplementedError() def get_slice(self,slice_object): raise NotImplementedError() def set_view(self,view_type): self.view_type = view_type self.n_classes = None def set_n_classes(self,n_classes): self.n_classes = n_classes def set_batch_size(self,batch_size): self.batch_size = batch_size You will note that there is no notion of train/valid/test in this class: I think we should just have a train dataset, a valid one and a test one instead or (if it's in one big file or infinite stream) just handle the split ourselves (via slicing, for instance). I (Dumi) am of the opinion that it keeps things cleaner, but the specification does not preclude more fine-grained 'splitting' of the data. A concrete implementation would look like this (we would have one class per dataset that we use, and the class declaration contains essentially everything there is to know about the dataset): .. code-block:: python class MNIST(Dataset): def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): self.type='standard_xy' self.in_memory = True self.inputs = inputs # load them or create self.outputs = outputs self.set_view('classification') self.set_n_classes(10) self.set_batch_size(20) self.n_batches = self._compute_n_batches() def get_batch(self,batch_index): x,y = self._fetch_batch(batch_index) if self.view_type == 'classification': return x,numpy.int32(y) elif self.view_type == 'density_estimation': return x else: raise NotImplementedError() def shared_data(self): shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX)) shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX)) return shared_x, T.cast(shared_y, 'int32') def _compute_n_batches(self): pass def _fetch_batch(self,batch_index): pass But nothing stops you from defining get_train_batch, get_valid_batch and stuff like that! So we'd use it as: train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy']) valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy']) x,y = train_mnist.get_batch(0) train_mnist.set_view('density_estimation') x = train_mnist.get_batch(0) or mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy']) batches_train = range(int(mnist_data.n_batches*0.8)) batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches) xt,yt = mnist_data.get_batch(batches_train[0]) xv,yv = mnist_data.get_batch(batches_valid[0]) COMMENTS ~~~~~~~~ JB asks: How about asking datasets to also provide a visualization mechanism for showing / playing individual examples from the dataset, but also other external objects that are similar to dataset examples (e.g. filters from a weight matrix that filters images). This doesn't have to be complicated, and it can be shared between datasets that exist in one modality (e.g. image datasets can all use an image-rending method) OD replies: Besides being able to display data without prior knowledge of the kind of data inside a dataset, is there any reason to put this within the dataset class? If not, it seems to me it may be more appropriate to have a way for the dataset to describe the kind of data it holds, and keep the visualization code separate from the dataset itself. It would make it easier in particular to try different visualization systems, and description of the data may turn out to be useful for other reasons (however, it also means we'd need to come up with a good way to describe data, which could prove difficult). JB asks: What may be passed as argument to the functions in Dataset, and what can be expected in return? Are there side effects (e.g. on the state of the Dataset) associated with any of the functions? JB asks: What properties are part of the Dataset API? What possible types can they have, are they expected to be read-only or writeable? What do they mean? JB asks: What is a view? Does set_view change the Dataset or return a new Dataset with a certain view of the original (in which case call it get_view)? Does the view imply the types of the return-value of functions like get_batch? What is the difference between the view and the subclasses of Dataset in PyML? JB asks: Do container formats (I'm thinking of HDF5) offer features for fast retrieval that we would like to expose via this interface? JB asks: How would you recommend using this sort of dataset in a boosting algorithm where points need to be re-weighted. JB asks: Do we want to provide for the possibility of feedback that modifies the dataset? For example, curriculum learning might be adaptive in this sense, or if we wanted to provide a virtual world for an agent as a dataset then we need to provide 'actions' to get the next batch. Could this be done in the current API? Field names and attributes ~~~~~~~~~~~~~~~~~~~~~~~~~~ OD: One important question is how to handle fields' names and characteristics. For instance, it can be useful to know that the 3rd input field represents a number of fingers, and is a non-negative discrete field whose numeric value is meaningful (compared, to, say, an integer index that would correspond to an animal's category). We mentioned metadata during the meeting, but we did not get into its details: that may be a place where to put this kind of things. Freeing memory ~~~~~~~~~~~~~~ OD: It is sometimes useful to be able to free memory used by previous computations. A typical example is when you load in memory the original dataset, then perform various processing steps, ending with a new dataset that you also store in memory before feeding it to the learner. Unless you very carefully design your code to avoid it, your original dataset will still remain in memory (as well as maybe the results of some computations performed along the way). So there may be a use for a `clear()` method that would be called by the topmost dataset (the one doing the final memory caching), and would be forwarded iteratively to previous datasets so as to get back all this wasted memory space. What is a mini-batch? ~~~~~~~~~~~~~~~~~~~~~ This is a follow-up to the meeting's discussion about whether a mini-batch returned by a dataset should be itself a dataset. OD: During the meeting I was voting in favor of a 'yes', mostly because it made sense to me (a mini-batch is a subset of a dataset and thus should be a dataset), but now I tend towards 'no'. The main reason is it is not clear yet what the dataset interface will be, so that it is hard to judge whether this is good idea (my main concern is how much additional work would be required by the writer of a new dataset subclass). Anyway, maybe a first thing we could think about is what we want a mini-batch to be. I think we can agree that we would like to be able to do something like: .. code-block:: python for mb in dataset.mini_batches(size=10): learner.update(mb.input, mb.target) so that it should be ok for a mini-batch to be an object whose fields (that should have the same name as those of the dataset) are numpy arrays. More generally, we would like to be able to iterate on samples in a mini-batch, or do random access on them, so a mini-batch should implement __iter__ and __getitem__. Besides this, is there any other typical use-case of a mini-batch? In particular, is there any reason to want an infinite mini-batch, or a very big mini-batch that may not fit in memory? (in which case we may need to revise our idea of what 'mini' means) Hopefully the answer to that last question is no, as I think it would definitely keep things simpler, since we could simply use numpy arrays (for numeric data) or lists (for anything else) to store mini-batches' data. So I vote for 'no'. YB: I agree that a mini-batch should definitely be safely assumed to fit in memory. That makes it at least in principle semantically different from a dataset. But barring that restriction, it might share of the properties of a dataset. A dataset is a learner ~~~~~~~~~~~~~~~~~~~~~~ OD: (this is hopefully a clearer re-write of the original version from r7e6e77d50eeb, which I was not happy with). There are typically three kinds of objects that spit out data: 1. Datasets that are loaded from disk or are able to generate data all by themselves (i.e. without any other dataset as input) 2. Datasets that transform their input dataset in a way that only depends on the input dataset (e.g. filtering samples or features, normalizing data, etc.) 3. Datasets that transform their input dataset in a way that is learned on a potentially different dataset (e.g. PCA when you want to learn the projection space on the training set in order to transform both the training and test sets). My impression currently is that we would use dataset subclasses to handle 1 and 2. However, 3 requires a learner framework, so you would need to have something like a LearnerOutputDataset(trained_learner, dataset). Note however that 2 is a special case of 3 (where training does nothing), and 1 is a special case of 2 (where we do not care about being given an input dataset). Thus you could decide to also implement 1 and 2 as learners wrapped by LearnerOutputDataset. The main advantages I find in this approach (that I have been using at Ubisoft) are: - You only need to learn how to subclass the learner class. The only dataset class is LearnerOutputDataset, which you could just name Dataset. - You do not have different ways to achieve the same result (having to figure out which one is most appropriate). - Upgrading code from 2 to 3 is more straighforward. Such a situation can happen e.g. if you write some code that normalizes your input dataset (situation 2), then realize later you would like to be able to normalize new datasets using the same parameters (e.g. same shift & rescaling), which requires situation 3. - It can make your life easier when thinking about how to plug things together (something that has not been discussed yet), because the interfaces of the various components are less varied. I am not saying that we should necessarily do it this way, but I think it is worth at least keeping in mind this close relationship between simple processing and learning, and thinking about what are the benefits / drawbacks in keeping them separate in the class hierarchy. RP: I actually like this idea of having the dataset implement the same interface as the learner ( or actually a subset of the interface .. ). I hope people decide to do this. Support for shared variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ RP asks: What is the status of having the dataset support copying data on the GPU ( by storing data in shared variables) ? Have you decided to include this feature or not ? I think that the strongest selling point of Theano is that it runs on GPU transperently, and I see this as a good selling point for the library as well. Plus we intend to move more and more towards running things on GPU. If the dataset object does not support this feature we will need to find hacks around it .. OD: I have like zero experience with GPU so hopefully someone else can answer this. But the way I see it, hopefully it could work by having some dataset object that would take care of storing its input data into a shared variable. OD (continued): After thinking a bit more about it, I am not sure that would work. I definitely need to look at some code doing it to get a better understanding of it, but my feeling is that you need your learner to be written in a specific way to achieve this, in which case it may be up to the learner to take its input data and store it into a shared variable. RP comment: Yes, the dataset object alone can not handle this, the issue is somewhere between the dataset and the learner. Or in other words, everytime you change the data you need to recompile your theano function. So the learner can not only get data from the dataset, it needs to get a shared variable. The learner should also be aware when the dataset is changed, to recompile its internal functions. I'm not sure which is the best wa to do this. My personal feeling is that the dataset should be part of the learner. The lerner should provide a function use_dataset ( or replace_dataset). When this function is called, all the theano functions in the learner get recompiled based on shared variables that the dataset object provides. It sort of fits very well in the framework that I have in mind, which was spattered around in the learner.txt and some of my previous emails. I think it shares a lot with James concepts, since it follows quite closely the concepts behind Theano. OD asks: Ok, so why would the dataset have to be responsible for providing a shared variable? Why wouldn't the learner just create this shared variable internally and copy into it the data provided by the dataset? RP replies: Sure, the learner could take care of all this. Note though that the learner should take care to divide the dataset into chunks that fit in the GPU memory ( in case of a large dataset) and then take care of updating the shared variables acording to the current chunk. Personally I feel like all this data division, management and so on should be done by the dataset. It feels more natural that way. For example assume you have a dataset that is composed of a time series and some static data ( carre-tech heart beat data is a good example). The static data is small enough so that you could always store on the GPU, and you would only need to split the time series. For the learner to do this ( since it gets the same interface from any dataset object) would be like and if <this case> then, while for the dataset is just a different class. But I'm happy to have all this GPU stuff send to the learner as well if everybody else believe that is better. FB comment: I don't understand why you would need to recompile the theano function. Their is 2 cases, the data is in a shared variable. You can directly change the data in the shared variable without recompiling the theano fct. The second case is when the dataset is in an ordinary theano variable. In that case, the first step in the theano fct will be to transfer the dataset to the gpu before computation. If the data change at each call, that will be as efficient as changing the data manually every time in the shared variable. AB: I have an idea about this which kind of fits in the "building a theano op" thing that we talked about at the last meeting. We can just build a theano Op that wraps dataset objects and takes care of the details of tranferring data to the GPU or otherwise. I have a prototype interface/implemantation in the shared_dataset.py file in this directory. OD: I like AB's approach. Data API proposal by Olivier D ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A single sample containing multiple fields (e.g. an input and a target part) is an object s that you can manipulate as follows: .. code-block:: python # Obtain actual data stored within `s` (e.g. a numpy vector). There is no # guarantee that modifying the resulting data object will actually update # the data stored in `s`. data = s() # Create a sample that sees a field of `s`. input_part = s.input # Obtain actual input data (e.g. as a numpy vector). input_data = input_part() # Create a sample that sees the i-th element of the data stored in `s`. ith = s[i] # This should not fail. assert ith() == s()[i] # You could also select a range. i_to_j = s[i:j] assert i_to_j() == s()[i:j] # And actually do pretty much anything you want with __getitem__, as long # as the underlying data stored in the sample supports it (for instance, # here it should be at least a 3D tensor). fancy_selection = s[i, :, j:k] assert fancy_selection() == s()[i, :, j:k] # Write some value (e.g. a numpy vector) into the sample. May raise an # exception if the sample is in read-only mode. s._write(val) # Shortcut to write data into a field (same as `s.input._write(val)`). s.input = val # Basic mathematical operators. s *= val s += val s -= val s /= val # Replace a field. Note that this is different from `s.input = val` # because here `new_input` is a sample, not a numeric value: the current # `s.input` will not be written to, instead it makes `s.input` point # towards a different sample. This may lead to confusion, so a different # syntax may be better (e.g. s._set_field('input', new_input)). s.input = new_input # The equality of two samples is defined by the equality of their # underlying data. def __eq__(self, other): return self() == other() # Iterate on fields (open question: should they be ordered?). fields = dict([(name, sample) for name, sample in s._iter_fields()]) assert fields['input'] == s.input # Iterating on a sample yields samples that see consecutive elements. for sample, value in izip(s, s()): assert sample() == value # The length of a sample is the same as that of its underlying data. assert len(s) == len(s()) # The shape of a sample is the same as that of its underlying data. # Note that it only makes sense for tensor-like data. assert s._shape() == s().shape # The size of a sample is the product of its shape elements. assert s._size() == reduce(operator.__mul__, s._shape()) All sample methods should start with '_', to differentiate them from the sample's fields. This is a bit awkward, but I like the `sample.field` syntax compared to something like "sample.get_field('field')", which makes code less readable, especially when combining with sub_fields, e.g. `sample.input.x1` vs. sample.get_field('input').get_field('x1'). The extension from sample to dataset is actually to use the same class, but with the convention that the first "dimension" in the data seen by the dataset corresponds to the samples' indices in the dataset. .. code-block:: python # Return data stored in dataset `d` (e.g. a numpy matrix). data = d() # Return the i-th sample in the dataset. s = d[i] # Data should match! assert data[i] == s() # Return a subset of the dataset. sub_data = d[i:j] # Advanced indexing. sub_data = d[some_list_of_indices] # Dataset that sees the input part only. input_part = d.input # Dataset such that its i-th element is data[i][something] (see the sample # examples for what `something` may be). some_sub_data = d[:, something] # The following should not fail. assert d[i, something] == d[i][something] # == some_sub_data[i] # You can also write into a dataset. d._write(val) d.input = val # Center dataset in-place (requires `d` not to be read-only). d -= numpy.mean(d()) # The length of a dataset is its number of samples. n_samples = len(d) # The width of a dataset (if it exists) is the length of its samples. assert d._shape()[1] == len(d[0]) # == d._width() (shortcut) # Iterating on a dataset yields individual samples. for i, sample in enumerate(d): assert d[i] == sample # It is allowed for a dataset to hold heterogeneous data. For instance # you could have len(d.data1) != len(d.data2) # A sample in the dataset is not required to inherit all the dataset's # fields, for instance in the case above you could decide that the dataset # sees the same data as its first sub-dataset, i.e. d[i] == d.data1[i] There remain some fuzzy points. For instance, are fields allowed to overlap? (e.g. so that one could write both s.pos_3d to get the 3d vector coordinate of sample s, and s.x to get the x coordinate without being forced to go through s.pos_3d.x). What are the fields of s[i:j] if the (i, j) range does not exactly match a subset of fields? How do we handle metadata? (e.g. if we want to describe the dataset to say it contains 28x28 image data, so that an algorithm for filter visualization can automatically deal with it) Now, on to some use cases. .. code-block:: python # Mini-batches. mb_dataset = d._minibatches(batch_size=5) # The mini-batch dataset views samples that are mini-batches. assert mb_dataset[0]() == d[0:5]() # As long as len(d) >= 5. # Shuffling samples. random_indices = range(len(d)) random_indices = numpy.random.shuffle(random_indices) shuffled_dataset = d[random_indices] # Typical linear regression with stochastic gradient descent. n_inputs = d.input._width() n_targets = d.target._width() weights = numpy.zeros((n_inputs, n_targets)) bias = numpy.zeros(n_targets) mb_dataset = d._minibatches(batch_size=10) # Note: it is important to get the number of inputs / targets # before converting to minibatches, because # mb_dataset.input._width() == 10 # since this is the length of a minibatch matrix. However you # could still do the following, which is less readable: # n_inputs = mb_dataset.input._shape()[2] # You could also wait until you see the first sample to create # the parameters (this would actually be a better way to do it, since # it avoids calling the _width method). for input, target in izip(mb_dataset.input, mb_dataset.target): cost = (numpy.dot(input(), weights) + b - target())**2 # Update weights and bias depending on cost.... A few more points: - Infinite datasets could be used (would just need to define a convention on what __len__ should do). - It is also ok to have datasets that do not support random access (so the only way to access samples is through iteration). - Ideally, data should be deterministic (i.e. __call__() should always return the same thing). It would probably be up to the user to be super careful if he decides to use a non-deterministic dataset. - About the "task vs. dataset" distinction. This could be achieved by associating to a task the names of the fields it requires (e.g. "input" and "target" for the regression task), and if the dataset does not already defines these fields, using a dataset wrapper than does it (saying for instance that "input" is the concatenation of "x1" and "x2", and "target" is "y", for a dataset whose fields are x1, x2 and y). RP comments: - I like this approach. I think having overlapping fields might be useful. I would add that I was thinking of a way to look at one's results. Is something I've been faced with, say you run 500 jobs and then you want to understand those jobs' results. Looking just at the best performing seems a waste, and there is a lot more information you can extract from your results if you are able to generate certain plots or statistics. To do this you would need to get the data in ipython (or something quite similar) where you have available the needed functions to plot different things, generate different tables. The point that I was trying to make is that you can get those results in something that has this very API that Olivier described. This way both both your input data and your results will be in the same form and whatever visualization functions you have for your results you can use on your data as well. For this you would need a bit more flexibility, in the sense that if you have some data d, you should be able to put constraints on it, like d.some_field == 5 means all entries in d that has some_field == 5, or d.some_field > 5. You would also not use psql anymore but this console, which would collect the results for you from sql, and give them to you as data object. OD replies: Actually this should be doable with (almost) what I wrote above, due to the way numpy redefines ==, >, etc. (which btw should break some of my assertions above, since I had forgotten about this). If you replace e.g. my implementation of __eq__ above by the following: .. code-block:: python def __eq__(self, other): return other == self() Here, `self` is a dataset that represents some numpy vector data. Then whether `other` is another dataset or a numpy vector or some scalar, this will return a numpy boolean vector (the result of the comparison made by numpy). We may support boolean vectors in advanced indexing, so you could do d[d.some_field == 5] and obtain the subset of `d` whose samples have `some_field` set to 5. Same could be done with __lt__, __le__, etc.