pylearn: doc/v2_planning/dataset.txt comparison

comparison doc/v2_planning/dataset.txt @ 1190:9ff2242a817b

fix rst syntax errors/warnings

author	Frederic Bastien <nouiz@nouiz.org>
date	Fri, 17 Sep 2010 21:14:41 -0400
parents	d9550c27a192
children	7dfc3d3052ea

comparison

equal deleted inserted replaced

-:0e12ea6ba661
+:9ff2242a817b
 ======================================================
 Some talking points from the September 2 meeting:
 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
 needs to be flexible enough to accommodate different (sub)tasks and views of
 the same underlying data.
 * Datasets as probability distributions from which one can sample.
 * That's not something I would consider to be a dataset-related problem to
 tackle now: a probability distribution in Pylearn would probably be a
 different kind of beast, and it should be easy enough to have a
 DatasetToDistribution class for instance, that would take care of viewing a
 dataset as a probability distribution. -- OD
 * Our specification should allow transparent handling of infinite datasets (or
 simply datasets which cannot fit in memory)
 * GPU/buffering issues.
 Commiteee: DE, OB, OD, AB, PV
 Leader: DE
 A concrete implementation would look like this (we would have one class per
 dataset that we use, and the class declaration contains essentially everything
 there is to know about the dataset):
-class MNIST(Dataset):
+.. code-block:: python
+class MNIST(Dataset):
 def  __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
 self.type='standard_xy'
 self.in_memory = True
 self.inputs = inputs # load them or create
 self.outputs = outputs
 what the dataset interface will be, so that it is hard to judge whether this
 is good idea (my main concern is how much additional work would be required by
 the writer of a new dataset subclass). Anyway, maybe a first thing we could
 think about is what we want a mini-batch to be. I think we can agree that we
 would like to be able to do something like:
+.. code-block:: python
 for mb in dataset.mini_batches(size=10):
 learner.update(mb.input, mb.target)
 so that it should be ok for a mini-batch to be an object whose fields
 (that should have the same name as those of the dataset) are numpy arrays.
 More generally, we would like to be able to iterate on samples in a
 mini-batch, or do random access on them, so a mini-batch should implement
 __iter__ and __getitem__.
 ~~~~~~~~~~~~~~~~~~~~~~
 OD: (this is hopefully a clearer re-write of the original version from
 r7e6e77d50eeb, which I was not happy with).
 There are typically three kinds of objects that spit out data:
 1. Datasets that are loaded from disk or are able to generate data all by
 themselves (i.e. without any other dataset as input)
 2. Datasets that transform their input dataset in a way that only depends on
 the input dataset (e.g. filtering samples or features, normalizing data, etc.)
 3. Datasets that transform their input dataset in a way that is learned on a
 potentially different dataset (e.g. PCA when you want to learn the projection
 space on the training set in order to transform both the training and test
 sets).
 My impression currently is that we would use dataset subclasses to handle 1
 and 2. However, 3 requires a learner framework, so you would need to have
 something like a LearnerOutputDataset(trained_learner, dataset).
 Note however that 2 is a special case of 3 (where training does nothing), and
 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped
 by LearnerOutputDataset.
 The main advantages I find in this approach (that I have been using at
 Ubisoft) are:
 - You only need to learn how to subclass the learner class. The only dataset
 class is LearnerOutputDataset, which you could just name Dataset.
 - You do not have different ways to achieve the same result (having to figure
 out which one is most appropriate).
 - Upgrading code from 2 to 3 is more straighforward. Such a situation can

Mercurial > pylearn

comparison doc/v2_planning/dataset.txt @ 1190:9ff2242a817b