Mercurial > pylearn
diff doc/v2_planning/dataset.txt @ 1190:9ff2242a817b
fix rst syntax errors/warnings
author | Frederic Bastien <nouiz@nouiz.org> |
---|---|
date | Fri, 17 Sep 2010 21:14:41 -0400 |
parents | d9550c27a192 |
children | 7dfc3d3052ea |
line wrap: on
line diff
--- a/doc/v2_planning/dataset.txt Fri Sep 17 20:55:18 2010 -0400 +++ b/doc/v2_planning/dataset.txt Fri Sep 17 21:14:41 2010 -0400 @@ -4,8 +4,8 @@ Some talking points from the September 2 meeting: * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification - needs to be flexible enough to accommodate different (sub)tasks and views of - the same underlying data. + needs to be flexible enough to accommodate different (sub)tasks and views of + the same underlying data. * Datasets as probability distributions from which one can sample. * That's not something I would consider to be a dataset-related problem to tackle now: a probability distribution in Pylearn would probably be a @@ -13,7 +13,7 @@ DatasetToDistribution class for instance, that would take care of viewing a dataset as a probability distribution. -- OD * Our specification should allow transparent handling of infinite datasets (or - simply datasets which cannot fit in memory) + simply datasets which cannot fit in memory) * GPU/buffering issues. Commiteee: DE, OB, OD, AB, PV @@ -117,7 +117,9 @@ dataset that we use, and the class declaration contains essentially everything there is to know about the dataset): -class MNIST(Dataset): +.. code-block:: python + + class MNIST(Dataset): def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): self.type='standard_xy' self.in_memory = True @@ -259,8 +261,12 @@ the writer of a new dataset subclass). Anyway, maybe a first thing we could think about is what we want a mini-batch to be. I think we can agree that we would like to be able to do something like: + +.. code-block:: python + for mb in dataset.mini_batches(size=10): learner.update(mb.input, mb.target) + so that it should be ok for a mini-batch to be an object whose fields (that should have the same name as those of the dataset) are numpy arrays. More generally, we would like to be able to iterate on samples in a @@ -285,6 +291,7 @@ OD: (this is hopefully a clearer re-write of the original version from r7e6e77d50eeb, which I was not happy with). There are typically three kinds of objects that spit out data: + 1. Datasets that are loaded from disk or are able to generate data all by themselves (i.e. without any other dataset as input) 2. Datasets that transform their input dataset in a way that only depends on @@ -293,6 +300,7 @@ potentially different dataset (e.g. PCA when you want to learn the projection space on the training set in order to transform both the training and test sets). + My impression currently is that we would use dataset subclasses to handle 1 and 2. However, 3 requires a learner framework, so you would need to have something like a LearnerOutputDataset(trained_learner, dataset). @@ -304,6 +312,7 @@ The main advantages I find in this approach (that I have been using at Ubisoft) are: + - You only need to learn how to subclass the learner class. The only dataset class is LearnerOutputDataset, which you could just name Dataset. - You do not have different ways to achieve the same result (having to figure