Mercurial > pylearn
comparison doc/v2_planning/dataset.txt @ 1190:9ff2242a817b
fix rst syntax errors/warnings
author | Frederic Bastien <nouiz@nouiz.org> |
---|---|
date | Fri, 17 Sep 2010 21:14:41 -0400 |
parents | d9550c27a192 |
children | 7dfc3d3052ea |
comparison
equal
deleted
inserted
replaced
1189:0e12ea6ba661 | 1190:9ff2242a817b |
---|---|
2 ====================================================== | 2 ====================================================== |
3 | 3 |
4 Some talking points from the September 2 meeting: | 4 Some talking points from the September 2 meeting: |
5 | 5 |
6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification | 6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification |
7 needs to be flexible enough to accommodate different (sub)tasks and views of | 7 needs to be flexible enough to accommodate different (sub)tasks and views of |
8 the same underlying data. | 8 the same underlying data. |
9 * Datasets as probability distributions from which one can sample. | 9 * Datasets as probability distributions from which one can sample. |
10 * That's not something I would consider to be a dataset-related problem to | 10 * That's not something I would consider to be a dataset-related problem to |
11 tackle now: a probability distribution in Pylearn would probably be a | 11 tackle now: a probability distribution in Pylearn would probably be a |
12 different kind of beast, and it should be easy enough to have a | 12 different kind of beast, and it should be easy enough to have a |
13 DatasetToDistribution class for instance, that would take care of viewing a | 13 DatasetToDistribution class for instance, that would take care of viewing a |
14 dataset as a probability distribution. -- OD | 14 dataset as a probability distribution. -- OD |
15 * Our specification should allow transparent handling of infinite datasets (or | 15 * Our specification should allow transparent handling of infinite datasets (or |
16 simply datasets which cannot fit in memory) | 16 simply datasets which cannot fit in memory) |
17 * GPU/buffering issues. | 17 * GPU/buffering issues. |
18 | 18 |
19 Commiteee: DE, OB, OD, AB, PV | 19 Commiteee: DE, OB, OD, AB, PV |
20 Leader: DE | 20 Leader: DE |
21 | 21 |
115 | 115 |
116 A concrete implementation would look like this (we would have one class per | 116 A concrete implementation would look like this (we would have one class per |
117 dataset that we use, and the class declaration contains essentially everything | 117 dataset that we use, and the class declaration contains essentially everything |
118 there is to know about the dataset): | 118 there is to know about the dataset): |
119 | 119 |
120 class MNIST(Dataset): | 120 .. code-block:: python |
121 | |
122 class MNIST(Dataset): | |
121 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): | 123 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): |
122 self.type='standard_xy' | 124 self.type='standard_xy' |
123 self.in_memory = True | 125 self.in_memory = True |
124 self.inputs = inputs # load them or create | 126 self.inputs = inputs # load them or create |
125 self.outputs = outputs | 127 self.outputs = outputs |
257 what the dataset interface will be, so that it is hard to judge whether this | 259 what the dataset interface will be, so that it is hard to judge whether this |
258 is good idea (my main concern is how much additional work would be required by | 260 is good idea (my main concern is how much additional work would be required by |
259 the writer of a new dataset subclass). Anyway, maybe a first thing we could | 261 the writer of a new dataset subclass). Anyway, maybe a first thing we could |
260 think about is what we want a mini-batch to be. I think we can agree that we | 262 think about is what we want a mini-batch to be. I think we can agree that we |
261 would like to be able to do something like: | 263 would like to be able to do something like: |
264 | |
265 .. code-block:: python | |
266 | |
262 for mb in dataset.mini_batches(size=10): | 267 for mb in dataset.mini_batches(size=10): |
263 learner.update(mb.input, mb.target) | 268 learner.update(mb.input, mb.target) |
269 | |
264 so that it should be ok for a mini-batch to be an object whose fields | 270 so that it should be ok for a mini-batch to be an object whose fields |
265 (that should have the same name as those of the dataset) are numpy arrays. | 271 (that should have the same name as those of the dataset) are numpy arrays. |
266 More generally, we would like to be able to iterate on samples in a | 272 More generally, we would like to be able to iterate on samples in a |
267 mini-batch, or do random access on them, so a mini-batch should implement | 273 mini-batch, or do random access on them, so a mini-batch should implement |
268 __iter__ and __getitem__. | 274 __iter__ and __getitem__. |
283 ~~~~~~~~~~~~~~~~~~~~~~ | 289 ~~~~~~~~~~~~~~~~~~~~~~ |
284 | 290 |
285 OD: (this is hopefully a clearer re-write of the original version from | 291 OD: (this is hopefully a clearer re-write of the original version from |
286 r7e6e77d50eeb, which I was not happy with). | 292 r7e6e77d50eeb, which I was not happy with). |
287 There are typically three kinds of objects that spit out data: | 293 There are typically three kinds of objects that spit out data: |
294 | |
288 1. Datasets that are loaded from disk or are able to generate data all by | 295 1. Datasets that are loaded from disk or are able to generate data all by |
289 themselves (i.e. without any other dataset as input) | 296 themselves (i.e. without any other dataset as input) |
290 2. Datasets that transform their input dataset in a way that only depends on | 297 2. Datasets that transform their input dataset in a way that only depends on |
291 the input dataset (e.g. filtering samples or features, normalizing data, etc.) | 298 the input dataset (e.g. filtering samples or features, normalizing data, etc.) |
292 3. Datasets that transform their input dataset in a way that is learned on a | 299 3. Datasets that transform their input dataset in a way that is learned on a |
293 potentially different dataset (e.g. PCA when you want to learn the projection | 300 potentially different dataset (e.g. PCA when you want to learn the projection |
294 space on the training set in order to transform both the training and test | 301 space on the training set in order to transform both the training and test |
295 sets). | 302 sets). |
303 | |
296 My impression currently is that we would use dataset subclasses to handle 1 | 304 My impression currently is that we would use dataset subclasses to handle 1 |
297 and 2. However, 3 requires a learner framework, so you would need to have | 305 and 2. However, 3 requires a learner framework, so you would need to have |
298 something like a LearnerOutputDataset(trained_learner, dataset). | 306 something like a LearnerOutputDataset(trained_learner, dataset). |
299 | 307 |
300 Note however that 2 is a special case of 3 (where training does nothing), and | 308 Note however that 2 is a special case of 3 (where training does nothing), and |
302 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped | 310 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped |
303 by LearnerOutputDataset. | 311 by LearnerOutputDataset. |
304 | 312 |
305 The main advantages I find in this approach (that I have been using at | 313 The main advantages I find in this approach (that I have been using at |
306 Ubisoft) are: | 314 Ubisoft) are: |
315 | |
307 - You only need to learn how to subclass the learner class. The only dataset | 316 - You only need to learn how to subclass the learner class. The only dataset |
308 class is LearnerOutputDataset, which you could just name Dataset. | 317 class is LearnerOutputDataset, which you could just name Dataset. |
309 - You do not have different ways to achieve the same result (having to figure | 318 - You do not have different ways to achieve the same result (having to figure |
310 out which one is most appropriate). | 319 out which one is most appropriate). |
311 - Upgrading code from 2 to 3 is more straighforward. Such a situation can | 320 - Upgrading code from 2 to 3 is more straighforward. Such a situation can |