comparison doc/v2_planning/dataset.txt @ 1190:9ff2242a817b

fix rst syntax errors/warnings
author Frederic Bastien <nouiz@nouiz.org>
date Fri, 17 Sep 2010 21:14:41 -0400
parents d9550c27a192
children 7dfc3d3052ea
comparison
equal deleted inserted replaced
1189:0e12ea6ba661 1190:9ff2242a817b
2 ====================================================== 2 ======================================================
3 3
4 Some talking points from the September 2 meeting: 4 Some talking points from the September 2 meeting:
5 5
6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification 6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
7 needs to be flexible enough to accommodate different (sub)tasks and views of 7 needs to be flexible enough to accommodate different (sub)tasks and views of
8 the same underlying data. 8 the same underlying data.
9 * Datasets as probability distributions from which one can sample. 9 * Datasets as probability distributions from which one can sample.
10 * That's not something I would consider to be a dataset-related problem to 10 * That's not something I would consider to be a dataset-related problem to
11 tackle now: a probability distribution in Pylearn would probably be a 11 tackle now: a probability distribution in Pylearn would probably be a
12 different kind of beast, and it should be easy enough to have a 12 different kind of beast, and it should be easy enough to have a
13 DatasetToDistribution class for instance, that would take care of viewing a 13 DatasetToDistribution class for instance, that would take care of viewing a
14 dataset as a probability distribution. -- OD 14 dataset as a probability distribution. -- OD
15 * Our specification should allow transparent handling of infinite datasets (or 15 * Our specification should allow transparent handling of infinite datasets (or
16 simply datasets which cannot fit in memory) 16 simply datasets which cannot fit in memory)
17 * GPU/buffering issues. 17 * GPU/buffering issues.
18 18
19 Commiteee: DE, OB, OD, AB, PV 19 Commiteee: DE, OB, OD, AB, PV
20 Leader: DE 20 Leader: DE
21 21
115 115
116 A concrete implementation would look like this (we would have one class per 116 A concrete implementation would look like this (we would have one class per
117 dataset that we use, and the class declaration contains essentially everything 117 dataset that we use, and the class declaration contains essentially everything
118 there is to know about the dataset): 118 there is to know about the dataset):
119 119
120 class MNIST(Dataset): 120 .. code-block:: python
121
122 class MNIST(Dataset):
121 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): 123 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
122 self.type='standard_xy' 124 self.type='standard_xy'
123 self.in_memory = True 125 self.in_memory = True
124 self.inputs = inputs # load them or create 126 self.inputs = inputs # load them or create
125 self.outputs = outputs 127 self.outputs = outputs
257 what the dataset interface will be, so that it is hard to judge whether this 259 what the dataset interface will be, so that it is hard to judge whether this
258 is good idea (my main concern is how much additional work would be required by 260 is good idea (my main concern is how much additional work would be required by
259 the writer of a new dataset subclass). Anyway, maybe a first thing we could 261 the writer of a new dataset subclass). Anyway, maybe a first thing we could
260 think about is what we want a mini-batch to be. I think we can agree that we 262 think about is what we want a mini-batch to be. I think we can agree that we
261 would like to be able to do something like: 263 would like to be able to do something like:
264
265 .. code-block:: python
266
262 for mb in dataset.mini_batches(size=10): 267 for mb in dataset.mini_batches(size=10):
263 learner.update(mb.input, mb.target) 268 learner.update(mb.input, mb.target)
269
264 so that it should be ok for a mini-batch to be an object whose fields 270 so that it should be ok for a mini-batch to be an object whose fields
265 (that should have the same name as those of the dataset) are numpy arrays. 271 (that should have the same name as those of the dataset) are numpy arrays.
266 More generally, we would like to be able to iterate on samples in a 272 More generally, we would like to be able to iterate on samples in a
267 mini-batch, or do random access on them, so a mini-batch should implement 273 mini-batch, or do random access on them, so a mini-batch should implement
268 __iter__ and __getitem__. 274 __iter__ and __getitem__.
283 ~~~~~~~~~~~~~~~~~~~~~~ 289 ~~~~~~~~~~~~~~~~~~~~~~
284 290
285 OD: (this is hopefully a clearer re-write of the original version from 291 OD: (this is hopefully a clearer re-write of the original version from
286 r7e6e77d50eeb, which I was not happy with). 292 r7e6e77d50eeb, which I was not happy with).
287 There are typically three kinds of objects that spit out data: 293 There are typically three kinds of objects that spit out data:
294
288 1. Datasets that are loaded from disk or are able to generate data all by 295 1. Datasets that are loaded from disk or are able to generate data all by
289 themselves (i.e. without any other dataset as input) 296 themselves (i.e. without any other dataset as input)
290 2. Datasets that transform their input dataset in a way that only depends on 297 2. Datasets that transform their input dataset in a way that only depends on
291 the input dataset (e.g. filtering samples or features, normalizing data, etc.) 298 the input dataset (e.g. filtering samples or features, normalizing data, etc.)
292 3. Datasets that transform their input dataset in a way that is learned on a 299 3. Datasets that transform their input dataset in a way that is learned on a
293 potentially different dataset (e.g. PCA when you want to learn the projection 300 potentially different dataset (e.g. PCA when you want to learn the projection
294 space on the training set in order to transform both the training and test 301 space on the training set in order to transform both the training and test
295 sets). 302 sets).
303
296 My impression currently is that we would use dataset subclasses to handle 1 304 My impression currently is that we would use dataset subclasses to handle 1
297 and 2. However, 3 requires a learner framework, so you would need to have 305 and 2. However, 3 requires a learner framework, so you would need to have
298 something like a LearnerOutputDataset(trained_learner, dataset). 306 something like a LearnerOutputDataset(trained_learner, dataset).
299 307
300 Note however that 2 is a special case of 3 (where training does nothing), and 308 Note however that 2 is a special case of 3 (where training does nothing), and
302 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped 310 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped
303 by LearnerOutputDataset. 311 by LearnerOutputDataset.
304 312
305 The main advantages I find in this approach (that I have been using at 313 The main advantages I find in this approach (that I have been using at
306 Ubisoft) are: 314 Ubisoft) are:
315
307 - You only need to learn how to subclass the learner class. The only dataset 316 - You only need to learn how to subclass the learner class. The only dataset
308 class is LearnerOutputDataset, which you could just name Dataset. 317 class is LearnerOutputDataset, which you could just name Dataset.
309 - You do not have different ways to achieve the same result (having to figure 318 - You do not have different ways to achieve the same result (having to figure
310 out which one is most appropriate). 319 out which one is most appropriate).
311 - Upgrading code from 2 to 3 is more straighforward. Such a situation can 320 - Upgrading code from 2 to 3 is more straighforward. Such a situation can