Mercurial > pylearn
comparison doc/v2_planning/dataset.txt @ 1086:65ac0f493830
dataset: Some clarifications on my comments
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 10 Sep 2010 22:22:02 -0400 |
parents | de456561ec40 |
children | f15216356522 |
comparison
equal
deleted
inserted
replaced
1085:de456561ec40 | 1086:65ac0f493830 |
---|---|
249 (that should have the same name as those of the dataset) are numpy arrays. | 249 (that should have the same name as those of the dataset) are numpy arrays. |
250 More generally, we would like to be able to iterate on samples in a | 250 More generally, we would like to be able to iterate on samples in a |
251 mini-batch, or do random access on them, so a mini-batch should implement | 251 mini-batch, or do random access on them, so a mini-batch should implement |
252 __iter__ and __getitem__. | 252 __iter__ and __getitem__. |
253 Besides this, is there any other typical use-case of a mini-batch? In | 253 Besides this, is there any other typical use-case of a mini-batch? In |
254 particular, is there any reason to want an infinite mini-batch? (in which case | 254 particular, is there any reason to want an infinite mini-batch, or a very big |
255 we may need to revise our idea of what 'mini' means) Hopefully the answer to | 255 mini-batch that may not fit in memory? (in which case we may need to revise |
256 that last question is no, as I think it would definitely keep things simpler, | 256 our idea of what 'mini' means) Hopefully the answer to that last question is |
257 since we could simply use numpy arrays (for numeric data) or lists (for | 257 no, as I think it would definitely keep things simpler, since we could simply |
258 anything else) to store mini-batches' data. So I vote for 'no'. | 258 use numpy arrays (for numeric data) or lists (for anything else) to store |
259 mini-batches' data. So I vote for 'no'. | |
259 | 260 |
260 A dataset is a learner | 261 A dataset is a learner |
261 ~~~~~~~~~~~~~~~~~~~~~~ | 262 ~~~~~~~~~~~~~~~~~~~~~~ |
262 | 263 |
263 OD: (this is hopefully a clearer re-write of the original version from | 264 OD: (this is hopefully a clearer re-write of the original version from |
264 r7e6e77d50eeb, which I was not happy with). | 265 r7e6e77d50eeb, which I was not happy with). |
265 There are typically three kinds of objects that spit out data: | 266 There are typically three kinds of objects that spit out data: |
266 1. Datasets that are loaded from disk or are able to generate data all by | 267 1. Datasets that are loaded from disk or are able to generate data all by |
267 themselves (i.e. without any other dataset as input) | 268 themselves (i.e. without any other dataset as input) |
268 2. Datasets that transform their input dataset in some way (e.g. filtering | 269 2. Datasets that transform their input dataset in a way that only depends on |
269 samples or features, normalizing data, etc.) | 270 the input dataset (e.g. filtering samples or features, normalizing data, etc.) |
270 3. Datasets that are the output of a transformation whose parameters are | 271 3. Datasets that transform their input dataset in a way that is learned on a |
271 learned on a potentially different dataset (e.g. PCA when you want to learn the | 272 potentially different dataset (e.g. PCA when you want to learn the projection |
272 projection space on the training set in order to transform both the training | 273 space on the training set in order to transform both the training and test |
273 and test sets). | 274 sets). |
274 My impression currently is that we would use dataset subclasses to handle 1 | 275 My impression currently is that we would use dataset subclasses to handle 1 |
275 and 2. However, 3 requires a learner framework, so you would need to have | 276 and 2. However, 3 requires a learner framework, so you would need to have |
276 something like a LearnerOutputDataset(trained_learner, dataset). | 277 something like a LearnerOutputDataset(trained_learner, dataset). |
277 | 278 |
278 Note however that 2 is a special case of 3 (where training does nothing), and | 279 Note however that 2 is a special case of 3 (where training does nothing), and |