comparison doc/v2_planning/dataset.txt @ 1086:65ac0f493830

dataset: Some clarifications on my comments
author Olivier Delalleau <delallea@iro>
date Fri, 10 Sep 2010 22:22:02 -0400
parents de456561ec40
children f15216356522
comparison
equal deleted inserted replaced
1085:de456561ec40 1086:65ac0f493830
249 (that should have the same name as those of the dataset) are numpy arrays. 249 (that should have the same name as those of the dataset) are numpy arrays.
250 More generally, we would like to be able to iterate on samples in a 250 More generally, we would like to be able to iterate on samples in a
251 mini-batch, or do random access on them, so a mini-batch should implement 251 mini-batch, or do random access on them, so a mini-batch should implement
252 __iter__ and __getitem__. 252 __iter__ and __getitem__.
253 Besides this, is there any other typical use-case of a mini-batch? In 253 Besides this, is there any other typical use-case of a mini-batch? In
254 particular, is there any reason to want an infinite mini-batch? (in which case 254 particular, is there any reason to want an infinite mini-batch, or a very big
255 we may need to revise our idea of what 'mini' means) Hopefully the answer to 255 mini-batch that may not fit in memory? (in which case we may need to revise
256 that last question is no, as I think it would definitely keep things simpler, 256 our idea of what 'mini' means) Hopefully the answer to that last question is
257 since we could simply use numpy arrays (for numeric data) or lists (for 257 no, as I think it would definitely keep things simpler, since we could simply
258 anything else) to store mini-batches' data. So I vote for 'no'. 258 use numpy arrays (for numeric data) or lists (for anything else) to store
259 mini-batches' data. So I vote for 'no'.
259 260
260 A dataset is a learner 261 A dataset is a learner
261 ~~~~~~~~~~~~~~~~~~~~~~ 262 ~~~~~~~~~~~~~~~~~~~~~~
262 263
263 OD: (this is hopefully a clearer re-write of the original version from 264 OD: (this is hopefully a clearer re-write of the original version from
264 r7e6e77d50eeb, which I was not happy with). 265 r7e6e77d50eeb, which I was not happy with).
265 There are typically three kinds of objects that spit out data: 266 There are typically three kinds of objects that spit out data:
266 1. Datasets that are loaded from disk or are able to generate data all by 267 1. Datasets that are loaded from disk or are able to generate data all by
267 themselves (i.e. without any other dataset as input) 268 themselves (i.e. without any other dataset as input)
268 2. Datasets that transform their input dataset in some way (e.g. filtering 269 2. Datasets that transform their input dataset in a way that only depends on
269 samples or features, normalizing data, etc.) 270 the input dataset (e.g. filtering samples or features, normalizing data, etc.)
270 3. Datasets that are the output of a transformation whose parameters are 271 3. Datasets that transform their input dataset in a way that is learned on a
271 learned on a potentially different dataset (e.g. PCA when you want to learn the 272 potentially different dataset (e.g. PCA when you want to learn the projection
272 projection space on the training set in order to transform both the training 273 space on the training set in order to transform both the training and test
273 and test sets). 274 sets).
274 My impression currently is that we would use dataset subclasses to handle 1 275 My impression currently is that we would use dataset subclasses to handle 1
275 and 2. However, 3 requires a learner framework, so you would need to have 276 and 2. However, 3 requires a learner framework, so you would need to have
276 something like a LearnerOutputDataset(trained_learner, dataset). 277 something like a LearnerOutputDataset(trained_learner, dataset).
277 278
278 Note however that 2 is a special case of 3 (where training does nothing), and 279 Note however that 2 is a special case of 3 (where training does nothing), and