comparison doc/v2_planning/dataset.txt @ 1083:4c00af69c164

dataset: Asking what we want from mini-batches
author Olivier Delalleau <delallea@iro>
date Fri, 10 Sep 2010 16:31:43 -0400
parents f9f72ae84313
children 7e6e77d50eeb
comparison
equal deleted inserted replaced
1082:f9f72ae84313 1083:4c00af69c164
227 along the way). So there may be a use for a `clear()` method that would be 227 along the way). So there may be a use for a `clear()` method that would be
228 called by the topmost dataset (the one doing the final memory caching), and 228 called by the topmost dataset (the one doing the final memory caching), and
229 would be forwarded iteratively to previous datasets so as to get back all this 229 would be forwarded iteratively to previous datasets so as to get back all this
230 wasted memory space. 230 wasted memory space.
231 231
232 What is a mini-batch?
233 ~~~~~~~~~~~~~~~~~~~~~
234
235 This is a follow-up to the meeting's discussion about whether a mini-batch
236 returned by a dataset should be itself a dataset.
237
238 OD: During the meeting I was voting in favor of a 'yes', mostly because it
239 made sense to me (a mini-batch is a subset of a dataset and thus should be a
240 dataset), but now I tend towards 'no'. The main reason is it is not clear yet
241 what the dataset interface will be, so that it is hard to judge whether this
242 is good idea (my main concern is how much additional work would be required by
243 the writer of a new dataset subclass). Anyway, maybe a first thing we could
244 think about is what we want a mini-batch to be. I think we can agree that we
245 would like to be able to do something like:
246 for mb in dataset.mini_batches(size=10):
247 learner.update(mb.input, mb.target)
248 so that it should be ok for a mini-batch to be an object whose fields
249 (that should have the same name as those of the dataset) are numpy arrays.
250 More generally, we would like to be able to iterate on samples in a
251 mini-batch, or do random access on them, so a mini-batch should implement
252 __iter__ and __getitem__.
253 Besides this, is there any other typical use-case of a mini-batch? In
254 particular, is there any reason to want an infinite mini-batch? (in which case
255 we may need to revise our idea of what 'mini' means) Hopefully the answer to
256 that last question is no, as I think it would definitely keep things simpler,
257 since we could simply use numpy arrays (for numeric data) or lists (for
258 anything else) to store mini-batches' data. So I vote for 'no'.
259