# HG changeset patch # User Olivier Delalleau # Date 1284150703 14400 # Node ID 4c00af69c164c105a486d70ac2fc401c1a9e6c2b # Parent f9f72ae84313026b6db95a409d9cc4d0804e3a61 dataset: Asking what we want from mini-batches diff -r f9f72ae84313 -r 4c00af69c164 doc/v2_planning/dataset.txt --- a/doc/v2_planning/dataset.txt Fri Sep 10 15:36:23 2010 -0400 +++ b/doc/v2_planning/dataset.txt Fri Sep 10 16:31:43 2010 -0400 @@ -229,3 +229,31 @@ would be forwarded iteratively to previous datasets so as to get back all this wasted memory space. +What is a mini-batch? +~~~~~~~~~~~~~~~~~~~~~ + +This is a follow-up to the meeting's discussion about whether a mini-batch +returned by a dataset should be itself a dataset. + +OD: During the meeting I was voting in favor of a 'yes', mostly because it +made sense to me (a mini-batch is a subset of a dataset and thus should be a +dataset), but now I tend towards 'no'. The main reason is it is not clear yet +what the dataset interface will be, so that it is hard to judge whether this +is good idea (my main concern is how much additional work would be required by +the writer of a new dataset subclass). Anyway, maybe a first thing we could +think about is what we want a mini-batch to be. I think we can agree that we +would like to be able to do something like: + for mb in dataset.mini_batches(size=10): + learner.update(mb.input, mb.target) +so that it should be ok for a mini-batch to be an object whose fields +(that should have the same name as those of the dataset) are numpy arrays. +More generally, we would like to be able to iterate on samples in a +mini-batch, or do random access on them, so a mini-batch should implement +__iter__ and __getitem__. +Besides this, is there any other typical use-case of a mini-batch? In +particular, is there any reason to want an infinite mini-batch? (in which case +we may need to revise our idea of what 'mini' means) Hopefully the answer to +that last question is no, as I think it would definitely keep things simpler, +since we could simply use numpy arrays (for numeric data) or lists (for +anything else) to store mini-batches' data. So I vote for 'no'. +