Mercurial > pylearn
changeset 1083:4c00af69c164
dataset: Asking what we want from mini-batches
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 10 Sep 2010 16:31:43 -0400 |
parents | f9f72ae84313 |
children | 7e6e77d50eeb |
files | doc/v2_planning/dataset.txt |
diffstat | 1 files changed, 28 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/doc/v2_planning/dataset.txt Fri Sep 10 15:36:23 2010 -0400 +++ b/doc/v2_planning/dataset.txt Fri Sep 10 16:31:43 2010 -0400 @@ -229,3 +229,31 @@ would be forwarded iteratively to previous datasets so as to get back all this wasted memory space. +What is a mini-batch? +~~~~~~~~~~~~~~~~~~~~~ + +This is a follow-up to the meeting's discussion about whether a mini-batch +returned by a dataset should be itself a dataset. + +OD: During the meeting I was voting in favor of a 'yes', mostly because it +made sense to me (a mini-batch is a subset of a dataset and thus should be a +dataset), but now I tend towards 'no'. The main reason is it is not clear yet +what the dataset interface will be, so that it is hard to judge whether this +is good idea (my main concern is how much additional work would be required by +the writer of a new dataset subclass). Anyway, maybe a first thing we could +think about is what we want a mini-batch to be. I think we can agree that we +would like to be able to do something like: + for mb in dataset.mini_batches(size=10): + learner.update(mb.input, mb.target) +so that it should be ok for a mini-batch to be an object whose fields +(that should have the same name as those of the dataset) are numpy arrays. +More generally, we would like to be able to iterate on samples in a +mini-batch, or do random access on them, so a mini-batch should implement +__iter__ and __getitem__. +Besides this, is there any other typical use-case of a mini-batch? In +particular, is there any reason to want an infinite mini-batch? (in which case +we may need to revise our idea of what 'mini' means) Hopefully the answer to +that last question is no, as I think it would definitely keep things simpler, +since we could simply use numpy arrays (for numeric data) or lists (for +anything else) to store mini-batches' data. So I vote for 'no'. +