changeset 1083:4c00af69c164

dataset: Asking what we want from mini-batches
author Olivier Delalleau <delallea@iro>
date Fri, 10 Sep 2010 16:31:43 -0400
parents f9f72ae84313
children 7e6e77d50eeb
files doc/v2_planning/dataset.txt
diffstat 1 files changed, 28 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/doc/v2_planning/dataset.txt	Fri Sep 10 15:36:23 2010 -0400
+++ b/doc/v2_planning/dataset.txt	Fri Sep 10 16:31:43 2010 -0400
@@ -229,3 +229,31 @@
 would be forwarded iteratively to previous datasets so as to get back all this
 wasted memory space.
 
+What is a mini-batch?
+~~~~~~~~~~~~~~~~~~~~~
+
+This is a follow-up to the meeting's discussion about whether a mini-batch
+returned by a dataset should be itself a dataset.
+
+OD: During the meeting I was voting in favor of a 'yes', mostly because it
+made sense to me (a mini-batch is a subset of a dataset and thus should be a
+dataset), but now I tend towards 'no'. The main reason is it is not clear yet
+what the dataset interface will be, so that it is hard to judge whether this
+is good idea (my main concern is how much additional work would be required by
+the writer of a new dataset subclass). Anyway, maybe a first thing we could
+think about is what we want a mini-batch to be. I think we can agree that we
+would like to be able to do something like:
+    for mb in dataset.mini_batches(size=10):
+        learner.update(mb.input, mb.target)
+so that it should be ok for a mini-batch to be an object whose fields
+(that should have the same name as those of the dataset) are numpy arrays.
+More generally, we would like to be able to iterate on samples in a
+mini-batch, or do random access on them, so a mini-batch should implement
+__iter__ and __getitem__.
+Besides this, is there any other typical use-case of a mini-batch? In
+particular, is there any reason to want an infinite mini-batch? (in which case
+we may need to revise our idea of what 'mini' means) Hopefully the answer to
+that last question is no, as I think it would definitely keep things simpler,
+since we could simply use numpy arrays (for numeric data) or lists (for
+anything else) to store mini-batches' data. So I vote for 'no'.
+