Mercurial > pylearn
comparison doc/v2_planning/dataset.txt @ 1083:4c00af69c164
dataset: Asking what we want from mini-batches
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 10 Sep 2010 16:31:43 -0400 |
parents | f9f72ae84313 |
children | 7e6e77d50eeb |
comparison
equal
deleted
inserted
replaced
1082:f9f72ae84313 | 1083:4c00af69c164 |
---|---|
227 along the way). So there may be a use for a `clear()` method that would be | 227 along the way). So there may be a use for a `clear()` method that would be |
228 called by the topmost dataset (the one doing the final memory caching), and | 228 called by the topmost dataset (the one doing the final memory caching), and |
229 would be forwarded iteratively to previous datasets so as to get back all this | 229 would be forwarded iteratively to previous datasets so as to get back all this |
230 wasted memory space. | 230 wasted memory space. |
231 | 231 |
232 What is a mini-batch? | |
233 ~~~~~~~~~~~~~~~~~~~~~ | |
234 | |
235 This is a follow-up to the meeting's discussion about whether a mini-batch | |
236 returned by a dataset should be itself a dataset. | |
237 | |
238 OD: During the meeting I was voting in favor of a 'yes', mostly because it | |
239 made sense to me (a mini-batch is a subset of a dataset and thus should be a | |
240 dataset), but now I tend towards 'no'. The main reason is it is not clear yet | |
241 what the dataset interface will be, so that it is hard to judge whether this | |
242 is good idea (my main concern is how much additional work would be required by | |
243 the writer of a new dataset subclass). Anyway, maybe a first thing we could | |
244 think about is what we want a mini-batch to be. I think we can agree that we | |
245 would like to be able to do something like: | |
246 for mb in dataset.mini_batches(size=10): | |
247 learner.update(mb.input, mb.target) | |
248 so that it should be ok for a mini-batch to be an object whose fields | |
249 (that should have the same name as those of the dataset) are numpy arrays. | |
250 More generally, we would like to be able to iterate on samples in a | |
251 mini-batch, or do random access on them, so a mini-batch should implement | |
252 __iter__ and __getitem__. | |
253 Besides this, is there any other typical use-case of a mini-batch? In | |
254 particular, is there any reason to want an infinite mini-batch? (in which case | |
255 we may need to revise our idea of what 'mini' means) Hopefully the answer to | |
256 that last question is no, as I think it would definitely keep things simpler, | |
257 since we could simply use numpy arrays (for numeric data) or lists (for | |
258 anything else) to store mini-batches' data. So I vote for 'no'. | |
259 |