# HG changeset patch # User Olivier Delalleau # Date 1284171722 14400 # Node ID 65ac0f493830568d48182df51b5d01da5176bde3 # Parent de456561ec400dd53e6240a02c80bbdd8af312f8 dataset: Some clarifications on my comments diff -r de456561ec40 -r 65ac0f493830 doc/v2_planning/dataset.txt --- a/doc/v2_planning/dataset.txt Fri Sep 10 20:24:51 2010 -0400 +++ b/doc/v2_planning/dataset.txt Fri Sep 10 22:22:02 2010 -0400 @@ -251,11 +251,12 @@ mini-batch, or do random access on them, so a mini-batch should implement __iter__ and __getitem__. Besides this, is there any other typical use-case of a mini-batch? In -particular, is there any reason to want an infinite mini-batch? (in which case -we may need to revise our idea of what 'mini' means) Hopefully the answer to -that last question is no, as I think it would definitely keep things simpler, -since we could simply use numpy arrays (for numeric data) or lists (for -anything else) to store mini-batches' data. So I vote for 'no'. +particular, is there any reason to want an infinite mini-batch, or a very big +mini-batch that may not fit in memory? (in which case we may need to revise +our idea of what 'mini' means) Hopefully the answer to that last question is +no, as I think it would definitely keep things simpler, since we could simply +use numpy arrays (for numeric data) or lists (for anything else) to store +mini-batches' data. So I vote for 'no'. A dataset is a learner ~~~~~~~~~~~~~~~~~~~~~~ @@ -265,12 +266,12 @@ There are typically three kinds of objects that spit out data: 1. Datasets that are loaded from disk or are able to generate data all by themselves (i.e. without any other dataset as input) -2. Datasets that transform their input dataset in some way (e.g. filtering - samples or features, normalizing data, etc.) -3. Datasets that are the output of a transformation whose parameters are - learned on a potentially different dataset (e.g. PCA when you want to learn the - projection space on the training set in order to transform both the training - and test sets). +2. Datasets that transform their input dataset in a way that only depends on + the input dataset (e.g. filtering samples or features, normalizing data, etc.) +3. Datasets that transform their input dataset in a way that is learned on a + potentially different dataset (e.g. PCA when you want to learn the projection + space on the training set in order to transform both the training and test + sets). My impression currently is that we would use dataset subclasses to handle 1 and 2. However, 3 requires a learner framework, so you would need to have something like a LearnerOutputDataset(trained_learner, dataset).