changeset 1086:65ac0f493830

dataset: Some clarifications on my comments
author Olivier Delalleau <delallea@iro>
date Fri, 10 Sep 2010 22:22:02 -0400
parents de456561ec40
children 8c448829db30
files doc/v2_planning/dataset.txt
diffstat 1 files changed, 12 insertions(+), 11 deletions(-) [+]
line wrap: on
line diff
--- a/doc/v2_planning/dataset.txt	Fri Sep 10 20:24:51 2010 -0400
+++ b/doc/v2_planning/dataset.txt	Fri Sep 10 22:22:02 2010 -0400
@@ -251,11 +251,12 @@
 mini-batch, or do random access on them, so a mini-batch should implement
 __iter__ and __getitem__.
 Besides this, is there any other typical use-case of a mini-batch? In
-particular, is there any reason to want an infinite mini-batch? (in which case
-we may need to revise our idea of what 'mini' means) Hopefully the answer to
-that last question is no, as I think it would definitely keep things simpler,
-since we could simply use numpy arrays (for numeric data) or lists (for
-anything else) to store mini-batches' data. So I vote for 'no'.
+particular, is there any reason to want an infinite mini-batch, or a very big
+mini-batch that may not fit in memory? (in which case we may need to revise
+our idea of what 'mini' means) Hopefully the answer to that last question is
+no, as I think it would definitely keep things simpler, since we could simply
+use numpy arrays (for numeric data) or lists (for anything else) to store
+mini-batches' data. So I vote for 'no'.
 
 A dataset is a learner
 ~~~~~~~~~~~~~~~~~~~~~~
@@ -265,12 +266,12 @@
 There are typically three kinds of objects that spit out data:
 1. Datasets that are loaded from disk or are able to generate data all by
    themselves (i.e. without any other dataset as input)
-2. Datasets that transform their input dataset in some way (e.g. filtering
-   samples or features, normalizing data, etc.)
-3. Datasets that are the output of a transformation whose parameters are
-   learned on a potentially different dataset (e.g. PCA when you want to learn the
-   projection space on the training set in order to transform both the training
-   and test sets).
+2. Datasets that transform their input dataset in a way that only depends on
+   the input dataset (e.g. filtering samples or features, normalizing data, etc.)
+3. Datasets that transform their input dataset in a way that is learned on a
+   potentially different dataset (e.g. PCA when you want to learn the projection
+   space on the training set in order to transform both the training and test
+   sets).
 My impression currently is that we would use dataset subclasses to handle 1
 and 2. However, 3 requires a learner framework, so you would need to have
 something like a LearnerOutputDataset(trained_learner, dataset).