pylearn: doc/v2_planning/dataset.txt comparison

comparison doc/v2_planning/dataset.txt @ 1086:65ac0f493830

dataset: Some clarifications on my comments

author	Olivier Delalleau <delallea@iro>
date	Fri, 10 Sep 2010 22:22:02 -0400
parents	de456561ec40
children	f15216356522

comparison

equal deleted inserted replaced

-:de456561ec40
+:65ac0f493830
 (that should have the same name as those of the dataset) are numpy arrays.
 More generally, we would like to be able to iterate on samples in a
 mini-batch, or do random access on them, so a mini-batch should implement
 __iter__ and __getitem__.
 Besides this, is there any other typical use-case of a mini-batch? In
-particular, is there any reason to want an infinite mini-batch? (in which case
+particular, is there any reason to want an infinite mini-batch, or a very big
-we may need to revise our idea of what 'mini' means) Hopefully the answer to
+mini-batch that may not fit in memory? (in which case we may need to revise
-that last question is no, as I think it would definitely keep things simpler,
+our idea of what 'mini' means) Hopefully the answer to that last question is
-since we could simply use numpy arrays (for numeric data) or lists (for
+no, as I think it would definitely keep things simpler, since we could simply
-anything else) to store mini-batches' data. So I vote for 'no'.
+use numpy arrays (for numeric data) or lists (for anything else) to store
+mini-batches' data. So I vote for 'no'.
 A dataset is a learner
 ~~~~~~~~~~~~~~~~~~~~~~
 OD: (this is hopefully a clearer re-write of the original version from
 r7e6e77d50eeb, which I was not happy with).
 There are typically three kinds of objects that spit out data:
 1. Datasets that are loaded from disk or are able to generate data all by
 themselves (i.e. without any other dataset as input)
-2. Datasets that transform their input dataset in some way (e.g. filtering
+2. Datasets that transform their input dataset in a way that only depends on
-samples or features, normalizing data, etc.)
+the input dataset (e.g. filtering samples or features, normalizing data, etc.)
-3. Datasets that are the output of a transformation whose parameters are
+3. Datasets that transform their input dataset in a way that is learned on a
-learned on a potentially different dataset (e.g. PCA when you want to learn the
+potentially different dataset (e.g. PCA when you want to learn the projection
-projection space on the training set in order to transform both the training
+space on the training set in order to transform both the training and test
-and test sets).
+sets).
 My impression currently is that we would use dataset subclasses to handle 1
 and 2. However, 3 requires a learner framework, so you would need to have
 something like a LearnerOutputDataset(trained_learner, dataset).
 Note however that 2 is a special case of 3 (where training does nothing), and

Mercurial > pylearn

comparison doc/v2_planning/dataset.txt @ 1086:65ac0f493830