Mercurial > pylearn
diff doc/v2_planning/dataset.txt @ 1084:7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 10 Sep 2010 17:06:38 -0400 |
parents | 4c00af69c164 |
children | de456561ec40 |
line wrap: on
line diff
--- a/doc/v2_planning/dataset.txt Fri Sep 10 16:31:43 2010 -0400 +++ b/doc/v2_planning/dataset.txt Fri Sep 10 17:06:38 2010 -0400 @@ -257,3 +257,46 @@ since we could simply use numpy arrays (for numeric data) or lists (for anything else) to store mini-batches' data. So I vote for 'no'. +A dataset is a learner +~~~~~~~~~~~~~~~~~~~~~~ + +OD: This is more a high-level comment that may or may not be relevant +depending on how we get to plug our different classes together. +In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing +all sorts of fancy things, the majority of these classes taking as input +another dataset, and transforming it in some way (e.g. taking a subset of +samples, a subset of features, normalizing features, computing extra fields +given existing fields, etc.). I think right now our interface is heading in a +similar direction. +When you think about it, this kind of operation is equivalent to writing a +learner class that is trained on the input dataset, and whose output on this +same dataset is used to obtain an output dataset (note that the training phase +may do nothing, e.g. if the goal is only to filter out a predefined set of +samples). +If you push it even further, even a dataset that has no input dataset, say +e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a +learner that was trained on nothing and whose output is computed on nothing +(but still outputs this 2D matrix). +In the small ML library I have been using at Ubisoft, the dataset class +actually inherits from learner, based on this point of view. Actually pretty +much all objects that are plugged together to make an experiment are learners. +The main advantage is everything has the same interface and the "plugging" of +the different parts can remain very simple. Confusion is avoided by the module +hierarchy to ensure objects with different behavior have different names. +Something like dataset.MatrixDataset would create a dataset from scratch (i.e. +a numpy matrix), process.FilterSamples would be something that does not need +to be trained, but needs an input dataset, and learner.NNet would be a usual +learning algorithm that must be trained on an input dataset, and computes an +output (possibly on the same dataset, possibly on another one). + +Ok, this is getting too long, I am definitely not saying we should do this, +but I think there is some close relationship between the usual data processing +we do and the learning process, so it may be worth thinking how to put them +together in a coherent framework. For instance, in PLearn there was (something +like) a NormalizeVMatrix (think of it as a dataset subclass), but it could +not be used in a natural way to learn the normalization parameters on a +training set (e.g. mean and std of features) and normalize another dataset. +Instead you could use (something like) a +PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both +ways to do (almost the) same thing can be confusing. +