# HG changeset patch # User Olivier Delalleau # Date 1284152798 14400 # Node ID 7e6e77d50eeb2d460df9103ffceea952fed891c8 # Parent 4c00af69c164c105a486d70ac2fc401c1a9e6c2b dataset: I say the learner committee should take care of dataset as well diff -r 4c00af69c164 -r 7e6e77d50eeb doc/v2_planning/dataset.txt --- a/doc/v2_planning/dataset.txt Fri Sep 10 16:31:43 2010 -0400 +++ b/doc/v2_planning/dataset.txt Fri Sep 10 17:06:38 2010 -0400 @@ -257,3 +257,46 @@ since we could simply use numpy arrays (for numeric data) or lists (for anything else) to store mini-batches' data. So I vote for 'no'. +A dataset is a learner +~~~~~~~~~~~~~~~~~~~~~~ + +OD: This is more a high-level comment that may or may not be relevant +depending on how we get to plug our different classes together. +In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing +all sorts of fancy things, the majority of these classes taking as input +another dataset, and transforming it in some way (e.g. taking a subset of +samples, a subset of features, normalizing features, computing extra fields +given existing fields, etc.). I think right now our interface is heading in a +similar direction. +When you think about it, this kind of operation is equivalent to writing a +learner class that is trained on the input dataset, and whose output on this +same dataset is used to obtain an output dataset (note that the training phase +may do nothing, e.g. if the goal is only to filter out a predefined set of +samples). +If you push it even further, even a dataset that has no input dataset, say +e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a +learner that was trained on nothing and whose output is computed on nothing +(but still outputs this 2D matrix). +In the small ML library I have been using at Ubisoft, the dataset class +actually inherits from learner, based on this point of view. Actually pretty +much all objects that are plugged together to make an experiment are learners. +The main advantage is everything has the same interface and the "plugging" of +the different parts can remain very simple. Confusion is avoided by the module +hierarchy to ensure objects with different behavior have different names. +Something like dataset.MatrixDataset would create a dataset from scratch (i.e. +a numpy matrix), process.FilterSamples would be something that does not need +to be trained, but needs an input dataset, and learner.NNet would be a usual +learning algorithm that must be trained on an input dataset, and computes an +output (possibly on the same dataset, possibly on another one). + +Ok, this is getting too long, I am definitely not saying we should do this, +but I think there is some close relationship between the usual data processing +we do and the learning process, so it may be worth thinking how to put them +together in a coherent framework. For instance, in PLearn there was (something +like) a NormalizeVMatrix (think of it as a dataset subclass), but it could +not be used in a natural way to learn the normalization parameters on a +training set (e.g. mean and std of features) and normalize another dataset. +Instead you could use (something like) a +PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both +ways to do (almost the) same thing can be confusing. +