# HG changeset patch # User Olivier Delalleau # Date 1284164691 14400 # Node ID de456561ec400dd53e6240a02c80bbdd8af312f8 # Parent 7e6e77d50eeb2d460df9103ffceea952fed891c8 dataset: Rewrote my rambling about the links between dataset and learner diff -r 7e6e77d50eeb -r de456561ec40 doc/v2_planning/dataset.txt --- a/doc/v2_planning/dataset.txt Fri Sep 10 17:06:38 2010 -0400 +++ b/doc/v2_planning/dataset.txt Fri Sep 10 20:24:51 2010 -0400 @@ -260,43 +260,42 @@ A dataset is a learner ~~~~~~~~~~~~~~~~~~~~~~ -OD: This is more a high-level comment that may or may not be relevant -depending on how we get to plug our different classes together. -In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing -all sorts of fancy things, the majority of these classes taking as input -another dataset, and transforming it in some way (e.g. taking a subset of -samples, a subset of features, normalizing features, computing extra fields -given existing fields, etc.). I think right now our interface is heading in a -similar direction. -When you think about it, this kind of operation is equivalent to writing a -learner class that is trained on the input dataset, and whose output on this -same dataset is used to obtain an output dataset (note that the training phase -may do nothing, e.g. if the goal is only to filter out a predefined set of -samples). -If you push it even further, even a dataset that has no input dataset, say -e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a -learner that was trained on nothing and whose output is computed on nothing -(but still outputs this 2D matrix). -In the small ML library I have been using at Ubisoft, the dataset class -actually inherits from learner, based on this point of view. Actually pretty -much all objects that are plugged together to make an experiment are learners. -The main advantage is everything has the same interface and the "plugging" of -the different parts can remain very simple. Confusion is avoided by the module -hierarchy to ensure objects with different behavior have different names. -Something like dataset.MatrixDataset would create a dataset from scratch (i.e. -a numpy matrix), process.FilterSamples would be something that does not need -to be trained, but needs an input dataset, and learner.NNet would be a usual -learning algorithm that must be trained on an input dataset, and computes an -output (possibly on the same dataset, possibly on another one). +OD: (this is hopefully a clearer re-write of the original version from +r7e6e77d50eeb, which I was not happy with). +There are typically three kinds of objects that spit out data: +1. Datasets that are loaded from disk or are able to generate data all by + themselves (i.e. without any other dataset as input) +2. Datasets that transform their input dataset in some way (e.g. filtering + samples or features, normalizing data, etc.) +3. Datasets that are the output of a transformation whose parameters are + learned on a potentially different dataset (e.g. PCA when you want to learn the + projection space on the training set in order to transform both the training + and test sets). +My impression currently is that we would use dataset subclasses to handle 1 +and 2. However, 3 requires a learner framework, so you would need to have +something like a LearnerOutputDataset(trained_learner, dataset). + +Note however that 2 is a special case of 3 (where training does nothing), and +1 is a special case of 2 (where we do not care about being given an input +dataset). Thus you could decide to also implement 1 and 2 as learners wrapped +by LearnerOutputDataset. -Ok, this is getting too long, I am definitely not saying we should do this, -but I think there is some close relationship between the usual data processing -we do and the learning process, so it may be worth thinking how to put them -together in a coherent framework. For instance, in PLearn there was (something -like) a NormalizeVMatrix (think of it as a dataset subclass), but it could -not be used in a natural way to learn the normalization parameters on a -training set (e.g. mean and std of features) and normalize another dataset. -Instead you could use (something like) a -PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both -ways to do (almost the) same thing can be confusing. +The main advantages I find in this approach (that I have been using at +Ubisoft) are: +- You only need to learn how to subclass the learner class. The only dataset + class is LearnerOutputDataset, which you could just name Dataset. +- You do not have different ways to achieve the same result (having to figure + out which one is most appropriate). +- Upgrading code from 2 to 3 is more straighforward. Such a situation can + happen e.g. if you write some code that normalizes your input dataset + (situation 2), then realize later you would like to be able to normalize new + datasets using the same parameters (e.g. same shift & rescaling), which + requires situation 3. +- It can make your life easier when thinking about how to plug things together + (something that has not been discussed yet), because the interfaces of the + various components are less varied. +I am not saying that we should necessarily do it this way, but I think it is +worth at least keeping in mind this close relationship between simple +processing and learning, and thinking about what are the benefits / drawbacks +in keeping them separate in the class hierarchy.