Mercurial > pylearn

--- a/doc/v2_planning/dataset.txt	Fri Sep 10 17:06:38 2010 -0400
+++ b/doc/v2_planning/dataset.txt	Fri Sep 10 20:24:51 2010 -0400
@@ -260,43 +260,42 @@
 A dataset is a learner
 ~~~~~~~~~~~~~~~~~~~~~~

-OD: This is more a high-level comment that may or may not be relevant
-depending on how we get to plug our different classes together.
-In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing
-all sorts of fancy things, the majority of these classes taking as input
-another dataset, and transforming it in some way (e.g. taking a subset of
-samples, a subset of features, normalizing features, computing extra fields
-given existing fields, etc.). I think right now our interface is heading in a
-similar direction.
-When you think about it, this kind of operation is equivalent to writing a
-learner class that is trained on the input dataset, and whose output on this
-same dataset is used to obtain an output dataset (note that the training phase
-may do nothing, e.g. if the goal is only to filter out a predefined set of
-samples).
-If you push it even further, even a dataset that has no input dataset, say
-e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a
-learner that was trained on nothing and whose output is computed on nothing
-(but still outputs this 2D matrix).
-In the small ML library I have been using at Ubisoft, the dataset class
-actually inherits from learner, based on this point of view. Actually pretty
-much all objects that are plugged together to make an experiment are learners.
-The main advantage is everything has the same interface and the "plugging" of
-the different parts can remain very simple. Confusion is avoided by the module
-hierarchy to ensure objects with different behavior have different names.
-Something like dataset.MatrixDataset would create a dataset from scratch (i.e.
-a numpy matrix), process.FilterSamples would be something that does not need
-to be trained, but needs an input dataset, and learner.NNet would be a usual
-learning algorithm that must be trained on an input dataset, and computes an
-output (possibly on the same dataset, possibly on another one).
+OD: (this is hopefully a clearer re-write of the original version from
+r7e6e77d50eeb, which I was not happy with).
+There are typically three kinds of objects that spit out data:
+1. Datasets that are loaded from disk or are able to generate data all by
+   themselves (i.e. without any other dataset as input)
+2. Datasets that transform their input dataset in some way (e.g. filtering
+   samples or features, normalizing data, etc.)
+3. Datasets that are the output of a transformation whose parameters are
+   learned on a potentially different dataset (e.g. PCA when you want to learn the
+   projection space on the training set in order to transform both the training
+   and test sets).
+My impression currently is that we would use dataset subclasses to handle 1
+and 2. However, 3 requires a learner framework, so you would need to have
+something like a LearnerOutputDataset(trained_learner, dataset).
+
+Note however that 2 is a special case of 3 (where training does nothing), and
+1 is a special case of 2 (where we do not care about being given an input
+dataset). Thus you could decide to also implement 1 and 2 as learners wrapped
+by LearnerOutputDataset.

-Ok, this is getting too long, I am definitely not saying we should do this,
-but I think there is some close relationship between the usual data processing
-we do and the learning process, so it may be worth thinking how to put them
-together in a coherent framework. For instance, in PLearn there was (something
-like) a NormalizeVMatrix (think of it as a dataset subclass), but it could
-not be used in a natural way to learn the normalization parameters on a
-training set (e.g. mean and std of features) and normalize another dataset.
-Instead you could use (something like) a
-PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both
-ways to do (almost the) same thing can be confusing.
+The main advantages I find in this approach (that I have been using at
+Ubisoft) are:
+- You only need to learn how to subclass the learner class. The only dataset
+  class is LearnerOutputDataset, which you could just name Dataset.
+- You do not have different ways to achieve the same result (having to figure
+  out which one is most appropriate).
+- Upgrading code from 2 to 3 is more straighforward. Such a situation can
+  happen e.g. if you write some code that normalizes your input dataset
+  (situation 2), then realize later you would like to be able to normalize new
+  datasets using the same parameters (e.g. same shift & rescaling), which
+  requires situation 3.
+- It can make your life easier when thinking about how to plug things together
+  (something that has not been discussed yet), because the interfaces of the
+  various components are less varied.

+I am not saying that we should necessarily do it this way, but I think it is
+worth at least keeping in mind this close relationship between simple
+processing and learning, and thinking about what are the benefits / drawbacks
+in keeping them separate in the class hierarchy.