diff doc/v2_planning/dataset.txt @ 1084:7e6e77d50eeb

dataset: I say the learner committee should take care of dataset as well
author Olivier Delalleau <delallea@iro>
date Fri, 10 Sep 2010 17:06:38 -0400
parents 4c00af69c164
children de456561ec40
line wrap: on
line diff
--- a/doc/v2_planning/dataset.txt	Fri Sep 10 16:31:43 2010 -0400
+++ b/doc/v2_planning/dataset.txt	Fri Sep 10 17:06:38 2010 -0400
@@ -257,3 +257,46 @@
 since we could simply use numpy arrays (for numeric data) or lists (for
 anything else) to store mini-batches' data. So I vote for 'no'.
 
+A dataset is a learner
+~~~~~~~~~~~~~~~~~~~~~~
+
+OD: This is more a high-level comment that may or may not be relevant
+depending on how we get to plug our different classes together.
+In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing
+all sorts of fancy things, the majority of these classes taking as input
+another dataset, and transforming it in some way (e.g. taking a subset of
+samples, a subset of features, normalizing features, computing extra fields
+given existing fields, etc.). I think right now our interface is heading in a
+similar direction.
+When you think about it, this kind of operation is equivalent to writing a
+learner class that is trained on the input dataset, and whose output on this
+same dataset is used to obtain an output dataset (note that the training phase
+may do nothing, e.g. if the goal is only to filter out a predefined set of
+samples).
+If you push it even further, even a dataset that has no input dataset, say
+e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a
+learner that was trained on nothing and whose output is computed on nothing
+(but still outputs this 2D matrix).
+In the small ML library I have been using at Ubisoft, the dataset class
+actually inherits from learner, based on this point of view. Actually pretty
+much all objects that are plugged together to make an experiment are learners.
+The main advantage is everything has the same interface and the "plugging" of
+the different parts can remain very simple. Confusion is avoided by the module
+hierarchy to ensure objects with different behavior have different names.
+Something like dataset.MatrixDataset would create a dataset from scratch (i.e.
+a numpy matrix), process.FilterSamples would be something that does not need
+to be trained, but needs an input dataset, and learner.NNet would be a usual
+learning algorithm that must be trained on an input dataset, and computes an
+output (possibly on the same dataset, possibly on another one).
+
+Ok, this is getting too long, I am definitely not saying we should do this,
+but I think there is some close relationship between the usual data processing
+we do and the learning process, so it may be worth thinking how to put them
+together in a coherent framework. For instance, in PLearn there was (something
+like) a NormalizeVMatrix (think of it as a dataset subclass), but it could
+not be used in a natural way to learn the normalization parameters on a
+training set (e.g. mean and std of features) and normalize another dataset.
+Instead you could use (something like) a
+PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both
+ways to do (almost the) same thing can be confusing.
+