comparison doc/v2_planning/dataset.txt @ 1084:7e6e77d50eeb

dataset: I say the learner committee should take care of dataset as well
author Olivier Delalleau <delallea@iro>
date Fri, 10 Sep 2010 17:06:38 -0400
parents 4c00af69c164
children de456561ec40
comparison
equal deleted inserted replaced
1083:4c00af69c164 1084:7e6e77d50eeb
255 we may need to revise our idea of what 'mini' means) Hopefully the answer to 255 we may need to revise our idea of what 'mini' means) Hopefully the answer to
256 that last question is no, as I think it would definitely keep things simpler, 256 that last question is no, as I think it would definitely keep things simpler,
257 since we could simply use numpy arrays (for numeric data) or lists (for 257 since we could simply use numpy arrays (for numeric data) or lists (for
258 anything else) to store mini-batches' data. So I vote for 'no'. 258 anything else) to store mini-batches' data. So I vote for 'no'.
259 259
260 A dataset is a learner
261 ~~~~~~~~~~~~~~~~~~~~~~
262
263 OD: This is more a high-level comment that may or may not be relevant
264 depending on how we get to plug our different classes together.
265 In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing
266 all sorts of fancy things, the majority of these classes taking as input
267 another dataset, and transforming it in some way (e.g. taking a subset of
268 samples, a subset of features, normalizing features, computing extra fields
269 given existing fields, etc.). I think right now our interface is heading in a
270 similar direction.
271 When you think about it, this kind of operation is equivalent to writing a
272 learner class that is trained on the input dataset, and whose output on this
273 same dataset is used to obtain an output dataset (note that the training phase
274 may do nothing, e.g. if the goal is only to filter out a predefined set of
275 samples).
276 If you push it even further, even a dataset that has no input dataset, say
277 e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a
278 learner that was trained on nothing and whose output is computed on nothing
279 (but still outputs this 2D matrix).
280 In the small ML library I have been using at Ubisoft, the dataset class
281 actually inherits from learner, based on this point of view. Actually pretty
282 much all objects that are plugged together to make an experiment are learners.
283 The main advantage is everything has the same interface and the "plugging" of
284 the different parts can remain very simple. Confusion is avoided by the module
285 hierarchy to ensure objects with different behavior have different names.
286 Something like dataset.MatrixDataset would create a dataset from scratch (i.e.
287 a numpy matrix), process.FilterSamples would be something that does not need
288 to be trained, but needs an input dataset, and learner.NNet would be a usual
289 learning algorithm that must be trained on an input dataset, and computes an
290 output (possibly on the same dataset, possibly on another one).
291
292 Ok, this is getting too long, I am definitely not saying we should do this,
293 but I think there is some close relationship between the usual data processing
294 we do and the learning process, so it may be worth thinking how to put them
295 together in a coherent framework. For instance, in PLearn there was (something
296 like) a NormalizeVMatrix (think of it as a dataset subclass), but it could
297 not be used in a natural way to learn the normalization parameters on a
298 training set (e.g. mean and std of features) and normalize another dataset.
299 Instead you could use (something like) a
300 PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both
301 ways to do (almost the) same thing can be confusing.
302