comparison doc/v2_planning/dataset.txt @ 1085:de456561ec40

dataset: Rewrote my rambling about the links between dataset and learner
author Olivier Delalleau <delallea@iro>
date Fri, 10 Sep 2010 20:24:51 -0400
parents 7e6e77d50eeb
children 65ac0f493830
comparison
equal deleted inserted replaced
1084:7e6e77d50eeb 1085:de456561ec40
258 anything else) to store mini-batches' data. So I vote for 'no'. 258 anything else) to store mini-batches' data. So I vote for 'no'.
259 259
260 A dataset is a learner 260 A dataset is a learner
261 ~~~~~~~~~~~~~~~~~~~~~~ 261 ~~~~~~~~~~~~~~~~~~~~~~
262 262
263 OD: This is more a high-level comment that may or may not be relevant 263 OD: (this is hopefully a clearer re-write of the original version from
264 depending on how we get to plug our different classes together. 264 r7e6e77d50eeb, which I was not happy with).
265 In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing 265 There are typically three kinds of objects that spit out data:
266 all sorts of fancy things, the majority of these classes taking as input 266 1. Datasets that are loaded from disk or are able to generate data all by
267 another dataset, and transforming it in some way (e.g. taking a subset of 267 themselves (i.e. without any other dataset as input)
268 samples, a subset of features, normalizing features, computing extra fields 268 2. Datasets that transform their input dataset in some way (e.g. filtering
269 given existing fields, etc.). I think right now our interface is heading in a 269 samples or features, normalizing data, etc.)
270 similar direction. 270 3. Datasets that are the output of a transformation whose parameters are
271 When you think about it, this kind of operation is equivalent to writing a 271 learned on a potentially different dataset (e.g. PCA when you want to learn the
272 learner class that is trained on the input dataset, and whose output on this 272 projection space on the training set in order to transform both the training
273 same dataset is used to obtain an output dataset (note that the training phase 273 and test sets).
274 may do nothing, e.g. if the goal is only to filter out a predefined set of 274 My impression currently is that we would use dataset subclasses to handle 1
275 samples). 275 and 2. However, 3 requires a learner framework, so you would need to have
276 If you push it even further, even a dataset that has no input dataset, say 276 something like a LearnerOutputDataset(trained_learner, dataset).
277 e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a 277
278 learner that was trained on nothing and whose output is computed on nothing 278 Note however that 2 is a special case of 3 (where training does nothing), and
279 (but still outputs this 2D matrix). 279 1 is a special case of 2 (where we do not care about being given an input
280 In the small ML library I have been using at Ubisoft, the dataset class 280 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped
281 actually inherits from learner, based on this point of view. Actually pretty 281 by LearnerOutputDataset.
282 much all objects that are plugged together to make an experiment are learners. 282
283 The main advantage is everything has the same interface and the "plugging" of 283 The main advantages I find in this approach (that I have been using at
284 the different parts can remain very simple. Confusion is avoided by the module 284 Ubisoft) are:
285 hierarchy to ensure objects with different behavior have different names. 285 - You only need to learn how to subclass the learner class. The only dataset
286 Something like dataset.MatrixDataset would create a dataset from scratch (i.e. 286 class is LearnerOutputDataset, which you could just name Dataset.
287 a numpy matrix), process.FilterSamples would be something that does not need 287 - You do not have different ways to achieve the same result (having to figure
288 to be trained, but needs an input dataset, and learner.NNet would be a usual 288 out which one is most appropriate).
289 learning algorithm that must be trained on an input dataset, and computes an 289 - Upgrading code from 2 to 3 is more straighforward. Such a situation can
290 output (possibly on the same dataset, possibly on another one). 290 happen e.g. if you write some code that normalizes your input dataset
291 291 (situation 2), then realize later you would like to be able to normalize new
292 Ok, this is getting too long, I am definitely not saying we should do this, 292 datasets using the same parameters (e.g. same shift & rescaling), which
293 but I think there is some close relationship between the usual data processing 293 requires situation 3.
294 we do and the learning process, so it may be worth thinking how to put them 294 - It can make your life easier when thinking about how to plug things together
295 together in a coherent framework. For instance, in PLearn there was (something 295 (something that has not been discussed yet), because the interfaces of the
296 like) a NormalizeVMatrix (think of it as a dataset subclass), but it could 296 various components are less varied.
297 not be used in a natural way to learn the normalization parameters on a 297
298 training set (e.g. mean and std of features) and normalize another dataset. 298 I am not saying that we should necessarily do it this way, but I think it is
299 Instead you could use (something like) a 299 worth at least keeping in mind this close relationship between simple
300 PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both 300 processing and learning, and thinking about what are the benefits / drawbacks
301 ways to do (almost the) same thing can be confusing. 301 in keeping them separate in the class hierarchy.
302