Mercurial > pylearn
comparison doc/v2_planning/dataset.txt @ 1085:de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 10 Sep 2010 20:24:51 -0400 |
parents | 7e6e77d50eeb |
children | 65ac0f493830 |
comparison
equal
deleted
inserted
replaced
1084:7e6e77d50eeb | 1085:de456561ec40 |
---|---|
258 anything else) to store mini-batches' data. So I vote for 'no'. | 258 anything else) to store mini-batches' data. So I vote for 'no'. |
259 | 259 |
260 A dataset is a learner | 260 A dataset is a learner |
261 ~~~~~~~~~~~~~~~~~~~~~~ | 261 ~~~~~~~~~~~~~~~~~~~~~~ |
262 | 262 |
263 OD: This is more a high-level comment that may or may not be relevant | 263 OD: (this is hopefully a clearer re-write of the original version from |
264 depending on how we get to plug our different classes together. | 264 r7e6e77d50eeb, which I was not happy with). |
265 In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing | 265 There are typically three kinds of objects that spit out data: |
266 all sorts of fancy things, the majority of these classes taking as input | 266 1. Datasets that are loaded from disk or are able to generate data all by |
267 another dataset, and transforming it in some way (e.g. taking a subset of | 267 themselves (i.e. without any other dataset as input) |
268 samples, a subset of features, normalizing features, computing extra fields | 268 2. Datasets that transform their input dataset in some way (e.g. filtering |
269 given existing fields, etc.). I think right now our interface is heading in a | 269 samples or features, normalizing data, etc.) |
270 similar direction. | 270 3. Datasets that are the output of a transformation whose parameters are |
271 When you think about it, this kind of operation is equivalent to writing a | 271 learned on a potentially different dataset (e.g. PCA when you want to learn the |
272 learner class that is trained on the input dataset, and whose output on this | 272 projection space on the training set in order to transform both the training |
273 same dataset is used to obtain an output dataset (note that the training phase | 273 and test sets). |
274 may do nothing, e.g. if the goal is only to filter out a predefined set of | 274 My impression currently is that we would use dataset subclasses to handle 1 |
275 samples). | 275 and 2. However, 3 requires a learner framework, so you would need to have |
276 If you push it even further, even a dataset that has no input dataset, say | 276 something like a LearnerOutputDataset(trained_learner, dataset). |
277 e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a | 277 |
278 learner that was trained on nothing and whose output is computed on nothing | 278 Note however that 2 is a special case of 3 (where training does nothing), and |
279 (but still outputs this 2D matrix). | 279 1 is a special case of 2 (where we do not care about being given an input |
280 In the small ML library I have been using at Ubisoft, the dataset class | 280 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped |
281 actually inherits from learner, based on this point of view. Actually pretty | 281 by LearnerOutputDataset. |
282 much all objects that are plugged together to make an experiment are learners. | 282 |
283 The main advantage is everything has the same interface and the "plugging" of | 283 The main advantages I find in this approach (that I have been using at |
284 the different parts can remain very simple. Confusion is avoided by the module | 284 Ubisoft) are: |
285 hierarchy to ensure objects with different behavior have different names. | 285 - You only need to learn how to subclass the learner class. The only dataset |
286 Something like dataset.MatrixDataset would create a dataset from scratch (i.e. | 286 class is LearnerOutputDataset, which you could just name Dataset. |
287 a numpy matrix), process.FilterSamples would be something that does not need | 287 - You do not have different ways to achieve the same result (having to figure |
288 to be trained, but needs an input dataset, and learner.NNet would be a usual | 288 out which one is most appropriate). |
289 learning algorithm that must be trained on an input dataset, and computes an | 289 - Upgrading code from 2 to 3 is more straighforward. Such a situation can |
290 output (possibly on the same dataset, possibly on another one). | 290 happen e.g. if you write some code that normalizes your input dataset |
291 | 291 (situation 2), then realize later you would like to be able to normalize new |
292 Ok, this is getting too long, I am definitely not saying we should do this, | 292 datasets using the same parameters (e.g. same shift & rescaling), which |
293 but I think there is some close relationship between the usual data processing | 293 requires situation 3. |
294 we do and the learning process, so it may be worth thinking how to put them | 294 - It can make your life easier when thinking about how to plug things together |
295 together in a coherent framework. For instance, in PLearn there was (something | 295 (something that has not been discussed yet), because the interfaces of the |
296 like) a NormalizeVMatrix (think of it as a dataset subclass), but it could | 296 various components are less varied. |
297 not be used in a natural way to learn the normalization parameters on a | 297 |
298 training set (e.g. mean and std of features) and normalize another dataset. | 298 I am not saying that we should necessarily do it this way, but I think it is |
299 Instead you could use (something like) a | 299 worth at least keeping in mind this close relationship between simple |
300 PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both | 300 processing and learning, and thinking about what are the benefits / drawbacks |
301 ways to do (almost the) same thing can be confusing. | 301 in keeping them separate in the class hierarchy. |
302 |