Mercurial > pylearn
comparison doc/v2_planning/dataset.txt @ 1084:7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 10 Sep 2010 17:06:38 -0400 |
parents | 4c00af69c164 |
children | de456561ec40 |
comparison
equal
deleted
inserted
replaced
1083:4c00af69c164 | 1084:7e6e77d50eeb |
---|---|
255 we may need to revise our idea of what 'mini' means) Hopefully the answer to | 255 we may need to revise our idea of what 'mini' means) Hopefully the answer to |
256 that last question is no, as I think it would definitely keep things simpler, | 256 that last question is no, as I think it would definitely keep things simpler, |
257 since we could simply use numpy arrays (for numeric data) or lists (for | 257 since we could simply use numpy arrays (for numeric data) or lists (for |
258 anything else) to store mini-batches' data. So I vote for 'no'. | 258 anything else) to store mini-batches' data. So I vote for 'no'. |
259 | 259 |
260 A dataset is a learner | |
261 ~~~~~~~~~~~~~~~~~~~~~~ | |
262 | |
263 OD: This is more a high-level comment that may or may not be relevant | |
264 depending on how we get to plug our different classes together. | |
265 In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing | |
266 all sorts of fancy things, the majority of these classes taking as input | |
267 another dataset, and transforming it in some way (e.g. taking a subset of | |
268 samples, a subset of features, normalizing features, computing extra fields | |
269 given existing fields, etc.). I think right now our interface is heading in a | |
270 similar direction. | |
271 When you think about it, this kind of operation is equivalent to writing a | |
272 learner class that is trained on the input dataset, and whose output on this | |
273 same dataset is used to obtain an output dataset (note that the training phase | |
274 may do nothing, e.g. if the goal is only to filter out a predefined set of | |
275 samples). | |
276 If you push it even further, even a dataset that has no input dataset, say | |
277 e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a | |
278 learner that was trained on nothing and whose output is computed on nothing | |
279 (but still outputs this 2D matrix). | |
280 In the small ML library I have been using at Ubisoft, the dataset class | |
281 actually inherits from learner, based on this point of view. Actually pretty | |
282 much all objects that are plugged together to make an experiment are learners. | |
283 The main advantage is everything has the same interface and the "plugging" of | |
284 the different parts can remain very simple. Confusion is avoided by the module | |
285 hierarchy to ensure objects with different behavior have different names. | |
286 Something like dataset.MatrixDataset would create a dataset from scratch (i.e. | |
287 a numpy matrix), process.FilterSamples would be something that does not need | |
288 to be trained, but needs an input dataset, and learner.NNet would be a usual | |
289 learning algorithm that must be trained on an input dataset, and computes an | |
290 output (possibly on the same dataset, possibly on another one). | |
291 | |
292 Ok, this is getting too long, I am definitely not saying we should do this, | |
293 but I think there is some close relationship between the usual data processing | |
294 we do and the learning process, so it may be worth thinking how to put them | |
295 together in a coherent framework. For instance, in PLearn there was (something | |
296 like) a NormalizeVMatrix (think of it as a dataset subclass), but it could | |
297 not be used in a natural way to learn the normalization parameters on a | |
298 training set (e.g. mean and std of features) and normalize another dataset. | |
299 Instead you could use (something like) a | |
300 PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both | |
301 ways to do (almost the) same thing can be confusing. | |
302 |