Mercurial > pylearn
comparison doc/v2_planning/dataset.txt @ 1337:7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Thu, 21 Oct 2010 13:00:49 -0400 |
parents | 9ff2242a817b |
children | 91637815b7ca |
comparison
equal
deleted
inserted
replaced
1336:09ad2a4f663c | 1337:7dfc3d3052ea |
---|---|
404 I have a prototype interface/implemantation in the shared_dataset.py | 404 I have a prototype interface/implemantation in the shared_dataset.py |
405 file in this directory. | 405 file in this directory. |
406 | 406 |
407 OD: I like AB's approach. | 407 OD: I like AB's approach. |
408 | 408 |
409 | |
410 Data API proposal by Olivier D | |
411 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
412 | |
413 A single sample containing multiple fields (e.g. an input and a target part) | |
414 is an object s that you can manipulate as follows: | |
415 | |
416 .. code-block:: python | |
417 | |
418 # Obtain actual data stored within `s` (e.g. a numpy vector). There is no | |
419 # guarantee that modifying the resulting data object will actually update | |
420 # the data stored in `s`. | |
421 data = s() | |
422 # Create a sample that sees a field of `s`. | |
423 input_part = s.input | |
424 # Obtain actual input data (e.g. as a numpy vector). | |
425 input_data = input_part() | |
426 # Create a sample that sees the i-th element of the data stored in `s`. | |
427 ith = s[i] | |
428 # This should not fail. | |
429 assert ith() == s()[i] | |
430 # You could also select a range. | |
431 i_to_j = s[i:j] | |
432 assert i_to_j() == s()[i:j] | |
433 # And actually do pretty much anything you want with __getitem__, as long | |
434 # as the underlying data stored in the sample supports it (for instance, | |
435 # here it should be at least a 3D tensor). | |
436 fancy_selection = s[i, :, j:k] | |
437 assert fancy_selection() == s()[i, :, j:k] | |
438 # Write some value (e.g. a numpy vector) into the sample. May raise an | |
439 # exception if the sample is in read-only mode. | |
440 s._write(val) | |
441 # Shortcut to write data into a field (same as `s.input._write(val)`). | |
442 s.input = val | |
443 # Basic mathematical operators. | |
444 s *= val | |
445 s += val | |
446 s -= val | |
447 s /= val | |
448 # Replace a field. Note that this is different from `s.input = val` | |
449 # because here `new_input` is a sample, not a numeric value: the current | |
450 # `s.input` will not be written to, instead it makes `s.input` point | |
451 # towards a different sample. This may lead to confusion, so a different | |
452 # syntax may be better (e.g. s._set_field('input', new_input)). | |
453 s.input = new_input | |
454 # The equality of two samples is defined by the equality of their | |
455 # underlying data. | |
456 def __eq__(self, other): | |
457 return self() == other() | |
458 # Iterate on fields (open question: should they be ordered?). | |
459 fields = dict([(name, sample) for name, sample in s._iter_fields()]) | |
460 assert fields['input'] == s.input | |
461 # Iterating on a sample yields samples that see consecutive elements. | |
462 for sample, value in izip(s, s()): | |
463 assert sample() == value | |
464 # The length of a sample is the same as that of its underlying data. | |
465 assert len(s) == len(s()) | |
466 # The shape of a sample is the same as that of its underlying data. | |
467 # Note that it only makes sense for tensor-like data. | |
468 assert s._shape() == s().shape | |
469 # The size of a sample is the product of its shape elements. | |
470 assert s._size() == reduce(operator.__mul__, s._shape()) | |
471 | |
472 All sample methods should start with '_', to differentiate them from the | |
473 sample's fields. This is a bit awkward, but I like the `sample.field` syntax | |
474 compared to something like "sample.get_field('field')", which makes code less | |
475 readable, especially when combining with sub_fields, e.g. `sample.input.x1` | |
476 vs. sample.get_field('input').get_field('x1'). | |
477 | |
478 The extension from sample to dataset is actually to use the same class, but | |
479 with the convention that the first "dimension" in the data seen by the dataset | |
480 corresponds to the samples' indices in the dataset. | |
481 | |
482 .. code-block:: python | |
483 | |
484 # Return data stored in dataset `d` (e.g. a numpy matrix). | |
485 data = d() | |
486 # Return the i-th sample in the dataset. | |
487 s = d[i] | |
488 # Data should match! | |
489 assert data[i] == s() | |
490 # Return a subset of the dataset. | |
491 sub_data = d[i:j] | |
492 # Advanced indexing. | |
493 sub_data = d[some_list_of_indices] | |
494 # Dataset that sees the input part only. | |
495 input_part = d.input | |
496 # Dataset such that its i-th element is data[i][something] (see the sample | |
497 # examples for what `something` may be). | |
498 some_sub_data = d[:, something] | |
499 # The following should not fail. | |
500 assert d[i, something] == d[i][something] # == some_sub_data[i] | |
501 # You can also write into a dataset. | |
502 d._write(val) | |
503 d.input = val | |
504 # Center dataset in-place (requires `d` not to be read-only). | |
505 d -= numpy.mean(d()) | |
506 # The length of a dataset is its number of samples. | |
507 n_samples = len(d) | |
508 # The width of a dataset (if it exists) is the length of its samples. | |
509 assert d._shape()[1] == len(d[0]) # == d._width() (shortcut) | |
510 # Iterating on a dataset yields individual samples. | |
511 for i, sample in enumerate(d): | |
512 assert d[i] == sample | |
513 # It is allowed for a dataset to hold heterogeneous data. For instance | |
514 # you could have | |
515 len(d.data1) != len(d.data2) | |
516 # A sample in the dataset is not required to inherit all the dataset's | |
517 # fields, for instance in the case above you could decide that the dataset | |
518 # sees the same data as its first sub-dataset, i.e. | |
519 d[i] == d.data1[i] | |
520 | |
521 There remain some fuzzy points. For instance, are fields allowed to overlap? | |
522 (e.g. so that one could write both s.pos_3d to get the 3d vector coordinate of | |
523 sample s, and s.x to get the x coordinate without being forced to go through | |
524 s.pos_3d.x). What are the fields of s[i:j] if the (i, j) range does not | |
525 exactly match a subset of fields? How do we handle metadata? (e.g. if we want | |
526 to describe the dataset to say it contains 28x28 image data, so that an | |
527 algorithm for filter visualization can automatically deal with it) | |
528 | |
529 Now, on to some use cases. | |
530 | |
531 .. code-block:: python | |
532 | |
533 # Mini-batches. | |
534 mb_dataset = d._minibatches(batch_size=5) | |
535 # The mini-batch dataset views samples that are mini-batches. | |
536 assert mb_dataset[0]() == d[0:5]() # As long as len(d) >= 5. | |
537 | |
538 # Shuffling samples. | |
539 random_indices = range(len(d)) | |
540 random_indices = numpy.random.shuffle(random_indices) | |
541 shuffled_dataset = d[random_indices] | |
542 | |
543 # Typical linear regression with stochastic gradient descent. | |
544 n_inputs = d.input._width() | |
545 n_targets = d.target._width() | |
546 weights = numpy.zeros((n_inputs, n_targets)) | |
547 bias = numpy.zeros(n_targets) | |
548 mb_dataset = d._minibatches(batch_size=10) | |
549 # Note: it is important to get the number of inputs / targets | |
550 # before converting to minibatches, because | |
551 # mb_dataset.input._width() == 10 | |
552 # since this is the length of a minibatch matrix. However you | |
553 # could still do the following, which is less readable: | |
554 # n_inputs = mb_dataset.input._shape()[2] | |
555 # You could also wait until you see the first sample to create | |
556 # the parameters (this would actually be a better way to do it, since | |
557 # it avoids calling the _width method). | |
558 for input, target in izip(mb_dataset.input, mb_dataset.target): | |
559 cost = (numpy.dot(input(), weights) + b - target())**2 | |
560 # Update weights and bias depending on cost.... | |
561 | |
562 A few more points: | |
563 - Infinite datasets could be used (would just need to define a convention | |
564 on what __len__ should do). | |
565 - It is also ok to have datasets that do not support random access (so the | |
566 only way to access samples is through iteration). | |
567 - Ideally, data should be deterministic (i.e. __call__() should always | |
568 return the same thing). It would probably be up to the user to be super | |
569 careful if he decides to use a non-deterministic dataset. | |
570 |