comparison doc/v2_planning/dataset.txt @ 1337:7dfc3d3052ea

Added proposal for dataset API as discussed on pylearn-dev
author Olivier Delalleau <delallea@iro>
date Thu, 21 Oct 2010 13:00:49 -0400
parents 9ff2242a817b
children 91637815b7ca
comparison
equal deleted inserted replaced
1336:09ad2a4f663c 1337:7dfc3d3052ea
404 I have a prototype interface/implemantation in the shared_dataset.py 404 I have a prototype interface/implemantation in the shared_dataset.py
405 file in this directory. 405 file in this directory.
406 406
407 OD: I like AB's approach. 407 OD: I like AB's approach.
408 408
409
410 Data API proposal by Olivier D
411 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
412
413 A single sample containing multiple fields (e.g. an input and a target part)
414 is an object s that you can manipulate as follows:
415
416 .. code-block:: python
417
418 # Obtain actual data stored within `s` (e.g. a numpy vector). There is no
419 # guarantee that modifying the resulting data object will actually update
420 # the data stored in `s`.
421 data = s()
422 # Create a sample that sees a field of `s`.
423 input_part = s.input
424 # Obtain actual input data (e.g. as a numpy vector).
425 input_data = input_part()
426 # Create a sample that sees the i-th element of the data stored in `s`.
427 ith = s[i]
428 # This should not fail.
429 assert ith() == s()[i]
430 # You could also select a range.
431 i_to_j = s[i:j]
432 assert i_to_j() == s()[i:j]
433 # And actually do pretty much anything you want with __getitem__, as long
434 # as the underlying data stored in the sample supports it (for instance,
435 # here it should be at least a 3D tensor).
436 fancy_selection = s[i, :, j:k]
437 assert fancy_selection() == s()[i, :, j:k]
438 # Write some value (e.g. a numpy vector) into the sample. May raise an
439 # exception if the sample is in read-only mode.
440 s._write(val)
441 # Shortcut to write data into a field (same as `s.input._write(val)`).
442 s.input = val
443 # Basic mathematical operators.
444 s *= val
445 s += val
446 s -= val
447 s /= val
448 # Replace a field. Note that this is different from `s.input = val`
449 # because here `new_input` is a sample, not a numeric value: the current
450 # `s.input` will not be written to, instead it makes `s.input` point
451 # towards a different sample. This may lead to confusion, so a different
452 # syntax may be better (e.g. s._set_field('input', new_input)).
453 s.input = new_input
454 # The equality of two samples is defined by the equality of their
455 # underlying data.
456 def __eq__(self, other):
457 return self() == other()
458 # Iterate on fields (open question: should they be ordered?).
459 fields = dict([(name, sample) for name, sample in s._iter_fields()])
460 assert fields['input'] == s.input
461 # Iterating on a sample yields samples that see consecutive elements.
462 for sample, value in izip(s, s()):
463 assert sample() == value
464 # The length of a sample is the same as that of its underlying data.
465 assert len(s) == len(s())
466 # The shape of a sample is the same as that of its underlying data.
467 # Note that it only makes sense for tensor-like data.
468 assert s._shape() == s().shape
469 # The size of a sample is the product of its shape elements.
470 assert s._size() == reduce(operator.__mul__, s._shape())
471
472 All sample methods should start with '_', to differentiate them from the
473 sample's fields. This is a bit awkward, but I like the `sample.field` syntax
474 compared to something like "sample.get_field('field')", which makes code less
475 readable, especially when combining with sub_fields, e.g. `sample.input.x1`
476 vs. sample.get_field('input').get_field('x1').
477
478 The extension from sample to dataset is actually to use the same class, but
479 with the convention that the first "dimension" in the data seen by the dataset
480 corresponds to the samples' indices in the dataset.
481
482 .. code-block:: python
483
484 # Return data stored in dataset `d` (e.g. a numpy matrix).
485 data = d()
486 # Return the i-th sample in the dataset.
487 s = d[i]
488 # Data should match!
489 assert data[i] == s()
490 # Return a subset of the dataset.
491 sub_data = d[i:j]
492 # Advanced indexing.
493 sub_data = d[some_list_of_indices]
494 # Dataset that sees the input part only.
495 input_part = d.input
496 # Dataset such that its i-th element is data[i][something] (see the sample
497 # examples for what `something` may be).
498 some_sub_data = d[:, something]
499 # The following should not fail.
500 assert d[i, something] == d[i][something] # == some_sub_data[i]
501 # You can also write into a dataset.
502 d._write(val)
503 d.input = val
504 # Center dataset in-place (requires `d` not to be read-only).
505 d -= numpy.mean(d())
506 # The length of a dataset is its number of samples.
507 n_samples = len(d)
508 # The width of a dataset (if it exists) is the length of its samples.
509 assert d._shape()[1] == len(d[0]) # == d._width() (shortcut)
510 # Iterating on a dataset yields individual samples.
511 for i, sample in enumerate(d):
512 assert d[i] == sample
513 # It is allowed for a dataset to hold heterogeneous data. For instance
514 # you could have
515 len(d.data1) != len(d.data2)
516 # A sample in the dataset is not required to inherit all the dataset's
517 # fields, for instance in the case above you could decide that the dataset
518 # sees the same data as its first sub-dataset, i.e.
519 d[i] == d.data1[i]
520
521 There remain some fuzzy points. For instance, are fields allowed to overlap?
522 (e.g. so that one could write both s.pos_3d to get the 3d vector coordinate of
523 sample s, and s.x to get the x coordinate without being forced to go through
524 s.pos_3d.x). What are the fields of s[i:j] if the (i, j) range does not
525 exactly match a subset of fields? How do we handle metadata? (e.g. if we want
526 to describe the dataset to say it contains 28x28 image data, so that an
527 algorithm for filter visualization can automatically deal with it)
528
529 Now, on to some use cases.
530
531 .. code-block:: python
532
533 # Mini-batches.
534 mb_dataset = d._minibatches(batch_size=5)
535 # The mini-batch dataset views samples that are mini-batches.
536 assert mb_dataset[0]() == d[0:5]() # As long as len(d) >= 5.
537
538 # Shuffling samples.
539 random_indices = range(len(d))
540 random_indices = numpy.random.shuffle(random_indices)
541 shuffled_dataset = d[random_indices]
542
543 # Typical linear regression with stochastic gradient descent.
544 n_inputs = d.input._width()
545 n_targets = d.target._width()
546 weights = numpy.zeros((n_inputs, n_targets))
547 bias = numpy.zeros(n_targets)
548 mb_dataset = d._minibatches(batch_size=10)
549 # Note: it is important to get the number of inputs / targets
550 # before converting to minibatches, because
551 # mb_dataset.input._width() == 10
552 # since this is the length of a minibatch matrix. However you
553 # could still do the following, which is less readable:
554 # n_inputs = mb_dataset.input._shape()[2]
555 # You could also wait until you see the first sample to create
556 # the parameters (this would actually be a better way to do it, since
557 # it avoids calling the _width method).
558 for input, target in izip(mb_dataset.input, mb_dataset.target):
559 cost = (numpy.dot(input(), weights) + b - target())**2
560 # Update weights and bias depending on cost....
561
562 A few more points:
563 - Infinite datasets could be used (would just need to define a convention
564 on what __len__ should do).
565 - It is also ok to have datasets that do not support random access (so the
566 only way to access samples is through iteration).
567 - Ideally, data should be deterministic (i.e. __call__() should always
568 return the same thing). It would probably be up to the user to be super
569 careful if he decides to use a non-deterministic dataset.
570