518
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
1 """The dataset-from-descriptor mechanism."""
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
2
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
3 _factory = {}
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
4
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
5 def add_dataset_factory(tok0, fn):
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
6 """Add `fn` as the handler for descriptors whose first token is `tok0`.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
7
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
8 :returns: None
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
9
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
10 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
11 if tok0 in _factory:
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
12 raise Exception('Identifier already in use:', tok0)
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
13 else:
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
14 _factory[tok0] = fn
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
15
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
16 def dataset_factory(tok0):
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
17 """Register a function as the handler for a given kind of dataset, identified by `tok0`.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
18
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
19 When someone calls dataset_from_descr('kind_of_dataset option1 option2, etc.', approx=1),
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
20 then the handler registered for 'kind_of_dataset' will be called with the same arguments as
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
21 dataset_from_descr.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
22
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
23 .. code-block:: python
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
24
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
25 @dataset_factory('MNIST')
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
26 def mnist_related_dataset(descr, **kwargs):
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
27 ...
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
28
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
29 :returns: `dectorator`
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
30 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
31 def decorator(fn):
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
32 add_dataset_factory(tok0, fn)
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
33 return fn
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
34 return decorator
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
35
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
36 def dataset(descr, **kwargs):
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
37 """Return the dataset described by `descr`.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
38
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
39 :param descr: a dataset identifier
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
40 :type descr: str
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
41 :returns: `Dataset`
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
42
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
43 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
44 tok0 = descr.split()[0]
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
45 fn = _factory[tok0]
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
46 return fn(descr, **kwargs)
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
47
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
48
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
49 class Dataset(object):
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
50 """Dataset is a generic container for pylearn datasets.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
51
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
52 It is not intended to put any restriction whatsoever on its contents.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
53
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
54 It is intended to encourage certain conventions, described below. Conventions should arise
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
55 naturally among datasets in PyLearn. When a few datasets adhere to a new convention, then
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
56 describe it here and make it more official.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
57
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
58 If no particular convention applies. Create your own object to store the dataset, and
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
59 assign it to the `data` attribute.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
60 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
61 data = None
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
62
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
63 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
64 SIMPLE REGRESSION / CLASSIFICATION
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
65 ----------------------------------
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
66
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
67 In this setting, you are aiming to do vector classification or vector regression
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
68 where your train, valid and test sets fit in memory.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
69 The convention is to put your data into numpy ndarray instances. Put training data in the
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
70 `train` attribute, validation data in the `valid` attribute and test data in the `test
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
71 attribute`.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
72 Each of those attributes should be an instance that defines at least two attributes: `x` for the
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
73 input matrix and `y` for the target matrix. The `x` ndarray should be one example per
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
74 leading index (row for matrices).
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
75 The `y` ndarray should be one target per leading index (entry for vectors, row for matrices).
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
76 If `y` is a classification target, than it should be a vector with numpy dtype 'int32'.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
77
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
78 If there are weights associated with different examples, then create a 'weights' attribute whose
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
79 value is a vector with one floating-point value (typically double-precision) per example.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
80
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
81 If the task is classification, then the classes should be mapped to the integers
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
82 0,1,...,N-1.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
83 The number of classes (here, N) should be stored in the `n_classes` attribute.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
84
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
85 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
86 train = None #instance with .x, .y
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
87
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
88 valid = None #instance with .x, .y
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
89
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
90 test = None #instance with .x, .y
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
91
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
92 n_classes = None #int
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
93
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
94 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
95 WHEN INPUTS ARE FIXED-SIZE GREYSCALE IMAGES
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
96 -------------------------------------------
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
97
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
98 In this setting we typically encode images as vectors, by enumerating the pixel values in
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
99 left-to-right, top-to-bottom order. Pixel values should be in floating-point, and
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
100 normalized between 0 and 1.
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
101
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
102 The shape of the images should be recorded in the `img_shape` attribute as a tuple (rows,
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
103 cols).
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
104
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
105 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
106
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
107 img_shape = None # (rows, cols)
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
108
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
109
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
110 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
111 TIMESERIES
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
112 ----------
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
113
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
114 When dealing with examples which are themselves timeseries, put each example timeseries in a
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
115 tensor and make a list of them. Generally use tensors, and resort to lists or arrays
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
116 wherever different
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
117 """
|
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
118
|