annotate dataset.py @ 36:438440ba0627

Rewriting dataset.py completely
author bengioy@zircon.iro.umontreal.ca
date Tue, 22 Apr 2008 18:03:11 -0400
parents 46c5c90019c2
children 73c4212ba5b3
rev   line source
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
1
12
ff4e551490f1 Added LookupList type in lookup_list.py and used it to keep order
bengioy@esprit.iro.umontreal.ca
parents: 11
diff changeset
2 from lookup_list import LookupList
ff4e551490f1 Added LookupList type in lookup_list.py and used it to keep order
bengioy@esprit.iro.umontreal.ca
parents: 11
diff changeset
3 Example = LookupList
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
4 from misc import *
26
672fe4b23032 Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents: 23
diff changeset
5 import copy
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
6
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
7 class AbstractFunction (Exception): """Derived class must override this function"""
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
8 class NotImplementedYet (NotImplementedError): """Work in progress, this should eventually be implemented"""
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
9
1
2cd82666b9a7 Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents: 0
diff changeset
10 class DataSet(object):
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
11 """A virtual base class for datasets.
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
12
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
13 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
14 with learning algorithms (for training and testing them): rows/records are called examples, and
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
15 columns/attributes are called fields. The field value for a particular example can be an arbitrary
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
16 python object, which depends on the particular dataset.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
17
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
18 We call a DataSet a 'stream' when its length is unbounded (len(dataset)==float("infinity")).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
19
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
20 A DataSet is a generator of iterators; these iterators can run through the
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
21 examples or the fields in a variety of ways. A DataSet need not necessarily have a finite
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
22 or known length, so this class can be used to interface to a 'stream' which
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
23 feeds on-line learning (however, as noted below, some operations are not
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
24 feasible or not recommanded on streams).
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
25
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
26 To iterate over examples, there are several possibilities:
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
27 * for example in dataset([field1, field2,field3, ...]):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
28 * for val1,val2,val3 in dataset([field1, field2,field3]):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
29 * for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
30 * for example in dataset:
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
31 Each of these is documented below. All of these iterators are expected
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
32 to provide, in addition to the usual 'next()' method, a 'next_index()' method
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
33 which returns a non-negative integer pointing to the position of the next
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
34 example that will be returned by 'next()' (or of the first example in the
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
35 next minibatch returned). This is important because these iterators
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
36 can wrap around the dataset in order to do multiple passes through it,
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
37 in possibly unregular ways if the minibatch size is not a divisor of the
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
38 dataset length.
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
39
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
40 To iterate over fields, one can do
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
41 * for fields in dataset.fields()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
42 * for fields in dataset(field1,field2,...).fields() to select a subset of fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
43 * for fields in dataset.fields(field1,field2,...) to select a subset of fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
44 and each of these fields is iterable over the examples:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
45 * for field_examples in dataset.fields():
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
46 for example_value in field_examples:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
47 ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
48 but when the dataset is a stream (unbounded length), it is not recommanded to do
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
49 such things because the underlying dataset may refuse to access the different fields in
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
50 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
51 The result of fields() is a DataSetFields object, which iterates over fields,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
52 and whose elements are iterable over examples. A DataSetFields object can
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
53 be turned back into a DataSet with its examples() method:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
54 dataset2 = dataset1.fields().examples()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
55 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
56
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
57 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content.
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
58
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
59 Note: The content of a field can be of any type. Field values can also be 'missing'
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
60 (e.g. to handle semi-supervised learning), and in the case of numeric (numpy array)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
61 fields (i.e. an ArrayFieldsDataSet), NaN plays the role of a missing value.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
62
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
63 Dataset elements can be indexed and sub-datasets (with a subset
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
64 of examples) can be extracted. These operations are not supported
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
65 by default in the case of streams.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
66
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
67 * dataset[:n] returns a dataset with the n first examples.
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
68
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
69 * dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
70
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
71 * dataset[i] returns an Example.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
72
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
73 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
74
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
75 Datasets can be concatenated either vertically (increasing the length) or
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
76 horizontally (augmenting the set of fields), if they are compatible, using
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
77 the following operations (with the same basic semantics as numpy.hstack
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
78 and numpy.vstack):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
79
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
80 * dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3])
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
81
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
82 creates a new dataset whose list of fields is the concatenation of the list of
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
83 fields of the argument datasets. This only works if they all have the same length.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
84
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
85 * dataset1 + dataset2 + dataset3 == dataset.vstack([dataset1,dataset2,dataset3])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
86
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
87 creates a new dataset that concatenates the examples from the argument datasets
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
88 (and whose length is the sum of the length of the argument datasets). This only
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
89 works if they all have the same fields.
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
90
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
91 According to the same logic, and viewing a DataSetFields object associated to
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
92 a DataSet as a kind of transpose of it, fields1 + fields2 concatenates fields of
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
93 a DataSetFields fields1 and fields2, and fields1 | fields2 concatenates their
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
94 examples.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
95
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
96
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
97 A DataSet sub-class should always redefine the following methods:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
98 * __len__ if it is not a stream
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
99 * __getitem__ may not be feasible with some streams
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
100 * fieldNames
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
101 * minibatches
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
102 * valuesHStack
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
103 * valuesVStack
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
104 For efficiency of implementation, a sub-class might also want to redefine
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
105 * hasFields
2
3fddb1c8f955 Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents: 1
diff changeset
106 """
1
2cd82666b9a7 Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents: 0
diff changeset
107
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
108 infinity = float("infinity")
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
109
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
110 def __init__(self):
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
111 pass
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
112
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
113 class MinibatchToSingleExampleIterator(object):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
114 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
115 Converts the result of minibatch iterator with minibatch_size==1 into
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
116 single-example values in the result. Therefore the result of
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
117 iterating on the dataset itself gives a sequence of single examples
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
118 (whereas the result of iterating over minibatches gives in each
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
119 Example field an iterable object over the individual examples in
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
120 the minibatch).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
121 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
122 def __init__(self, minibatch_iterator):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
123 self.minibatch_iterator = minibatch_iterator
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
124 def __iter__(self): #makes for loop work
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
125 return self
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
126 def next(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
127 return self.minibatch_iterator.next()[0]
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
128 def next_index(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
129 return self.minibatch_iterator.next_index()
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
130
3
378b68d5c4ad Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents: 2
diff changeset
131 def __iter__(self):
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
132 """Supports the syntax "for i in dataset: ..."
1
2cd82666b9a7 Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents: 0
diff changeset
133
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
134 Using this syntax, "i" will be an Example instance (or equivalent) with
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
135 all the fields of DataSet self. Every field of "i" will give access to
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
136 a field of a single example. Fields should be accessible via
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
137 i["fielname"] or i[3] (in the order defined by the elements of the
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
138 Example returned by this iterator), but the derived class is free
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
139 to accept any type of identifier, and add extra functionality to the iterator.
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
140
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
141 The default implementation calls the minibatches iterator and extracts the first example of each field.
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
142 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
143 return DataSet.MinibatchToSingleExampleIterator(self.minibatches(None, minibatch_size = 1))
2
3fddb1c8f955 Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents: 1
diff changeset
144
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
145 minibatches_fieldnames = None
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
146 minibatches_minibatch_size = 1
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
147 minibatches_n_batches = None
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
148 def minibatches(self,
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
149 fieldnames = minibatches_fieldnames,
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
150 minibatch_size = minibatches_minibatch_size,
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
151 n_batches = minibatches_n_batches):
6
d5738b79089a Removed MinibatchIterator and instead made minibatch_size a field of all DataSets,
bengioy@bengiomac.local
parents: 5
diff changeset
152 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
153 Return an iterator that supports three forms of syntax:
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
154
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
155 for i in dataset.minibatches(None,**kwargs): ...
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
156
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
157 for i in dataset.minibatches([f1, f2, f3],**kwargs): ...
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
158
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
159 for i1, i2, i3 in dataset.minibatches([f1, f2, f3],**kwargs): ...
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
160
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
161 Using the first two syntaxes, "i" will be an indexable object, such as a list,
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
162 tuple, or Example instance. In both cases, i[k] is a list-like container
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
163 of a batch of current examples. In the second case, i[0] is
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
164 list-like container of the f1 field of a batch current examples, i[1] is
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
165 a list-like container of the f2 field, etc.
2
3fddb1c8f955 Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents: 1
diff changeset
166
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
167 Using the first syntax, all the fields will be returned in "i".
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
168 Beware that some datasets may not support this syntax, if the number
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
169 of fields is infinite (i.e. field values may be computed "on demand").
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
170
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
171 Using the third syntax, i1, i2, i3 will be list-like containers of the
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
172 f1, f2, and f3 fields of a batch of examples on each loop iteration.
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
173
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
174 The minibatches iterator is expected to return upon each call to next()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
175 a DataSetFields object, which is a LookupList (indexed by the field names) whose
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
176 elements are iterable over the minibatch examples, and which keeps a pointer to
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
177 a sub-dataset that can be used to iterate over the individual examples
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
178 in the minibatch. Hence a minibatch can be converted back to a regular
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
179 dataset or its fields can be looked at individually (and possibly iterated over).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
180
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
181 PARAMETERS
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
182 - fieldnames (list of any type, default None):
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
183 The loop variables i1, i2, i3 (in the example above) should contain the
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
184 f1, f2, and f3 fields of the current batch of examples. If None, the
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
185 derived class can choose a default, e.g. all fields.
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
186
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
187 - minibatch_size (integer, default 1)
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
188 On every iteration, the variables i1, i2, i3 will have
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
189 exactly minibatch_size elements. e.g. len(i1) == minibatch_size
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
190
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
191 - n_batches (integer, default None)
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
192 The iterator will loop exactly this many times, and then stop. If None,
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
193 the derived class can choose a default. If (-1), then the returned
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
194 iterator should support looping indefinitely.
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
195
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
196 Note: A list-like container is something like a tuple, list, numpy.ndarray or
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
197 any other object that supports integer indexing and slicing.
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
198
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
199 """
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
200 raise AbstractFunction()
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
201
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
202
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
203 def __len__(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
204 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
205 len(dataset) returns the number of examples in the dataset.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
206 By default, a DataSet is a 'stream', i.e. it has an unbounded (infinite) length.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
207 Sub-classes which implement finite-length datasets should redefine this method.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
208 Some methods only make sense for finite-length datasets, and will perform
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
209 assert len(dataset)<DataSet.infinity
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
210 in order to check the finiteness of the dataset.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
211 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
212 return infinity
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
213
26
672fe4b23032 Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents: 23
diff changeset
214 def hasFields(self,*fieldnames):
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
215 """
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
216 Return true if the given field name (or field names, if multiple arguments are
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
217 given) is recognized by the DataSet (i.e. can be used as a field name in one
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
218 of the iterators).
29
46c5c90019c2 Changed apply_function so that it propagates methods of the source.
bengioy@grenat.iro.umontreal.ca
parents: 28
diff changeset
219
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
220 The default implementation may be inefficient (O(# fields in dataset)), as it calls the fieldNames()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
221 method. Many datasets may store their field names in a dictionary, which would allow more efficiency.
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
222 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
223 return len(unique_elements_list_intersection(fieldnames,self.fieldNames()))>0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
224
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
225 def fieldNames(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
226 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
227 Return the list of field names that are supported by the iterators,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
228 and for which hasFields(fieldname) would return True.
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
229 """
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
230 raise AbstractFunction()
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
231
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
232 def __call__(self,*fieldnames):
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
233 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
234 Return a dataset that sees only the fields whose name are specified.
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
235 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
236 assert self.hasFields(fieldnames)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
237 return self.fields(fieldnames).examples()
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
238
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
239 def fields(self,*fieldnames):
29
46c5c90019c2 Changed apply_function so that it propagates methods of the source.
bengioy@grenat.iro.umontreal.ca
parents: 28
diff changeset
240 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
241 Return a DataSetFields object associated with this dataset.
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
242 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
243 return DataSetFields(self,fieldnames)
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
244
2
3fddb1c8f955 Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents: 1
diff changeset
245 def __getitem__(self,i):
28
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
246 """
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
247 dataset[i] returns the (i+1)-th example of the dataset.
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
248 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1.
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
249 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2.
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
250 dataset[[i1,i2,..,in]] returns the subdataset with examples i1,i2,...,in.
1
2cd82666b9a7 Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents: 0
diff changeset
251
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
252 Note that some stream datasets may be unable to implement slicing/indexing
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
253 because they can only iterate through examples one or a minibatch at a time
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
254 and do not actually store or keep past (or future) examples.
28
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
255 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
256 raise NotImplementedError()
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
257
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
258 def valuesHStack(self,fieldnames,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
259 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
260 Return a value that corresponds to concatenating (horizontally) several field values.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
261 This can be useful to merge some fields. The implementation of this operation is likely
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
262 to involve a copy of the original values. When the values are numpy arrays, the
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
263 result should be numpy.hstack(values). If it makes sense, this operation should
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
264 work as well when each value corresponds to multiple examples in a minibatch
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
265 e.g. if each value is a Ni-vector and a minibatch of length L is a LxNi matrix,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
266 then the result should be a Lx(N1+N2+..) matrix equal to numpy.hstack(values).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
267 The default is to use numpy.hstack for numpy.ndarray values, and a list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
268 pointing to the original values for other data types.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
269 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
270 all_numpy=True
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
271 for value in fieldvalues:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
272 if not type(value) is numpy.ndarray:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
273 all_numpy=False
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
274 if all_numpy:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
275 return numpy.hstack(fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
276 # the default implementation of horizontal stacking is to put values in a list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
277 return fieldvalues
26
672fe4b23032 Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents: 23
diff changeset
278
672fe4b23032 Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents: 23
diff changeset
279
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
280 def valuesVStack(self,fieldname,values):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
281 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
282 Return a value that corresponds to concatenating (vertically) several values of the
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
283 same field. This can be important to build a minibatch out of individual examples. This
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
284 is likely to involve a copy of the original values. When the values are numpy arrays, the
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
285 result should be numpy.vstack(values).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
286 The default is to use numpy.vstack for numpy.ndarray values, and a list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
287 pointing to the original values for other data types.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
288 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
289 all_numpy=True
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
290 for value in values:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
291 if not type(value) is numpy.ndarray:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
292 all_numpy=False
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
293 if all_numpy:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
294 return numpy.vstack(values)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
295 # the default implementation of vertical stacking is to put values in a list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
296 return values
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
297
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
298 def __or__(self,other):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
299 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
300 dataset1 | dataset2 returns a dataset whose list of fields is the concatenation of the list of
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
301 fields of the argument datasets. This only works if they all have the same length.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
302 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
303 return HStackedDataSet(self,other)
3
378b68d5c4ad Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents: 2
diff changeset
304
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
305 def __add__(self,other):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
306 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
307 dataset1 + dataset2 is a dataset that concatenates the examples from the argument datasets
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
308 (and whose length is the sum of the length of the argument datasets). This only
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
309 works if they all have the same fields.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
310 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
311 return VStackedDataSet(self,other)
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
312
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
313 def hstack(datasets):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
314 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
315 hstack(dataset1,dataset2,...) returns dataset1 | datataset2 | ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
316 which is a dataset whose fields list is the concatenation of the fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
317 of the individual datasets.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
318 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
319 assert len(datasets)>0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
320 if len(datasets)==1:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
321 return datasets[0]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
322 return HStackedDataSet(datasets)
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
323
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
324 def vstack(datasets):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
325 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
326 vstack(dataset1,dataset2,...) returns dataset1 + datataset2 + ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
327 which is a dataset which iterates first over the examples of dataset1, then
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
328 over those of dataset2, etc.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
329 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
330 assert len(datasets)>0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
331 if len(datasets)==1:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
332 return datasets[0]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
333 return VStackedDataSet(datasets)
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
334
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
335
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
336 class DataSetFields(LookupList):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
337 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
338 Although a DataSet iterates over examples (like rows of a matrix), an associated
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
339 DataSetFields iterates over fields (like columns of a matrix), and can be understood
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
340 as a transpose of the associated dataset.
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
341
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
342 To iterate over fields, one can do
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
343 * for fields in dataset.fields()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
344 * for fields in dataset(field1,field2,...).fields() to select a subset of fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
345 * for fields in dataset.fields(field1,field2,...) to select a subset of fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
346 and each of these fields is iterable over the examples:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
347 * for field_examples in dataset.fields():
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
348 for example_value in field_examples:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
349 ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
350 but when the dataset is a stream (unbounded length), it is not recommanded to do
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
351 such things because the underlying dataset may refuse to access the different fields in
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
352 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
353 The result of fields() is a DataSetFields object, which iterates over fields,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
354 and whose elements are iterable over examples. A DataSetFields object can
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
355 be turned back into a DataSet with its examples() method:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
356 dataset2 = dataset1.fields().examples()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
357 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
358 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
359 def __init__(self,dataset,*fieldnames):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
360 self.dataset=dataset
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
361 assert dataset.hasField(*fieldnames)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
362 LookupList.__init__(self,dataset.fieldNames(),
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
363 dataset.minibatches(fieldnames if len(fieldnames)>0 else self.fieldNames(),minibatch_size=len(dataset)).next()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
364 def examples(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
365 return self.dataset
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
366
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
367 def __or__(self,other):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
368 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
369 fields1 | fields2 is a DataSetFields that whose list of examples is the concatenation
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
370 of the list of examples of DataSetFields fields1 and fields2.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
371 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
372 return (self.examples() + other.examples()).fields()
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
373
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
374 def __add__(self,other):
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
375 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
376 fields1 + fields2 is a DataSetFields that whose list of fields is the concatenation
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
377 of the fields of DataSetFields fields1 and fields2.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
378 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
379 return (self.examples() | other.examples()).fields()
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
380
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
381 class MinibatchDataSet(DataSet):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
382 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
383 Turn a LookupList of same-length fields into an example-iterable dataset.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
384 Each element of the lookup-list should be an iterable and sliceable, all of the same length.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
385 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
386 def __init__(self,fields_lookuplist,values_vstack=DataSet().valuesVStack,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
387 values_hstack=DataSet().valuesHStack):
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
388 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
389 The user can (and generally should) also provide values_vstack(fieldname,fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
390 and a values_hstack(fieldnames,fieldvalues) functions behaving with the same
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
391 semantics as the DataSet methods of the same name (but without the self argument).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
392 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
393 self.fields=fields_lookuplist
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
394 assert len(fields_lookuplist)>0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
395 self.length=len(fields_lookuplist[0])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
396 for field in fields_lookuplist[1:]:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
397 assert self.length==len(field)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
398 self.values_vstack=values_vstack
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
399 self.values_hstack=values_hstack
3
378b68d5c4ad Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents: 2
diff changeset
400
378b68d5c4ad Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents: 2
diff changeset
401 def __len__(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
402 return self.length
28
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
403
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
404 def __getitem__(self,i):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
405 return Example(self.fields.keys(),[field[i] for field in self.fields])
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
406
29
46c5c90019c2 Changed apply_function so that it propagates methods of the source.
bengioy@grenat.iro.umontreal.ca
parents: 28
diff changeset
407 def fieldNames(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
408 return self.fields.keys()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
409
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
410 def hasField(self,*fieldnames):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
411 for fieldname in fieldnames:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
412 if fieldname not in self.fields:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
413 return False
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
414 return True
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
415
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
416 def minibatches(self,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
417 fieldnames = minibatches_fieldnames,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
418 minibatch_size = minibatches_minibatch_size,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
419 n_batches = minibatches_n_batches):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
420 class Iterator(object):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
421 def __init__(self,ds):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
422 self.ds=ds
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
423 self.next_example=0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
424 self.n_batches_done=0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
425 assert minibatch_size > 0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
426 if minibatch_size > ds.length
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
427 raise NotImplementedError()
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
428 def __iter__(self):
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
429 return self
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
430 def next_index(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
431 return self.next_example
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
432 def next(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
433 upper = next_example+minibatch_size
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
434 if upper<=self.ds.length:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
435 minibatch = Example(self.ds.fields.keys(),
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
436 [field[next_example:upper]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
437 for field in self.ds.fields])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
438 else: # we must concatenate (vstack) the bottom and top parts of our minibatch
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
439 minibatch = Example(self.ds.fields.keys(),
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
440 [self.ds.valuesVStack(name,[value[next_example:],
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
441 value[0:upper-self.ds.length]])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
442 for name,value in self.ds.fields.items()])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
443 self.next_example+=minibatch_size
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
444 self.n_batches_done+=1
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
445 if n_batches:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
446 if self.n_batches_done==n_batches:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
447 raise StopIteration
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
448 if self.next_example>=self.ds.length:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
449 self.next_example-=self.ds.length
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
450 else:
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
451 if self.next_example>=self.ds.length:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
452 raise StopIteration
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
453 return DataSetFields(MinibatchDataSet(minibatch),fieldnames)
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
454
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
455 return Iterator(self)
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
456
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
457 def valuesVStack(self,fieldname,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
458 return self.values_vstack(fieldname,fieldvalues)
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
459
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
460 def valuesHStack(self,fieldnames,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
461 return self.values_hstack(fieldnames,fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
462
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
463 class HStackedDataSet(DataSet):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
464 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
465 A DataSet that wraps several datasets and shows a view that includes all their fields,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
466 i.e. whose list of fields is the concatenation of their lists of fields.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
467
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
468 If a field name is found in more than one of the datasets, then either an error is
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
469 raised or the fields are renamed (either by prefixing the __name__ attribute
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
470 of the dataset + ".", if it exists, or by suffixing the dataset index in the argument list).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
471
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
472 TODO: automatically detect a chain of stacked datasets due to A | B | C | D ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
473 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
474 def __init__(self,datasets,accept_nonunique_names=False):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
475 DataSet.__init__(self)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
476 self.datasets=datasets
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
477 self.accept_nonunique_names=accept_nonunique_names
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
478 self.fieldname2dataset={}
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
479
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
480 def rename_field(fieldname,dataset,i):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
481 if hasattr(dataset,"__name__"):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
482 return dataset.__name__ + "." + fieldname
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
483 return fieldname+"."+str(i)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
484
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
485 # make sure all datasets have the same length and unique field names
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
486 self.length=None
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
487 names_to_change=[]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
488 for i in xrange(len(datasets)):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
489 dataset = datasets[i]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
490 length=len(dataset)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
491 if self.length:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
492 assert self.length==length
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
493 else:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
494 self.length=length
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
495 for fieldname in dataset.fieldNames():
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
496 if fieldname in self.fieldname2dataset: # name conflict!
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
497 if accept_nonunique_names:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
498 fieldname=rename_field(fieldname,dataset,i)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
499 names2change.append((fieldname,i))
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
500 else:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
501 raise ValueError("Incompatible datasets: non-unique field name = "+fieldname)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
502 self.fieldname2dataset[fieldname]=i
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
503 for fieldname,i in names_to_change:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
504 del self.fieldname2dataset[fieldname]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
505 self.fieldname2dataset[rename_field(fieldname,self.datasets[i],i)]=i
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
506
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
507 def hasField(self,*fieldnames):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
508 for fieldname in fieldnames:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
509 if not fieldname in self.fieldname2dataset:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
510 return False
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
511 return True
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
512
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
513 def fieldNames(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
514 return self.fieldname2dataset.keys()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
515
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
516 def minibatches(self,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
517 fieldnames = minibatches_fieldnames,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
518 minibatch_size = minibatches_minibatch_size,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
519 n_batches = minibatches_n_batches):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
520
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
521 class Iterator(object):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
522 def __init__(self,hsds,iterators):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
523 self.hsds=hsds
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
524 self.iterators=iterators
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
525 def __iter__(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
526 return self
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
527 def next_index(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
528 return self.iterators[0].next_index()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
529 def next(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
530 # concatenate all the fields of the minibatches
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
531 minibatch = reduce(LookupList.__add__,[iterator.next() for iterator in self.iterators])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
532 # and return a DataSetFields whose dataset is the transpose (=examples()) of this minibatch
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
533 return DataSetFields(MinibatchDataSet(minibatch,self.hsds.valuesVStack,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
534 self.hsds.valuesHStack),
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
535 fieldnames if fieldnames else hsds.fieldNames())
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
536
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
537 assert self.hasfields(fieldnames)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
538 # find out which underlying datasets are necessary to service the required fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
539 # and construct corresponding minibatch iterators
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
540 if fieldnames:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
541 datasets=set([])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
542 fields_in_dataset=dict([(dataset,[]) for dataset in datasets])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
543 for fieldname in fieldnames:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
544 dataset=self.datasets[self.fieldnames2dataset[fieldname]]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
545 datasets.add(dataset)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
546 fields_in_dataset[dataset].append(fieldname)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
547 datasets=list(datasets)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
548 iterators=[dataset.minibatches(fields_in_dataset[dataset],minibatch_size,n_batches)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
549 for dataset in datasets]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
550 else:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
551 datasets=self.datasets
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
552 iterators=[dataset.minibatches(None,minibatch_size,n_batches) for dataset in datasets]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
553 return Iterator(self,iterators)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
554
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
555
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
556 def valuesVStack(self,fieldname,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
557 return self.datasets[self.fieldname2dataset[fieldname]].valuesVStack(fieldname,fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
558
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
559 def valuesHStack(self,fieldnames,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
560 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
561 We will use the sub-dataset associated with the first fieldname in the fieldnames list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
562 to do the work, hoping that it can cope with the other values (i.e. won't care
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
563 about the incompatible fieldnames). Hence this heuristic will always work if
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
564 all the fieldnames are of the same sub-dataset.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
565 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
566 return self.datasets[self.fieldname2dataset[fieldnames[0]]].valuesHStack(fieldnames,fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
567
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
568 class VStackedDataSet(DataSet):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
569 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
570 A DataSet that wraps several datasets and shows a view that includes all their examples,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
571 in the order provided. This clearly assumes that they all have the same field names
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
572 and all (except possibly the last one) are of finite length.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
573
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
574 TODO: automatically detect a chain of stacked datasets due to A + B + C + D ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
575 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
576 def __init__(self,datasets):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
577 self.datasets=datasets
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
578 self.length=0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
579 self.index2dataset={}
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
580 # we use this map from row index to dataset index for constant-time random access of examples,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
581 # to avoid having to search for the appropriate dataset each time and slice is asked for
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
582 for dataset,k in enumerate(datasets[0:-1]):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
583 L=len(dataset)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
584 assert L<DataSet.infinity
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
585 for i in xrange(L):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
586 self.index2dataset[self.length+i]=k
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
587 self.length+=L
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
588 self.last_start=self.length
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
589 self.length+=len(datasets[-1])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
590
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
591
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
592 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None):
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
593 """
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
594 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
595 user to define a set of fields as the 'input' field and a set of fields
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
596 as the 'target' field. Optionally, a single weight_field can also be defined.
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
597 """
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
598 args = ((input_fields,'input'),(output_fields,'target'))
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
599 if weight_field: args+=(([weight_field],'weight'))
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
600 return src_dataset.merge_fields(*args)
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
601
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
602
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
603
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
604