annotate dataset.py @ 44:5a85fda9b19b

Fixed some more iterator bugs
author bengioy@grenat.iro.umontreal.ca
date Mon, 28 Apr 2008 13:52:54 -0400
parents e92244f30116
children a5c70dc42972
rev   line source
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
1
12
ff4e551490f1 Added LookupList type in lookup_list.py and used it to keep order
bengioy@esprit.iro.umontreal.ca
parents: 11
diff changeset
2 from lookup_list import LookupList
ff4e551490f1 Added LookupList type in lookup_list.py and used it to keep order
bengioy@esprit.iro.umontreal.ca
parents: 11
diff changeset
3 Example = LookupList
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
4 from misc import unique_elements_list_intersection
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
5 from string import join
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
6 from sys import maxint
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
7
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
8 class AbstractFunction (Exception): """Derived class must override this function"""
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
9 class NotImplementedYet (NotImplementedError): """Work in progress, this should eventually be implemented"""
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
10 #class UnboundedDataSet (Exception): """Trying to obtain length of unbounded dataset (a stream)"""
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
11
1
2cd82666b9a7 Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents: 0
diff changeset
12 class DataSet(object):
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
13 """A virtual base class for datasets.
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
14
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
15 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
16 with learning algorithms (for training and testing them): rows/records are called examples, and
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
17 columns/attributes are called fields. The field value for a particular example can be an arbitrary
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
18 python object, which depends on the particular dataset.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
19
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
20 We call a DataSet a 'stream' when its length is unbounded (otherwise its __len__ method
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
21 should raise an UnboundedDataSet exception).
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
22
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
23 A DataSet is a generator of iterators; these iterators can run through the
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
24 examples or the fields in a variety of ways. A DataSet need not necessarily have a finite
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
25 or known length, so this class can be used to interface to a 'stream' which
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
26 feeds on-line learning (however, as noted below, some operations are not
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
27 feasible or not recommanded on streams).
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
28
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
29 To iterate over examples, there are several possibilities:
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
30 * for example in dataset([field1, field2,field3, ...]):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
31 * for val1,val2,val3 in dataset([field1, field2,field3]):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
32 * for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N):
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
33 * for mini1,mini2,mini3 in dataset.minibatches([field1, field2, ...],minibatch_size=N):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
34 * for example in dataset:
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
35 Each of these is documented below. All of these iterators are expected
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
36 to provide, in addition to the usual 'next()' method, a 'next_index()' method
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
37 which returns a non-negative integer pointing to the position of the next
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
38 example that will be returned by 'next()' (or of the first example in the
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
39 next minibatch returned). This is important because these iterators
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
40 can wrap around the dataset in order to do multiple passes through it,
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
41 in possibly unregular ways if the minibatch size is not a divisor of the
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
42 dataset length.
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
43
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
44 To iterate over fields, one can do
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
45 * for fields in dataset.fields()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
46 * for fields in dataset(field1,field2,...).fields() to select a subset of fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
47 * for fields in dataset.fields(field1,field2,...) to select a subset of fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
48 and each of these fields is iterable over the examples:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
49 * for field_examples in dataset.fields():
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
50 for example_value in field_examples:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
51 ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
52 but when the dataset is a stream (unbounded length), it is not recommanded to do
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
53 such things because the underlying dataset may refuse to access the different fields in
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
54 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
55 The result of fields() is a DataSetFields object, which iterates over fields,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
56 and whose elements are iterable over examples. A DataSetFields object can
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
57 be turned back into a DataSet with its examples() method:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
58 dataset2 = dataset1.fields().examples()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
59 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
60
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
61 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content.
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
62
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
63 Note: The content of a field can be of any type. Field values can also be 'missing'
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
64 (e.g. to handle semi-supervised learning), and in the case of numeric (numpy array)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
65 fields (i.e. an ArrayFieldsDataSet), NaN plays the role of a missing value.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
66
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
67 Dataset elements can be indexed and sub-datasets (with a subset
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
68 of examples) can be extracted. These operations are not supported
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
69 by default in the case of streams.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
70
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
71 * dataset[:n] returns a dataset with the n first examples.
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
72
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
73 * dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
74
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
75 * dataset[i] returns an Example.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
76
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
77 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
78
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
79 * dataset['key'] returns a property associated with the given 'key' string.
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
80 If 'key' is a fieldname, then the VStacked field values (iterable over
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
81 field values) for that field is returned. Other keys may be supported
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
82 by different dataset subclasses. The following key names are should be supported:
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
83 - 'description': a textual description or name for the dataset
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
84 - '<fieldname>.type': a type name or value for a given <fieldname>
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
85
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
86 Datasets can be concatenated either vertically (increasing the length) or
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
87 horizontally (augmenting the set of fields), if they are compatible, using
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
88 the following operations (with the same basic semantics as numpy.hstack
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
89 and numpy.vstack):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
90
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
91 * dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3])
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
92
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
93 creates a new dataset whose list of fields is the concatenation of the list of
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
94 fields of the argument datasets. This only works if they all have the same length.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
95
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
96 * dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3])
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
97
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
98 creates a new dataset that concatenates the examples from the argument datasets
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
99 (and whose length is the sum of the length of the argument datasets). This only
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
100 works if they all have the same fields.
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
101
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
102 According to the same logic, and viewing a DataSetFields object associated to
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
103 a DataSet as a kind of transpose of it, fields1 + fields2 concatenates fields of
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
104 a DataSetFields fields1 and fields2, and fields1 | fields2 concatenates their
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
105 examples.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
106
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
107 A dataset can hold arbitrary key-value pairs that may be used to access meta-data
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
108 or other properties of the dataset or associated with the dataset or the result
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
109 of a computation stored in a dataset. These can be accessed through the [key] syntax
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
110 when key is a string (or more specifically, neither an integer, a slice, nor a list).
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
111
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
112 A DataSet sub-class should always redefine the following methods:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
113 * __len__ if it is not a stream
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
114 * fieldNames
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
115 * minibatches_nowrap (called by DataSet.minibatches())
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
116 * valuesHStack
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
117 * valuesVStack
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
118 For efficiency of implementation, a sub-class might also want to redefine
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
119 * hasFields
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
120 * __getitem__ may not be feasible with some streams
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
121 * __iter__
2
3fddb1c8f955 Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents: 1
diff changeset
122 """
1
2cd82666b9a7 Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents: 0
diff changeset
123
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
124 def __init__(self,description=None,field_types=None):
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
125 if description is None:
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
126 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)"
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
127 description = type(self).__name__ + " ( " + join([x.__name__ for x in type(self).__bases__]) + " )"
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
128 self.description=description
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
129 self.field_types=field_types
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
130
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
131 class MinibatchToSingleExampleIterator(object):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
132 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
133 Converts the result of minibatch iterator with minibatch_size==1 into
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
134 single-example values in the result. Therefore the result of
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
135 iterating on the dataset itself gives a sequence of single examples
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
136 (whereas the result of iterating over minibatches gives in each
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
137 Example field an iterable object over the individual examples in
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
138 the minibatch).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
139 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
140 def __init__(self, minibatch_iterator):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
141 self.minibatch_iterator = minibatch_iterator
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
142 self.minibatch = None
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
143 def __iter__(self): #makes for loop work
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
144 return self
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
145 def next(self):
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
146 size1_minibatch = self.minibatch_iterator.next()
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
147 if not self.minibatch:
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
148 self.minibatch = Example(size1_minibatch.keys(),[value[0] for value in size1_minibatch.values()])
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
149 else:
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
150 self.minibatch._values = [value[0] for value in size1_minibatch.values()]
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
151 return self.minibatch
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
152
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
153 def next_index(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
154 return self.minibatch_iterator.next_index()
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
155
3
378b68d5c4ad Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents: 2
diff changeset
156 def __iter__(self):
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
157 """Supports the syntax "for i in dataset: ..."
1
2cd82666b9a7 Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents: 0
diff changeset
158
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
159 Using this syntax, "i" will be an Example instance (or equivalent) with
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
160 all the fields of DataSet self. Every field of "i" will give access to
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
161 a field of a single example. Fields should be accessible via
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
162 i["fielname"] or i[3] (in the order defined by the elements of the
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
163 Example returned by this iterator), but the derived class is free
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
164 to accept any type of identifier, and add extra functionality to the iterator.
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
165
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
166 The default implementation calls the minibatches iterator and extracts the first example of each field.
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
167 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
168 return DataSet.MinibatchToSingleExampleIterator(self.minibatches(None, minibatch_size = 1))
2
3fddb1c8f955 Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents: 1
diff changeset
169
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
170
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
171 class MinibatchWrapAroundIterator(object):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
172 """
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
173 An iterator for minibatches that handles the case where we need to wrap around the
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
174 dataset because n_batches*minibatch_size > len(dataset). It is constructed from
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
175 a dataset that provides a minibatch iterator that does not need to handle that problem.
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
176 This class is a utility for dataset subclass writers, so that they do not have to handle
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
177 this issue multiple times, nor check that fieldnames are valid, nor handle the
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
178 empty fieldnames (meaning 'use all the fields').
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
179 """
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
180 def __init__(self,dataset,fieldnames,minibatch_size,n_batches,offset):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
181 self.dataset=dataset
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
182 self.fieldnames=fieldnames
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
183 self.minibatch_size=minibatch_size
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
184 self.n_batches=n_batches
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
185 self.n_batches_done=0
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
186 self.next_row=offset
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
187 self.L=len(dataset)
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
188 assert offset+minibatch_size<=self.L
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
189 ds_nbatches = (self.L-offset)/minibatch_size
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
190 if n_batches is not None:
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
191 ds_nbatches = max(n_batches,ds_nbatches)
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
192 if fieldnames:
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
193 assert dataset.hasFields(*fieldnames)
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
194 else:
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
195 fieldnames=dataset.fieldNames()
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
196 self.iterator = dataset.minibatches_nowrap(fieldnames,minibatch_size,ds_nbatches,offset)
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
197
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
198 def __iter__(self):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
199 return self
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
200
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
201 def next_index(self):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
202 return self.next_row
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
203
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
204 def next(self):
43
e92244f30116 Corrected iterator logic errors
bengioy@grenat.iro.umontreal.ca
parents: 42
diff changeset
205 if self.n_batches and self.n_batches_done==self.n_batches:
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
206 raise StopIteration
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
207 upper = self.next_row+self.minibatch_size
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
208 if upper <=self.L:
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
209 minibatch = self.iterator.next()
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
210 else:
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
211 if not self.n_batches:
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
212 raise StopIteration
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
213 # we must concatenate (vstack) the bottom and top parts of our minibatch
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
214 # first get the beginning of our minibatch (top of dataset)
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
215 first_part = self.dataset.minibatches_nowrap(fieldnames,self.L-self.next_row,1,self.next_row).next()
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
216 second_part = self.dataset.minibatches_nowrap(fieldnames,upper-self.L,1,0).next()
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
217 minibatch = Example(self.fieldnames,
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
218 [self.dataset.valuesVStack(name,[first_part[name],second_part[name]])
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
219 for name in self.fieldnames])
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
220 self.next_row=upper
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
221 self.n_batches_done+=1
43
e92244f30116 Corrected iterator logic errors
bengioy@grenat.iro.umontreal.ca
parents: 42
diff changeset
222 if upper >= self.L and self.n_batches:
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
223 self.next_row -= self.L
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
224 return minibatch
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
225
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
226
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
227 minibatches_fieldnames = None
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
228 minibatches_minibatch_size = 1
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
229 minibatches_n_batches = None
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
230 def minibatches(self,
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
231 fieldnames = minibatches_fieldnames,
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
232 minibatch_size = minibatches_minibatch_size,
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
233 n_batches = minibatches_n_batches,
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
234 offset = 0):
6
d5738b79089a Removed MinibatchIterator and instead made minibatch_size a field of all DataSets,
bengioy@bengiomac.local
parents: 5
diff changeset
235 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
236 Return an iterator that supports three forms of syntax:
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
237
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
238 for i in dataset.minibatches(None,**kwargs): ...
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
239
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
240 for i in dataset.minibatches([f1, f2, f3],**kwargs): ...
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
241
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
242 for i1, i2, i3 in dataset.minibatches([f1, f2, f3],**kwargs): ...
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
243
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
244 Using the first two syntaxes, "i" will be an indexable object, such as a list,
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
245 tuple, or Example instance. In both cases, i[k] is a list-like container
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
246 of a batch of current examples. In the second case, i[0] is
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
247 list-like container of the f1 field of a batch current examples, i[1] is
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
248 a list-like container of the f2 field, etc.
2
3fddb1c8f955 Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents: 1
diff changeset
249
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
250 Using the first syntax, all the fields will be returned in "i".
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
251 Using the third syntax, i1, i2, i3 will be list-like containers of the
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
252 f1, f2, and f3 fields of a batch of examples on each loop iteration.
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
253
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
254 The minibatches iterator is expected to return upon each call to next()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
255 a DataSetFields object, which is a LookupList (indexed by the field names) whose
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
256 elements are iterable over the minibatch examples, and which keeps a pointer to
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
257 a sub-dataset that can be used to iterate over the individual examples
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
258 in the minibatch. Hence a minibatch can be converted back to a regular
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
259 dataset or its fields can be looked at individually (and possibly iterated over).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
260
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
261 PARAMETERS
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
262 - fieldnames (list of any type, default None):
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
263 The loop variables i1, i2, i3 (in the example above) should contain the
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
264 f1, f2, and f3 fields of the current batch of examples. If None, the
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
265 derived class can choose a default, e.g. all fields.
16
813723310d75 commenting
bergstrj@iro.umontreal.ca
parents: 15 11
diff changeset
266
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
267 - minibatch_size (integer, default 1)
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
268 On every iteration, the variables i1, i2, i3 will have
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
269 exactly minibatch_size elements. e.g. len(i1) == minibatch_size
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
270
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
271 - n_batches (integer, default None)
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
272 The iterator will loop exactly this many times, and then stop. If None,
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
273 the derived class can choose a default. If (-1), then the returned
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
274 iterator should support looping indefinitely.
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
275
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
276 - offset (integer, default 0)
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
277 The iterator will start at example 'offset' in the dataset, rather than the default.
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
278
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
279 Note: A list-like container is something like a tuple, list, numpy.ndarray or
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
280 any other object that supports integer indexing and slicing.
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
281
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
282 """
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
283 return DataSet.MinibatchWrapAroundIterator(self,fieldnames,minibatch_size,n_batches,offset)
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
284
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
285 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
286 """
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
287 This is the minibatches iterator generator that sub-classes must define.
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
288 It does not need to worry about wrapping around multiple times across the dataset,
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
289 as this is handled by MinibatchWrapAroundIterator when DataSet.minibatches() is called.
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
290 The next() method of the returned iterator does not even need to worry about
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
291 the termination condition (as StopIteration will be raised by DataSet.minibatches
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
292 before an improper call to minibatches_nowrap's next() is made).
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
293 That next() method can assert that its next row will always be within [0,len(dataset)).
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
294 The iterator returned by minibatches_nowrap does not need to implement
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
295 a next_index() method either, as this will be provided by MinibatchWrapAroundIterator.
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
296 """
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
297 raise AbstractFunction()
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
298
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
299 def __len__(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
300 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
301 len(dataset) returns the number of examples in the dataset.
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
302 By default, a DataSet is a 'stream', i.e. it has an unbounded length (raises UnboundedDataSet).
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
303 Sub-classes which implement finite-length datasets should redefine this method.
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
304 Some methods only make sense for finite-length datasets.
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
305 """
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
306 raise UnboundedDataSet()
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
307
26
672fe4b23032 Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents: 23
diff changeset
308 def hasFields(self,*fieldnames):
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
309 """
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
310 Return true if the given field name (or field names, if multiple arguments are
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
311 given) is recognized by the DataSet (i.e. can be used as a field name in one
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
312 of the iterators).
29
46c5c90019c2 Changed apply_function so that it propagates methods of the source.
bengioy@grenat.iro.umontreal.ca
parents: 28
diff changeset
313
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
314 The default implementation may be inefficient (O(# fields in dataset)), as it calls the fieldNames()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
315 method. Many datasets may store their field names in a dictionary, which would allow more efficiency.
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
316 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
317 return len(unique_elements_list_intersection(fieldnames,self.fieldNames()))>0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
318
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
319 def fieldNames(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
320 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
321 Return the list of field names that are supported by the iterators,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
322 and for which hasFields(fieldname) would return True.
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
323 """
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
324 raise AbstractFunction()
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
325
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
326 def __call__(self,*fieldnames):
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
327 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
328 Return a dataset that sees only the fields whose name are specified.
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
329 """
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
330 assert self.hasFields(*fieldnames)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
331 return self.fields(*fieldnames).examples()
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
332
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
333 def fields(self,*fieldnames):
29
46c5c90019c2 Changed apply_function so that it propagates methods of the source.
bengioy@grenat.iro.umontreal.ca
parents: 28
diff changeset
334 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
335 Return a DataSetFields object associated with this dataset.
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
336 """
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
337 return DataSetFields(self,*fieldnames)
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
338
2
3fddb1c8f955 Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents: 1
diff changeset
339 def __getitem__(self,i):
28
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
340 """
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
341 dataset[i] returns the (i+1)-th example of the dataset.
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
342 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1.
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
343 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2.
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
344 dataset[[i1,i2,..,in]] returns the subdataset with examples i1,i2,...,in.
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
345 dataset['key'] returns a property associated with the given 'key' string.
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
346 If 'key' is a fieldname, then the VStacked field values (iterable over
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
347 field values) for that field is returned. Other keys may be supported
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
348 by different dataset subclasses. The following key names are encouraged:
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
349 - 'description': a textual description or name for the dataset
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
350 - '<fieldname>.type': a type name or value for a given <fieldname>
1
2cd82666b9a7 Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents: 0
diff changeset
351
39
c682c6e9bf93 Minor edits
bengioy@esprit.iro.umontreal.ca
parents: 38
diff changeset
352 Note that some stream datasets may be unable to implement random access, i.e.
c682c6e9bf93 Minor edits
bengioy@esprit.iro.umontreal.ca
parents: 38
diff changeset
353 arbitrary slicing/indexing
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
354 because they can only iterate through examples one or a minibatch at a time
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
355 and do not actually store or keep past (or future) examples.
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
356
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
357 The default implementation of getitem uses the minibatches iterator
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
358 to obtain one example, one slice, or a list of examples. It may not
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
359 always be the most efficient way to obtain the result, especially if
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
360 the data are actually stored in a memory array.
28
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
361 """
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
362 # check for an index
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
363 if type(i) is int:
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
364 return DataSet.MinibatchToSingleExampleIterator(
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
365 self.minibatches(minibatch_size=1,n_batches=1,offset=i)).next()
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
366 rows=None
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
367 # or a slice
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
368 if type(i) is slice:
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
369 if not i.start: i.start=0
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
370 if not i.step: i.step=1
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
371 if i.step is 1:
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
372 return self.minibatches(minibatch_size=i.stop-i.start,n_batches=1,offset=i.start).next().examples()
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
373 rows = range(i.start,i.stop,i.step)
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
374 # or a list of indices
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
375 elif type(i) is list:
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
376 rows = i
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
377 if rows is not None:
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
378 fields_values = zip(*[self[row] for row in rows])
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
379 return DataSet.MinibatchDataSet(
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
380 Example(self.fieldNames(),[ self.valuesVStack(fieldname,field_values)
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
381 for fieldname,field_values
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
382 in zip(self.fieldNames(),fields_values)]))
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
383 # else check for a fieldname
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
384 if self.hasFields(i):
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
385 return self.minibatches(fieldnames=[i],minibatch_size=len(self),n_batches=1,offset=0).next()[0]
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
386 # else we are trying to access a property of the dataset
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
387 assert i in self.__dict__ # else it means we are trying to access a non-existing property
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
388 return self.__dict__[i]
22
b6b36f65664f Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents: 20
diff changeset
389
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
390 def valuesHStack(self,fieldnames,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
391 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
392 Return a value that corresponds to concatenating (horizontally) several field values.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
393 This can be useful to merge some fields. The implementation of this operation is likely
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
394 to involve a copy of the original values. When the values are numpy arrays, the
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
395 result should be numpy.hstack(values). If it makes sense, this operation should
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
396 work as well when each value corresponds to multiple examples in a minibatch
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
397 e.g. if each value is a Ni-vector and a minibatch of length L is a LxNi matrix,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
398 then the result should be a Lx(N1+N2+..) matrix equal to numpy.hstack(values).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
399 The default is to use numpy.hstack for numpy.ndarray values, and a list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
400 pointing to the original values for other data types.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
401 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
402 all_numpy=True
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
403 for value in fieldvalues:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
404 if not type(value) is numpy.ndarray:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
405 all_numpy=False
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
406 if all_numpy:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
407 return numpy.hstack(fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
408 # the default implementation of horizontal stacking is to put values in a list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
409 return fieldvalues
26
672fe4b23032 Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents: 23
diff changeset
410
672fe4b23032 Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents: 23
diff changeset
411
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
412 def valuesVStack(self,fieldname,values):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
413 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
414 Return a value that corresponds to concatenating (vertically) several values of the
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
415 same field. This can be important to build a minibatch out of individual examples. This
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
416 is likely to involve a copy of the original values. When the values are numpy arrays, the
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
417 result should be numpy.vstack(values).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
418 The default is to use numpy.vstack for numpy.ndarray values, and a list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
419 pointing to the original values for other data types.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
420 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
421 all_numpy=True
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
422 for value in values:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
423 if not type(value) is numpy.ndarray:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
424 all_numpy=False
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
425 if all_numpy:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
426 return numpy.vstack(values)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
427 # the default implementation of vertical stacking is to put values in a list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
428 return values
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
429
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
430 def __or__(self,other):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
431 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
432 dataset1 | dataset2 returns a dataset whose list of fields is the concatenation of the list of
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
433 fields of the argument datasets. This only works if they all have the same length.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
434 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
435 return HStackedDataSet(self,other)
3
378b68d5c4ad Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents: 2
diff changeset
436
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
437 def __and__(self,other):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
438 """
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
439 dataset1 & dataset2 is a dataset that concatenates the examples from the argument datasets
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
440 (and whose length is the sum of the length of the argument datasets). This only
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
441 works if they all have the same fields.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
442 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
443 return VStackedDataSet(self,other)
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
444
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
445 def hstack(datasets):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
446 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
447 hstack(dataset1,dataset2,...) returns dataset1 | datataset2 | ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
448 which is a dataset whose fields list is the concatenation of the fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
449 of the individual datasets.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
450 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
451 assert len(datasets)>0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
452 if len(datasets)==1:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
453 return datasets[0]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
454 return HStackedDataSet(datasets)
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
455
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
456 def vstack(datasets):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
457 """
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
458 vstack(dataset1,dataset2,...) returns dataset1 & datataset2 & ...
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
459 which is a dataset which iterates first over the examples of dataset1, then
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
460 over those of dataset2, etc.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
461 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
462 assert len(datasets)>0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
463 if len(datasets)==1:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
464 return datasets[0]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
465 return VStackedDataSet(datasets)
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
466
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
467 class FieldsSubsetDataSet(DataSet):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
468 """
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
469 A sub-class of DataSet that selects a subset of the fields.
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
470 """
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
471 def __init__(self,src,fieldnames):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
472 self.src=src
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
473 self.fieldnames=fieldnames
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
474 assert src.hasFields(*fieldnames)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
475 self.valuesHStack = src.valuesHStack
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
476 self.valuesVStack = src.valuesVStack
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
477
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
478 def __len__(self): return len(self.src)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
479
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
480 def fieldNames(self):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
481 return self.fieldnames
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
482
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
483 def __iter__(self):
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
484 class FieldsSubsetIterator(object):
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
485 def __init__(self,ds):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
486 self.ds=ds
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
487 self.src_iter=ds.src.__iter__()
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
488 self.example=None
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
489 def __iter__(self): return self
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
490 def next(self):
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
491 complete_example = self.src_iter.next()
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
492 if self.example:
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
493 self.example._values=[complete_example[field]
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
494 for field in self.ds.fieldnames]
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
495 else:
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
496 self.example=Example(self.ds.fieldnames,
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
497 [complete_example[field] for field in self.ds.fieldnames])
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
498 return self.example
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
499 return FieldsSubsetIterator(self)
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
500
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
501 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
502 assert self.hasFields(*fieldnames)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
503 return self.src.minibatches_nowrap(fieldnames,minibatch_size,n_batches,offset)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
504 def __getitem__(self,i):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
505 return FieldsSubsetDataSet(self.src[i],self.fieldnames)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
506
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
507
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
508 class DataSetFields(LookupList):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
509 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
510 Although a DataSet iterates over examples (like rows of a matrix), an associated
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
511 DataSetFields iterates over fields (like columns of a matrix), and can be understood
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
512 as a transpose of the associated dataset.
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
513
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
514 To iterate over fields, one can do
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
515 * for fields in dataset.fields()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
516 * for fields in dataset(field1,field2,...).fields() to select a subset of fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
517 * for fields in dataset.fields(field1,field2,...) to select a subset of fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
518 and each of these fields is iterable over the examples:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
519 * for field_examples in dataset.fields():
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
520 for example_value in field_examples:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
521 ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
522 but when the dataset is a stream (unbounded length), it is not recommanded to do
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
523 such things because the underlying dataset may refuse to access the different fields in
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
524 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
525 The result of fields() is a DataSetFields object, which iterates over fields,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
526 and whose elements are iterable over examples. A DataSetFields object can
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
527 be turned back into a DataSet with its examples() method:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
528 dataset2 = dataset1.fields().examples()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
529 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1).
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
530
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
531 DataSetFields can be concatenated vertically or horizontally. To be consistent with
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
532 the syntax used for DataSets, the | concatenates the fields and the & concatenates
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
533 the examples.
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
534 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
535 def __init__(self,dataset,*fieldnames):
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
536 if not fieldnames:
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
537 fieldnames=dataset.fieldNames()
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
538 elif fieldnames is not dataset.fieldNames():
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
539 dataset = FieldsSubsetDataSet(dataset,fieldnames)
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
540 assert dataset.hasFields(*fieldnames)
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
541 self.dataset=dataset
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
542 minibatch_iterator = dataset.minibatches(fieldnames,
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
543 minibatch_size=len(dataset),
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
544 n_batches=1)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
545 minibatch=minibatch_iterator.next()
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
546 LookupList.__init__(self,fieldnames,minibatch)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
547
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
548 def examples(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
549 return self.dataset
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
550
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
551 def __or__(self,other):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
552 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
553 fields1 | fields2 is a DataSetFields that whose list of examples is the concatenation
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
554 of the list of examples of DataSetFields fields1 and fields2.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
555 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
556 return (self.examples() + other.examples()).fields()
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
557
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
558 def __and__(self,other):
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
559 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
560 fields1 + fields2 is a DataSetFields that whose list of fields is the concatenation
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
561 of the fields of DataSetFields fields1 and fields2.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
562 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
563 return (self.examples() | other.examples()).fields()
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
564
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
565
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
566 class MinibatchDataSet(DataSet):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
567 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
568 Turn a LookupList of same-length fields into an example-iterable dataset.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
569 Each element of the lookup-list should be an iterable and sliceable, all of the same length.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
570 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
571 def __init__(self,fields_lookuplist,values_vstack=DataSet().valuesVStack,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
572 values_hstack=DataSet().valuesHStack):
17
759d17112b23 more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
parents: 16 12
diff changeset
573 """
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
574 The user can (and generally should) also provide values_vstack(fieldname,fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
575 and a values_hstack(fieldnames,fieldvalues) functions behaving with the same
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
576 semantics as the DataSet methods of the same name (but without the self argument).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
577 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
578 self.fields=fields_lookuplist
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
579 assert len(fields_lookuplist)>0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
580 self.length=len(fields_lookuplist[0])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
581 for field in fields_lookuplist[1:]:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
582 assert self.length==len(field)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
583 self.values_vstack=values_vstack
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
584 self.values_hstack=values_hstack
3
378b68d5c4ad Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents: 2
diff changeset
585
378b68d5c4ad Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents: 2
diff changeset
586 def __len__(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
587 return self.length
28
541a273bc89f Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents: 26
diff changeset
588
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
589 def __getitem__(self,i):
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
590 return DataSetFields(MinibatchDataSet(
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
591 Example(self.fields.keys(),[field[i] for field in self.fields])),self.fields)
11
be128b9127c8 Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents: 9
diff changeset
592
29
46c5c90019c2 Changed apply_function so that it propagates methods of the source.
bengioy@grenat.iro.umontreal.ca
parents: 28
diff changeset
593 def fieldNames(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
594 return self.fields.keys()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
595
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
596 def hasFields(self,*fieldnames):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
597 for fieldname in fieldnames:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
598 if fieldname not in self.fields:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
599 return False
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
600 return True
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
601
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
602 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
603 class Iterator(object):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
604 def __init__(self,ds):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
605 self.ds=ds
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
606 self.next_example=offset
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
607 assert minibatch_size > 0
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
608 if offset+minibatch_size > ds.length:
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
609 raise NotImplementedError()
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
610 def __iter__(self):
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
611 return self
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
612 def next(self):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
613 upper = next_example+minibatch_size
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
614 assert upper<=self.ds.length
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
615 minibatch = Example(self.ds.fields.keys(),
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
616 [field[next_example:upper]
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
617 for field in self.ds.fields])
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
618 self.next_example+=minibatch_size
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
619 return DataSetFields(MinibatchDataSet(minibatch),fieldnames)
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
620
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
621 return Iterator(self)
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
622
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
623 def valuesVStack(self,fieldname,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
624 return self.values_vstack(fieldname,fieldvalues)
20
266c68cb6136 Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents: 19
diff changeset
625
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
626 def valuesHStack(self,fieldnames,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
627 return self.values_hstack(fieldnames,fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
628
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
629 class HStackedDataSet(DataSet):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
630 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
631 A DataSet that wraps several datasets and shows a view that includes all their fields,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
632 i.e. whose list of fields is the concatenation of their lists of fields.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
633
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
634 If a field name is found in more than one of the datasets, then either an error is
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
635 raised or the fields are renamed (either by prefixing the __name__ attribute
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
636 of the dataset + ".", if it exists, or by suffixing the dataset index in the argument list).
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
637
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
638 TODO: automatically detect a chain of stacked datasets due to A | B | C | D ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
639 """
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
640 def __init__(self,datasets,accept_nonunique_names=False,description=None,field_types=None):
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
641 DataSet.__init__(self,description,field_types)
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
642 self.datasets=datasets
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
643 self.accept_nonunique_names=accept_nonunique_names
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
644 self.fieldname2dataset={}
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
645
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
646 def rename_field(fieldname,dataset,i):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
647 if hasattr(dataset,"__name__"):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
648 return dataset.__name__ + "." + fieldname
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
649 return fieldname+"."+str(i)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
650
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
651 # make sure all datasets have the same length and unique field names
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
652 self.length=None
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
653 names_to_change=[]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
654 for i in xrange(len(datasets)):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
655 dataset = datasets[i]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
656 length=len(dataset)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
657 if self.length:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
658 assert self.length==length
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
659 else:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
660 self.length=length
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
661 for fieldname in dataset.fieldNames():
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
662 if fieldname in self.fieldname2dataset: # name conflict!
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
663 if accept_nonunique_names:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
664 fieldname=rename_field(fieldname,dataset,i)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
665 names2change.append((fieldname,i))
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
666 else:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
667 raise ValueError("Incompatible datasets: non-unique field name = "+fieldname)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
668 self.fieldname2dataset[fieldname]=i
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
669 for fieldname,i in names_to_change:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
670 del self.fieldname2dataset[fieldname]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
671 self.fieldname2dataset[rename_field(fieldname,self.datasets[i],i)]=i
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
672
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
673 def hasFields(self,*fieldnames):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
674 for fieldname in fieldnames:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
675 if not fieldname in self.fieldname2dataset:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
676 return False
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
677 return True
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
678
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
679 def fieldNames(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
680 return self.fieldname2dataset.keys()
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
681
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
682 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
683
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
684 class HStackedIterator(object):
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
685 def __init__(self,hsds,iterators):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
686 self.hsds=hsds
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
687 self.iterators=iterators
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
688 def __iter__(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
689 return self
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
690 def next(self):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
691 # concatenate all the fields of the minibatches
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
692 minibatch = reduce(LookupList.__add__,[iterator.next() for iterator in self.iterators])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
693 # and return a DataSetFields whose dataset is the transpose (=examples()) of this minibatch
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
694 return DataSetFields(MinibatchDataSet(minibatch,self.hsds.valuesVStack,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
695 self.hsds.valuesHStack),
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
696 fieldnames if fieldnames else hsds.fieldNames())
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
697
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
698 assert self.hasfields(fieldnames)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
699 # find out which underlying datasets are necessary to service the required fields
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
700 # and construct corresponding minibatch iterators
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
701 if fieldnames:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
702 datasets=set([])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
703 fields_in_dataset=dict([(dataset,[]) for dataset in datasets])
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
704 for fieldname in fieldnames:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
705 dataset=self.datasets[self.fieldnames2dataset[fieldname]]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
706 datasets.add(dataset)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
707 fields_in_dataset[dataset].append(fieldname)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
708 datasets=list(datasets)
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
709 iterators=[dataset.minibatches(fields_in_dataset[dataset],minibatch_size,n_batches,offset)
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
710 for dataset in datasets]
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
711 else:
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
712 datasets=self.datasets
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
713 iterators=[dataset.minibatches(None,minibatch_size,n_batches,offset) for dataset in datasets]
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
714 return HStackedIterator(self,iterators)
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
715
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
716
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
717 def valuesVStack(self,fieldname,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
718 return self.datasets[self.fieldname2dataset[fieldname]].valuesVStack(fieldname,fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
719
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
720 def valuesHStack(self,fieldnames,fieldvalues):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
721 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
722 We will use the sub-dataset associated with the first fieldname in the fieldnames list
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
723 to do the work, hoping that it can cope with the other values (i.e. won't care
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
724 about the incompatible fieldnames). Hence this heuristic will always work if
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
725 all the fieldnames are of the same sub-dataset.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
726 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
727 return self.datasets[self.fieldname2dataset[fieldnames[0]]].valuesHStack(fieldnames,fieldvalues)
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
728
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
729 class VStackedDataSet(DataSet):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
730 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
731 A DataSet that wraps several datasets and shows a view that includes all their examples,
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
732 in the order provided. This clearly assumes that they all have the same field names
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
733 and all (except possibly the last one) are of finite length.
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
734
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
735 TODO: automatically detect a chain of stacked datasets due to A + B + C + D ...
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
736 """
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
737 def __init__(self,datasets):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
738 self.datasets=datasets
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
739 self.length=0
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
740 self.index2dataset={}
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
741 assert len(datasets)>0
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
742 fieldnames = datasets[-1].fieldNames()
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
743 self.datasets_start_row=[]
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
744 # We use this map from row index to dataset index for constant-time random access of examples,
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
745 # to avoid having to search for the appropriate dataset each time and slice is asked for.
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
746 for dataset,k in enumerate(datasets[0:-1]):
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
747 try:
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
748 L=len(dataset)
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
749 except UnboundedDataSet:
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
750 print "All VStacked datasets (except possibly the last) must be bounded (have a length)."
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
751 assert False
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
752 for i in xrange(L):
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
753 self.index2dataset[self.length+i]=k
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
754 self.datasets_start_row.append(self.length)
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
755 self.length+=L
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
756 assert dataset.fieldNames()==fieldnames
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
757 self.datasets_start_row.append(self.length)
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
758 self.length+=len(datasets[-1])
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
759 # If length is very large, we should use a more memory-efficient mechanism
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
760 # that does not store all indices
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
761 if self.length>1000000:
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
762 # 1 million entries would require about 60 meg for the index2dataset map
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
763 # TODO
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
764 print "A more efficient mechanism for index2dataset should be implemented"
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
765
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
766 def __len__(self):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
767 return self.length
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
768
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
769 def fieldNames(self):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
770 return self.datasets[0].fieldNames()
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
771
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
772 def hasFields(self,*fieldnames):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
773 return self.datasets[0].hasFields(*fieldnames)
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
774
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
775 def locate_row(self,row):
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
776 """Return (dataset_index, row_within_dataset) for global row number"""
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
777 dataset_index = self.index2dataset[row]
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
778 row_within_dataset = self.datasets_start_row[dataset_index]
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
779 return dataset_index, row_within_dataset
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
780
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
781 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
782
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
783 class VStackedIterator(object):
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
784 def __init__(self,vsds):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
785 self.vsds=vsds
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
786 self.next_row=offset
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
787 self.next_dataset_index,self.next_dataset_row=self.vsds.locate_row(offset)
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
788 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
789 self.next_iterator(vsds.datasets[0],offset,n_batches)
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
790
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
791 def next_iterator(self,dataset,starting_offset,batches_left):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
792 L=len(dataset)
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
793 ds_nbatches = (L-starting_offset)/minibatch_size
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
794 if batches_left is not None:
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
795 ds_nbatches = max(batches_left,ds_nbatches)
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
796 if minibatch_size>L:
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
797 ds_minibatch_size=L
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
798 n_left_in_mb=minibatch_size-L
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
799 ds_nbatches=1
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
800 else:
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
801 n_left_in_mb=0
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
802 return dataset.minibatches(fieldnames,minibatch_size,ds_nbatches,starting_offset), \
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
803 L-(starting_offset+ds_nbatches*minibatch_size), n_left_in_mb
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
804
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
805 def move_to_next_dataset(self):
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
806 if self.n_left_at_the_end_of_ds>0:
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
807 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
808 self.next_iterator(vsds.datasets[self.next_dataset_index],
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
809 self.n_left_at_the_end_of_ds,1)
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
810 else:
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
811 self.next_dataset_index +=1
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
812 if self.next_dataset_index==len(self.vsds.datasets):
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
813 self.next_dataset_index = 0
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
814 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
815 self.next_iterator(vsds.datasets[self.next_dataset_index],starting_offset,n_batches)
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
816
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
817 def __iter__(self):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
818 return self
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
819
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
820 def next(self):
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
821 dataset=self.vsds.datasets[self.next_dataset_index]
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
822 mb = self.next_iterator.next()
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
823 if self.n_left_in_mb:
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
824 extra_mb = []
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
825 while self.n_left_in_mb>0:
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
826 self.move_to_next_dataset()
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
827 extra_mb.append(self.next_iterator.next())
40
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
828 examples = Example(names,
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
829 [dataset.valuesVStack(name,
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
830 [mb[name]]+[b[name] for b in extra_mb])
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
831 for name in fieldnames])
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
832 mb = DataSetFields(MinibatchDataSet(examples),fieldnames)
88fd1cce08b9 replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents: 39
diff changeset
833
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
834 self.next_row+=minibatch_size
38
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
835 self.next_dataset_row+=minibatch_size
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
836 if self.next_row+minibatch_size>len(dataset):
d637ad8f7352 Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents: 37
diff changeset
837 self.move_to_next_dataset()
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
838 return examples
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
839 return VStackedIterator(self)
37
73c4212ba5b3 Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents: 36
diff changeset
840
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
841 class ArrayFieldsDataSet(DataSet):
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
842 """
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
843 Virtual super-class of datasets whose field values are numpy array,
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
844 thus defining valuesHStack and valuesVStack for sub-classes.
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
845 """
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
846 def __init__(self,description=None,field_types=None):
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
847 DataSet.__init__(self,description,field_types)
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
848 def valuesHStack(self,fieldnames,fieldvalues):
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
849 """Concatenate field values horizontally, e.g. two vectors
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
850 become a longer vector, two matrices become a wider matrix, etc."""
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
851 return numpy.hstack(fieldvalues)
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
852 def valuesVStack(self,fieldname,values):
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
853 """Concatenate field values vertically, e.g. two vectors
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
854 become a two-row matrix, two matrices become a longer matrix, etc."""
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
855 return numpy.vstack(values)
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
856
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
857 class ArrayDataSet(ArrayFieldsDataSet):
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
858 """
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
859 An ArrayDataSet stores the fields as groups of columns in a numpy tensor,
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
860 whose first axis iterates over examples, second axis determines fields.
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
861 If the underlying array is N-dimensional (has N axes), then the field
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
862 values are (N-2)-dimensional objects (i.e. ordinary numbers if N=2).
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
863 """
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
864
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
865 """
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
866 Construct an ArrayDataSet from the underlying numpy array (data) and
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
867 a map (fields_columns) from fieldnames to field columns. The columns of a field are specified
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
868 using the standard arguments for indexing/slicing: integer for a column index,
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
869 slice for an interval of columns (with possible stride), or iterable of column indices.
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
870 """
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
871 def __init__(self, data_array, fields_columns):
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
872 self.data=data_array
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
873 self.fields_columns=fields_columns
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
874
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
875 # check consistency and complete slices definitions
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
876 for fieldname, fieldcolumns in self.fields_columns.items():
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
877 if type(fieldcolumns) is int:
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
878 assert fieldcolumns>=0 and fieldcolumns<data_array.shape[1]
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
879 elif type(fieldcolumns) is slice:
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
880 start,step=None,None
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
881 if not fieldcolumns.start:
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
882 start=0
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
883 if not fieldcolumns.step:
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
884 step=1
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
885 if start or step:
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
886 self.fields_columns[fieldname]=slice(start,fieldcolumns.stop,step)
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
887 elif hasattr(fieldcolumns,"__iter__"): # something like a list
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
888 for i in fieldcolumns:
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
889 assert i>=0 and i<data_array.shape[1]
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
890
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
891 def fieldNames(self):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
892 return self.fields_columns.keys()
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
893
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
894 def __len__(self):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
895 return len(self.data)
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
896
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
897 #def __getitem__(self,i):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
898 # """More efficient implementation than the default"""
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
899
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
900 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
901 class ArrayDataSetIterator(object):
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
902 def __init__(self,dataset,fieldnames,minibatch_size,n_batches,offset):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
903 if fieldnames is None: fieldnames = dataset.fieldNames()
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
904 # store the resulting minibatch in a lookup-list of values
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
905 self.minibatch = LookupList(fieldnames,[0]*len(fieldnames))
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
906 self.dataset=dataset
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
907 self.minibatch_size=minibatch_size
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
908 assert offset>=0 and offset<len(dataset.data)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
909 assert offset+minibatch_size<=len(dataset.data)
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
910 self.current=offset
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
911 def __iter__(self):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
912 return self
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
913 def next(self):
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
914 sub_data = self.dataset.data[self.current:self.current+self.minibatch_size]
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
915 self.minibatch._values = [sub_data[:,self.dataset.fields_columns[f]] for f in self.minibatch._names]
43
e92244f30116 Corrected iterator logic errors
bengioy@grenat.iro.umontreal.ca
parents: 42
diff changeset
916 self.current+=self.minibatch_size
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
917 return self.minibatch
42
9b68774fcc6b Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents: 41
diff changeset
918
44
5a85fda9b19b Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents: 43
diff changeset
919 return ArrayDataSetIterator(self,fieldnames,minibatch_size,n_batches,offset)
41
283e95c15b47 Added ArrayDataSet
bengioy@grenat.iro.umontreal.ca
parents: 40
diff changeset
920
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
921 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None):
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
922 """
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
923 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
924 user to define a set of fields as the 'input' field and a set of fields
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
925 as the 'target' field. Optionally, a single weight_field can also be defined.
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
926 """
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
927 args = ((input_fields,'input'),(output_fields,'target'))
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
928 if weight_field: args+=(([weight_field],'weight'))
36
438440ba0627 Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents: 29
diff changeset
929 return src_dataset.merge_fields(*args)
23
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
930
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
931
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
932
526e192b0699 Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents: 22
diff changeset
933