comparison dataset.py @ 71:5b699b31770a

merge
author James Bergstra <bergstrj@iro.umontreal.ca>
date Fri, 02 May 2008 18:19:35 -0400
parents dde1fb1b63ba
children 2b6656b2ef52 40476a7746e8
comparison
equal deleted inserted replaced
70:76e5c0f37165 71:5b699b31770a
1 1
2 from lookup_list import LookupList 2 from lookup_list import LookupList
3 Example = LookupList 3 Example = LookupList
4 import copy 4 from misc import unique_elements_list_intersection
5 from string import join
6 from sys import maxint
7 import numpy
5 8
6 class AbstractFunction (Exception): """Derived class must override this function""" 9 class AbstractFunction (Exception): """Derived class must override this function"""
7 10 class NotImplementedYet (NotImplementedError): """Work in progress, this should eventually be implemented"""
11
8 class DataSet(object): 12 class DataSet(object):
9 """A virtual base class for datasets. 13 """A virtual base class for datasets.
10 14
15 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction
16 with learning algorithms (for training and testing them): rows/records are called examples, and
17 columns/attributes are called fields. The field value for a particular example can be an arbitrary
18 python object, which depends on the particular dataset.
19
20 We call a DataSet a 'stream' when its length is unbounded (otherwise its __len__ method
21 should return sys.maxint).
22
11 A DataSet is a generator of iterators; these iterators can run through the 23 A DataSet is a generator of iterators; these iterators can run through the
12 examples in a variety of ways. A DataSet need not necessarily have a finite 24 examples or the fields in a variety of ways. A DataSet need not necessarily have a finite
13 or known length, so this class can be used to interface to a 'stream' which 25 or known length, so this class can be used to interface to a 'stream' which
14 feeds on-line learning. 26 feeds on-line learning (however, as noted below, some operations are not
27 feasible or not recommanded on streams).
15 28
16 To iterate over examples, there are several possibilities: 29 To iterate over examples, there are several possibilities:
17 - for example in dataset.zip([field1, field2,field3, ...]) 30 * for example in dataset([field1, field2,field3, ...]):
18 - for val1,val2,val3 in dataset.zip([field1, field2,field3]) 31 * for val1,val2,val3 in dataset([field1, field2,field3]):
19 - for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N) 32 * for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N):
20 - for example in dataset 33 * for mini1,mini2,mini3 in dataset.minibatches([field1, field2, field3], minibatch_size=N):
34 * for example in dataset:
35 print example['x']
36 * for x,y,z in dataset:
21 Each of these is documented below. All of these iterators are expected 37 Each of these is documented below. All of these iterators are expected
22 to provide, in addition to the usual 'next()' method, a 'next_index()' method 38 to provide, in addition to the usual 'next()' method, a 'next_index()' method
23 which returns a non-negative integer pointing to the position of the next 39 which returns a non-negative integer pointing to the position of the next
24 example that will be returned by 'next()' (or of the first example in the 40 example that will be returned by 'next()' (or of the first example in the
25 next minibatch returned). This is important because these iterators 41 next minibatch returned). This is important because these iterators
26 can wrap around the dataset in order to do multiple passes through it, 42 can wrap around the dataset in order to do multiple passes through it,
27 in possibly unregular ways if the minibatch size is not a divisor of the 43 in possibly unregular ways if the minibatch size is not a divisor of the
28 dataset length. 44 dataset length.
29 45
46 To iterate over fields, one can do
47 * for field in dataset.fields():
48 for field_value in field: # iterate over the values associated to that field for all the dataset examples
49 * for field in dataset(field1,field2,...).fields() to select a subset of fields
50 * for field in dataset.fields(field1,field2,...) to select a subset of fields
51 and each of these fields is iterable over the examples:
52 * for field_examples in dataset.fields():
53 for example_value in field_examples:
54 ...
55 but when the dataset is a stream (unbounded length), it is not recommanded to do
56 such things because the underlying dataset may refuse to access the different fields in
57 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
58 The result of fields() is a DataSetFields object, which iterates over fields,
59 and whose elements are iterable over examples. A DataSetFields object can
60 be turned back into a DataSet with its examples() method:
61 dataset2 = dataset1.fields().examples()
62 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1).
63
30 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content. 64 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content.
31 65
32 Note: The content of a field can be of any type. 66 Note: The content of a field can be of any type. Field values can also be 'missing'
33 67 (e.g. to handle semi-supervised learning), and in the case of numeric (numpy array)
34 Note: A dataset can recognize a potentially infinite number of field names (i.e. the field 68 fields (i.e. an ArrayFieldsDataSet), NaN plays the role of a missing value.
35 values can be computed on-demand, when particular field names are used in one of the 69 What about non-numeric values? None.
36 iterators). 70
37 71 Dataset elements can be indexed and sub-datasets (with a subset
38 Datasets of finite length should be sub-classes of FiniteLengthDataSet. 72 of examples) can be extracted. These operations are not supported
39 73 by default in the case of streams.
40 Datasets whose elements can be indexed and whose sub-datasets (with a subset 74
41 of examples) can be extracted should be sub-classes of 75 * dataset[:n] returns a dataset with the n first examples.
42 SliceableDataSet. 76
43 77 * dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s.
44 Datasets with a finite number of fields should be sub-classes of 78
45 FiniteWidthDataSet. 79 * dataset[i] returns an Example.
46 """ 80
47 81 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in.
48 def __init__(self): 82
49 pass 83 * dataset[fieldname] an iterable over the values of the field fieldname across
50 84 the dataset (the iterable is obtained by default by calling valuesVStack
51 class Iterator(LookupList): 85 over the values for individual examples).
52 def __init__(self, ll): 86
53 LookupList.__init__(self, ll.keys(), ll.values()) 87 * dataset.<property> returns the value of a property associated with
54 self.ll = ll 88 the name <property>. The following properties should be supported:
89 - 'description': a textual description or name for the dataset
90 - 'fieldtypes': a list of types (one per field)
91
92 Datasets can be concatenated either vertically (increasing the length) or
93 horizontally (augmenting the set of fields), if they are compatible, using
94 the following operations (with the same basic semantics as numpy.hstack
95 and numpy.vstack):
96
97 * dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3])
98
99 creates a new dataset whose list of fields is the concatenation of the list of
100 fields of the argument datasets. This only works if they all have the same length.
101
102 * dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3])
103
104 creates a new dataset that concatenates the examples from the argument datasets
105 (and whose length is the sum of the length of the argument datasets). This only
106 works if they all have the same fields.
107
108 According to the same logic, and viewing a DataSetFields object associated to
109 a DataSet as a kind of transpose of it, fields1 & fields2 concatenates fields of
110 a DataSetFields fields1 and fields2, and fields1 | fields2 concatenates their
111 examples.
112
113 A dataset can hold arbitrary key-value pairs that may be used to access meta-data
114 or other properties of the dataset or associated with the dataset or the result
115 of a computation stored in a dataset. These can be accessed through the [key] syntax
116 when key is a string (or more specifically, neither an integer, a slice, nor a list).
117
118 A DataSet sub-class should always redefine the following methods:
119 * __len__ if it is not a stream
120 * fieldNames
121 * minibatches_nowrap (called by DataSet.minibatches())
122 * valuesHStack
123 * valuesVStack
124 For efficiency of implementation, a sub-class might also want to redefine
125 * hasFields
126 * __getitem__ may not be feasible with some streams
127 * __iter__
128 """
129
130 def __init__(self,description=None,fieldtypes=None):
131 if description is None:
132 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)"
133 description = type(self).__name__ + " ( " + join([x.__name__ for x in type(self).__bases__]) + " )"
134 self.description=description
135 self.fieldtypes=fieldtypes
136
137 class MinibatchToSingleExampleIterator(object):
138 """
139 Converts the result of minibatch iterator with minibatch_size==1 into
140 single-example values in the result. Therefore the result of
141 iterating on the dataset itself gives a sequence of single examples
142 (whereas the result of iterating over minibatches gives in each
143 Example field an iterable object over the individual examples in
144 the minibatch).
145 """
146 def __init__(self, minibatch_iterator):
147 self.minibatch_iterator = minibatch_iterator
148 self.minibatch = None
55 def __iter__(self): #makes for loop work 149 def __iter__(self): #makes for loop work
56 return self 150 return self
57 def next(self): 151 def next(self):
58 self.ll.next() 152 size1_minibatch = self.minibatch_iterator.next()
59 self._values = [v[0] for v in self.ll._values] 153 if not self.minibatch:
60 return self 154 self.minibatch = Example(size1_minibatch.keys(),[value[0] for value in size1_minibatch.values()])
155 else:
156 self.minibatch._values = [value[0] for value in size1_minibatch.values()]
157 return self.minibatch
158
61 def next_index(self): 159 def next_index(self):
62 return self.ll.next_index() 160 return self.minibatch_iterator.next_index()
63 161
64 def __iter__(self): 162 def __iter__(self):
65 """Supports the syntax "for i in dataset: ..." 163 """Supports the syntax "for i in dataset: ..."
66 164
67 Using this syntax, "i" will be an Example instance (or equivalent) with 165 Using this syntax, "i" will be an Example instance (or equivalent) with
68 all the fields of DataSet self. Every field of "i" will give access to 166 all the fields of DataSet self. Every field of "i" will give access to
69 a field of a single example. Fields should be accessible via 167 a field of a single example. Fields should be accessible via
70 i["fielname"] or i[3] (in the order defined by the elements of the 168 i["fielname"] or i[3] (in the order defined by the elements of the
71 Example returned by this iterator), but the derived class is free 169 Example returned by this iterator), but the derived class is free
72 to accept any type of identifier, and add extra functionality to the iterator. 170 to accept any type of identifier, and add extra functionality to the iterator.
73 """ 171
74 return DataSet.Iterator(self.minibatches(None, minibatch_size = 1)) 172 The default implementation calls the minibatches iterator and extracts the first example of each field.
75 173 """
76 def zip(self, *fieldnames): 174 return DataSet.MinibatchToSingleExampleIterator(self.minibatches(None, minibatch_size = 1))
77 """ 175
78 Supports two forms of syntax: 176
79 177 class MinibatchWrapAroundIterator(object):
80 for i in dataset.zip([f1, f2, f3]): ... 178 """
81 179 An iterator for minibatches that handles the case where we need to wrap around the
82 for i1, i2, i3 in dataset.zip([f1, f2, f3]): ... 180 dataset because n_batches*minibatch_size > len(dataset). It is constructed from
83 181 a dataset that provides a minibatch iterator that does not need to handle that problem.
84 Using the first syntax, "i" will be an indexable object, such as a list, 182 This class is a utility for dataset subclass writers, so that they do not have to handle
85 tuple, or Example instance, such that on every iteration, i[0] is the f1 183 this issue multiple times, nor check that fieldnames are valid, nor handle the
86 field of the current example, i[1] is the f2 field, and so on. 184 empty fieldnames (meaning 'use all the fields').
87 185 """
88 Using the second syntax, i1, i2, i3 will contain the the contents of the 186 def __init__(self,dataset,fieldnames,minibatch_size,n_batches,offset):
89 f1, f2, and f3 fields of a single example on each loop iteration. 187 self.dataset=dataset
90 188 self.fieldnames=fieldnames
91 The derived class may accept fieldname arguments of any type. 189 self.minibatch_size=minibatch_size
92 190 self.n_batches=n_batches
93 """ 191 self.n_batches_done=0
94 return DataSet.Iterator(self.minibatches(fieldnames, minibatch_size = 1)) 192 self.next_row=offset
193 self.L=len(dataset)
194 assert offset+minibatch_size<=self.L
195 ds_nbatches = (self.L-offset)/minibatch_size
196 if n_batches is not None:
197 ds_nbatches = max(n_batches,ds_nbatches)
198 if fieldnames:
199 assert dataset.hasFields(*fieldnames)
200 else:
201 fieldnames=dataset.fieldNames()
202 self.iterator = dataset.minibatches_nowrap(fieldnames,minibatch_size,ds_nbatches,offset)
203
204 def __iter__(self):
205 return self
206
207 def next_index(self):
208 return self.next_row
209
210 def next(self):
211 if self.n_batches and self.n_batches_done==self.n_batches:
212 raise StopIteration
213 upper = self.next_row+self.minibatch_size
214 if upper <=self.L:
215 minibatch = self.iterator.next()
216 else:
217 if not self.n_batches:
218 raise StopIteration
219 # we must concatenate (vstack) the bottom and top parts of our minibatch
220 # first get the beginning of our minibatch (top of dataset)
221 first_part = self.dataset.minibatches_nowrap(fieldnames,self.L-self.next_row,1,self.next_row).next()
222 second_part = self.dataset.minibatches_nowrap(fieldnames,upper-self.L,1,0).next()
223 minibatch = Example(self.fieldnames,
224 [self.dataset.valuesVStack(name,[first_part[name],second_part[name]])
225 for name in self.fieldnames])
226 self.next_row=upper
227 self.n_batches_done+=1
228 if upper >= self.L and self.n_batches:
229 self.next_row -= self.L
230 return minibatch
231
95 232
96 minibatches_fieldnames = None 233 minibatches_fieldnames = None
97 minibatches_minibatch_size = 1 234 minibatches_minibatch_size = 1
98 minibatches_n_batches = None 235 minibatches_n_batches = None
99 def minibatches(self, 236 def minibatches(self,
100 fieldnames = minibatches_fieldnames, 237 fieldnames = minibatches_fieldnames,
101 minibatch_size = minibatches_minibatch_size, 238 minibatch_size = minibatches_minibatch_size,
102 n_batches = minibatches_n_batches): 239 n_batches = minibatches_n_batches,
103 """ 240 offset = 0):
104 Supports three forms of syntax: 241 """
242 Return an iterator that supports three forms of syntax:
105 243
106 for i in dataset.minibatches(None,**kwargs): ... 244 for i in dataset.minibatches(None,**kwargs): ...
107 245
108 for i in dataset.minibatches([f1, f2, f3],**kwargs): ... 246 for i in dataset.minibatches([f1, f2, f3],**kwargs): ...
109 247
114 of a batch of current examples. In the second case, i[0] is 252 of a batch of current examples. In the second case, i[0] is
115 list-like container of the f1 field of a batch current examples, i[1] is 253 list-like container of the f1 field of a batch current examples, i[1] is
116 a list-like container of the f2 field, etc. 254 a list-like container of the f2 field, etc.
117 255
118 Using the first syntax, all the fields will be returned in "i". 256 Using the first syntax, all the fields will be returned in "i".
119 Beware that some datasets may not support this syntax, if the number
120 of fields is infinite (i.e. field values may be computed "on demand").
121
122 Using the third syntax, i1, i2, i3 will be list-like containers of the 257 Using the third syntax, i1, i2, i3 will be list-like containers of the
123 f1, f2, and f3 fields of a batch of examples on each loop iteration. 258 f1, f2, and f3 fields of a batch of examples on each loop iteration.
259
260 The minibatches iterator is expected to return upon each call to next()
261 a DataSetFields object, which is a LookupList (indexed by the field names) whose
262 elements are iterable over the minibatch examples, and which keeps a pointer to
263 a sub-dataset that can be used to iterate over the individual examples
264 in the minibatch. Hence a minibatch can be converted back to a regular
265 dataset or its fields can be looked at individually (and possibly iterated over).
124 266
125 PARAMETERS 267 PARAMETERS
126 - fieldnames (list of any type, default None): 268 - fieldnames (list of any type, default None):
127 The loop variables i1, i2, i3 (in the example above) should contain the 269 The loop variables i1, i2, i3 (in the example above) should contain the
128 f1, f2, and f3 fields of the current batch of examples. If None, the 270 f1, f2, and f3 fields of the current batch of examples. If None, the
135 - n_batches (integer, default None) 277 - n_batches (integer, default None)
136 The iterator will loop exactly this many times, and then stop. If None, 278 The iterator will loop exactly this many times, and then stop. If None,
137 the derived class can choose a default. If (-1), then the returned 279 the derived class can choose a default. If (-1), then the returned
138 iterator should support looping indefinitely. 280 iterator should support looping indefinitely.
139 281
282 - offset (integer, default 0)
283 The iterator will start at example 'offset' in the dataset, rather than the default.
284
140 Note: A list-like container is something like a tuple, list, numpy.ndarray or 285 Note: A list-like container is something like a tuple, list, numpy.ndarray or
141 any other object that supports integer indexing and slicing. 286 any other object that supports integer indexing and slicing.
142 287
143 """ 288 """
289 return DataSet.MinibatchWrapAroundIterator(self,fieldnames,minibatch_size,n_batches,offset)
290
291 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
292 """
293 This is the minibatches iterator generator that sub-classes must define.
294 It does not need to worry about wrapping around multiple times across the dataset,
295 as this is handled by MinibatchWrapAroundIterator when DataSet.minibatches() is called.
296 The next() method of the returned iterator does not even need to worry about
297 the termination condition (as StopIteration will be raised by DataSet.minibatches
298 before an improper call to minibatches_nowrap's next() is made).
299 That next() method can assert that its next row will always be within [0,len(dataset)).
300 The iterator returned by minibatches_nowrap does not need to implement
301 a next_index() method either, as this will be provided by MinibatchWrapAroundIterator.
302 """
144 raise AbstractFunction() 303 raise AbstractFunction()
304
305 def __len__(self):
306 """
307 len(dataset) returns the number of examples in the dataset.
308 By default, a DataSet is a 'stream', i.e. it has an unbounded length (sys.maxint).
309 Sub-classes which implement finite-length datasets should redefine this method.
310 Some methods only make sense for finite-length datasets.
311 """
312 return sys.maxint
313
314 def is_unbounded(self):
315 """
316 Tests whether a dataset is unbounded (e.g. a stream).
317 """
318 return len(self)==sys.maxint
145 319
146 def hasFields(self,*fieldnames): 320 def hasFields(self,*fieldnames):
147 """ 321 """
148 Return true if the given field name (or field names, if multiple arguments are 322 Return true if the given field name (or field names, if multiple arguments are
149 given) is recognized by the DataSet (i.e. can be used as a field name in one 323 given) is recognized by the DataSet (i.e. can be used as a field name in one
150 of the iterators). 324 of the iterators).
325
326 The default implementation may be inefficient (O(# fields in dataset)), as it calls the fieldNames()
327 method. Many datasets may store their field names in a dictionary, which would allow more efficiency.
328 """
329 return len(unique_elements_list_intersection(fieldnames,self.fieldNames()))>0
330
331 def fieldNames(self):
332 """
333 Return the list of field names that are supported by the iterators,
334 and for which hasFields(fieldname) would return True.
151 """ 335 """
152 raise AbstractFunction() 336 raise AbstractFunction()
153 337
154 338 def __call__(self,*fieldnames):
155 def merge_fields(self,*specifications): 339 """
156 """ 340 Return a dataset that sees only the fields whose name are specified.
157 Return a new dataset that maps old fields (of self) to new fields (of the returned 341 """
158 dataset). The minimal syntax that should be supported is the following: 342 assert self.hasFields(*fieldnames)
159 new_field_specifications = [new_field_spec1, new_field_spec2, ...] 343 return self.fields(*fieldnames).examples()
160 new_field_spec = ([old_field1, old_field2, ...], new_field) 344
161 In general both old_field and new_field should be strings, but some datasets may also 345 def fields(self,*fieldnames):
162 support additional indexing schemes within each field (e.g. column slice 346 """
163 of a matrix-like field). 347 Return a DataSetFields object associated with this dataset.
164 """ 348 """
165 raise AbstractFunction() 349 return DataSetFields(self,*fieldnames)
166
167 def merge_field_values(self,*field_value_pairs):
168 """
169 Return the value that corresponds to merging the values of several fields,
170 given as arguments (field_name, field_value) pairs with self.hasField(field_name).
171 This may be used by implementations of merge_fields.
172 Raise a ValueError if the operation is not possible.
173 """
174 fieldnames,fieldvalues = zip(*field_value_pairs)
175 raise ValueError("Unable to merge values of these fields:"+repr(fieldnames))
176
177 def examples2minibatch(self,examples):
178 """
179 Combine a list of Examples into a minibatch. A minibatch is an Example whose fields
180 are iterable over the examples of the minibatch.
181 """
182 raise AbstractFunction()
183
184 def rename(self,rename_dict):
185 """
186 Changes a dataset into one that renames fields, using a dictionnary that maps old field
187 names to new field names. The only fields visible by the returned dataset are those
188 whose names are keys of the rename_dict.
189 """
190 self_class = self.__class__
191 class SelfRenamingDataSet(RenamingDataSet,self_class):
192 pass
193 self.__class__ = SelfRenamingDataSet
194 # set the rename_dict and src fields
195 SelfRenamingDataSet.__init__(self,self,rename_dict)
196 return self
197
198 def apply_function(self,function, input_fields, output_fields, copy_inputs=True, accept_minibatches=True, cache=True):
199 """
200 Changes a dataset into one that contains as fields the results of applying
201 the given function (example-wise) to the specified input_fields. The
202 function should return a sequence whose elements will be stored in
203 fields whose names are given in the output_fields list. If copy_inputs
204 is True then the resulting dataset will also contain the fields of self.
205 If accept_minibatches, then the function may be called
206 with minibatches as arguments (what is returned by the minibatches
207 iterator). In any case, the computations may be delayed until the examples
208 of the resulting dataset are requested. If cache is True, then
209 once the output fields for some examples have been computed, then
210 are cached (to avoid recomputation if the same examples are again
211 requested).
212 """
213 self_class = self.__class__
214 class SelfApplyFunctionDataSet(ApplyFunctionDataSet,self_class):
215 pass
216 self.__class__ = SelfApplyFunctionDataSet
217 # set the required additional fields
218 ApplyFunctionDataSet.__init__(self,self,function, input_fields, output_fields, copy_inputs, accept_minibatches, cache)
219 return self
220
221
222 class FiniteLengthDataSet(DataSet):
223 """
224 Virtual interface for datasets that have a finite length (number of examples),
225 and thus recognize a len(dataset) call.
226 """
227 def __init__(self):
228 DataSet.__init__(self)
229
230 def __len__(self):
231 """len(dataset) returns the number of examples in the dataset."""
232 raise AbstractFunction()
233
234 def __call__(self,fieldname_or_fieldnames):
235 """
236 Extract one or more fields. This may be an expensive operation when the
237 dataset is large. It is not the recommanded way to access individual values
238 (use the iterators instead). If the argument is a string fieldname, then the result
239 is a sequence (iterable object) of values for that field, for the whole dataset. If the
240 argument is a list of field names, then the result is a 'batch', i.e., an Example with keys
241 corresponding to the given field names and values being iterable objects over the
242 individual example values.
243 """
244 if type(fieldname_or_fieldnames) is string:
245 minibatch = self.minibatches([fieldname_or_fieldnames],len(self)).next()
246 return minibatch[fieldname_or_fieldnames]
247 return self.minibatches(fieldname_or_fieldnames,len(self)).next()
248
249 class SliceableDataSet(DataSet):
250 """
251 Virtual interface, a subclass of DataSet for datasets which are sliceable
252 and whose individual elements can be accessed, generally respecting the
253 python semantics for [spec], where spec is either a non-negative integer
254 (for selecting one example), a python slice(start,stop,step) for selecting a regular
255 sub-dataset comprising examples start,start+step,start+2*step,...,n (with n<stop), or a
256 sequence (e.g. a list) of integers [i1,i2,...,in] for selecting
257 an arbitrary subset of examples. This is useful for obtaining
258 sub-datasets, e.g. for splitting a dataset into training and test sets.
259 """
260 def __init__(self):
261 DataSet.__init__(self)
262
263 def minibatches(self,
264 fieldnames = DataSet.minibatches_fieldnames,
265 minibatch_size = DataSet.minibatches_minibatch_size,
266 n_batches = DataSet.minibatches_n_batches):
267 """
268 If the n_batches is empty, we want to see all the examples possible
269 for the given minibatch_size (possibly missing a few at the end of the dataset).
270 """
271 # substitute the defaults:
272 if n_batches is None: n_batches = len(self) / minibatch_size
273 return DataSet.Iterator(self, fieldnames, minibatch_size, n_batches)
274 350
275 def __getitem__(self,i): 351 def __getitem__(self,i):
276 """ 352 """
277 dataset[i] returns the (i+1)-th example of the dataset. 353 dataset[i] returns the (i+1)-th example of the dataset.
278 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1. 354 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1.
279 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2. 355 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2.
280 dataset[[i1,i2,..,in]] returns the subdataset with examples i1,i2,...,in. 356 dataset[[i1,i2,..,in]] returns the subdataset with examples i1,i2,...,in.
281 """ 357 dataset['key'] returns a property associated with the given 'key' string.
282 raise AbstractFunction() 358 If 'key' is a fieldname, then the VStacked field values (iterable over
283 359 field values) for that field is returned. Other keys may be supported
284 def __getslice__(self,*slice_args): 360 by different dataset subclasses. The following key names are encouraged:
285 """ 361 - 'description': a textual description or name for the dataset
286 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1. 362 - '<fieldname>.type': a type name or value for a given <fieldname>
287 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2. 363
288 """ 364 Note that some stream datasets may be unable to implement random access, i.e.
289 raise AbstractFunction() 365 arbitrary slicing/indexing
290 366 because they can only iterate through examples one or a minibatch at a time
291 367 and do not actually store or keep past (or future) examples.
292 class FiniteWidthDataSet(DataSet): 368
293 """ 369 The default implementation of getitem uses the minibatches iterator
294 Virtual interface for datasets that have a finite width (number of fields), 370 to obtain one example, one slice, or a list of examples. It may not
295 and thus return a list of fieldNames. 371 always be the most efficient way to obtain the result, especially if
296 """ 372 the data are actually stored in a memory array.
297 def __init__(self): 373 """
298 DataSet.__init__(self) 374 # check for an index
299 375 if type(i) is int:
300 def hasFields(self,*fields): 376 return DataSet.MinibatchToSingleExampleIterator(
301 has_fields=True 377 self.minibatches(minibatch_size=1,n_batches=1,offset=i)).next()
302 fieldnames = self.fieldNames() 378 rows=None
303 for name in fields: 379 # or a slice
304 if name not in fieldnames: 380 if type(i) is slice:
305 has_fields=False 381 if not i.start: i.start=0
306 return has_fields 382 if not i.step: i.step=1
307 383 if i.step is 1:
384 return self.minibatches(minibatch_size=i.stop-i.start,n_batches=1,offset=i.start).next().examples()
385 rows = range(i.start,i.stop,i.step)
386 # or a list of indices
387 elif type(i) is list:
388 rows = i
389 if rows is not None:
390 examples = [self[row] for row in rows]
391 fields_values = zip(*examples)
392 return MinibatchDataSet(
393 Example(self.fieldNames(),[ self.valuesVStack(fieldname,field_values)
394 for fieldname,field_values
395 in zip(self.fieldNames(),fields_values)]))
396 # else check for a fieldname
397 if self.hasFields(i):
398 return self.minibatches(fieldnames=[i],minibatch_size=len(self),n_batches=1,offset=0).next()[0]
399 # else we are trying to access a property of the dataset
400 assert i in self.__dict__ # else it means we are trying to access a non-existing property
401 return self.__dict__[i]
402
403 def valuesHStack(self,fieldnames,fieldvalues):
404 """
405 Return a value that corresponds to concatenating (horizontally) several field values.
406 This can be useful to merge some fields. The implementation of this operation is likely
407 to involve a copy of the original values. When the values are numpy arrays, the
408 result should be numpy.hstack(values). If it makes sense, this operation should
409 work as well when each value corresponds to multiple examples in a minibatch
410 e.g. if each value is a Ni-vector and a minibatch of length L is a LxNi matrix,
411 then the result should be a Lx(N1+N2+..) matrix equal to numpy.hstack(values).
412 The default is to use numpy.hstack for numpy.ndarray values, and a list
413 pointing to the original values for other data types.
414 """
415 all_numpy=True
416 for value in fieldvalues:
417 if not type(value) is numpy.ndarray:
418 all_numpy=False
419 if all_numpy:
420 return numpy.hstack(fieldvalues)
421 # the default implementation of horizontal stacking is to put values in a list
422 return fieldvalues
423
424
425 def valuesVStack(self,fieldname,values):
426 """
427 Return a value that corresponds to concatenating (vertically) several values of the
428 same field. This can be important to build a minibatch out of individual examples. This
429 is likely to involve a copy of the original values. When the values are numpy arrays, the
430 result should be numpy.vstack(values).
431 The default is to use numpy.vstack for numpy.ndarray values, and a list
432 pointing to the original values for other data types.
433 """
434 all_numpy=True
435 for value in values:
436 if not type(value) is numpy.ndarray:
437 all_numpy=False
438 if all_numpy:
439 return numpy.vstack(values)
440 # the default implementation of vertical stacking is to put values in a list
441 return values
442
443 def __or__(self,other):
444 """
445 dataset1 | dataset2 returns a dataset whose list of fields is the concatenation of the list of
446 fields of the argument datasets. This only works if they all have the same length.
447 """
448 return HStackedDataSet(self,other)
449
450 def __and__(self,other):
451 """
452 dataset1 & dataset2 is a dataset that concatenates the examples from the argument datasets
453 (and whose length is the sum of the length of the argument datasets). This only
454 works if they all have the same fields.
455 """
456 return VStackedDataSet(self,other)
457
458 def hstack(datasets):
459 """
460 hstack(dataset1,dataset2,...) returns dataset1 | datataset2 | ...
461 which is a dataset whose fields list is the concatenation of the fields
462 of the individual datasets.
463 """
464 assert len(datasets)>0
465 if len(datasets)==1:
466 return datasets[0]
467 return HStackedDataSet(datasets)
468
469 def vstack(datasets):
470 """
471 vstack(dataset1,dataset2,...) returns dataset1 & datataset2 & ...
472 which is a dataset which iterates first over the examples of dataset1, then
473 over those of dataset2, etc.
474 """
475 assert len(datasets)>0
476 if len(datasets)==1:
477 return datasets[0]
478 return VStackedDataSet(datasets)
479
480 class FieldsSubsetDataSet(DataSet):
481 """
482 A sub-class of DataSet that selects a subset of the fields.
483 """
484 def __init__(self,src,fieldnames):
485 self.src=src
486 self.fieldnames=fieldnames
487 assert src.hasFields(*fieldnames)
488 self.valuesHStack = src.valuesHStack
489 self.valuesVStack = src.valuesVStack
490
491 def __len__(self): return len(self.src)
492
308 def fieldNames(self): 493 def fieldNames(self):
309 """Return the list of field names that are supported by the iterators, 494 return self.fieldnames
310 and for which hasFields(fieldname) would return True.""" 495
311 raise AbstractFunction() 496 def __iter__(self):
312 497 class FieldsSubsetIterator(object):
313 498 def __init__(self,ds):
314 class RenamingDataSet(FiniteWidthDataSet): 499 self.ds=ds
315 """A DataSet that wraps another one, and makes it look like the field names 500 self.src_iter=ds.src.__iter__()
316 are different 501 self.example=None
317 502 def __iter__(self): return self
318 Renaming is done by a dictionary that maps new names to the old ones used in 503 def next(self):
319 self.src. 504 complete_example = self.src_iter.next()
320 """ 505 if self.example:
321 def __init__(self, src, rename_dct): 506 self.example._values=[complete_example[field]
322 DataSet.__init__(self) 507 for field in self.ds.fieldnames]
323 self.src = src 508 else:
324 self.rename_dct = copy.copy(rename_dct) 509 self.example=Example(self.ds.fieldnames,
510 [complete_example[field] for field in self.ds.fieldnames])
511 return self.example
512 return FieldsSubsetIterator(self)
513
514 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
515 assert self.hasFields(*fieldnames)
516 return self.src.minibatches_nowrap(fieldnames,minibatch_size,n_batches,offset)
517 def __getitem__(self,i):
518 return FieldsSubsetDataSet(self.src[i],self.fieldnames)
519
520
521 class DataSetFields(LookupList):
522 """
523 Although a DataSet iterates over examples (like rows of a matrix), an associated
524 DataSetFields iterates over fields (like columns of a matrix), and can be understood
525 as a transpose of the associated dataset.
526
527 To iterate over fields, one can do
528 * for fields in dataset.fields()
529 * for fields in dataset(field1,field2,...).fields() to select a subset of fields
530 * for fields in dataset.fields(field1,field2,...) to select a subset of fields
531 and each of these fields is iterable over the examples:
532 * for field_examples in dataset.fields():
533 for example_value in field_examples:
534 ...
535 but when the dataset is a stream (unbounded length), it is not recommanded to do
536 such things because the underlying dataset may refuse to access the different fields in
537 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
538 The result of fields() is a DataSetFields object, which iterates over fields,
539 and whose elements are iterable over examples. A DataSetFields object can
540 be turned back into a DataSet with its examples() method:
541 dataset2 = dataset1.fields().examples()
542 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1).
543
544 DataSetFields can be concatenated vertically or horizontally. To be consistent with
545 the syntax used for DataSets, the | concatenates the fields and the & concatenates
546 the examples.
547 """
548 def __init__(self,dataset,*fieldnames):
549 original_dataset=dataset
550 if not fieldnames:
551 fieldnames=dataset.fieldNames()
552 elif not fieldnames==dataset.fieldNames():
553 dataset = FieldsSubsetDataSet(dataset,fieldnames)
554 assert dataset.hasFields(*fieldnames)
555 self.dataset=dataset
556
557 if isinstance(dataset,MinibatchDataSet):
558 LookupList.__init__(self,fieldnames,list(dataset._fields))
559 elif isinstance(original_dataset,MinibatchDataSet):
560 LookupList.__init__(self,fieldnames,
561 [original_dataset._fields[field]
562 for field in fieldnames])
563 else:
564 minibatch_iterator = dataset.minibatches(fieldnames,
565 minibatch_size=len(dataset),
566 n_batches=1)
567 minibatch=minibatch_iterator.next()
568 LookupList.__init__(self,fieldnames,minibatch)
569
570 def examples(self):
571 return self.dataset
572
573 def __or__(self,other):
574 """
575 fields1 | fields2 is a DataSetFields that whose list of examples is the concatenation
576 of the list of examples of DataSetFields fields1 and fields2.
577 """
578 return (self.examples() + other.examples()).fields()
579
580 def __and__(self,other):
581 """
582 fields1 + fields2 is a DataSetFields that whose list of fields is the concatenation
583 of the fields of DataSetFields fields1 and fields2.
584 """
585 return (self.examples() | other.examples()).fields()
586
587
588 class MinibatchDataSet(DataSet):
589 """
590 Turn a LookupList of same-length fields into an example-iterable dataset.
591 Each element of the lookup-list should be an iterable and sliceable, all of the same length.
592 """
593 def __init__(self,fields_lookuplist,values_vstack=DataSet().valuesVStack,
594 values_hstack=DataSet().valuesHStack):
595 """
596 The user can (and generally should) also provide values_vstack(fieldname,fieldvalues)
597 and a values_hstack(fieldnames,fieldvalues) functions behaving with the same
598 semantics as the DataSet methods of the same name (but without the self argument).
599 """
600 self._fields=fields_lookuplist
601 assert len(fields_lookuplist)>0
602 self.length=len(fields_lookuplist[0])
603 for field in fields_lookuplist[1:]:
604 assert self.length==len(field)
605 self.values_vstack=values_vstack
606 self.values_hstack=values_hstack
607
608 def __len__(self):
609 return self.length
610
611 def __getitem__(self,i):
612 if type(i) in (int,slice,list):
613 return DataSetFields(MinibatchDataSet(
614 Example(self._fields.keys(),[field[i] for field in self._fields])),self._fields)
615 if self.hasFields(i):
616 return self._fields[i]
617 assert i in self.__dict__ # else it means we are trying to access a non-existing property
618 return self.__dict__[i]
325 619
326 def fieldNames(self): 620 def fieldNames(self):
327 return self.rename_dct.keys() 621 return self._fields.keys()
328 622
329 def minibatches(self, 623 def hasFields(self,*fieldnames):
330 fieldnames = DataSet.minibatches_fieldnames, 624 for fieldname in fieldnames:
331 minibatch_size = DataSet.minibatches_minibatch_size, 625 if fieldname not in self._fields.keys():
332 n_batches = DataSet.minibatches_n_batches): 626 return False
333 dct = self.rename_dct 627 return True
334 new_fieldnames = [dct.get(f, f) for f in fieldnames] 628
335 return self.src.minibatches(new_fieldnames, minibatches_size, n_batches) 629 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
336 630 class Iterator(object):
337 631 def __init__(self,ds):
338 # we may want ArrayDataSet defined in another python file 632 self.ds=ds
339 633 self.next_example=offset
340 import numpy 634 assert minibatch_size > 0
341 635 if offset+minibatch_size > ds.length:
342 def as_array_dataset(dataset): 636 raise NotImplementedError()
343 # Generally datasets can be efficient by making data fields overlap, but
344 # this function doesn't know which fields overlap. So, it should check if
345 # dataset supports an as_array_dataset member function, and return that if
346 # possible.
347 if hasattr(dataset, 'as_array_dataset'):
348 return dataset.as_array_dataset()
349
350 raise NotImplementedError
351
352 # Make ONE big minibatch with all the examples, to separate the fields.
353 n_examples = len(dataset)
354 batch = dataset.minibatches( minibatch_size = len(dataset)).next()
355
356 # Each field of the underlying dataset must be convertible to a numpy array of the same type
357 # currently just double, but should use the smallest compatible dtype
358 n_fields = len(batch)
359 fieldnames = batch.fields.keys()
360 total_width = 0
361 type = None
362 fields = LookupList()
363 for i in xrange(n_fields):
364 field = array(batch[i])
365 assert field.shape[0]==n_examples
366 width = field.shape[1]
367 start=total_width
368 total_width += width
369 fields[fieldnames[i]]=slice(start,total_width,1)
370 # many complicated things remain to be done:
371 # - find common dtype
372 # - decide what to do with extra dimensions if not the same in all fields
373 # - try to see if we can avoid the copy?
374
375 class ArrayDataSet(FiniteLengthDataSet,FiniteWidthDataSet,SliceableDataSet):
376 """
377 An ArrayDataSet behaves like a numpy array but adds the notion of named fields
378 from DataSet (and the ability to view the values of multiple fields as an 'Example').
379 It is a fixed-length and fixed-width dataset
380 in which each element is a fixed dimension numpy array or a number, hence the whole
381 dataset corresponds to a numpy array. Fields
382 must correspond to a slice of array columns or to a list of column numbers.
383 If the dataset has fields,
384 each 'example' is just a one-row ArrayDataSet, otherwise it is a numpy array.
385 Any dataset can also be converted to a numpy array (losing the notion of fields
386 by the numpy.array(dataset) call.
387 """
388
389 class Iterator(LookupList):
390 """An iterator over a finite dataset that implements wrap-around"""
391 def __init__(self, dataset, fieldnames, minibatch_size, next_max):
392 if fieldnames is None: fieldnames = dataset.fieldNames()
393 LookupList.__init__(self, fieldnames, [0]*len(fieldnames))
394 self.dataset=dataset
395 self.minibatch_size=minibatch_size
396 self.next_count = 0
397 self.next_max = next_max
398 self.current = -self.minibatch_size
399 assert minibatch_size > 0
400 if minibatch_size >= len(dataset):
401 raise NotImplementedError()
402
403 def __iter__(self): #makes for loop work
404 return self
405
406 @staticmethod
407 def matcat(a, b):
408 a0, a1 = a.shape
409 b0, b1 = b.shape
410 assert a1 == b1
411 assert a.dtype is b.dtype
412 rval = numpy.empty( (a0 + b0, a1), dtype=a.dtype)
413 rval[:a0,:] = a
414 rval[a0:,:] = b
415 return rval
416
417 def next_index(self):
418 n_rows = self.dataset.data.shape[0]
419 next_i = self.current+self.minibatch_size
420 if next_i >= n_rows:
421 next_i -= n_rows
422 return next_i
423
424 def next(self):
425
426 #check for end-of-loop
427 self.next_count += 1
428 if self.next_count == self.next_max:
429 raise StopIteration
430
431 #determine the first and last elements of the minibatch slice we'll return
432 n_rows = self.dataset.data.shape[0]
433 self.current = self.next_index()
434 upper = self.current + self.minibatch_size
435
436 data = self.dataset.data
437
438 if upper <= n_rows:
439 #this is the easy case, we only need once slice
440 dataview = data[self.current:upper]
441 else:
442 # the minibatch wraps around the end of the dataset
443 dataview = data[self.current:]
444 upper -= n_rows
445 assert upper > 0
446 dataview = self.matcat(dataview, data[:upper])
447
448 self._values = [dataview[:, self.dataset.fields[f]]\
449 for f in self._names]
450 return self
451
452
453 def __init__(self, data, fields=None):
454 """
455 There are two ways to construct an ArrayDataSet: (1) from an
456 existing dataset (which may result in a copy of the data in a numpy array),
457 or (2) from a numpy.array (the data argument), along with an optional description
458 of the fields (a LookupList of column slices (or column lists) indexed by field names).
459 """
460 self.data=data
461 self.fields=fields
462 rows, cols = data.shape
463
464 if fields:
465 for fieldname,fieldslice in fields.items():
466 assert type(fieldslice) is int or isinstance(fieldslice,slice) or hasattr(fieldslice,"__iter__")
467 if hasattr(fieldslice,"__iter__"): # is a sequence
468 for i in fieldslice:
469 assert type(i) is int
470 elif isinstance(fieldslice,slice):
471 # make sure fieldslice.start and fieldslice.step are defined
472 start=fieldslice.start
473 step=fieldslice.step
474 if not start:
475 start=0
476 if not step:
477 step=1
478 if not fieldslice.start or not fieldslice.step:
479 fields[fieldname] = fieldslice = slice(start,fieldslice.stop,step)
480 # and coherent with the data array
481 assert fieldslice.start >= 0 and fieldslice.stop <= cols
482
483 def minibatches(self,
484 fieldnames = DataSet.minibatches_fieldnames,
485 minibatch_size = DataSet.minibatches_minibatch_size,
486 n_batches = DataSet.minibatches_n_batches):
487 """
488 If the fieldnames list is None, it means that we want to see ALL the fields.
489
490 If the n_batches is None, we want to see all the examples possible
491 for the given minibatch_size (possibly missing some near the end).
492 """
493 # substitute the defaults:
494 if n_batches is None: n_batches = len(self) / minibatch_size
495 return ArrayDataSet.Iterator(self, fieldnames, minibatch_size, n_batches)
496
497 def fieldNames(self):
498 """Return the list of field names that are supported by getattr and hasField."""
499 return self.fields.keys()
500
501 def __len__(self):
502 """len(dataset) returns the number of examples in the dataset."""
503 return len(self.data)
504
505 def __getitem__(self,i):
506 """
507 dataset[i] returns the (i+1)-th Example of the dataset.
508 If there are no fields the result is just a numpy array (for the i-th row of the dataset data matrix).
509 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1.
510 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2.
511 dataset[[i1,i2,..,in]] returns the subdataset with examples i1,i2,...,in.
512 """
513 if self.fields:
514 fieldnames,fieldslices=zip(*self.fields.items())
515 return Example(self.fields.keys(),[self.data[i,fieldslice] for fieldslice in self.fields.values()])
516 else:
517 return self.data[i]
518
519 def __getslice__(self,*args):
520 """
521 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1.
522 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2.
523 """
524 return ArrayDataSet(self.data.__getslice__(*args), fields=self.fields)
525
526 def indices_of_unique_columns_used(self):
527 """
528 Return the unique indices of the columns actually used by the fields, and a boolean
529 that signals (if True) that used columns overlap. If they do then the
530 indices are not repeated in the result.
531 """
532 columns_used = numpy.zeros((self.data.shape[1]),dtype=bool)
533 overlapping_columns = False
534 for field_slice in self.fields.values():
535 if sum(columns_used[field_slice])>0: overlapping_columns=True
536 columns_used[field_slice]=True
537 return [i for i,used in enumerate(columns_used) if used],overlapping_columns
538
539 def slice_of_unique_columns_used(self):
540 """
541 Return None if the indices_of_unique_columns_used do not form a slice. If they do,
542 return that slice. It means that the columns used can be extracted
543 from the data array without making a copy. If the fields overlap
544 but their unique columns used form a slice, still return that slice.
545 """
546 columns_used,overlapping_columns = self.indices_of_columns_used()
547 mappable_to_one_slice = True
548 if not overlapping_fields:
549 start=0
550 while start<len(columns_used) and not columns_used[start]:
551 start+=1
552 stop=len(columns_used)
553 while stop>0 and not columns_used[stop-1]:
554 stop-=1
555 step=0
556 i=start
557 while i<stop:
558 j=i+1
559 while j<stop and not columns_used[j]:
560 j+=1
561 if step:
562 if step!=j-i:
563 mappable_to_one_slice = False
564 break
565 else:
566 step = j-i
567 i=j
568 return slice(start,stop,step)
569
570 class ApplyFunctionDataSet(FiniteWidthDataSet):
571 """
572 A dataset that contains as fields the results of applying
573 a given function (example-wise) to specified input_fields of a source
574 dataset. The function should return a sequence whose elements will be stored in
575 fields whose names are given in the output_fields list. If copy_inputs
576 is True then the resulting dataset will also contain the fields of the source.
577 dataset. If accept_minibatches, then the function expects
578 minibatches as arguments (what is returned by the minibatches
579 iterator). In any case, the computations may be delayed until the examples
580 of self are requested. If cache is True, then
581 once the output fields for some examples have been computed, then
582 are cached (to avoid recomputation if the same examples are again requested).
583 """
584 def __init__(src,function, input_fields, output_fields, copy_inputs=True, accept_minibatches=True, cache=True, compute_now=False):
585 DataSet.__init__(self)
586 self.src=src
587 self.function=function
588 assert src.hasFields(input_fields)
589 self.input_fields=input_fields
590 self.output_fields=output_fields
591 assert not (copy_inputs and compute_now and not hasattr(src,'fieldNames'))
592 self.copy_inputs=copy_inputs
593 self.accept_minibatches=accept_minibatches
594 self.cache=cache
595 self.compute_now=compute_now
596 if compute_now:
597 assert hasattr(src,'__len__') and len(src)>=0
598 fieldnames = output_fields
599 if copy_inputs: fieldnames = src.fieldNames() + output_fields
600 if accept_minibatches:
601 # make a single minibatch with all the inputs
602 inputs = src.minibatches(input_fields,len(src)).next()
603 # and apply the function to it, and transpose into a list of examples (field values, actually)
604 self.cached_examples = zip(*Example(output_fields,function(*inputs)))
605 else:
606 # compute a list with one tuple per example, with the function outputs
607 self.cached_examples = [ function(input) for input in src.zip(input_fields) ]
608 elif cache:
609 # maybe a fixed-size array kind of structure would be more efficient than a list
610 # in the case where src is FiniteDataSet. -YB
611 self.cached_examples = []
612
613 def fieldNames(self):
614 if self.copy_inputs:
615 return self.output_fields + self.src.fieldNames()
616 return self.output_fields
617
618 def minibatches(self,
619 fieldnames = DataSet.minibatches_fieldnames,
620 minibatch_size = DataSet.minibatches_minibatch_size,
621 n_batches = DataSet.minibatches_n_batches):
622
623 class Iterator(LookupList):
624
625 def __init__(self,dataset):
626 if fieldnames is None:
627 assert hasattr(dataset,"fieldNames")
628 fieldnames = dataset.fieldNames()
629 self.example_index=0
630 LookupList.__init__(self, fieldnames, [0]*len(fieldnames))
631 self.dataset=dataset
632 self.src_iterator=self.src.minibatches(list(set.union(set(fieldnames),set(dataset.input_fields))),
633 minibatch_size,n_batches)
634 self.fieldnames_not_in_input = []
635 if self.copy_inputs:
636 self.fieldnames_not_in_input = filter(lambda x: not x in dataset.input_fields, fieldnames)
637
638 def __iter__(self): 637 def __iter__(self):
639 return self 638 return self
640 639 def next(self):
641 def next_index(self): 640 upper = self.next_example+minibatch_size
642 return self.src_iterator.next_index() 641 assert upper<=self.ds.length
642 minibatch = Example(self.ds._fields.keys(),
643 [field[self.next_example:upper]
644 for field in self.ds._fields])
645 self.next_example+=minibatch_size
646 return DataSetFields(MinibatchDataSet(minibatch),*fieldnames)
647
648 return Iterator(self)
649
650 def valuesVStack(self,fieldname,fieldvalues):
651 return self.values_vstack(fieldname,fieldvalues)
652
653 def valuesHStack(self,fieldnames,fieldvalues):
654 return self.values_hstack(fieldnames,fieldvalues)
655
656 class HStackedDataSet(DataSet):
657 """
658 A DataSet that wraps several datasets and shows a view that includes all their fields,
659 i.e. whose list of fields is the concatenation of their lists of fields.
660
661 If a field name is found in more than one of the datasets, then either an error is
662 raised or the fields are renamed (either by prefixing the __name__ attribute
663 of the dataset + ".", if it exists, or by suffixing the dataset index in the argument list).
664
665 TODO: automatically detect a chain of stacked datasets due to A | B | C | D ...
666 """
667 def __init__(self,datasets,accept_nonunique_names=False,description=None,field_types=None):
668 DataSet.__init__(self,description,field_types)
669 self.datasets=datasets
670 self.accept_nonunique_names=accept_nonunique_names
671 self.fieldname2dataset={}
672
673 def rename_field(fieldname,dataset,i):
674 if hasattr(dataset,"__name__"):
675 return dataset.__name__ + "." + fieldname
676 return fieldname+"."+str(i)
643 677
678 # make sure all datasets have the same length and unique field names
679 self.length=None
680 names_to_change=[]
681 for i in xrange(len(datasets)):
682 dataset = datasets[i]
683 length=len(dataset)
684 if self.length:
685 assert self.length==length
686 else:
687 self.length=length
688 for fieldname in dataset.fieldNames():
689 if fieldname in self.fieldname2dataset: # name conflict!
690 if accept_nonunique_names:
691 fieldname=rename_field(fieldname,dataset,i)
692 names2change.append((fieldname,i))
693 else:
694 raise ValueError("Incompatible datasets: non-unique field name = "+fieldname)
695 self.fieldname2dataset[fieldname]=i
696 for fieldname,i in names_to_change:
697 del self.fieldname2dataset[fieldname]
698 self.fieldname2dataset[rename_field(fieldname,self.datasets[i],i)]=i
699
700 def hasFields(self,*fieldnames):
701 for fieldname in fieldnames:
702 if not fieldname in self.fieldname2dataset:
703 return False
704 return True
705
706 def fieldNames(self):
707 return self.fieldname2dataset.keys()
708
709 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
710
711 class HStackedIterator(object):
712 def __init__(self,hsds,iterators):
713 self.hsds=hsds
714 self.iterators=iterators
715 def __iter__(self):
716 return self
644 def next(self): 717 def next(self):
645 example_index = self.src_iterator.next_index() 718 # concatenate all the fields of the minibatches
646 src_examples = self.src_iterator.next() 719 minibatch = reduce(LookupList.__add__,[iterator.next() for iterator in self.iterators])
647 if self.dataset.copy_inputs: 720 # and return a DataSetFields whose dataset is the transpose (=examples()) of this minibatch
648 function_inputs = [src_examples[field_name] for field_name in self.dataset.input_fields] 721 return DataSetFields(MinibatchDataSet(minibatch,self.hsds.valuesVStack,
722 self.hsds.valuesHStack),
723 fieldnames if fieldnames else hsds.fieldNames())
724
725 assert self.hasfields(fieldnames)
726 # find out which underlying datasets are necessary to service the required fields
727 # and construct corresponding minibatch iterators
728 if fieldnames:
729 datasets=set([])
730 fields_in_dataset=dict([(dataset,[]) for dataset in datasets])
731 for fieldname in fieldnames:
732 dataset=self.datasets[self.fieldnames2dataset[fieldname]]
733 datasets.add(dataset)
734 fields_in_dataset[dataset].append(fieldname)
735 datasets=list(datasets)
736 iterators=[dataset.minibatches(fields_in_dataset[dataset],minibatch_size,n_batches,offset)
737 for dataset in datasets]
738 else:
739 datasets=self.datasets
740 iterators=[dataset.minibatches(None,minibatch_size,n_batches,offset) for dataset in datasets]
741 return HStackedIterator(self,iterators)
742
743
744 def valuesVStack(self,fieldname,fieldvalues):
745 return self.datasets[self.fieldname2dataset[fieldname]].valuesVStack(fieldname,fieldvalues)
746
747 def valuesHStack(self,fieldnames,fieldvalues):
748 """
749 We will use the sub-dataset associated with the first fieldname in the fieldnames list
750 to do the work, hoping that it can cope with the other values (i.e. won't care
751 about the incompatible fieldnames). Hence this heuristic will always work if
752 all the fieldnames are of the same sub-dataset.
753 """
754 return self.datasets[self.fieldname2dataset[fieldnames[0]]].valuesHStack(fieldnames,fieldvalues)
755
756 class VStackedDataSet(DataSet):
757 """
758 A DataSet that wraps several datasets and shows a view that includes all their examples,
759 in the order provided. This clearly assumes that they all have the same field names
760 and all (except possibly the last one) are of finite length.
761
762 TODO: automatically detect a chain of stacked datasets due to A + B + C + D ...
763 """
764 def __init__(self,datasets):
765 self.datasets=datasets
766 self.length=0
767 self.index2dataset={}
768 assert len(datasets)>0
769 fieldnames = datasets[-1].fieldNames()
770 self.datasets_start_row=[]
771 # We use this map from row index to dataset index for constant-time random access of examples,
772 # to avoid having to search for the appropriate dataset each time and slice is asked for.
773 for dataset,k in enumerate(datasets[0:-1]):
774 assert dataset.is_unbounded() # All VStacked datasets (except possibly the last) must be bounded (have a length).
775 L=len(dataset)
776 for i in xrange(L):
777 self.index2dataset[self.length+i]=k
778 self.datasets_start_row.append(self.length)
779 self.length+=L
780 assert dataset.fieldNames()==fieldnames
781 self.datasets_start_row.append(self.length)
782 self.length+=len(datasets[-1])
783 # If length is very large, we should use a more memory-efficient mechanism
784 # that does not store all indices
785 if self.length>1000000:
786 # 1 million entries would require about 60 meg for the index2dataset map
787 # TODO
788 print "A more efficient mechanism for index2dataset should be implemented"
789
790 def __len__(self):
791 return self.length
792
793 def fieldNames(self):
794 return self.datasets[0].fieldNames()
795
796 def hasFields(self,*fieldnames):
797 return self.datasets[0].hasFields(*fieldnames)
798
799 def locate_row(self,row):
800 """Return (dataset_index, row_within_dataset) for global row number"""
801 dataset_index = self.index2dataset[row]
802 row_within_dataset = self.datasets_start_row[dataset_index]
803 return dataset_index, row_within_dataset
804
805 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
806
807 class VStackedIterator(object):
808 def __init__(self,vsds):
809 self.vsds=vsds
810 self.next_row=offset
811 self.next_dataset_index,self.next_dataset_row=self.vsds.locate_row(offset)
812 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \
813 self.next_iterator(vsds.datasets[0],offset,n_batches)
814
815 def next_iterator(self,dataset,starting_offset,batches_left):
816 L=len(dataset)
817 ds_nbatches = (L-starting_offset)/minibatch_size
818 if batches_left is not None:
819 ds_nbatches = max(batches_left,ds_nbatches)
820 if minibatch_size>L:
821 ds_minibatch_size=L
822 n_left_in_mb=minibatch_size-L
823 ds_nbatches=1
649 else: 824 else:
650 function_inputs = src_examples 825 n_left_in_mb=0
651 if self.dataset.cached_examples: 826 return dataset.minibatches(fieldnames,minibatch_size,ds_nbatches,starting_offset), \
652 cache_len=len(self.cached_examples) 827 L-(starting_offset+ds_nbatches*minibatch_size), n_left_in_mb
653 if example_index<cache_len+minibatch_size: 828
654 outputs_list = self.cached_examples[example_index:example_index+minibatch_size] 829 def move_to_next_dataset(self):
655 # convert the minibatch list of examples 830 if self.n_left_at_the_end_of_ds>0:
656 # into a list of fields each of which iterate over the minibatch 831 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \
657 outputs = zip(*outputs_list) 832 self.next_iterator(vsds.datasets[self.next_dataset_index],
658 else: 833 self.n_left_at_the_end_of_ds,1)
659 outputs = self.dataset.function(*function_inputs)
660 if self.dataset.cache:
661 # convert the list of fields, each of which can iterate over the minibatch
662 # into a list of examples in the minibatch (each of which is a list of field values)
663 outputs_list = zip(*outputs)
664 # copy the outputs_list into the cache
665 for i in xrange(cache_len,example_index):
666 self.cached_examples.append(None)
667 self.cached_examples += outputs_list
668 else: 834 else:
669 outputs = self.dataset.function(*function_inputs) 835 self.next_dataset_index +=1
836 if self.next_dataset_index==len(self.vsds.datasets):
837 self.next_dataset_index = 0
838 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \
839 self.next_iterator(vsds.datasets[self.next_dataset_index],starting_offset,n_batches)
670 840
671 return Example(self.fieldnames_not_in_input+self.dataset.output_fields, 841 def __iter__(self):
672 [src_examples[field_name] for field_name in self.fieldnames_not_in_input]+outputs) 842 return self
673 843
674 844 def next(self):
675 for fieldname in fieldnames: 845 dataset=self.vsds.datasets[self.next_dataset_index]
676 assert fieldname in self.output_fields or self.src.hasFields(fieldname) 846 mb = self.next_iterator.next()
677 return Iterator(self) 847 if self.n_left_in_mb:
678 848 extra_mb = []
679 849 while self.n_left_in_mb>0:
850 self.move_to_next_dataset()
851 extra_mb.append(self.next_iterator.next())
852 examples = Example(names,
853 [dataset.valuesVStack(name,
854 [mb[name]]+[b[name] for b in extra_mb])
855 for name in fieldnames])
856 mb = DataSetFields(MinibatchDataSet(examples),fieldnames)
857
858 self.next_row+=minibatch_size
859 self.next_dataset_row+=minibatch_size
860 if self.next_row+minibatch_size>len(dataset):
861 self.move_to_next_dataset()
862 return examples
863 return VStackedIterator(self)
864
865 class ArrayFieldsDataSet(DataSet):
866 """
867 Virtual super-class of datasets whose field values are numpy array,
868 thus defining valuesHStack and valuesVStack for sub-classes.
869 """
870 def __init__(self,description=None,field_types=None):
871 DataSet.__init__(self,description,field_types)
872 def valuesHStack(self,fieldnames,fieldvalues):
873 """Concatenate field values horizontally, e.g. two vectors
874 become a longer vector, two matrices become a wider matrix, etc."""
875 return numpy.hstack(fieldvalues)
876 def valuesVStack(self,fieldname,values):
877 """Concatenate field values vertically, e.g. two vectors
878 become a two-row matrix, two matrices become a longer matrix, etc."""
879 return numpy.vstack(values)
880
881 class ArrayDataSet(ArrayFieldsDataSet):
882 """
883 An ArrayDataSet stores the fields as groups of columns in a numpy tensor,
884 whose first axis iterates over examples, second axis determines fields.
885 If the underlying array is N-dimensional (has N axes), then the field
886 values are (N-2)-dimensional objects (i.e. ordinary numbers if N=2).
887 """
888
889 def __init__(self, data_array, fields_columns):
890 """
891 Construct an ArrayDataSet from the underlying numpy array (data) and
892 a map (fields_columns) from fieldnames to field columns. The columns of a field are specified
893 using the standard arguments for indexing/slicing: integer for a column index,
894 slice for an interval of columns (with possible stride), or iterable of column indices.
895 """
896 self.data=data_array
897 self.fields_columns=fields_columns
898
899 # check consistency and complete slices definitions
900 for fieldname, fieldcolumns in self.fields_columns.items():
901 if type(fieldcolumns) is int:
902 assert fieldcolumns>=0 and fieldcolumns<data_array.shape[1]
903 elif type(fieldcolumns) is slice:
904 start,step=None,None
905 if not fieldcolumns.start:
906 start=0
907 if not fieldcolumns.step:
908 step=1
909 if start or step:
910 self.fields_columns[fieldname]=slice(start,fieldcolumns.stop,step)
911 elif hasattr(fieldcolumns,"__iter__"): # something like a list
912 for i in fieldcolumns:
913 assert i>=0 and i<data_array.shape[1]
914
915 def fieldNames(self):
916 return self.fields_columns.keys()
917
918 def __len__(self):
919 return len(self.data)
920
921 def __getitem__(self,i):
922 """More efficient implementation than the default __getitem__"""
923 fieldnames=self.fields_columns.keys()
924 if type(i) is int:
925 return Example(fieldnames,
926 [self.data[i,self.fields_columns[f]] for f in fieldnames])
927 if type(i) in (slice,list):
928 return MinibatchDataSet(Example(fieldnames,
929 [self.data[i,self.fields_columns[f]] for f in fieldnames]))
930 # else check for a fieldname
931 if self.hasFields(i):
932 return Example([i],[self.data[self.fields_columns[i],:]])
933 # else we are trying to access a property of the dataset
934 assert i in self.__dict__ # else it means we are trying to access a non-existing property
935 return self.__dict__[i]
936
937
938 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset):
939 class ArrayDataSetIterator(object):
940 def __init__(self,dataset,fieldnames,minibatch_size,n_batches,offset):
941 if fieldnames is None: fieldnames = dataset.fieldNames()
942 # store the resulting minibatch in a lookup-list of values
943 self.minibatch = LookupList(fieldnames,[0]*len(fieldnames))
944 self.dataset=dataset
945 self.minibatch_size=minibatch_size
946 assert offset>=0 and offset<len(dataset.data)
947 assert offset+minibatch_size<=len(dataset.data)
948 self.current=offset
949 def __iter__(self):
950 return self
951 def next(self):
952 sub_data = self.dataset.data[self.current:self.current+self.minibatch_size]
953 self.minibatch._values = [sub_data[:,self.dataset.fields_columns[f]] for f in self.minibatch._names]
954 self.current+=self.minibatch_size
955 return self.minibatch
956
957 return ArrayDataSetIterator(self,fieldnames,minibatch_size,n_batches,offset)
958
959
960 class CachedDataSet(DataSet):
961 """
962 Wrap a dataset whose values are computationally expensive to obtain
963 (e.g. because they involve some computation, or disk access),
964 so that repeated accesses to the same example are done cheaply,
965 by caching every example value that has been accessed at least once.
966
967 Optionally, for finite-length dataset, all the values can be computed
968 (and cached) upon construction of the CachedDataSet, rather at the
969 first access.
970 """
971
972 class ApplyFunctionDataSet(DataSet):
973 """
974 A dataset that contains as fields the results of applying a given function
975 example-wise or minibatch-wise to all the fields of an input dataset.
976 The output of the function should be an iterable (e.g. a list or a LookupList)
977 over the resulting values. In minibatch mode, the function is expected
978 to work on minibatches (takes a minibatch in input and returns a minibatch
979 in output).
980
981 The function is applied each time an example or a minibatch is accessed.
982 To avoid re-doing computation, wrap this dataset inside a CachedDataSet.
983 """
984
985
680 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None): 986 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None):
681 """ 987 """
682 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the 988 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the
683 user to define a set of fields as the 'input' field and a set of fields 989 user to define a set of fields as the 'input' field and a set of fields
684 as the 'target' field. Optionally, a single weight_field can also be defined. 990 as the 'target' field. Optionally, a single weight_field can also be defined.
685 """ 991 """
686 args = ((input_fields,'input'),(output_fields,'target')) 992 args = ((input_fields,'input'),(output_fields,'target'))
687 if weight_field: args+=(([weight_field],'weight')) 993 if weight_field: args+=(([weight_field],'weight'))
688 return src_dataset.rename(*args) 994 return src_dataset.merge_fields(*args)
689 995
690 996
691 997
692 998