Mercurial > pylearn
comparison dataset.py @ 71:5b699b31770a
merge
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Fri, 02 May 2008 18:19:35 -0400 |
parents | dde1fb1b63ba |
children | 2b6656b2ef52 40476a7746e8 |
comparison
equal
deleted
inserted
replaced
70:76e5c0f37165 | 71:5b699b31770a |
---|---|
1 | 1 |
2 from lookup_list import LookupList | 2 from lookup_list import LookupList |
3 Example = LookupList | 3 Example = LookupList |
4 import copy | 4 from misc import unique_elements_list_intersection |
5 from string import join | |
6 from sys import maxint | |
7 import numpy | |
5 | 8 |
6 class AbstractFunction (Exception): """Derived class must override this function""" | 9 class AbstractFunction (Exception): """Derived class must override this function""" |
7 | 10 class NotImplementedYet (NotImplementedError): """Work in progress, this should eventually be implemented""" |
11 | |
8 class DataSet(object): | 12 class DataSet(object): |
9 """A virtual base class for datasets. | 13 """A virtual base class for datasets. |
10 | 14 |
15 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction | |
16 with learning algorithms (for training and testing them): rows/records are called examples, and | |
17 columns/attributes are called fields. The field value for a particular example can be an arbitrary | |
18 python object, which depends on the particular dataset. | |
19 | |
20 We call a DataSet a 'stream' when its length is unbounded (otherwise its __len__ method | |
21 should return sys.maxint). | |
22 | |
11 A DataSet is a generator of iterators; these iterators can run through the | 23 A DataSet is a generator of iterators; these iterators can run through the |
12 examples in a variety of ways. A DataSet need not necessarily have a finite | 24 examples or the fields in a variety of ways. A DataSet need not necessarily have a finite |
13 or known length, so this class can be used to interface to a 'stream' which | 25 or known length, so this class can be used to interface to a 'stream' which |
14 feeds on-line learning. | 26 feeds on-line learning (however, as noted below, some operations are not |
27 feasible or not recommanded on streams). | |
15 | 28 |
16 To iterate over examples, there are several possibilities: | 29 To iterate over examples, there are several possibilities: |
17 - for example in dataset.zip([field1, field2,field3, ...]) | 30 * for example in dataset([field1, field2,field3, ...]): |
18 - for val1,val2,val3 in dataset.zip([field1, field2,field3]) | 31 * for val1,val2,val3 in dataset([field1, field2,field3]): |
19 - for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N) | 32 * for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N): |
20 - for example in dataset | 33 * for mini1,mini2,mini3 in dataset.minibatches([field1, field2, field3], minibatch_size=N): |
34 * for example in dataset: | |
35 print example['x'] | |
36 * for x,y,z in dataset: | |
21 Each of these is documented below. All of these iterators are expected | 37 Each of these is documented below. All of these iterators are expected |
22 to provide, in addition to the usual 'next()' method, a 'next_index()' method | 38 to provide, in addition to the usual 'next()' method, a 'next_index()' method |
23 which returns a non-negative integer pointing to the position of the next | 39 which returns a non-negative integer pointing to the position of the next |
24 example that will be returned by 'next()' (or of the first example in the | 40 example that will be returned by 'next()' (or of the first example in the |
25 next minibatch returned). This is important because these iterators | 41 next minibatch returned). This is important because these iterators |
26 can wrap around the dataset in order to do multiple passes through it, | 42 can wrap around the dataset in order to do multiple passes through it, |
27 in possibly unregular ways if the minibatch size is not a divisor of the | 43 in possibly unregular ways if the minibatch size is not a divisor of the |
28 dataset length. | 44 dataset length. |
29 | 45 |
46 To iterate over fields, one can do | |
47 * for field in dataset.fields(): | |
48 for field_value in field: # iterate over the values associated to that field for all the dataset examples | |
49 * for field in dataset(field1,field2,...).fields() to select a subset of fields | |
50 * for field in dataset.fields(field1,field2,...) to select a subset of fields | |
51 and each of these fields is iterable over the examples: | |
52 * for field_examples in dataset.fields(): | |
53 for example_value in field_examples: | |
54 ... | |
55 but when the dataset is a stream (unbounded length), it is not recommanded to do | |
56 such things because the underlying dataset may refuse to access the different fields in | |
57 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. | |
58 The result of fields() is a DataSetFields object, which iterates over fields, | |
59 and whose elements are iterable over examples. A DataSetFields object can | |
60 be turned back into a DataSet with its examples() method: | |
61 dataset2 = dataset1.fields().examples() | |
62 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1). | |
63 | |
30 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content. | 64 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content. |
31 | 65 |
32 Note: The content of a field can be of any type. | 66 Note: The content of a field can be of any type. Field values can also be 'missing' |
33 | 67 (e.g. to handle semi-supervised learning), and in the case of numeric (numpy array) |
34 Note: A dataset can recognize a potentially infinite number of field names (i.e. the field | 68 fields (i.e. an ArrayFieldsDataSet), NaN plays the role of a missing value. |
35 values can be computed on-demand, when particular field names are used in one of the | 69 What about non-numeric values? None. |
36 iterators). | 70 |
37 | 71 Dataset elements can be indexed and sub-datasets (with a subset |
38 Datasets of finite length should be sub-classes of FiniteLengthDataSet. | 72 of examples) can be extracted. These operations are not supported |
39 | 73 by default in the case of streams. |
40 Datasets whose elements can be indexed and whose sub-datasets (with a subset | 74 |
41 of examples) can be extracted should be sub-classes of | 75 * dataset[:n] returns a dataset with the n first examples. |
42 SliceableDataSet. | 76 |
43 | 77 * dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s. |
44 Datasets with a finite number of fields should be sub-classes of | 78 |
45 FiniteWidthDataSet. | 79 * dataset[i] returns an Example. |
46 """ | 80 |
47 | 81 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. |
48 def __init__(self): | 82 |
49 pass | 83 * dataset[fieldname] an iterable over the values of the field fieldname across |
50 | 84 the dataset (the iterable is obtained by default by calling valuesVStack |
51 class Iterator(LookupList): | 85 over the values for individual examples). |
52 def __init__(self, ll): | 86 |
53 LookupList.__init__(self, ll.keys(), ll.values()) | 87 * dataset.<property> returns the value of a property associated with |
54 self.ll = ll | 88 the name <property>. The following properties should be supported: |
89 - 'description': a textual description or name for the dataset | |
90 - 'fieldtypes': a list of types (one per field) | |
91 | |
92 Datasets can be concatenated either vertically (increasing the length) or | |
93 horizontally (augmenting the set of fields), if they are compatible, using | |
94 the following operations (with the same basic semantics as numpy.hstack | |
95 and numpy.vstack): | |
96 | |
97 * dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3]) | |
98 | |
99 creates a new dataset whose list of fields is the concatenation of the list of | |
100 fields of the argument datasets. This only works if they all have the same length. | |
101 | |
102 * dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3]) | |
103 | |
104 creates a new dataset that concatenates the examples from the argument datasets | |
105 (and whose length is the sum of the length of the argument datasets). This only | |
106 works if they all have the same fields. | |
107 | |
108 According to the same logic, and viewing a DataSetFields object associated to | |
109 a DataSet as a kind of transpose of it, fields1 & fields2 concatenates fields of | |
110 a DataSetFields fields1 and fields2, and fields1 | fields2 concatenates their | |
111 examples. | |
112 | |
113 A dataset can hold arbitrary key-value pairs that may be used to access meta-data | |
114 or other properties of the dataset or associated with the dataset or the result | |
115 of a computation stored in a dataset. These can be accessed through the [key] syntax | |
116 when key is a string (or more specifically, neither an integer, a slice, nor a list). | |
117 | |
118 A DataSet sub-class should always redefine the following methods: | |
119 * __len__ if it is not a stream | |
120 * fieldNames | |
121 * minibatches_nowrap (called by DataSet.minibatches()) | |
122 * valuesHStack | |
123 * valuesVStack | |
124 For efficiency of implementation, a sub-class might also want to redefine | |
125 * hasFields | |
126 * __getitem__ may not be feasible with some streams | |
127 * __iter__ | |
128 """ | |
129 | |
130 def __init__(self,description=None,fieldtypes=None): | |
131 if description is None: | |
132 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)" | |
133 description = type(self).__name__ + " ( " + join([x.__name__ for x in type(self).__bases__]) + " )" | |
134 self.description=description | |
135 self.fieldtypes=fieldtypes | |
136 | |
137 class MinibatchToSingleExampleIterator(object): | |
138 """ | |
139 Converts the result of minibatch iterator with minibatch_size==1 into | |
140 single-example values in the result. Therefore the result of | |
141 iterating on the dataset itself gives a sequence of single examples | |
142 (whereas the result of iterating over minibatches gives in each | |
143 Example field an iterable object over the individual examples in | |
144 the minibatch). | |
145 """ | |
146 def __init__(self, minibatch_iterator): | |
147 self.minibatch_iterator = minibatch_iterator | |
148 self.minibatch = None | |
55 def __iter__(self): #makes for loop work | 149 def __iter__(self): #makes for loop work |
56 return self | 150 return self |
57 def next(self): | 151 def next(self): |
58 self.ll.next() | 152 size1_minibatch = self.minibatch_iterator.next() |
59 self._values = [v[0] for v in self.ll._values] | 153 if not self.minibatch: |
60 return self | 154 self.minibatch = Example(size1_minibatch.keys(),[value[0] for value in size1_minibatch.values()]) |
155 else: | |
156 self.minibatch._values = [value[0] for value in size1_minibatch.values()] | |
157 return self.minibatch | |
158 | |
61 def next_index(self): | 159 def next_index(self): |
62 return self.ll.next_index() | 160 return self.minibatch_iterator.next_index() |
63 | 161 |
64 def __iter__(self): | 162 def __iter__(self): |
65 """Supports the syntax "for i in dataset: ..." | 163 """Supports the syntax "for i in dataset: ..." |
66 | 164 |
67 Using this syntax, "i" will be an Example instance (or equivalent) with | 165 Using this syntax, "i" will be an Example instance (or equivalent) with |
68 all the fields of DataSet self. Every field of "i" will give access to | 166 all the fields of DataSet self. Every field of "i" will give access to |
69 a field of a single example. Fields should be accessible via | 167 a field of a single example. Fields should be accessible via |
70 i["fielname"] or i[3] (in the order defined by the elements of the | 168 i["fielname"] or i[3] (in the order defined by the elements of the |
71 Example returned by this iterator), but the derived class is free | 169 Example returned by this iterator), but the derived class is free |
72 to accept any type of identifier, and add extra functionality to the iterator. | 170 to accept any type of identifier, and add extra functionality to the iterator. |
73 """ | 171 |
74 return DataSet.Iterator(self.minibatches(None, minibatch_size = 1)) | 172 The default implementation calls the minibatches iterator and extracts the first example of each field. |
75 | 173 """ |
76 def zip(self, *fieldnames): | 174 return DataSet.MinibatchToSingleExampleIterator(self.minibatches(None, minibatch_size = 1)) |
77 """ | 175 |
78 Supports two forms of syntax: | 176 |
79 | 177 class MinibatchWrapAroundIterator(object): |
80 for i in dataset.zip([f1, f2, f3]): ... | 178 """ |
81 | 179 An iterator for minibatches that handles the case where we need to wrap around the |
82 for i1, i2, i3 in dataset.zip([f1, f2, f3]): ... | 180 dataset because n_batches*minibatch_size > len(dataset). It is constructed from |
83 | 181 a dataset that provides a minibatch iterator that does not need to handle that problem. |
84 Using the first syntax, "i" will be an indexable object, such as a list, | 182 This class is a utility for dataset subclass writers, so that they do not have to handle |
85 tuple, or Example instance, such that on every iteration, i[0] is the f1 | 183 this issue multiple times, nor check that fieldnames are valid, nor handle the |
86 field of the current example, i[1] is the f2 field, and so on. | 184 empty fieldnames (meaning 'use all the fields'). |
87 | 185 """ |
88 Using the second syntax, i1, i2, i3 will contain the the contents of the | 186 def __init__(self,dataset,fieldnames,minibatch_size,n_batches,offset): |
89 f1, f2, and f3 fields of a single example on each loop iteration. | 187 self.dataset=dataset |
90 | 188 self.fieldnames=fieldnames |
91 The derived class may accept fieldname arguments of any type. | 189 self.minibatch_size=minibatch_size |
92 | 190 self.n_batches=n_batches |
93 """ | 191 self.n_batches_done=0 |
94 return DataSet.Iterator(self.minibatches(fieldnames, minibatch_size = 1)) | 192 self.next_row=offset |
193 self.L=len(dataset) | |
194 assert offset+minibatch_size<=self.L | |
195 ds_nbatches = (self.L-offset)/minibatch_size | |
196 if n_batches is not None: | |
197 ds_nbatches = max(n_batches,ds_nbatches) | |
198 if fieldnames: | |
199 assert dataset.hasFields(*fieldnames) | |
200 else: | |
201 fieldnames=dataset.fieldNames() | |
202 self.iterator = dataset.minibatches_nowrap(fieldnames,minibatch_size,ds_nbatches,offset) | |
203 | |
204 def __iter__(self): | |
205 return self | |
206 | |
207 def next_index(self): | |
208 return self.next_row | |
209 | |
210 def next(self): | |
211 if self.n_batches and self.n_batches_done==self.n_batches: | |
212 raise StopIteration | |
213 upper = self.next_row+self.minibatch_size | |
214 if upper <=self.L: | |
215 minibatch = self.iterator.next() | |
216 else: | |
217 if not self.n_batches: | |
218 raise StopIteration | |
219 # we must concatenate (vstack) the bottom and top parts of our minibatch | |
220 # first get the beginning of our minibatch (top of dataset) | |
221 first_part = self.dataset.minibatches_nowrap(fieldnames,self.L-self.next_row,1,self.next_row).next() | |
222 second_part = self.dataset.minibatches_nowrap(fieldnames,upper-self.L,1,0).next() | |
223 minibatch = Example(self.fieldnames, | |
224 [self.dataset.valuesVStack(name,[first_part[name],second_part[name]]) | |
225 for name in self.fieldnames]) | |
226 self.next_row=upper | |
227 self.n_batches_done+=1 | |
228 if upper >= self.L and self.n_batches: | |
229 self.next_row -= self.L | |
230 return minibatch | |
231 | |
95 | 232 |
96 minibatches_fieldnames = None | 233 minibatches_fieldnames = None |
97 minibatches_minibatch_size = 1 | 234 minibatches_minibatch_size = 1 |
98 minibatches_n_batches = None | 235 minibatches_n_batches = None |
99 def minibatches(self, | 236 def minibatches(self, |
100 fieldnames = minibatches_fieldnames, | 237 fieldnames = minibatches_fieldnames, |
101 minibatch_size = minibatches_minibatch_size, | 238 minibatch_size = minibatches_minibatch_size, |
102 n_batches = minibatches_n_batches): | 239 n_batches = minibatches_n_batches, |
103 """ | 240 offset = 0): |
104 Supports three forms of syntax: | 241 """ |
242 Return an iterator that supports three forms of syntax: | |
105 | 243 |
106 for i in dataset.minibatches(None,**kwargs): ... | 244 for i in dataset.minibatches(None,**kwargs): ... |
107 | 245 |
108 for i in dataset.minibatches([f1, f2, f3],**kwargs): ... | 246 for i in dataset.minibatches([f1, f2, f3],**kwargs): ... |
109 | 247 |
114 of a batch of current examples. In the second case, i[0] is | 252 of a batch of current examples. In the second case, i[0] is |
115 list-like container of the f1 field of a batch current examples, i[1] is | 253 list-like container of the f1 field of a batch current examples, i[1] is |
116 a list-like container of the f2 field, etc. | 254 a list-like container of the f2 field, etc. |
117 | 255 |
118 Using the first syntax, all the fields will be returned in "i". | 256 Using the first syntax, all the fields will be returned in "i". |
119 Beware that some datasets may not support this syntax, if the number | |
120 of fields is infinite (i.e. field values may be computed "on demand"). | |
121 | |
122 Using the third syntax, i1, i2, i3 will be list-like containers of the | 257 Using the third syntax, i1, i2, i3 will be list-like containers of the |
123 f1, f2, and f3 fields of a batch of examples on each loop iteration. | 258 f1, f2, and f3 fields of a batch of examples on each loop iteration. |
259 | |
260 The minibatches iterator is expected to return upon each call to next() | |
261 a DataSetFields object, which is a LookupList (indexed by the field names) whose | |
262 elements are iterable over the minibatch examples, and which keeps a pointer to | |
263 a sub-dataset that can be used to iterate over the individual examples | |
264 in the minibatch. Hence a minibatch can be converted back to a regular | |
265 dataset or its fields can be looked at individually (and possibly iterated over). | |
124 | 266 |
125 PARAMETERS | 267 PARAMETERS |
126 - fieldnames (list of any type, default None): | 268 - fieldnames (list of any type, default None): |
127 The loop variables i1, i2, i3 (in the example above) should contain the | 269 The loop variables i1, i2, i3 (in the example above) should contain the |
128 f1, f2, and f3 fields of the current batch of examples. If None, the | 270 f1, f2, and f3 fields of the current batch of examples. If None, the |
135 - n_batches (integer, default None) | 277 - n_batches (integer, default None) |
136 The iterator will loop exactly this many times, and then stop. If None, | 278 The iterator will loop exactly this many times, and then stop. If None, |
137 the derived class can choose a default. If (-1), then the returned | 279 the derived class can choose a default. If (-1), then the returned |
138 iterator should support looping indefinitely. | 280 iterator should support looping indefinitely. |
139 | 281 |
282 - offset (integer, default 0) | |
283 The iterator will start at example 'offset' in the dataset, rather than the default. | |
284 | |
140 Note: A list-like container is something like a tuple, list, numpy.ndarray or | 285 Note: A list-like container is something like a tuple, list, numpy.ndarray or |
141 any other object that supports integer indexing and slicing. | 286 any other object that supports integer indexing and slicing. |
142 | 287 |
143 """ | 288 """ |
289 return DataSet.MinibatchWrapAroundIterator(self,fieldnames,minibatch_size,n_batches,offset) | |
290 | |
291 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): | |
292 """ | |
293 This is the minibatches iterator generator that sub-classes must define. | |
294 It does not need to worry about wrapping around multiple times across the dataset, | |
295 as this is handled by MinibatchWrapAroundIterator when DataSet.minibatches() is called. | |
296 The next() method of the returned iterator does not even need to worry about | |
297 the termination condition (as StopIteration will be raised by DataSet.minibatches | |
298 before an improper call to minibatches_nowrap's next() is made). | |
299 That next() method can assert that its next row will always be within [0,len(dataset)). | |
300 The iterator returned by minibatches_nowrap does not need to implement | |
301 a next_index() method either, as this will be provided by MinibatchWrapAroundIterator. | |
302 """ | |
144 raise AbstractFunction() | 303 raise AbstractFunction() |
304 | |
305 def __len__(self): | |
306 """ | |
307 len(dataset) returns the number of examples in the dataset. | |
308 By default, a DataSet is a 'stream', i.e. it has an unbounded length (sys.maxint). | |
309 Sub-classes which implement finite-length datasets should redefine this method. | |
310 Some methods only make sense for finite-length datasets. | |
311 """ | |
312 return sys.maxint | |
313 | |
314 def is_unbounded(self): | |
315 """ | |
316 Tests whether a dataset is unbounded (e.g. a stream). | |
317 """ | |
318 return len(self)==sys.maxint | |
145 | 319 |
146 def hasFields(self,*fieldnames): | 320 def hasFields(self,*fieldnames): |
147 """ | 321 """ |
148 Return true if the given field name (or field names, if multiple arguments are | 322 Return true if the given field name (or field names, if multiple arguments are |
149 given) is recognized by the DataSet (i.e. can be used as a field name in one | 323 given) is recognized by the DataSet (i.e. can be used as a field name in one |
150 of the iterators). | 324 of the iterators). |
325 | |
326 The default implementation may be inefficient (O(# fields in dataset)), as it calls the fieldNames() | |
327 method. Many datasets may store their field names in a dictionary, which would allow more efficiency. | |
328 """ | |
329 return len(unique_elements_list_intersection(fieldnames,self.fieldNames()))>0 | |
330 | |
331 def fieldNames(self): | |
332 """ | |
333 Return the list of field names that are supported by the iterators, | |
334 and for which hasFields(fieldname) would return True. | |
151 """ | 335 """ |
152 raise AbstractFunction() | 336 raise AbstractFunction() |
153 | 337 |
154 | 338 def __call__(self,*fieldnames): |
155 def merge_fields(self,*specifications): | 339 """ |
156 """ | 340 Return a dataset that sees only the fields whose name are specified. |
157 Return a new dataset that maps old fields (of self) to new fields (of the returned | 341 """ |
158 dataset). The minimal syntax that should be supported is the following: | 342 assert self.hasFields(*fieldnames) |
159 new_field_specifications = [new_field_spec1, new_field_spec2, ...] | 343 return self.fields(*fieldnames).examples() |
160 new_field_spec = ([old_field1, old_field2, ...], new_field) | 344 |
161 In general both old_field and new_field should be strings, but some datasets may also | 345 def fields(self,*fieldnames): |
162 support additional indexing schemes within each field (e.g. column slice | 346 """ |
163 of a matrix-like field). | 347 Return a DataSetFields object associated with this dataset. |
164 """ | 348 """ |
165 raise AbstractFunction() | 349 return DataSetFields(self,*fieldnames) |
166 | |
167 def merge_field_values(self,*field_value_pairs): | |
168 """ | |
169 Return the value that corresponds to merging the values of several fields, | |
170 given as arguments (field_name, field_value) pairs with self.hasField(field_name). | |
171 This may be used by implementations of merge_fields. | |
172 Raise a ValueError if the operation is not possible. | |
173 """ | |
174 fieldnames,fieldvalues = zip(*field_value_pairs) | |
175 raise ValueError("Unable to merge values of these fields:"+repr(fieldnames)) | |
176 | |
177 def examples2minibatch(self,examples): | |
178 """ | |
179 Combine a list of Examples into a minibatch. A minibatch is an Example whose fields | |
180 are iterable over the examples of the minibatch. | |
181 """ | |
182 raise AbstractFunction() | |
183 | |
184 def rename(self,rename_dict): | |
185 """ | |
186 Changes a dataset into one that renames fields, using a dictionnary that maps old field | |
187 names to new field names. The only fields visible by the returned dataset are those | |
188 whose names are keys of the rename_dict. | |
189 """ | |
190 self_class = self.__class__ | |
191 class SelfRenamingDataSet(RenamingDataSet,self_class): | |
192 pass | |
193 self.__class__ = SelfRenamingDataSet | |
194 # set the rename_dict and src fields | |
195 SelfRenamingDataSet.__init__(self,self,rename_dict) | |
196 return self | |
197 | |
198 def apply_function(self,function, input_fields, output_fields, copy_inputs=True, accept_minibatches=True, cache=True): | |
199 """ | |
200 Changes a dataset into one that contains as fields the results of applying | |
201 the given function (example-wise) to the specified input_fields. The | |
202 function should return a sequence whose elements will be stored in | |
203 fields whose names are given in the output_fields list. If copy_inputs | |
204 is True then the resulting dataset will also contain the fields of self. | |
205 If accept_minibatches, then the function may be called | |
206 with minibatches as arguments (what is returned by the minibatches | |
207 iterator). In any case, the computations may be delayed until the examples | |
208 of the resulting dataset are requested. If cache is True, then | |
209 once the output fields for some examples have been computed, then | |
210 are cached (to avoid recomputation if the same examples are again | |
211 requested). | |
212 """ | |
213 self_class = self.__class__ | |
214 class SelfApplyFunctionDataSet(ApplyFunctionDataSet,self_class): | |
215 pass | |
216 self.__class__ = SelfApplyFunctionDataSet | |
217 # set the required additional fields | |
218 ApplyFunctionDataSet.__init__(self,self,function, input_fields, output_fields, copy_inputs, accept_minibatches, cache) | |
219 return self | |
220 | |
221 | |
222 class FiniteLengthDataSet(DataSet): | |
223 """ | |
224 Virtual interface for datasets that have a finite length (number of examples), | |
225 and thus recognize a len(dataset) call. | |
226 """ | |
227 def __init__(self): | |
228 DataSet.__init__(self) | |
229 | |
230 def __len__(self): | |
231 """len(dataset) returns the number of examples in the dataset.""" | |
232 raise AbstractFunction() | |
233 | |
234 def __call__(self,fieldname_or_fieldnames): | |
235 """ | |
236 Extract one or more fields. This may be an expensive operation when the | |
237 dataset is large. It is not the recommanded way to access individual values | |
238 (use the iterators instead). If the argument is a string fieldname, then the result | |
239 is a sequence (iterable object) of values for that field, for the whole dataset. If the | |
240 argument is a list of field names, then the result is a 'batch', i.e., an Example with keys | |
241 corresponding to the given field names and values being iterable objects over the | |
242 individual example values. | |
243 """ | |
244 if type(fieldname_or_fieldnames) is string: | |
245 minibatch = self.minibatches([fieldname_or_fieldnames],len(self)).next() | |
246 return minibatch[fieldname_or_fieldnames] | |
247 return self.minibatches(fieldname_or_fieldnames,len(self)).next() | |
248 | |
249 class SliceableDataSet(DataSet): | |
250 """ | |
251 Virtual interface, a subclass of DataSet for datasets which are sliceable | |
252 and whose individual elements can be accessed, generally respecting the | |
253 python semantics for [spec], where spec is either a non-negative integer | |
254 (for selecting one example), a python slice(start,stop,step) for selecting a regular | |
255 sub-dataset comprising examples start,start+step,start+2*step,...,n (with n<stop), or a | |
256 sequence (e.g. a list) of integers [i1,i2,...,in] for selecting | |
257 an arbitrary subset of examples. This is useful for obtaining | |
258 sub-datasets, e.g. for splitting a dataset into training and test sets. | |
259 """ | |
260 def __init__(self): | |
261 DataSet.__init__(self) | |
262 | |
263 def minibatches(self, | |
264 fieldnames = DataSet.minibatches_fieldnames, | |
265 minibatch_size = DataSet.minibatches_minibatch_size, | |
266 n_batches = DataSet.minibatches_n_batches): | |
267 """ | |
268 If the n_batches is empty, we want to see all the examples possible | |
269 for the given minibatch_size (possibly missing a few at the end of the dataset). | |
270 """ | |
271 # substitute the defaults: | |
272 if n_batches is None: n_batches = len(self) / minibatch_size | |
273 return DataSet.Iterator(self, fieldnames, minibatch_size, n_batches) | |
274 | 350 |
275 def __getitem__(self,i): | 351 def __getitem__(self,i): |
276 """ | 352 """ |
277 dataset[i] returns the (i+1)-th example of the dataset. | 353 dataset[i] returns the (i+1)-th example of the dataset. |
278 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1. | 354 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1. |
279 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2. | 355 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2. |
280 dataset[[i1,i2,..,in]] returns the subdataset with examples i1,i2,...,in. | 356 dataset[[i1,i2,..,in]] returns the subdataset with examples i1,i2,...,in. |
281 """ | 357 dataset['key'] returns a property associated with the given 'key' string. |
282 raise AbstractFunction() | 358 If 'key' is a fieldname, then the VStacked field values (iterable over |
283 | 359 field values) for that field is returned. Other keys may be supported |
284 def __getslice__(self,*slice_args): | 360 by different dataset subclasses. The following key names are encouraged: |
285 """ | 361 - 'description': a textual description or name for the dataset |
286 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1. | 362 - '<fieldname>.type': a type name or value for a given <fieldname> |
287 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2. | 363 |
288 """ | 364 Note that some stream datasets may be unable to implement random access, i.e. |
289 raise AbstractFunction() | 365 arbitrary slicing/indexing |
290 | 366 because they can only iterate through examples one or a minibatch at a time |
291 | 367 and do not actually store or keep past (or future) examples. |
292 class FiniteWidthDataSet(DataSet): | 368 |
293 """ | 369 The default implementation of getitem uses the minibatches iterator |
294 Virtual interface for datasets that have a finite width (number of fields), | 370 to obtain one example, one slice, or a list of examples. It may not |
295 and thus return a list of fieldNames. | 371 always be the most efficient way to obtain the result, especially if |
296 """ | 372 the data are actually stored in a memory array. |
297 def __init__(self): | 373 """ |
298 DataSet.__init__(self) | 374 # check for an index |
299 | 375 if type(i) is int: |
300 def hasFields(self,*fields): | 376 return DataSet.MinibatchToSingleExampleIterator( |
301 has_fields=True | 377 self.minibatches(minibatch_size=1,n_batches=1,offset=i)).next() |
302 fieldnames = self.fieldNames() | 378 rows=None |
303 for name in fields: | 379 # or a slice |
304 if name not in fieldnames: | 380 if type(i) is slice: |
305 has_fields=False | 381 if not i.start: i.start=0 |
306 return has_fields | 382 if not i.step: i.step=1 |
307 | 383 if i.step is 1: |
384 return self.minibatches(minibatch_size=i.stop-i.start,n_batches=1,offset=i.start).next().examples() | |
385 rows = range(i.start,i.stop,i.step) | |
386 # or a list of indices | |
387 elif type(i) is list: | |
388 rows = i | |
389 if rows is not None: | |
390 examples = [self[row] for row in rows] | |
391 fields_values = zip(*examples) | |
392 return MinibatchDataSet( | |
393 Example(self.fieldNames(),[ self.valuesVStack(fieldname,field_values) | |
394 for fieldname,field_values | |
395 in zip(self.fieldNames(),fields_values)])) | |
396 # else check for a fieldname | |
397 if self.hasFields(i): | |
398 return self.minibatches(fieldnames=[i],minibatch_size=len(self),n_batches=1,offset=0).next()[0] | |
399 # else we are trying to access a property of the dataset | |
400 assert i in self.__dict__ # else it means we are trying to access a non-existing property | |
401 return self.__dict__[i] | |
402 | |
403 def valuesHStack(self,fieldnames,fieldvalues): | |
404 """ | |
405 Return a value that corresponds to concatenating (horizontally) several field values. | |
406 This can be useful to merge some fields. The implementation of this operation is likely | |
407 to involve a copy of the original values. When the values are numpy arrays, the | |
408 result should be numpy.hstack(values). If it makes sense, this operation should | |
409 work as well when each value corresponds to multiple examples in a minibatch | |
410 e.g. if each value is a Ni-vector and a minibatch of length L is a LxNi matrix, | |
411 then the result should be a Lx(N1+N2+..) matrix equal to numpy.hstack(values). | |
412 The default is to use numpy.hstack for numpy.ndarray values, and a list | |
413 pointing to the original values for other data types. | |
414 """ | |
415 all_numpy=True | |
416 for value in fieldvalues: | |
417 if not type(value) is numpy.ndarray: | |
418 all_numpy=False | |
419 if all_numpy: | |
420 return numpy.hstack(fieldvalues) | |
421 # the default implementation of horizontal stacking is to put values in a list | |
422 return fieldvalues | |
423 | |
424 | |
425 def valuesVStack(self,fieldname,values): | |
426 """ | |
427 Return a value that corresponds to concatenating (vertically) several values of the | |
428 same field. This can be important to build a minibatch out of individual examples. This | |
429 is likely to involve a copy of the original values. When the values are numpy arrays, the | |
430 result should be numpy.vstack(values). | |
431 The default is to use numpy.vstack for numpy.ndarray values, and a list | |
432 pointing to the original values for other data types. | |
433 """ | |
434 all_numpy=True | |
435 for value in values: | |
436 if not type(value) is numpy.ndarray: | |
437 all_numpy=False | |
438 if all_numpy: | |
439 return numpy.vstack(values) | |
440 # the default implementation of vertical stacking is to put values in a list | |
441 return values | |
442 | |
443 def __or__(self,other): | |
444 """ | |
445 dataset1 | dataset2 returns a dataset whose list of fields is the concatenation of the list of | |
446 fields of the argument datasets. This only works if they all have the same length. | |
447 """ | |
448 return HStackedDataSet(self,other) | |
449 | |
450 def __and__(self,other): | |
451 """ | |
452 dataset1 & dataset2 is a dataset that concatenates the examples from the argument datasets | |
453 (and whose length is the sum of the length of the argument datasets). This only | |
454 works if they all have the same fields. | |
455 """ | |
456 return VStackedDataSet(self,other) | |
457 | |
458 def hstack(datasets): | |
459 """ | |
460 hstack(dataset1,dataset2,...) returns dataset1 | datataset2 | ... | |
461 which is a dataset whose fields list is the concatenation of the fields | |
462 of the individual datasets. | |
463 """ | |
464 assert len(datasets)>0 | |
465 if len(datasets)==1: | |
466 return datasets[0] | |
467 return HStackedDataSet(datasets) | |
468 | |
469 def vstack(datasets): | |
470 """ | |
471 vstack(dataset1,dataset2,...) returns dataset1 & datataset2 & ... | |
472 which is a dataset which iterates first over the examples of dataset1, then | |
473 over those of dataset2, etc. | |
474 """ | |
475 assert len(datasets)>0 | |
476 if len(datasets)==1: | |
477 return datasets[0] | |
478 return VStackedDataSet(datasets) | |
479 | |
480 class FieldsSubsetDataSet(DataSet): | |
481 """ | |
482 A sub-class of DataSet that selects a subset of the fields. | |
483 """ | |
484 def __init__(self,src,fieldnames): | |
485 self.src=src | |
486 self.fieldnames=fieldnames | |
487 assert src.hasFields(*fieldnames) | |
488 self.valuesHStack = src.valuesHStack | |
489 self.valuesVStack = src.valuesVStack | |
490 | |
491 def __len__(self): return len(self.src) | |
492 | |
308 def fieldNames(self): | 493 def fieldNames(self): |
309 """Return the list of field names that are supported by the iterators, | 494 return self.fieldnames |
310 and for which hasFields(fieldname) would return True.""" | 495 |
311 raise AbstractFunction() | 496 def __iter__(self): |
312 | 497 class FieldsSubsetIterator(object): |
313 | 498 def __init__(self,ds): |
314 class RenamingDataSet(FiniteWidthDataSet): | 499 self.ds=ds |
315 """A DataSet that wraps another one, and makes it look like the field names | 500 self.src_iter=ds.src.__iter__() |
316 are different | 501 self.example=None |
317 | 502 def __iter__(self): return self |
318 Renaming is done by a dictionary that maps new names to the old ones used in | 503 def next(self): |
319 self.src. | 504 complete_example = self.src_iter.next() |
320 """ | 505 if self.example: |
321 def __init__(self, src, rename_dct): | 506 self.example._values=[complete_example[field] |
322 DataSet.__init__(self) | 507 for field in self.ds.fieldnames] |
323 self.src = src | 508 else: |
324 self.rename_dct = copy.copy(rename_dct) | 509 self.example=Example(self.ds.fieldnames, |
510 [complete_example[field] for field in self.ds.fieldnames]) | |
511 return self.example | |
512 return FieldsSubsetIterator(self) | |
513 | |
514 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): | |
515 assert self.hasFields(*fieldnames) | |
516 return self.src.minibatches_nowrap(fieldnames,minibatch_size,n_batches,offset) | |
517 def __getitem__(self,i): | |
518 return FieldsSubsetDataSet(self.src[i],self.fieldnames) | |
519 | |
520 | |
521 class DataSetFields(LookupList): | |
522 """ | |
523 Although a DataSet iterates over examples (like rows of a matrix), an associated | |
524 DataSetFields iterates over fields (like columns of a matrix), and can be understood | |
525 as a transpose of the associated dataset. | |
526 | |
527 To iterate over fields, one can do | |
528 * for fields in dataset.fields() | |
529 * for fields in dataset(field1,field2,...).fields() to select a subset of fields | |
530 * for fields in dataset.fields(field1,field2,...) to select a subset of fields | |
531 and each of these fields is iterable over the examples: | |
532 * for field_examples in dataset.fields(): | |
533 for example_value in field_examples: | |
534 ... | |
535 but when the dataset is a stream (unbounded length), it is not recommanded to do | |
536 such things because the underlying dataset may refuse to access the different fields in | |
537 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. | |
538 The result of fields() is a DataSetFields object, which iterates over fields, | |
539 and whose elements are iterable over examples. A DataSetFields object can | |
540 be turned back into a DataSet with its examples() method: | |
541 dataset2 = dataset1.fields().examples() | |
542 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1). | |
543 | |
544 DataSetFields can be concatenated vertically or horizontally. To be consistent with | |
545 the syntax used for DataSets, the | concatenates the fields and the & concatenates | |
546 the examples. | |
547 """ | |
548 def __init__(self,dataset,*fieldnames): | |
549 original_dataset=dataset | |
550 if not fieldnames: | |
551 fieldnames=dataset.fieldNames() | |
552 elif not fieldnames==dataset.fieldNames(): | |
553 dataset = FieldsSubsetDataSet(dataset,fieldnames) | |
554 assert dataset.hasFields(*fieldnames) | |
555 self.dataset=dataset | |
556 | |
557 if isinstance(dataset,MinibatchDataSet): | |
558 LookupList.__init__(self,fieldnames,list(dataset._fields)) | |
559 elif isinstance(original_dataset,MinibatchDataSet): | |
560 LookupList.__init__(self,fieldnames, | |
561 [original_dataset._fields[field] | |
562 for field in fieldnames]) | |
563 else: | |
564 minibatch_iterator = dataset.minibatches(fieldnames, | |
565 minibatch_size=len(dataset), | |
566 n_batches=1) | |
567 minibatch=minibatch_iterator.next() | |
568 LookupList.__init__(self,fieldnames,minibatch) | |
569 | |
570 def examples(self): | |
571 return self.dataset | |
572 | |
573 def __or__(self,other): | |
574 """ | |
575 fields1 | fields2 is a DataSetFields that whose list of examples is the concatenation | |
576 of the list of examples of DataSetFields fields1 and fields2. | |
577 """ | |
578 return (self.examples() + other.examples()).fields() | |
579 | |
580 def __and__(self,other): | |
581 """ | |
582 fields1 + fields2 is a DataSetFields that whose list of fields is the concatenation | |
583 of the fields of DataSetFields fields1 and fields2. | |
584 """ | |
585 return (self.examples() | other.examples()).fields() | |
586 | |
587 | |
588 class MinibatchDataSet(DataSet): | |
589 """ | |
590 Turn a LookupList of same-length fields into an example-iterable dataset. | |
591 Each element of the lookup-list should be an iterable and sliceable, all of the same length. | |
592 """ | |
593 def __init__(self,fields_lookuplist,values_vstack=DataSet().valuesVStack, | |
594 values_hstack=DataSet().valuesHStack): | |
595 """ | |
596 The user can (and generally should) also provide values_vstack(fieldname,fieldvalues) | |
597 and a values_hstack(fieldnames,fieldvalues) functions behaving with the same | |
598 semantics as the DataSet methods of the same name (but without the self argument). | |
599 """ | |
600 self._fields=fields_lookuplist | |
601 assert len(fields_lookuplist)>0 | |
602 self.length=len(fields_lookuplist[0]) | |
603 for field in fields_lookuplist[1:]: | |
604 assert self.length==len(field) | |
605 self.values_vstack=values_vstack | |
606 self.values_hstack=values_hstack | |
607 | |
608 def __len__(self): | |
609 return self.length | |
610 | |
611 def __getitem__(self,i): | |
612 if type(i) in (int,slice,list): | |
613 return DataSetFields(MinibatchDataSet( | |
614 Example(self._fields.keys(),[field[i] for field in self._fields])),self._fields) | |
615 if self.hasFields(i): | |
616 return self._fields[i] | |
617 assert i in self.__dict__ # else it means we are trying to access a non-existing property | |
618 return self.__dict__[i] | |
325 | 619 |
326 def fieldNames(self): | 620 def fieldNames(self): |
327 return self.rename_dct.keys() | 621 return self._fields.keys() |
328 | 622 |
329 def minibatches(self, | 623 def hasFields(self,*fieldnames): |
330 fieldnames = DataSet.minibatches_fieldnames, | 624 for fieldname in fieldnames: |
331 minibatch_size = DataSet.minibatches_minibatch_size, | 625 if fieldname not in self._fields.keys(): |
332 n_batches = DataSet.minibatches_n_batches): | 626 return False |
333 dct = self.rename_dct | 627 return True |
334 new_fieldnames = [dct.get(f, f) for f in fieldnames] | 628 |
335 return self.src.minibatches(new_fieldnames, minibatches_size, n_batches) | 629 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): |
336 | 630 class Iterator(object): |
337 | 631 def __init__(self,ds): |
338 # we may want ArrayDataSet defined in another python file | 632 self.ds=ds |
339 | 633 self.next_example=offset |
340 import numpy | 634 assert minibatch_size > 0 |
341 | 635 if offset+minibatch_size > ds.length: |
342 def as_array_dataset(dataset): | 636 raise NotImplementedError() |
343 # Generally datasets can be efficient by making data fields overlap, but | |
344 # this function doesn't know which fields overlap. So, it should check if | |
345 # dataset supports an as_array_dataset member function, and return that if | |
346 # possible. | |
347 if hasattr(dataset, 'as_array_dataset'): | |
348 return dataset.as_array_dataset() | |
349 | |
350 raise NotImplementedError | |
351 | |
352 # Make ONE big minibatch with all the examples, to separate the fields. | |
353 n_examples = len(dataset) | |
354 batch = dataset.minibatches( minibatch_size = len(dataset)).next() | |
355 | |
356 # Each field of the underlying dataset must be convertible to a numpy array of the same type | |
357 # currently just double, but should use the smallest compatible dtype | |
358 n_fields = len(batch) | |
359 fieldnames = batch.fields.keys() | |
360 total_width = 0 | |
361 type = None | |
362 fields = LookupList() | |
363 for i in xrange(n_fields): | |
364 field = array(batch[i]) | |
365 assert field.shape[0]==n_examples | |
366 width = field.shape[1] | |
367 start=total_width | |
368 total_width += width | |
369 fields[fieldnames[i]]=slice(start,total_width,1) | |
370 # many complicated things remain to be done: | |
371 # - find common dtype | |
372 # - decide what to do with extra dimensions if not the same in all fields | |
373 # - try to see if we can avoid the copy? | |
374 | |
375 class ArrayDataSet(FiniteLengthDataSet,FiniteWidthDataSet,SliceableDataSet): | |
376 """ | |
377 An ArrayDataSet behaves like a numpy array but adds the notion of named fields | |
378 from DataSet (and the ability to view the values of multiple fields as an 'Example'). | |
379 It is a fixed-length and fixed-width dataset | |
380 in which each element is a fixed dimension numpy array or a number, hence the whole | |
381 dataset corresponds to a numpy array. Fields | |
382 must correspond to a slice of array columns or to a list of column numbers. | |
383 If the dataset has fields, | |
384 each 'example' is just a one-row ArrayDataSet, otherwise it is a numpy array. | |
385 Any dataset can also be converted to a numpy array (losing the notion of fields | |
386 by the numpy.array(dataset) call. | |
387 """ | |
388 | |
389 class Iterator(LookupList): | |
390 """An iterator over a finite dataset that implements wrap-around""" | |
391 def __init__(self, dataset, fieldnames, minibatch_size, next_max): | |
392 if fieldnames is None: fieldnames = dataset.fieldNames() | |
393 LookupList.__init__(self, fieldnames, [0]*len(fieldnames)) | |
394 self.dataset=dataset | |
395 self.minibatch_size=minibatch_size | |
396 self.next_count = 0 | |
397 self.next_max = next_max | |
398 self.current = -self.minibatch_size | |
399 assert minibatch_size > 0 | |
400 if minibatch_size >= len(dataset): | |
401 raise NotImplementedError() | |
402 | |
403 def __iter__(self): #makes for loop work | |
404 return self | |
405 | |
406 @staticmethod | |
407 def matcat(a, b): | |
408 a0, a1 = a.shape | |
409 b0, b1 = b.shape | |
410 assert a1 == b1 | |
411 assert a.dtype is b.dtype | |
412 rval = numpy.empty( (a0 + b0, a1), dtype=a.dtype) | |
413 rval[:a0,:] = a | |
414 rval[a0:,:] = b | |
415 return rval | |
416 | |
417 def next_index(self): | |
418 n_rows = self.dataset.data.shape[0] | |
419 next_i = self.current+self.minibatch_size | |
420 if next_i >= n_rows: | |
421 next_i -= n_rows | |
422 return next_i | |
423 | |
424 def next(self): | |
425 | |
426 #check for end-of-loop | |
427 self.next_count += 1 | |
428 if self.next_count == self.next_max: | |
429 raise StopIteration | |
430 | |
431 #determine the first and last elements of the minibatch slice we'll return | |
432 n_rows = self.dataset.data.shape[0] | |
433 self.current = self.next_index() | |
434 upper = self.current + self.minibatch_size | |
435 | |
436 data = self.dataset.data | |
437 | |
438 if upper <= n_rows: | |
439 #this is the easy case, we only need once slice | |
440 dataview = data[self.current:upper] | |
441 else: | |
442 # the minibatch wraps around the end of the dataset | |
443 dataview = data[self.current:] | |
444 upper -= n_rows | |
445 assert upper > 0 | |
446 dataview = self.matcat(dataview, data[:upper]) | |
447 | |
448 self._values = [dataview[:, self.dataset.fields[f]]\ | |
449 for f in self._names] | |
450 return self | |
451 | |
452 | |
453 def __init__(self, data, fields=None): | |
454 """ | |
455 There are two ways to construct an ArrayDataSet: (1) from an | |
456 existing dataset (which may result in a copy of the data in a numpy array), | |
457 or (2) from a numpy.array (the data argument), along with an optional description | |
458 of the fields (a LookupList of column slices (or column lists) indexed by field names). | |
459 """ | |
460 self.data=data | |
461 self.fields=fields | |
462 rows, cols = data.shape | |
463 | |
464 if fields: | |
465 for fieldname,fieldslice in fields.items(): | |
466 assert type(fieldslice) is int or isinstance(fieldslice,slice) or hasattr(fieldslice,"__iter__") | |
467 if hasattr(fieldslice,"__iter__"): # is a sequence | |
468 for i in fieldslice: | |
469 assert type(i) is int | |
470 elif isinstance(fieldslice,slice): | |
471 # make sure fieldslice.start and fieldslice.step are defined | |
472 start=fieldslice.start | |
473 step=fieldslice.step | |
474 if not start: | |
475 start=0 | |
476 if not step: | |
477 step=1 | |
478 if not fieldslice.start or not fieldslice.step: | |
479 fields[fieldname] = fieldslice = slice(start,fieldslice.stop,step) | |
480 # and coherent with the data array | |
481 assert fieldslice.start >= 0 and fieldslice.stop <= cols | |
482 | |
483 def minibatches(self, | |
484 fieldnames = DataSet.minibatches_fieldnames, | |
485 minibatch_size = DataSet.minibatches_minibatch_size, | |
486 n_batches = DataSet.minibatches_n_batches): | |
487 """ | |
488 If the fieldnames list is None, it means that we want to see ALL the fields. | |
489 | |
490 If the n_batches is None, we want to see all the examples possible | |
491 for the given minibatch_size (possibly missing some near the end). | |
492 """ | |
493 # substitute the defaults: | |
494 if n_batches is None: n_batches = len(self) / minibatch_size | |
495 return ArrayDataSet.Iterator(self, fieldnames, minibatch_size, n_batches) | |
496 | |
497 def fieldNames(self): | |
498 """Return the list of field names that are supported by getattr and hasField.""" | |
499 return self.fields.keys() | |
500 | |
501 def __len__(self): | |
502 """len(dataset) returns the number of examples in the dataset.""" | |
503 return len(self.data) | |
504 | |
505 def __getitem__(self,i): | |
506 """ | |
507 dataset[i] returns the (i+1)-th Example of the dataset. | |
508 If there are no fields the result is just a numpy array (for the i-th row of the dataset data matrix). | |
509 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1. | |
510 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2. | |
511 dataset[[i1,i2,..,in]] returns the subdataset with examples i1,i2,...,in. | |
512 """ | |
513 if self.fields: | |
514 fieldnames,fieldslices=zip(*self.fields.items()) | |
515 return Example(self.fields.keys(),[self.data[i,fieldslice] for fieldslice in self.fields.values()]) | |
516 else: | |
517 return self.data[i] | |
518 | |
519 def __getslice__(self,*args): | |
520 """ | |
521 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1. | |
522 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2. | |
523 """ | |
524 return ArrayDataSet(self.data.__getslice__(*args), fields=self.fields) | |
525 | |
526 def indices_of_unique_columns_used(self): | |
527 """ | |
528 Return the unique indices of the columns actually used by the fields, and a boolean | |
529 that signals (if True) that used columns overlap. If they do then the | |
530 indices are not repeated in the result. | |
531 """ | |
532 columns_used = numpy.zeros((self.data.shape[1]),dtype=bool) | |
533 overlapping_columns = False | |
534 for field_slice in self.fields.values(): | |
535 if sum(columns_used[field_slice])>0: overlapping_columns=True | |
536 columns_used[field_slice]=True | |
537 return [i for i,used in enumerate(columns_used) if used],overlapping_columns | |
538 | |
539 def slice_of_unique_columns_used(self): | |
540 """ | |
541 Return None if the indices_of_unique_columns_used do not form a slice. If they do, | |
542 return that slice. It means that the columns used can be extracted | |
543 from the data array without making a copy. If the fields overlap | |
544 but their unique columns used form a slice, still return that slice. | |
545 """ | |
546 columns_used,overlapping_columns = self.indices_of_columns_used() | |
547 mappable_to_one_slice = True | |
548 if not overlapping_fields: | |
549 start=0 | |
550 while start<len(columns_used) and not columns_used[start]: | |
551 start+=1 | |
552 stop=len(columns_used) | |
553 while stop>0 and not columns_used[stop-1]: | |
554 stop-=1 | |
555 step=0 | |
556 i=start | |
557 while i<stop: | |
558 j=i+1 | |
559 while j<stop and not columns_used[j]: | |
560 j+=1 | |
561 if step: | |
562 if step!=j-i: | |
563 mappable_to_one_slice = False | |
564 break | |
565 else: | |
566 step = j-i | |
567 i=j | |
568 return slice(start,stop,step) | |
569 | |
570 class ApplyFunctionDataSet(FiniteWidthDataSet): | |
571 """ | |
572 A dataset that contains as fields the results of applying | |
573 a given function (example-wise) to specified input_fields of a source | |
574 dataset. The function should return a sequence whose elements will be stored in | |
575 fields whose names are given in the output_fields list. If copy_inputs | |
576 is True then the resulting dataset will also contain the fields of the source. | |
577 dataset. If accept_minibatches, then the function expects | |
578 minibatches as arguments (what is returned by the minibatches | |
579 iterator). In any case, the computations may be delayed until the examples | |
580 of self are requested. If cache is True, then | |
581 once the output fields for some examples have been computed, then | |
582 are cached (to avoid recomputation if the same examples are again requested). | |
583 """ | |
584 def __init__(src,function, input_fields, output_fields, copy_inputs=True, accept_minibatches=True, cache=True, compute_now=False): | |
585 DataSet.__init__(self) | |
586 self.src=src | |
587 self.function=function | |
588 assert src.hasFields(input_fields) | |
589 self.input_fields=input_fields | |
590 self.output_fields=output_fields | |
591 assert not (copy_inputs and compute_now and not hasattr(src,'fieldNames')) | |
592 self.copy_inputs=copy_inputs | |
593 self.accept_minibatches=accept_minibatches | |
594 self.cache=cache | |
595 self.compute_now=compute_now | |
596 if compute_now: | |
597 assert hasattr(src,'__len__') and len(src)>=0 | |
598 fieldnames = output_fields | |
599 if copy_inputs: fieldnames = src.fieldNames() + output_fields | |
600 if accept_minibatches: | |
601 # make a single minibatch with all the inputs | |
602 inputs = src.minibatches(input_fields,len(src)).next() | |
603 # and apply the function to it, and transpose into a list of examples (field values, actually) | |
604 self.cached_examples = zip(*Example(output_fields,function(*inputs))) | |
605 else: | |
606 # compute a list with one tuple per example, with the function outputs | |
607 self.cached_examples = [ function(input) for input in src.zip(input_fields) ] | |
608 elif cache: | |
609 # maybe a fixed-size array kind of structure would be more efficient than a list | |
610 # in the case where src is FiniteDataSet. -YB | |
611 self.cached_examples = [] | |
612 | |
613 def fieldNames(self): | |
614 if self.copy_inputs: | |
615 return self.output_fields + self.src.fieldNames() | |
616 return self.output_fields | |
617 | |
618 def minibatches(self, | |
619 fieldnames = DataSet.minibatches_fieldnames, | |
620 minibatch_size = DataSet.minibatches_minibatch_size, | |
621 n_batches = DataSet.minibatches_n_batches): | |
622 | |
623 class Iterator(LookupList): | |
624 | |
625 def __init__(self,dataset): | |
626 if fieldnames is None: | |
627 assert hasattr(dataset,"fieldNames") | |
628 fieldnames = dataset.fieldNames() | |
629 self.example_index=0 | |
630 LookupList.__init__(self, fieldnames, [0]*len(fieldnames)) | |
631 self.dataset=dataset | |
632 self.src_iterator=self.src.minibatches(list(set.union(set(fieldnames),set(dataset.input_fields))), | |
633 minibatch_size,n_batches) | |
634 self.fieldnames_not_in_input = [] | |
635 if self.copy_inputs: | |
636 self.fieldnames_not_in_input = filter(lambda x: not x in dataset.input_fields, fieldnames) | |
637 | |
638 def __iter__(self): | 637 def __iter__(self): |
639 return self | 638 return self |
640 | 639 def next(self): |
641 def next_index(self): | 640 upper = self.next_example+minibatch_size |
642 return self.src_iterator.next_index() | 641 assert upper<=self.ds.length |
642 minibatch = Example(self.ds._fields.keys(), | |
643 [field[self.next_example:upper] | |
644 for field in self.ds._fields]) | |
645 self.next_example+=minibatch_size | |
646 return DataSetFields(MinibatchDataSet(minibatch),*fieldnames) | |
647 | |
648 return Iterator(self) | |
649 | |
650 def valuesVStack(self,fieldname,fieldvalues): | |
651 return self.values_vstack(fieldname,fieldvalues) | |
652 | |
653 def valuesHStack(self,fieldnames,fieldvalues): | |
654 return self.values_hstack(fieldnames,fieldvalues) | |
655 | |
656 class HStackedDataSet(DataSet): | |
657 """ | |
658 A DataSet that wraps several datasets and shows a view that includes all their fields, | |
659 i.e. whose list of fields is the concatenation of their lists of fields. | |
660 | |
661 If a field name is found in more than one of the datasets, then either an error is | |
662 raised or the fields are renamed (either by prefixing the __name__ attribute | |
663 of the dataset + ".", if it exists, or by suffixing the dataset index in the argument list). | |
664 | |
665 TODO: automatically detect a chain of stacked datasets due to A | B | C | D ... | |
666 """ | |
667 def __init__(self,datasets,accept_nonunique_names=False,description=None,field_types=None): | |
668 DataSet.__init__(self,description,field_types) | |
669 self.datasets=datasets | |
670 self.accept_nonunique_names=accept_nonunique_names | |
671 self.fieldname2dataset={} | |
672 | |
673 def rename_field(fieldname,dataset,i): | |
674 if hasattr(dataset,"__name__"): | |
675 return dataset.__name__ + "." + fieldname | |
676 return fieldname+"."+str(i) | |
643 | 677 |
678 # make sure all datasets have the same length and unique field names | |
679 self.length=None | |
680 names_to_change=[] | |
681 for i in xrange(len(datasets)): | |
682 dataset = datasets[i] | |
683 length=len(dataset) | |
684 if self.length: | |
685 assert self.length==length | |
686 else: | |
687 self.length=length | |
688 for fieldname in dataset.fieldNames(): | |
689 if fieldname in self.fieldname2dataset: # name conflict! | |
690 if accept_nonunique_names: | |
691 fieldname=rename_field(fieldname,dataset,i) | |
692 names2change.append((fieldname,i)) | |
693 else: | |
694 raise ValueError("Incompatible datasets: non-unique field name = "+fieldname) | |
695 self.fieldname2dataset[fieldname]=i | |
696 for fieldname,i in names_to_change: | |
697 del self.fieldname2dataset[fieldname] | |
698 self.fieldname2dataset[rename_field(fieldname,self.datasets[i],i)]=i | |
699 | |
700 def hasFields(self,*fieldnames): | |
701 for fieldname in fieldnames: | |
702 if not fieldname in self.fieldname2dataset: | |
703 return False | |
704 return True | |
705 | |
706 def fieldNames(self): | |
707 return self.fieldname2dataset.keys() | |
708 | |
709 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): | |
710 | |
711 class HStackedIterator(object): | |
712 def __init__(self,hsds,iterators): | |
713 self.hsds=hsds | |
714 self.iterators=iterators | |
715 def __iter__(self): | |
716 return self | |
644 def next(self): | 717 def next(self): |
645 example_index = self.src_iterator.next_index() | 718 # concatenate all the fields of the minibatches |
646 src_examples = self.src_iterator.next() | 719 minibatch = reduce(LookupList.__add__,[iterator.next() for iterator in self.iterators]) |
647 if self.dataset.copy_inputs: | 720 # and return a DataSetFields whose dataset is the transpose (=examples()) of this minibatch |
648 function_inputs = [src_examples[field_name] for field_name in self.dataset.input_fields] | 721 return DataSetFields(MinibatchDataSet(minibatch,self.hsds.valuesVStack, |
722 self.hsds.valuesHStack), | |
723 fieldnames if fieldnames else hsds.fieldNames()) | |
724 | |
725 assert self.hasfields(fieldnames) | |
726 # find out which underlying datasets are necessary to service the required fields | |
727 # and construct corresponding minibatch iterators | |
728 if fieldnames: | |
729 datasets=set([]) | |
730 fields_in_dataset=dict([(dataset,[]) for dataset in datasets]) | |
731 for fieldname in fieldnames: | |
732 dataset=self.datasets[self.fieldnames2dataset[fieldname]] | |
733 datasets.add(dataset) | |
734 fields_in_dataset[dataset].append(fieldname) | |
735 datasets=list(datasets) | |
736 iterators=[dataset.minibatches(fields_in_dataset[dataset],minibatch_size,n_batches,offset) | |
737 for dataset in datasets] | |
738 else: | |
739 datasets=self.datasets | |
740 iterators=[dataset.minibatches(None,minibatch_size,n_batches,offset) for dataset in datasets] | |
741 return HStackedIterator(self,iterators) | |
742 | |
743 | |
744 def valuesVStack(self,fieldname,fieldvalues): | |
745 return self.datasets[self.fieldname2dataset[fieldname]].valuesVStack(fieldname,fieldvalues) | |
746 | |
747 def valuesHStack(self,fieldnames,fieldvalues): | |
748 """ | |
749 We will use the sub-dataset associated with the first fieldname in the fieldnames list | |
750 to do the work, hoping that it can cope with the other values (i.e. won't care | |
751 about the incompatible fieldnames). Hence this heuristic will always work if | |
752 all the fieldnames are of the same sub-dataset. | |
753 """ | |
754 return self.datasets[self.fieldname2dataset[fieldnames[0]]].valuesHStack(fieldnames,fieldvalues) | |
755 | |
756 class VStackedDataSet(DataSet): | |
757 """ | |
758 A DataSet that wraps several datasets and shows a view that includes all their examples, | |
759 in the order provided. This clearly assumes that they all have the same field names | |
760 and all (except possibly the last one) are of finite length. | |
761 | |
762 TODO: automatically detect a chain of stacked datasets due to A + B + C + D ... | |
763 """ | |
764 def __init__(self,datasets): | |
765 self.datasets=datasets | |
766 self.length=0 | |
767 self.index2dataset={} | |
768 assert len(datasets)>0 | |
769 fieldnames = datasets[-1].fieldNames() | |
770 self.datasets_start_row=[] | |
771 # We use this map from row index to dataset index for constant-time random access of examples, | |
772 # to avoid having to search for the appropriate dataset each time and slice is asked for. | |
773 for dataset,k in enumerate(datasets[0:-1]): | |
774 assert dataset.is_unbounded() # All VStacked datasets (except possibly the last) must be bounded (have a length). | |
775 L=len(dataset) | |
776 for i in xrange(L): | |
777 self.index2dataset[self.length+i]=k | |
778 self.datasets_start_row.append(self.length) | |
779 self.length+=L | |
780 assert dataset.fieldNames()==fieldnames | |
781 self.datasets_start_row.append(self.length) | |
782 self.length+=len(datasets[-1]) | |
783 # If length is very large, we should use a more memory-efficient mechanism | |
784 # that does not store all indices | |
785 if self.length>1000000: | |
786 # 1 million entries would require about 60 meg for the index2dataset map | |
787 # TODO | |
788 print "A more efficient mechanism for index2dataset should be implemented" | |
789 | |
790 def __len__(self): | |
791 return self.length | |
792 | |
793 def fieldNames(self): | |
794 return self.datasets[0].fieldNames() | |
795 | |
796 def hasFields(self,*fieldnames): | |
797 return self.datasets[0].hasFields(*fieldnames) | |
798 | |
799 def locate_row(self,row): | |
800 """Return (dataset_index, row_within_dataset) for global row number""" | |
801 dataset_index = self.index2dataset[row] | |
802 row_within_dataset = self.datasets_start_row[dataset_index] | |
803 return dataset_index, row_within_dataset | |
804 | |
805 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): | |
806 | |
807 class VStackedIterator(object): | |
808 def __init__(self,vsds): | |
809 self.vsds=vsds | |
810 self.next_row=offset | |
811 self.next_dataset_index,self.next_dataset_row=self.vsds.locate_row(offset) | |
812 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \ | |
813 self.next_iterator(vsds.datasets[0],offset,n_batches) | |
814 | |
815 def next_iterator(self,dataset,starting_offset,batches_left): | |
816 L=len(dataset) | |
817 ds_nbatches = (L-starting_offset)/minibatch_size | |
818 if batches_left is not None: | |
819 ds_nbatches = max(batches_left,ds_nbatches) | |
820 if minibatch_size>L: | |
821 ds_minibatch_size=L | |
822 n_left_in_mb=minibatch_size-L | |
823 ds_nbatches=1 | |
649 else: | 824 else: |
650 function_inputs = src_examples | 825 n_left_in_mb=0 |
651 if self.dataset.cached_examples: | 826 return dataset.minibatches(fieldnames,minibatch_size,ds_nbatches,starting_offset), \ |
652 cache_len=len(self.cached_examples) | 827 L-(starting_offset+ds_nbatches*minibatch_size), n_left_in_mb |
653 if example_index<cache_len+minibatch_size: | 828 |
654 outputs_list = self.cached_examples[example_index:example_index+minibatch_size] | 829 def move_to_next_dataset(self): |
655 # convert the minibatch list of examples | 830 if self.n_left_at_the_end_of_ds>0: |
656 # into a list of fields each of which iterate over the minibatch | 831 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \ |
657 outputs = zip(*outputs_list) | 832 self.next_iterator(vsds.datasets[self.next_dataset_index], |
658 else: | 833 self.n_left_at_the_end_of_ds,1) |
659 outputs = self.dataset.function(*function_inputs) | |
660 if self.dataset.cache: | |
661 # convert the list of fields, each of which can iterate over the minibatch | |
662 # into a list of examples in the minibatch (each of which is a list of field values) | |
663 outputs_list = zip(*outputs) | |
664 # copy the outputs_list into the cache | |
665 for i in xrange(cache_len,example_index): | |
666 self.cached_examples.append(None) | |
667 self.cached_examples += outputs_list | |
668 else: | 834 else: |
669 outputs = self.dataset.function(*function_inputs) | 835 self.next_dataset_index +=1 |
836 if self.next_dataset_index==len(self.vsds.datasets): | |
837 self.next_dataset_index = 0 | |
838 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \ | |
839 self.next_iterator(vsds.datasets[self.next_dataset_index],starting_offset,n_batches) | |
670 | 840 |
671 return Example(self.fieldnames_not_in_input+self.dataset.output_fields, | 841 def __iter__(self): |
672 [src_examples[field_name] for field_name in self.fieldnames_not_in_input]+outputs) | 842 return self |
673 | 843 |
674 | 844 def next(self): |
675 for fieldname in fieldnames: | 845 dataset=self.vsds.datasets[self.next_dataset_index] |
676 assert fieldname in self.output_fields or self.src.hasFields(fieldname) | 846 mb = self.next_iterator.next() |
677 return Iterator(self) | 847 if self.n_left_in_mb: |
678 | 848 extra_mb = [] |
679 | 849 while self.n_left_in_mb>0: |
850 self.move_to_next_dataset() | |
851 extra_mb.append(self.next_iterator.next()) | |
852 examples = Example(names, | |
853 [dataset.valuesVStack(name, | |
854 [mb[name]]+[b[name] for b in extra_mb]) | |
855 for name in fieldnames]) | |
856 mb = DataSetFields(MinibatchDataSet(examples),fieldnames) | |
857 | |
858 self.next_row+=minibatch_size | |
859 self.next_dataset_row+=minibatch_size | |
860 if self.next_row+minibatch_size>len(dataset): | |
861 self.move_to_next_dataset() | |
862 return examples | |
863 return VStackedIterator(self) | |
864 | |
865 class ArrayFieldsDataSet(DataSet): | |
866 """ | |
867 Virtual super-class of datasets whose field values are numpy array, | |
868 thus defining valuesHStack and valuesVStack for sub-classes. | |
869 """ | |
870 def __init__(self,description=None,field_types=None): | |
871 DataSet.__init__(self,description,field_types) | |
872 def valuesHStack(self,fieldnames,fieldvalues): | |
873 """Concatenate field values horizontally, e.g. two vectors | |
874 become a longer vector, two matrices become a wider matrix, etc.""" | |
875 return numpy.hstack(fieldvalues) | |
876 def valuesVStack(self,fieldname,values): | |
877 """Concatenate field values vertically, e.g. two vectors | |
878 become a two-row matrix, two matrices become a longer matrix, etc.""" | |
879 return numpy.vstack(values) | |
880 | |
881 class ArrayDataSet(ArrayFieldsDataSet): | |
882 """ | |
883 An ArrayDataSet stores the fields as groups of columns in a numpy tensor, | |
884 whose first axis iterates over examples, second axis determines fields. | |
885 If the underlying array is N-dimensional (has N axes), then the field | |
886 values are (N-2)-dimensional objects (i.e. ordinary numbers if N=2). | |
887 """ | |
888 | |
889 def __init__(self, data_array, fields_columns): | |
890 """ | |
891 Construct an ArrayDataSet from the underlying numpy array (data) and | |
892 a map (fields_columns) from fieldnames to field columns. The columns of a field are specified | |
893 using the standard arguments for indexing/slicing: integer for a column index, | |
894 slice for an interval of columns (with possible stride), or iterable of column indices. | |
895 """ | |
896 self.data=data_array | |
897 self.fields_columns=fields_columns | |
898 | |
899 # check consistency and complete slices definitions | |
900 for fieldname, fieldcolumns in self.fields_columns.items(): | |
901 if type(fieldcolumns) is int: | |
902 assert fieldcolumns>=0 and fieldcolumns<data_array.shape[1] | |
903 elif type(fieldcolumns) is slice: | |
904 start,step=None,None | |
905 if not fieldcolumns.start: | |
906 start=0 | |
907 if not fieldcolumns.step: | |
908 step=1 | |
909 if start or step: | |
910 self.fields_columns[fieldname]=slice(start,fieldcolumns.stop,step) | |
911 elif hasattr(fieldcolumns,"__iter__"): # something like a list | |
912 for i in fieldcolumns: | |
913 assert i>=0 and i<data_array.shape[1] | |
914 | |
915 def fieldNames(self): | |
916 return self.fields_columns.keys() | |
917 | |
918 def __len__(self): | |
919 return len(self.data) | |
920 | |
921 def __getitem__(self,i): | |
922 """More efficient implementation than the default __getitem__""" | |
923 fieldnames=self.fields_columns.keys() | |
924 if type(i) is int: | |
925 return Example(fieldnames, | |
926 [self.data[i,self.fields_columns[f]] for f in fieldnames]) | |
927 if type(i) in (slice,list): | |
928 return MinibatchDataSet(Example(fieldnames, | |
929 [self.data[i,self.fields_columns[f]] for f in fieldnames])) | |
930 # else check for a fieldname | |
931 if self.hasFields(i): | |
932 return Example([i],[self.data[self.fields_columns[i],:]]) | |
933 # else we are trying to access a property of the dataset | |
934 assert i in self.__dict__ # else it means we are trying to access a non-existing property | |
935 return self.__dict__[i] | |
936 | |
937 | |
938 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): | |
939 class ArrayDataSetIterator(object): | |
940 def __init__(self,dataset,fieldnames,minibatch_size,n_batches,offset): | |
941 if fieldnames is None: fieldnames = dataset.fieldNames() | |
942 # store the resulting minibatch in a lookup-list of values | |
943 self.minibatch = LookupList(fieldnames,[0]*len(fieldnames)) | |
944 self.dataset=dataset | |
945 self.minibatch_size=minibatch_size | |
946 assert offset>=0 and offset<len(dataset.data) | |
947 assert offset+minibatch_size<=len(dataset.data) | |
948 self.current=offset | |
949 def __iter__(self): | |
950 return self | |
951 def next(self): | |
952 sub_data = self.dataset.data[self.current:self.current+self.minibatch_size] | |
953 self.minibatch._values = [sub_data[:,self.dataset.fields_columns[f]] for f in self.minibatch._names] | |
954 self.current+=self.minibatch_size | |
955 return self.minibatch | |
956 | |
957 return ArrayDataSetIterator(self,fieldnames,minibatch_size,n_batches,offset) | |
958 | |
959 | |
960 class CachedDataSet(DataSet): | |
961 """ | |
962 Wrap a dataset whose values are computationally expensive to obtain | |
963 (e.g. because they involve some computation, or disk access), | |
964 so that repeated accesses to the same example are done cheaply, | |
965 by caching every example value that has been accessed at least once. | |
966 | |
967 Optionally, for finite-length dataset, all the values can be computed | |
968 (and cached) upon construction of the CachedDataSet, rather at the | |
969 first access. | |
970 """ | |
971 | |
972 class ApplyFunctionDataSet(DataSet): | |
973 """ | |
974 A dataset that contains as fields the results of applying a given function | |
975 example-wise or minibatch-wise to all the fields of an input dataset. | |
976 The output of the function should be an iterable (e.g. a list or a LookupList) | |
977 over the resulting values. In minibatch mode, the function is expected | |
978 to work on minibatches (takes a minibatch in input and returns a minibatch | |
979 in output). | |
980 | |
981 The function is applied each time an example or a minibatch is accessed. | |
982 To avoid re-doing computation, wrap this dataset inside a CachedDataSet. | |
983 """ | |
984 | |
985 | |
680 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None): | 986 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None): |
681 """ | 987 """ |
682 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the | 988 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the |
683 user to define a set of fields as the 'input' field and a set of fields | 989 user to define a set of fields as the 'input' field and a set of fields |
684 as the 'target' field. Optionally, a single weight_field can also be defined. | 990 as the 'target' field. Optionally, a single weight_field can also be defined. |
685 """ | 991 """ |
686 args = ((input_fields,'input'),(output_fields,'target')) | 992 args = ((input_fields,'input'),(output_fields,'target')) |
687 if weight_field: args+=(([weight_field],'weight')) | 993 if weight_field: args+=(([weight_field],'weight')) |
688 return src_dataset.rename(*args) | 994 return src_dataset.merge_fields(*args) |
689 | 995 |
690 | 996 |
691 | 997 |
692 | 998 |