comparison dataset.py @ 245:c702abb7f875

merged
author James Bergstra <bergstrj@iro.umontreal.ca>
date Mon, 02 Jun 2008 17:09:58 -0400
parents c8f19a9eb10f
children 7e6edee187e3 4ad6bc9b4f03
comparison
equal deleted inserted replaced
244:3156a9976183 245:c702abb7f875
45 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction 45 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction
46 with learning algorithms (for training and testing them): rows/records are called examples, and 46 with learning algorithms (for training and testing them): rows/records are called examples, and
47 columns/attributes are called fields. The field value for a particular example can be an arbitrary 47 columns/attributes are called fields. The field value for a particular example can be an arbitrary
48 python object, which depends on the particular dataset. 48 python object, which depends on the particular dataset.
49 49
50 We call a DataSet a 'stream' when its length is unbounded (otherwise its __len__ method 50 We call a DataSet a 'stream' when its length is unbounded (in which case its __len__ method
51 should return sys.maxint). 51 should return sys.maxint).
52 52
53 A DataSet is a generator of iterators; these iterators can run through the 53 A DataSet is a generator of iterators; these iterators can run through the
54 examples or the fields in a variety of ways. A DataSet need not necessarily have a finite 54 examples or the fields in a variety of ways. A DataSet need not necessarily have a finite
55 or known length, so this class can be used to interface to a 'stream' which 55 or known length, so this class can be used to interface to a 'stream' which
56 feeds on-line learning (however, as noted below, some operations are not 56 feeds on-line learning (however, as noted below, some operations are not
57 feasible or not recommanded on streams). 57 feasible or not recommended on streams).
58 58
59 To iterate over examples, there are several possibilities: 59 To iterate over examples, there are several possibilities:
60 - for example in dataset: 60 - for example in dataset:
61 - for val1,val2,... in dataset: 61 - for val1,val2,... in dataset:
62 - for example in dataset(field1, field2,field3, ...): 62 - for example in dataset(field1, field2,field3, ...):
79 - for field in dataset.fields(field1,field2,...) to select a subset of fields 79 - for field in dataset.fields(field1,field2,...) to select a subset of fields
80 and each of these fields is iterable over the examples: 80 and each of these fields is iterable over the examples:
81 - for field_examples in dataset.fields(): 81 - for field_examples in dataset.fields():
82 for example_value in field_examples: 82 for example_value in field_examples:
83 ... 83 ...
84 but when the dataset is a stream (unbounded length), it is not recommanded to do 84 but when the dataset is a stream (unbounded length), it is not recommended to do
85 such things because the underlying dataset may refuse to access the different fields in 85 such things because the underlying dataset may refuse to access the different fields in
86 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. 86 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
87 The result of fields() is a L{DataSetFields} object, which iterates over fields, 87 The result of fields() is a L{DataSetFields} object, which iterates over fields,
88 and whose elements are iterable over examples. A DataSetFields object can 88 and whose elements are iterable over examples. A DataSetFields object can
89 be turned back into a DataSet with its examples() method:: 89 be turned back into a DataSet with its examples() method::
597 * for fields in dataset.fields(field1,field2,...) to select a subset of fields 597 * for fields in dataset.fields(field1,field2,...) to select a subset of fields
598 and each of these fields is iterable over the examples: 598 and each of these fields is iterable over the examples:
599 * for field_examples in dataset.fields(): 599 * for field_examples in dataset.fields():
600 for example_value in field_examples: 600 for example_value in field_examples:
601 ... 601 ...
602 but when the dataset is a stream (unbounded length), it is not recommanded to do 602 but when the dataset is a stream (unbounded length), it is not recommended to do
603 such things because the underlying dataset may refuse to access the different fields in 603 such things because the underlying dataset may refuse to access the different fields in
604 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. 604 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
605 The result of fields() is a DataSetFields object, which iterates over fields, 605 The result of fields() is a DataSetFields object, which iterates over fields,
606 and whose elements are iterable over examples. A DataSetFields object can 606 and whose elements are iterable over examples. A DataSetFields object can
607 be turned back into a DataSet with its examples() method: 607 be turned back into a DataSet with its examples() method:
1014 return len(self.data) 1014 return len(self.data)
1015 1015
1016 def __getitem__(self,key): 1016 def __getitem__(self,key):
1017 """More efficient implementation than the default __getitem__""" 1017 """More efficient implementation than the default __getitem__"""
1018 fieldnames=self.fields_columns.keys() 1018 fieldnames=self.fields_columns.keys()
1019 values=self.fields_columns.values()
1019 if type(key) is int: 1020 if type(key) is int:
1020 return Example(fieldnames, 1021 return Example(fieldnames,
1021 [self.data[key,self.fields_columns[f]] for f in fieldnames]) 1022 [self.data[key,col] for col in values])
1022 if type(key) is slice: 1023 if type(key) is slice:
1023 return MinibatchDataSet(Example(fieldnames, 1024 return MinibatchDataSet(Example(fieldnames,
1024 [self.data[key,self.fields_columns[f]] for f in fieldnames])) 1025 [self.data[key,col] for col in values]))
1025 if type(key) is list: 1026 if type(key) is list:
1026 for i in range(len(key)): 1027 for i in range(len(key)):
1027 if self.hasFields(key[i]): 1028 if self.hasFields(key[i]):
1028 key[i]=self.fields_columns[key[i]] 1029 key[i]=self.fields_columns[key[i]]
1029 return MinibatchDataSet(Example(fieldnames, 1030 return MinibatchDataSet(Example(fieldnames,
1030 #we must separate differently for list as numpy 1031 #we must separate differently for list as numpy
1031 # doesn't support self.data[[i1,...],[i2,...]] 1032 # doesn't support self.data[[i1,...],[i2,...]]
1032 # when their is more then two i1 and i2 1033 # when their is more then two i1 and i2
1033 [self.data[key,:][:,self.fields_columns[f]] 1034 [self.data[key,:][:,col]
1034 if isinstance(self.fields_columns[f],list) else 1035 if isinstance(col,list) else
1035 self.data[key,self.fields_columns[f]] for f in fieldnames]), 1036 self.data[key,col] for col in values]),
1037
1036 1038
1037 self.valuesVStack,self.valuesHStack) 1039 self.valuesVStack,self.valuesHStack)
1038 1040
1039 # else check for a fieldname 1041 # else check for a fieldname
1040 if self.hasFields(key): 1042 if self.hasFields(key):
1052 self.dataset=dataset 1054 self.dataset=dataset
1053 self.minibatch_size=minibatch_size 1055 self.minibatch_size=minibatch_size
1054 assert offset>=0 and offset<len(dataset.data) 1056 assert offset>=0 and offset<len(dataset.data)
1055 assert offset+minibatch_size<=len(dataset.data) 1057 assert offset+minibatch_size<=len(dataset.data)
1056 self.current=offset 1058 self.current=offset
1059 self.columns = [self.dataset.fields_columns[f]
1060 for f in self.minibatch._names]
1057 def __iter__(self): 1061 def __iter__(self):
1058 return self 1062 return self
1059 def next(self): 1063 def next(self):
1060 #@todo: we suppose that we need to stop only when minibatch_size == 1. 1064 #@todo: we suppose that we need to stop only when minibatch_size == 1.
1061 # Otherwise, MinibatchWrapAroundIterator do it. 1065 # Otherwise, MinibatchWrapAroundIterator do it.
1062 if self.current>=self.dataset.data.shape[0]: 1066 if self.current>=self.dataset.data.shape[0]:
1063 raise StopIteration 1067 raise StopIteration
1064 sub_data = self.dataset.data[self.current] 1068 sub_data = self.dataset.data[self.current]
1065 self.minibatch._values = [sub_data[self.dataset.fields_columns[f]] for f in self.minibatch._names] 1069 self.minibatch._values = [sub_data[c] for c in self.columns]
1070
1066 self.current+=self.minibatch_size 1071 self.current+=self.minibatch_size
1067 return self.minibatch 1072 return self.minibatch
1068 1073
1069 return ArrayDataSetIterator2(self,self.fieldNames(),1,0,0) 1074 return ArrayDataSetIterator2(self,self.fieldNames(),1,0,0)
1070 1075