Mercurial > pylearn
comparison dataset.py @ 245:c702abb7f875
merged
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Mon, 02 Jun 2008 17:09:58 -0400 |
parents | c8f19a9eb10f |
children | 7e6edee187e3 4ad6bc9b4f03 |
comparison
equal
deleted
inserted
replaced
244:3156a9976183 | 245:c702abb7f875 |
---|---|
45 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction | 45 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction |
46 with learning algorithms (for training and testing them): rows/records are called examples, and | 46 with learning algorithms (for training and testing them): rows/records are called examples, and |
47 columns/attributes are called fields. The field value for a particular example can be an arbitrary | 47 columns/attributes are called fields. The field value for a particular example can be an arbitrary |
48 python object, which depends on the particular dataset. | 48 python object, which depends on the particular dataset. |
49 | 49 |
50 We call a DataSet a 'stream' when its length is unbounded (otherwise its __len__ method | 50 We call a DataSet a 'stream' when its length is unbounded (in which case its __len__ method |
51 should return sys.maxint). | 51 should return sys.maxint). |
52 | 52 |
53 A DataSet is a generator of iterators; these iterators can run through the | 53 A DataSet is a generator of iterators; these iterators can run through the |
54 examples or the fields in a variety of ways. A DataSet need not necessarily have a finite | 54 examples or the fields in a variety of ways. A DataSet need not necessarily have a finite |
55 or known length, so this class can be used to interface to a 'stream' which | 55 or known length, so this class can be used to interface to a 'stream' which |
56 feeds on-line learning (however, as noted below, some operations are not | 56 feeds on-line learning (however, as noted below, some operations are not |
57 feasible or not recommanded on streams). | 57 feasible or not recommended on streams). |
58 | 58 |
59 To iterate over examples, there are several possibilities: | 59 To iterate over examples, there are several possibilities: |
60 - for example in dataset: | 60 - for example in dataset: |
61 - for val1,val2,... in dataset: | 61 - for val1,val2,... in dataset: |
62 - for example in dataset(field1, field2,field3, ...): | 62 - for example in dataset(field1, field2,field3, ...): |
79 - for field in dataset.fields(field1,field2,...) to select a subset of fields | 79 - for field in dataset.fields(field1,field2,...) to select a subset of fields |
80 and each of these fields is iterable over the examples: | 80 and each of these fields is iterable over the examples: |
81 - for field_examples in dataset.fields(): | 81 - for field_examples in dataset.fields(): |
82 for example_value in field_examples: | 82 for example_value in field_examples: |
83 ... | 83 ... |
84 but when the dataset is a stream (unbounded length), it is not recommanded to do | 84 but when the dataset is a stream (unbounded length), it is not recommended to do |
85 such things because the underlying dataset may refuse to access the different fields in | 85 such things because the underlying dataset may refuse to access the different fields in |
86 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. | 86 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. |
87 The result of fields() is a L{DataSetFields} object, which iterates over fields, | 87 The result of fields() is a L{DataSetFields} object, which iterates over fields, |
88 and whose elements are iterable over examples. A DataSetFields object can | 88 and whose elements are iterable over examples. A DataSetFields object can |
89 be turned back into a DataSet with its examples() method:: | 89 be turned back into a DataSet with its examples() method:: |
597 * for fields in dataset.fields(field1,field2,...) to select a subset of fields | 597 * for fields in dataset.fields(field1,field2,...) to select a subset of fields |
598 and each of these fields is iterable over the examples: | 598 and each of these fields is iterable over the examples: |
599 * for field_examples in dataset.fields(): | 599 * for field_examples in dataset.fields(): |
600 for example_value in field_examples: | 600 for example_value in field_examples: |
601 ... | 601 ... |
602 but when the dataset is a stream (unbounded length), it is not recommanded to do | 602 but when the dataset is a stream (unbounded length), it is not recommended to do |
603 such things because the underlying dataset may refuse to access the different fields in | 603 such things because the underlying dataset may refuse to access the different fields in |
604 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. | 604 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. |
605 The result of fields() is a DataSetFields object, which iterates over fields, | 605 The result of fields() is a DataSetFields object, which iterates over fields, |
606 and whose elements are iterable over examples. A DataSetFields object can | 606 and whose elements are iterable over examples. A DataSetFields object can |
607 be turned back into a DataSet with its examples() method: | 607 be turned back into a DataSet with its examples() method: |
1014 return len(self.data) | 1014 return len(self.data) |
1015 | 1015 |
1016 def __getitem__(self,key): | 1016 def __getitem__(self,key): |
1017 """More efficient implementation than the default __getitem__""" | 1017 """More efficient implementation than the default __getitem__""" |
1018 fieldnames=self.fields_columns.keys() | 1018 fieldnames=self.fields_columns.keys() |
1019 values=self.fields_columns.values() | |
1019 if type(key) is int: | 1020 if type(key) is int: |
1020 return Example(fieldnames, | 1021 return Example(fieldnames, |
1021 [self.data[key,self.fields_columns[f]] for f in fieldnames]) | 1022 [self.data[key,col] for col in values]) |
1022 if type(key) is slice: | 1023 if type(key) is slice: |
1023 return MinibatchDataSet(Example(fieldnames, | 1024 return MinibatchDataSet(Example(fieldnames, |
1024 [self.data[key,self.fields_columns[f]] for f in fieldnames])) | 1025 [self.data[key,col] for col in values])) |
1025 if type(key) is list: | 1026 if type(key) is list: |
1026 for i in range(len(key)): | 1027 for i in range(len(key)): |
1027 if self.hasFields(key[i]): | 1028 if self.hasFields(key[i]): |
1028 key[i]=self.fields_columns[key[i]] | 1029 key[i]=self.fields_columns[key[i]] |
1029 return MinibatchDataSet(Example(fieldnames, | 1030 return MinibatchDataSet(Example(fieldnames, |
1030 #we must separate differently for list as numpy | 1031 #we must separate differently for list as numpy |
1031 # doesn't support self.data[[i1,...],[i2,...]] | 1032 # doesn't support self.data[[i1,...],[i2,...]] |
1032 # when their is more then two i1 and i2 | 1033 # when their is more then two i1 and i2 |
1033 [self.data[key,:][:,self.fields_columns[f]] | 1034 [self.data[key,:][:,col] |
1034 if isinstance(self.fields_columns[f],list) else | 1035 if isinstance(col,list) else |
1035 self.data[key,self.fields_columns[f]] for f in fieldnames]), | 1036 self.data[key,col] for col in values]), |
1037 | |
1036 | 1038 |
1037 self.valuesVStack,self.valuesHStack) | 1039 self.valuesVStack,self.valuesHStack) |
1038 | 1040 |
1039 # else check for a fieldname | 1041 # else check for a fieldname |
1040 if self.hasFields(key): | 1042 if self.hasFields(key): |
1052 self.dataset=dataset | 1054 self.dataset=dataset |
1053 self.minibatch_size=minibatch_size | 1055 self.minibatch_size=minibatch_size |
1054 assert offset>=0 and offset<len(dataset.data) | 1056 assert offset>=0 and offset<len(dataset.data) |
1055 assert offset+minibatch_size<=len(dataset.data) | 1057 assert offset+minibatch_size<=len(dataset.data) |
1056 self.current=offset | 1058 self.current=offset |
1059 self.columns = [self.dataset.fields_columns[f] | |
1060 for f in self.minibatch._names] | |
1057 def __iter__(self): | 1061 def __iter__(self): |
1058 return self | 1062 return self |
1059 def next(self): | 1063 def next(self): |
1060 #@todo: we suppose that we need to stop only when minibatch_size == 1. | 1064 #@todo: we suppose that we need to stop only when minibatch_size == 1. |
1061 # Otherwise, MinibatchWrapAroundIterator do it. | 1065 # Otherwise, MinibatchWrapAroundIterator do it. |
1062 if self.current>=self.dataset.data.shape[0]: | 1066 if self.current>=self.dataset.data.shape[0]: |
1063 raise StopIteration | 1067 raise StopIteration |
1064 sub_data = self.dataset.data[self.current] | 1068 sub_data = self.dataset.data[self.current] |
1065 self.minibatch._values = [sub_data[self.dataset.fields_columns[f]] for f in self.minibatch._names] | 1069 self.minibatch._values = [sub_data[c] for c in self.columns] |
1070 | |
1066 self.current+=self.minibatch_size | 1071 self.current+=self.minibatch_size |
1067 return self.minibatch | 1072 return self.minibatch |
1068 | 1073 |
1069 return ArrayDataSetIterator2(self,self.fieldNames(),1,0,0) | 1074 return ArrayDataSetIterator2(self,self.fieldNames(),1,0,0) |
1070 | 1075 |