comparison dataset.py @ 57:1aabd2e2bb5f

Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Tue, 29 Apr 2008 17:45:16 -0400
parents 1729ad44f175
children 9165d86855ab
comparison
equal deleted inserted replaced
56:1729ad44f175 57:1aabd2e2bb5f
78 78
79 * dataset[i] returns an Example. 79 * dataset[i] returns an Example.
80 80
81 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. 81 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in.
82 82
83 * dataset['key'] returns a property associated with the given 'key' string. 83 * dataset[fieldname] an iterable over the values of the field fieldname across
84 If 'key' is a fieldname, then the VStacked field values (iterable over 84 the dataset (the iterable is obtained by default by calling valuesVStack
85 field values) for that field is returned. Other keys may be supported 85 over the values for individual examples).
86 by different dataset subclasses. The following key names are should be supported: 86
87 * dataset.<property> returns the value of a property associated with
88 the name <property>. The following properties should be supported:
87 - 'description': a textual description or name for the dataset 89 - 'description': a textual description or name for the dataset
88 - '<fieldname>.type': a type name or value for a given <fieldname> 90 - 'fieldtypes': a list of types (one per field)
89 91
90 Datasets can be concatenated either vertically (increasing the length) or 92 Datasets can be concatenated either vertically (increasing the length) or
91 horizontally (augmenting the set of fields), if they are compatible, using 93 horizontally (augmenting the set of fields), if they are compatible, using
92 the following operations (with the same basic semantics as numpy.hstack 94 the following operations (with the same basic semantics as numpy.hstack
93 and numpy.vstack): 95 and numpy.vstack):
123 * hasFields 125 * hasFields
124 * __getitem__ may not be feasible with some streams 126 * __getitem__ may not be feasible with some streams
125 * __iter__ 127 * __iter__
126 """ 128 """
127 129
128 def __init__(self,description=None,field_types=None): 130 def __init__(self,description=None,fieldtypes=None):
129 if description is None: 131 if description is None:
130 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)" 132 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)"
131 description = type(self).__name__ + " ( " + join([x.__name__ for x in type(self).__bases__]) + " )" 133 description = type(self).__name__ + " ( " + join([x.__name__ for x in type(self).__bases__]) + " )"
132 self.description=description 134 self.description=description
133 self.field_types=field_types 135 self.fieldtypes=field_types
134 136
135 class MinibatchToSingleExampleIterator(object): 137 class MinibatchToSingleExampleIterator(object):
136 """ 138 """
137 Converts the result of minibatch iterator with minibatch_size==1 into 139 Converts the result of minibatch iterator with minibatch_size==1 into
138 single-example values in the result. Therefore the result of 140 single-example values in the result. Therefore the result of
942 self.minibatch._values = [sub_data[:,self.dataset.fields_columns[f]] for f in self.minibatch._names] 944 self.minibatch._values = [sub_data[:,self.dataset.fields_columns[f]] for f in self.minibatch._names]
943 self.current+=self.minibatch_size 945 self.current+=self.minibatch_size
944 return self.minibatch 946 return self.minibatch
945 947
946 return ArrayDataSetIterator(self,fieldnames,minibatch_size,n_batches,offset) 948 return ArrayDataSetIterator(self,fieldnames,minibatch_size,n_batches,offset)
947 949
950
951 class CachedDataSet(DataSet):
952 """
953 Wrap a dataset whose values are computationally expensive to obtain
954 (e.g. because they involve some computation, or disk access),
955 so that repeated accesses to the same example are done cheaply,
956 by caching every example value that has been accessed at least once.
957
958 Optionally, for finite-length dataset, all the values can be computed
959 (and cached) upon construction of the CachedDataSet, rather at the
960 first access.
961 """
962
963 class ApplyFunctionDataSet(DataSet):
964 """
965 A dataset that contains as fields the results of applying a given function
966 example-wise or minibatch-wise to all the fields of an input dataset.
967 The output of the function should be an iterable (e.g. a list or a LookupList)
968 over the resulting values. In minibatch mode, the function is expected
969 to work on minibatches (takes a minibatch in input and returns a minibatch
970 in output).
971
972 The function is applied each time an example or a minibatch is accessed.
973 To avoid re-doing computation, wrap this dataset inside a CachedDataSet.
974 """
975
976
948 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None): 977 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None):
949 """ 978 """
950 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the 979 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the
951 user to define a set of fields as the 'input' field and a set of fields 980 user to define a set of fields as the 'input' field and a set of fields
952 as the 'target' field. Optionally, a single weight_field can also be defined. 981 as the 'target' field. Optionally, a single weight_field can also be defined.