Mercurial > pylearn
comparison dataset.py @ 57:1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Tue, 29 Apr 2008 17:45:16 -0400 |
parents | 1729ad44f175 |
children | 9165d86855ab |
comparison
equal
deleted
inserted
replaced
56:1729ad44f175 | 57:1aabd2e2bb5f |
---|---|
78 | 78 |
79 * dataset[i] returns an Example. | 79 * dataset[i] returns an Example. |
80 | 80 |
81 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. | 81 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. |
82 | 82 |
83 * dataset['key'] returns a property associated with the given 'key' string. | 83 * dataset[fieldname] an iterable over the values of the field fieldname across |
84 If 'key' is a fieldname, then the VStacked field values (iterable over | 84 the dataset (the iterable is obtained by default by calling valuesVStack |
85 field values) for that field is returned. Other keys may be supported | 85 over the values for individual examples). |
86 by different dataset subclasses. The following key names are should be supported: | 86 |
87 * dataset.<property> returns the value of a property associated with | |
88 the name <property>. The following properties should be supported: | |
87 - 'description': a textual description or name for the dataset | 89 - 'description': a textual description or name for the dataset |
88 - '<fieldname>.type': a type name or value for a given <fieldname> | 90 - 'fieldtypes': a list of types (one per field) |
89 | 91 |
90 Datasets can be concatenated either vertically (increasing the length) or | 92 Datasets can be concatenated either vertically (increasing the length) or |
91 horizontally (augmenting the set of fields), if they are compatible, using | 93 horizontally (augmenting the set of fields), if they are compatible, using |
92 the following operations (with the same basic semantics as numpy.hstack | 94 the following operations (with the same basic semantics as numpy.hstack |
93 and numpy.vstack): | 95 and numpy.vstack): |
123 * hasFields | 125 * hasFields |
124 * __getitem__ may not be feasible with some streams | 126 * __getitem__ may not be feasible with some streams |
125 * __iter__ | 127 * __iter__ |
126 """ | 128 """ |
127 | 129 |
128 def __init__(self,description=None,field_types=None): | 130 def __init__(self,description=None,fieldtypes=None): |
129 if description is None: | 131 if description is None: |
130 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)" | 132 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)" |
131 description = type(self).__name__ + " ( " + join([x.__name__ for x in type(self).__bases__]) + " )" | 133 description = type(self).__name__ + " ( " + join([x.__name__ for x in type(self).__bases__]) + " )" |
132 self.description=description | 134 self.description=description |
133 self.field_types=field_types | 135 self.fieldtypes=field_types |
134 | 136 |
135 class MinibatchToSingleExampleIterator(object): | 137 class MinibatchToSingleExampleIterator(object): |
136 """ | 138 """ |
137 Converts the result of minibatch iterator with minibatch_size==1 into | 139 Converts the result of minibatch iterator with minibatch_size==1 into |
138 single-example values in the result. Therefore the result of | 140 single-example values in the result. Therefore the result of |
942 self.minibatch._values = [sub_data[:,self.dataset.fields_columns[f]] for f in self.minibatch._names] | 944 self.minibatch._values = [sub_data[:,self.dataset.fields_columns[f]] for f in self.minibatch._names] |
943 self.current+=self.minibatch_size | 945 self.current+=self.minibatch_size |
944 return self.minibatch | 946 return self.minibatch |
945 | 947 |
946 return ArrayDataSetIterator(self,fieldnames,minibatch_size,n_batches,offset) | 948 return ArrayDataSetIterator(self,fieldnames,minibatch_size,n_batches,offset) |
947 | 949 |
950 | |
951 class CachedDataSet(DataSet): | |
952 """ | |
953 Wrap a dataset whose values are computationally expensive to obtain | |
954 (e.g. because they involve some computation, or disk access), | |
955 so that repeated accesses to the same example are done cheaply, | |
956 by caching every example value that has been accessed at least once. | |
957 | |
958 Optionally, for finite-length dataset, all the values can be computed | |
959 (and cached) upon construction of the CachedDataSet, rather at the | |
960 first access. | |
961 """ | |
962 | |
963 class ApplyFunctionDataSet(DataSet): | |
964 """ | |
965 A dataset that contains as fields the results of applying a given function | |
966 example-wise or minibatch-wise to all the fields of an input dataset. | |
967 The output of the function should be an iterable (e.g. a list or a LookupList) | |
968 over the resulting values. In minibatch mode, the function is expected | |
969 to work on minibatches (takes a minibatch in input and returns a minibatch | |
970 in output). | |
971 | |
972 The function is applied each time an example or a minibatch is accessed. | |
973 To avoid re-doing computation, wrap this dataset inside a CachedDataSet. | |
974 """ | |
975 | |
976 | |
948 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None): | 977 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None): |
949 """ | 978 """ |
950 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the | 979 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the |
951 user to define a set of fields as the 'input' field and a set of fields | 980 user to define a set of fields as the 'input' field and a set of fields |
952 as the 'target' field. Optionally, a single weight_field can also be defined. | 981 as the 'target' field. Optionally, a single weight_field can also be defined. |