Mercurial > pylearn
diff dataset.py @ 72:2b6656b2ef52
Changed docs slightly
author | Joseph Turian <turian@iro.umontreal.ca> |
---|---|
date | Fri, 02 May 2008 18:36:47 -0400 |
parents | dde1fb1b63ba |
children | 69f97aad3faf |
line wrap: on
line diff
--- a/dataset.py Fri May 02 18:19:35 2008 -0400 +++ b/dataset.py Fri May 02 18:36:47 2008 -0400 @@ -27,29 +27,29 @@ feasible or not recommanded on streams). To iterate over examples, there are several possibilities: - * for example in dataset([field1, field2,field3, ...]): - * for val1,val2,val3 in dataset([field1, field2,field3]): - * for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N): - * for mini1,mini2,mini3 in dataset.minibatches([field1, field2, field3], minibatch_size=N): - * for example in dataset: + - for example in dataset([field1, field2,field3, ...]): + - for val1,val2,val3 in dataset([field1, field2,field3]): + - for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N): + - for mini1,mini2,mini3 in dataset.minibatches([field1, field2, field3], minibatch_size=N): + - for example in dataset:: print example['x'] - * for x,y,z in dataset: - Each of these is documented below. All of these iterators are expected - to provide, in addition to the usual 'next()' method, a 'next_index()' method - which returns a non-negative integer pointing to the position of the next - example that will be returned by 'next()' (or of the first example in the - next minibatch returned). This is important because these iterators - can wrap around the dataset in order to do multiple passes through it, - in possibly unregular ways if the minibatch size is not a divisor of the - dataset length. + - for x,y,z in dataset: + Each of these is documented below. All of these iterators are expected + to provide, in addition to the usual 'next()' method, a 'next_index()' method + which returns a non-negative integer pointing to the position of the next + example that will be returned by 'next()' (or of the first example in the + next minibatch returned). This is important because these iterators + can wrap around the dataset in order to do multiple passes through it, + in possibly unregular ways if the minibatch size is not a divisor of the + dataset length. To iterate over fields, one can do - * for field in dataset.fields(): + - for field in dataset.fields(): for field_value in field: # iterate over the values associated to that field for all the dataset examples - * for field in dataset(field1,field2,...).fields() to select a subset of fields - * for field in dataset.fields(field1,field2,...) to select a subset of fields + - for field in dataset(field1,field2,...).fields() to select a subset of fields + - for field in dataset.fields(field1,field2,...) to select a subset of fields and each of these fields is iterable over the examples: - * for field_examples in dataset.fields(): + - for field_examples in dataset.fields(): for example_value in field_examples: ... but when the dataset is a stream (unbounded length), it is not recommanded to do @@ -57,7 +57,7 @@ an unsynchronized ways. Hence the fields() method is illegal for streams, by default. The result of fields() is a DataSetFields object, which iterates over fields, and whose elements are iterable over examples. A DataSetFields object can - be turned back into a DataSet with its examples() method: + be turned back into a DataSet with its examples() method:: dataset2 = dataset1.fields().examples() and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1). @@ -72,20 +72,20 @@ of examples) can be extracted. These operations are not supported by default in the case of streams. - * dataset[:n] returns a dataset with the n first examples. + - dataset[:n] returns a dataset with the n first examples. - * dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s. + - dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s. - * dataset[i] returns an Example. + - dataset[i] returns an Example. - * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. + - dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. - * dataset[fieldname] an iterable over the values of the field fieldname across - the dataset (the iterable is obtained by default by calling valuesVStack - over the values for individual examples). + - dataset[fieldname] an iterable over the values of the field fieldname across + the dataset (the iterable is obtained by default by calling valuesVStack + over the values for individual examples). - * dataset.<property> returns the value of a property associated with - the name <property>. The following properties should be supported: + - dataset.<property> returns the value of a property associated with + the name <property>. The following properties should be supported: - 'description': a textual description or name for the dataset - 'fieldtypes': a list of types (one per field) @@ -94,12 +94,12 @@ the following operations (with the same basic semantics as numpy.hstack and numpy.vstack): - * dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3]) + - dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3]) creates a new dataset whose list of fields is the concatenation of the list of fields of the argument datasets. This only works if they all have the same length. - * dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3]) + - dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3]) creates a new dataset that concatenates the examples from the argument datasets (and whose length is the sum of the length of the argument datasets). This only @@ -116,15 +116,15 @@ when key is a string (or more specifically, neither an integer, a slice, nor a list). A DataSet sub-class should always redefine the following methods: - * __len__ if it is not a stream - * fieldNames - * minibatches_nowrap (called by DataSet.minibatches()) - * valuesHStack - * valuesVStack + - __len__ if it is not a stream + - fieldNames + - minibatches_nowrap (called by DataSet.minibatches()) + - valuesHStack + - valuesVStack For efficiency of implementation, a sub-class might also want to redefine - * hasFields - * __getitem__ may not be feasible with some streams - * __iter__ + - hasFields + - __getitem__ may not be feasible with some streams + - __iter__ """ def __init__(self,description=None,fieldtypes=None):