comparison dataset.py @ 72:2b6656b2ef52

Changed docs slightly
author Joseph Turian <turian@iro.umontreal.ca>
date Fri, 02 May 2008 18:36:47 -0400
parents dde1fb1b63ba
children 69f97aad3faf
comparison
equal deleted inserted replaced
71:5b699b31770a 72:2b6656b2ef52
25 or known length, so this class can be used to interface to a 'stream' which 25 or known length, so this class can be used to interface to a 'stream' which
26 feeds on-line learning (however, as noted below, some operations are not 26 feeds on-line learning (however, as noted below, some operations are not
27 feasible or not recommanded on streams). 27 feasible or not recommanded on streams).
28 28
29 To iterate over examples, there are several possibilities: 29 To iterate over examples, there are several possibilities:
30 * for example in dataset([field1, field2,field3, ...]): 30 - for example in dataset([field1, field2,field3, ...]):
31 * for val1,val2,val3 in dataset([field1, field2,field3]): 31 - for val1,val2,val3 in dataset([field1, field2,field3]):
32 * for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N): 32 - for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N):
33 * for mini1,mini2,mini3 in dataset.minibatches([field1, field2, field3], minibatch_size=N): 33 - for mini1,mini2,mini3 in dataset.minibatches([field1, field2, field3], minibatch_size=N):
34 * for example in dataset: 34 - for example in dataset::
35 print example['x'] 35 print example['x']
36 * for x,y,z in dataset: 36 - for x,y,z in dataset:
37 Each of these is documented below. All of these iterators are expected 37 Each of these is documented below. All of these iterators are expected
38 to provide, in addition to the usual 'next()' method, a 'next_index()' method 38 to provide, in addition to the usual 'next()' method, a 'next_index()' method
39 which returns a non-negative integer pointing to the position of the next 39 which returns a non-negative integer pointing to the position of the next
40 example that will be returned by 'next()' (or of the first example in the 40 example that will be returned by 'next()' (or of the first example in the
41 next minibatch returned). This is important because these iterators 41 next minibatch returned). This is important because these iterators
42 can wrap around the dataset in order to do multiple passes through it, 42 can wrap around the dataset in order to do multiple passes through it,
43 in possibly unregular ways if the minibatch size is not a divisor of the 43 in possibly unregular ways if the minibatch size is not a divisor of the
44 dataset length. 44 dataset length.
45 45
46 To iterate over fields, one can do 46 To iterate over fields, one can do
47 * for field in dataset.fields(): 47 - for field in dataset.fields():
48 for field_value in field: # iterate over the values associated to that field for all the dataset examples 48 for field_value in field: # iterate over the values associated to that field for all the dataset examples
49 * for field in dataset(field1,field2,...).fields() to select a subset of fields 49 - for field in dataset(field1,field2,...).fields() to select a subset of fields
50 * for field in dataset.fields(field1,field2,...) to select a subset of fields 50 - for field in dataset.fields(field1,field2,...) to select a subset of fields
51 and each of these fields is iterable over the examples: 51 and each of these fields is iterable over the examples:
52 * for field_examples in dataset.fields(): 52 - for field_examples in dataset.fields():
53 for example_value in field_examples: 53 for example_value in field_examples:
54 ... 54 ...
55 but when the dataset is a stream (unbounded length), it is not recommanded to do 55 but when the dataset is a stream (unbounded length), it is not recommanded to do
56 such things because the underlying dataset may refuse to access the different fields in 56 such things because the underlying dataset may refuse to access the different fields in
57 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. 57 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
58 The result of fields() is a DataSetFields object, which iterates over fields, 58 The result of fields() is a DataSetFields object, which iterates over fields,
59 and whose elements are iterable over examples. A DataSetFields object can 59 and whose elements are iterable over examples. A DataSetFields object can
60 be turned back into a DataSet with its examples() method: 60 be turned back into a DataSet with its examples() method::
61 dataset2 = dataset1.fields().examples() 61 dataset2 = dataset1.fields().examples()
62 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1). 62 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1).
63 63
64 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content. 64 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content.
65 65
70 70
71 Dataset elements can be indexed and sub-datasets (with a subset 71 Dataset elements can be indexed and sub-datasets (with a subset
72 of examples) can be extracted. These operations are not supported 72 of examples) can be extracted. These operations are not supported
73 by default in the case of streams. 73 by default in the case of streams.
74 74
75 * dataset[:n] returns a dataset with the n first examples. 75 - dataset[:n] returns a dataset with the n first examples.
76 76
77 * dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s. 77 - dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s.
78 78
79 * dataset[i] returns an Example. 79 - dataset[i] returns an Example.
80 80
81 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. 81 - dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in.
82 82
83 * dataset[fieldname] an iterable over the values of the field fieldname across 83 - dataset[fieldname] an iterable over the values of the field fieldname across
84 the dataset (the iterable is obtained by default by calling valuesVStack 84 the dataset (the iterable is obtained by default by calling valuesVStack
85 over the values for individual examples). 85 over the values for individual examples).
86 86
87 * dataset.<property> returns the value of a property associated with 87 - dataset.<property> returns the value of a property associated with
88 the name <property>. The following properties should be supported: 88 the name <property>. The following properties should be supported:
89 - 'description': a textual description or name for the dataset 89 - 'description': a textual description or name for the dataset
90 - 'fieldtypes': a list of types (one per field) 90 - 'fieldtypes': a list of types (one per field)
91 91
92 Datasets can be concatenated either vertically (increasing the length) or 92 Datasets can be concatenated either vertically (increasing the length) or
93 horizontally (augmenting the set of fields), if they are compatible, using 93 horizontally (augmenting the set of fields), if they are compatible, using
94 the following operations (with the same basic semantics as numpy.hstack 94 the following operations (with the same basic semantics as numpy.hstack
95 and numpy.vstack): 95 and numpy.vstack):
96 96
97 * dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3]) 97 - dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3])
98 98
99 creates a new dataset whose list of fields is the concatenation of the list of 99 creates a new dataset whose list of fields is the concatenation of the list of
100 fields of the argument datasets. This only works if they all have the same length. 100 fields of the argument datasets. This only works if they all have the same length.
101 101
102 * dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3]) 102 - dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3])
103 103
104 creates a new dataset that concatenates the examples from the argument datasets 104 creates a new dataset that concatenates the examples from the argument datasets
105 (and whose length is the sum of the length of the argument datasets). This only 105 (and whose length is the sum of the length of the argument datasets). This only
106 works if they all have the same fields. 106 works if they all have the same fields.
107 107
114 or other properties of the dataset or associated with the dataset or the result 114 or other properties of the dataset or associated with the dataset or the result
115 of a computation stored in a dataset. These can be accessed through the [key] syntax 115 of a computation stored in a dataset. These can be accessed through the [key] syntax
116 when key is a string (or more specifically, neither an integer, a slice, nor a list). 116 when key is a string (or more specifically, neither an integer, a slice, nor a list).
117 117
118 A DataSet sub-class should always redefine the following methods: 118 A DataSet sub-class should always redefine the following methods:
119 * __len__ if it is not a stream 119 - __len__ if it is not a stream
120 * fieldNames 120 - fieldNames
121 * minibatches_nowrap (called by DataSet.minibatches()) 121 - minibatches_nowrap (called by DataSet.minibatches())
122 * valuesHStack 122 - valuesHStack
123 * valuesVStack 123 - valuesVStack
124 For efficiency of implementation, a sub-class might also want to redefine 124 For efficiency of implementation, a sub-class might also want to redefine
125 * hasFields 125 - hasFields
126 * __getitem__ may not be feasible with some streams 126 - __getitem__ may not be feasible with some streams
127 * __iter__ 127 - __iter__
128 """ 128 """
129 129
130 def __init__(self,description=None,fieldtypes=None): 130 def __init__(self,description=None,fieldtypes=None):
131 if description is None: 131 if description is None:
132 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)" 132 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)"