Mercurial > pylearn
comparison dataset.py @ 72:2b6656b2ef52
Changed docs slightly
author | Joseph Turian <turian@iro.umontreal.ca> |
---|---|
date | Fri, 02 May 2008 18:36:47 -0400 |
parents | dde1fb1b63ba |
children | 69f97aad3faf |
comparison
equal
deleted
inserted
replaced
71:5b699b31770a | 72:2b6656b2ef52 |
---|---|
25 or known length, so this class can be used to interface to a 'stream' which | 25 or known length, so this class can be used to interface to a 'stream' which |
26 feeds on-line learning (however, as noted below, some operations are not | 26 feeds on-line learning (however, as noted below, some operations are not |
27 feasible or not recommanded on streams). | 27 feasible or not recommanded on streams). |
28 | 28 |
29 To iterate over examples, there are several possibilities: | 29 To iterate over examples, there are several possibilities: |
30 * for example in dataset([field1, field2,field3, ...]): | 30 - for example in dataset([field1, field2,field3, ...]): |
31 * for val1,val2,val3 in dataset([field1, field2,field3]): | 31 - for val1,val2,val3 in dataset([field1, field2,field3]): |
32 * for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N): | 32 - for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N): |
33 * for mini1,mini2,mini3 in dataset.minibatches([field1, field2, field3], minibatch_size=N): | 33 - for mini1,mini2,mini3 in dataset.minibatches([field1, field2, field3], minibatch_size=N): |
34 * for example in dataset: | 34 - for example in dataset:: |
35 print example['x'] | 35 print example['x'] |
36 * for x,y,z in dataset: | 36 - for x,y,z in dataset: |
37 Each of these is documented below. All of these iterators are expected | 37 Each of these is documented below. All of these iterators are expected |
38 to provide, in addition to the usual 'next()' method, a 'next_index()' method | 38 to provide, in addition to the usual 'next()' method, a 'next_index()' method |
39 which returns a non-negative integer pointing to the position of the next | 39 which returns a non-negative integer pointing to the position of the next |
40 example that will be returned by 'next()' (or of the first example in the | 40 example that will be returned by 'next()' (or of the first example in the |
41 next minibatch returned). This is important because these iterators | 41 next minibatch returned). This is important because these iterators |
42 can wrap around the dataset in order to do multiple passes through it, | 42 can wrap around the dataset in order to do multiple passes through it, |
43 in possibly unregular ways if the minibatch size is not a divisor of the | 43 in possibly unregular ways if the minibatch size is not a divisor of the |
44 dataset length. | 44 dataset length. |
45 | 45 |
46 To iterate over fields, one can do | 46 To iterate over fields, one can do |
47 * for field in dataset.fields(): | 47 - for field in dataset.fields(): |
48 for field_value in field: # iterate over the values associated to that field for all the dataset examples | 48 for field_value in field: # iterate over the values associated to that field for all the dataset examples |
49 * for field in dataset(field1,field2,...).fields() to select a subset of fields | 49 - for field in dataset(field1,field2,...).fields() to select a subset of fields |
50 * for field in dataset.fields(field1,field2,...) to select a subset of fields | 50 - for field in dataset.fields(field1,field2,...) to select a subset of fields |
51 and each of these fields is iterable over the examples: | 51 and each of these fields is iterable over the examples: |
52 * for field_examples in dataset.fields(): | 52 - for field_examples in dataset.fields(): |
53 for example_value in field_examples: | 53 for example_value in field_examples: |
54 ... | 54 ... |
55 but when the dataset is a stream (unbounded length), it is not recommanded to do | 55 but when the dataset is a stream (unbounded length), it is not recommanded to do |
56 such things because the underlying dataset may refuse to access the different fields in | 56 such things because the underlying dataset may refuse to access the different fields in |
57 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. | 57 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. |
58 The result of fields() is a DataSetFields object, which iterates over fields, | 58 The result of fields() is a DataSetFields object, which iterates over fields, |
59 and whose elements are iterable over examples. A DataSetFields object can | 59 and whose elements are iterable over examples. A DataSetFields object can |
60 be turned back into a DataSet with its examples() method: | 60 be turned back into a DataSet with its examples() method:: |
61 dataset2 = dataset1.fields().examples() | 61 dataset2 = dataset1.fields().examples() |
62 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1). | 62 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1). |
63 | 63 |
64 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content. | 64 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content. |
65 | 65 |
70 | 70 |
71 Dataset elements can be indexed and sub-datasets (with a subset | 71 Dataset elements can be indexed and sub-datasets (with a subset |
72 of examples) can be extracted. These operations are not supported | 72 of examples) can be extracted. These operations are not supported |
73 by default in the case of streams. | 73 by default in the case of streams. |
74 | 74 |
75 * dataset[:n] returns a dataset with the n first examples. | 75 - dataset[:n] returns a dataset with the n first examples. |
76 | 76 |
77 * dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s. | 77 - dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s. |
78 | 78 |
79 * dataset[i] returns an Example. | 79 - dataset[i] returns an Example. |
80 | 80 |
81 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. | 81 - dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. |
82 | 82 |
83 * dataset[fieldname] an iterable over the values of the field fieldname across | 83 - dataset[fieldname] an iterable over the values of the field fieldname across |
84 the dataset (the iterable is obtained by default by calling valuesVStack | 84 the dataset (the iterable is obtained by default by calling valuesVStack |
85 over the values for individual examples). | 85 over the values for individual examples). |
86 | 86 |
87 * dataset.<property> returns the value of a property associated with | 87 - dataset.<property> returns the value of a property associated with |
88 the name <property>. The following properties should be supported: | 88 the name <property>. The following properties should be supported: |
89 - 'description': a textual description or name for the dataset | 89 - 'description': a textual description or name for the dataset |
90 - 'fieldtypes': a list of types (one per field) | 90 - 'fieldtypes': a list of types (one per field) |
91 | 91 |
92 Datasets can be concatenated either vertically (increasing the length) or | 92 Datasets can be concatenated either vertically (increasing the length) or |
93 horizontally (augmenting the set of fields), if they are compatible, using | 93 horizontally (augmenting the set of fields), if they are compatible, using |
94 the following operations (with the same basic semantics as numpy.hstack | 94 the following operations (with the same basic semantics as numpy.hstack |
95 and numpy.vstack): | 95 and numpy.vstack): |
96 | 96 |
97 * dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3]) | 97 - dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3]) |
98 | 98 |
99 creates a new dataset whose list of fields is the concatenation of the list of | 99 creates a new dataset whose list of fields is the concatenation of the list of |
100 fields of the argument datasets. This only works if they all have the same length. | 100 fields of the argument datasets. This only works if they all have the same length. |
101 | 101 |
102 * dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3]) | 102 - dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3]) |
103 | 103 |
104 creates a new dataset that concatenates the examples from the argument datasets | 104 creates a new dataset that concatenates the examples from the argument datasets |
105 (and whose length is the sum of the length of the argument datasets). This only | 105 (and whose length is the sum of the length of the argument datasets). This only |
106 works if they all have the same fields. | 106 works if they all have the same fields. |
107 | 107 |
114 or other properties of the dataset or associated with the dataset or the result | 114 or other properties of the dataset or associated with the dataset or the result |
115 of a computation stored in a dataset. These can be accessed through the [key] syntax | 115 of a computation stored in a dataset. These can be accessed through the [key] syntax |
116 when key is a string (or more specifically, neither an integer, a slice, nor a list). | 116 when key is a string (or more specifically, neither an integer, a slice, nor a list). |
117 | 117 |
118 A DataSet sub-class should always redefine the following methods: | 118 A DataSet sub-class should always redefine the following methods: |
119 * __len__ if it is not a stream | 119 - __len__ if it is not a stream |
120 * fieldNames | 120 - fieldNames |
121 * minibatches_nowrap (called by DataSet.minibatches()) | 121 - minibatches_nowrap (called by DataSet.minibatches()) |
122 * valuesHStack | 122 - valuesHStack |
123 * valuesVStack | 123 - valuesVStack |
124 For efficiency of implementation, a sub-class might also want to redefine | 124 For efficiency of implementation, a sub-class might also want to redefine |
125 * hasFields | 125 - hasFields |
126 * __getitem__ may not be feasible with some streams | 126 - __getitem__ may not be feasible with some streams |
127 * __iter__ | 127 - __iter__ |
128 """ | 128 """ |
129 | 129 |
130 def __init__(self,description=None,fieldtypes=None): | 130 def __init__(self,description=None,fieldtypes=None): |
131 if description is None: | 131 if description is None: |
132 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)" | 132 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)" |