Mercurial > pylearn
annotate dataset.py @ 62:23bf2c9eb7b3
bugfix
author | Frederic Bastien <bastienf@iro.umontreal.ca> |
---|---|
date | Fri, 02 May 2008 10:14:01 -0400 |
parents | a8b70a9117ad |
children | 863da25a60f1 |
rev | line source |
---|---|
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
1 |
12
ff4e551490f1
Added LookupList type in lookup_list.py and used it to keep order
bengioy@esprit.iro.umontreal.ca
parents:
11
diff
changeset
|
2 from lookup_list import LookupList |
ff4e551490f1
Added LookupList type in lookup_list.py and used it to keep order
bengioy@esprit.iro.umontreal.ca
parents:
11
diff
changeset
|
3 Example = LookupList |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
4 from misc import unique_elements_list_intersection |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
5 from string import join |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
6 from sys import maxint |
45
a5c70dc42972
Test functions for dataset.py
bengioy@grenat.iro.umontreal.ca
parents:
44
diff
changeset
|
7 import numpy |
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
8 |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
9 class AbstractFunction (Exception): """Derived class must override this function""" |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
10 class NotImplementedYet (NotImplementedError): """Work in progress, this should eventually be implemented""" |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
11 |
1
2cd82666b9a7
Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents:
0
diff
changeset
|
12 class DataSet(object): |
16 | 13 """A virtual base class for datasets. |
14 | |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
15 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
16 with learning algorithms (for training and testing them): rows/records are called examples, and |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
17 columns/attributes are called fields. The field value for a particular example can be an arbitrary |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
18 python object, which depends on the particular dataset. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
19 |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
20 We call a DataSet a 'stream' when its length is unbounded (otherwise its __len__ method |
48
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
21 should return sys.maxint). |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
22 |
16 | 23 A DataSet is a generator of iterators; these iterators can run through the |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
24 examples or the fields in a variety of ways. A DataSet need not necessarily have a finite |
16 | 25 or known length, so this class can be used to interface to a 'stream' which |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
26 feeds on-line learning (however, as noted below, some operations are not |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
27 feasible or not recommanded on streams). |
16 | 28 |
29 To iterate over examples, there are several possibilities: | |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
30 * for example in dataset([field1, field2,field3, ...]): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
31 * for val1,val2,val3 in dataset([field1, field2,field3]): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
32 * for minibatch in dataset.minibatches([field1, field2, ...],minibatch_size=N): |
50 | 33 * for mini1,mini2,mini3 in dataset.minibatches([field1, field2, field3], minibatch_size=N): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
34 * for example in dataset: |
46
c5b07e87b0cb
comments modif made by Yoshua
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
45
diff
changeset
|
35 print example['x'] |
c5b07e87b0cb
comments modif made by Yoshua
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
45
diff
changeset
|
36 * for x,y,z in dataset: |
23
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
37 Each of these is documented below. All of these iterators are expected |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
38 to provide, in addition to the usual 'next()' method, a 'next_index()' method |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
39 which returns a non-negative integer pointing to the position of the next |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
40 example that will be returned by 'next()' (or of the first example in the |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
41 next minibatch returned). This is important because these iterators |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
42 can wrap around the dataset in order to do multiple passes through it, |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
43 in possibly unregular ways if the minibatch size is not a divisor of the |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
44 dataset length. |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
45 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
46 To iterate over fields, one can do |
46
c5b07e87b0cb
comments modif made by Yoshua
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
45
diff
changeset
|
47 * for field in dataset.fields(): |
c5b07e87b0cb
comments modif made by Yoshua
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
45
diff
changeset
|
48 for field_value in field: # iterate over the values associated to that field for all the dataset examples |
50 | 49 * for field in dataset(field1,field2,...).fields() to select a subset of fields |
50 * for field in dataset.fields(field1,field2,...) to select a subset of fields | |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
51 and each of these fields is iterable over the examples: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
52 * for field_examples in dataset.fields(): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
53 for example_value in field_examples: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
54 ... |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
55 but when the dataset is a stream (unbounded length), it is not recommanded to do |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
56 such things because the underlying dataset may refuse to access the different fields in |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
57 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
58 The result of fields() is a DataSetFields object, which iterates over fields, |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
59 and whose elements are iterable over examples. A DataSetFields object can |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
60 be turned back into a DataSet with its examples() method: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
61 dataset2 = dataset1.fields().examples() |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
62 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1). |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
63 |
16 | 64 Note: Fields are not mutually exclusive, i.e. two fields can overlap in their actual content. |
65 | |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
66 Note: The content of a field can be of any type. Field values can also be 'missing' |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
67 (e.g. to handle semi-supervised learning), and in the case of numeric (numpy array) |
46
c5b07e87b0cb
comments modif made by Yoshua
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
45
diff
changeset
|
68 fields (i.e. an ArrayFieldsDataSet), NaN plays the role of a missing value. |
c5b07e87b0cb
comments modif made by Yoshua
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
45
diff
changeset
|
69 What about non-numeric values? None. |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
70 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
71 Dataset elements can be indexed and sub-datasets (with a subset |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
72 of examples) can be extracted. These operations are not supported |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
73 by default in the case of streams. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
74 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
75 * dataset[:n] returns a dataset with the n first examples. |
16 | 76 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
77 * dataset[i1:i2:s] returns a dataset with the examples i1,i1+s,...i2-s. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
78 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
79 * dataset[i] returns an Example. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
80 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
81 * dataset[[i1,i2,...in]] returns a dataset with examples i1,i2,...in. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
82 |
57
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
83 * dataset[fieldname] an iterable over the values of the field fieldname across |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
84 the dataset (the iterable is obtained by default by calling valuesVStack |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
85 over the values for individual examples). |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
86 |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
87 * dataset.<property> returns the value of a property associated with |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
88 the name <property>. The following properties should be supported: |
41 | 89 - 'description': a textual description or name for the dataset |
57
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
90 - 'fieldtypes': a list of types (one per field) |
41 | 91 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
92 Datasets can be concatenated either vertically (increasing the length) or |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
93 horizontally (augmenting the set of fields), if they are compatible, using |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
94 the following operations (with the same basic semantics as numpy.hstack |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
95 and numpy.vstack): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
96 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
97 * dataset1 | dataset2 | dataset3 == dataset.hstack([dataset1,dataset2,dataset3]) |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
98 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
99 creates a new dataset whose list of fields is the concatenation of the list of |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
100 fields of the argument datasets. This only works if they all have the same length. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
101 |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
102 * dataset1 & dataset2 & dataset3 == dataset.vstack([dataset1,dataset2,dataset3]) |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
103 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
104 creates a new dataset that concatenates the examples from the argument datasets |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
105 (and whose length is the sum of the length of the argument datasets). This only |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
106 works if they all have the same fields. |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
107 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
108 According to the same logic, and viewing a DataSetFields object associated to |
46
c5b07e87b0cb
comments modif made by Yoshua
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
45
diff
changeset
|
109 a DataSet as a kind of transpose of it, fields1 & fields2 concatenates fields of |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
110 a DataSetFields fields1 and fields2, and fields1 | fields2 concatenates their |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
111 examples. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
112 |
41 | 113 A dataset can hold arbitrary key-value pairs that may be used to access meta-data |
114 or other properties of the dataset or associated with the dataset or the result | |
115 of a computation stored in a dataset. These can be accessed through the [key] syntax | |
116 when key is a string (or more specifically, neither an integer, a slice, nor a list). | |
117 | |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
118 A DataSet sub-class should always redefine the following methods: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
119 * __len__ if it is not a stream |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
120 * fieldNames |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
121 * minibatches_nowrap (called by DataSet.minibatches()) |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
122 * valuesHStack |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
123 * valuesVStack |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
124 For efficiency of implementation, a sub-class might also want to redefine |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
125 * hasFields |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
126 * __getitem__ may not be feasible with some streams |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
127 * __iter__ |
2
3fddb1c8f955
Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents:
1
diff
changeset
|
128 """ |
1
2cd82666b9a7
Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents:
0
diff
changeset
|
129 |
57
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
130 def __init__(self,description=None,fieldtypes=None): |
41 | 131 if description is None: |
132 # by default return "<DataSetType>(<SuperClass1>,<SuperClass2>,...)" | |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
133 description = type(self).__name__ + " ( " + join([x.__name__ for x in type(self).__bases__]) + " )" |
41 | 134 self.description=description |
60 | 135 self.fieldtypes=fieldtypes |
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
136 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
137 class MinibatchToSingleExampleIterator(object): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
138 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
139 Converts the result of minibatch iterator with minibatch_size==1 into |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
140 single-example values in the result. Therefore the result of |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
141 iterating on the dataset itself gives a sequence of single examples |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
142 (whereas the result of iterating over minibatches gives in each |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
143 Example field an iterable object over the individual examples in |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
144 the minibatch). |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
145 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
146 def __init__(self, minibatch_iterator): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
147 self.minibatch_iterator = minibatch_iterator |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
148 self.minibatch = None |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
149 def __iter__(self): #makes for loop work |
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
150 return self |
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
151 def next(self): |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
152 size1_minibatch = self.minibatch_iterator.next() |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
153 if not self.minibatch: |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
154 self.minibatch = Example(size1_minibatch.keys(),[value[0] for value in size1_minibatch.values()]) |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
155 else: |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
156 self.minibatch._values = [value[0] for value in size1_minibatch.values()] |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
157 return self.minibatch |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
158 |
23
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
159 def next_index(self): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
160 return self.minibatch_iterator.next_index() |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
161 |
3
378b68d5c4ad
Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents:
2
diff
changeset
|
162 def __iter__(self): |
16 | 163 """Supports the syntax "for i in dataset: ..." |
1
2cd82666b9a7
Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents:
0
diff
changeset
|
164 |
16 | 165 Using this syntax, "i" will be an Example instance (or equivalent) with |
166 all the fields of DataSet self. Every field of "i" will give access to | |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
167 a field of a single example. Fields should be accessible via |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
168 i["fielname"] or i[3] (in the order defined by the elements of the |
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
169 Example returned by this iterator), but the derived class is free |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
170 to accept any type of identifier, and add extra functionality to the iterator. |
16 | 171 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
172 The default implementation calls the minibatches iterator and extracts the first example of each field. |
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
173 """ |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
174 return DataSet.MinibatchToSingleExampleIterator(self.minibatches(None, minibatch_size = 1)) |
2
3fddb1c8f955
Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents:
1
diff
changeset
|
175 |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
176 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
177 class MinibatchWrapAroundIterator(object): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
178 """ |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
179 An iterator for minibatches that handles the case where we need to wrap around the |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
180 dataset because n_batches*minibatch_size > len(dataset). It is constructed from |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
181 a dataset that provides a minibatch iterator that does not need to handle that problem. |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
182 This class is a utility for dataset subclass writers, so that they do not have to handle |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
183 this issue multiple times, nor check that fieldnames are valid, nor handle the |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
184 empty fieldnames (meaning 'use all the fields'). |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
185 """ |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
186 def __init__(self,dataset,fieldnames,minibatch_size,n_batches,offset): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
187 self.dataset=dataset |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
188 self.fieldnames=fieldnames |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
189 self.minibatch_size=minibatch_size |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
190 self.n_batches=n_batches |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
191 self.n_batches_done=0 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
192 self.next_row=offset |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
193 self.L=len(dataset) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
194 assert offset+minibatch_size<=self.L |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
195 ds_nbatches = (self.L-offset)/minibatch_size |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
196 if n_batches is not None: |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
197 ds_nbatches = max(n_batches,ds_nbatches) |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
198 if fieldnames: |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
199 assert dataset.hasFields(*fieldnames) |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
200 else: |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
201 fieldnames=dataset.fieldNames() |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
202 self.iterator = dataset.minibatches_nowrap(fieldnames,minibatch_size,ds_nbatches,offset) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
203 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
204 def __iter__(self): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
205 return self |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
206 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
207 def next_index(self): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
208 return self.next_row |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
209 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
210 def next(self): |
43
e92244f30116
Corrected iterator logic errors
bengioy@grenat.iro.umontreal.ca
parents:
42
diff
changeset
|
211 if self.n_batches and self.n_batches_done==self.n_batches: |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
212 raise StopIteration |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
213 upper = self.next_row+self.minibatch_size |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
214 if upper <=self.L: |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
215 minibatch = self.iterator.next() |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
216 else: |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
217 if not self.n_batches: |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
218 raise StopIteration |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
219 # we must concatenate (vstack) the bottom and top parts of our minibatch |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
220 # first get the beginning of our minibatch (top of dataset) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
221 first_part = self.dataset.minibatches_nowrap(fieldnames,self.L-self.next_row,1,self.next_row).next() |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
222 second_part = self.dataset.minibatches_nowrap(fieldnames,upper-self.L,1,0).next() |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
223 minibatch = Example(self.fieldnames, |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
224 [self.dataset.valuesVStack(name,[first_part[name],second_part[name]]) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
225 for name in self.fieldnames]) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
226 self.next_row=upper |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
227 self.n_batches_done+=1 |
43
e92244f30116
Corrected iterator logic errors
bengioy@grenat.iro.umontreal.ca
parents:
42
diff
changeset
|
228 if upper >= self.L and self.n_batches: |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
229 self.next_row -= self.L |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
230 return minibatch |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
231 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
232 |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
233 minibatches_fieldnames = None |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
234 minibatches_minibatch_size = 1 |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
235 minibatches_n_batches = None |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
236 def minibatches(self, |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
237 fieldnames = minibatches_fieldnames, |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
238 minibatch_size = minibatches_minibatch_size, |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
239 n_batches = minibatches_n_batches, |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
240 offset = 0): |
6
d5738b79089a
Removed MinibatchIterator and instead made minibatch_size a field of all DataSets,
bengioy@bengiomac.local
parents:
5
diff
changeset
|
241 """ |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
242 Return an iterator that supports three forms of syntax: |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
243 |
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
244 for i in dataset.minibatches(None,**kwargs): ... |
16 | 245 |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
246 for i in dataset.minibatches([f1, f2, f3],**kwargs): ... |
16 | 247 |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
248 for i1, i2, i3 in dataset.minibatches([f1, f2, f3],**kwargs): ... |
16 | 249 |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
250 Using the first two syntaxes, "i" will be an indexable object, such as a list, |
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
251 tuple, or Example instance. In both cases, i[k] is a list-like container |
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
252 of a batch of current examples. In the second case, i[0] is |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
253 list-like container of the f1 field of a batch current examples, i[1] is |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
254 a list-like container of the f2 field, etc. |
2
3fddb1c8f955
Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents:
1
diff
changeset
|
255 |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
256 Using the first syntax, all the fields will be returned in "i". |
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
257 Using the third syntax, i1, i2, i3 will be list-like containers of the |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
258 f1, f2, and f3 fields of a batch of examples on each loop iteration. |
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
259 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
260 The minibatches iterator is expected to return upon each call to next() |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
261 a DataSetFields object, which is a LookupList (indexed by the field names) whose |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
262 elements are iterable over the minibatch examples, and which keeps a pointer to |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
263 a sub-dataset that can be used to iterate over the individual examples |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
264 in the minibatch. Hence a minibatch can be converted back to a regular |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
265 dataset or its fields can be looked at individually (and possibly iterated over). |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
266 |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
267 PARAMETERS |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
268 - fieldnames (list of any type, default None): |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
269 The loop variables i1, i2, i3 (in the example above) should contain the |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
270 f1, f2, and f3 fields of the current batch of examples. If None, the |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
271 derived class can choose a default, e.g. all fields. |
16 | 272 |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
273 - minibatch_size (integer, default 1) |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
274 On every iteration, the variables i1, i2, i3 will have |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
275 exactly minibatch_size elements. e.g. len(i1) == minibatch_size |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
276 |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
277 - n_batches (integer, default None) |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
278 The iterator will loop exactly this many times, and then stop. If None, |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
279 the derived class can choose a default. If (-1), then the returned |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
280 iterator should support looping indefinitely. |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
281 |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
282 - offset (integer, default 0) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
283 The iterator will start at example 'offset' in the dataset, rather than the default. |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
284 |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
285 Note: A list-like container is something like a tuple, list, numpy.ndarray or |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
286 any other object that supports integer indexing and slicing. |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
287 |
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
288 """ |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
289 return DataSet.MinibatchWrapAroundIterator(self,fieldnames,minibatch_size,n_batches,offset) |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
290 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
291 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
292 """ |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
293 This is the minibatches iterator generator that sub-classes must define. |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
294 It does not need to worry about wrapping around multiple times across the dataset, |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
295 as this is handled by MinibatchWrapAroundIterator when DataSet.minibatches() is called. |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
296 The next() method of the returned iterator does not even need to worry about |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
297 the termination condition (as StopIteration will be raised by DataSet.minibatches |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
298 before an improper call to minibatches_nowrap's next() is made). |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
299 That next() method can assert that its next row will always be within [0,len(dataset)). |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
300 The iterator returned by minibatches_nowrap does not need to implement |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
301 a next_index() method either, as this will be provided by MinibatchWrapAroundIterator. |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
302 """ |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
303 raise AbstractFunction() |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
304 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
305 def __len__(self): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
306 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
307 len(dataset) returns the number of examples in the dataset. |
48
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
308 By default, a DataSet is a 'stream', i.e. it has an unbounded length (sys.maxint). |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
309 Sub-classes which implement finite-length datasets should redefine this method. |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
310 Some methods only make sense for finite-length datasets. |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
311 """ |
48
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
312 return sys.maxint |
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
313 |
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
314 def is_unbounded(self): |
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
315 """ |
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
316 Tests whether a dataset is unbounded (e.g. a stream). |
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
317 """ |
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
318 return len(self)==sys.maxint |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
319 |
26
672fe4b23032
Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents:
23
diff
changeset
|
320 def hasFields(self,*fieldnames): |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
321 """ |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
322 Return true if the given field name (or field names, if multiple arguments are |
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
323 given) is recognized by the DataSet (i.e. can be used as a field name in one |
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
324 of the iterators). |
29
46c5c90019c2
Changed apply_function so that it propagates methods of the source.
bengioy@grenat.iro.umontreal.ca
parents:
28
diff
changeset
|
325 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
326 The default implementation may be inefficient (O(# fields in dataset)), as it calls the fieldNames() |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
327 method. Many datasets may store their field names in a dictionary, which would allow more efficiency. |
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
328 """ |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
329 return len(unique_elements_list_intersection(fieldnames,self.fieldNames()))>0 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
330 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
331 def fieldNames(self): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
332 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
333 Return the list of field names that are supported by the iterators, |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
334 and for which hasFields(fieldname) would return True. |
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
335 """ |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
336 raise AbstractFunction() |
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
337 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
338 def __call__(self,*fieldnames): |
23
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
339 """ |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
340 Return a dataset that sees only the fields whose name are specified. |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
341 """ |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
342 assert self.hasFields(*fieldnames) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
343 return self.fields(*fieldnames).examples() |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
344 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
345 def fields(self,*fieldnames): |
29
46c5c90019c2
Changed apply_function so that it propagates methods of the source.
bengioy@grenat.iro.umontreal.ca
parents:
28
diff
changeset
|
346 """ |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
347 Return a DataSetFields object associated with this dataset. |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
348 """ |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
349 return DataSetFields(self,*fieldnames) |
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
350 |
2
3fddb1c8f955
Rewrote DataSet interface and created FiniteDataSet interface.
bengioy@bengiomac.local
parents:
1
diff
changeset
|
351 def __getitem__(self,i): |
28
541a273bc89f
Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents:
26
diff
changeset
|
352 """ |
541a273bc89f
Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents:
26
diff
changeset
|
353 dataset[i] returns the (i+1)-th example of the dataset. |
541a273bc89f
Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents:
26
diff
changeset
|
354 dataset[i:j] returns the subdataset with examples i,i+1,...,j-1. |
541a273bc89f
Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents:
26
diff
changeset
|
355 dataset[i:j:s] returns the subdataset with examples i,i+2,i+4...,j-2. |
541a273bc89f
Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents:
26
diff
changeset
|
356 dataset[[i1,i2,..,in]] returns the subdataset with examples i1,i2,...,in. |
41 | 357 dataset['key'] returns a property associated with the given 'key' string. |
358 If 'key' is a fieldname, then the VStacked field values (iterable over | |
359 field values) for that field is returned. Other keys may be supported | |
360 by different dataset subclasses. The following key names are encouraged: | |
361 - 'description': a textual description or name for the dataset | |
362 - '<fieldname>.type': a type name or value for a given <fieldname> | |
1
2cd82666b9a7
Added statscollector and started writing dataset and learner.
bengioy@esprit.iro.umontreal.ca
parents:
0
diff
changeset
|
363 |
39 | 364 Note that some stream datasets may be unable to implement random access, i.e. |
365 arbitrary slicing/indexing | |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
366 because they can only iterate through examples one or a minibatch at a time |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
367 and do not actually store or keep past (or future) examples. |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
368 |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
369 The default implementation of getitem uses the minibatches iterator |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
370 to obtain one example, one slice, or a list of examples. It may not |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
371 always be the most efficient way to obtain the result, especially if |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
372 the data are actually stored in a memory array. |
28
541a273bc89f
Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents:
26
diff
changeset
|
373 """ |
41 | 374 # check for an index |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
375 if type(i) is int: |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
376 return DataSet.MinibatchToSingleExampleIterator( |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
377 self.minibatches(minibatch_size=1,n_batches=1,offset=i)).next() |
41 | 378 rows=None |
379 # or a slice | |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
380 if type(i) is slice: |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
381 if not i.start: i.start=0 |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
382 if not i.step: i.step=1 |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
383 if i.step is 1: |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
384 return self.minibatches(minibatch_size=i.stop-i.start,n_batches=1,offset=i.start).next().examples() |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
385 rows = range(i.start,i.stop,i.step) |
41 | 386 # or a list of indices |
387 elif type(i) is list: | |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
388 rows = i |
41 | 389 if rows is not None: |
48
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
390 examples = [self[row] for row in rows] |
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
391 fields_values = zip(*examples) |
45
a5c70dc42972
Test functions for dataset.py
bengioy@grenat.iro.umontreal.ca
parents:
44
diff
changeset
|
392 return MinibatchDataSet( |
41 | 393 Example(self.fieldNames(),[ self.valuesVStack(fieldname,field_values) |
394 for fieldname,field_values | |
395 in zip(self.fieldNames(),fields_values)])) | |
396 # else check for a fieldname | |
397 if self.hasFields(i): | |
398 return self.minibatches(fieldnames=[i],minibatch_size=len(self),n_batches=1,offset=0).next()[0] | |
399 # else we are trying to access a property of the dataset | |
400 assert i in self.__dict__ # else it means we are trying to access a non-existing property | |
401 return self.__dict__[i] | |
22
b6b36f65664f
Created virtual sub-classes of DataSet: {Finite{Length,Width},Sliceable}DataSet,
bengioy@esprit.iro.umontreal.ca
parents:
20
diff
changeset
|
402 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
403 def valuesHStack(self,fieldnames,fieldvalues): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
404 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
405 Return a value that corresponds to concatenating (horizontally) several field values. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
406 This can be useful to merge some fields. The implementation of this operation is likely |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
407 to involve a copy of the original values. When the values are numpy arrays, the |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
408 result should be numpy.hstack(values). If it makes sense, this operation should |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
409 work as well when each value corresponds to multiple examples in a minibatch |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
410 e.g. if each value is a Ni-vector and a minibatch of length L is a LxNi matrix, |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
411 then the result should be a Lx(N1+N2+..) matrix equal to numpy.hstack(values). |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
412 The default is to use numpy.hstack for numpy.ndarray values, and a list |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
413 pointing to the original values for other data types. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
414 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
415 all_numpy=True |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
416 for value in fieldvalues: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
417 if not type(value) is numpy.ndarray: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
418 all_numpy=False |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
419 if all_numpy: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
420 return numpy.hstack(fieldvalues) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
421 # the default implementation of horizontal stacking is to put values in a list |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
422 return fieldvalues |
26
672fe4b23032
Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents:
23
diff
changeset
|
423 |
672fe4b23032
Fixed dataset errors so that _test_dataset.py works again.
bengioy@grenat.iro.umontreal.ca
parents:
23
diff
changeset
|
424 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
425 def valuesVStack(self,fieldname,values): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
426 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
427 Return a value that corresponds to concatenating (vertically) several values of the |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
428 same field. This can be important to build a minibatch out of individual examples. This |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
429 is likely to involve a copy of the original values. When the values are numpy arrays, the |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
430 result should be numpy.vstack(values). |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
431 The default is to use numpy.vstack for numpy.ndarray values, and a list |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
432 pointing to the original values for other data types. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
433 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
434 all_numpy=True |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
435 for value in values: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
436 if not type(value) is numpy.ndarray: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
437 all_numpy=False |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
438 if all_numpy: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
439 return numpy.vstack(values) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
440 # the default implementation of vertical stacking is to put values in a list |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
441 return values |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
442 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
443 def __or__(self,other): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
444 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
445 dataset1 | dataset2 returns a dataset whose list of fields is the concatenation of the list of |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
446 fields of the argument datasets. This only works if they all have the same length. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
447 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
448 return HStackedDataSet(self,other) |
3
378b68d5c4ad
Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents:
2
diff
changeset
|
449 |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
450 def __and__(self,other): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
451 """ |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
452 dataset1 & dataset2 is a dataset that concatenates the examples from the argument datasets |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
453 (and whose length is the sum of the length of the argument datasets). This only |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
454 works if they all have the same fields. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
455 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
456 return VStackedDataSet(self,other) |
23
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
457 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
458 def hstack(datasets): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
459 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
460 hstack(dataset1,dataset2,...) returns dataset1 | datataset2 | ... |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
461 which is a dataset whose fields list is the concatenation of the fields |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
462 of the individual datasets. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
463 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
464 assert len(datasets)>0 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
465 if len(datasets)==1: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
466 return datasets[0] |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
467 return HStackedDataSet(datasets) |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
468 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
469 def vstack(datasets): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
470 """ |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
471 vstack(dataset1,dataset2,...) returns dataset1 & datataset2 & ... |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
472 which is a dataset which iterates first over the examples of dataset1, then |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
473 over those of dataset2, etc. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
474 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
475 assert len(datasets)>0 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
476 if len(datasets)==1: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
477 return datasets[0] |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
478 return VStackedDataSet(datasets) |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
479 |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
480 class FieldsSubsetDataSet(DataSet): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
481 """ |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
482 A sub-class of DataSet that selects a subset of the fields. |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
483 """ |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
484 def __init__(self,src,fieldnames): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
485 self.src=src |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
486 self.fieldnames=fieldnames |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
487 assert src.hasFields(*fieldnames) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
488 self.valuesHStack = src.valuesHStack |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
489 self.valuesVStack = src.valuesVStack |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
490 |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
491 def __len__(self): return len(self.src) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
492 |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
493 def fieldNames(self): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
494 return self.fieldnames |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
495 |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
496 def __iter__(self): |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
497 class FieldsSubsetIterator(object): |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
498 def __init__(self,ds): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
499 self.ds=ds |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
500 self.src_iter=ds.src.__iter__() |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
501 self.example=None |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
502 def __iter__(self): return self |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
503 def next(self): |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
504 complete_example = self.src_iter.next() |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
505 if self.example: |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
506 self.example._values=[complete_example[field] |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
507 for field in self.ds.fieldnames] |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
508 else: |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
509 self.example=Example(self.ds.fieldnames, |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
510 [complete_example[field] for field in self.ds.fieldnames]) |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
511 return self.example |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
512 return FieldsSubsetIterator(self) |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
513 |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
514 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
515 assert self.hasFields(*fieldnames) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
516 return self.src.minibatches_nowrap(fieldnames,minibatch_size,n_batches,offset) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
517 def __getitem__(self,i): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
518 return FieldsSubsetDataSet(self.src[i],self.fieldnames) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
519 |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
520 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
521 class DataSetFields(LookupList): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
522 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
523 Although a DataSet iterates over examples (like rows of a matrix), an associated |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
524 DataSetFields iterates over fields (like columns of a matrix), and can be understood |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
525 as a transpose of the associated dataset. |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
526 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
527 To iterate over fields, one can do |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
528 * for fields in dataset.fields() |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
529 * for fields in dataset(field1,field2,...).fields() to select a subset of fields |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
530 * for fields in dataset.fields(field1,field2,...) to select a subset of fields |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
531 and each of these fields is iterable over the examples: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
532 * for field_examples in dataset.fields(): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
533 for example_value in field_examples: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
534 ... |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
535 but when the dataset is a stream (unbounded length), it is not recommanded to do |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
536 such things because the underlying dataset may refuse to access the different fields in |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
537 an unsynchronized ways. Hence the fields() method is illegal for streams, by default. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
538 The result of fields() is a DataSetFields object, which iterates over fields, |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
539 and whose elements are iterable over examples. A DataSetFields object can |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
540 be turned back into a DataSet with its examples() method: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
541 dataset2 = dataset1.fields().examples() |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
542 and dataset2 should behave exactly like dataset1 (in fact by default dataset2==dataset1). |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
543 |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
544 DataSetFields can be concatenated vertically or horizontally. To be consistent with |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
545 the syntax used for DataSets, the | concatenates the fields and the & concatenates |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
546 the examples. |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
547 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
548 def __init__(self,dataset,*fieldnames): |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
549 if not fieldnames: |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
550 fieldnames=dataset.fieldNames() |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
551 elif fieldnames is not dataset.fieldNames(): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
552 dataset = FieldsSubsetDataSet(dataset,fieldnames) |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
553 assert dataset.hasFields(*fieldnames) |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
554 self.dataset=dataset |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
555 minibatch_iterator = dataset.minibatches(fieldnames, |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
556 minibatch_size=len(dataset), |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
557 n_batches=1) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
558 minibatch=minibatch_iterator.next() |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
559 LookupList.__init__(self,fieldnames,minibatch) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
560 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
561 def examples(self): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
562 return self.dataset |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
563 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
564 def __or__(self,other): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
565 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
566 fields1 | fields2 is a DataSetFields that whose list of examples is the concatenation |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
567 of the list of examples of DataSetFields fields1 and fields2. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
568 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
569 return (self.examples() + other.examples()).fields() |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
570 |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
571 def __and__(self,other): |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
572 """ |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
573 fields1 + fields2 is a DataSetFields that whose list of fields is the concatenation |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
574 of the fields of DataSetFields fields1 and fields2. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
575 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
576 return (self.examples() | other.examples()).fields() |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
577 |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
578 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
579 class MinibatchDataSet(DataSet): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
580 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
581 Turn a LookupList of same-length fields into an example-iterable dataset. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
582 Each element of the lookup-list should be an iterable and sliceable, all of the same length. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
583 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
584 def __init__(self,fields_lookuplist,values_vstack=DataSet().valuesVStack, |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
585 values_hstack=DataSet().valuesHStack): |
17
759d17112b23
more comments, looping ArrayDataSet iterator, bugfixes to lookup_list, more tests
bergstrj@iro.umontreal.ca
diff
changeset
|
586 """ |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
587 The user can (and generally should) also provide values_vstack(fieldname,fieldvalues) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
588 and a values_hstack(fieldnames,fieldvalues) functions behaving with the same |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
589 semantics as the DataSet methods of the same name (but without the self argument). |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
590 """ |
61
a8b70a9117ad
bugfix: in MinibatchDataSet renamed the class variable fields to _fields as parent class have a function called field.
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
60
diff
changeset
|
591 self._fields=fields_lookuplist |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
592 assert len(fields_lookuplist)>0 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
593 self.length=len(fields_lookuplist[0]) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
594 for field in fields_lookuplist[1:]: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
595 assert self.length==len(field) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
596 self.values_vstack=values_vstack |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
597 self.values_hstack=values_hstack |
3
378b68d5c4ad
Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents:
2
diff
changeset
|
598 |
378b68d5c4ad
Added first (untested) version of ArrayDataSet
bengioy@bengiomac.local
parents:
2
diff
changeset
|
599 def __len__(self): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
600 return self.length |
28
541a273bc89f
Removed __array__ method from dataset, whose
bengioy@grenat.iro.umontreal.ca
parents:
26
diff
changeset
|
601 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
602 def __getitem__(self,i): |
48
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
603 if type(i) in (int,slice,list): |
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
604 return DataSetFields(MinibatchDataSet( |
61
a8b70a9117ad
bugfix: in MinibatchDataSet renamed the class variable fields to _fields as parent class have a function called field.
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
60
diff
changeset
|
605 Example(self._fields.keys(),[field[i] for field in self._fields])),self._fields) |
48
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
606 if self.hasFields(i): |
61
a8b70a9117ad
bugfix: in MinibatchDataSet renamed the class variable fields to _fields as parent class have a function called field.
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
60
diff
changeset
|
607 return self._fields[i] |
55
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
608 assert i in self.__dict__ # else it means we are trying to access a non-existing property |
48
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
609 return self.__dict__[i] |
11
be128b9127c8
Debugged (to the extent of my tests) the new version of dataset
bengioy@esprit.iro.umontreal.ca
parents:
9
diff
changeset
|
610 |
29
46c5c90019c2
Changed apply_function so that it propagates methods of the source.
bengioy@grenat.iro.umontreal.ca
parents:
28
diff
changeset
|
611 def fieldNames(self): |
61
a8b70a9117ad
bugfix: in MinibatchDataSet renamed the class variable fields to _fields as parent class have a function called field.
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
60
diff
changeset
|
612 return self._fields.keys() |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
613 |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
614 def hasFields(self,*fieldnames): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
615 for fieldname in fieldnames: |
61
a8b70a9117ad
bugfix: in MinibatchDataSet renamed the class variable fields to _fields as parent class have a function called field.
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
60
diff
changeset
|
616 if fieldname not in self._fields.keys(): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
617 return False |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
618 return True |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
619 |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
620 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
621 class Iterator(object): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
622 def __init__(self,ds): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
623 self.ds=ds |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
624 self.next_example=offset |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
625 assert minibatch_size > 0 |
41 | 626 if offset+minibatch_size > ds.length: |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
627 raise NotImplementedError() |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
628 def __iter__(self): |
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
629 return self |
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
630 def next(self): |
61
a8b70a9117ad
bugfix: in MinibatchDataSet renamed the class variable fields to _fields as parent class have a function called field.
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
60
diff
changeset
|
631 upper = self.next_example+minibatch_size |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
632 assert upper<=self.ds.length |
61
a8b70a9117ad
bugfix: in MinibatchDataSet renamed the class variable fields to _fields as parent class have a function called field.
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
60
diff
changeset
|
633 minibatch = Example(self.ds._fields.keys(), |
a8b70a9117ad
bugfix: in MinibatchDataSet renamed the class variable fields to _fields as parent class have a function called field.
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
60
diff
changeset
|
634 [field[self.next_example:upper] |
a8b70a9117ad
bugfix: in MinibatchDataSet renamed the class variable fields to _fields as parent class have a function called field.
Frederic Bastien <bastienf@iro.umontreal.ca>
parents:
60
diff
changeset
|
635 for field in self.ds._fields]) |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
636 self.next_example+=minibatch_size |
62 | 637 return DataSetFields(MinibatchDataSet(minibatch),*fieldnames) |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
638 |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
639 return Iterator(self) |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
640 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
641 def valuesVStack(self,fieldname,fieldvalues): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
642 return self.values_vstack(fieldname,fieldvalues) |
20
266c68cb6136
Minor editions, plus adding untested ApplyFunctionDataset for GradientLearner in the works.
bengioy@bengiomac.local
parents:
19
diff
changeset
|
643 |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
644 def valuesHStack(self,fieldnames,fieldvalues): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
645 return self.values_hstack(fieldnames,fieldvalues) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
646 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
647 class HStackedDataSet(DataSet): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
648 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
649 A DataSet that wraps several datasets and shows a view that includes all their fields, |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
650 i.e. whose list of fields is the concatenation of their lists of fields. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
651 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
652 If a field name is found in more than one of the datasets, then either an error is |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
653 raised or the fields are renamed (either by prefixing the __name__ attribute |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
654 of the dataset + ".", if it exists, or by suffixing the dataset index in the argument list). |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
655 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
656 TODO: automatically detect a chain of stacked datasets due to A | B | C | D ... |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
657 """ |
41 | 658 def __init__(self,datasets,accept_nonunique_names=False,description=None,field_types=None): |
659 DataSet.__init__(self,description,field_types) | |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
660 self.datasets=datasets |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
661 self.accept_nonunique_names=accept_nonunique_names |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
662 self.fieldname2dataset={} |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
663 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
664 def rename_field(fieldname,dataset,i): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
665 if hasattr(dataset,"__name__"): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
666 return dataset.__name__ + "." + fieldname |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
667 return fieldname+"."+str(i) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
668 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
669 # make sure all datasets have the same length and unique field names |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
670 self.length=None |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
671 names_to_change=[] |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
672 for i in xrange(len(datasets)): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
673 dataset = datasets[i] |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
674 length=len(dataset) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
675 if self.length: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
676 assert self.length==length |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
677 else: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
678 self.length=length |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
679 for fieldname in dataset.fieldNames(): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
680 if fieldname in self.fieldname2dataset: # name conflict! |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
681 if accept_nonunique_names: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
682 fieldname=rename_field(fieldname,dataset,i) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
683 names2change.append((fieldname,i)) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
684 else: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
685 raise ValueError("Incompatible datasets: non-unique field name = "+fieldname) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
686 self.fieldname2dataset[fieldname]=i |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
687 for fieldname,i in names_to_change: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
688 del self.fieldname2dataset[fieldname] |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
689 self.fieldname2dataset[rename_field(fieldname,self.datasets[i],i)]=i |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
690 |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
691 def hasFields(self,*fieldnames): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
692 for fieldname in fieldnames: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
693 if not fieldname in self.fieldname2dataset: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
694 return False |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
695 return True |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
696 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
697 def fieldNames(self): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
698 return self.fieldname2dataset.keys() |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
699 |
41 | 700 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
701 |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
702 class HStackedIterator(object): |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
703 def __init__(self,hsds,iterators): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
704 self.hsds=hsds |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
705 self.iterators=iterators |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
706 def __iter__(self): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
707 return self |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
708 def next(self): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
709 # concatenate all the fields of the minibatches |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
710 minibatch = reduce(LookupList.__add__,[iterator.next() for iterator in self.iterators]) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
711 # and return a DataSetFields whose dataset is the transpose (=examples()) of this minibatch |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
712 return DataSetFields(MinibatchDataSet(minibatch,self.hsds.valuesVStack, |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
713 self.hsds.valuesHStack), |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
714 fieldnames if fieldnames else hsds.fieldNames()) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
715 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
716 assert self.hasfields(fieldnames) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
717 # find out which underlying datasets are necessary to service the required fields |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
718 # and construct corresponding minibatch iterators |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
719 if fieldnames: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
720 datasets=set([]) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
721 fields_in_dataset=dict([(dataset,[]) for dataset in datasets]) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
722 for fieldname in fieldnames: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
723 dataset=self.datasets[self.fieldnames2dataset[fieldname]] |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
724 datasets.add(dataset) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
725 fields_in_dataset[dataset].append(fieldname) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
726 datasets=list(datasets) |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
727 iterators=[dataset.minibatches(fields_in_dataset[dataset],minibatch_size,n_batches,offset) |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
728 for dataset in datasets] |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
729 else: |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
730 datasets=self.datasets |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
731 iterators=[dataset.minibatches(None,minibatch_size,n_batches,offset) for dataset in datasets] |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
732 return HStackedIterator(self,iterators) |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
733 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
734 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
735 def valuesVStack(self,fieldname,fieldvalues): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
736 return self.datasets[self.fieldname2dataset[fieldname]].valuesVStack(fieldname,fieldvalues) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
737 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
738 def valuesHStack(self,fieldnames,fieldvalues): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
739 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
740 We will use the sub-dataset associated with the first fieldname in the fieldnames list |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
741 to do the work, hoping that it can cope with the other values (i.e. won't care |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
742 about the incompatible fieldnames). Hence this heuristic will always work if |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
743 all the fieldnames are of the same sub-dataset. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
744 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
745 return self.datasets[self.fieldname2dataset[fieldnames[0]]].valuesHStack(fieldnames,fieldvalues) |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
746 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
747 class VStackedDataSet(DataSet): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
748 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
749 A DataSet that wraps several datasets and shows a view that includes all their examples, |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
750 in the order provided. This clearly assumes that they all have the same field names |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
751 and all (except possibly the last one) are of finite length. |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
752 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
753 TODO: automatically detect a chain of stacked datasets due to A + B + C + D ... |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
754 """ |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
755 def __init__(self,datasets): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
756 self.datasets=datasets |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
757 self.length=0 |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
758 self.index2dataset={} |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
759 assert len(datasets)>0 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
760 fieldnames = datasets[-1].fieldNames() |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
761 self.datasets_start_row=[] |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
762 # We use this map from row index to dataset index for constant-time random access of examples, |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
763 # to avoid having to search for the appropriate dataset each time and slice is asked for. |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
764 for dataset,k in enumerate(datasets[0:-1]): |
48
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
765 assert dataset.is_unbounded() # All VStacked datasets (except possibly the last) must be bounded (have a length). |
b6730f9a336d
Fixing MinibatchDataSet getitem
bengioy@grenat.iro.umontreal.ca
parents:
46
diff
changeset
|
766 L=len(dataset) |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
767 for i in xrange(L): |
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
768 self.index2dataset[self.length+i]=k |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
769 self.datasets_start_row.append(self.length) |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
770 self.length+=L |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
771 assert dataset.fieldNames()==fieldnames |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
772 self.datasets_start_row.append(self.length) |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
773 self.length+=len(datasets[-1]) |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
774 # If length is very large, we should use a more memory-efficient mechanism |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
775 # that does not store all indices |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
776 if self.length>1000000: |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
777 # 1 million entries would require about 60 meg for the index2dataset map |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
778 # TODO |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
779 print "A more efficient mechanism for index2dataset should be implemented" |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
780 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
781 def __len__(self): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
782 return self.length |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
783 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
784 def fieldNames(self): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
785 return self.datasets[0].fieldNames() |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
786 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
787 def hasFields(self,*fieldnames): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
788 return self.datasets[0].hasFields(*fieldnames) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
789 |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
790 def locate_row(self,row): |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
791 """Return (dataset_index, row_within_dataset) for global row number""" |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
792 dataset_index = self.index2dataset[row] |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
793 row_within_dataset = self.datasets_start_row[dataset_index] |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
794 return dataset_index, row_within_dataset |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
795 |
41 | 796 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
797 |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
798 class VStackedIterator(object): |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
799 def __init__(self,vsds): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
800 self.vsds=vsds |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
801 self.next_row=offset |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
802 self.next_dataset_index,self.next_dataset_row=self.vsds.locate_row(offset) |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
803 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \ |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
804 self.next_iterator(vsds.datasets[0],offset,n_batches) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
805 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
806 def next_iterator(self,dataset,starting_offset,batches_left): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
807 L=len(dataset) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
808 ds_nbatches = (L-starting_offset)/minibatch_size |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
809 if batches_left is not None: |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
810 ds_nbatches = max(batches_left,ds_nbatches) |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
811 if minibatch_size>L: |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
812 ds_minibatch_size=L |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
813 n_left_in_mb=minibatch_size-L |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
814 ds_nbatches=1 |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
815 else: |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
816 n_left_in_mb=0 |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
817 return dataset.minibatches(fieldnames,minibatch_size,ds_nbatches,starting_offset), \ |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
818 L-(starting_offset+ds_nbatches*minibatch_size), n_left_in_mb |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
819 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
820 def move_to_next_dataset(self): |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
821 if self.n_left_at_the_end_of_ds>0: |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
822 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \ |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
823 self.next_iterator(vsds.datasets[self.next_dataset_index], |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
824 self.n_left_at_the_end_of_ds,1) |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
825 else: |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
826 self.next_dataset_index +=1 |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
827 if self.next_dataset_index==len(self.vsds.datasets): |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
828 self.next_dataset_index = 0 |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
829 self.current_iterator,self.n_left_at_the_end_of_ds,self.n_left_in_mb= \ |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
830 self.next_iterator(vsds.datasets[self.next_dataset_index],starting_offset,n_batches) |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
831 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
832 def __iter__(self): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
833 return self |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
834 |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
835 def next(self): |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
836 dataset=self.vsds.datasets[self.next_dataset_index] |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
837 mb = self.next_iterator.next() |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
838 if self.n_left_in_mb: |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
839 extra_mb = [] |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
840 while self.n_left_in_mb>0: |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
841 self.move_to_next_dataset() |
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
842 extra_mb.append(self.next_iterator.next()) |
40
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
843 examples = Example(names, |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
844 [dataset.valuesVStack(name, |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
845 [mb[name]]+[b[name] for b in extra_mb]) |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
846 for name in fieldnames]) |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
847 mb = DataSetFields(MinibatchDataSet(examples),fieldnames) |
88fd1cce08b9
replaced infinity for length by raise UnboundedDataSet and use & instead of + to concatenate datasets
bengioy@esprit.iro.umontreal.ca
parents:
39
diff
changeset
|
848 |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
849 self.next_row+=minibatch_size |
38
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
850 self.next_dataset_row+=minibatch_size |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
851 if self.next_row+minibatch_size>len(dataset): |
d637ad8f7352
Finished first untested version of VStackedDataset
bengioy@esprit.iro.umontreal.ca
parents:
37
diff
changeset
|
852 self.move_to_next_dataset() |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
853 return examples |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
854 return VStackedIterator(self) |
37
73c4212ba5b3
Factored the minibatch-writing code into an iterator class inside DataSet
bengioy@esprit.iro.umontreal.ca
parents:
36
diff
changeset
|
855 |
41 | 856 class ArrayFieldsDataSet(DataSet): |
857 """ | |
858 Virtual super-class of datasets whose field values are numpy array, | |
859 thus defining valuesHStack and valuesVStack for sub-classes. | |
860 """ | |
861 def __init__(self,description=None,field_types=None): | |
862 DataSet.__init__(self,description,field_types) | |
863 def valuesHStack(self,fieldnames,fieldvalues): | |
864 """Concatenate field values horizontally, e.g. two vectors | |
865 become a longer vector, two matrices become a wider matrix, etc.""" | |
866 return numpy.hstack(fieldvalues) | |
867 def valuesVStack(self,fieldname,values): | |
868 """Concatenate field values vertically, e.g. two vectors | |
869 become a two-row matrix, two matrices become a longer matrix, etc.""" | |
870 return numpy.vstack(values) | |
871 | |
872 class ArrayDataSet(ArrayFieldsDataSet): | |
873 """ | |
874 An ArrayDataSet stores the fields as groups of columns in a numpy tensor, | |
875 whose first axis iterates over examples, second axis determines fields. | |
876 If the underlying array is N-dimensional (has N axes), then the field | |
877 values are (N-2)-dimensional objects (i.e. ordinary numbers if N=2). | |
878 """ | |
879 | |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
880 def __init__(self, data_array, fields_columns): |
55
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
881 """ |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
882 Construct an ArrayDataSet from the underlying numpy array (data) and |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
883 a map (fields_columns) from fieldnames to field columns. The columns of a field are specified |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
884 using the standard arguments for indexing/slicing: integer for a column index, |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
885 slice for an interval of columns (with possible stride), or iterable of column indices. |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
886 """ |
41 | 887 self.data=data_array |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
888 self.fields_columns=fields_columns |
41 | 889 |
890 # check consistency and complete slices definitions | |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
891 for fieldname, fieldcolumns in self.fields_columns.items(): |
41 | 892 if type(fieldcolumns) is int: |
893 assert fieldcolumns>=0 and fieldcolumns<data_array.shape[1] | |
894 elif type(fieldcolumns) is slice: | |
895 start,step=None,None | |
896 if not fieldcolumns.start: | |
897 start=0 | |
898 if not fieldcolumns.step: | |
899 step=1 | |
900 if start or step: | |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
901 self.fields_columns[fieldname]=slice(start,fieldcolumns.stop,step) |
41 | 902 elif hasattr(fieldcolumns,"__iter__"): # something like a list |
903 for i in fieldcolumns: | |
904 assert i>=0 and i<data_array.shape[1] | |
905 | |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
906 def fieldNames(self): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
907 return self.fields_columns.keys() |
41 | 908 |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
909 def __len__(self): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
910 return len(self.data) |
41 | 911 |
55
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
912 def __getitem__(self,i): |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
913 """More efficient implementation than the default __getitem__""" |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
914 fieldnames=self.fields_columns.keys() |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
915 if type(i) is int: |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
916 return Example(fieldnames, |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
917 [self.data[i,self.fields_columns[f]] for f in fieldnames]) |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
918 if type(i) in (slice,list): |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
919 return MinibatchDataSet(Example(fieldnames, |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
920 [self.data[i,self.fields_columns[f]] for f in fieldnames])) |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
921 # else check for a fieldname |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
922 if self.hasFields(i): |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
923 return Example([i],[self.data[self.fields_columns[i],:]]) |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
924 # else we are trying to access a property of the dataset |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
925 assert i in self.__dict__ # else it means we are trying to access a non-existing property |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
926 return self.__dict__[i] |
66619ce44497
Efficient implementation of getitem for ArrayDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
48
diff
changeset
|
927 |
41 | 928 |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
929 def minibatches_nowrap(self,fieldnames,minibatch_size,n_batches,offset): |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
930 class ArrayDataSetIterator(object): |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
931 def __init__(self,dataset,fieldnames,minibatch_size,n_batches,offset): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
932 if fieldnames is None: fieldnames = dataset.fieldNames() |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
933 # store the resulting minibatch in a lookup-list of values |
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
934 self.minibatch = LookupList(fieldnames,[0]*len(fieldnames)) |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
935 self.dataset=dataset |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
936 self.minibatch_size=minibatch_size |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
937 assert offset>=0 and offset<len(dataset.data) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
938 assert offset+minibatch_size<=len(dataset.data) |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
939 self.current=offset |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
940 def __iter__(self): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
941 return self |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
942 def next(self): |
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
943 sub_data = self.dataset.data[self.current:self.current+self.minibatch_size] |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
944 self.minibatch._values = [sub_data[:,self.dataset.fields_columns[f]] for f in self.minibatch._names] |
43
e92244f30116
Corrected iterator logic errors
bengioy@grenat.iro.umontreal.ca
parents:
42
diff
changeset
|
945 self.current+=self.minibatch_size |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
946 return self.minibatch |
42
9b68774fcc6b
Testing basic functionality and removing obvious bugs
bengioy@grenat.iro.umontreal.ca
parents:
41
diff
changeset
|
947 |
44
5a85fda9b19b
Fixed some more iterator bugs
bengioy@grenat.iro.umontreal.ca
parents:
43
diff
changeset
|
948 return ArrayDataSetIterator(self,fieldnames,minibatch_size,n_batches,offset) |
57
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
949 |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
950 |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
951 class CachedDataSet(DataSet): |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
952 """ |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
953 Wrap a dataset whose values are computationally expensive to obtain |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
954 (e.g. because they involve some computation, or disk access), |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
955 so that repeated accesses to the same example are done cheaply, |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
956 by caching every example value that has been accessed at least once. |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
957 |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
958 Optionally, for finite-length dataset, all the values can be computed |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
959 (and cached) upon construction of the CachedDataSet, rather at the |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
960 first access. |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
961 """ |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
962 |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
963 class ApplyFunctionDataSet(DataSet): |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
964 """ |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
965 A dataset that contains as fields the results of applying a given function |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
966 example-wise or minibatch-wise to all the fields of an input dataset. |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
967 The output of the function should be an iterable (e.g. a list or a LookupList) |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
968 over the resulting values. In minibatch mode, the function is expected |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
969 to work on minibatches (takes a minibatch in input and returns a minibatch |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
970 in output). |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
971 |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
972 The function is applied each time an example or a minibatch is accessed. |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
973 To avoid re-doing computation, wrap this dataset inside a CachedDataSet. |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
974 """ |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
975 |
1aabd2e2bb5f
Added empty classes with doc: CachedDataSet and ApplyFunctionDataSet
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
56
diff
changeset
|
976 |
23
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
977 def supervised_learning_dataset(src_dataset,input_fields,target_fields,weight_field=None): |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
978 """ |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
979 Wraps an arbitrary DataSet into one for supervised learning tasks by forcing the |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
980 user to define a set of fields as the 'input' field and a set of fields |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
981 as the 'target' field. Optionally, a single weight_field can also be defined. |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
982 """ |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
983 args = ((input_fields,'input'),(output_fields,'target')) |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
984 if weight_field: args+=(([weight_field],'weight')) |
36
438440ba0627
Rewriting dataset.py completely
bengioy@zircon.iro.umontreal.ca
parents:
29
diff
changeset
|
985 return src_dataset.merge_fields(*args) |
23
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
986 |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
987 |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
988 |
526e192b0699
Working on ApplyFunctionDataSet, added constraint that
bengioy@esprit.iro.umontreal.ca
parents:
22
diff
changeset
|
989 |