pylearn: dataset.py comparison

comparison dataset.py @ 245:c702abb7f875

merged

author	James Bergstra <bergstrj@iro.umontreal.ca>
date	Mon, 02 Jun 2008 17:09:58 -0400
parents	c8f19a9eb10f
children	7e6edee187e3 4ad6bc9b4f03

comparison

equal deleted inserted replaced

-:3156a9976183
+:c702abb7f875
 A DataSet can be seen as a generalization of a matrix, meant to be used in conjunction
 with learning algorithms (for training and testing them): rows/records are called examples, and
 columns/attributes are called fields. The field value for a particular example can be an arbitrary
 python object, which depends on the particular dataset.
-We call a DataSet a 'stream' when its length is unbounded (otherwise its __len__ method
+We call a DataSet a 'stream' when its length is unbounded (in which case its __len__ method
 should return sys.maxint).
 A DataSet is a generator of iterators; these iterators can run through the
 examples or the fields in a variety of ways.  A DataSet need not necessarily have a finite
 or known length, so this class can be used to interface to a 'stream' which
 feeds on-line learning (however, as noted below, some operations are not
-feasible or not recommanded on streams).
+feasible or not recommended on streams).
 To iterate over examples, there are several possibilities:
 - for example in dataset:
 - for val1,val2,... in dataset:
 - for example in dataset(field1, field2,field3, ...):
 - for field in dataset.fields(field1,field2,...) to select a subset of fields
 and each of these fields is iterable over the examples:
 - for field_examples in dataset.fields():
 for example_value in field_examples:
 ...
-but when the dataset is a stream (unbounded length), it is not recommanded to do
+but when the dataset is a stream (unbounded length), it is not recommended to do
 such things because the underlying dataset may refuse to access the different fields in
 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
 The result of fields() is a L{DataSetFields} object, which iterates over fields,
 and whose elements are iterable over examples. A DataSetFields object can
 be turned back into a DataSet with its examples() method::
 * for fields in dataset.fields(field1,field2,...) to select a subset of fields
 and each of these fields is iterable over the examples:
 * for field_examples in dataset.fields():
 for example_value in field_examples:
 ...
-but when the dataset is a stream (unbounded length), it is not recommanded to do
+but when the dataset is a stream (unbounded length), it is not recommended to do
 such things because the underlying dataset may refuse to access the different fields in
 an unsynchronized ways. Hence the fields() method is illegal for streams, by default.
 The result of fields() is a DataSetFields object, which iterates over fields,
 and whose elements are iterable over examples. A DataSetFields object can
 be turned back into a DataSet with its examples() method:
 return len(self.data)
 def __getitem__(self,key):
 """More efficient implementation than the default __getitem__"""
 fieldnames=self.fields_columns.keys()
+values=self.fields_columns.values()
 if type(key) is int:
 return Example(fieldnames,
-[self.data[key,self.fields_columns[f]] for f in fieldnames])
+[self.data[key,col] for col in values])
 if type(key) is slice:
 return MinibatchDataSet(Example(fieldnames,
-[self.data[key,self.fields_columns[f]] for f in fieldnames]))
+[self.data[key,col] for col in values]))
 if type(key) is list:
 for i in range(len(key)):
 if self.hasFields(key[i]):
 key[i]=self.fields_columns[key[i]]
 return MinibatchDataSet(Example(fieldnames,
 #we must separate differently for list as numpy
 # doesn't support self.data[[i1,...],[i2,...]]
 # when their is more then two i1 and i2
-[self.data[key,:][:,self.fields_columns[f]]
+[self.data[key,:][:,col]
-if isinstance(self.fields_columns[f],list) else
+if isinstance(col,list) else
-self.data[key,self.fields_columns[f]] for f in fieldnames]),
+self.data[key,col] for col in values]),
 self.valuesVStack,self.valuesHStack)
 # else check for a fieldname
 if self.hasFields(key):
 self.dataset=dataset
 self.minibatch_size=minibatch_size
 assert offset>=0 and offset<len(dataset.data)
 assert offset+minibatch_size<=len(dataset.data)
 self.current=offset
+self.columns = [self.dataset.fields_columns[f]
+for f in self.minibatch._names]
 def __iter__(self):
 return self
 def next(self):
 #@todo: we suppose that we need to stop only when minibatch_size == 1.
 # Otherwise, MinibatchWrapAroundIterator do it.
 if self.current>=self.dataset.data.shape[0]:
 raise StopIteration
 sub_data =  self.dataset.data[self.current]
-self.minibatch._values = [sub_data[self.dataset.fields_columns[f]] for f in self.minibatch._names]
+self.minibatch._values = [sub_data[c] for c in self.columns]
 self.current+=self.minibatch_size
 return self.minibatch
 return ArrayDataSetIterator2(self,self.fieldNames(),1,0,0)

Mercurial > pylearn

comparison dataset.py @ 245:c702abb7f875