comparison doc/v2_planning/dataset.txt @ 1047:1b61cbe0810b

A very rough draft of ideas, to kick-start things
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Wed, 08 Sep 2010 14:13:43 -0400
parents a154c9b68239
children a474fabd1f37
comparison
equal deleted inserted replaced
1042:4eaf576c3e9a 1047:1b61cbe0810b
16 simply datasets which cannot fit in memory) 16 simply datasets which cannot fit in memory)
17 * GPU/buffering issues. 17 * GPU/buffering issues.
18 18
19 Commiteee: DE, OB, OD, AB, PV 19 Commiteee: DE, OB, OD, AB, PV
20 Leader: DE 20 Leader: DE
21
22 Some ideas from existing ML libraries:
23
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
25 PairDataSet, Aggregate. Ultimately, the learner decides
26 - mlpy: very primitive notions of data
27 - (still going through the other ones)
28
29 A few things that our dataset containers should support at a minimum:
30
31 - streams, possibly infinite
32 - task/views of the data for different problems
33 - indexing & slicing
34 - pairs or triples or etc of examples
35 - a 'distance/gram matrix' container (imagine that the data is given to you
36 as a distance matrix)
37 - multi-dimensional time-series (again, maybe with pairs/triples, maybe
38 given to you as a distance matrix over time)
39
40 Another question to consider is the following: how tight should it integrate
41 with Theano? Do we want to be able to store data as shared variables or just
42 have an option for that? Theano + GPU constrains things that we can do (in terms
43 of sizes, buffering, etc): these are things we need to think about, but it's not
44 clear whether we should aim for building them into the interface.
45
46 Task views of the data for different problems: How can we achieve this? Should
47 we simply have a set of standard dataset descriptors ('classification',
48 'regression', 'multi-label', 'density_estimation') and have a set_view method
49 that changes the current dataset view type?
50
51 There is then the question of how to approach the design of a Dataset class from
52 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class'
53 Dataset that doesn't implement any methods except a few setters/getters. The reason
54 to have the methods listed that way is to have a common 'specification', but classes
55 that inherit from Dataset need not implement every single method (only the ones
56 that are relevant) and can obviously implement other methods as appropriate. The
57 reason to have a common specification (as abstract as it might be) is to, well,
58 have a common specification that would make our code clearer and cleaner.
59
60 An example of what I (Dumi) am thinking in terms of concrete API:
61
62 class Dataset:
63 def __init__(self):
64 self.type = None
65 self.in_memory = None
66 self.inputs = None # list of filepaths, or objects in memory, or...
67 self.outputs = None
68
69 def get_example(self,example_index):
70 raise NotImplementedError()
71
72 def get_next_example(self):
73 raise NotImplementedError()
74
75 def get_batch(self,batch_index):
76 raise NotImplementedError()
77
78 def get_next_batch(self):
79 raise NotImplementedError()
80
81 def get_slice(self,slice_object):
82 raise NotImplementedError()
83
84 def set_view(self,view_type):
85 self.view_type = view_type
86 self.n_classes = None
87
88 def set_n_classes(self,n_classes):
89 self.n_classes = n_classes
90
91 def set_batch_size(self,batch_size):
92 self.batch_size = batch_size
93
94 You will note that there is no notion of train/valid/test in this class: I think we should
95 just have a train dataset, a valid one and a test one instead or (if it's in one
96 big file or infinite stream) just handle the split ourselves (via slicing, for
97 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the
98 specification does not preclude more fine-grained 'splitting' of the data.
99
100 A concrete implementation would look like this (we would have one class per
101 dataset that we use, and the class declaration contains essentially everything
102 there is to know about the dataset):
103
104 class MNIST(Dataset):
105 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
106 self.type='standard_xy'
107 self.in_memory = True
108 self.inputs = inputs # load them or create
109 self.outputs = outputs
110 self.set_view('classification')
111 self.set_n_classes(10)
112 self.set_batch_size(20)
113 self.n_batches = self._compute_n_batches()
114
115 def get_batch(self,batch_index):
116 x,y = self._fetch_batch(batch_index)
117 if self.view_type == 'classification':
118 return x,numpy.int32(y)
119 elif self.view_type == 'density_estimation':
120 return x
121 else:
122 raise NotImplementedError()
123
124 def shared_data(self):
125 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX))
126 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX))
127 return shared_x, T.cast(shared_y, 'int32')
128
129 def _compute_n_batches(self):
130 pass
131
132 def _fetch_batch(self,batch_index):
133 pass
134
135 But nothing stops you from defining get_train_batch, get_valid_batch and stuff
136 like that!
137
138 So we'd use it as:
139
140 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy'])
141 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy'])
142
143 x,y = train_mnist.get_batch(0)
144 train_mnist.set_view('density_estimation')
145 x = train_mnist.get_batch(0)
146
147 or
148
149 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy'])
150 batches_train = range(int(mnist_data.n_batches*0.8))
151 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches)
152
153 xt,yt = mnist_data.get_batch(batches_train[0])
154 xv,yv = mnist_data.get_batch(batches_valid[0])
155