Mercurial > pylearn
comparison doc/v2_planning/dataset.txt @ 1047:1b61cbe0810b
A very rough draft of ideas, to kick-start things
author | Dumitru Erhan <dumitru.erhan@gmail.com> |
---|---|
date | Wed, 08 Sep 2010 14:13:43 -0400 |
parents | a154c9b68239 |
children | a474fabd1f37 |
comparison
equal
deleted
inserted
replaced
1042:4eaf576c3e9a | 1047:1b61cbe0810b |
---|---|
16 simply datasets which cannot fit in memory) | 16 simply datasets which cannot fit in memory) |
17 * GPU/buffering issues. | 17 * GPU/buffering issues. |
18 | 18 |
19 Commiteee: DE, OB, OD, AB, PV | 19 Commiteee: DE, OB, OD, AB, PV |
20 Leader: DE | 20 Leader: DE |
21 | |
22 Some ideas from existing ML libraries: | |
23 | |
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData, | |
25 PairDataSet, Aggregate. Ultimately, the learner decides | |
26 - mlpy: very primitive notions of data | |
27 - (still going through the other ones) | |
28 | |
29 A few things that our dataset containers should support at a minimum: | |
30 | |
31 - streams, possibly infinite | |
32 - task/views of the data for different problems | |
33 - indexing & slicing | |
34 - pairs or triples or etc of examples | |
35 - a 'distance/gram matrix' container (imagine that the data is given to you | |
36 as a distance matrix) | |
37 - multi-dimensional time-series (again, maybe with pairs/triples, maybe | |
38 given to you as a distance matrix over time) | |
39 | |
40 Another question to consider is the following: how tight should it integrate | |
41 with Theano? Do we want to be able to store data as shared variables or just | |
42 have an option for that? Theano + GPU constrains things that we can do (in terms | |
43 of sizes, buffering, etc): these are things we need to think about, but it's not | |
44 clear whether we should aim for building them into the interface. | |
45 | |
46 Task views of the data for different problems: How can we achieve this? Should | |
47 we simply have a set of standard dataset descriptors ('classification', | |
48 'regression', 'multi-label', 'density_estimation') and have a set_view method | |
49 that changes the current dataset view type? | |
50 | |
51 There is then the question of how to approach the design of a Dataset class from | |
52 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class' | |
53 Dataset that doesn't implement any methods except a few setters/getters. The reason | |
54 to have the methods listed that way is to have a common 'specification', but classes | |
55 that inherit from Dataset need not implement every single method (only the ones | |
56 that are relevant) and can obviously implement other methods as appropriate. The | |
57 reason to have a common specification (as abstract as it might be) is to, well, | |
58 have a common specification that would make our code clearer and cleaner. | |
59 | |
60 An example of what I (Dumi) am thinking in terms of concrete API: | |
61 | |
62 class Dataset: | |
63 def __init__(self): | |
64 self.type = None | |
65 self.in_memory = None | |
66 self.inputs = None # list of filepaths, or objects in memory, or... | |
67 self.outputs = None | |
68 | |
69 def get_example(self,example_index): | |
70 raise NotImplementedError() | |
71 | |
72 def get_next_example(self): | |
73 raise NotImplementedError() | |
74 | |
75 def get_batch(self,batch_index): | |
76 raise NotImplementedError() | |
77 | |
78 def get_next_batch(self): | |
79 raise NotImplementedError() | |
80 | |
81 def get_slice(self,slice_object): | |
82 raise NotImplementedError() | |
83 | |
84 def set_view(self,view_type): | |
85 self.view_type = view_type | |
86 self.n_classes = None | |
87 | |
88 def set_n_classes(self,n_classes): | |
89 self.n_classes = n_classes | |
90 | |
91 def set_batch_size(self,batch_size): | |
92 self.batch_size = batch_size | |
93 | |
94 You will note that there is no notion of train/valid/test in this class: I think we should | |
95 just have a train dataset, a valid one and a test one instead or (if it's in one | |
96 big file or infinite stream) just handle the split ourselves (via slicing, for | |
97 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the | |
98 specification does not preclude more fine-grained 'splitting' of the data. | |
99 | |
100 A concrete implementation would look like this (we would have one class per | |
101 dataset that we use, and the class declaration contains essentially everything | |
102 there is to know about the dataset): | |
103 | |
104 class MNIST(Dataset): | |
105 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): | |
106 self.type='standard_xy' | |
107 self.in_memory = True | |
108 self.inputs = inputs # load them or create | |
109 self.outputs = outputs | |
110 self.set_view('classification') | |
111 self.set_n_classes(10) | |
112 self.set_batch_size(20) | |
113 self.n_batches = self._compute_n_batches() | |
114 | |
115 def get_batch(self,batch_index): | |
116 x,y = self._fetch_batch(batch_index) | |
117 if self.view_type == 'classification': | |
118 return x,numpy.int32(y) | |
119 elif self.view_type == 'density_estimation': | |
120 return x | |
121 else: | |
122 raise NotImplementedError() | |
123 | |
124 def shared_data(self): | |
125 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX)) | |
126 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX)) | |
127 return shared_x, T.cast(shared_y, 'int32') | |
128 | |
129 def _compute_n_batches(self): | |
130 pass | |
131 | |
132 def _fetch_batch(self,batch_index): | |
133 pass | |
134 | |
135 But nothing stops you from defining get_train_batch, get_valid_batch and stuff | |
136 like that! | |
137 | |
138 So we'd use it as: | |
139 | |
140 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy']) | |
141 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy']) | |
142 | |
143 x,y = train_mnist.get_batch(0) | |
144 train_mnist.set_view('density_estimation') | |
145 x = train_mnist.get_batch(0) | |
146 | |
147 or | |
148 | |
149 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy']) | |
150 batches_train = range(int(mnist_data.n_batches*0.8)) | |
151 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches) | |
152 | |
153 xt,yt = mnist_data.get_batch(batches_train[0]) | |
154 xv,yv = mnist_data.get_batch(batches_valid[0]) | |
155 |