annotate doc/v2_planning/dataset.txt @ 1047:1b61cbe0810b

A very rough draft of ideas, to kick-start things
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Wed, 08 Sep 2010 14:13:43 -0400
parents a154c9b68239
children a474fabd1f37
rev   line source
1002
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
1 Discussion of Function Specification for Dataset Types
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
2 ======================================================
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
3
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
4 Some talking points from the September 2 meeting:
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
5
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
7 needs to be flexible enough to accommodate different (sub)tasks and views of
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
8 the same underlying data.
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
9 * Datasets as probability distributions from which one can sample.
1023
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
10 * That's not something I would consider to be a dataset-related problem to
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
11 tackle now: a probability distribution in Pylearn would probably be a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
12 different kind of beast, and it should be easy enough to have a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
13 DatasetToDistribution class for instance, that would take care of viewing a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
14 dataset as a probability distribution. -- OD
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
15 * Our specification should allow transparent handling of infinite datasets (or
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
16 simply datasets which cannot fit in memory)
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
17 * GPU/buffering issues.
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
18
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
19 Commiteee: DE, OB, OD, AB, PV
1030
a154c9b68239 dataset: Dumi confirmed as leader
Olivier Delalleau <delallea@iro>
parents: 1023
diff changeset
20 Leader: DE
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
21
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
22 Some ideas from existing ML libraries:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
23
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
25 PairDataSet, Aggregate. Ultimately, the learner decides
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
26 - mlpy: very primitive notions of data
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
27 - (still going through the other ones)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
28
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
29 A few things that our dataset containers should support at a minimum:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
30
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
31 - streams, possibly infinite
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
32 - task/views of the data for different problems
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
33 - indexing & slicing
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
34 - pairs or triples or etc of examples
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
35 - a 'distance/gram matrix' container (imagine that the data is given to you
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
36 as a distance matrix)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
37 - multi-dimensional time-series (again, maybe with pairs/triples, maybe
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
38 given to you as a distance matrix over time)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
39
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
40 Another question to consider is the following: how tight should it integrate
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
41 with Theano? Do we want to be able to store data as shared variables or just
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
42 have an option for that? Theano + GPU constrains things that we can do (in terms
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
43 of sizes, buffering, etc): these are things we need to think about, but it's not
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
44 clear whether we should aim for building them into the interface.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
45
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
46 Task views of the data for different problems: How can we achieve this? Should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
47 we simply have a set of standard dataset descriptors ('classification',
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
48 'regression', 'multi-label', 'density_estimation') and have a set_view method
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
49 that changes the current dataset view type?
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
50
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
51 There is then the question of how to approach the design of a Dataset class from
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
52 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
53 Dataset that doesn't implement any methods except a few setters/getters. The reason
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
54 to have the methods listed that way is to have a common 'specification', but classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
55 that inherit from Dataset need not implement every single method (only the ones
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
56 that are relevant) and can obviously implement other methods as appropriate. The
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
57 reason to have a common specification (as abstract as it might be) is to, well,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
58 have a common specification that would make our code clearer and cleaner.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
59
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
60 An example of what I (Dumi) am thinking in terms of concrete API:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
61
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
62 class Dataset:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
63 def __init__(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
64 self.type = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
65 self.in_memory = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
66 self.inputs = None # list of filepaths, or objects in memory, or...
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
67 self.outputs = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
68
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
69 def get_example(self,example_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
70 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
71
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
72 def get_next_example(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
73 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
74
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
75 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
76 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
77
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
78 def get_next_batch(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
79 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
80
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
81 def get_slice(self,slice_object):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
82 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
83
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
84 def set_view(self,view_type):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
85 self.view_type = view_type
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
86 self.n_classes = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
87
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
88 def set_n_classes(self,n_classes):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
89 self.n_classes = n_classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
90
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
91 def set_batch_size(self,batch_size):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
92 self.batch_size = batch_size
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
93
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
94 You will note that there is no notion of train/valid/test in this class: I think we should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
95 just have a train dataset, a valid one and a test one instead or (if it's in one
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
96 big file or infinite stream) just handle the split ourselves (via slicing, for
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
97 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
98 specification does not preclude more fine-grained 'splitting' of the data.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
99
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
100 A concrete implementation would look like this (we would have one class per
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
101 dataset that we use, and the class declaration contains essentially everything
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
102 there is to know about the dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
103
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
104 class MNIST(Dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
105 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
106 self.type='standard_xy'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
107 self.in_memory = True
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
108 self.inputs = inputs # load them or create
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
109 self.outputs = outputs
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
110 self.set_view('classification')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
111 self.set_n_classes(10)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
112 self.set_batch_size(20)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
113 self.n_batches = self._compute_n_batches()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
114
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
115 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
116 x,y = self._fetch_batch(batch_index)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
117 if self.view_type == 'classification':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
118 return x,numpy.int32(y)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
119 elif self.view_type == 'density_estimation':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
120 return x
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
121 else:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
122 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
123
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
124 def shared_data(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
125 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
126 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
127 return shared_x, T.cast(shared_y, 'int32')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
128
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
129 def _compute_n_batches(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
130 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
131
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
132 def _fetch_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
133 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
134
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
135 But nothing stops you from defining get_train_batch, get_valid_batch and stuff
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
136 like that!
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
137
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
138 So we'd use it as:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
139
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
140 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
141 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
142
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
143 x,y = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
144 train_mnist.set_view('density_estimation')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
145 x = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
146
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
147 or
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
148
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
149 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
150 batches_train = range(int(mnist_data.n_batches*0.8))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
151 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
152
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
153 xt,yt = mnist_data.get_batch(batches_train[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
154 xv,yv = mnist_data.get_batch(batches_valid[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
155