annotate doc/v2_planning/dataset.txt @ 1116:18a092001752

An idea about Datasets and GPU.
author Arnaud Bergeron <abergeron@gmail.com>
date Tue, 14 Sep 2010 14:20:31 -0400
parents 4797a4cb73e1
children c1943feada10
rev   line source
1002
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
1 Discussion of Function Specification for Dataset Types
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
2 ======================================================
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
3
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
4 Some talking points from the September 2 meeting:
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
5
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
7 needs to be flexible enough to accommodate different (sub)tasks and views of
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
8 the same underlying data.
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
9 * Datasets as probability distributions from which one can sample.
1023
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
10 * That's not something I would consider to be a dataset-related problem to
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
11 tackle now: a probability distribution in Pylearn would probably be a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
12 different kind of beast, and it should be easy enough to have a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
13 DatasetToDistribution class for instance, that would take care of viewing a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
14 dataset as a probability distribution. -- OD
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
15 * Our specification should allow transparent handling of infinite datasets (or
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
16 simply datasets which cannot fit in memory)
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
17 * GPU/buffering issues.
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
18
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
19 Commiteee: DE, OB, OD, AB, PV
1030
a154c9b68239 dataset: Dumi confirmed as leader
Olivier Delalleau <delallea@iro>
parents: 1023
diff changeset
20 Leader: DE
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
21
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
22 Some ideas from existing ML libraries:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
23
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
25 PairDataSet, Aggregate. Ultimately, the learner decides
1077
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
26 - mlpy: very primitive notions of data (simple 2D matrices)
1076
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet,
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
29 constrained and may have a different interface.
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
30 - MDP: Seems to have restrictions on the type of data being passed around, as
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
31 well as its dimensionality ("Input array data is typically assumed to be
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
32 two-dimensional and ordered such that observations of the same variable are
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
33 stored on rows and different variables are stored on columns.")
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
34 - Orange: Data matrices, with names and types associated to each column.
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
35 Basically there seems to be only one base dataset class that contains the
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
36 data. Data points are lists (of values corresponding to each column).
1077
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
37 - APGL: Hard to say how they deal with data from the documentation alone.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
38 - Monte: Data is simply numpy arrays.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
40 a 2D numpy array of input features, and dataset.target the target vector.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
42 though).
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
43 - Any more worth looking at?
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
44
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
45 A few things that our dataset containers should support at a minimum:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
46
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
47 - streams, possibly infinite
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
48 - task/views of the data for different problems
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
49 - indexing & slicing
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
50 - pairs or triples or etc of examples
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
51 - a 'distance/gram matrix' container (imagine that the data is given to you
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
52 as a distance matrix)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
53 - multi-dimensional time-series (again, maybe with pairs/triples, maybe
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
54 given to you as a distance matrix over time)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
55
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
56 Another question to consider is the following: how tight should it integrate
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
57 with Theano? Do we want to be able to store data as shared variables or just
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
58 have an option for that? Theano + GPU constrains things that we can do (in terms
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
59 of sizes, buffering, etc): these are things we need to think about, but it's not
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
60 clear whether we should aim for building them into the interface.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
61
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
62 Task views of the data for different problems: How can we achieve this? Should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
63 we simply have a set of standard dataset descriptors ('classification',
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
64 'regression', 'multi-label', 'density_estimation') and have a set_view method
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
65 that changes the current dataset view type?
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
66
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
67 There is then the question of how to approach the design of a Dataset class from
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
68 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
69 Dataset that doesn't implement any methods except a few setters/getters. The reason
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
70 to have the methods listed that way is to have a common 'specification', but classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
71 that inherit from Dataset need not implement every single method (only the ones
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
72 that are relevant) and can obviously implement other methods as appropriate. The
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
73 reason to have a common specification (as abstract as it might be) is to, well,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
74 have a common specification that would make our code clearer and cleaner.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
75
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
76 An example of what I (Dumi) am thinking in terms of concrete API:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
77
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
78 class Dataset:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
79 def __init__(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
80 self.type = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
81 self.in_memory = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
82 self.inputs = None # list of filepaths, or objects in memory, or...
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
83 self.outputs = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
84
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
85 def get_example(self,example_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
86 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
87
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
88 def get_next_example(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
89 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
90
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
91 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
92 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
93
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
94 def get_next_batch(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
95 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
96
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
97 def get_slice(self,slice_object):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
98 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
99
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
100 def set_view(self,view_type):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
101 self.view_type = view_type
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
102 self.n_classes = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
103
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
104 def set_n_classes(self,n_classes):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
105 self.n_classes = n_classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
106
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
107 def set_batch_size(self,batch_size):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
108 self.batch_size = batch_size
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
109
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
110 You will note that there is no notion of train/valid/test in this class: I think we should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
111 just have a train dataset, a valid one and a test one instead or (if it's in one
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
112 big file or infinite stream) just handle the split ourselves (via slicing, for
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
113 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
114 specification does not preclude more fine-grained 'splitting' of the data.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
115
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
116 A concrete implementation would look like this (we would have one class per
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
117 dataset that we use, and the class declaration contains essentially everything
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
118 there is to know about the dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
119
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
120 class MNIST(Dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
121 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
122 self.type='standard_xy'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
123 self.in_memory = True
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
124 self.inputs = inputs # load them or create
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
125 self.outputs = outputs
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
126 self.set_view('classification')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
127 self.set_n_classes(10)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
128 self.set_batch_size(20)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
129 self.n_batches = self._compute_n_batches()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
130
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
131 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
132 x,y = self._fetch_batch(batch_index)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
133 if self.view_type == 'classification':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
134 return x,numpy.int32(y)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
135 elif self.view_type == 'density_estimation':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
136 return x
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
137 else:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
138 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
139
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
140 def shared_data(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
141 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
142 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
143 return shared_x, T.cast(shared_y, 'int32')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
144
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
145 def _compute_n_batches(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
146 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
147
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
148 def _fetch_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
149 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
150
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
151 But nothing stops you from defining get_train_batch, get_valid_batch and stuff
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
152 like that!
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
153
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
154 So we'd use it as:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
155
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
156 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
157 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
158
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
159 x,y = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
160 train_mnist.set_view('density_estimation')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
161 x = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
162
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
163 or
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
164
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
165 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
166 batches_train = range(int(mnist_data.n_batches*0.8))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
167 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
168
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
169 xt,yt = mnist_data.get_batch(batches_train[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
170 xv,yv = mnist_data.get_batch(batches_valid[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
171
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
172
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
173
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
174
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
175 COMMENTS
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
176 ~~~~~~~~
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
177
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
178
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
179 JB asks: What may be passed as argument to the functions in Dataset, and what
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
180 can be expected in return? Are there side effects (e.g. on the state of the
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
181 Dataset) associated with any of the functions?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
182
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
183 JB asks: What properties are part of the Dataset API? What possible types can
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
184 they have, are they expected to be read-only or writeable? What do they mean?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
185
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
186
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
187 JB asks: What is a view? Does set_view change the Dataset or return a new
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
188 Dataset with a certain view of the original (in which case call it get_view)?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
189 Does the view imply the types of the return-value of functions like
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
190 get_batch? What is the difference between the view and the subclasses of
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
191 Dataset in PyML?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
192
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
193 JB asks: Do container formats (I'm thinking of HDF5) offer features for fast
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
194 retrieval that we would like to expose via this interface?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
195
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
196 JB asks: How would you recommend using this sort of dataset in a boosting
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
197 algorithm where points need to be re-weighted.
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
198
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
199
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
200 JB asks: Do we want to provide for the possibility of feedback that modifies the
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
201 dataset? For example, curriculum learning might be adaptive in this sense, or
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
202 if we wanted to provide a virtual world for an agent as a dataset then we need
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
203 to provide 'actions' to get the next batch. Could this be done in the current
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
204 API?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
205
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
206
1082
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
207 Field names and attributes
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
208 ~~~~~~~~~~~~~~~~~~~~~~~~~~
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
209
1082
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
210 OD: One important question is how to handle fields' names and characteristics.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
211 For instance, it can be useful to know that the 3rd input field represents a
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
212 number of fingers, and is a non-negative discrete field whose numeric value is
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
213 meaningful (compared, to, say, an integer index that would correspond to an
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
214 animal's category). We mentioned metadata during the meeting, but we did not
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
215 get into its details: that may be a place where to put this kind of things.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
216
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
217
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
218 Freeing memory
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
219 ~~~~~~~~~~~~~~
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
220
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
221 OD: It is sometimes useful to be able to free memory used by previous
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
222 computations. A typical example is when you load in memory the original
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
223 dataset, then perform various processing steps, ending with a new dataset that
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
224 you also store in memory before feeding it to the learner. Unless you very
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
225 carefully design your code to avoid it, your original dataset will still
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
226 remain in memory (as well as maybe the results of some computations performed
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
227 along the way). So there may be a use for a `clear()` method that would be
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
228 called by the topmost dataset (the one doing the final memory caching), and
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
229 would be forwarded iteratively to previous datasets so as to get back all this
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
230 wasted memory space.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
231
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
232 What is a mini-batch?
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
233 ~~~~~~~~~~~~~~~~~~~~~
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
234
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
235 This is a follow-up to the meeting's discussion about whether a mini-batch
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
236 returned by a dataset should be itself a dataset.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
237
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
238 OD: During the meeting I was voting in favor of a 'yes', mostly because it
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
239 made sense to me (a mini-batch is a subset of a dataset and thus should be a
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
240 dataset), but now I tend towards 'no'. The main reason is it is not clear yet
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
241 what the dataset interface will be, so that it is hard to judge whether this
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
242 is good idea (my main concern is how much additional work would be required by
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
243 the writer of a new dataset subclass). Anyway, maybe a first thing we could
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
244 think about is what we want a mini-batch to be. I think we can agree that we
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
245 would like to be able to do something like:
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
246 for mb in dataset.mini_batches(size=10):
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
247 learner.update(mb.input, mb.target)
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
248 so that it should be ok for a mini-batch to be an object whose fields
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
249 (that should have the same name as those of the dataset) are numpy arrays.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
250 More generally, we would like to be able to iterate on samples in a
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
251 mini-batch, or do random access on them, so a mini-batch should implement
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
252 __iter__ and __getitem__.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
253 Besides this, is there any other typical use-case of a mini-batch? In
1086
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
254 particular, is there any reason to want an infinite mini-batch, or a very big
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
255 mini-batch that may not fit in memory? (in which case we may need to revise
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
256 our idea of what 'mini' means) Hopefully the answer to that last question is
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
257 no, as I think it would definitely keep things simpler, since we could simply
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
258 use numpy arrays (for numeric data) or lists (for anything else) to store
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
259 mini-batches' data. So I vote for 'no'.
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
260
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
261 A dataset is a learner
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
262 ~~~~~~~~~~~~~~~~~~~~~~
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
263
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
264 OD: (this is hopefully a clearer re-write of the original version from
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
265 r7e6e77d50eeb, which I was not happy with).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
266 There are typically three kinds of objects that spit out data:
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
267 1. Datasets that are loaded from disk or are able to generate data all by
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
268 themselves (i.e. without any other dataset as input)
1086
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
269 2. Datasets that transform their input dataset in a way that only depends on
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
270 the input dataset (e.g. filtering samples or features, normalizing data, etc.)
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
271 3. Datasets that transform their input dataset in a way that is learned on a
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
272 potentially different dataset (e.g. PCA when you want to learn the projection
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
273 space on the training set in order to transform both the training and test
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
274 sets).
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
275 My impression currently is that we would use dataset subclasses to handle 1
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
276 and 2. However, 3 requires a learner framework, so you would need to have
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
277 something like a LearnerOutputDataset(trained_learner, dataset).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
278
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
279 Note however that 2 is a special case of 3 (where training does nothing), and
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
280 1 is a special case of 2 (where we do not care about being given an input
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
281 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
282 by LearnerOutputDataset.
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
283
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
284 The main advantages I find in this approach (that I have been using at
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
285 Ubisoft) are:
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
286 - You only need to learn how to subclass the learner class. The only dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
287 class is LearnerOutputDataset, which you could just name Dataset.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
288 - You do not have different ways to achieve the same result (having to figure
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
289 out which one is most appropriate).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
290 - Upgrading code from 2 to 3 is more straighforward. Such a situation can
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
291 happen e.g. if you write some code that normalizes your input dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
292 (situation 2), then realize later you would like to be able to normalize new
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
293 datasets using the same parameters (e.g. same shift & rescaling), which
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
294 requires situation 3.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
295 - It can make your life easier when thinking about how to plug things together
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
296 (something that has not been discussed yet), because the interfaces of the
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
297 various components are less varied.
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
298
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
299 I am not saying that we should necessarily do it this way, but I think it is
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
300 worth at least keeping in mind this close relationship between simple
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
301 processing and learning, and thinking about what are the benefits / drawbacks
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
302 in keeping them separate in the class hierarchy.
1089
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
303
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
304 RP: I actually like this idea of having the dataset implement the same
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
305 interface as the learner ( or actually a subset of the interface .. ).
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
306 I hope people decide to do this.
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
307
1091
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
308 Support for shared variables
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
309 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1089
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
310
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
311 RP asks: What is the status of having the dataset support copying data
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
312 on the GPU ( by storing data in shared variables) ? Have you decided to
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
313 include this feature or not ? I think that the strongest selling point of
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
314 Theano is that it runs on GPU transperently, and I see this as a good
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
315 selling point for the library as well. Plus we intend to move more and
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
316 more towards running things on GPU. If the dataset object does not support
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
317 this feature we will need to find hacks around it ..
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
318
1091
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
319 OD: I have like zero experience with GPU so hopefully someone else can answer
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
320 this. But the way I see it, hopefully it could work by having some dataset
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
321 object that would take care of storing its input data into a shared variable.
1094
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
322 OD (continued): After thinking a bit more about it, I am not sure that would
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
323 work. I definitely need to look at some code doing it to get a better
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
324 understanding of it, but my feeling is that you need your learner to be
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
325 written in a specific way to achieve this, in which case it may be up to the
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
326 learner to take its input data and store it into a shared variable.
1104
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
327
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
328 RP comment: Yes, the dataset object alone can not handle this, the issue is somewhere
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
329 between the dataset and the learner. Or in other words, everytime you change
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
330 the data you need to recompile your theano function. So the learner can not
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
331 only get data from the dataset, it needs to get a shared variable. The learner
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
332 should also be aware when the dataset is changed, to recompile its internal
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
333 functions. I'm not sure which is the best wa to do this. My personal feeling
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
334 is that the dataset should be part of the learner. The lerner should provide
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
335 a function use_dataset ( or replace_dataset). When this function is called,
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
336 all the theano functions in the learner get recompiled based on shared
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
337 variables that the dataset object provides. It sort of fits very well in the
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
338 framework that I have in mind, which was spattered around in the learner.txt
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
339 and some of my previous emails. I think it shares a lot with James concepts,
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
340 since it follows quite closely the concepts behind Theano.
1105
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
341
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
342 OD asks: Ok, so why would the dataset have to be responsible for providing a
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
343 shared variable? Why wouldn't the learner just create this shared variable
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
344 internally and copy into it the data provided by the dataset?
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
345
1109
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
346 RP replies: Sure, the learner could take care of all this. Note though that the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
347 learner should take care to divide the dataset into chunks that fit in the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
348 GPU memory ( in case of a large dataset) and then take care of updating the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
349 shared variables acording to the current chunk. Personally I feel like all
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
350 this data division, management and so on should be done by the dataset.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
351 It feels more natural that way. For example assume you have a dataset that
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
352 is composed of a time series and some static data ( carre-tech heart beat
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
353 data is a good example). The static data is small enough so that you could
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
354 always store on the GPU, and you would only need to split the time series.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
355 For the learner to do this ( since it gets the same interface from any
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
356 dataset object) would be like and if <this case> then, while for the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
357 dataset is just a different class. But I'm happy to have all this GPU stuff
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
358 send to the learner as well if everybody else believe that is better.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
359
1110
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
360 FB comment: I don't understand why you would need to recompile the theano function.
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
361 Their is 2 cases, the data is in a shared variable. You can directly change the data
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
362 in the shared variable without recompiling the theano fct. The second case is when
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
363 the dataset is in an ordinary theano variable. In that case, the first step in the
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
364 theano fct will be to transfer the dataset to the gpu before computation. If the data
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
365 change at each call, that will be as efficient as changing the data manually every time
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
366 in the shared variable.
1116
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
367
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
368 AB: I have an idea about this which kind of fits in the "building a
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
369 theano op" thing that we talked about at the last meeting.
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
370
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
371 We could have a specialezed theano op that takes a dataset and returns
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
372 chunks of it with a index using the standard Dataset interface. The
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
373 code to transfer to the GPU or whatever goes in that Op and we don't
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
374 need to change to dataset interface.