annotate doc/v2_planning/dataset.txt @ 1373:90116fb3636b

fix and URL that was moved.
author Frederic Bastien <nouiz@nouiz.org>
date Wed, 17 Nov 2010 13:21:21 -0500
parents 04b988fb00b6
children
rev   line source
1002
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
1 Discussion of Function Specification for Dataset Types
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
2 ======================================================
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
3
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
4 Some talking points from the September 2 meeting:
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
5
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
7 needs to be flexible enough to accommodate different (sub)tasks and views of
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
8 the same underlying data.
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
9 * Datasets as probability distributions from which one can sample.
1023
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
10 * That's not something I would consider to be a dataset-related problem to
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
11 tackle now: a probability distribution in Pylearn would probably be a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
12 different kind of beast, and it should be easy enough to have a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
13 DatasetToDistribution class for instance, that would take care of viewing a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
14 dataset as a probability distribution. -- OD
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
15 * Our specification should allow transparent handling of infinite datasets (or
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
16 simply datasets which cannot fit in memory)
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
17 * GPU/buffering issues.
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
18
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
19 Commiteee: DE, OB, OD, AB, PV
1030
a154c9b68239 dataset: Dumi confirmed as leader
Olivier Delalleau <delallea@iro>
parents: 1023
diff changeset
20 Leader: DE
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
21
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
22 Some ideas from existing ML libraries:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
23
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
25 PairDataSet, Aggregate. Ultimately, the learner decides
1077
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
26 - mlpy: very primitive notions of data (simple 2D matrices)
1076
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet,
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
29 constrained and may have a different interface.
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
30 - MDP: Seems to have restrictions on the type of data being passed around, as
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
31 well as its dimensionality ("Input array data is typically assumed to be
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
32 two-dimensional and ordered such that observations of the same variable are
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
33 stored on rows and different variables are stored on columns.")
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
34 - Orange: Data matrices, with names and types associated to each column.
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
35 Basically there seems to be only one base dataset class that contains the
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
36 data. Data points are lists (of values corresponding to each column).
1077
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
37 - APGL: Hard to say how they deal with data from the documentation alone.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
38 - Monte: Data is simply numpy arrays.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
40 a 2D numpy array of input features, and dataset.target the target vector.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
42 though).
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
43 - Any more worth looking at?
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
44
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
45 A few things that our dataset containers should support at a minimum:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
46
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
47 - streams, possibly infinite
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
48 - task/views of the data for different problems
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
49 - indexing & slicing
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
50 - pairs or triples or etc of examples
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
51 - a 'distance/gram matrix' container (imagine that the data is given to you
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
52 as a distance matrix)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
53 - multi-dimensional time-series (again, maybe with pairs/triples, maybe
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
54 given to you as a distance matrix over time)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
55
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
56 Another question to consider is the following: how tight should it integrate
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
57 with Theano? Do we want to be able to store data as shared variables or just
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
58 have an option for that? Theano + GPU constrains things that we can do (in terms
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
59 of sizes, buffering, etc): these are things we need to think about, but it's not
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
60 clear whether we should aim for building them into the interface.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
61
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
62 Task views of the data for different problems: How can we achieve this? Should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
63 we simply have a set of standard dataset descriptors ('classification',
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
64 'regression', 'multi-label', 'density_estimation') and have a set_view method
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
65 that changes the current dataset view type?
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
66
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
67 There is then the question of how to approach the design of a Dataset class from
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
68 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
69 Dataset that doesn't implement any methods except a few setters/getters. The reason
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
70 to have the methods listed that way is to have a common 'specification', but classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
71 that inherit from Dataset need not implement every single method (only the ones
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
72 that are relevant) and can obviously implement other methods as appropriate. The
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
73 reason to have a common specification (as abstract as it might be) is to, well,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
74 have a common specification that would make our code clearer and cleaner.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
75
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
76 An example of what I (Dumi) am thinking in terms of concrete API:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
77
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
78 class Dataset:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
79 def __init__(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
80 self.type = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
81 self.in_memory = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
82 self.inputs = None # list of filepaths, or objects in memory, or...
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
83 self.outputs = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
84
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
85 def get_example(self,example_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
86 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
87
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
88 def get_next_example(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
89 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
90
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
91 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
92 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
93
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
94 def get_next_batch(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
95 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
96
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
97 def get_slice(self,slice_object):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
98 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
99
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
100 def set_view(self,view_type):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
101 self.view_type = view_type
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
102 self.n_classes = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
103
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
104 def set_n_classes(self,n_classes):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
105 self.n_classes = n_classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
106
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
107 def set_batch_size(self,batch_size):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
108 self.batch_size = batch_size
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
109
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
110 You will note that there is no notion of train/valid/test in this class: I think we should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
111 just have a train dataset, a valid one and a test one instead or (if it's in one
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
112 big file or infinite stream) just handle the split ourselves (via slicing, for
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
113 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
114 specification does not preclude more fine-grained 'splitting' of the data.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
115
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
116 A concrete implementation would look like this (we would have one class per
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
117 dataset that we use, and the class declaration contains essentially everything
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
118 there is to know about the dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
119
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
120 .. code-block:: python
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
121
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
122 class MNIST(Dataset):
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
123 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
124 self.type='standard_xy'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
125 self.in_memory = True
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
126 self.inputs = inputs # load them or create
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
127 self.outputs = outputs
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
128 self.set_view('classification')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
129 self.set_n_classes(10)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
130 self.set_batch_size(20)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
131 self.n_batches = self._compute_n_batches()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
132
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
133 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
134 x,y = self._fetch_batch(batch_index)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
135 if self.view_type == 'classification':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
136 return x,numpy.int32(y)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
137 elif self.view_type == 'density_estimation':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
138 return x
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
139 else:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
140 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
141
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
142 def shared_data(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
143 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
144 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
145 return shared_x, T.cast(shared_y, 'int32')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
146
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
147 def _compute_n_batches(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
148 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
149
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
150 def _fetch_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
151 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
152
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
153 But nothing stops you from defining get_train_batch, get_valid_batch and stuff
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
154 like that!
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
155
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
156 So we'd use it as:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
157
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
158 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
159 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
160
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
161 x,y = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
162 train_mnist.set_view('density_estimation')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
163 x = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
164
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
165 or
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
166
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
167 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
168 batches_train = range(int(mnist_data.n_batches*0.8))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
169 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
170
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
171 xt,yt = mnist_data.get_batch(batches_train[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
172 xv,yv = mnist_data.get_batch(batches_valid[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
173
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
174
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
175
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
176
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
177 COMMENTS
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
178 ~~~~~~~~
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
179
1120
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
180 JB asks: How about asking datasets to also provide a visualization mechanism
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
181 for showing / playing individual examples from the dataset, but also other
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
182 external objects that are similar to dataset examples (e.g. filters from a
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
183 weight matrix that filters images). This doesn't have to be complicated, and it
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
184 can be shared between datasets that exist in one modality (e.g. image datasets
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
185 can all use an image-rending method)
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
186
1131
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
187 OD replies: Besides being able to display data without prior knowledge of the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
188 kind of data inside a dataset, is there any reason to put this within the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
189 dataset class? If not, it seems to me it may be more appropriate to have a way
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
190 for the dataset to describe the kind of data it holds, and keep the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
191 visualization code separate from the dataset itself. It would make it easier
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
192 in particular to try different visualization systems, and description of the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
193 data may turn out to be useful for other reasons (however, it also means we'd
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
194 need to come up with a good way to describe data, which could prove
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
195 difficult).
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
196
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
197 JB asks: What may be passed as argument to the functions in Dataset, and what
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
198 can be expected in return? Are there side effects (e.g. on the state of the
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
199 Dataset) associated with any of the functions?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
200
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
201 JB asks: What properties are part of the Dataset API? What possible types can
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
202 they have, are they expected to be read-only or writeable? What do they mean?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
203
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
204
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
205 JB asks: What is a view? Does set_view change the Dataset or return a new
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
206 Dataset with a certain view of the original (in which case call it get_view)?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
207 Does the view imply the types of the return-value of functions like
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
208 get_batch? What is the difference between the view and the subclasses of
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
209 Dataset in PyML?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
210
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
211 JB asks: Do container formats (I'm thinking of HDF5) offer features for fast
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
212 retrieval that we would like to expose via this interface?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
213
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
214 JB asks: How would you recommend using this sort of dataset in a boosting
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
215 algorithm where points need to be re-weighted.
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
216
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
217
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
218 JB asks: Do we want to provide for the possibility of feedback that modifies the
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
219 dataset? For example, curriculum learning might be adaptive in this sense, or
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
220 if we wanted to provide a virtual world for an agent as a dataset then we need
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
221 to provide 'actions' to get the next batch. Could this be done in the current
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
222 API?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
223
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
224
1082
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
225 Field names and attributes
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
226 ~~~~~~~~~~~~~~~~~~~~~~~~~~
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
227
1082
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
228 OD: One important question is how to handle fields' names and characteristics.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
229 For instance, it can be useful to know that the 3rd input field represents a
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
230 number of fingers, and is a non-negative discrete field whose numeric value is
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
231 meaningful (compared, to, say, an integer index that would correspond to an
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
232 animal's category). We mentioned metadata during the meeting, but we did not
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
233 get into its details: that may be a place where to put this kind of things.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
234
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
235
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
236 Freeing memory
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
237 ~~~~~~~~~~~~~~
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
238
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
239 OD: It is sometimes useful to be able to free memory used by previous
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
240 computations. A typical example is when you load in memory the original
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
241 dataset, then perform various processing steps, ending with a new dataset that
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
242 you also store in memory before feeding it to the learner. Unless you very
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
243 carefully design your code to avoid it, your original dataset will still
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
244 remain in memory (as well as maybe the results of some computations performed
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
245 along the way). So there may be a use for a `clear()` method that would be
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
246 called by the topmost dataset (the one doing the final memory caching), and
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
247 would be forwarded iteratively to previous datasets so as to get back all this
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
248 wasted memory space.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
249
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
250 What is a mini-batch?
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
251 ~~~~~~~~~~~~~~~~~~~~~
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
252
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
253 This is a follow-up to the meeting's discussion about whether a mini-batch
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
254 returned by a dataset should be itself a dataset.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
255
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
256 OD: During the meeting I was voting in favor of a 'yes', mostly because it
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
257 made sense to me (a mini-batch is a subset of a dataset and thus should be a
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
258 dataset), but now I tend towards 'no'. The main reason is it is not clear yet
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
259 what the dataset interface will be, so that it is hard to judge whether this
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
260 is good idea (my main concern is how much additional work would be required by
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
261 the writer of a new dataset subclass). Anyway, maybe a first thing we could
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
262 think about is what we want a mini-batch to be. I think we can agree that we
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
263 would like to be able to do something like:
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
264
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
265 .. code-block:: python
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
266
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
267 for mb in dataset.mini_batches(size=10):
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
268 learner.update(mb.input, mb.target)
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
269
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
270 so that it should be ok for a mini-batch to be an object whose fields
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
271 (that should have the same name as those of the dataset) are numpy arrays.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
272 More generally, we would like to be able to iterate on samples in a
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
273 mini-batch, or do random access on them, so a mini-batch should implement
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
274 __iter__ and __getitem__.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
275 Besides this, is there any other typical use-case of a mini-batch? In
1086
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
276 particular, is there any reason to want an infinite mini-batch, or a very big
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
277 mini-batch that may not fit in memory? (in which case we may need to revise
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
278 our idea of what 'mini' means) Hopefully the answer to that last question is
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
279 no, as I think it would definitely keep things simpler, since we could simply
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
280 use numpy arrays (for numeric data) or lists (for anything else) to store
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
281 mini-batches' data. So I vote for 'no'.
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
282
1124
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
283 YB: I agree that a mini-batch should definitely be safely assumed
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
284 to fit in memory. That makes it at least in principle semantically
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
285 different from a dataset. But barring that restriction, it might
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
286 share of the properties of a dataset.
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
287
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
288 A dataset is a learner
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
289 ~~~~~~~~~~~~~~~~~~~~~~
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
290
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
291 OD: (this is hopefully a clearer re-write of the original version from
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
292 r7e6e77d50eeb, which I was not happy with).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
293 There are typically three kinds of objects that spit out data:
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
294
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
295 1. Datasets that are loaded from disk or are able to generate data all by
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
296 themselves (i.e. without any other dataset as input)
1086
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
297 2. Datasets that transform their input dataset in a way that only depends on
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
298 the input dataset (e.g. filtering samples or features, normalizing data, etc.)
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
299 3. Datasets that transform their input dataset in a way that is learned on a
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
300 potentially different dataset (e.g. PCA when you want to learn the projection
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
301 space on the training set in order to transform both the training and test
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
302 sets).
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
303
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
304 My impression currently is that we would use dataset subclasses to handle 1
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
305 and 2. However, 3 requires a learner framework, so you would need to have
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
306 something like a LearnerOutputDataset(trained_learner, dataset).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
307
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
308 Note however that 2 is a special case of 3 (where training does nothing), and
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
309 1 is a special case of 2 (where we do not care about being given an input
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
310 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
311 by LearnerOutputDataset.
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
312
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
313 The main advantages I find in this approach (that I have been using at
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
314 Ubisoft) are:
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
315
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
316 - You only need to learn how to subclass the learner class. The only dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
317 class is LearnerOutputDataset, which you could just name Dataset.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
318 - You do not have different ways to achieve the same result (having to figure
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
319 out which one is most appropriate).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
320 - Upgrading code from 2 to 3 is more straighforward. Such a situation can
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
321 happen e.g. if you write some code that normalizes your input dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
322 (situation 2), then realize later you would like to be able to normalize new
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
323 datasets using the same parameters (e.g. same shift & rescaling), which
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
324 requires situation 3.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
325 - It can make your life easier when thinking about how to plug things together
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
326 (something that has not been discussed yet), because the interfaces of the
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
327 various components are less varied.
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
328
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
329 I am not saying that we should necessarily do it this way, but I think it is
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
330 worth at least keeping in mind this close relationship between simple
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
331 processing and learning, and thinking about what are the benefits / drawbacks
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
332 in keeping them separate in the class hierarchy.
1089
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
333
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
334 RP: I actually like this idea of having the dataset implement the same
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
335 interface as the learner ( or actually a subset of the interface .. ).
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
336 I hope people decide to do this.
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
337
1091
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
338 Support for shared variables
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
339 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1089
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
340
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
341 RP asks: What is the status of having the dataset support copying data
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
342 on the GPU ( by storing data in shared variables) ? Have you decided to
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
343 include this feature or not ? I think that the strongest selling point of
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
344 Theano is that it runs on GPU transperently, and I see this as a good
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
345 selling point for the library as well. Plus we intend to move more and
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
346 more towards running things on GPU. If the dataset object does not support
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
347 this feature we will need to find hacks around it ..
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
348
1091
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
349 OD: I have like zero experience with GPU so hopefully someone else can answer
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
350 this. But the way I see it, hopefully it could work by having some dataset
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
351 object that would take care of storing its input data into a shared variable.
1094
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
352 OD (continued): After thinking a bit more about it, I am not sure that would
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
353 work. I definitely need to look at some code doing it to get a better
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
354 understanding of it, but my feeling is that you need your learner to be
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
355 written in a specific way to achieve this, in which case it may be up to the
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
356 learner to take its input data and store it into a shared variable.
1104
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
357
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
358 RP comment: Yes, the dataset object alone can not handle this, the issue is somewhere
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
359 between the dataset and the learner. Or in other words, everytime you change
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
360 the data you need to recompile your theano function. So the learner can not
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
361 only get data from the dataset, it needs to get a shared variable. The learner
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
362 should also be aware when the dataset is changed, to recompile its internal
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
363 functions. I'm not sure which is the best wa to do this. My personal feeling
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
364 is that the dataset should be part of the learner. The lerner should provide
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
365 a function use_dataset ( or replace_dataset). When this function is called,
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
366 all the theano functions in the learner get recompiled based on shared
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
367 variables that the dataset object provides. It sort of fits very well in the
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
368 framework that I have in mind, which was spattered around in the learner.txt
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
369 and some of my previous emails. I think it shares a lot with James concepts,
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
370 since it follows quite closely the concepts behind Theano.
1105
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
371
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
372 OD asks: Ok, so why would the dataset have to be responsible for providing a
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
373 shared variable? Why wouldn't the learner just create this shared variable
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
374 internally and copy into it the data provided by the dataset?
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
375
1109
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
376 RP replies: Sure, the learner could take care of all this. Note though that the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
377 learner should take care to divide the dataset into chunks that fit in the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
378 GPU memory ( in case of a large dataset) and then take care of updating the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
379 shared variables acording to the current chunk. Personally I feel like all
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
380 this data division, management and so on should be done by the dataset.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
381 It feels more natural that way. For example assume you have a dataset that
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
382 is composed of a time series and some static data ( carre-tech heart beat
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
383 data is a good example). The static data is small enough so that you could
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
384 always store on the GPU, and you would only need to split the time series.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
385 For the learner to do this ( since it gets the same interface from any
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
386 dataset object) would be like and if <this case> then, while for the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
387 dataset is just a different class. But I'm happy to have all this GPU stuff
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
388 send to the learner as well if everybody else believe that is better.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
389
1110
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
390 FB comment: I don't understand why you would need to recompile the theano function.
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
391 Their is 2 cases, the data is in a shared variable. You can directly change the data
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
392 in the shared variable without recompiling the theano fct. The second case is when
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
393 the dataset is in an ordinary theano variable. In that case, the first step in the
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
394 theano fct will be to transfer the dataset to the gpu before computation. If the data
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
395 change at each call, that will be as efficient as changing the data manually every time
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
396 in the shared variable.
1116
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
397
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
398 AB: I have an idea about this which kind of fits in the "building a
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
399 theano op" thing that we talked about at the last meeting.
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
400
1117
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
401 We can just build a theano Op that wraps dataset objects and takes
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
402 care of the details of tranferring data to the GPU or otherwise.
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
403
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
404 I have a prototype interface/implemantation in the shared_dataset.py
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
405 file in this directory.
1127
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents: 1124
diff changeset
406
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents: 1124
diff changeset
407 OD: I like AB's approach.
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents: 1124
diff changeset
408
1337
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
409
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
410 Data API proposal by Olivier D
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
411 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
412
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
413 A single sample containing multiple fields (e.g. an input and a target part)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
414 is an object s that you can manipulate as follows:
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
415
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
416 .. code-block:: python
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
417
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
418 # Obtain actual data stored within `s` (e.g. a numpy vector). There is no
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
419 # guarantee that modifying the resulting data object will actually update
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
420 # the data stored in `s`.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
421 data = s()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
422 # Create a sample that sees a field of `s`.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
423 input_part = s.input
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
424 # Obtain actual input data (e.g. as a numpy vector).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
425 input_data = input_part()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
426 # Create a sample that sees the i-th element of the data stored in `s`.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
427 ith = s[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
428 # This should not fail.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
429 assert ith() == s()[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
430 # You could also select a range.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
431 i_to_j = s[i:j]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
432 assert i_to_j() == s()[i:j]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
433 # And actually do pretty much anything you want with __getitem__, as long
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
434 # as the underlying data stored in the sample supports it (for instance,
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
435 # here it should be at least a 3D tensor).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
436 fancy_selection = s[i, :, j:k]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
437 assert fancy_selection() == s()[i, :, j:k]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
438 # Write some value (e.g. a numpy vector) into the sample. May raise an
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
439 # exception if the sample is in read-only mode.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
440 s._write(val)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
441 # Shortcut to write data into a field (same as `s.input._write(val)`).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
442 s.input = val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
443 # Basic mathematical operators.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
444 s *= val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
445 s += val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
446 s -= val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
447 s /= val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
448 # Replace a field. Note that this is different from `s.input = val`
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
449 # because here `new_input` is a sample, not a numeric value: the current
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
450 # `s.input` will not be written to, instead it makes `s.input` point
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
451 # towards a different sample. This may lead to confusion, so a different
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
452 # syntax may be better (e.g. s._set_field('input', new_input)).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
453 s.input = new_input
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
454 # The equality of two samples is defined by the equality of their
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
455 # underlying data.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
456 def __eq__(self, other):
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
457 return self() == other()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
458 # Iterate on fields (open question: should they be ordered?).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
459 fields = dict([(name, sample) for name, sample in s._iter_fields()])
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
460 assert fields['input'] == s.input
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
461 # Iterating on a sample yields samples that see consecutive elements.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
462 for sample, value in izip(s, s()):
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
463 assert sample() == value
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
464 # The length of a sample is the same as that of its underlying data.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
465 assert len(s) == len(s())
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
466 # The shape of a sample is the same as that of its underlying data.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
467 # Note that it only makes sense for tensor-like data.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
468 assert s._shape() == s().shape
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
469 # The size of a sample is the product of its shape elements.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
470 assert s._size() == reduce(operator.__mul__, s._shape())
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
471
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
472 All sample methods should start with '_', to differentiate them from the
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
473 sample's fields. This is a bit awkward, but I like the `sample.field` syntax
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
474 compared to something like "sample.get_field('field')", which makes code less
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
475 readable, especially when combining with sub_fields, e.g. `sample.input.x1`
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
476 vs. sample.get_field('input').get_field('x1').
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
477
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
478 The extension from sample to dataset is actually to use the same class, but
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
479 with the convention that the first "dimension" in the data seen by the dataset
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
480 corresponds to the samples' indices in the dataset.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
481
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
482 .. code-block:: python
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
483
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
484 # Return data stored in dataset `d` (e.g. a numpy matrix).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
485 data = d()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
486 # Return the i-th sample in the dataset.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
487 s = d[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
488 # Data should match!
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
489 assert data[i] == s()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
490 # Return a subset of the dataset.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
491 sub_data = d[i:j]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
492 # Advanced indexing.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
493 sub_data = d[some_list_of_indices]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
494 # Dataset that sees the input part only.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
495 input_part = d.input
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
496 # Dataset such that its i-th element is data[i][something] (see the sample
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
497 # examples for what `something` may be).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
498 some_sub_data = d[:, something]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
499 # The following should not fail.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
500 assert d[i, something] == d[i][something] # == some_sub_data[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
501 # You can also write into a dataset.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
502 d._write(val)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
503 d.input = val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
504 # Center dataset in-place (requires `d` not to be read-only).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
505 d -= numpy.mean(d())
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
506 # The length of a dataset is its number of samples.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
507 n_samples = len(d)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
508 # The width of a dataset (if it exists) is the length of its samples.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
509 assert d._shape()[1] == len(d[0]) # == d._width() (shortcut)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
510 # Iterating on a dataset yields individual samples.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
511 for i, sample in enumerate(d):
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
512 assert d[i] == sample
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
513 # It is allowed for a dataset to hold heterogeneous data. For instance
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
514 # you could have
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
515 len(d.data1) != len(d.data2)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
516 # A sample in the dataset is not required to inherit all the dataset's
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
517 # fields, for instance in the case above you could decide that the dataset
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
518 # sees the same data as its first sub-dataset, i.e.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
519 d[i] == d.data1[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
520
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
521 There remain some fuzzy points. For instance, are fields allowed to overlap?
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
522 (e.g. so that one could write both s.pos_3d to get the 3d vector coordinate of
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
523 sample s, and s.x to get the x coordinate without being forced to go through
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
524 s.pos_3d.x). What are the fields of s[i:j] if the (i, j) range does not
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
525 exactly match a subset of fields? How do we handle metadata? (e.g. if we want
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
526 to describe the dataset to say it contains 28x28 image data, so that an
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
527 algorithm for filter visualization can automatically deal with it)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
528
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
529 Now, on to some use cases.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
530
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
531 .. code-block:: python
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
532
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
533 # Mini-batches.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
534 mb_dataset = d._minibatches(batch_size=5)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
535 # The mini-batch dataset views samples that are mini-batches.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
536 assert mb_dataset[0]() == d[0:5]() # As long as len(d) >= 5.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
537
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
538 # Shuffling samples.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
539 random_indices = range(len(d))
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
540 random_indices = numpy.random.shuffle(random_indices)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
541 shuffled_dataset = d[random_indices]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
542
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
543 # Typical linear regression with stochastic gradient descent.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
544 n_inputs = d.input._width()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
545 n_targets = d.target._width()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
546 weights = numpy.zeros((n_inputs, n_targets))
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
547 bias = numpy.zeros(n_targets)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
548 mb_dataset = d._minibatches(batch_size=10)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
549 # Note: it is important to get the number of inputs / targets
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
550 # before converting to minibatches, because
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
551 # mb_dataset.input._width() == 10
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
552 # since this is the length of a minibatch matrix. However you
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
553 # could still do the following, which is less readable:
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
554 # n_inputs = mb_dataset.input._shape()[2]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
555 # You could also wait until you see the first sample to create
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
556 # the parameters (this would actually be a better way to do it, since
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
557 # it avoids calling the _width method).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
558 for input, target in izip(mb_dataset.input, mb_dataset.target):
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
559 cost = (numpy.dot(input(), weights) + b - target())**2
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
560 # Update weights and bias depending on cost....
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
561
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
562 A few more points:
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
563 - Infinite datasets could be used (would just need to define a convention
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
564 on what __len__ should do).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
565 - It is also ok to have datasets that do not support random access (so the
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
566 only way to access samples is through iteration).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
567 - Ideally, data should be deterministic (i.e. __call__() should always
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
568 return the same thing). It would probably be up to the user to be super
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
569 careful if he decides to use a non-deterministic dataset.
1338
91637815b7ca Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents: 1337
diff changeset
570 - About the "task vs. dataset" distinction. This could be achieved by
91637815b7ca Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents: 1337
diff changeset
571 associating to a task the names of the fields it requires (e.g. "input"
91637815b7ca Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents: 1337
diff changeset
572 and "target" for the regression task), and if the dataset does not
91637815b7ca Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents: 1337
diff changeset
573 already defines these fields, using a dataset wrapper than does it
91637815b7ca Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents: 1337
diff changeset
574 (saying for instance that "input" is the concatenation of "x1" and "x2",
91637815b7ca Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents: 1337
diff changeset
575 and "target" is "y", for a dataset whose fields are x1, x2 and y).
1337
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents: 1190
diff changeset
576
1339
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
577
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
578 RP comments:
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
579 - I like this approach. I think having overlapping fields might be useful.
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
580 I would add that I was thinking of a way to look at one's results. Is
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
581 something I've been faced with, say you run 500 jobs and then you want to
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
582 understand those jobs' results. Looking just at the best performing seems a waste, and
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
583 there is a lot more information you can extract from your results if you are
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
584 able to generate certain plots or statistics. To do this you would need to
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
585 get the data in ipython (or something quite similar) where you have available
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
586 the needed functions to plot different things, generate different tables. The
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
587 point that I was trying to make is that you can get those results in
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
588 something that has this very API that Olivier described. This way both both
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
589 your input data and your results will be in the same form and whatever
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
590 visualization functions you have for your results you can use on your data as
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
591 well. For this you would need a bit more flexibility, in the sense that if
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
592 you have some data d, you should be able to put constraints on it, like
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
593 d.some_field == 5 means all entries in d that has some_field == 5, or
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
594 d.some_field > 5. You would also not use psql anymore but this console,
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
595 which would collect the results for you from sql, and give them to you as
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
596 data object.
158493f8dff9 comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1338
diff changeset
597
1340
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
598 OD replies: Actually this should be doable with (almost) what I wrote above,
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
599 due to the way numpy redefines ==, >, etc. (which btw should break some of my
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
600 assertions above, since I had forgotten about this). If you replace e.g. my
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
601 implementation of __eq__ above by the following:
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
602
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
603 .. code-block:: python
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
604
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
605 def __eq__(self, other):
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
606 return other == self()
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
607
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
608 Here, `self` is a dataset that represents some numpy vector data. Then whether
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
609 `other` is another dataset or a numpy vector or some scalar, this will return
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
610 a numpy boolean vector (the result of the comparison made by numpy). We may
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
611 support boolean vectors in advanced indexing, so you could do
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
612 d[d.some_field == 5]
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
613 and obtain the subset of `d` whose samples have `some_field` set to 5.
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
614 Same could be done with __lt__, __le__, etc.
04b988fb00b6 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1339
diff changeset
615