annotate doc/v2_planning/dataset.txt @ 1199:98954d8cb92d

v2planning - modifs to plugin_JB
author James Bergstra <bergstrj@iro.umontreal.ca>
date Mon, 20 Sep 2010 02:56:11 -0400
parents 9ff2242a817b
children 7dfc3d3052ea
rev   line source
1002
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
1 Discussion of Function Specification for Dataset Types
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
2 ======================================================
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
3
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
4 Some talking points from the September 2 meeting:
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
5
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
7 needs to be flexible enough to accommodate different (sub)tasks and views of
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
8 the same underlying data.
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
9 * Datasets as probability distributions from which one can sample.
1023
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
10 * That's not something I would consider to be a dataset-related problem to
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
11 tackle now: a probability distribution in Pylearn would probably be a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
12 different kind of beast, and it should be easy enough to have a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
13 DatasetToDistribution class for instance, that would take care of viewing a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
14 dataset as a probability distribution. -- OD
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
15 * Our specification should allow transparent handling of infinite datasets (or
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
16 simply datasets which cannot fit in memory)
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
17 * GPU/buffering issues.
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
18
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
19 Commiteee: DE, OB, OD, AB, PV
1030
a154c9b68239 dataset: Dumi confirmed as leader
Olivier Delalleau <delallea@iro>
parents: 1023
diff changeset
20 Leader: DE
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
21
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
22 Some ideas from existing ML libraries:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
23
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
25 PairDataSet, Aggregate. Ultimately, the learner decides
1077
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
26 - mlpy: very primitive notions of data (simple 2D matrices)
1076
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet,
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
29 constrained and may have a different interface.
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
30 - MDP: Seems to have restrictions on the type of data being passed around, as
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
31 well as its dimensionality ("Input array data is typically assumed to be
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
32 two-dimensional and ordered such that observations of the same variable are
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
33 stored on rows and different variables are stored on columns.")
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
34 - Orange: Data matrices, with names and types associated to each column.
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
35 Basically there seems to be only one base dataset class that contains the
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
36 data. Data points are lists (of values corresponding to each column).
1077
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
37 - APGL: Hard to say how they deal with data from the documentation alone.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
38 - Monte: Data is simply numpy arrays.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
40 a 2D numpy array of input features, and dataset.target the target vector.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
42 though).
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
43 - Any more worth looking at?
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
44
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
45 A few things that our dataset containers should support at a minimum:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
46
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
47 - streams, possibly infinite
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
48 - task/views of the data for different problems
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
49 - indexing & slicing
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
50 - pairs or triples or etc of examples
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
51 - a 'distance/gram matrix' container (imagine that the data is given to you
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
52 as a distance matrix)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
53 - multi-dimensional time-series (again, maybe with pairs/triples, maybe
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
54 given to you as a distance matrix over time)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
55
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
56 Another question to consider is the following: how tight should it integrate
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
57 with Theano? Do we want to be able to store data as shared variables or just
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
58 have an option for that? Theano + GPU constrains things that we can do (in terms
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
59 of sizes, buffering, etc): these are things we need to think about, but it's not
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
60 clear whether we should aim for building them into the interface.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
61
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
62 Task views of the data for different problems: How can we achieve this? Should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
63 we simply have a set of standard dataset descriptors ('classification',
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
64 'regression', 'multi-label', 'density_estimation') and have a set_view method
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
65 that changes the current dataset view type?
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
66
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
67 There is then the question of how to approach the design of a Dataset class from
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
68 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
69 Dataset that doesn't implement any methods except a few setters/getters. The reason
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
70 to have the methods listed that way is to have a common 'specification', but classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
71 that inherit from Dataset need not implement every single method (only the ones
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
72 that are relevant) and can obviously implement other methods as appropriate. The
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
73 reason to have a common specification (as abstract as it might be) is to, well,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
74 have a common specification that would make our code clearer and cleaner.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
75
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
76 An example of what I (Dumi) am thinking in terms of concrete API:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
77
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
78 class Dataset:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
79 def __init__(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
80 self.type = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
81 self.in_memory = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
82 self.inputs = None # list of filepaths, or objects in memory, or...
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
83 self.outputs = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
84
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
85 def get_example(self,example_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
86 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
87
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
88 def get_next_example(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
89 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
90
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
91 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
92 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
93
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
94 def get_next_batch(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
95 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
96
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
97 def get_slice(self,slice_object):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
98 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
99
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
100 def set_view(self,view_type):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
101 self.view_type = view_type
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
102 self.n_classes = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
103
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
104 def set_n_classes(self,n_classes):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
105 self.n_classes = n_classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
106
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
107 def set_batch_size(self,batch_size):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
108 self.batch_size = batch_size
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
109
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
110 You will note that there is no notion of train/valid/test in this class: I think we should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
111 just have a train dataset, a valid one and a test one instead or (if it's in one
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
112 big file or infinite stream) just handle the split ourselves (via slicing, for
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
113 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
114 specification does not preclude more fine-grained 'splitting' of the data.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
115
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
116 A concrete implementation would look like this (we would have one class per
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
117 dataset that we use, and the class declaration contains essentially everything
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
118 there is to know about the dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
119
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
120 .. code-block:: python
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
121
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
122 class MNIST(Dataset):
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
123 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
124 self.type='standard_xy'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
125 self.in_memory = True
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
126 self.inputs = inputs # load them or create
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
127 self.outputs = outputs
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
128 self.set_view('classification')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
129 self.set_n_classes(10)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
130 self.set_batch_size(20)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
131 self.n_batches = self._compute_n_batches()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
132
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
133 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
134 x,y = self._fetch_batch(batch_index)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
135 if self.view_type == 'classification':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
136 return x,numpy.int32(y)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
137 elif self.view_type == 'density_estimation':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
138 return x
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
139 else:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
140 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
141
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
142 def shared_data(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
143 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
144 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
145 return shared_x, T.cast(shared_y, 'int32')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
146
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
147 def _compute_n_batches(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
148 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
149
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
150 def _fetch_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
151 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
152
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
153 But nothing stops you from defining get_train_batch, get_valid_batch and stuff
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
154 like that!
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
155
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
156 So we'd use it as:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
157
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
158 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
159 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
160
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
161 x,y = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
162 train_mnist.set_view('density_estimation')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
163 x = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
164
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
165 or
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
166
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
167 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
168 batches_train = range(int(mnist_data.n_batches*0.8))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
169 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
170
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
171 xt,yt = mnist_data.get_batch(batches_train[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
172 xv,yv = mnist_data.get_batch(batches_valid[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
173
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
174
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
175
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
176
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
177 COMMENTS
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
178 ~~~~~~~~
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
179
1120
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
180 JB asks: How about asking datasets to also provide a visualization mechanism
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
181 for showing / playing individual examples from the dataset, but also other
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
182 external objects that are similar to dataset examples (e.g. filters from a
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
183 weight matrix that filters images). This doesn't have to be complicated, and it
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
184 can be shared between datasets that exist in one modality (e.g. image datasets
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
185 can all use an image-rending method)
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
186
1131
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
187 OD replies: Besides being able to display data without prior knowledge of the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
188 kind of data inside a dataset, is there any reason to put this within the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
189 dataset class? If not, it seems to me it may be more appropriate to have a way
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
190 for the dataset to describe the kind of data it holds, and keep the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
191 visualization code separate from the dataset itself. It would make it easier
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
192 in particular to try different visualization systems, and description of the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
193 data may turn out to be useful for other reasons (however, it also means we'd
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
194 need to come up with a good way to describe data, which could prove
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
195 difficult).
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
196
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
197 JB asks: What may be passed as argument to the functions in Dataset, and what
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
198 can be expected in return? Are there side effects (e.g. on the state of the
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
199 Dataset) associated with any of the functions?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
200
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
201 JB asks: What properties are part of the Dataset API? What possible types can
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
202 they have, are they expected to be read-only or writeable? What do they mean?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
203
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
204
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
205 JB asks: What is a view? Does set_view change the Dataset or return a new
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
206 Dataset with a certain view of the original (in which case call it get_view)?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
207 Does the view imply the types of the return-value of functions like
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
208 get_batch? What is the difference between the view and the subclasses of
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
209 Dataset in PyML?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
210
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
211 JB asks: Do container formats (I'm thinking of HDF5) offer features for fast
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
212 retrieval that we would like to expose via this interface?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
213
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
214 JB asks: How would you recommend using this sort of dataset in a boosting
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
215 algorithm where points need to be re-weighted.
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
216
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
217
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
218 JB asks: Do we want to provide for the possibility of feedback that modifies the
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
219 dataset? For example, curriculum learning might be adaptive in this sense, or
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
220 if we wanted to provide a virtual world for an agent as a dataset then we need
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
221 to provide 'actions' to get the next batch. Could this be done in the current
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
222 API?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
223
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
224
1082
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
225 Field names and attributes
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
226 ~~~~~~~~~~~~~~~~~~~~~~~~~~
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
227
1082
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
228 OD: One important question is how to handle fields' names and characteristics.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
229 For instance, it can be useful to know that the 3rd input field represents a
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
230 number of fingers, and is a non-negative discrete field whose numeric value is
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
231 meaningful (compared, to, say, an integer index that would correspond to an
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
232 animal's category). We mentioned metadata during the meeting, but we did not
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
233 get into its details: that may be a place where to put this kind of things.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
234
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
235
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
236 Freeing memory
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
237 ~~~~~~~~~~~~~~
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
238
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
239 OD: It is sometimes useful to be able to free memory used by previous
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
240 computations. A typical example is when you load in memory the original
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
241 dataset, then perform various processing steps, ending with a new dataset that
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
242 you also store in memory before feeding it to the learner. Unless you very
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
243 carefully design your code to avoid it, your original dataset will still
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
244 remain in memory (as well as maybe the results of some computations performed
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
245 along the way). So there may be a use for a `clear()` method that would be
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
246 called by the topmost dataset (the one doing the final memory caching), and
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
247 would be forwarded iteratively to previous datasets so as to get back all this
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
248 wasted memory space.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
249
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
250 What is a mini-batch?
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
251 ~~~~~~~~~~~~~~~~~~~~~
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
252
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
253 This is a follow-up to the meeting's discussion about whether a mini-batch
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
254 returned by a dataset should be itself a dataset.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
255
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
256 OD: During the meeting I was voting in favor of a 'yes', mostly because it
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
257 made sense to me (a mini-batch is a subset of a dataset and thus should be a
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
258 dataset), but now I tend towards 'no'. The main reason is it is not clear yet
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
259 what the dataset interface will be, so that it is hard to judge whether this
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
260 is good idea (my main concern is how much additional work would be required by
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
261 the writer of a new dataset subclass). Anyway, maybe a first thing we could
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
262 think about is what we want a mini-batch to be. I think we can agree that we
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
263 would like to be able to do something like:
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
264
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
265 .. code-block:: python
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
266
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
267 for mb in dataset.mini_batches(size=10):
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
268 learner.update(mb.input, mb.target)
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
269
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
270 so that it should be ok for a mini-batch to be an object whose fields
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
271 (that should have the same name as those of the dataset) are numpy arrays.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
272 More generally, we would like to be able to iterate on samples in a
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
273 mini-batch, or do random access on them, so a mini-batch should implement
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
274 __iter__ and __getitem__.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
275 Besides this, is there any other typical use-case of a mini-batch? In
1086
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
276 particular, is there any reason to want an infinite mini-batch, or a very big
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
277 mini-batch that may not fit in memory? (in which case we may need to revise
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
278 our idea of what 'mini' means) Hopefully the answer to that last question is
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
279 no, as I think it would definitely keep things simpler, since we could simply
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
280 use numpy arrays (for numeric data) or lists (for anything else) to store
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
281 mini-batches' data. So I vote for 'no'.
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
282
1124
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
283 YB: I agree that a mini-batch should definitely be safely assumed
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
284 to fit in memory. That makes it at least in principle semantically
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
285 different from a dataset. But barring that restriction, it might
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
286 share of the properties of a dataset.
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
287
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
288 A dataset is a learner
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
289 ~~~~~~~~~~~~~~~~~~~~~~
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
290
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
291 OD: (this is hopefully a clearer re-write of the original version from
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
292 r7e6e77d50eeb, which I was not happy with).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
293 There are typically three kinds of objects that spit out data:
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
294
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
295 1. Datasets that are loaded from disk or are able to generate data all by
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
296 themselves (i.e. without any other dataset as input)
1086
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
297 2. Datasets that transform their input dataset in a way that only depends on
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
298 the input dataset (e.g. filtering samples or features, normalizing data, etc.)
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
299 3. Datasets that transform their input dataset in a way that is learned on a
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
300 potentially different dataset (e.g. PCA when you want to learn the projection
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
301 space on the training set in order to transform both the training and test
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
302 sets).
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
303
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
304 My impression currently is that we would use dataset subclasses to handle 1
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
305 and 2. However, 3 requires a learner framework, so you would need to have
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
306 something like a LearnerOutputDataset(trained_learner, dataset).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
307
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
308 Note however that 2 is a special case of 3 (where training does nothing), and
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
309 1 is a special case of 2 (where we do not care about being given an input
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
310 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
311 by LearnerOutputDataset.
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
312
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
313 The main advantages I find in this approach (that I have been using at
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
314 Ubisoft) are:
1190
9ff2242a817b fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents: 1131
diff changeset
315
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
316 - You only need to learn how to subclass the learner class. The only dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
317 class is LearnerOutputDataset, which you could just name Dataset.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
318 - You do not have different ways to achieve the same result (having to figure
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
319 out which one is most appropriate).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
320 - Upgrading code from 2 to 3 is more straighforward. Such a situation can
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
321 happen e.g. if you write some code that normalizes your input dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
322 (situation 2), then realize later you would like to be able to normalize new
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
323 datasets using the same parameters (e.g. same shift & rescaling), which
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
324 requires situation 3.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
325 - It can make your life easier when thinking about how to plug things together
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
326 (something that has not been discussed yet), because the interfaces of the
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
327 various components are less varied.
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
328
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
329 I am not saying that we should necessarily do it this way, but I think it is
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
330 worth at least keeping in mind this close relationship between simple
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
331 processing and learning, and thinking about what are the benefits / drawbacks
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
332 in keeping them separate in the class hierarchy.
1089
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
333
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
334 RP: I actually like this idea of having the dataset implement the same
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
335 interface as the learner ( or actually a subset of the interface .. ).
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
336 I hope people decide to do this.
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
337
1091
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
338 Support for shared variables
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
339 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1089
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
340
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
341 RP asks: What is the status of having the dataset support copying data
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
342 on the GPU ( by storing data in shared variables) ? Have you decided to
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
343 include this feature or not ? I think that the strongest selling point of
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
344 Theano is that it runs on GPU transperently, and I see this as a good
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
345 selling point for the library as well. Plus we intend to move more and
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
346 more towards running things on GPU. If the dataset object does not support
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
347 this feature we will need to find hacks around it ..
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
348
1091
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
349 OD: I have like zero experience with GPU so hopefully someone else can answer
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
350 this. But the way I see it, hopefully it could work by having some dataset
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
351 object that would take care of storing its input data into a shared variable.
1094
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
352 OD (continued): After thinking a bit more about it, I am not sure that would
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
353 work. I definitely need to look at some code doing it to get a better
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
354 understanding of it, but my feeling is that you need your learner to be
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
355 written in a specific way to achieve this, in which case it may be up to the
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
356 learner to take its input data and store it into a shared variable.
1104
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
357
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
358 RP comment: Yes, the dataset object alone can not handle this, the issue is somewhere
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
359 between the dataset and the learner. Or in other words, everytime you change
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
360 the data you need to recompile your theano function. So the learner can not
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
361 only get data from the dataset, it needs to get a shared variable. The learner
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
362 should also be aware when the dataset is changed, to recompile its internal
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
363 functions. I'm not sure which is the best wa to do this. My personal feeling
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
364 is that the dataset should be part of the learner. The lerner should provide
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
365 a function use_dataset ( or replace_dataset). When this function is called,
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
366 all the theano functions in the learner get recompiled based on shared
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
367 variables that the dataset object provides. It sort of fits very well in the
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
368 framework that I have in mind, which was spattered around in the learner.txt
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
369 and some of my previous emails. I think it shares a lot with James concepts,
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
370 since it follows quite closely the concepts behind Theano.
1105
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
371
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
372 OD asks: Ok, so why would the dataset have to be responsible for providing a
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
373 shared variable? Why wouldn't the learner just create this shared variable
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
374 internally and copy into it the data provided by the dataset?
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
375
1109
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
376 RP replies: Sure, the learner could take care of all this. Note though that the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
377 learner should take care to divide the dataset into chunks that fit in the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
378 GPU memory ( in case of a large dataset) and then take care of updating the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
379 shared variables acording to the current chunk. Personally I feel like all
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
380 this data division, management and so on should be done by the dataset.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
381 It feels more natural that way. For example assume you have a dataset that
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
382 is composed of a time series and some static data ( carre-tech heart beat
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
383 data is a good example). The static data is small enough so that you could
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
384 always store on the GPU, and you would only need to split the time series.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
385 For the learner to do this ( since it gets the same interface from any
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
386 dataset object) would be like and if <this case> then, while for the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
387 dataset is just a different class. But I'm happy to have all this GPU stuff
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
388 send to the learner as well if everybody else believe that is better.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
389
1110
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
390 FB comment: I don't understand why you would need to recompile the theano function.
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
391 Their is 2 cases, the data is in a shared variable. You can directly change the data
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
392 in the shared variable without recompiling the theano fct. The second case is when
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
393 the dataset is in an ordinary theano variable. In that case, the first step in the
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
394 theano fct will be to transfer the dataset to the gpu before computation. If the data
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
395 change at each call, that will be as efficient as changing the data manually every time
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
396 in the shared variable.
1116
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
397
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
398 AB: I have an idea about this which kind of fits in the "building a
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
399 theano op" thing that we talked about at the last meeting.
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
400
1117
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
401 We can just build a theano Op that wraps dataset objects and takes
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
402 care of the details of tranferring data to the GPU or otherwise.
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
403
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
404 I have a prototype interface/implemantation in the shared_dataset.py
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
405 file in this directory.
1127
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents: 1124
diff changeset
406
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents: 1124
diff changeset
407 OD: I like AB's approach.
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents: 1124
diff changeset
408