annotate doc/v2_planning/dataset.txt @ 1170:53340a8df1fa

coding_style: Started to write full code sample
author Olivier Delalleau <delallea@iro>
date Fri, 17 Sep 2010 14:37:00 -0400
parents d9550c27a192
children 9ff2242a817b
rev   line source
1002
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
1 Discussion of Function Specification for Dataset Types
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
2 ======================================================
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff changeset
3
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
4 Some talking points from the September 2 meeting:
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
5
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
7 needs to be flexible enough to accommodate different (sub)tasks and views of
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
8 the same underlying data.
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
9 * Datasets as probability distributions from which one can sample.
1023
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
10 * That's not something I would consider to be a dataset-related problem to
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
11 tackle now: a probability distribution in Pylearn would probably be a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
12 different kind of beast, and it should be easy enough to have a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
13 DatasetToDistribution class for instance, that would take care of viewing a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents: 1008
diff changeset
14 dataset as a probability distribution. -- OD
1008
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
15 * Our specification should allow transparent handling of infinite datasets (or
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
16 simply datasets which cannot fit in memory)
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
17 * GPU/buffering issues.
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
18
a5886b394bda Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1002
diff changeset
19 Commiteee: DE, OB, OD, AB, PV
1030
a154c9b68239 dataset: Dumi confirmed as leader
Olivier Delalleau <delallea@iro>
parents: 1023
diff changeset
20 Leader: DE
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
21
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
22 Some ideas from existing ML libraries:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
23
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
25 PairDataSet, Aggregate. Ultimately, the learner decides
1077
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
26 - mlpy: very primitive notions of data (simple 2D matrices)
1076
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet,
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
29 constrained and may have a different interface.
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
30 - MDP: Seems to have restrictions on the type of data being passed around, as
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
31 well as its dimensionality ("Input array data is typically assumed to be
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
32 two-dimensional and ordered such that observations of the same variable are
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
33 stored on rows and different variables are stored on columns.")
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
34 - Orange: Data matrices, with names and types associated to each column.
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
35 Basically there seems to be only one base dataset class that contains the
20a1af112a75 dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents: 1054
diff changeset
36 data. Data points are lists (of values corresponding to each column).
1077
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
37 - APGL: Hard to say how they deal with data from the documentation alone.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
38 - Monte: Data is simply numpy arrays.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
40 a 2D numpy array of input features, and dataset.target the target vector.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
42 though).
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents: 1076
diff changeset
43 - Any more worth looking at?
1047
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
44
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
45 A few things that our dataset containers should support at a minimum:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
46
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
47 - streams, possibly infinite
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
48 - task/views of the data for different problems
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
49 - indexing & slicing
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
50 - pairs or triples or etc of examples
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
51 - a 'distance/gram matrix' container (imagine that the data is given to you
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
52 as a distance matrix)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
53 - multi-dimensional time-series (again, maybe with pairs/triples, maybe
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
54 given to you as a distance matrix over time)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
55
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
56 Another question to consider is the following: how tight should it integrate
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
57 with Theano? Do we want to be able to store data as shared variables or just
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
58 have an option for that? Theano + GPU constrains things that we can do (in terms
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
59 of sizes, buffering, etc): these are things we need to think about, but it's not
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
60 clear whether we should aim for building them into the interface.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
61
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
62 Task views of the data for different problems: How can we achieve this? Should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
63 we simply have a set of standard dataset descriptors ('classification',
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
64 'regression', 'multi-label', 'density_estimation') and have a set_view method
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
65 that changes the current dataset view type?
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
66
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
67 There is then the question of how to approach the design of a Dataset class from
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
68 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
69 Dataset that doesn't implement any methods except a few setters/getters. The reason
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
70 to have the methods listed that way is to have a common 'specification', but classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
71 that inherit from Dataset need not implement every single method (only the ones
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
72 that are relevant) and can obviously implement other methods as appropriate. The
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
73 reason to have a common specification (as abstract as it might be) is to, well,
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
74 have a common specification that would make our code clearer and cleaner.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
75
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
76 An example of what I (Dumi) am thinking in terms of concrete API:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
77
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
78 class Dataset:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
79 def __init__(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
80 self.type = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
81 self.in_memory = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
82 self.inputs = None # list of filepaths, or objects in memory, or...
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
83 self.outputs = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
84
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
85 def get_example(self,example_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
86 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
87
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
88 def get_next_example(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
89 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
90
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
91 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
92 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
93
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
94 def get_next_batch(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
95 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
96
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
97 def get_slice(self,slice_object):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
98 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
99
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
100 def set_view(self,view_type):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
101 self.view_type = view_type
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
102 self.n_classes = None
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
103
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
104 def set_n_classes(self,n_classes):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
105 self.n_classes = n_classes
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
106
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
107 def set_batch_size(self,batch_size):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
108 self.batch_size = batch_size
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
109
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
110 You will note that there is no notion of train/valid/test in this class: I think we should
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
111 just have a train dataset, a valid one and a test one instead or (if it's in one
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
112 big file or infinite stream) just handle the split ourselves (via slicing, for
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
113 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
114 specification does not preclude more fine-grained 'splitting' of the data.
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
115
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
116 A concrete implementation would look like this (we would have one class per
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
117 dataset that we use, and the class declaration contains essentially everything
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
118 there is to know about the dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
119
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
120 class MNIST(Dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
121 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
122 self.type='standard_xy'
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
123 self.in_memory = True
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
124 self.inputs = inputs # load them or create
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
125 self.outputs = outputs
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
126 self.set_view('classification')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
127 self.set_n_classes(10)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
128 self.set_batch_size(20)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
129 self.n_batches = self._compute_n_batches()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
130
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
131 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
132 x,y = self._fetch_batch(batch_index)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
133 if self.view_type == 'classification':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
134 return x,numpy.int32(y)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
135 elif self.view_type == 'density_estimation':
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
136 return x
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
137 else:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
138 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
139
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
140 def shared_data(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
141 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
142 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
143 return shared_x, T.cast(shared_y, 'int32')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
144
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
145 def _compute_n_batches(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
146 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
147
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
148 def _fetch_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
149 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
150
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
151 But nothing stops you from defining get_train_batch, get_valid_batch and stuff
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
152 like that!
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
153
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
154 So we'd use it as:
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
155
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
156 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
157 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
158
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
159 x,y = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
160 train_mnist.set_view('density_estimation')
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
161 x = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
162
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
163 or
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
164
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
165 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
166 batches_train = range(int(mnist_data.n_batches*0.8))
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
167 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches)
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
168
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
169 xt,yt = mnist_data.get_batch(batches_train[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
170 xv,yv = mnist_data.get_batch(batches_valid[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 1030
diff changeset
171
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
172
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
173
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
174
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
175 COMMENTS
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
176 ~~~~~~~~
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
177
1120
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
178 JB asks: How about asking datasets to also provide a visualization mechanism
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
179 for showing / playing individual examples from the dataset, but also other
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
180 external objects that are similar to dataset examples (e.g. filters from a
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
181 weight matrix that filters images). This doesn't have to be complicated, and it
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
182 can be shared between datasets that exist in one modality (e.g. image datasets
27d0ef195e1d v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1117
diff changeset
183 can all use an image-rending method)
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
184
1131
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
185 OD replies: Besides being able to display data without prior knowledge of the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
186 kind of data inside a dataset, is there any reason to put this within the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
187 dataset class? If not, it seems to me it may be more appropriate to have a way
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
188 for the dataset to describe the kind of data it holds, and keep the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
189 visualization code separate from the dataset itself. It would make it easier
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
190 in particular to try different visualization systems, and description of the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
191 data may turn out to be useful for other reasons (however, it also means we'd
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
192 need to come up with a good way to describe data, which could prove
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
193 difficult).
d9550c27a192 dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents: 1127
diff changeset
194
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
195 JB asks: What may be passed as argument to the functions in Dataset, and what
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
196 can be expected in return? Are there side effects (e.g. on the state of the
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
197 Dataset) associated with any of the functions?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
198
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
199 JB asks: What properties are part of the Dataset API? What possible types can
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
200 they have, are they expected to be read-only or writeable? What do they mean?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
201
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
202
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
203 JB asks: What is a view? Does set_view change the Dataset or return a new
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
204 Dataset with a certain view of the original (in which case call it get_view)?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
205 Does the view imply the types of the return-value of functions like
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
206 get_batch? What is the difference between the view and the subclasses of
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
207 Dataset in PyML?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
208
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
209 JB asks: Do container formats (I'm thinking of HDF5) offer features for fast
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
210 retrieval that we would like to expose via this interface?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
211
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
212 JB asks: How would you recommend using this sort of dataset in a boosting
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
213 algorithm where points need to be re-weighted.
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
214
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
215
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
216 JB asks: Do we want to provide for the possibility of feedback that modifies the
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
217 dataset? For example, curriculum learning might be adaptive in this sense, or
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
218 if we wanted to provide a virtual world for an agent as a dataset then we need
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
219 to provide 'actions' to get the next batch. Could this be done in the current
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
220 API?
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
221
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
222
1082
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
223 Field names and attributes
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
224 ~~~~~~~~~~~~~~~~~~~~~~~~~~
1054
a474fabd1f37 v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1047
diff changeset
225
1082
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
226 OD: One important question is how to handle fields' names and characteristics.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
227 For instance, it can be useful to know that the 3rd input field represents a
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
228 number of fingers, and is a non-negative discrete field whose numeric value is
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
229 meaningful (compared, to, say, an integer index that would correspond to an
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
230 animal's category). We mentioned metadata during the meeting, but we did not
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
231 get into its details: that may be a place where to put this kind of things.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
232
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
233
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
234 Freeing memory
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
235 ~~~~~~~~~~~~~~
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
236
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
237 OD: It is sometimes useful to be able to free memory used by previous
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
238 computations. A typical example is when you load in memory the original
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
239 dataset, then perform various processing steps, ending with a new dataset that
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
240 you also store in memory before feeding it to the learner. Unless you very
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
241 carefully design your code to avoid it, your original dataset will still
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
242 remain in memory (as well as maybe the results of some computations performed
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
243 along the way). So there may be a use for a `clear()` method that would be
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
244 called by the topmost dataset (the one doing the final memory caching), and
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
245 would be forwarded iteratively to previous datasets so as to get back all this
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
246 wasted memory space.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents: 1077
diff changeset
247
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
248 What is a mini-batch?
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
249 ~~~~~~~~~~~~~~~~~~~~~
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
250
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
251 This is a follow-up to the meeting's discussion about whether a mini-batch
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
252 returned by a dataset should be itself a dataset.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
253
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
254 OD: During the meeting I was voting in favor of a 'yes', mostly because it
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
255 made sense to me (a mini-batch is a subset of a dataset and thus should be a
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
256 dataset), but now I tend towards 'no'. The main reason is it is not clear yet
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
257 what the dataset interface will be, so that it is hard to judge whether this
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
258 is good idea (my main concern is how much additional work would be required by
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
259 the writer of a new dataset subclass). Anyway, maybe a first thing we could
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
260 think about is what we want a mini-batch to be. I think we can agree that we
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
261 would like to be able to do something like:
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
262 for mb in dataset.mini_batches(size=10):
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
263 learner.update(mb.input, mb.target)
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
264 so that it should be ok for a mini-batch to be an object whose fields
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
265 (that should have the same name as those of the dataset) are numpy arrays.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
266 More generally, we would like to be able to iterate on samples in a
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
267 mini-batch, or do random access on them, so a mini-batch should implement
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
268 __iter__ and __getitem__.
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
269 Besides this, is there any other typical use-case of a mini-batch? In
1086
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
270 particular, is there any reason to want an infinite mini-batch, or a very big
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
271 mini-batch that may not fit in memory? (in which case we may need to revise
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
272 our idea of what 'mini' means) Hopefully the answer to that last question is
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
273 no, as I think it would definitely keep things simpler, since we could simply
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
274 use numpy arrays (for numeric data) or lists (for anything else) to store
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
275 mini-batches' data. So I vote for 'no'.
1083
4c00af69c164 dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents: 1082
diff changeset
276
1124
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
277 YB: I agree that a mini-batch should definitely be safely assumed
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
278 to fit in memory. That makes it at least in principle semantically
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
279 different from a dataset. But barring that restriction, it might
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
280 share of the properties of a dataset.
0f184b5e7a3f YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 1120
diff changeset
281
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
282 A dataset is a learner
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
283 ~~~~~~~~~~~~~~~~~~~~~~
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
284
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
285 OD: (this is hopefully a clearer re-write of the original version from
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
286 r7e6e77d50eeb, which I was not happy with).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
287 There are typically three kinds of objects that spit out data:
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
288 1. Datasets that are loaded from disk or are able to generate data all by
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
289 themselves (i.e. without any other dataset as input)
1086
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
290 2. Datasets that transform their input dataset in a way that only depends on
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
291 the input dataset (e.g. filtering samples or features, normalizing data, etc.)
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
292 3. Datasets that transform their input dataset in a way that is learned on a
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
293 potentially different dataset (e.g. PCA when you want to learn the projection
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
294 space on the training set in order to transform both the training and test
65ac0f493830 dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents: 1085
diff changeset
295 sets).
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
296 My impression currently is that we would use dataset subclasses to handle 1
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
297 and 2. However, 3 requires a learner framework, so you would need to have
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
298 something like a LearnerOutputDataset(trained_learner, dataset).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
299
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
300 Note however that 2 is a special case of 3 (where training does nothing), and
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
301 1 is a special case of 2 (where we do not care about being given an input
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
302 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
303 by LearnerOutputDataset.
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
304
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
305 The main advantages I find in this approach (that I have been using at
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
306 Ubisoft) are:
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
307 - You only need to learn how to subclass the learner class. The only dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
308 class is LearnerOutputDataset, which you could just name Dataset.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
309 - You do not have different ways to achieve the same result (having to figure
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
310 out which one is most appropriate).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
311 - Upgrading code from 2 to 3 is more straighforward. Such a situation can
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
312 happen e.g. if you write some code that normalizes your input dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
313 (situation 2), then realize later you would like to be able to normalize new
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
314 datasets using the same parameters (e.g. same shift & rescaling), which
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
315 requires situation 3.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
316 - It can make your life easier when thinking about how to plug things together
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
317 (something that has not been discussed yet), because the interfaces of the
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
318 various components are less varied.
1084
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents: 1083
diff changeset
319
1085
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
320 I am not saying that we should necessarily do it this way, but I think it is
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
321 worth at least keeping in mind this close relationship between simple
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
322 processing and learning, and thinking about what are the benefits / drawbacks
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents: 1084
diff changeset
323 in keeping them separate in the class hierarchy.
1089
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
324
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
325 RP: I actually like this idea of having the dataset implement the same
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
326 interface as the learner ( or actually a subset of the interface .. ).
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
327 I hope people decide to do this.
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
328
1091
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
329 Support for shared variables
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
330 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1089
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
331
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
332 RP asks: What is the status of having the dataset support copying data
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
333 on the GPU ( by storing data in shared variables) ? Have you decided to
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
334 include this feature or not ? I think that the strongest selling point of
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
335 Theano is that it runs on GPU transperently, and I see this as a good
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
336 selling point for the library as well. Plus we intend to move more and
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
337 more towards running things on GPU. If the dataset object does not support
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
338 this feature we will need to find hacks around it ..
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1086
diff changeset
339
1091
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
340 OD: I have like zero experience with GPU so hopefully someone else can answer
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
341 this. But the way I see it, hopefully it could work by having some dataset
319de699fb67 dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents: 1089
diff changeset
342 object that would take care of storing its input data into a shared variable.
1094
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
343 OD (continued): After thinking a bit more about it, I am not sure that would
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
344 work. I definitely need to look at some code doing it to get a better
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
345 understanding of it, but my feeling is that you need your learner to be
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
346 written in a specific way to achieve this, in which case it may be up to the
75175e2e697d dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents: 1091
diff changeset
347 learner to take its input data and store it into a shared variable.
1104
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
348
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
349 RP comment: Yes, the dataset object alone can not handle this, the issue is somewhere
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
350 between the dataset and the learner. Or in other words, everytime you change
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
351 the data you need to recompile your theano function. So the learner can not
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
352 only get data from the dataset, it needs to get a shared variable. The learner
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
353 should also be aware when the dataset is changed, to recompile its internal
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
354 functions. I'm not sure which is the best wa to do this. My personal feeling
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
355 is that the dataset should be part of the learner. The lerner should provide
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
356 a function use_dataset ( or replace_dataset). When this function is called,
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
357 all the theano functions in the learner get recompiled based on shared
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
358 variables that the dataset object provides. It sort of fits very well in the
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
359 framework that I have in mind, which was spattered around in the learner.txt
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
360 and some of my previous emails. I think it shares a lot with James concepts,
5e6d7d9e803a a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1094
diff changeset
361 since it follows quite closely the concepts behind Theano.
1105
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
362
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
363 OD asks: Ok, so why would the dataset have to be responsible for providing a
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
364 shared variable? Why wouldn't the learner just create this shared variable
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
365 internally and copy into it the data provided by the dataset?
546bd0ccb0e4 dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents: 1104
diff changeset
366
1109
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
367 RP replies: Sure, the learner could take care of all this. Note though that the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
368 learner should take care to divide the dataset into chunks that fit in the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
369 GPU memory ( in case of a large dataset) and then take care of updating the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
370 shared variables acording to the current chunk. Personally I feel like all
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
371 this data division, management and so on should be done by the dataset.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
372 It feels more natural that way. For example assume you have a dataset that
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
373 is composed of a time series and some static data ( carre-tech heart beat
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
374 data is a good example). The static data is small enough so that you could
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
375 always store on the GPU, and you would only need to split the time series.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
376 For the learner to do this ( since it gets the same interface from any
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
377 dataset object) would be like and if <this case> then, while for the
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
378 dataset is just a different class. But I'm happy to have all this GPU stuff
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
379 send to the learner as well if everybody else believe that is better.
29b48deb6a84 reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1105
diff changeset
380
1110
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
381 FB comment: I don't understand why you would need to recompile the theano function.
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
382 Their is 2 cases, the data is in a shared variable. You can directly change the data
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
383 in the shared variable without recompiling the theano fct. The second case is when
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
384 the dataset is in an ordinary theano variable. In that case, the first step in the
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
385 theano fct will be to transfer the dataset to the gpu before computation. If the data
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
386 change at each call, that will be as efficient as changing the data manually every time
4797a4cb73e1 added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents: 1109
diff changeset
387 in the shared variable.
1116
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
388
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
389 AB: I have an idea about this which kind of fits in the "building a
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
390 theano op" thing that we talked about at the last meeting.
18a092001752 An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1110
diff changeset
391
1117
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
392 We can just build a theano Op that wraps dataset objects and takes
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
393 care of the details of tranferring data to the GPU or otherwise.
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
394
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
395 I have a prototype interface/implemantation in the shared_dataset.py
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents: 1116
diff changeset
396 file in this directory.
1127
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents: 1124
diff changeset
397
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents: 1124
diff changeset
398 OD: I like AB's approach.
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents: 1124
diff changeset
399