Mercurial > pylearn
annotate doc/v2_planning/dataset.txt @ 1507:2a6a6f16416c
fix import.
author | Frederic Bastien <nouiz@nouiz.org> |
---|---|
date | Mon, 12 Sep 2011 11:45:41 -0400 |
parents | 04b988fb00b6 |
children |
rev | line source |
---|---|
1002
f82093bf4405
adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff
changeset
|
1 Discussion of Function Specification for Dataset Types |
f82093bf4405
adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff
changeset
|
2 ====================================================== |
f82093bf4405
adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff
changeset
|
3 |
1008
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
4 Some talking points from the September 2 meeting: |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
5 |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification |
1190
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
7 needs to be flexible enough to accommodate different (sub)tasks and views of |
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
8 the same underlying data. |
1008
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
9 * Datasets as probability distributions from which one can sample. |
1023
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
10 * That's not something I would consider to be a dataset-related problem to |
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
11 tackle now: a probability distribution in Pylearn would probably be a |
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
12 different kind of beast, and it should be easy enough to have a |
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
13 DatasetToDistribution class for instance, that would take care of viewing a |
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
14 dataset as a probability distribution. -- OD |
1008
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
15 * Our specification should allow transparent handling of infinite datasets (or |
1190
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
16 simply datasets which cannot fit in memory) |
1008
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
17 * GPU/buffering issues. |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
18 |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
19 Commiteee: DE, OB, OD, AB, PV |
1030
a154c9b68239
dataset: Dumi confirmed as leader
Olivier Delalleau <delallea@iro>
parents:
1023
diff
changeset
|
20 Leader: DE |
1047
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
21 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
22 Some ideas from existing ML libraries: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
23 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData, |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
25 PairDataSet, Aggregate. Ultimately, the learner decides |
1077
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
26 - mlpy: very primitive notions of data (simple 2D matrices) |
1076
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet, |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
29 constrained and may have a different interface. |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
30 - MDP: Seems to have restrictions on the type of data being passed around, as |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
31 well as its dimensionality ("Input array data is typically assumed to be |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
32 two-dimensional and ordered such that observations of the same variable are |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
33 stored on rows and different variables are stored on columns.") |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
34 - Orange: Data matrices, with names and types associated to each column. |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
35 Basically there seems to be only one base dataset class that contains the |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
36 data. Data points are lists (of values corresponding to each column). |
1077
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
37 - APGL: Hard to say how they deal with data from the documentation alone. |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
38 - Monte: Data is simply numpy arrays. |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
40 a 2D numpy array of input features, and dataset.target the target vector. |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
42 though). |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
43 - Any more worth looking at? |
1047
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
44 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
45 A few things that our dataset containers should support at a minimum: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
46 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
47 - streams, possibly infinite |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
48 - task/views of the data for different problems |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
49 - indexing & slicing |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
50 - pairs or triples or etc of examples |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
51 - a 'distance/gram matrix' container (imagine that the data is given to you |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
52 as a distance matrix) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
53 - multi-dimensional time-series (again, maybe with pairs/triples, maybe |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
54 given to you as a distance matrix over time) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
55 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
56 Another question to consider is the following: how tight should it integrate |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
57 with Theano? Do we want to be able to store data as shared variables or just |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
58 have an option for that? Theano + GPU constrains things that we can do (in terms |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
59 of sizes, buffering, etc): these are things we need to think about, but it's not |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
60 clear whether we should aim for building them into the interface. |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
61 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
62 Task views of the data for different problems: How can we achieve this? Should |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
63 we simply have a set of standard dataset descriptors ('classification', |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
64 'regression', 'multi-label', 'density_estimation') and have a set_view method |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
65 that changes the current dataset view type? |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
66 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
67 There is then the question of how to approach the design of a Dataset class from |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
68 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class' |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
69 Dataset that doesn't implement any methods except a few setters/getters. The reason |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
70 to have the methods listed that way is to have a common 'specification', but classes |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
71 that inherit from Dataset need not implement every single method (only the ones |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
72 that are relevant) and can obviously implement other methods as appropriate. The |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
73 reason to have a common specification (as abstract as it might be) is to, well, |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
74 have a common specification that would make our code clearer and cleaner. |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
75 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
76 An example of what I (Dumi) am thinking in terms of concrete API: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
77 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
78 class Dataset: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
79 def __init__(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
80 self.type = None |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
81 self.in_memory = None |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
82 self.inputs = None # list of filepaths, or objects in memory, or... |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
83 self.outputs = None |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
84 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
85 def get_example(self,example_index): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
86 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
87 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
88 def get_next_example(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
89 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
90 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
91 def get_batch(self,batch_index): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
92 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
93 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
94 def get_next_batch(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
95 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
96 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
97 def get_slice(self,slice_object): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
98 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
99 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
100 def set_view(self,view_type): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
101 self.view_type = view_type |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
102 self.n_classes = None |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
103 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
104 def set_n_classes(self,n_classes): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
105 self.n_classes = n_classes |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
106 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
107 def set_batch_size(self,batch_size): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
108 self.batch_size = batch_size |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
109 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
110 You will note that there is no notion of train/valid/test in this class: I think we should |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
111 just have a train dataset, a valid one and a test one instead or (if it's in one |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
112 big file or infinite stream) just handle the split ourselves (via slicing, for |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
113 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
114 specification does not preclude more fine-grained 'splitting' of the data. |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
115 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
116 A concrete implementation would look like this (we would have one class per |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
117 dataset that we use, and the class declaration contains essentially everything |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
118 there is to know about the dataset): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
119 |
1190
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
120 .. code-block:: python |
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
121 |
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
122 class MNIST(Dataset): |
1047
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
123 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
124 self.type='standard_xy' |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
125 self.in_memory = True |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
126 self.inputs = inputs # load them or create |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
127 self.outputs = outputs |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
128 self.set_view('classification') |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
129 self.set_n_classes(10) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
130 self.set_batch_size(20) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
131 self.n_batches = self._compute_n_batches() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
132 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
133 def get_batch(self,batch_index): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
134 x,y = self._fetch_batch(batch_index) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
135 if self.view_type == 'classification': |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
136 return x,numpy.int32(y) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
137 elif self.view_type == 'density_estimation': |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
138 return x |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
139 else: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
140 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
141 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
142 def shared_data(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
143 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX)) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
144 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX)) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
145 return shared_x, T.cast(shared_y, 'int32') |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
146 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
147 def _compute_n_batches(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
148 pass |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
149 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
150 def _fetch_batch(self,batch_index): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
151 pass |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
152 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
153 But nothing stops you from defining get_train_batch, get_valid_batch and stuff |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
154 like that! |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
155 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
156 So we'd use it as: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
157 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
158 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy']) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
159 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy']) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
160 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
161 x,y = train_mnist.get_batch(0) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
162 train_mnist.set_view('density_estimation') |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
163 x = train_mnist.get_batch(0) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
164 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
165 or |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
166 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
167 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy']) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
168 batches_train = range(int(mnist_data.n_batches*0.8)) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
169 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
170 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
171 xt,yt = mnist_data.get_batch(batches_train[0]) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
172 xv,yv = mnist_data.get_batch(batches_valid[0]) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
173 |
1054
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
174 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
175 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
176 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
177 COMMENTS |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
178 ~~~~~~~~ |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
179 |
1120
27d0ef195e1d
v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1117
diff
changeset
|
180 JB asks: How about asking datasets to also provide a visualization mechanism |
27d0ef195e1d
v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1117
diff
changeset
|
181 for showing / playing individual examples from the dataset, but also other |
27d0ef195e1d
v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1117
diff
changeset
|
182 external objects that are similar to dataset examples (e.g. filters from a |
27d0ef195e1d
v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1117
diff
changeset
|
183 weight matrix that filters images). This doesn't have to be complicated, and it |
27d0ef195e1d
v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1117
diff
changeset
|
184 can be shared between datasets that exist in one modality (e.g. image datasets |
27d0ef195e1d
v2planning - added comment to dataset re: visualization
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1117
diff
changeset
|
185 can all use an image-rending method) |
1054
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
186 |
1131
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
187 OD replies: Besides being able to display data without prior knowledge of the |
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
188 kind of data inside a dataset, is there any reason to put this within the |
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
189 dataset class? If not, it seems to me it may be more appropriate to have a way |
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
190 for the dataset to describe the kind of data it holds, and keep the |
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
191 visualization code separate from the dataset itself. It would make it easier |
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
192 in particular to try different visualization systems, and description of the |
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
193 data may turn out to be useful for other reasons (however, it also means we'd |
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
194 need to come up with a good way to describe data, which could prove |
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
195 difficult). |
d9550c27a192
dataset: Reply about being able to plot samples from a dataset
Olivier Delalleau <delallea@iro>
parents:
1127
diff
changeset
|
196 |
1054
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
197 JB asks: What may be passed as argument to the functions in Dataset, and what |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
198 can be expected in return? Are there side effects (e.g. on the state of the |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
199 Dataset) associated with any of the functions? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
200 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
201 JB asks: What properties are part of the Dataset API? What possible types can |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
202 they have, are they expected to be read-only or writeable? What do they mean? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
203 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
204 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
205 JB asks: What is a view? Does set_view change the Dataset or return a new |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
206 Dataset with a certain view of the original (in which case call it get_view)? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
207 Does the view imply the types of the return-value of functions like |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
208 get_batch? What is the difference between the view and the subclasses of |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
209 Dataset in PyML? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
210 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
211 JB asks: Do container formats (I'm thinking of HDF5) offer features for fast |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
212 retrieval that we would like to expose via this interface? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
213 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
214 JB asks: How would you recommend using this sort of dataset in a boosting |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
215 algorithm where points need to be re-weighted. |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
216 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
217 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
218 JB asks: Do we want to provide for the possibility of feedback that modifies the |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
219 dataset? For example, curriculum learning might be adaptive in this sense, or |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
220 if we wanted to provide a virtual world for an agent as a dataset then we need |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
221 to provide 'actions' to get the next batch. Could this be done in the current |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
222 API? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
223 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
224 |
1082
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
225 Field names and attributes |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
226 ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
1054
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
227 |
1082
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
228 OD: One important question is how to handle fields' names and characteristics. |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
229 For instance, it can be useful to know that the 3rd input field represents a |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
230 number of fingers, and is a non-negative discrete field whose numeric value is |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
231 meaningful (compared, to, say, an integer index that would correspond to an |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
232 animal's category). We mentioned metadata during the meeting, but we did not |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
233 get into its details: that may be a place where to put this kind of things. |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
234 |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
235 |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
236 Freeing memory |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
237 ~~~~~~~~~~~~~~ |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
238 |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
239 OD: It is sometimes useful to be able to free memory used by previous |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
240 computations. A typical example is when you load in memory the original |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
241 dataset, then perform various processing steps, ending with a new dataset that |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
242 you also store in memory before feeding it to the learner. Unless you very |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
243 carefully design your code to avoid it, your original dataset will still |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
244 remain in memory (as well as maybe the results of some computations performed |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
245 along the way). So there may be a use for a `clear()` method that would be |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
246 called by the topmost dataset (the one doing the final memory caching), and |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
247 would be forwarded iteratively to previous datasets so as to get back all this |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
248 wasted memory space. |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
249 |
1083
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
250 What is a mini-batch? |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
251 ~~~~~~~~~~~~~~~~~~~~~ |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
252 |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
253 This is a follow-up to the meeting's discussion about whether a mini-batch |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
254 returned by a dataset should be itself a dataset. |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
255 |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
256 OD: During the meeting I was voting in favor of a 'yes', mostly because it |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
257 made sense to me (a mini-batch is a subset of a dataset and thus should be a |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
258 dataset), but now I tend towards 'no'. The main reason is it is not clear yet |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
259 what the dataset interface will be, so that it is hard to judge whether this |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
260 is good idea (my main concern is how much additional work would be required by |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
261 the writer of a new dataset subclass). Anyway, maybe a first thing we could |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
262 think about is what we want a mini-batch to be. I think we can agree that we |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
263 would like to be able to do something like: |
1190
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
264 |
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
265 .. code-block:: python |
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
266 |
1083
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
267 for mb in dataset.mini_batches(size=10): |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
268 learner.update(mb.input, mb.target) |
1190
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
269 |
1083
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
270 so that it should be ok for a mini-batch to be an object whose fields |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
271 (that should have the same name as those of the dataset) are numpy arrays. |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
272 More generally, we would like to be able to iterate on samples in a |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
273 mini-batch, or do random access on them, so a mini-batch should implement |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
274 __iter__ and __getitem__. |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
275 Besides this, is there any other typical use-case of a mini-batch? In |
1086
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
276 particular, is there any reason to want an infinite mini-batch, or a very big |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
277 mini-batch that may not fit in memory? (in which case we may need to revise |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
278 our idea of what 'mini' means) Hopefully the answer to that last question is |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
279 no, as I think it would definitely keep things simpler, since we could simply |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
280 use numpy arrays (for numeric data) or lists (for anything else) to store |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
281 mini-batches' data. So I vote for 'no'. |
1083
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
282 |
1124
0f184b5e7a3f
YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
1120
diff
changeset
|
283 YB: I agree that a mini-batch should definitely be safely assumed |
0f184b5e7a3f
YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
1120
diff
changeset
|
284 to fit in memory. That makes it at least in principle semantically |
0f184b5e7a3f
YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
1120
diff
changeset
|
285 different from a dataset. But barring that restriction, it might |
0f184b5e7a3f
YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
1120
diff
changeset
|
286 share of the properties of a dataset. |
0f184b5e7a3f
YB: comment on minibatches for dataset.txt
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
1120
diff
changeset
|
287 |
1084
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
288 A dataset is a learner |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
289 ~~~~~~~~~~~~~~~~~~~~~~ |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
290 |
1085
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
291 OD: (this is hopefully a clearer re-write of the original version from |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
292 r7e6e77d50eeb, which I was not happy with). |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
293 There are typically three kinds of objects that spit out data: |
1190
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
294 |
1085
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
295 1. Datasets that are loaded from disk or are able to generate data all by |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
296 themselves (i.e. without any other dataset as input) |
1086
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
297 2. Datasets that transform their input dataset in a way that only depends on |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
298 the input dataset (e.g. filtering samples or features, normalizing data, etc.) |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
299 3. Datasets that transform their input dataset in a way that is learned on a |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
300 potentially different dataset (e.g. PCA when you want to learn the projection |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
301 space on the training set in order to transform both the training and test |
65ac0f493830
dataset: Some clarifications on my comments
Olivier Delalleau <delallea@iro>
parents:
1085
diff
changeset
|
302 sets). |
1190
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
303 |
1085
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
304 My impression currently is that we would use dataset subclasses to handle 1 |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
305 and 2. However, 3 requires a learner framework, so you would need to have |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
306 something like a LearnerOutputDataset(trained_learner, dataset). |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
307 |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
308 Note however that 2 is a special case of 3 (where training does nothing), and |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
309 1 is a special case of 2 (where we do not care about being given an input |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
310 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
311 by LearnerOutputDataset. |
1084
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
312 |
1085
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
313 The main advantages I find in this approach (that I have been using at |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
314 Ubisoft) are: |
1190
9ff2242a817b
fix rst syntax errors/warnings
Frederic Bastien <nouiz@nouiz.org>
parents:
1131
diff
changeset
|
315 |
1085
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
316 - You only need to learn how to subclass the learner class. The only dataset |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
317 class is LearnerOutputDataset, which you could just name Dataset. |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
318 - You do not have different ways to achieve the same result (having to figure |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
319 out which one is most appropriate). |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
320 - Upgrading code from 2 to 3 is more straighforward. Such a situation can |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
321 happen e.g. if you write some code that normalizes your input dataset |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
322 (situation 2), then realize later you would like to be able to normalize new |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
323 datasets using the same parameters (e.g. same shift & rescaling), which |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
324 requires situation 3. |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
325 - It can make your life easier when thinking about how to plug things together |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
326 (something that has not been discussed yet), because the interfaces of the |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
327 various components are less varied. |
1084
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
328 |
1085
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
329 I am not saying that we should necessarily do it this way, but I think it is |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
330 worth at least keeping in mind this close relationship between simple |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
331 processing and learning, and thinking about what are the benefits / drawbacks |
de456561ec40
dataset: Rewrote my rambling about the links between dataset and learner
Olivier Delalleau <delallea@iro>
parents:
1084
diff
changeset
|
332 in keeping them separate in the class hierarchy. |
1089
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
333 |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
334 RP: I actually like this idea of having the dataset implement the same |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
335 interface as the learner ( or actually a subset of the interface .. ). |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
336 I hope people decide to do this. |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
337 |
1091
319de699fb67
dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents:
1089
diff
changeset
|
338 Support for shared variables |
319de699fb67
dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents:
1089
diff
changeset
|
339 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
1089
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
340 |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
341 RP asks: What is the status of having the dataset support copying data |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
342 on the GPU ( by storing data in shared variables) ? Have you decided to |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
343 include this feature or not ? I think that the strongest selling point of |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
344 Theano is that it runs on GPU transperently, and I see this as a good |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
345 selling point for the library as well. Plus we intend to move more and |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
346 more towards running things on GPU. If the dataset object does not support |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
347 this feature we will need to find hacks around it .. |
f15216356522
Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ?
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1086
diff
changeset
|
348 |
1091
319de699fb67
dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents:
1089
diff
changeset
|
349 OD: I have like zero experience with GPU so hopefully someone else can answer |
319de699fb67
dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents:
1089
diff
changeset
|
350 this. But the way I see it, hopefully it could work by having some dataset |
319de699fb67
dataset: Reply to GPU question
Olivier Delalleau <delallea@iro>
parents:
1089
diff
changeset
|
351 object that would take care of storing its input data into a shared variable. |
1094
75175e2e697d
dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents:
1091
diff
changeset
|
352 OD (continued): After thinking a bit more about it, I am not sure that would |
75175e2e697d
dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents:
1091
diff
changeset
|
353 work. I definitely need to look at some code doing it to get a better |
75175e2e697d
dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents:
1091
diff
changeset
|
354 understanding of it, but my feeling is that you need your learner to be |
75175e2e697d
dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents:
1091
diff
changeset
|
355 written in a specific way to achieve this, in which case it may be up to the |
75175e2e697d
dataset: Continued comment about GPU and shared variables
Olivier Delalleau <delallea@iro>
parents:
1091
diff
changeset
|
356 learner to take its input data and store it into a shared variable. |
1104
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
357 |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
358 RP comment: Yes, the dataset object alone can not handle this, the issue is somewhere |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
359 between the dataset and the learner. Or in other words, everytime you change |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
360 the data you need to recompile your theano function. So the learner can not |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
361 only get data from the dataset, it needs to get a shared variable. The learner |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
362 should also be aware when the dataset is changed, to recompile its internal |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
363 functions. I'm not sure which is the best wa to do this. My personal feeling |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
364 is that the dataset should be part of the learner. The lerner should provide |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
365 a function use_dataset ( or replace_dataset). When this function is called, |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
366 all the theano functions in the learner get recompiled based on shared |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
367 variables that the dataset object provides. It sort of fits very well in the |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
368 framework that I have in mind, which was spattered around in the learner.txt |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
369 and some of my previous emails. I think it shares a lot with James concepts, |
5e6d7d9e803a
a comment on the GPU issue for datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1094
diff
changeset
|
370 since it follows quite closely the concepts behind Theano. |
1105
546bd0ccb0e4
dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents:
1104
diff
changeset
|
371 |
546bd0ccb0e4
dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents:
1104
diff
changeset
|
372 OD asks: Ok, so why would the dataset have to be responsible for providing a |
546bd0ccb0e4
dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents:
1104
diff
changeset
|
373 shared variable? Why wouldn't the learner just create this shared variable |
546bd0ccb0e4
dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents:
1104
diff
changeset
|
374 internally and copy into it the data provided by the dataset? |
546bd0ccb0e4
dataset: Question about shared variables
Olivier Delalleau <delallea@iro>
parents:
1104
diff
changeset
|
375 |
1109
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
376 RP replies: Sure, the learner could take care of all this. Note though that the |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
377 learner should take care to divide the dataset into chunks that fit in the |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
378 GPU memory ( in case of a large dataset) and then take care of updating the |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
379 shared variables acording to the current chunk. Personally I feel like all |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
380 this data division, management and so on should be done by the dataset. |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
381 It feels more natural that way. For example assume you have a dataset that |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
382 is composed of a time series and some static data ( carre-tech heart beat |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
383 data is a good example). The static data is small enough so that you could |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
384 always store on the GPU, and you would only need to split the time series. |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
385 For the learner to do this ( since it gets the same interface from any |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
386 dataset object) would be like and if <this case> then, while for the |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
387 dataset is just a different class. But I'm happy to have all this GPU stuff |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
388 send to the learner as well if everybody else believe that is better. |
29b48deb6a84
reply/comment regarding the GPU and datasets
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1105
diff
changeset
|
389 |
1110
4797a4cb73e1
added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents:
1109
diff
changeset
|
390 FB comment: I don't understand why you would need to recompile the theano function. |
4797a4cb73e1
added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents:
1109
diff
changeset
|
391 Their is 2 cases, the data is in a shared variable. You can directly change the data |
4797a4cb73e1
added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents:
1109
diff
changeset
|
392 in the shared variable without recompiling the theano fct. The second case is when |
4797a4cb73e1
added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents:
1109
diff
changeset
|
393 the dataset is in an ordinary theano variable. In that case, the first step in the |
4797a4cb73e1
added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents:
1109
diff
changeset
|
394 theano fct will be to transfer the dataset to the gpu before computation. If the data |
4797a4cb73e1
added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents:
1109
diff
changeset
|
395 change at each call, that will be as efficient as changing the data manually every time |
4797a4cb73e1
added comment to dataset.
Frederic Bastien <nouiz@nouiz.org>
parents:
1109
diff
changeset
|
396 in the shared variable. |
1116
18a092001752
An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents:
1110
diff
changeset
|
397 |
18a092001752
An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents:
1110
diff
changeset
|
398 AB: I have an idea about this which kind of fits in the "building a |
18a092001752
An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents:
1110
diff
changeset
|
399 theano op" thing that we talked about at the last meeting. |
18a092001752
An idea about Datasets and GPU.
Arnaud Bergeron <abergeron@gmail.com>
parents:
1110
diff
changeset
|
400 |
1117
c1943feada10
Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents:
1116
diff
changeset
|
401 We can just build a theano Op that wraps dataset objects and takes |
c1943feada10
Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents:
1116
diff
changeset
|
402 care of the details of tranferring data to the GPU or otherwise. |
c1943feada10
Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents:
1116
diff
changeset
|
403 |
c1943feada10
Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents:
1116
diff
changeset
|
404 I have a prototype interface/implemantation in the shared_dataset.py |
c1943feada10
Proposal for theano dataset wrapper. The details still have to be worked out.
Arnaud Bergeron <abergeron@gmail.com>
parents:
1116
diff
changeset
|
405 file in this directory. |
1127
7207f86a661f
dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents:
1124
diff
changeset
|
406 |
7207f86a661f
dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents:
1124
diff
changeset
|
407 OD: I like AB's approach. |
7207f86a661f
dataset: Comment on AB's idea to handle the GPU/shared variable issue
Olivier Delalleau <delallea@iro>
parents:
1124
diff
changeset
|
408 |
1337
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
409 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
410 Data API proposal by Olivier D |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
411 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
412 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
413 A single sample containing multiple fields (e.g. an input and a target part) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
414 is an object s that you can manipulate as follows: |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
415 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
416 .. code-block:: python |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
417 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
418 # Obtain actual data stored within `s` (e.g. a numpy vector). There is no |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
419 # guarantee that modifying the resulting data object will actually update |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
420 # the data stored in `s`. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
421 data = s() |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
422 # Create a sample that sees a field of `s`. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
423 input_part = s.input |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
424 # Obtain actual input data (e.g. as a numpy vector). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
425 input_data = input_part() |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
426 # Create a sample that sees the i-th element of the data stored in `s`. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
427 ith = s[i] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
428 # This should not fail. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
429 assert ith() == s()[i] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
430 # You could also select a range. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
431 i_to_j = s[i:j] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
432 assert i_to_j() == s()[i:j] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
433 # And actually do pretty much anything you want with __getitem__, as long |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
434 # as the underlying data stored in the sample supports it (for instance, |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
435 # here it should be at least a 3D tensor). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
436 fancy_selection = s[i, :, j:k] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
437 assert fancy_selection() == s()[i, :, j:k] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
438 # Write some value (e.g. a numpy vector) into the sample. May raise an |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
439 # exception if the sample is in read-only mode. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
440 s._write(val) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
441 # Shortcut to write data into a field (same as `s.input._write(val)`). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
442 s.input = val |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
443 # Basic mathematical operators. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
444 s *= val |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
445 s += val |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
446 s -= val |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
447 s /= val |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
448 # Replace a field. Note that this is different from `s.input = val` |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
449 # because here `new_input` is a sample, not a numeric value: the current |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
450 # `s.input` will not be written to, instead it makes `s.input` point |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
451 # towards a different sample. This may lead to confusion, so a different |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
452 # syntax may be better (e.g. s._set_field('input', new_input)). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
453 s.input = new_input |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
454 # The equality of two samples is defined by the equality of their |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
455 # underlying data. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
456 def __eq__(self, other): |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
457 return self() == other() |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
458 # Iterate on fields (open question: should they be ordered?). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
459 fields = dict([(name, sample) for name, sample in s._iter_fields()]) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
460 assert fields['input'] == s.input |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
461 # Iterating on a sample yields samples that see consecutive elements. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
462 for sample, value in izip(s, s()): |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
463 assert sample() == value |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
464 # The length of a sample is the same as that of its underlying data. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
465 assert len(s) == len(s()) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
466 # The shape of a sample is the same as that of its underlying data. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
467 # Note that it only makes sense for tensor-like data. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
468 assert s._shape() == s().shape |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
469 # The size of a sample is the product of its shape elements. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
470 assert s._size() == reduce(operator.__mul__, s._shape()) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
471 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
472 All sample methods should start with '_', to differentiate them from the |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
473 sample's fields. This is a bit awkward, but I like the `sample.field` syntax |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
474 compared to something like "sample.get_field('field')", which makes code less |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
475 readable, especially when combining with sub_fields, e.g. `sample.input.x1` |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
476 vs. sample.get_field('input').get_field('x1'). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
477 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
478 The extension from sample to dataset is actually to use the same class, but |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
479 with the convention that the first "dimension" in the data seen by the dataset |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
480 corresponds to the samples' indices in the dataset. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
481 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
482 .. code-block:: python |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
483 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
484 # Return data stored in dataset `d` (e.g. a numpy matrix). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
485 data = d() |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
486 # Return the i-th sample in the dataset. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
487 s = d[i] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
488 # Data should match! |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
489 assert data[i] == s() |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
490 # Return a subset of the dataset. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
491 sub_data = d[i:j] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
492 # Advanced indexing. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
493 sub_data = d[some_list_of_indices] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
494 # Dataset that sees the input part only. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
495 input_part = d.input |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
496 # Dataset such that its i-th element is data[i][something] (see the sample |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
497 # examples for what `something` may be). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
498 some_sub_data = d[:, something] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
499 # The following should not fail. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
500 assert d[i, something] == d[i][something] # == some_sub_data[i] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
501 # You can also write into a dataset. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
502 d._write(val) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
503 d.input = val |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
504 # Center dataset in-place (requires `d` not to be read-only). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
505 d -= numpy.mean(d()) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
506 # The length of a dataset is its number of samples. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
507 n_samples = len(d) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
508 # The width of a dataset (if it exists) is the length of its samples. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
509 assert d._shape()[1] == len(d[0]) # == d._width() (shortcut) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
510 # Iterating on a dataset yields individual samples. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
511 for i, sample in enumerate(d): |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
512 assert d[i] == sample |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
513 # It is allowed for a dataset to hold heterogeneous data. For instance |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
514 # you could have |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
515 len(d.data1) != len(d.data2) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
516 # A sample in the dataset is not required to inherit all the dataset's |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
517 # fields, for instance in the case above you could decide that the dataset |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
518 # sees the same data as its first sub-dataset, i.e. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
519 d[i] == d.data1[i] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
520 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
521 There remain some fuzzy points. For instance, are fields allowed to overlap? |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
522 (e.g. so that one could write both s.pos_3d to get the 3d vector coordinate of |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
523 sample s, and s.x to get the x coordinate without being forced to go through |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
524 s.pos_3d.x). What are the fields of s[i:j] if the (i, j) range does not |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
525 exactly match a subset of fields? How do we handle metadata? (e.g. if we want |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
526 to describe the dataset to say it contains 28x28 image data, so that an |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
527 algorithm for filter visualization can automatically deal with it) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
528 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
529 Now, on to some use cases. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
530 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
531 .. code-block:: python |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
532 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
533 # Mini-batches. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
534 mb_dataset = d._minibatches(batch_size=5) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
535 # The mini-batch dataset views samples that are mini-batches. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
536 assert mb_dataset[0]() == d[0:5]() # As long as len(d) >= 5. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
537 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
538 # Shuffling samples. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
539 random_indices = range(len(d)) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
540 random_indices = numpy.random.shuffle(random_indices) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
541 shuffled_dataset = d[random_indices] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
542 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
543 # Typical linear regression with stochastic gradient descent. |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
544 n_inputs = d.input._width() |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
545 n_targets = d.target._width() |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
546 weights = numpy.zeros((n_inputs, n_targets)) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
547 bias = numpy.zeros(n_targets) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
548 mb_dataset = d._minibatches(batch_size=10) |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
549 # Note: it is important to get the number of inputs / targets |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
550 # before converting to minibatches, because |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
551 # mb_dataset.input._width() == 10 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
552 # since this is the length of a minibatch matrix. However you |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
553 # could still do the following, which is less readable: |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
554 # n_inputs = mb_dataset.input._shape()[2] |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
555 # You could also wait until you see the first sample to create |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
556 # the parameters (this would actually be a better way to do it, since |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
557 # it avoids calling the _width method). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
558 for input, target in izip(mb_dataset.input, mb_dataset.target): |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
559 cost = (numpy.dot(input(), weights) + b - target())**2 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
560 # Update weights and bias depending on cost.... |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
561 |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
562 A few more points: |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
563 - Infinite datasets could be used (would just need to define a convention |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
564 on what __len__ should do). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
565 - It is also ok to have datasets that do not support random access (so the |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
566 only way to access samples is through iteration). |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
567 - Ideally, data should be deterministic (i.e. __call__() should always |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
568 return the same thing). It would probably be up to the user to be super |
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
569 careful if he decides to use a non-deterministic dataset. |
1338
91637815b7ca
Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents:
1337
diff
changeset
|
570 - About the "task vs. dataset" distinction. This could be achieved by |
91637815b7ca
Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents:
1337
diff
changeset
|
571 associating to a task the names of the fields it requires (e.g. "input" |
91637815b7ca
Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents:
1337
diff
changeset
|
572 and "target" for the regression task), and if the dataset does not |
91637815b7ca
Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents:
1337
diff
changeset
|
573 already defines these fields, using a dataset wrapper than does it |
91637815b7ca
Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents:
1337
diff
changeset
|
574 (saying for instance that "input" is the concatenation of "x1" and "x2", |
91637815b7ca
Added a comment on the dataset vs. task issue
Olivier Delalleau <delallea@iro>
parents:
1337
diff
changeset
|
575 and "target" is "y", for a dataset whose fields are x1, x2 and y). |
1337
7dfc3d3052ea
Added proposal for dataset API as discussed on pylearn-dev
Olivier Delalleau <delallea@iro>
parents:
1190
diff
changeset
|
576 |
1339
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
577 |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
578 RP comments: |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
579 - I like this approach. I think having overlapping fields might be useful. |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
580 I would add that I was thinking of a way to look at one's results. Is |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
581 something I've been faced with, say you run 500 jobs and then you want to |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
582 understand those jobs' results. Looking just at the best performing seems a waste, and |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
583 there is a lot more information you can extract from your results if you are |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
584 able to generate certain plots or statistics. To do this you would need to |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
585 get the data in ipython (or something quite similar) where you have available |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
586 the needed functions to plot different things, generate different tables. The |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
587 point that I was trying to make is that you can get those results in |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
588 something that has this very API that Olivier described. This way both both |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
589 your input data and your results will be in the same form and whatever |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
590 visualization functions you have for your results you can use on your data as |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
591 well. For this you would need a bit more flexibility, in the sense that if |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
592 you have some data d, you should be able to put constraints on it, like |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
593 d.some_field == 5 means all entries in d that has some_field == 5, or |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
594 d.some_field > 5. You would also not use psql anymore but this console, |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
595 which would collect the results for you from sql, and give them to you as |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
596 data object. |
158493f8dff9
comment on dataset proposal by Olivier
Razvan Pascanu <r.pascanu@gmail.com>
parents:
1338
diff
changeset
|
597 |
1340 | 598 OD replies: Actually this should be doable with (almost) what I wrote above, |
599 due to the way numpy redefines ==, >, etc. (which btw should break some of my | |
600 assertions above, since I had forgotten about this). If you replace e.g. my | |
601 implementation of __eq__ above by the following: | |
602 | |
603 .. code-block:: python | |
604 | |
605 def __eq__(self, other): | |
606 return other == self() | |
607 | |
608 Here, `self` is a dataset that represents some numpy vector data. Then whether | |
609 `other` is another dataset or a numpy vector or some scalar, this will return | |
610 a numpy boolean vector (the result of the comparison made by numpy). We may | |
611 support boolean vectors in advanced indexing, so you could do | |
612 d[d.some_field == 5] | |
613 and obtain the subset of `d` whose samples have `some_field` set to 5. | |
614 Same could be done with __lt__, __le__, etc. | |
615 |