Mercurial > pylearn
annotate doc/v2_planning/dataset.txt @ 1084:7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 10 Sep 2010 17:06:38 -0400 |
parents | 4c00af69c164 |
children | de456561ec40 |
rev | line source |
---|---|
1002
f82093bf4405
adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff
changeset
|
1 Discussion of Function Specification for Dataset Types |
f82093bf4405
adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff
changeset
|
2 ====================================================== |
f82093bf4405
adding learner.txt and dataset.txt in v2_planning/
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
diff
changeset
|
3 |
1008
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
4 Some talking points from the September 2 meeting: |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
5 |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
7 needs to be flexible enough to accommodate different (sub)tasks and views of |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
8 the same underlying data. |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
9 * Datasets as probability distributions from which one can sample. |
1023
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
10 * That's not something I would consider to be a dataset-related problem to |
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
11 tackle now: a probability distribution in Pylearn would probably be a |
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
12 different kind of beast, and it should be easy enough to have a |
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
13 DatasetToDistribution class for instance, that would take care of viewing a |
fb6cae14fd07
dataset: Comment about viewing a dataset as a distribution
Olivier Delalleau <delallea@iro>
parents:
1008
diff
changeset
|
14 dataset as a probability distribution. -- OD |
1008
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
15 * Our specification should allow transparent handling of infinite datasets (or |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
16 simply datasets which cannot fit in memory) |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
17 * GPU/buffering issues. |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
18 |
a5886b394bda
Updating with talking points from Sept. 2 discussion
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1002
diff
changeset
|
19 Commiteee: DE, OB, OD, AB, PV |
1030
a154c9b68239
dataset: Dumi confirmed as leader
Olivier Delalleau <delallea@iro>
parents:
1023
diff
changeset
|
20 Leader: DE |
1047
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
21 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
22 Some ideas from existing ML libraries: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
23 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData, |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
25 PairDataSet, Aggregate. Ultimately, the learner decides |
1077
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
26 - mlpy: very primitive notions of data (simple 2D matrices) |
1076
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet, |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
29 constrained and may have a different interface. |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
30 - MDP: Seems to have restrictions on the type of data being passed around, as |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
31 well as its dimensionality ("Input array data is typically assumed to be |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
32 two-dimensional and ordered such that observations of the same variable are |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
33 stored on rows and different variables are stored on columns.") |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
34 - Orange: Data matrices, with names and types associated to each column. |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
35 Basically there seems to be only one base dataset class that contains the |
20a1af112a75
dataset: Looked into datasets from some other ML libraries
Olivier Delalleau <delallea@iro>
parents:
1054
diff
changeset
|
36 data. Data points are lists (of values corresponding to each column). |
1077
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
37 - APGL: Hard to say how they deal with data from the documentation alone. |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
38 - Monte: Data is simply numpy arrays. |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
40 a 2D numpy array of input features, and dataset.target the target vector. |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
42 though). |
5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
Olivier Delalleau <delallea@iro>
parents:
1076
diff
changeset
|
43 - Any more worth looking at? |
1047
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
44 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
45 A few things that our dataset containers should support at a minimum: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
46 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
47 - streams, possibly infinite |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
48 - task/views of the data for different problems |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
49 - indexing & slicing |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
50 - pairs or triples or etc of examples |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
51 - a 'distance/gram matrix' container (imagine that the data is given to you |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
52 as a distance matrix) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
53 - multi-dimensional time-series (again, maybe with pairs/triples, maybe |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
54 given to you as a distance matrix over time) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
55 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
56 Another question to consider is the following: how tight should it integrate |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
57 with Theano? Do we want to be able to store data as shared variables or just |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
58 have an option for that? Theano + GPU constrains things that we can do (in terms |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
59 of sizes, buffering, etc): these are things we need to think about, but it's not |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
60 clear whether we should aim for building them into the interface. |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
61 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
62 Task views of the data for different problems: How can we achieve this? Should |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
63 we simply have a set of standard dataset descriptors ('classification', |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
64 'regression', 'multi-label', 'density_estimation') and have a set_view method |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
65 that changes the current dataset view type? |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
66 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
67 There is then the question of how to approach the design of a Dataset class from |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
68 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class' |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
69 Dataset that doesn't implement any methods except a few setters/getters. The reason |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
70 to have the methods listed that way is to have a common 'specification', but classes |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
71 that inherit from Dataset need not implement every single method (only the ones |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
72 that are relevant) and can obviously implement other methods as appropriate. The |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
73 reason to have a common specification (as abstract as it might be) is to, well, |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
74 have a common specification that would make our code clearer and cleaner. |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
75 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
76 An example of what I (Dumi) am thinking in terms of concrete API: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
77 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
78 class Dataset: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
79 def __init__(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
80 self.type = None |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
81 self.in_memory = None |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
82 self.inputs = None # list of filepaths, or objects in memory, or... |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
83 self.outputs = None |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
84 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
85 def get_example(self,example_index): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
86 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
87 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
88 def get_next_example(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
89 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
90 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
91 def get_batch(self,batch_index): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
92 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
93 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
94 def get_next_batch(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
95 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
96 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
97 def get_slice(self,slice_object): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
98 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
99 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
100 def set_view(self,view_type): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
101 self.view_type = view_type |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
102 self.n_classes = None |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
103 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
104 def set_n_classes(self,n_classes): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
105 self.n_classes = n_classes |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
106 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
107 def set_batch_size(self,batch_size): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
108 self.batch_size = batch_size |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
109 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
110 You will note that there is no notion of train/valid/test in this class: I think we should |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
111 just have a train dataset, a valid one and a test one instead or (if it's in one |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
112 big file or infinite stream) just handle the split ourselves (via slicing, for |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
113 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
114 specification does not preclude more fine-grained 'splitting' of the data. |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
115 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
116 A concrete implementation would look like this (we would have one class per |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
117 dataset that we use, and the class declaration contains essentially everything |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
118 there is to know about the dataset): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
119 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
120 class MNIST(Dataset): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
121 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
122 self.type='standard_xy' |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
123 self.in_memory = True |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
124 self.inputs = inputs # load them or create |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
125 self.outputs = outputs |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
126 self.set_view('classification') |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
127 self.set_n_classes(10) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
128 self.set_batch_size(20) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
129 self.n_batches = self._compute_n_batches() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
130 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
131 def get_batch(self,batch_index): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
132 x,y = self._fetch_batch(batch_index) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
133 if self.view_type == 'classification': |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
134 return x,numpy.int32(y) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
135 elif self.view_type == 'density_estimation': |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
136 return x |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
137 else: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
138 raise NotImplementedError() |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
139 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
140 def shared_data(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
141 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX)) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
142 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX)) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
143 return shared_x, T.cast(shared_y, 'int32') |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
144 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
145 def _compute_n_batches(self): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
146 pass |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
147 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
148 def _fetch_batch(self,batch_index): |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
149 pass |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
150 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
151 But nothing stops you from defining get_train_batch, get_valid_batch and stuff |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
152 like that! |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
153 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
154 So we'd use it as: |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
155 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
156 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy']) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
157 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy']) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
158 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
159 x,y = train_mnist.get_batch(0) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
160 train_mnist.set_view('density_estimation') |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
161 x = train_mnist.get_batch(0) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
162 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
163 or |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
164 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
165 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy']) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
166 batches_train = range(int(mnist_data.n_batches*0.8)) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
167 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
168 |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
169 xt,yt = mnist_data.get_batch(batches_train[0]) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
170 xv,yv = mnist_data.get_batch(batches_valid[0]) |
1b61cbe0810b
A very rough draft of ideas, to kick-start things
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
1030
diff
changeset
|
171 |
1054
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
172 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
173 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
174 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
175 COMMENTS |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
176 ~~~~~~~~ |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
177 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
178 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
179 JB asks: What may be passed as argument to the functions in Dataset, and what |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
180 can be expected in return? Are there side effects (e.g. on the state of the |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
181 Dataset) associated with any of the functions? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
182 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
183 JB asks: What properties are part of the Dataset API? What possible types can |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
184 they have, are they expected to be read-only or writeable? What do they mean? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
185 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
186 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
187 JB asks: What is a view? Does set_view change the Dataset or return a new |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
188 Dataset with a certain view of the original (in which case call it get_view)? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
189 Does the view imply the types of the return-value of functions like |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
190 get_batch? What is the difference between the view and the subclasses of |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
191 Dataset in PyML? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
192 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
193 JB asks: Do container formats (I'm thinking of HDF5) offer features for fast |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
194 retrieval that we would like to expose via this interface? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
195 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
196 JB asks: How would you recommend using this sort of dataset in a boosting |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
197 algorithm where points need to be re-weighted. |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
198 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
199 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
200 JB asks: Do we want to provide for the possibility of feedback that modifies the |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
201 dataset? For example, curriculum learning might be adaptive in this sense, or |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
202 if we wanted to provide a virtual world for an agent as a dataset then we need |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
203 to provide 'actions' to get the next batch. Could this be done in the current |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
204 API? |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
205 |
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
206 |
1082
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
207 Field names and attributes |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
208 ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
1054
a474fabd1f37
v2_planning dataset - added questions
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
1047
diff
changeset
|
209 |
1082
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
210 OD: One important question is how to handle fields' names and characteristics. |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
211 For instance, it can be useful to know that the 3rd input field represents a |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
212 number of fingers, and is a non-negative discrete field whose numeric value is |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
213 meaningful (compared, to, say, an integer index that would correspond to an |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
214 animal's category). We mentioned metadata during the meeting, but we did not |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
215 get into its details: that may be a place where to put this kind of things. |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
216 |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
217 |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
218 Freeing memory |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
219 ~~~~~~~~~~~~~~ |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
220 |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
221 OD: It is sometimes useful to be able to free memory used by previous |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
222 computations. A typical example is when you load in memory the original |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
223 dataset, then perform various processing steps, ending with a new dataset that |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
224 you also store in memory before feeding it to the learner. Unless you very |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
225 carefully design your code to avoid it, your original dataset will still |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
226 remain in memory (as well as maybe the results of some computations performed |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
227 along the way). So there may be a use for a `clear()` method that would be |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
228 called by the topmost dataset (the one doing the final memory caching), and |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
229 would be forwarded iteratively to previous datasets so as to get back all this |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
230 wasted memory space. |
f9f72ae84313
dataset: Added a couple points we did not have time to discuss during meeting
Olivier Delalleau <delallea@iro>
parents:
1077
diff
changeset
|
231 |
1083
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
232 What is a mini-batch? |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
233 ~~~~~~~~~~~~~~~~~~~~~ |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
234 |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
235 This is a follow-up to the meeting's discussion about whether a mini-batch |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
236 returned by a dataset should be itself a dataset. |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
237 |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
238 OD: During the meeting I was voting in favor of a 'yes', mostly because it |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
239 made sense to me (a mini-batch is a subset of a dataset and thus should be a |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
240 dataset), but now I tend towards 'no'. The main reason is it is not clear yet |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
241 what the dataset interface will be, so that it is hard to judge whether this |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
242 is good idea (my main concern is how much additional work would be required by |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
243 the writer of a new dataset subclass). Anyway, maybe a first thing we could |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
244 think about is what we want a mini-batch to be. I think we can agree that we |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
245 would like to be able to do something like: |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
246 for mb in dataset.mini_batches(size=10): |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
247 learner.update(mb.input, mb.target) |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
248 so that it should be ok for a mini-batch to be an object whose fields |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
249 (that should have the same name as those of the dataset) are numpy arrays. |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
250 More generally, we would like to be able to iterate on samples in a |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
251 mini-batch, or do random access on them, so a mini-batch should implement |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
252 __iter__ and __getitem__. |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
253 Besides this, is there any other typical use-case of a mini-batch? In |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
254 particular, is there any reason to want an infinite mini-batch? (in which case |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
255 we may need to revise our idea of what 'mini' means) Hopefully the answer to |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
256 that last question is no, as I think it would definitely keep things simpler, |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
257 since we could simply use numpy arrays (for numeric data) or lists (for |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
258 anything else) to store mini-batches' data. So I vote for 'no'. |
4c00af69c164
dataset: Asking what we want from mini-batches
Olivier Delalleau <delallea@iro>
parents:
1082
diff
changeset
|
259 |
1084
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
260 A dataset is a learner |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
261 ~~~~~~~~~~~~~~~~~~~~~~ |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
262 |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
263 OD: This is more a high-level comment that may or may not be relevant |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
264 depending on how we get to plug our different classes together. |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
265 In PLearn (old C++ lisa ML library) we had *lots* of dataset subclasses doing |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
266 all sorts of fancy things, the majority of these classes taking as input |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
267 another dataset, and transforming it in some way (e.g. taking a subset of |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
268 samples, a subset of features, normalizing features, computing extra fields |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
269 given existing fields, etc.). I think right now our interface is heading in a |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
270 similar direction. |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
271 When you think about it, this kind of operation is equivalent to writing a |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
272 learner class that is trained on the input dataset, and whose output on this |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
273 same dataset is used to obtain an output dataset (note that the training phase |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
274 may do nothing, e.g. if the goal is only to filter out a predefined set of |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
275 samples). |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
276 If you push it even further, even a dataset that has no input dataset, say |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
277 e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
278 learner that was trained on nothing and whose output is computed on nothing |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
279 (but still outputs this 2D matrix). |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
280 In the small ML library I have been using at Ubisoft, the dataset class |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
281 actually inherits from learner, based on this point of view. Actually pretty |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
282 much all objects that are plugged together to make an experiment are learners. |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
283 The main advantage is everything has the same interface and the "plugging" of |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
284 the different parts can remain very simple. Confusion is avoided by the module |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
285 hierarchy to ensure objects with different behavior have different names. |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
286 Something like dataset.MatrixDataset would create a dataset from scratch (i.e. |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
287 a numpy matrix), process.FilterSamples would be something that does not need |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
288 to be trained, but needs an input dataset, and learner.NNet would be a usual |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
289 learning algorithm that must be trained on an input dataset, and computes an |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
290 output (possibly on the same dataset, possibly on another one). |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
291 |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
292 Ok, this is getting too long, I am definitely not saying we should do this, |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
293 but I think there is some close relationship between the usual data processing |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
294 we do and the learning process, so it may be worth thinking how to put them |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
295 together in a coherent framework. For instance, in PLearn there was (something |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
296 like) a NormalizeVMatrix (think of it as a dataset subclass), but it could |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
297 not be used in a natural way to learn the normalization parameters on a |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
298 training set (e.g. mean and std of features) and normalize another dataset. |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
299 Instead you could use (something like) a |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
300 PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
301 ways to do (almost the) same thing can be confusing. |
7e6e77d50eeb
dataset: I say the learner committee should take care of dataset as well
Olivier Delalleau <delallea@iro>
parents:
1083
diff
changeset
|
302 |