pylearn: doc/v2_planning/dataset.txt annotate

annotate doc/v2_planning/dataset.txt @ 1084:7e6e77d50eeb

dataset: I say the learner committee should take care of dataset as well

author	Olivier Delalleau <delallea@iro>
date	Fri, 10 Sep 2010 17:06:38 -0400
parents	4c00af69c164
children	de456561ec40

rev	line source
1002 f82093bf4405 adding learner.txt and dataset.txt in v2_planning/ Yoshua Bengio <bengioy@iro.umontreal.ca> parents: diff changeset	1 Discussion of Function Specification for Dataset Types
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/ Yoshua Bengio <bengioy@iro.umontreal.ca> parents: diff changeset	2 ======================================================
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/ Yoshua Bengio <bengioy@iro.umontreal.ca> parents: diff changeset	3
1008 a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	4 Some talking points from the September 2 meeting:
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	5
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	7 needs to be flexible enough to accommodate different (sub)tasks and views of
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	8 the same underlying data.
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	9 * Datasets as probability distributions from which one can sample.
1023 fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	10 * That's not something I would consider to be a dataset-related problem to
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	11 tackle now: a probability distribution in Pylearn would probably be a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	12 different kind of beast, and it should be easy enough to have a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	13 DatasetToDistribution class for instance, that would take care of viewing a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	14 dataset as a probability distribution. -- OD
1008 a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	15 * Our specification should allow transparent handling of infinite datasets (or
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	16 simply datasets which cannot fit in memory)
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	17 * GPU/buffering issues.
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	18
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	19 Commiteee: DE, OB, OD, AB, PV
1030 a154c9b68239 dataset: Dumi confirmed as leader Olivier Delalleau <delallea@iro> parents: 1023 diff changeset	20 Leader: DE
1047 1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	21
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	22 Some ideas from existing ML libraries:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	23
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	25 PairDataSet, Aggregate. Ultimately, the learner decides
1077 5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	26 - mlpy: very primitive notions of data (simple 2D matrices)
1076 20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet,
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	29 constrained and may have a different interface.
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	30 - MDP: Seems to have restrictions on the type of data being passed around, as
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	31 well as its dimensionality ("Input array data is typically assumed to be
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	32 two-dimensional and ordered such that observations of the same variable are
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	33 stored on rows and different variables are stored on columns.")
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	34 - Orange: Data matrices, with names and types associated to each column.
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	35 Basically there seems to be only one base dataset class that contains the
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	36 data. Data points are lists (of values corresponding to each column).
1077 5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	37 - APGL: Hard to say how they deal with data from the documentation alone.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	38 - Monte: Data is simply numpy arrays.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	40 a 2D numpy array of input features, and dataset.target the target vector.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	42 though).
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	43 - Any more worth looking at?
1047 1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	44
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	45 A few things that our dataset containers should support at a minimum:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	46
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	47 - streams, possibly infinite
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	48 - task/views of the data for different problems
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	49 - indexing & slicing
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	50 - pairs or triples or etc of examples
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	51 - a 'distance/gram matrix' container (imagine that the data is given to you
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	52 as a distance matrix)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	53 - multi-dimensional time-series (again, maybe with pairs/triples, maybe
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	54 given to you as a distance matrix over time)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	55
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	56 Another question to consider is the following: how tight should it integrate
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	57 with Theano? Do we want to be able to store data as shared variables or just
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	58 have an option for that? Theano + GPU constrains things that we can do (in terms
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	59 of sizes, buffering, etc): these are things we need to think about, but it's not
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	60 clear whether we should aim for building them into the interface.
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	61
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	62 Task views of the data for different problems: How can we achieve this? Should
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	63 we simply have a set of standard dataset descriptors ('classification',
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	64 'regression', 'multi-label', 'density_estimation') and have a set_view method
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	65 that changes the current dataset view type?
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	66
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	67 There is then the question of how to approach the design of a Dataset class from
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	68 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class'
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	69 Dataset that doesn't implement any methods except a few setters/getters. The reason
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	70 to have the methods listed that way is to have a common 'specification', but classes
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	71 that inherit from Dataset need not implement every single method (only the ones
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	72 that are relevant) and can obviously implement other methods as appropriate. The
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	73 reason to have a common specification (as abstract as it might be) is to, well,
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	74 have a common specification that would make our code clearer and cleaner.
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	75
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	76 An example of what I (Dumi) am thinking in terms of concrete API:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	77
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	78 class Dataset:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	79 def __init__(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	80 self.type = None
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	81 self.in_memory = None
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	82 self.inputs = None # list of filepaths, or objects in memory, or...
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	83 self.outputs = None
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	84
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	85 def get_example(self,example_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	86 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	87
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	88 def get_next_example(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	89 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	90
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	91 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	92 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	93
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	94 def get_next_batch(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	95 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	96
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	97 def get_slice(self,slice_object):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	98 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	99
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	100 def set_view(self,view_type):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	101 self.view_type = view_type
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	102 self.n_classes = None
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	103
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	104 def set_n_classes(self,n_classes):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	105 self.n_classes = n_classes
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	106
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	107 def set_batch_size(self,batch_size):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	108 self.batch_size = batch_size
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	109
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	110 You will note that there is no notion of train/valid/test in this class: I think we should
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	111 just have a train dataset, a valid one and a test one instead or (if it's in one
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	112 big file or infinite stream) just handle the split ourselves (via slicing, for
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	113 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	114 specification does not preclude more fine-grained 'splitting' of the data.
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	115
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	116 A concrete implementation would look like this (we would have one class per
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	117 dataset that we use, and the class declaration contains essentially everything
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	118 there is to know about the dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	119
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	120 class MNIST(Dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	121 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	122 self.type='standard_xy'
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	123 self.in_memory = True
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	124 self.inputs = inputs # load them or create
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	125 self.outputs = outputs
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	126 self.set_view('classification')
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	127 self.set_n_classes(10)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	128 self.set_batch_size(20)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	129 self.n_batches = self._compute_n_batches()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	130
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	131 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	132 x,y = self._fetch_batch(batch_index)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	133 if self.view_type == 'classification':
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	134 return x,numpy.int32(y)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	135 elif self.view_type == 'density_estimation':
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	136 return x
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	137 else:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	138 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	139
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	140 def shared_data(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	141 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	142 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	143 return shared_x, T.cast(shared_y, 'int32')
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	144
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	145 def _compute_n_batches(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	146 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	147
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	148 def _fetch_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	149 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	150
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	151 But nothing stops you from defining get_train_batch, get_valid_batch and stuff
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	152 like that!
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	153
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	154 So we'd use it as:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	155
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	156 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	157 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	158
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	159 x,y = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	160 train_mnist.set_view('density_estimation')
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	161 x = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	162
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	163 or
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	164
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	165 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	166 batches_train = range(int(mnist_data.n_batches*0.8))
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	167 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	168
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	169 xt,yt = mnist_data.get_batch(batches_train[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	170 xv,yv = mnist_data.get_batch(batches_valid[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	171
1054 a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	172
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	173
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	174
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	175 COMMENTS
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	176 ~~~~~~~~
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	177
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	178
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	179 JB asks: What may be passed as argument to the functions in Dataset, and what
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	180 can be expected in return? Are there side effects (e.g. on the state of the
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	181 Dataset) associated with any of the functions?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	182
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	183 JB asks: What properties are part of the Dataset API? What possible types can
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	184 they have, are they expected to be read-only or writeable? What do they mean?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	185
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	186
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	187 JB asks: What is a view? Does set_view change the Dataset or return a new
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	188 Dataset with a certain view of the original (in which case call it get_view)?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	189 Does the view imply the types of the return-value of functions like
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	190 get_batch? What is the difference between the view and the subclasses of
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	191 Dataset in PyML?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	192
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	193 JB asks: Do container formats (I'm thinking of HDF5) offer features for fast
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	194 retrieval that we would like to expose via this interface?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	195
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	196 JB asks: How would you recommend using this sort of dataset in a boosting
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	197 algorithm where points need to be re-weighted.
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	198
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	199
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	200 JB asks: Do we want to provide for the possibility of feedback that modifies the
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	201 dataset? For example, curriculum learning might be adaptive in this sense, or
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	202 if we wanted to provide a virtual world for an agent as a dataset then we need
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	203 to provide 'actions' to get the next batch. Could this be done in the current
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	204 API?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	205
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	206
1082 f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	207 Field names and attributes
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	208 ~~~~~~~~~~~~~~~~~~~~~~~~~~
1054 a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	209
1082 f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	210 OD: One important question is how to handle fields' names and characteristics.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	211 For instance, it can be useful to know that the 3rd input field represents a
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	212 number of fingers, and is a non-negative discrete field whose numeric value is
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	213 meaningful (compared, to, say, an integer index that would correspond to an
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	214 animal's category). We mentioned metadata during the meeting, but we did not
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	215 get into its details: that may be a place where to put this kind of things.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	216
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	217
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	218 Freeing memory
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	219 ~~~~~~~~~~~~~~
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	220
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	221 OD: It is sometimes useful to be able to free memory used by previous
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	222 computations. A typical example is when you load in memory the original
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	223 dataset, then perform various processing steps, ending with a new dataset that
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	224 you also store in memory before feeding it to the learner. Unless you very
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	225 carefully design your code to avoid it, your original dataset will still
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	226 remain in memory (as well as maybe the results of some computations performed
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	227 along the way). So there may be a use for a `clear()` method that would be
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	228 called by the topmost dataset (the one doing the final memory caching), and
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	229 would be forwarded iteratively to previous datasets so as to get back all this
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	230 wasted memory space.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	231
1083 4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	232 What is a mini-batch?
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	233 ~~~~~~~~~~~~~~~~~~~~~
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	234
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	235 This is a follow-up to the meeting's discussion about whether a mini-batch
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	236 returned by a dataset should be itself a dataset.
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	237
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	238 OD: During the meeting I was voting in favor of a 'yes', mostly because it
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	239 made sense to me (a mini-batch is a subset of a dataset and thus should be a
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	240 dataset), but now I tend towards 'no'. The main reason is it is not clear yet
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	241 what the dataset interface will be, so that it is hard to judge whether this
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	242 is good idea (my main concern is how much additional work would be required by
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	243 the writer of a new dataset subclass). Anyway, maybe a first thing we could
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	244 think about is what we want a mini-batch to be. I think we can agree that we
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	245 would like to be able to do something like:
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	246 for mb in dataset.mini_batches(size=10):
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	247 learner.update(mb.input, mb.target)
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	248 so that it should be ok for a mini-batch to be an object whose fields
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	249 (that should have the same name as those of the dataset) are numpy arrays.
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	250 More generally, we would like to be able to iterate on samples in a
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	251 mini-batch, or do random access on them, so a mini-batch should implement
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	252 __iter__ and __getitem__.
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	253 Besides this, is there any other typical use-case of a mini-batch? In
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	254 particular, is there any reason to want an infinite mini-batch? (in which case
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	255 we may need to revise our idea of what 'mini' means) Hopefully the answer to
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	256 that last question is no, as I think it would definitely keep things simpler,
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	257 since we could simply use numpy arrays (for numeric data) or lists (for
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	258 anything else) to store mini-batches' data. So I vote for 'no'.
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	259
1084 7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	260 A dataset is a learner
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	261 ~~~~~~~~~~~~~~~~~~~~~~
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	262
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	263 OD: This is more a high-level comment that may or may not be relevant
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	264 depending on how we get to plug our different classes together.
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	265 In PLearn (old C++ lisa ML library) we had lots of dataset subclasses doing
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	266 all sorts of fancy things, the majority of these classes taking as input
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	267 another dataset, and transforming it in some way (e.g. taking a subset of
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	268 samples, a subset of features, normalizing features, computing extra fields
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	269 given existing fields, etc.). I think right now our interface is heading in a
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	270 similar direction.
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	271 When you think about it, this kind of operation is equivalent to writing a
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	272 learner class that is trained on the input dataset, and whose output on this
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	273 same dataset is used to obtain an output dataset (note that the training phase
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	274 may do nothing, e.g. if the goal is only to filter out a predefined set of
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	275 samples).
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	276 If you push it even further, even a dataset that has no input dataset, say
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	277 e.g. a dataset view of a 2D numpy matrix, can be seen as the output of a
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	278 learner that was trained on nothing and whose output is computed on nothing
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	279 (but still outputs this 2D matrix).
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	280 In the small ML library I have been using at Ubisoft, the dataset class
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	281 actually inherits from learner, based on this point of view. Actually pretty
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	282 much all objects that are plugged together to make an experiment are learners.
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	283 The main advantage is everything has the same interface and the "plugging" of
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	284 the different parts can remain very simple. Confusion is avoided by the module
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	285 hierarchy to ensure objects with different behavior have different names.
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	286 Something like dataset.MatrixDataset would create a dataset from scratch (i.e.
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	287 a numpy matrix), process.FilterSamples would be something that does not need
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	288 to be trained, but needs an input dataset, and learner.NNet would be a usual
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	289 learning algorithm that must be trained on an input dataset, and computes an
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	290 output (possibly on the same dataset, possibly on another one).
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	291
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	292 Ok, this is getting too long, I am definitely not saying we should do this,
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	293 but I think there is some close relationship between the usual data processing
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	294 we do and the learning process, so it may be worth thinking how to put them
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	295 together in a coherent framework. For instance, in PLearn there was (something
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	296 like) a NormalizeVMatrix (think of it as a dataset subclass), but it could
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	297 not be used in a natural way to learn the normalization parameters on a
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	298 training set (e.g. mean and std of features) and normalize another dataset.
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	299 Instead you could use (something like) a
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	300 PLearnerOutputVMatrix(learner=NormalizeLearner(train_on=....)). Having both
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	301 ways to do (almost the) same thing can be confusing.
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	302

Mercurial > pylearn

annotate doc/v2_planning/dataset.txt @ 1084:7e6e77d50eeb