pylearn: doc/v2_planning/dataset.txt annotate

annotate doc/v2_planning/dataset.txt @ 1429:b0141efbf6a2

fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.

author	Frederic Bastien <nouiz@nouiz.org>
date	Tue, 08 Feb 2011 16:17:56 -0500
parents	04b988fb00b6
children

rev	line source
1002 f82093bf4405 adding learner.txt and dataset.txt in v2_planning/ Yoshua Bengio <bengioy@iro.umontreal.ca> parents: diff changeset	1 Discussion of Function Specification for Dataset Types
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/ Yoshua Bengio <bengioy@iro.umontreal.ca> parents: diff changeset	2 ======================================================
f82093bf4405 adding learner.txt and dataset.txt in v2_planning/ Yoshua Bengio <bengioy@iro.umontreal.ca> parents: diff changeset	3
1008 a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	4 Some talking points from the September 2 meeting:
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	5
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	6 * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
1190 9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	7 needs to be flexible enough to accommodate different (sub)tasks and views of
9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	8 the same underlying data.
1008 a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	9 * Datasets as probability distributions from which one can sample.
1023 fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	10 * That's not something I would consider to be a dataset-related problem to
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	11 tackle now: a probability distribution in Pylearn would probably be a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	12 different kind of beast, and it should be easy enough to have a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	13 DatasetToDistribution class for instance, that would take care of viewing a
fb6cae14fd07 dataset: Comment about viewing a dataset as a distribution Olivier Delalleau <delallea@iro> parents: 1008 diff changeset	14 dataset as a probability distribution. -- OD
1008 a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	15 * Our specification should allow transparent handling of infinite datasets (or
1190 9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	16 simply datasets which cannot fit in memory)
1008 a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	17 * GPU/buffering issues.
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	18
a5886b394bda Updating with talking points from Sept. 2 discussion Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1002 diff changeset	19 Commiteee: DE, OB, OD, AB, PV
1030 a154c9b68239 dataset: Dumi confirmed as leader Olivier Delalleau <delallea@iro> parents: 1023 diff changeset	20 Leader: DE
1047 1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	21
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	22 Some ideas from existing ML libraries:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	23
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	25 PairDataSet, Aggregate. Ultimately, the learner decides
1077 5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	26 - mlpy: very primitive notions of data (simple 2D matrices)
1076 20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet,
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	29 constrained and may have a different interface.
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	30 - MDP: Seems to have restrictions on the type of data being passed around, as
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	31 well as its dimensionality ("Input array data is typically assumed to be
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	32 two-dimensional and ordered such that observations of the same variable are
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	33 stored on rows and different variables are stored on columns.")
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	34 - Orange: Data matrices, with names and types associated to each column.
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	35 Basically there seems to be only one base dataset class that contains the
20a1af112a75 dataset: Looked into datasets from some other ML libraries Olivier Delalleau <delallea@iro> parents: 1054 diff changeset	36 data. Data points are lists (of values corresponding to each column).
1077 5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	37 - APGL: Hard to say how they deal with data from the documentation alone.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	38 - Monte: Data is simply numpy arrays.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	40 a 2D numpy array of input features, and dataset.target the target vector.
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	42 though).
5c14d2ffcbb3 dataset: Looked into a few more existing ML libraries Olivier Delalleau <delallea@iro> parents: 1076 diff changeset	43 - Any more worth looking at?
1047 1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	44
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	45 A few things that our dataset containers should support at a minimum:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	46
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	47 - streams, possibly infinite
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	48 - task/views of the data for different problems
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	49 - indexing & slicing
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	50 - pairs or triples or etc of examples
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	51 - a 'distance/gram matrix' container (imagine that the data is given to you
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	52 as a distance matrix)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	53 - multi-dimensional time-series (again, maybe with pairs/triples, maybe
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	54 given to you as a distance matrix over time)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	55
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	56 Another question to consider is the following: how tight should it integrate
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	57 with Theano? Do we want to be able to store data as shared variables or just
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	58 have an option for that? Theano + GPU constrains things that we can do (in terms
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	59 of sizes, buffering, etc): these are things we need to think about, but it's not
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	60 clear whether we should aim for building them into the interface.
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	61
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	62 Task views of the data for different problems: How can we achieve this? Should
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	63 we simply have a set of standard dataset descriptors ('classification',
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	64 'regression', 'multi-label', 'density_estimation') and have a set_view method
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	65 that changes the current dataset view type?
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	66
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	67 There is then the question of how to approach the design of a Dataset class from
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	68 an OOP perspective. So far, my (Dumi's) idea is to have an almost 'abstract class'
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	69 Dataset that doesn't implement any methods except a few setters/getters. The reason
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	70 to have the methods listed that way is to have a common 'specification', but classes
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	71 that inherit from Dataset need not implement every single method (only the ones
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	72 that are relevant) and can obviously implement other methods as appropriate. The
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	73 reason to have a common specification (as abstract as it might be) is to, well,
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	74 have a common specification that would make our code clearer and cleaner.
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	75
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	76 An example of what I (Dumi) am thinking in terms of concrete API:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	77
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	78 class Dataset:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	79 def __init__(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	80 self.type = None
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	81 self.in_memory = None
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	82 self.inputs = None # list of filepaths, or objects in memory, or...
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	83 self.outputs = None
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	84
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	85 def get_example(self,example_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	86 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	87
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	88 def get_next_example(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	89 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	90
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	91 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	92 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	93
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	94 def get_next_batch(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	95 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	96
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	97 def get_slice(self,slice_object):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	98 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	99
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	100 def set_view(self,view_type):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	101 self.view_type = view_type
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	102 self.n_classes = None
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	103
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	104 def set_n_classes(self,n_classes):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	105 self.n_classes = n_classes
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	106
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	107 def set_batch_size(self,batch_size):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	108 self.batch_size = batch_size
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	109
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	110 You will note that there is no notion of train/valid/test in this class: I think we should
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	111 just have a train dataset, a valid one and a test one instead or (if it's in one
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	112 big file or infinite stream) just handle the split ourselves (via slicing, for
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	113 instance). I (Dumi) am of the opinion that it keeps things cleaner, but the
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	114 specification does not preclude more fine-grained 'splitting' of the data.
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	115
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	116 A concrete implementation would look like this (we would have one class per
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	117 dataset that we use, and the class declaration contains essentially everything
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	118 there is to know about the dataset):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	119
1190 9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	120 .. code-block:: python
9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	121
9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	122 class MNIST(Dataset):
1047 1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	123 def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	124 self.type='standard_xy'
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	125 self.in_memory = True
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	126 self.inputs = inputs # load them or create
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	127 self.outputs = outputs
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	128 self.set_view('classification')
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	129 self.set_n_classes(10)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	130 self.set_batch_size(20)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	131 self.n_batches = self._compute_n_batches()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	132
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	133 def get_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	134 x,y = self._fetch_batch(batch_index)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	135 if self.view_type == 'classification':
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	136 return x,numpy.int32(y)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	137 elif self.view_type == 'density_estimation':
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	138 return x
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	139 else:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	140 raise NotImplementedError()
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	141
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	142 def shared_data(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	143 shared_x = theano.shared(numpy.asarray(self.inputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	144 shared_y = theano.shared(numpy.asarray(self.outputs, dtype=theano.config.floatX))
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	145 return shared_x, T.cast(shared_y, 'int32')
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	146
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	147 def _compute_n_batches(self):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	148 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	149
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	150 def _fetch_batch(self,batch_index):
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	151 pass
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	152
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	153 But nothing stops you from defining get_train_batch, get_valid_batch and stuff
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	154 like that!
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	155
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	156 So we'd use it as:
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	157
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	158 train_mnist = MNIST(inputs = ['train_x.npy'], outputs = ['train_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	159 valid_mnist = MNIST(inputs = ['valid_x.npy'], outputs = ['valid_y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	160
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	161 x,y = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	162 train_mnist.set_view('density_estimation')
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	163 x = train_mnist.get_batch(0)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	164
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	165 or
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	166
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	167 mnist_data = MNIST(inputs = ['x.npy'], outputs = ['y.npy'])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	168 batches_train = range(int(mnist_data.n_batches*0.8))
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	169 batches_valid = range(int(mnist_data.n_batches*0.8),mnist_data.n_batches)
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	170
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	171 xt,yt = mnist_data.get_batch(batches_train[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	172 xv,yv = mnist_data.get_batch(batches_valid[0])
1b61cbe0810b A very rough draft of ideas, to kick-start things Dumitru Erhan <dumitru.erhan@gmail.com> parents: 1030 diff changeset	173
1054 a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	174
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	175
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	176
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	177 COMMENTS
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	178 ~~~~~~~~
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	179
1120 27d0ef195e1d v2planning - added comment to dataset re: visualization James Bergstra <bergstrj@iro.umontreal.ca> parents: 1117 diff changeset	180 JB asks: How about asking datasets to also provide a visualization mechanism
27d0ef195e1d v2planning - added comment to dataset re: visualization James Bergstra <bergstrj@iro.umontreal.ca> parents: 1117 diff changeset	181 for showing / playing individual examples from the dataset, but also other
27d0ef195e1d v2planning - added comment to dataset re: visualization James Bergstra <bergstrj@iro.umontreal.ca> parents: 1117 diff changeset	182 external objects that are similar to dataset examples (e.g. filters from a
27d0ef195e1d v2planning - added comment to dataset re: visualization James Bergstra <bergstrj@iro.umontreal.ca> parents: 1117 diff changeset	183 weight matrix that filters images). This doesn't have to be complicated, and it
27d0ef195e1d v2planning - added comment to dataset re: visualization James Bergstra <bergstrj@iro.umontreal.ca> parents: 1117 diff changeset	184 can be shared between datasets that exist in one modality (e.g. image datasets
27d0ef195e1d v2planning - added comment to dataset re: visualization James Bergstra <bergstrj@iro.umontreal.ca> parents: 1117 diff changeset	185 can all use an image-rending method)
1054 a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	186
1131 d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	187 OD replies: Besides being able to display data without prior knowledge of the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	188 kind of data inside a dataset, is there any reason to put this within the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	189 dataset class? If not, it seems to me it may be more appropriate to have a way
d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	190 for the dataset to describe the kind of data it holds, and keep the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	191 visualization code separate from the dataset itself. It would make it easier
d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	192 in particular to try different visualization systems, and description of the
d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	193 data may turn out to be useful for other reasons (however, it also means we'd
d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	194 need to come up with a good way to describe data, which could prove
d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	195 difficult).
d9550c27a192 dataset: Reply about being able to plot samples from a dataset Olivier Delalleau <delallea@iro> parents: 1127 diff changeset	196
1054 a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	197 JB asks: What may be passed as argument to the functions in Dataset, and what
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	198 can be expected in return? Are there side effects (e.g. on the state of the
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	199 Dataset) associated with any of the functions?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	200
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	201 JB asks: What properties are part of the Dataset API? What possible types can
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	202 they have, are they expected to be read-only or writeable? What do they mean?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	203
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	204
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	205 JB asks: What is a view? Does set_view change the Dataset or return a new
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	206 Dataset with a certain view of the original (in which case call it get_view)?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	207 Does the view imply the types of the return-value of functions like
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	208 get_batch? What is the difference between the view and the subclasses of
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	209 Dataset in PyML?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	210
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	211 JB asks: Do container formats (I'm thinking of HDF5) offer features for fast
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	212 retrieval that we would like to expose via this interface?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	213
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	214 JB asks: How would you recommend using this sort of dataset in a boosting
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	215 algorithm where points need to be re-weighted.
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	216
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	217
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	218 JB asks: Do we want to provide for the possibility of feedback that modifies the
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	219 dataset? For example, curriculum learning might be adaptive in this sense, or
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	220 if we wanted to provide a virtual world for an agent as a dataset then we need
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	221 to provide 'actions' to get the next batch. Could this be done in the current
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	222 API?
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	223
a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	224
1082 f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	225 Field names and attributes
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	226 ~~~~~~~~~~~~~~~~~~~~~~~~~~
1054 a474fabd1f37 v2_planning dataset - added questions James Bergstra <bergstrj@iro.umontreal.ca> parents: 1047 diff changeset	227
1082 f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	228 OD: One important question is how to handle fields' names and characteristics.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	229 For instance, it can be useful to know that the 3rd input field represents a
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	230 number of fingers, and is a non-negative discrete field whose numeric value is
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	231 meaningful (compared, to, say, an integer index that would correspond to an
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	232 animal's category). We mentioned metadata during the meeting, but we did not
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	233 get into its details: that may be a place where to put this kind of things.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	234
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	235
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	236 Freeing memory
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	237 ~~~~~~~~~~~~~~
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	238
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	239 OD: It is sometimes useful to be able to free memory used by previous
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	240 computations. A typical example is when you load in memory the original
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	241 dataset, then perform various processing steps, ending with a new dataset that
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	242 you also store in memory before feeding it to the learner. Unless you very
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	243 carefully design your code to avoid it, your original dataset will still
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	244 remain in memory (as well as maybe the results of some computations performed
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	245 along the way). So there may be a use for a `clear()` method that would be
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	246 called by the topmost dataset (the one doing the final memory caching), and
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	247 would be forwarded iteratively to previous datasets so as to get back all this
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	248 wasted memory space.
f9f72ae84313 dataset: Added a couple points we did not have time to discuss during meeting Olivier Delalleau <delallea@iro> parents: 1077 diff changeset	249
1083 4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	250 What is a mini-batch?
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	251 ~~~~~~~~~~~~~~~~~~~~~
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	252
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	253 This is a follow-up to the meeting's discussion about whether a mini-batch
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	254 returned by a dataset should be itself a dataset.
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	255
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	256 OD: During the meeting I was voting in favor of a 'yes', mostly because it
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	257 made sense to me (a mini-batch is a subset of a dataset and thus should be a
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	258 dataset), but now I tend towards 'no'. The main reason is it is not clear yet
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	259 what the dataset interface will be, so that it is hard to judge whether this
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	260 is good idea (my main concern is how much additional work would be required by
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	261 the writer of a new dataset subclass). Anyway, maybe a first thing we could
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	262 think about is what we want a mini-batch to be. I think we can agree that we
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	263 would like to be able to do something like:
1190 9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	264
9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	265 .. code-block:: python
9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	266
1083 4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	267 for mb in dataset.mini_batches(size=10):
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	268 learner.update(mb.input, mb.target)
1190 9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	269
1083 4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	270 so that it should be ok for a mini-batch to be an object whose fields
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	271 (that should have the same name as those of the dataset) are numpy arrays.
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	272 More generally, we would like to be able to iterate on samples in a
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	273 mini-batch, or do random access on them, so a mini-batch should implement
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	274 __iter__ and __getitem__.
4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	275 Besides this, is there any other typical use-case of a mini-batch? In
1086 65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	276 particular, is there any reason to want an infinite mini-batch, or a very big
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	277 mini-batch that may not fit in memory? (in which case we may need to revise
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	278 our idea of what 'mini' means) Hopefully the answer to that last question is
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	279 no, as I think it would definitely keep things simpler, since we could simply
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	280 use numpy arrays (for numeric data) or lists (for anything else) to store
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	281 mini-batches' data. So I vote for 'no'.
1083 4c00af69c164 dataset: Asking what we want from mini-batches Olivier Delalleau <delallea@iro> parents: 1082 diff changeset	282
1124 0f184b5e7a3f YB: comment on minibatches for dataset.txt Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1120 diff changeset	283 YB: I agree that a mini-batch should definitely be safely assumed
0f184b5e7a3f YB: comment on minibatches for dataset.txt Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1120 diff changeset	284 to fit in memory. That makes it at least in principle semantically
0f184b5e7a3f YB: comment on minibatches for dataset.txt Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1120 diff changeset	285 different from a dataset. But barring that restriction, it might
0f184b5e7a3f YB: comment on minibatches for dataset.txt Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1120 diff changeset	286 share of the properties of a dataset.
0f184b5e7a3f YB: comment on minibatches for dataset.txt Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1120 diff changeset	287
1084 7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	288 A dataset is a learner
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	289 ~~~~~~~~~~~~~~~~~~~~~~
7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	290
1085 de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	291 OD: (this is hopefully a clearer re-write of the original version from
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	292 r7e6e77d50eeb, which I was not happy with).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	293 There are typically three kinds of objects that spit out data:
1190 9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	294
1085 de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	295 1. Datasets that are loaded from disk or are able to generate data all by
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	296 themselves (i.e. without any other dataset as input)
1086 65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	297 2. Datasets that transform their input dataset in a way that only depends on
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	298 the input dataset (e.g. filtering samples or features, normalizing data, etc.)
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	299 3. Datasets that transform their input dataset in a way that is learned on a
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	300 potentially different dataset (e.g. PCA when you want to learn the projection
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	301 space on the training set in order to transform both the training and test
65ac0f493830 dataset: Some clarifications on my comments Olivier Delalleau <delallea@iro> parents: 1085 diff changeset	302 sets).
1190 9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	303
1085 de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	304 My impression currently is that we would use dataset subclasses to handle 1
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	305 and 2. However, 3 requires a learner framework, so you would need to have
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	306 something like a LearnerOutputDataset(trained_learner, dataset).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	307
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	308 Note however that 2 is a special case of 3 (where training does nothing), and
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	309 1 is a special case of 2 (where we do not care about being given an input
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	310 dataset). Thus you could decide to also implement 1 and 2 as learners wrapped
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	311 by LearnerOutputDataset.
1084 7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	312
1085 de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	313 The main advantages I find in this approach (that I have been using at
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	314 Ubisoft) are:
1190 9ff2242a817b fix rst syntax errors/warnings Frederic Bastien <nouiz@nouiz.org> parents: 1131 diff changeset	315
1085 de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	316 - You only need to learn how to subclass the learner class. The only dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	317 class is LearnerOutputDataset, which you could just name Dataset.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	318 - You do not have different ways to achieve the same result (having to figure
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	319 out which one is most appropriate).
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	320 - Upgrading code from 2 to 3 is more straighforward. Such a situation can
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	321 happen e.g. if you write some code that normalizes your input dataset
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	322 (situation 2), then realize later you would like to be able to normalize new
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	323 datasets using the same parameters (e.g. same shift & rescaling), which
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	324 requires situation 3.
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	325 - It can make your life easier when thinking about how to plug things together
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	326 (something that has not been discussed yet), because the interfaces of the
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	327 various components are less varied.
1084 7e6e77d50eeb dataset: I say the learner committee should take care of dataset as well Olivier Delalleau <delallea@iro> parents: 1083 diff changeset	328
1085 de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	329 I am not saying that we should necessarily do it this way, but I think it is
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	330 worth at least keeping in mind this close relationship between simple
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	331 processing and learning, and thinking about what are the benefits / drawbacks
de456561ec40 dataset: Rewrote my rambling about the links between dataset and learner Olivier Delalleau <delallea@iro> parents: 1084 diff changeset	332 in keeping them separate in the class hierarchy.
1089 f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	333
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	334 RP: I actually like this idea of having the dataset implement the same
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	335 interface as the learner ( or actually a subset of the interface .. ).
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	336 I hope people decide to do this.
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	337
1091 319de699fb67 dataset: Reply to GPU question Olivier Delalleau <delallea@iro> parents: 1089 diff changeset	338 Support for shared variables
319de699fb67 dataset: Reply to GPU question Olivier Delalleau <delallea@iro> parents: 1089 diff changeset	339 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1089 f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	340
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	341 RP asks: What is the status of having the dataset support copying data
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	342 on the GPU ( by storing data in shared variables) ? Have you decided to
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	343 include this feature or not ? I think that the strongest selling point of
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	344 Theano is that it runs on GPU transperently, and I see this as a good
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	345 selling point for the library as well. Plus we intend to move more and
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	346 more towards running things on GPU. If the dataset object does not support
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	347 this feature we will need to find hacks around it ..
f15216356522 Did the dataset committee decide to include some GPU support ( use shared variables ) atleast in some cases ? Razvan Pascanu <r.pascanu@gmail.com> parents: 1086 diff changeset	348
1091 319de699fb67 dataset: Reply to GPU question Olivier Delalleau <delallea@iro> parents: 1089 diff changeset	349 OD: I have like zero experience with GPU so hopefully someone else can answer
319de699fb67 dataset: Reply to GPU question Olivier Delalleau <delallea@iro> parents: 1089 diff changeset	350 this. But the way I see it, hopefully it could work by having some dataset
319de699fb67 dataset: Reply to GPU question Olivier Delalleau <delallea@iro> parents: 1089 diff changeset	351 object that would take care of storing its input data into a shared variable.
1094 75175e2e697d dataset: Continued comment about GPU and shared variables Olivier Delalleau <delallea@iro> parents: 1091 diff changeset	352 OD (continued): After thinking a bit more about it, I am not sure that would
75175e2e697d dataset: Continued comment about GPU and shared variables Olivier Delalleau <delallea@iro> parents: 1091 diff changeset	353 work. I definitely need to look at some code doing it to get a better
75175e2e697d dataset: Continued comment about GPU and shared variables Olivier Delalleau <delallea@iro> parents: 1091 diff changeset	354 understanding of it, but my feeling is that you need your learner to be
75175e2e697d dataset: Continued comment about GPU and shared variables Olivier Delalleau <delallea@iro> parents: 1091 diff changeset	355 written in a specific way to achieve this, in which case it may be up to the
75175e2e697d dataset: Continued comment about GPU and shared variables Olivier Delalleau <delallea@iro> parents: 1091 diff changeset	356 learner to take its input data and store it into a shared variable.
1104 5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	357
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	358 RP comment: Yes, the dataset object alone can not handle this, the issue is somewhere
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	359 between the dataset and the learner. Or in other words, everytime you change
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	360 the data you need to recompile your theano function. So the learner can not
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	361 only get data from the dataset, it needs to get a shared variable. The learner
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	362 should also be aware when the dataset is changed, to recompile its internal
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	363 functions. I'm not sure which is the best wa to do this. My personal feeling
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	364 is that the dataset should be part of the learner. The lerner should provide
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	365 a function use_dataset ( or replace_dataset). When this function is called,
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	366 all the theano functions in the learner get recompiled based on shared
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	367 variables that the dataset object provides. It sort of fits very well in the
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	368 framework that I have in mind, which was spattered around in the learner.txt
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	369 and some of my previous emails. I think it shares a lot with James concepts,
5e6d7d9e803a a comment on the GPU issue for datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1094 diff changeset	370 since it follows quite closely the concepts behind Theano.
1105 546bd0ccb0e4 dataset: Question about shared variables Olivier Delalleau <delallea@iro> parents: 1104 diff changeset	371
546bd0ccb0e4 dataset: Question about shared variables Olivier Delalleau <delallea@iro> parents: 1104 diff changeset	372 OD asks: Ok, so why would the dataset have to be responsible for providing a
546bd0ccb0e4 dataset: Question about shared variables Olivier Delalleau <delallea@iro> parents: 1104 diff changeset	373 shared variable? Why wouldn't the learner just create this shared variable
546bd0ccb0e4 dataset: Question about shared variables Olivier Delalleau <delallea@iro> parents: 1104 diff changeset	374 internally and copy into it the data provided by the dataset?
546bd0ccb0e4 dataset: Question about shared variables Olivier Delalleau <delallea@iro> parents: 1104 diff changeset	375
1109 29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	376 RP replies: Sure, the learner could take care of all this. Note though that the
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	377 learner should take care to divide the dataset into chunks that fit in the
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	378 GPU memory ( in case of a large dataset) and then take care of updating the
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	379 shared variables acording to the current chunk. Personally I feel like all
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	380 this data division, management and so on should be done by the dataset.
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	381 It feels more natural that way. For example assume you have a dataset that
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	382 is composed of a time series and some static data ( carre-tech heart beat
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	383 data is a good example). The static data is small enough so that you could
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	384 always store on the GPU, and you would only need to split the time series.
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	385 For the learner to do this ( since it gets the same interface from any
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	386 dataset object) would be like and if <this case> then, while for the
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	387 dataset is just a different class. But I'm happy to have all this GPU stuff
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	388 send to the learner as well if everybody else believe that is better.
29b48deb6a84 reply/comment regarding the GPU and datasets Razvan Pascanu <r.pascanu@gmail.com> parents: 1105 diff changeset	389
1110 4797a4cb73e1 added comment to dataset. Frederic Bastien <nouiz@nouiz.org> parents: 1109 diff changeset	390 FB comment: I don't understand why you would need to recompile the theano function.
4797a4cb73e1 added comment to dataset. Frederic Bastien <nouiz@nouiz.org> parents: 1109 diff changeset	391 Their is 2 cases, the data is in a shared variable. You can directly change the data
4797a4cb73e1 added comment to dataset. Frederic Bastien <nouiz@nouiz.org> parents: 1109 diff changeset	392 in the shared variable without recompiling the theano fct. The second case is when
4797a4cb73e1 added comment to dataset. Frederic Bastien <nouiz@nouiz.org> parents: 1109 diff changeset	393 the dataset is in an ordinary theano variable. In that case, the first step in the
4797a4cb73e1 added comment to dataset. Frederic Bastien <nouiz@nouiz.org> parents: 1109 diff changeset	394 theano fct will be to transfer the dataset to the gpu before computation. If the data
4797a4cb73e1 added comment to dataset. Frederic Bastien <nouiz@nouiz.org> parents: 1109 diff changeset	395 change at each call, that will be as efficient as changing the data manually every time
4797a4cb73e1 added comment to dataset. Frederic Bastien <nouiz@nouiz.org> parents: 1109 diff changeset	396 in the shared variable.
1116 18a092001752 An idea about Datasets and GPU. Arnaud Bergeron <abergeron@gmail.com> parents: 1110 diff changeset	397
18a092001752 An idea about Datasets and GPU. Arnaud Bergeron <abergeron@gmail.com> parents: 1110 diff changeset	398 AB: I have an idea about this which kind of fits in the "building a
18a092001752 An idea about Datasets and GPU. Arnaud Bergeron <abergeron@gmail.com> parents: 1110 diff changeset	399 theano op" thing that we talked about at the last meeting.
18a092001752 An idea about Datasets and GPU. Arnaud Bergeron <abergeron@gmail.com> parents: 1110 diff changeset	400
1117 c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out. Arnaud Bergeron <abergeron@gmail.com> parents: 1116 diff changeset	401 We can just build a theano Op that wraps dataset objects and takes
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out. Arnaud Bergeron <abergeron@gmail.com> parents: 1116 diff changeset	402 care of the details of tranferring data to the GPU or otherwise.
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out. Arnaud Bergeron <abergeron@gmail.com> parents: 1116 diff changeset	403
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out. Arnaud Bergeron <abergeron@gmail.com> parents: 1116 diff changeset	404 I have a prototype interface/implemantation in the shared_dataset.py
c1943feada10 Proposal for theano dataset wrapper. The details still have to be worked out. Arnaud Bergeron <abergeron@gmail.com> parents: 1116 diff changeset	405 file in this directory.
1127 7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue Olivier Delalleau <delallea@iro> parents: 1124 diff changeset	406
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue Olivier Delalleau <delallea@iro> parents: 1124 diff changeset	407 OD: I like AB's approach.
7207f86a661f dataset: Comment on AB's idea to handle the GPU/shared variable issue Olivier Delalleau <delallea@iro> parents: 1124 diff changeset	408
1337 7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	409
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	410 Data API proposal by Olivier D
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	411 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	412
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	413 A single sample containing multiple fields (e.g. an input and a target part)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	414 is an object s that you can manipulate as follows:
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	415
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	416 .. code-block:: python
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	417
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	418 # Obtain actual data stored within `s` (e.g. a numpy vector). There is no
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	419 # guarantee that modifying the resulting data object will actually update
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	420 # the data stored in `s`.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	421 data = s()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	422 # Create a sample that sees a field of `s`.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	423 input_part = s.input
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	424 # Obtain actual input data (e.g. as a numpy vector).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	425 input_data = input_part()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	426 # Create a sample that sees the i-th element of the data stored in `s`.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	427 ith = s[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	428 # This should not fail.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	429 assert ith() == s()[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	430 # You could also select a range.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	431 i_to_j = s[i:j]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	432 assert i_to_j() == s()[i:j]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	433 # And actually do pretty much anything you want with __getitem__, as long
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	434 # as the underlying data stored in the sample supports it (for instance,
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	435 # here it should be at least a 3D tensor).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	436 fancy_selection = s[i, :, j:k]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	437 assert fancy_selection() == s()[i, :, j:k]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	438 # Write some value (e.g. a numpy vector) into the sample. May raise an
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	439 # exception if the sample is in read-only mode.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	440 s._write(val)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	441 # Shortcut to write data into a field (same as `s.input._write(val)`).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	442 s.input = val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	443 # Basic mathematical operators.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	444 s *= val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	445 s += val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	446 s -= val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	447 s /= val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	448 # Replace a field. Note that this is different from `s.input = val`
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	449 # because here `new_input` is a sample, not a numeric value: the current
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	450 # `s.input` will not be written to, instead it makes `s.input` point
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	451 # towards a different sample. This may lead to confusion, so a different
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	452 # syntax may be better (e.g. s._set_field('input', new_input)).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	453 s.input = new_input
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	454 # The equality of two samples is defined by the equality of their
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	455 # underlying data.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	456 def __eq__(self, other):
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	457 return self() == other()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	458 # Iterate on fields (open question: should they be ordered?).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	459 fields = dict([(name, sample) for name, sample in s._iter_fields()])
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	460 assert fields['input'] == s.input
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	461 # Iterating on a sample yields samples that see consecutive elements.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	462 for sample, value in izip(s, s()):
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	463 assert sample() == value
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	464 # The length of a sample is the same as that of its underlying data.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	465 assert len(s) == len(s())
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	466 # The shape of a sample is the same as that of its underlying data.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	467 # Note that it only makes sense for tensor-like data.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	468 assert s._shape() == s().shape
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	469 # The size of a sample is the product of its shape elements.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	470 assert s._size() == reduce(operator.__mul__, s._shape())
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	471
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	472 All sample methods should start with '_', to differentiate them from the
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	473 sample's fields. This is a bit awkward, but I like the `sample.field` syntax
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	474 compared to something like "sample.get_field('field')", which makes code less
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	475 readable, especially when combining with sub_fields, e.g. `sample.input.x1`
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	476 vs. sample.get_field('input').get_field('x1').
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	477
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	478 The extension from sample to dataset is actually to use the same class, but
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	479 with the convention that the first "dimension" in the data seen by the dataset
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	480 corresponds to the samples' indices in the dataset.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	481
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	482 .. code-block:: python
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	483
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	484 # Return data stored in dataset `d` (e.g. a numpy matrix).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	485 data = d()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	486 # Return the i-th sample in the dataset.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	487 s = d[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	488 # Data should match!
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	489 assert data[i] == s()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	490 # Return a subset of the dataset.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	491 sub_data = d[i:j]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	492 # Advanced indexing.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	493 sub_data = d[some_list_of_indices]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	494 # Dataset that sees the input part only.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	495 input_part = d.input
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	496 # Dataset such that its i-th element is data[i][something] (see the sample
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	497 # examples for what `something` may be).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	498 some_sub_data = d[:, something]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	499 # The following should not fail.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	500 assert d[i, something] == d[i][something] # == some_sub_data[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	501 # You can also write into a dataset.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	502 d._write(val)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	503 d.input = val
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	504 # Center dataset in-place (requires `d` not to be read-only).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	505 d -= numpy.mean(d())
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	506 # The length of a dataset is its number of samples.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	507 n_samples = len(d)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	508 # The width of a dataset (if it exists) is the length of its samples.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	509 assert d._shape()[1] == len(d[0]) # == d._width() (shortcut)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	510 # Iterating on a dataset yields individual samples.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	511 for i, sample in enumerate(d):
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	512 assert d[i] == sample
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	513 # It is allowed for a dataset to hold heterogeneous data. For instance
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	514 # you could have
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	515 len(d.data1) != len(d.data2)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	516 # A sample in the dataset is not required to inherit all the dataset's
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	517 # fields, for instance in the case above you could decide that the dataset
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	518 # sees the same data as its first sub-dataset, i.e.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	519 d[i] == d.data1[i]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	520
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	521 There remain some fuzzy points. For instance, are fields allowed to overlap?
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	522 (e.g. so that one could write both s.pos_3d to get the 3d vector coordinate of
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	523 sample s, and s.x to get the x coordinate without being forced to go through
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	524 s.pos_3d.x). What are the fields of s[i:j] if the (i, j) range does not
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	525 exactly match a subset of fields? How do we handle metadata? (e.g. if we want
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	526 to describe the dataset to say it contains 28x28 image data, so that an
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	527 algorithm for filter visualization can automatically deal with it)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	528
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	529 Now, on to some use cases.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	530
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	531 .. code-block:: python
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	532
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	533 # Mini-batches.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	534 mb_dataset = d._minibatches(batch_size=5)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	535 # The mini-batch dataset views samples that are mini-batches.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	536 assert mb_dataset[0]() == d[0:5]() # As long as len(d) >= 5.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	537
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	538 # Shuffling samples.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	539 random_indices = range(len(d))
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	540 random_indices = numpy.random.shuffle(random_indices)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	541 shuffled_dataset = d[random_indices]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	542
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	543 # Typical linear regression with stochastic gradient descent.
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	544 n_inputs = d.input._width()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	545 n_targets = d.target._width()
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	546 weights = numpy.zeros((n_inputs, n_targets))
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	547 bias = numpy.zeros(n_targets)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	548 mb_dataset = d._minibatches(batch_size=10)
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	549 # Note: it is important to get the number of inputs / targets
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	550 # before converting to minibatches, because
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	551 # mb_dataset.input._width() == 10
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	552 # since this is the length of a minibatch matrix. However you
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	553 # could still do the following, which is less readable:
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	554 # n_inputs = mb_dataset.input._shape()[2]
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	555 # You could also wait until you see the first sample to create
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	556 # the parameters (this would actually be a better way to do it, since
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	557 # it avoids calling the _width method).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	558 for input, target in izip(mb_dataset.input, mb_dataset.target):
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	559 cost = (numpy.dot(input(), weights) + b - target())**2
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	560 # Update weights and bias depending on cost....
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	561
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	562 A few more points:
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	563 - Infinite datasets could be used (would just need to define a convention
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	564 on what __len__ should do).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	565 - It is also ok to have datasets that do not support random access (so the
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	566 only way to access samples is through iteration).
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	567 - Ideally, data should be deterministic (i.e. __call__() should always
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	568 return the same thing). It would probably be up to the user to be super
7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	569 careful if he decides to use a non-deterministic dataset.
1338 91637815b7ca Added a comment on the dataset vs. task issue Olivier Delalleau <delallea@iro> parents: 1337 diff changeset	570 - About the "task vs. dataset" distinction. This could be achieved by
91637815b7ca Added a comment on the dataset vs. task issue Olivier Delalleau <delallea@iro> parents: 1337 diff changeset	571 associating to a task the names of the fields it requires (e.g. "input"
91637815b7ca Added a comment on the dataset vs. task issue Olivier Delalleau <delallea@iro> parents: 1337 diff changeset	572 and "target" for the regression task), and if the dataset does not
91637815b7ca Added a comment on the dataset vs. task issue Olivier Delalleau <delallea@iro> parents: 1337 diff changeset	573 already defines these fields, using a dataset wrapper than does it
91637815b7ca Added a comment on the dataset vs. task issue Olivier Delalleau <delallea@iro> parents: 1337 diff changeset	574 (saying for instance that "input" is the concatenation of "x1" and "x2",
91637815b7ca Added a comment on the dataset vs. task issue Olivier Delalleau <delallea@iro> parents: 1337 diff changeset	575 and "target" is "y", for a dataset whose fields are x1, x2 and y).
1337 7dfc3d3052ea Added proposal for dataset API as discussed on pylearn-dev Olivier Delalleau <delallea@iro> parents: 1190 diff changeset	576
1339 158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	577
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	578 RP comments:
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	579 - I like this approach. I think having overlapping fields might be useful.
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	580 I would add that I was thinking of a way to look at one's results. Is
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	581 something I've been faced with, say you run 500 jobs and then you want to
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	582 understand those jobs' results. Looking just at the best performing seems a waste, and
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	583 there is a lot more information you can extract from your results if you are
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	584 able to generate certain plots or statistics. To do this you would need to
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	585 get the data in ipython (or something quite similar) where you have available
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	586 the needed functions to plot different things, generate different tables. The
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	587 point that I was trying to make is that you can get those results in
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	588 something that has this very API that Olivier described. This way both both
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	589 your input data and your results will be in the same form and whatever
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	590 visualization functions you have for your results you can use on your data as
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	591 well. For this you would need a bit more flexibility, in the sense that if
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	592 you have some data d, you should be able to put constraints on it, like
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	593 d.some_field == 5 means all entries in d that has some_field == 5, or
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	594 d.some_field > 5. You would also not use psql anymore but this console,
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	595 which would collect the results for you from sql, and give them to you as
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	596 data object.
158493f8dff9 comment on dataset proposal by Olivier Razvan Pascanu <r.pascanu@gmail.com> parents: 1338 diff changeset	597
1340 04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	598 OD replies: Actually this should be doable with (almost) what I wrote above,
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	599 due to the way numpy redefines ==, >, etc. (which btw should break some of my
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	600 assertions above, since I had forgotten about this). If you replace e.g. my
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	601 implementation of __eq__ above by the following:
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	602
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	603 .. code-block:: python
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	604
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	605 def __eq__(self, other):
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	606 return other == self()
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	607
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	608 Here, `self` is a dataset that represents some numpy vector data. Then whether
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	609 `other` is another dataset or a numpy vector or some scalar, this will return
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	610 a numpy boolean vector (the result of the comparison made by numpy). We may
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	611 support boolean vectors in advanced indexing, so you could do
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	612 d[d.some_field == 5]
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	613 and obtain the subset of `d` whose samples have `some_field` set to 5.
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	614 Same could be done with __lt__, __le__, etc.
04b988fb00b6 Reply to Razvan Olivier Delalleau <delallea@iro> parents: 1339 diff changeset	615

Mercurial > pylearn

annotate doc/v2_planning/dataset.txt @ 1429:b0141efbf6a2