comparison doc/v2_planning/dataset.txt @ 1077:5c14d2ffcbb3

dataset: Looked into a few more existing ML libraries
author Olivier Delalleau <delallea@iro>
date Fri, 10 Sep 2010 12:48:32 -0400
parents 20a1af112a75
children f9f72ae84313
comparison
equal deleted inserted replaced
1076:20a1af112a75 1077:5c14d2ffcbb3
21 21
22 Some ideas from existing ML libraries: 22 Some ideas from existing ML libraries:
23 23
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData, 24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData,
25 PairDataSet, Aggregate. Ultimately, the learner decides 25 PairDataSet, Aggregate. Ultimately, the learner decides
26 - mlpy: very primitive notions of data 26 - mlpy: very primitive notions of data (simple 2D matrices)
27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet, 27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet,
28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite 28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite
29 constrained and may have a different interface. 29 constrained and may have a different interface.
30 - MDP: Seems to have restrictions on the type of data being passed around, as 30 - MDP: Seems to have restrictions on the type of data being passed around, as
31 well as its dimensionality ("Input array data is typically assumed to be 31 well as its dimensionality ("Input array data is typically assumed to be
32 two-dimensional and ordered such that observations of the same variable are 32 two-dimensional and ordered such that observations of the same variable are
33 stored on rows and different variables are stored on columns.") 33 stored on rows and different variables are stored on columns.")
34 - Orange: Data matrices, with names and types associated to each column. 34 - Orange: Data matrices, with names and types associated to each column.
35 Basically there seems to be only one base dataset class that contains the 35 Basically there seems to be only one base dataset class that contains the
36 data. Data points are lists (of values corresponding to each column). 36 data. Data points are lists (of values corresponding to each column).
37 - (still going through the other ones) 37 - APGL: Hard to say how they deal with data from the documentation alone.
38 - Monte: Data is simply numpy arrays.
39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being
40 a 2D numpy array of input features, and dataset.target the target vector.
41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept
42 though).
43 - Any more worth looking at?
38 44
39 A few things that our dataset containers should support at a minimum: 45 A few things that our dataset containers should support at a minimum:
40 46
41 - streams, possibly infinite 47 - streams, possibly infinite
42 - task/views of the data for different problems 48 - task/views of the data for different problems