Mercurial > pylearn
comparison doc/v2_planning/dataset.txt @ 1077:5c14d2ffcbb3
dataset: Looked into a few more existing ML libraries
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 10 Sep 2010 12:48:32 -0400 |
parents | 20a1af112a75 |
children | f9f72ae84313 |
comparison
equal
deleted
inserted
replaced
1076:20a1af112a75 | 1077:5c14d2ffcbb3 |
---|---|
21 | 21 |
22 Some ideas from existing ML libraries: | 22 Some ideas from existing ML libraries: |
23 | 23 |
24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData, | 24 - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData, |
25 PairDataSet, Aggregate. Ultimately, the learner decides | 25 PairDataSet, Aggregate. Ultimately, the learner decides |
26 - mlpy: very primitive notions of data | 26 - mlpy: very primitive notions of data (simple 2D matrices) |
27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet, | 27 - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet, |
28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite | 28 SequentialDataSet, ReinforcementDataSet, ... Each class is quite |
29 constrained and may have a different interface. | 29 constrained and may have a different interface. |
30 - MDP: Seems to have restrictions on the type of data being passed around, as | 30 - MDP: Seems to have restrictions on the type of data being passed around, as |
31 well as its dimensionality ("Input array data is typically assumed to be | 31 well as its dimensionality ("Input array data is typically assumed to be |
32 two-dimensional and ordered such that observations of the same variable are | 32 two-dimensional and ordered such that observations of the same variable are |
33 stored on rows and different variables are stored on columns.") | 33 stored on rows and different variables are stored on columns.") |
34 - Orange: Data matrices, with names and types associated to each column. | 34 - Orange: Data matrices, with names and types associated to each column. |
35 Basically there seems to be only one base dataset class that contains the | 35 Basically there seems to be only one base dataset class that contains the |
36 data. Data points are lists (of values corresponding to each column). | 36 data. Data points are lists (of values corresponding to each column). |
37 - (still going through the other ones) | 37 - APGL: Hard to say how they deal with data from the documentation alone. |
38 - Monte: Data is simply numpy arrays. | |
39 - scikits.learn: Dataset is a simple container with e.g. dataset.data being | |
40 a 2D numpy array of input features, and dataset.target the target vector. | |
41 - Shogun: Vade Retro C++! (may be worth looking into their feature concept | |
42 though). | |
43 - Any more worth looking at? | |
38 | 44 |
39 A few things that our dataset containers should support at a minimum: | 45 A few things that our dataset containers should support at a minimum: |
40 | 46 |
41 - streams, possibly infinite | 47 - streams, possibly infinite |
42 - task/views of the data for different problems | 48 - task/views of the data for different problems |