# HG changeset patch # User Olivier Delalleau # Date 1284137312 14400 # Node ID 5c14d2ffcbb3164ee8639be5e5625db865e56ba1 # Parent 20a1af112a75ed9661f27f4bef7cfba335a75248 dataset: Looked into a few more existing ML libraries diff -r 20a1af112a75 -r 5c14d2ffcbb3 doc/v2_planning/dataset.txt --- a/doc/v2_planning/dataset.txt Fri Sep 10 12:11:10 2010 -0400 +++ b/doc/v2_planning/dataset.txt Fri Sep 10 12:48:32 2010 -0400 @@ -23,7 +23,7 @@ - PyML: notion of dataset containers: VectorDataSet, SparseDataSet, KernelData, PairDataSet, Aggregate. Ultimately, the learner decides -- mlpy: very primitive notions of data +- mlpy: very primitive notions of data (simple 2D matrices) - PyBrain: Datasets are geared towards specific tasks: ClassificationDataSet, SequentialDataSet, ReinforcementDataSet, ... Each class is quite constrained and may have a different interface. @@ -34,7 +34,13 @@ - Orange: Data matrices, with names and types associated to each column. Basically there seems to be only one base dataset class that contains the data. Data points are lists (of values corresponding to each column). -- (still going through the other ones) +- APGL: Hard to say how they deal with data from the documentation alone. +- Monte: Data is simply numpy arrays. +- scikits.learn: Dataset is a simple container with e.g. dataset.data being + a 2D numpy array of input features, and dataset.target the target vector. +- Shogun: Vade Retro C++! (may be worth looking into their feature concept + though). +- Any more worth looking at? A few things that our dataset containers should support at a minimum: