Mercurial > pylearn
changeset 1207:53937045f6c7
Pasted content of email sent by Ian about existing python ML libraries
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Tue, 21 Sep 2010 10:58:14 -0400 |
parents | 203569655069 |
children | 0186805a93e7 |
files | doc/v2_planning/existing_python_ml_libraries.txt |
diffstat | 1 files changed, 101 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/doc/v2_planning/existing_python_ml_libraries.txt Tue Sep 21 10:51:38 2010 -0400 +++ b/doc/v2_planning/existing_python_ml_libraries.txt Tue Sep 21 10:58:14 2010 -0400 @@ -23,3 +23,104 @@ Also check out http://scipy.org/Topical_Software#head-fc5493250d285f5c634e51be7ba0f80d5f4d6443 - scipy.org's ``topical software'' section on Artificial Intelligence and Machine Learning + + +Email sent by IG to lisa_labo +----------------------------- + +The Existing Libraries committee has finished meeting. We have three +sets of recommendations: +1. Recommendations for designing pylearn based on features we like +from other libraries +2. Recommendations for distributing pylearn with other libraries +3. Recommendations for implementations to wrap + +1. Features we liked from other libraries include: +-Most major libraries such as MDP, PyML, scikit.learn, and pybrain +offer some way of making a DAG that specifies a feedforward +architecture (Monte Python does this and allows backprop as well). We +will probably have a similar structure but with more features on top +of it, such as joint training. One nice feature of MDP is the ability +to visualize this structure in an HTML document. +-Dataset abstractions handled by transformer nodes: Rather than +defining several "views" or "casts" of datasets, most of these +libraries (particularly mdp and scikit.learn) allow you to put in +whatever kind of data you want, and then have your processing nodes in +your DAG format the data correctly for the later parts of the DAG. +This makes it easy to use several transformed versions of the dataset +(like chopping images up into small patches) without pylearn having to +include functionality for all of these possible transformations. +-mdp and scikit.learn both have a clearly defined inheritance +structure, with a small number of root-level superclasses exposing +most of the functionality of the library through their method +signatures. +-checkpoints: mdp allows the user to specify arbitrary callbacks to +run at various points during training or processing. This is mainly +designed for the user to be able to save state for crash recovery +purposes, but could have other uses like visualizing the evolution of +the weights over time. +-mdp includes an interface for learners to declare that they can learn +in parallel, ie the same object can look at different data on +different cpu cores. This is not useful for sgd-based models but could +be nice for pca/sfa type models (which is most of what mdp +implements). +-Monte Python has humorously named classes, such as the 'gym', which +is the package that contains all of the 'trainers' +-pyml has support for sparse datasets +-pyml has an 'aggregate dataset' that can combine other datasets + + +2. Recommendations for distributing pylearn + +pylearn appears to be the most ambitious of all existing python +machine learning projects. There is no established machine learning +library whose scope is broad enough for us to contribute to that +library rather than developing our own. + +Some libraries are frequently cited in the literature and +well-respected. One example is libsvm. We should wrap these libraries +so that pylearn users can run experiments with the most +well-established and credible implementations possible. + +Wrapping 3rd party libraries may present some issues with licensing. +We expect to release pylearn under the BSD license (so that business +partners such as Ubisoft can use it in shipped products), but much of +the code we want to wrap may be released under the GPL or some other +license that prevents inclusion in a BSD project. We therefore propose +to keep only core functionality in pylearn itself, and put most +implementation of actual algorithms into separate packages. One +package could provide a set of BSD licensed plugins developed by us or +based on wrapping BSD licensed 3rd party libraries, and another +package could provide a set of GPL licensed plugins developed by +wrapping GPL'ed code. + +3. Recommendations for implementations to wrap + +shogun: + large scale kernel learning (mostly svms). this wraps other +libraries we should definitely be interested in, such as libsvm +(because it is well-established) and others that get state of the art +performance or are good for extremely large datasets, etc. +milk: + k-means + svm's with arbitrary python types for kernel arguments +pybrain: + lstm +mlpy: + feature selection +mdp: + ica + LLE +scikit.learn: + lasso + nearest neighbor + isomap + various metrics + mean shift + cross validation + LDA + HMMs +Yet Another Python Graph Library: + graph similarity functions that could be useful if we want to +learn with graphs as data +