view doc/v2_planning/existing_python_ml_libraries.txt @ 1482:be4a49a65333

modified Nade dataset to use new config.get_filepath_in_roots mechanism
author gdesjardins
date Tue, 05 Jul 2011 10:56:40 -0400
parents f5e9c00a67d7
children
line wrap: on
line source

Committee members: GD, DWF, IG, DE

This committee will investigate the possibility of interfacing and/or
borrowing from other Python machine learning libraries that exist out there.
Some questions that we need to answer:

 * How much should we try to interface with other libraries? 
 * What parts can we and should we implement ourselves and what should we leave
   to the other libraries?

Preliminary list of libraries to look at:

 * Pybrain   Razvan
 * MDP       Ian
 * Orange (http://www.ailab.si/orange/) Ian (but could trade)
 * PyML (http://pyml.sourceforge.net/)
 * mlpy (https://mlpy.fbk.eu/) Dumitru
 * APGL (http://packages.python.org/apgl/) Dumitru
 * MontePython (http://montepython.sourceforge.net/)  Guillaume (but could trade)
 * Shogun python bindings
 * libsvm python bindings   Ian (but could trade)
 * scikits.learn Guillaume (but could trade)

Also check out http://scipy.org/Topical_Software#head-fc5493250d285f5c634e51be7ba0f80d5f4d6443
- scipy.org's ``topical software'' section on Artificial Intelligence and Machine Learning


Email sent by IG to lisa_labo
-----------------------------

The Existing Libraries committee has finished meeting. We have three
sets of recommendations:
1. Recommendations for designing pylearn based on features we like
from other libraries
2. Recommendations for distributing pylearn with other libraries
3. Recommendations for implementations to wrap

1. Features we liked from other libraries include:
-Most major libraries such as MDP, PyML, scikit.learn, and pybrain
offer some way of making a DAG that specifies a feedforward
architecture (Monte Python does this and allows backprop as well). We
will probably have a similar structure but with more features on top
of it, such as joint training. One nice feature of MDP is the ability
to visualize this structure in an HTML document.
-Dataset abstractions handled by transformer nodes: Rather than
defining several "views" or "casts" of datasets, most of these
libraries (particularly mdp and scikit.learn) allow you to put in
whatever kind of data you want, and then have your processing nodes in
your DAG format the data correctly for the later parts of the DAG.
This makes it easy to use several transformed versions of the dataset
(like chopping images up into small patches) without pylearn having to
include functionality for all of these possible transformations.
-mdp and scikit.learn both have a clearly defined inheritance
structure, with a small number of root-level superclasses exposing
most of the functionality of the library through their method
signatures.
-checkpoints: mdp allows the user to specify arbitrary callbacks to
run at various points during training or processing. This is mainly
designed for the user to be able to save state for crash recovery
purposes, but could have other uses like visualizing the evolution of
the weights over time.
-mdp includes an interface for learners to declare that they can learn
in parallel, ie the same object can look at different data on
different cpu cores. This is not useful for sgd-based models but could
be nice for pca/sfa type models (which is most of what mdp
implements).
-Monte Python has humorously named classes, such as the 'gym', which
is the package that contains all of the 'trainers'
-pyml has support for sparse datasets
-pyml has an 'aggregate dataset' that can combine other datasets


2. Recommendations for distributing pylearn

pylearn appears to be the most ambitious of all existing python
machine learning projects. There is no established machine learning
library whose scope is broad enough for us to contribute to that
library rather than developing our own.

Some libraries are frequently cited in the literature and
well-respected. One example is libsvm. We should wrap these libraries
so that pylearn users can run experiments with the most
well-established and credible implementations possible.

Wrapping 3rd party libraries may present some issues with licensing.
We expect to release pylearn under the BSD license (so that business
partners such as Ubisoft can use it in shipped products), but much of
the code we want to wrap may be released under the GPL or some other
license that prevents inclusion in a BSD project. We therefore propose
to keep only core functionality in pylearn itself, and put most
implementation of actual algorithms into separate packages. One
package could provide a set of BSD licensed plugins developed by us or
based on wrapping BSD licensed 3rd party libraries, and another
package could provide a set of GPL licensed plugins developed by
wrapping GPL'ed code.

3. Recommendations for implementations to wrap

* shogun:
    * large scale kernel learning (mostly svms). this wraps other
      libraries we should definitely be interested in, such as libsvm
      (because it is well-established) and others that get state of the art
      performance or are good for extremely large datasets, etc.
* milk:
    * k-means
    * svm's with arbitrary python types for kernel arguments
* pybrain:
    * lstm
* mlpy:
    * feature selection
* mdp:
    * ica
    * LLE
* scikit.learn:
    * lasso
    * nearest neighbor
    * isomap
    * various metrics
    * mean shift
    * cross validation
    * LDA
    * HMMs
* Yet Another Python Graph Library:
    * graph similarity functions that could be useful if we want to
      learn with graphs as data