changeset 1207:53937045f6c7

Pasted content of email sent by Ian about existing python ML libraries
author Olivier Delalleau <delallea@iro>
date Tue, 21 Sep 2010 10:58:14 -0400
parents 203569655069
children 0186805a93e7
files doc/v2_planning/existing_python_ml_libraries.txt
diffstat 1 files changed, 101 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/doc/v2_planning/existing_python_ml_libraries.txt	Tue Sep 21 10:51:38 2010 -0400
+++ b/doc/v2_planning/existing_python_ml_libraries.txt	Tue Sep 21 10:58:14 2010 -0400
@@ -23,3 +23,104 @@
 
 Also check out http://scipy.org/Topical_Software#head-fc5493250d285f5c634e51be7ba0f80d5f4d6443
 - scipy.org's ``topical software'' section on Artificial Intelligence and Machine Learning
+
+
+Email sent by IG to lisa_labo
+-----------------------------
+
+The Existing Libraries committee has finished meeting. We have three
+sets of recommendations:
+1. Recommendations for designing pylearn based on features we like
+from other libraries
+2. Recommendations for distributing pylearn with other libraries
+3. Recommendations for implementations to wrap
+
+1. Features we liked from other libraries include:
+-Most major libraries such as MDP, PyML, scikit.learn, and pybrain
+offer some way of making a DAG that specifies a feedforward
+architecture (Monte Python does this and allows backprop as well). We
+will probably have a similar structure but with more features on top
+of it, such as joint training. One nice feature of MDP is the ability
+to visualize this structure in an HTML document.
+-Dataset abstractions handled by transformer nodes: Rather than
+defining several "views" or "casts" of datasets, most of these
+libraries (particularly mdp and scikit.learn) allow you to put in
+whatever kind of data you want, and then have your processing nodes in
+your DAG format the data correctly for the later parts of the DAG.
+This makes it easy to use several transformed versions of the dataset
+(like chopping images up into small patches) without pylearn having to
+include functionality for all of these possible transformations.
+-mdp and scikit.learn both have a clearly defined inheritance
+structure, with a small number of root-level superclasses exposing
+most of the functionality of the library through their method
+signatures.
+-checkpoints: mdp allows the user to specify arbitrary callbacks to
+run at various points during training or processing. This is mainly
+designed for the user to be able to save state for crash recovery
+purposes, but could have other uses like visualizing the evolution of
+the weights over time.
+-mdp includes an interface for learners to declare that they can learn
+in parallel, ie the same object can look at different data on
+different cpu cores. This is not useful for sgd-based models but could
+be nice for pca/sfa type models (which is most of what mdp
+implements).
+-Monte Python has humorously named classes, such as the 'gym', which
+is the package that contains all of the 'trainers'
+-pyml has support for sparse datasets
+-pyml has an 'aggregate dataset' that can combine other datasets
+
+
+2. Recommendations for distributing pylearn
+
+pylearn appears to be the most ambitious of all existing python
+machine learning projects. There is no established machine learning
+library whose scope is broad enough for us to contribute to that
+library rather than developing our own.
+
+Some libraries are frequently cited in the literature and
+well-respected. One example is libsvm. We should wrap these libraries
+so that pylearn users can run experiments with the most
+well-established and credible implementations possible.
+
+Wrapping 3rd party libraries may present some issues with licensing.
+We expect to release pylearn under the BSD license (so that business
+partners such as Ubisoft can use it in shipped products), but much of
+the code we want to wrap may be released under the GPL or some other
+license that prevents inclusion in a BSD project. We therefore propose
+to keep only core functionality in pylearn itself, and put most
+implementation of actual algorithms into separate packages. One
+package could provide a set of BSD licensed plugins developed by us or
+based on wrapping BSD licensed 3rd party libraries, and another
+package could provide a set of GPL licensed plugins developed by
+wrapping GPL'ed code.
+
+3. Recommendations for implementations to wrap
+
+shogun:
+      large scale kernel learning (mostly svms). this wraps other
+libraries we should definitely be interested in, such as libsvm
+(because it is well-established) and others that get state of the art
+performance or are good for extremely large datasets, etc.
+milk:
+      k-means
+      svm's with arbitrary python types for kernel arguments
+pybrain:
+      lstm
+mlpy:
+      feature selection
+mdp:
+      ica
+      LLE
+scikit.learn:
+      lasso
+      nearest neighbor
+      isomap
+      various metrics
+      mean shift
+      cross validation
+      LDA
+      HMMs
+Yet Another Python Graph Library:
+      graph similarity functions that could be useful if we want to
+learn with graphs as data
+