Mercurial > pylearn
view doc/v2_planning/existing_python_ml_libraries.txt @ 1408:2993b2a5c1af
allow to load UTLC transfer label data.
author | Frederic Bastien <nouiz@nouiz.org> |
---|---|
date | Fri, 28 Jan 2011 11:00:11 -0500 |
parents | f5e9c00a67d7 |
children |
line wrap: on
line source
Committee members: GD, DWF, IG, DE This committee will investigate the possibility of interfacing and/or borrowing from other Python machine learning libraries that exist out there. Some questions that we need to answer: * How much should we try to interface with other libraries? * What parts can we and should we implement ourselves and what should we leave to the other libraries? Preliminary list of libraries to look at: * Pybrain Razvan * MDP Ian * Orange (http://www.ailab.si/orange/) Ian (but could trade) * PyML (http://pyml.sourceforge.net/) * mlpy (https://mlpy.fbk.eu/) Dumitru * APGL (http://packages.python.org/apgl/) Dumitru * MontePython (http://montepython.sourceforge.net/) Guillaume (but could trade) * Shogun python bindings * libsvm python bindings Ian (but could trade) * scikits.learn Guillaume (but could trade) Also check out http://scipy.org/Topical_Software#head-fc5493250d285f5c634e51be7ba0f80d5f4d6443 - scipy.org's ``topical software'' section on Artificial Intelligence and Machine Learning Email sent by IG to lisa_labo ----------------------------- The Existing Libraries committee has finished meeting. We have three sets of recommendations: 1. Recommendations for designing pylearn based on features we like from other libraries 2. Recommendations for distributing pylearn with other libraries 3. Recommendations for implementations to wrap 1. Features we liked from other libraries include: -Most major libraries such as MDP, PyML, scikit.learn, and pybrain offer some way of making a DAG that specifies a feedforward architecture (Monte Python does this and allows backprop as well). We will probably have a similar structure but with more features on top of it, such as joint training. One nice feature of MDP is the ability to visualize this structure in an HTML document. -Dataset abstractions handled by transformer nodes: Rather than defining several "views" or "casts" of datasets, most of these libraries (particularly mdp and scikit.learn) allow you to put in whatever kind of data you want, and then have your processing nodes in your DAG format the data correctly for the later parts of the DAG. This makes it easy to use several transformed versions of the dataset (like chopping images up into small patches) without pylearn having to include functionality for all of these possible transformations. -mdp and scikit.learn both have a clearly defined inheritance structure, with a small number of root-level superclasses exposing most of the functionality of the library through their method signatures. -checkpoints: mdp allows the user to specify arbitrary callbacks to run at various points during training or processing. This is mainly designed for the user to be able to save state for crash recovery purposes, but could have other uses like visualizing the evolution of the weights over time. -mdp includes an interface for learners to declare that they can learn in parallel, ie the same object can look at different data on different cpu cores. This is not useful for sgd-based models but could be nice for pca/sfa type models (which is most of what mdp implements). -Monte Python has humorously named classes, such as the 'gym', which is the package that contains all of the 'trainers' -pyml has support for sparse datasets -pyml has an 'aggregate dataset' that can combine other datasets 2. Recommendations for distributing pylearn pylearn appears to be the most ambitious of all existing python machine learning projects. There is no established machine learning library whose scope is broad enough for us to contribute to that library rather than developing our own. Some libraries are frequently cited in the literature and well-respected. One example is libsvm. We should wrap these libraries so that pylearn users can run experiments with the most well-established and credible implementations possible. Wrapping 3rd party libraries may present some issues with licensing. We expect to release pylearn under the BSD license (so that business partners such as Ubisoft can use it in shipped products), but much of the code we want to wrap may be released under the GPL or some other license that prevents inclusion in a BSD project. We therefore propose to keep only core functionality in pylearn itself, and put most implementation of actual algorithms into separate packages. One package could provide a set of BSD licensed plugins developed by us or based on wrapping BSD licensed 3rd party libraries, and another package could provide a set of GPL licensed plugins developed by wrapping GPL'ed code. 3. Recommendations for implementations to wrap * shogun: * large scale kernel learning (mostly svms). this wraps other libraries we should definitely be interested in, such as libsvm (because it is well-established) and others that get state of the art performance or are good for extremely large datasets, etc. * milk: * k-means * svm's with arbitrary python types for kernel arguments * pybrain: * lstm * mlpy: * feature selection * mdp: * ica * LLE * scikit.learn: * lasso * nearest neighbor * isomap * various metrics * mean shift * cross validation * LDA * HMMs * Yet Another Python Graph Library: * graph similarity functions that could be useful if we want to learn with graphs as data