comparison doc/v2_planning/existing_python_ml_libraries.txt @ 1207:53937045f6c7

Pasted content of email sent by Ian about existing python ML libraries
author Olivier Delalleau <delallea@iro>
date Tue, 21 Sep 2010 10:58:14 -0400
parents 0e12ea6ba661
children e5b7a7913329
comparison
equal deleted inserted replaced
1206:203569655069 1207:53937045f6c7
21 * libsvm python bindings Ian (but could trade) 21 * libsvm python bindings Ian (but could trade)
22 * scikits.learn Guillaume (but could trade) 22 * scikits.learn Guillaume (but could trade)
23 23
24 Also check out http://scipy.org/Topical_Software#head-fc5493250d285f5c634e51be7ba0f80d5f4d6443 24 Also check out http://scipy.org/Topical_Software#head-fc5493250d285f5c634e51be7ba0f80d5f4d6443
25 - scipy.org's ``topical software'' section on Artificial Intelligence and Machine Learning 25 - scipy.org's ``topical software'' section on Artificial Intelligence and Machine Learning
26
27
28 Email sent by IG to lisa_labo
29 -----------------------------
30
31 The Existing Libraries committee has finished meeting. We have three
32 sets of recommendations:
33 1. Recommendations for designing pylearn based on features we like
34 from other libraries
35 2. Recommendations for distributing pylearn with other libraries
36 3. Recommendations for implementations to wrap
37
38 1. Features we liked from other libraries include:
39 -Most major libraries such as MDP, PyML, scikit.learn, and pybrain
40 offer some way of making a DAG that specifies a feedforward
41 architecture (Monte Python does this and allows backprop as well). We
42 will probably have a similar structure but with more features on top
43 of it, such as joint training. One nice feature of MDP is the ability
44 to visualize this structure in an HTML document.
45 -Dataset abstractions handled by transformer nodes: Rather than
46 defining several "views" or "casts" of datasets, most of these
47 libraries (particularly mdp and scikit.learn) allow you to put in
48 whatever kind of data you want, and then have your processing nodes in
49 your DAG format the data correctly for the later parts of the DAG.
50 This makes it easy to use several transformed versions of the dataset
51 (like chopping images up into small patches) without pylearn having to
52 include functionality for all of these possible transformations.
53 -mdp and scikit.learn both have a clearly defined inheritance
54 structure, with a small number of root-level superclasses exposing
55 most of the functionality of the library through their method
56 signatures.
57 -checkpoints: mdp allows the user to specify arbitrary callbacks to
58 run at various points during training or processing. This is mainly
59 designed for the user to be able to save state for crash recovery
60 purposes, but could have other uses like visualizing the evolution of
61 the weights over time.
62 -mdp includes an interface for learners to declare that they can learn
63 in parallel, ie the same object can look at different data on
64 different cpu cores. This is not useful for sgd-based models but could
65 be nice for pca/sfa type models (which is most of what mdp
66 implements).
67 -Monte Python has humorously named classes, such as the 'gym', which
68 is the package that contains all of the 'trainers'
69 -pyml has support for sparse datasets
70 -pyml has an 'aggregate dataset' that can combine other datasets
71
72
73 2. Recommendations for distributing pylearn
74
75 pylearn appears to be the most ambitious of all existing python
76 machine learning projects. There is no established machine learning
77 library whose scope is broad enough for us to contribute to that
78 library rather than developing our own.
79
80 Some libraries are frequently cited in the literature and
81 well-respected. One example is libsvm. We should wrap these libraries
82 so that pylearn users can run experiments with the most
83 well-established and credible implementations possible.
84
85 Wrapping 3rd party libraries may present some issues with licensing.
86 We expect to release pylearn under the BSD license (so that business
87 partners such as Ubisoft can use it in shipped products), but much of
88 the code we want to wrap may be released under the GPL or some other
89 license that prevents inclusion in a BSD project. We therefore propose
90 to keep only core functionality in pylearn itself, and put most
91 implementation of actual algorithms into separate packages. One
92 package could provide a set of BSD licensed plugins developed by us or
93 based on wrapping BSD licensed 3rd party libraries, and another
94 package could provide a set of GPL licensed plugins developed by
95 wrapping GPL'ed code.
96
97 3. Recommendations for implementations to wrap
98
99 shogun:
100 large scale kernel learning (mostly svms). this wraps other
101 libraries we should definitely be interested in, such as libsvm
102 (because it is well-established) and others that get state of the art
103 performance or are good for extremely large datasets, etc.
104 milk:
105 k-means
106 svm's with arbitrary python types for kernel arguments
107 pybrain:
108 lstm
109 mlpy:
110 feature selection
111 mdp:
112 ica
113 LLE
114 scikit.learn:
115 lasso
116 nearest neighbor
117 isomap
118 various metrics
119 mean shift
120 cross validation
121 LDA
122 HMMs
123 Yet Another Python Graph Library:
124 graph similarity functions that could be useful if we want to
125 learn with graphs as data
126