view doc/v2_planning/architecture.txt @ 1166:aee22eb2c117

added index file for the v2_planning doc.
author Frederic Bastien <nouiz@nouiz.org>
date Fri, 17 Sep 2010 13:57:27 -0400
parents ae5ba6206fd3
children f111f8c2a280
line wrap: on
line source

====================
Pylearn Architecture
====================


SE + VM Approach
=================

One avenue for the basic design of the library is to follow the Symbolic
Expression (SE) structure + virtual machine (VM) pattern that worked for Theano.

The main things for the library to provide would be:

- a few VMs, some of which can run programs in parallel across processors,
  hosts, and networks [R6,R8];

- MLA components as either individual Expressions (similar to Ops) or as
  subgraphs of SEs [R5,R7,R10,R11]

- machine learning algorithms including their training and testing in the form
  of python functions that build SE graphs.[R1,R8].

This design addresses R2 (modularity) because swapping components is literally implemented by
swapping subgraphs.

The design addresses R9 (algorithmic efficiency) because we can write
Theano-style graph transformations to recognize special cases of component
combinations.

The design addresses R3 if we make the additional decision that the VMs (at
least sometimes) cache the return value of program function calls.  This cache
serves as a database of experimental results, indexed by the functions that
originally computed them.  I think this is a very natural scheme for organizing
experiment results, and ensuring experiment reproducibility [R1].
At the same time, this is a clean and simple API behind which experiments can be
saved using a number of database technologies.

APIs vs. lambda
----------------

Modularity in general is achieved when pieces can be substituted one for the
other.

In an object-oriented design, modularity is achieved by agreeing on interface
APIs, but in a functional design there is another possibility: the lambda.

In an SE these pieces are expression [applications] and the subgraphs they form.
A subgraph is characterized syntactically within the program by its arguments
and its return values.  A lambda function allows the User to create new
Expression types from arbitrary subgraphs with very few keystrokes.  When a
lambda is available and easy to use, there is much less pressure on the
expression library to follow calling and return conventions strictly.

Of course, the closer are two subgraphs in terms of their inputs, outputs, and
semantics, the easier it is to substitute one for the other.  As library
designers, we should still aim for compatibility of similar algorithms.  It's
just not essential to choose an API that will guarantee a match, or indeed to
choose any explicit API at all.

YB: I agree that lambdas are more flexible, but from the user's point of
view it is really important to know what can swap with what, so that they
can easily plug-and-play. So even if informal, something in the spirit
of an API must be described somewhere, and components should declare
either formally or through comments what functionality 'type' 
they can take on.

Encapsulation vs. linearity
---------------------------

A while ago, the Apstat crew went to fight "encapsulation" to propose instead
a more "linearized" approach to experiment design. I must admit I didn't
really understand the deep motivations behind this, and after practicing both
styles (encapsulation for PLearn / Theano, linearity @ ARL / Ubisoft), I still
don't. I do find, however, some not-so-deep-but-still-significant advantages
to the linear version, which hopefully can be made clear (along with a
clarification of what the h*** am I talking about) in the following example:

   * Linear version:
    my_experiment = pipeline([
        data,
        filter_samples,
        PCA,
        k_fold_split,
        neural_net,
        evaluation,
    ])

   * Encapsulated version:
    my_experiment = evaluation(
        data=PCA(filter_samples(data)),
        split=k_fold_split,
        model=neural_net)

What I like in the linear version is it is much more easily human-readable
(once you know what it means): you just follow the flow of the experiment by
reading through a single list.
On the other hand, the encapsulated version requires some deeper analysis to
understand what is going on and in which order.
Also, commenting out parts of the processing is simpler in the first case (it
takes a single # in front of an element).
However, linearity tends to break when the experiment is actually not linear,
i.e. the graph of object dependencies is more complex (*).

I'm just bringing this up because it may be nice to be able to provide the
user with the most intuitive way to design experiments. I actually don't think
those approaches are mutually exclusive, and it could be possible for the
underlying system to use the more flexible / powerful encapsulated
representation, while having the option to write simple scripts in a form that
is easier to understand and manipulate.

It could also be worth discussing this issue with Xavier / Christian /
Nicolas.

(*) Note that I cheated a bit in my example above: the graph from the
encapsulated version is not a simple chain, so it is not obvious how to
convert it into the pipeline given in the linear version. It's still possible
though, but this is probably not the place to get into the details.

RP comment : The way I see it, you could always have everything using the
encapsulation paradigm ( which as you pointed out is a bit more powerful) and
then have linear shortcuts ( functions that take a list of functions and some
inputs and apply them in some order). You will not be able to have a one case
cover all pipeline function, but I think it is sufficient to offer such
options (linear functions) for a few widely used cases ..


Jobman Compatibility Approach
=============================

One basic approach for the library is to provide a set of components that are
compatible with remote execution.  The emphasis could be not so much on
standardizing the roles and APIs of components, so much as ensuring that they
can be glued together and supports parallel execution on one or more CPUs or
clusters.

In this approach we would provide a proxy for asynchronous execution
(e.g. "pylearn.call(fn, args, kwargs, backend=default_backend)"), which would
come with constraints on what fn, args, and kwargs can be.  Specifically, they
must be picklable, and there are benefits (e.g. automatic function call caching)
associated with them being hashable as well.