view doc/v2_planning/architecture.txt @ 1201:46527ae6db53

architecture: Clarified what I meant about saving the model
author Olivier Delalleau <delallea@iro>
date Mon, 20 Sep 2010 17:05:15 -0400
parents 9ff2242a817b
children b9d0a326e3e7
line wrap: on
line source

====================
Pylearn Architecture
====================


SE + VM Approach
=================

One avenue for the basic design of the library is to follow the Symbolic
Expression (SE) structure + virtual machine (VM) pattern that worked for Theano.

The main things for the library to provide would be:

- a few VMs, some of which can run programs in parallel across processors,
  hosts, and networks [R6,R8];

- MLA components as either individual Expressions (similar to Ops) or as
  subgraphs of SEs [R5,R7,R10,R11]

- machine learning algorithms including their training and testing in the form
  of python functions that build SE graphs.[R1,R8].

This design addresses R2 (modularity) because swapping components is literally implemented by
swapping subgraphs.

The design addresses R9 (algorithmic efficiency) because we can write
Theano-style graph transformations to recognize special cases of component
combinations.

The design addresses R3 if we make the additional decision that the VMs (at
least sometimes) cache the return value of program function calls.  This cache
serves as a database of experimental results, indexed by the functions that
originally computed them.  I think this is a very natural scheme for organizing
experiment results, and ensuring experiment reproducibility [R1].
At the same time, this is a clean and simple API behind which experiments can be
saved using a number of database technologies.

APIs vs. lambda
----------------

Modularity in general is achieved when pieces can be substituted one for the
other.

In an object-oriented design, modularity is achieved by agreeing on interface
APIs, but in a functional design there is another possibility: the lambda.

In an SE these pieces are expression [applications] and the subgraphs they form.
A subgraph is characterized syntactically within the program by its arguments
and its return values.  A lambda function allows the User to create new
Expression types from arbitrary subgraphs with very few keystrokes.  When a
lambda is available and easy to use, there is much less pressure on the
expression library to follow calling and return conventions strictly.

Of course, the closer are two subgraphs in terms of their inputs, outputs, and
semantics, the easier it is to substitute one for the other.  As library
designers, we should still aim for compatibility of similar algorithms.  It's
just not essential to choose an API that will guarantee a match, or indeed to
choose any explicit API at all.

YB: I agree that lambdas are more flexible, but from the user's point of
view it is really important to know what can swap with what, so that they
can easily plug-and-play. So even if informal, something in the spirit
of an API must be described somewhere, and components should declare
either formally or through comments what functionality 'type' 
they can take on.

Encapsulation vs. linearity
---------------------------

A while ago, the Apstat crew went to fight "encapsulation" to propose instead
a more "linearized" approach to experiment design. I must admit I didn't
really understand the deep motivations behind this, and after practicing both
styles (encapsulation for PLearn / Theano, linearity @ ARL / Ubisoft), I still
don't. I do find, however, some not-so-deep-but-still-significant advantages
to the linear version, which hopefully can be made clear (along with a
clarification of what the h*** am I talking about) in the following example:

   * Linear version:

.. code-block:: python

    my_experiment = pipeline([
        data,
        filter_samples,
        PCA,
        k_fold_split,
        neural_net,
        evaluation,
    ])

   * Encapsulated version:

.. code-block:: python

    my_experiment = evaluation(
        data=PCA(filter_samples(data)),
        split=k_fold_split,
        model=neural_net)

What I like in the linear version is it is much more easily human-readable
(once you know what it means): you just follow the flow of the experiment by
reading through a single list.
On the other hand, the encapsulated version requires some deeper analysis to
understand what is going on and in which order.
Also, commenting out parts of the processing is simpler in the first case (it
takes a single # in front of an element).
However, linearity tends to break when the experiment is actually not linear,
i.e. the graph of object dependencies is more complex (*).

I'm just bringing this up because it may be nice to be able to provide the
user with the most intuitive way to design experiments. I actually don't think
those approaches are mutually exclusive, and it could be possible for the
underlying system to use the more flexible / powerful encapsulated
representation, while having the option to write simple scripts in a form that
is easier to understand and manipulate.

It could also be worth discussing this issue with Xavier / Christian /
Nicolas.

(*) Note that I cheated a bit in my example above: the graph from the
encapsulated version is not a simple chain, so it is not obvious how to
convert it into the pipeline given in the linear version. It's still possible
though, but this is probably not the place to get into the details.

RP comment : The way I see it, you could always have everything using the
encapsulation paradigm ( which as you pointed out is a bit more powerful) and
then have linear shortcuts ( functions that take a list of functions and some
inputs and apply them in some order). You will not be able to have a one case
cover all pipeline function, but I think it is sufficient to offer such
options (linear functions) for a few widely used cases ..


Jobman Compatibility Approach
=============================

One basic approach for the library is to provide a set of components that are
compatible with remote execution.  The emphasis could be not so much on
standardizing the roles and APIs of components, so much as ensuring that they
can be glued together and supports parallel execution on one or more CPUs or
clusters.

In this approach we would provide a proxy for asynchronous execution
(e.g. "pylearn.call(fn, args, kwargs, backend=default_backend)"), which would
come with constraints on what fn, args, and kwargs can be.  Specifically, they
must be picklable, and there are benefits (e.g. automatic function call caching)
associated with them being hashable as well.


Benchmark
=========

During the general meeting on sept. 17th, we agreed to produce at least pseudo-code (if
possible, actual code) for the following model:
A Deep Belief Net (with greedy layerwise pre-training, and supervised
fine-tuning), with preprocessing of the data, double cross-validation, and
save/load of the model.

The different approach to be tested are:
    - Plugins with a global scheduler driving the experiment (Razvan's team)
    - Objects, with basic hooks at predefined places (Pascal L.'s team)
    - Existing objects and code (including dbi and Jobman), with some more
      pieces to tie things together (Fred B.)

OD comments: We were in a hurry to close the meeting and I did not have time
to really explain what I meant when I suggested we should add the requirement
of saving the final "best" model. What I had in mind is a typical "applied ML"
experiment, i.e. the following approach that hopefully can be understood just
by writing it down in the form of a processing pipeline. The double cross
validation step, whose goal is to obtain an estimate of the generalization
error of our final model, is:
    data -> k_fold_outer(preprocessing -> k_fold_inner(dbn -> evaluate) -> select_best -> retrain_on_all_data -> evaluate) -> evaluate
Once this is done, the model we want to save is obtained by doing
    data -> preprocessing -> k_fold(dbn -> evaluate) -> select_best -> retrain_on_all_data
and we save
    preprocessing -> best_model_selected