view doc/v2_planning/architecture.txt @ 1157:9686c0d9689d

Quick implementation of the Dataset Api we propose.
author Arnaud Bergeron <abergeron@gmail.com>
date Fri, 17 Sep 2010 12:01:12 -0400
parents 4eda3f52ebef
children e5306f5626d4
line wrap: on
line source

====================
Pylearn Architecture
====================


Basic Design Approach
=====================

I propose that the basic design of the library follow the Symbolic Expression
(SE) structure + virtual machine (VM) pattern that worked for Theano.

So the main things for the library to provide would be:

- a few VMs, some of which can run programs in parallel across processors,
  hosts, and networks [R6,R8];

- MLA components as either individual Expressions (similar to Ops) or as
  subgraphs of SEs [R5,R7,R10,R11]

- machine learning algorithms including their training and testing in the form
  of python functions that build SE graphs.[R1,R8].

This design addresses R2 (modularity) because swapping components is literally implemented by
swapping subgraphs.

The design addresses R9 (algorithmic efficiency) because we can write
Theano-style graph transformations to recognize special cases of component
combinations.

The design addresses R3 if we make the additional decision that the VMs (at
least sometimes) cache the return value of program function calls.  This cache
serves as a database of experimental results, indexed by the functions that
originally computed them.  I think this is a very natural scheme for organizing
experiment results, and ensuring experiment reproducibility [R1].
At the same time, this is a clean and simple API behind which experiments can be
saved using a number of database technologies.

APIs vs. lambda
----------------

Modularity in general is achieved when pieces can be substituted one for the
other.

In an object-oriented design, modularity is achieved by agreeing on interface
APIs, but in a functional design there is another possibility: the lambda.

In an SE these pieces are expression [applications] and the subgraphs they form.
A subgraph is characterized syntactically within the program by its arguments
and its return values.  A lambda function allows the User to create new
Expression types from arbitrary subgraphs with very few keystrokes.  When a
lambda is available and easy to use, there is much less pressure on the
expression library to follow calling and return conventions strictly.

Of course, the closer are two subgraphs in terms of their inputs, outputs, and
semantics, the easier it is to substitute one for the other.  As library
designers, we should still aim for compatibility of similar algorithms.  It's
just not essential to choose an API that will guarantee a match, or indeed to
choose any explicit API at all.