Mercurial > pylearn

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/v2_planning/architecture.txt	Mon Sep 13 09:42:36 2010 -0400
@@ -0,0 +1,60 @@
+====================
+Pylearn Architecture
+====================
+
+
+Basic Design Approach
+=====================
+
+I propose that the basic design of the library follow the Symbolic Expression
+(SE) structure + virtual machine (VM) pattern that worked for Theano.
+
+So the main things for the library to provide would be:
+
+- a few VMs, some of which can run programs in parallel across processors,
+  hosts, and networks [R6,R8];
+
+- MLA components as either individual Expressions (similar to Ops) or as
+  subgraphs of SEs [R5,R7,R10,R11]
+
+- machine learning algorithms including their training and testing in the form
+  of python functions that build SE graphs.[R1,R8].
+
+This design addresses R2 (modularity) because swapping components is literally implemented by
+swapping subgraphs.
+
+The design addresses R9 (algorithmic efficiency) because we can write
+Theano-style graph transformations to recognize special cases of component
+combinations.
+
+The design addresses R3 if we make the additional decision that the VMs (at
+least sometimes) cache the return value of program function calls.  This cache
+serves as a database of experimental results, indexed by the functions that
+originally computed them.  I think this is a very natural scheme for organizing
+experiment results, and ensuring experiment reproducibility [R1].
+At the same time, this is a clean and simple API behind which experiments can be
+saved using a number of database technologies.
+
+APIs vs. lambda
+----------------
+
+Modularity in general is achieved when pieces can be substituted one for the
+other.
+
+In an object-oriented design, modularity is achieved by agreeing on interface
+APIs, but in a functional design there is another possibility: the lambda.
+
+In an SE these pieces are expression [applications] and the subgraphs they form.
+A subgraph is characterized syntactically within the program by its arguments
+and its return values.  A lambda function allows the User to create new
+Expression types from arbitrary subgraphs with very few keystrokes.  When a
+lambda is available and easy to use, there is much less pressure on the
+expression library to follow calling and return conventions strictly.
+
+Of course, the closer are two subgraphs in terms of their inputs, outputs, and
+semantics, the easier it is to substitute one for the other.  As library
+designers, we should still aim for compatibility of similar algorithms.  It's
+just not essential to choose an API that will guarantee a match, or indeed to
+choose any explicit API at all.
+
+
--- a/doc/v2_planning/requirements.txt	Mon Sep 13 09:38:49 2010 -0400
+++ b/doc/v2_planning/requirements.txt	Mon Sep 13 09:42:36 2010 -0400
@@ -72,32 +72,8 @@

 R12. support infinite datasets (i.e. generated on the fly)

-R13. from a given evaluation experimental setup, be able to save a model that
-  can be used "in production" (e.g. say you try many combinations of
-  preprocessing, models and associated hyper-parameters, and want to easily be
-  able to recover the full "processing pipeline" that performs best, to be
-  used on future "real" test data)
-
-Basic Design Approach
-=====================
-
-An ability to drive parallel computations is essential in addressing [R6,R8].
+R13. apply trained models "in production".
+  - e.g. say you try many combinations of preprocessing, models and associated
+    hyper-parameters, and want to easily be able to recover the full "processing
+    pipeline" that performs best, and use it on real/test data later.

-The basic design approach for the library is to implement
-- a few virtual machines (VMs), some of which can run programs that can be
-  parallelized across processors, hosts, and networks.
-- MLAs in a Symbolic Expression language (similar to Theano) as required by
-  [R5,R7,R8]
-
-MLAs are typically specified by Symbolic programs that are compiled to these
-instructions, but some MLAs may be implemented in these instructions directly.
-Symbolic programs are naturally modularized by sub-expressions [R2] and can be
-optimized automatically (like in Theano) to address [R9].
-
-A VM that caches instruction return values serves as
-- a reliable record of what jobs were run [R1]
-- a database of intermediate results that can be analyzed after the
-  model-training jobs have completed [R3]
-- a clean API to several possible storage and execution backends.
-
-