# HG changeset patch # User James Bergstra # Date 1284385356 14400 # Node ID 4eda3f52ebef34e6b80659d020bcd74d9a4b46e7 # Parent 8be7928cc1aa306b0ef125de4666516a126decb2 v2planning - revs to requirements, added architecture diff -r 8be7928cc1aa -r 4eda3f52ebef doc/v2_planning/architecture.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/architecture.txt Mon Sep 13 09:42:36 2010 -0400 @@ -0,0 +1,60 @@ +==================== +Pylearn Architecture +==================== + + +Basic Design Approach +===================== + +I propose that the basic design of the library follow the Symbolic Expression +(SE) structure + virtual machine (VM) pattern that worked for Theano. + +So the main things for the library to provide would be: + +- a few VMs, some of which can run programs in parallel across processors, + hosts, and networks [R6,R8]; + +- MLA components as either individual Expressions (similar to Ops) or as + subgraphs of SEs [R5,R7,R10,R11] + +- machine learning algorithms including their training and testing in the form + of python functions that build SE graphs.[R1,R8]. + +This design addresses R2 (modularity) because swapping components is literally implemented by +swapping subgraphs. + +The design addresses R9 (algorithmic efficiency) because we can write +Theano-style graph transformations to recognize special cases of component +combinations. + +The design addresses R3 if we make the additional decision that the VMs (at +least sometimes) cache the return value of program function calls. This cache +serves as a database of experimental results, indexed by the functions that +originally computed them. I think this is a very natural scheme for organizing +experiment results, and ensuring experiment reproducibility [R1]. +At the same time, this is a clean and simple API behind which experiments can be +saved using a number of database technologies. + +APIs vs. lambda +---------------- + +Modularity in general is achieved when pieces can be substituted one for the +other. + +In an object-oriented design, modularity is achieved by agreeing on interface +APIs, but in a functional design there is another possibility: the lambda. + +In an SE these pieces are expression [applications] and the subgraphs they form. +A subgraph is characterized syntactically within the program by its arguments +and its return values. A lambda function allows the User to create new +Expression types from arbitrary subgraphs with very few keystrokes. When a +lambda is available and easy to use, there is much less pressure on the +expression library to follow calling and return conventions strictly. + +Of course, the closer are two subgraphs in terms of their inputs, outputs, and +semantics, the easier it is to substitute one for the other. As library +designers, we should still aim for compatibility of similar algorithms. It's +just not essential to choose an API that will guarantee a match, or indeed to +choose any explicit API at all. + + diff -r 8be7928cc1aa -r 4eda3f52ebef doc/v2_planning/requirements.txt --- a/doc/v2_planning/requirements.txt Mon Sep 13 09:38:49 2010 -0400 +++ b/doc/v2_planning/requirements.txt Mon Sep 13 09:42:36 2010 -0400 @@ -72,32 +72,8 @@ R12. support infinite datasets (i.e. generated on the fly) -R13. from a given evaluation experimental setup, be able to save a model that - can be used "in production" (e.g. say you try many combinations of - preprocessing, models and associated hyper-parameters, and want to easily be - able to recover the full "processing pipeline" that performs best, to be - used on future "real" test data) - -Basic Design Approach -===================== - -An ability to drive parallel computations is essential in addressing [R6,R8]. +R13. apply trained models "in production". + - e.g. say you try many combinations of preprocessing, models and associated + hyper-parameters, and want to easily be able to recover the full "processing + pipeline" that performs best, and use it on real/test data later. -The basic design approach for the library is to implement -- a few virtual machines (VMs), some of which can run programs that can be - parallelized across processors, hosts, and networks. -- MLAs in a Symbolic Expression language (similar to Theano) as required by - [R5,R7,R8] - -MLAs are typically specified by Symbolic programs that are compiled to these -instructions, but some MLAs may be implemented in these instructions directly. -Symbolic programs are naturally modularized by sub-expressions [R2] and can be -optimized automatically (like in Theano) to address [R9]. - -A VM that caches instruction return values serves as -- a reliable record of what jobs were run [R1] -- a database of intermediate results that can be analyzed after the - model-training jobs have completed [R3] -- a clean API to several possible storage and execution backends. - -