Mercurial > pylearn
view doc/v2_planning/architecture.txt @ 1293:879a5633bb52
A small addendum about the 'import A as B' moratorium.
author | David Warde-Farley <wardefar@iro.umontreal.ca> |
---|---|
date | Fri, 01 Oct 2010 11:27:41 -0400 |
parents | b9d0a326e3e7 |
children |
line wrap: on
line source
==================== Pylearn Architecture ==================== SE + VM Approach ================= One avenue for the basic design of the library is to follow the Symbolic Expression (SE) structure + virtual machine (VM) pattern that worked for Theano. The main things for the library to provide would be: - a few VMs, some of which can run programs in parallel across processors, hosts, and networks [R6,R8]; - MLA components as either individual Expressions (similar to Ops) or as subgraphs of SEs [R5,R7,R10,R11] - machine learning algorithms including their training and testing in the form of python functions that build SE graphs.[R1,R8]. This design addresses R2 (modularity) because swapping components is literally implemented by swapping subgraphs. The design addresses R9 (algorithmic efficiency) because we can write Theano-style graph transformations to recognize special cases of component combinations. The design addresses R3 if we make the additional decision that the VMs (at least sometimes) cache the return value of program function calls. This cache serves as a database of experimental results, indexed by the functions that originally computed them. I think this is a very natural scheme for organizing experiment results, and ensuring experiment reproducibility [R1]. At the same time, this is a clean and simple API behind which experiments can be saved using a number of database technologies. APIs vs. lambda ---------------- Modularity in general is achieved when pieces can be substituted one for the other. In an object-oriented design, modularity is achieved by agreeing on interface APIs, but in a functional design there is another possibility: the lambda. In an SE these pieces are expression [applications] and the subgraphs they form. A subgraph is characterized syntactically within the program by its arguments and its return values. A lambda function allows the User to create new Expression types from arbitrary subgraphs with very few keystrokes. When a lambda is available and easy to use, there is much less pressure on the expression library to follow calling and return conventions strictly. Of course, the closer are two subgraphs in terms of their inputs, outputs, and semantics, the easier it is to substitute one for the other. As library designers, we should still aim for compatibility of similar algorithms. It's just not essential to choose an API that will guarantee a match, or indeed to choose any explicit API at all. YB: I agree that lambdas are more flexible, but from the user's point of view it is really important to know what can swap with what, so that they can easily plug-and-play. So even if informal, something in the spirit of an API must be described somewhere, and components should declare either formally or through comments what functionality 'type' they can take on. Encapsulation vs. linearity --------------------------- A while ago, the Apstat crew went to fight "encapsulation" to propose instead a more "linearized" approach to experiment design. I must admit I didn't really understand the deep motivations behind this, and after practicing both styles (encapsulation for PLearn / Theano, linearity @ ARL / Ubisoft), I still don't. I do find, however, some not-so-deep-but-still-significant advantages to the linear version, which hopefully can be made clear (along with a clarification of what the h*** am I talking about) in the following example: * Linear version: .. code-block:: python my_experiment = pipeline([ data, filter_samples, PCA, k_fold_split, neural_net, evaluation, ]) * Encapsulated version: .. code-block:: python my_experiment = evaluation( data=PCA(filter_samples(data)), split=k_fold_split, model=neural_net) What I like in the linear version is it is much more easily human-readable (once you know what it means): you just follow the flow of the experiment by reading through a single list. On the other hand, the encapsulated version requires some deeper analysis to understand what is going on and in which order. Also, commenting out parts of the processing is simpler in the first case (it takes a single # in front of an element). However, linearity tends to break when the experiment is actually not linear, i.e. the graph of object dependencies is more complex (*). I'm just bringing this up because it may be nice to be able to provide the user with the most intuitive way to design experiments. I actually don't think those approaches are mutually exclusive, and it could be possible for the underlying system to use the more flexible / powerful encapsulated representation, while having the option to write simple scripts in a form that is easier to understand and manipulate. It could also be worth discussing this issue with Xavier / Christian / Nicolas. (*) Note that I cheated a bit in my example above: the graph from the encapsulated version is not a simple chain, so it is not obvious how to convert it into the pipeline given in the linear version. It's still possible though, but this is probably not the place to get into the details. RP comment : The way I see it, you could always have everything using the encapsulation paradigm ( which as you pointed out is a bit more powerful) and then have linear shortcuts ( functions that take a list of functions and some inputs and apply them in some order). You will not be able to have a one case cover all pipeline function, but I think it is sufficient to offer such options (linear functions) for a few widely used cases .. Jobman Compatibility Approach ============================= One basic approach for the library is to provide a set of components that are compatible with remote execution. The emphasis could be not so much on standardizing the roles and APIs of components, so much as ensuring that they can be glued together and supports parallel execution on one or more CPUs or clusters. In this approach we would provide a proxy for asynchronous execution (e.g. "pylearn.call(fn, args, kwargs, backend=default_backend)"), which would come with constraints on what fn, args, and kwargs can be. Specifically, they must be picklable, and there are benefits (e.g. automatic function call caching) associated with them being hashable as well. Benchmark ========= During the general meeting on sept. 17th, we agreed to produce at least pseudo-code (if possible, actual code) for the following model: A Deep Belief Net (with greedy layerwise pre-training, and supervised fine-tuning), with preprocessing of the data, double cross-validation, and save/load of the model. The different approach to be tested are: - Plugins with a global scheduler driving the experiment (Razvan's team) - Objects, with basic hooks at predefined places (Pascal L.'s team) - Existing objects and code (including dbi and Jobman), with some more pieces to tie things together (Fred B.) OD comments: We were in a hurry to close the meeting and I did not have time to really explain what I meant when I suggested we should add the requirement of saving the final "best" model. What I had in mind is a typical "applied ML" experiment, i.e. the following approach that hopefully can be understood just by writing it down in the form of a processing pipeline. The double cross validation step, whose goal is to obtain an estimate of the generalization error of our final model, is: data -> k_fold_outer(preprocessing -> k_fold_inner(dbn -> evaluate) -> select_best -> retrain_on_all_data -> evaluate) Once this is done, the model we want to save is obtained by doing data -> preprocessing -> k_fold(dbn -> evaluate) -> select_best -> retrain_on_all_data and we save preprocessing -> best_model_selected