view doc/v2_planning/arch_FB.txt @ 1377:0665274b14af

Minor fixes for better Sphinx doc output
author Olivier Delalleau <delallea@iro>
date Thu, 18 Nov 2010 14:14:48 -0500
parents e5b7a7913329
children
line wrap: on
line source

Current and extenstion of our framework
=======================================

This proposition is complementary to PL hook system and OB check point. This could be part of the backend of James system. I don't remember/know enought the other proposal.

Supposition I make:

* Dataset, Learner and Layers commity have done their work
    * That mean we have a more easy way to make a learning model.
* Checkpoint solved: we ignore(short jobs), don't care, manual checkpoint, structured checkpoint with an example or use OB system.

Example MLP
-----------

* Select the hyper parameter search space with `jobman sqlschedules`
* Dispatch the jobs with dbidispatch
* *Manually* (fixable) reset jobs status to START.
   * I started it, but I will change the syntax to make it more generic.
* *Manually* relaunch crashed jobs.
* *Manually* (fixable) analyse/visualise the result. (We need to start those meeting at some point)

Example MLP+cross validataion
-----------------------------

* Modify the dataset interface to accept 2 new hyper parameter: nb_cross_fold=X, id_cross_fold.
* Schedule all of the fold to run in parallel with `jobman sqlschedules`
* *Manually* (fixable) reset jobs status to START. 
* *Manually* relaunch crashed jobs.
* *Manually* (fixable) analyse/visualize the result.
   * Those tools need to understand the concept of cross validation
* *Manually* (fixable with proposition bellow) launch a retrain on the full dataset with the best hyper parameter


Example DBN
-----------

* *Concept* JOB PHASE. DBN( unsupervised and supervised)
   * We suppose the job script have a parameter to tell him witch phase it should do.
* *Jobman Extension* We can extend jobman to handle dependency between jobs.
    * Proposed syntax:

.. code-block:: bash

      jobman sqlschedule p0={{}} ... -- p1={{}} ... -- p2=...

        * The parameter before the first `--` tell on witch jobs the new jobs depends. (allow to depend on many jobs at the same time)
        * The parameter between `--` tell that we want to create a new group of jobs for all those jobs.
        * The parameter after the second `--` tell the new jobs to be create for each new group of jobs.

* *Jobman Extension* create `jobman dispatch`
    * This will dispatch new jobs to run on the cluster with dbidispatch when a jobs have his dependency finished.
* *Jobman Extension* create `jobman monitor`
    * This repeadly call `jobman condor_check` to print jobs that can potentially have crashed and print them on the screen. It need to filter the output of condor_check.
    * Can create other `jobman CLUSTER_check` for mammouth,colosse,angel,...
* *Jobman Extension* when we change the status of a job to START in jobman, change the status of the jobs that depend on it at the same time.
* *Jobman Extension* determine if a job finished correctly or not
   * If a job did not finish correctly don't start the following jobs.
* *Jobman Policy* All change to the db should be doable by jobman command.

* *Manually* relaunch crashed jobs.
* *Manually* (fixable) analyse/visualise the result.
   * Those tools need to understand the concept of job phase or be agnostic of that.


* *Cross validataion retrain* can be done with an additional phase in the extensions.
    * The new job need to know how to determine the best hyper parameter from the result.


* This can be extended for double cross validation.
   * Dataset must support double cross validation
   * We create more phase in jobman.

Hyper parameter search in Pylearn
---------------------------------

We would want to have the hyper parameter search being done in pylearn in some case. This will add a dependency on jobman. We can finish/verify how jobman work with sqlite to don't have request an installed db. sqlite is included in python 2.5. Jobman request python 2.5. We could make optional the jobman dependency on sqlalchemy when we use sqlite to limit the number of dependency.