view doc/v2_planning/plugin.txt @ 1517:a6e634b83d88

allow to read filetensor compressed with bz2
author Frederic Bastien <nouiz@nouiz.org>
date Wed, 09 May 2012 11:56:28 -0400
parents a1957faecc9b
children
line wrap: on
line source


======================================
Plugin system for iterative algorithms
======================================

I would like to propose a plugin system for iterative algorithms in
Pylearn. Basically, it would be useful to be able to sandwich
arbitrary behavior in-between two training iterations of an algorithm
(whenever applicable). I believe many mechanisms are best implemented
this way: early stopping, saving checkpoints, tracking statistics,
real time visualization, remote control of the process, or even
interlacing the training of several models and making them interact
with each other.

So here is the proposal: essentially, a plugin would be a (schedule,
timeline, function) tuple.

Schedule
========

The schedule is some function that takes two "times", t1 and t2, and
returns True if the plugin should be run in-between these times. The
indices refer to a "timeline" unit described below (e.g. "real time" or
"iterations"). The reason why we check a time range [t1, t2] rather than
some discrete time t is that we do not necessarily want to schedule plugins
on iteration numbers. For instance, we could want to run a plugin every
second, or every minute, and then [t1, t2] would be the start time and end
time of the last iteration - and then we run the plugin whenever a new
second started in that range (but still on training iteration
boundaries). Alternatively, we could want to run a plugin every n examples
seen - but if we use mini-batches, the nth example might be square in the
middle of a batch.

I've implemented a somewhat elaborate schedule system. `each(10)`
produces a schedule that returns true whenever a multiple of 10 is in
the time range. `at(17, 153)` produces one that returns true when 17
or 143 is in the time range. Schedules can be combined and negated,
e.g. `each(10) & ~at(20, 30)` (execute at each 10, except at 20 and
30). So that gives a lot of flexibility as to when you want to do
things.

Timeline
========

This would be a string indicating on what "timeline" the schedule is
supposed to operate. For instance, there could be a "real time"
timeline, an "algorithm time" timeline, an "iterations" timeline, a
"number of examples" timeline, and so on. This means you can schedule
some action to be executed every actual second, or every second of
training time (ignoring time spent executing plugins), or every
discrete iteration, or every n examples processed. This might be a
bloat feature (it was an afterthought to my original design, anyway),
but I think that there are circumstances where each of these options
is the best one.

Function
========

The plugin function would receive some object containing the time
range, a flag indicating whether the training has started, a flag
indicating whether the training is done (which they can set in order
to stop training), as well as anything pertinent about the model.

Implementation
==============

I have implemented the feature in plugin.py, in this directory. Simply
run python plugin.py to test it.



===============
Revised version
===============

Taking into account ideas thrown around during the September 16
meeting I (OB) have made the following modifications to my original
proposal:

Event objects
=============

In the revised framework, an Event is a generic object which can
contain any attributes you want, with one privileged attribute, the
'type' attribute, which is a string. I expect the following attributes
to be used widely:

* type: this is a string describing the abstract semantics of this
  event ("tick", "second", "millisecond", "batch", etc.)

* issuer: a pointer to the plugin that issued this event. This allows
  for fine grained filtering in the case where several plugins can
  fire the same event type

* time: an integer or float index on an abstract timeline. For
  instance, the "tick" event would have a "time" field, which would be
  increased by one every time the event is fired. Pretty much all
  recurrent events should include this.

* data: some data associated to the event. presumably it doesn't have
  to be named "data", and more than one data field could be given.

The basic idea is that it should be possible to say: "I want this
plugin to be executed every tenth time an event of this type is fired
by this plugin", or any subset of these conditions.

Matching events
===============

When registering a plugin, you specify a sort of "abstract event" that
an event must "match" in order to be fed to the plugin. This can be
done by simply instantiating an event with the fields you want to
match. I think examples would explain best my idea
(sch.schedule_plugin = add a plugin to the scheduler):

# Print the error on every parameter update (learner given in the event)
sch.schedule_plugin(Event("parameter_update"), PrintError())
# Print the reconstruction error of daa0 whenever it does a parameter update
sch.schedule_plugin(Event("parameter_update", issuer = daa0), PrintReconstructionError())
# Save the learner every 10 minutes
sch.schedule_plugin(Event("minute", time = each(10)), Save(learner))

The events given as first argument to schedule_plugin are not real
events: they are "template events" meant to be *matched* against the
real events that will be fired. If the terminology is confusing, it
would not be a problem to use another class with a better name (for
example, On("minute", time = each(10)) could be clearer than
Event(...), I don't know).

Note that fields in these Event objects can be a special kind of
object, a Matcher, which allows to filter events based on arbitrary
conditions. My Schedule objects (each, at, etc.) now inherit from
Matcher. You could easily have a matcher that allows you to match
issuers that are instances of a certain class, or matches every single
event (I have an example of the latter in plugin.py).

Plugins
=======

The plugin class would have the following methods:

* attach(scheduler): tell the plugin that it is being scheduled by the
  scheduler, store the scheduler in self. The method can return self,
  or a copy of itself.

* fire(type, **attributes): adds Event(type, issuer = self, **attributes)
  to the event queue of self.scheduler

Scheduler
=========

A Scheduler would have a schedule_plugin(event_template, plugin)
method to add plugins, a queue(event) method to queue a new event, and
it would be callable.

My current version proceeds as follows:

* Fire Event("begin"). Somewhat equivalent to "tick" at time 0, but I
  find it cleaner to have a special event to mark the beginning of the
  event loop.
* Infinite loop
  * Fire Event("tick", time = <iteration#>)
  * Loop until the queue is empty
    * Pop event, execute all plugins that respond to it
    * Check if event.type == "terminate". If so, stop.

Varia
=====

I've made a very simple implementation of a DispatchPlugin which, upon
reception of an event, dispatches it to its "on_<event.type>" method
(or calls a fallback). It seems nice. However, in order for it to work
reliably, it has to be registered on all events, and I'm not sure it
can scale well to more complex problems where the source of events is
important.

Implementation
==============

See plugin.py.