Mercurial > pylearn


Discussion of Function Specification for Learner Types
======================================================

In its most abstract form, a learner is an object with the
following semantics:

* A learner has named hyper-parameters that control how it learns (these can be viewed
as options of the constructor, or might be set directly by a user)

* A learner also has an internal state that depends on what it has learned.

* A learner reads and produces data, so the definition of learner is
intimately linked to the definition of dataset (and task).

* A learner has one or more 'train' or 'adapt' functions by which
it is given a sample of data (typically either the whole training set, or
a mini-batch, which contains as a special case a single 'example'). Learners
interface with datasets in order to obtain data. These functions cause the
learner to change its internal state and take advantage to some extent
of the data provided. The 'train' function should take charge of
completely exploiting the dataset, as specified per the hyper-parameters,
so that it would typically be called only once. An 'adapt' function
is meant for learners that can operate in an 'online' setting where
data continually arrive and the control loop (when to stop) is to
be managed outside of it. For most intents and purposes, the
'train' function could also handle the 'online' case by providing
the controlled iterations over the dataset (which would then be
seen as a stream of examples).
    * learner.train(dataset)
    * learner.adapt(data)

* Different types of learners can then exploit their internal state
in order to perform various computations after training is completed,
or in the middle of training, e.g.,

   * y=learner.predict(x)
     for learners that see (x,y) pairs during training and predict y given x,
     or for learners that see only x's and learn a transformation of it (i.e. feature extraction).
     Here and below, x and y are tensor-like objects whose first index iterates
     over particular examples in a batch or minibatch of examples.

   * p=learner.probability(examples)
     p=learner.log_probability(examples)
     for learners that can estimate probability density or probability functions,
     note that example could be a pair (x,y) for learners that expect each example
     to represent such a pair. The second form is provided in case the example
     is high-dimensional and computations in the log-domain are numerically preferable.
     The first dimension of examples or of x and y is an index over a minibatch or a dataset.

   * p=learner.free_energy(x)
     for learners that can estimate a log unnormalized probability; the output has the same length as the input.

   * c=learner.costs(examples)
     returns a matrix of costs (one row per example, i.e., again the output has the same length
     as the input), the first column of which represents the cost whose expectation
     we wish to minimize over new samples from the unknown underlying data distribution.


Some learners may be able to handle x's and y's that contain missing values.

* For convenience, some of these operations could be bundled, e.g.

    * [prediction,costs] = learner.predict_and_adapt((x,y))

* Some learners could include in their internal state not only what they
have learned but some information about recently seen examples that conditions
the expected distribution of upcoming examples. In that case, they might
be used, e.g. in an online setting as follows:
     for (x,y) in data_stream:
        [prediction,costs]=learner.predict((x,y))
        accumulate_statistics(prediction,costs)

* In some cases, each example is itself a (possibly variable-size) sequence
or other variable-size object (e.g. an image, or a video)


James's idea for Learner Interface
===================================

Theory:
-------

Think about the unfolding of a learning algorithm as exploring a path in a vast
directed graph.

There are some source nodes, which are potential initial conditions for the
learning algorithm.

At any node, there are a number of outgoing labeled edges that represent
distinct directions of exploration: like "allocate a model with N hidden units",
or "set the l1 weight decay on such-and-such units to 0.1" or "adapt for T
iterations" or "refresh the GPU dataset memory with the next batch of data".

Not all nodes have the same outgoing edge labels.  The dataset, model, and
optimization algorithm implementations may each have their various
hyper-parameters with various restrictions on what values they can take, and
when they can be changed.

Every move in this graph incurs some storage and computational expense, and
explores the graph.

Learners typically engage in goal-directed exploration of this graph - for
example to find the node with the best validation-set performance given a
certain computational budget.  We might often be interested in the best node
found.

The predict(), log_probability(), free_energy() etc correspond to costs that we
can measure at any particular node (at some computational expense) to see how we
are doing in our exploration.

Many semantically distinct components come into the definition of this graph:
the model (e.g. DAA) the dataset (e.g. an online one), the inference and
learning strategy.   I'm not sure what to call this graph than an 'experiment
graph'... so I'll go with that for now.


Use Cases
----------

Early stopping
~~~~~~~~~~~~~~

Early stopping can be implemented as a learner that progresses along a
particular kind of edge (e.g. "train more") until a stopping criterion (in terms
of a cost computed from nodes along the path) is met.


Grid Search
~~~~~~~~~~~

Grid search is a learner policy that can be implemented in an experiment graph
where all paths have the form:

(   "set param 0 to X",
    "set param 1 to Y",
    ... ,
    "set param N to Z",
    adapt,
    [early stop...],
    test)

It would explore all paths of this form and then return the best node.


Stagewise learning of DBNs combined with early stopping and grid search
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This would be a learner that is effective for experiment graphs that reflect the
greedy-stagewise optimization of DBNs.


Boosting
~~~~~~~~

Given an ExperimentGraph that permits re-weighting of examples, it is
straightforward to write a meta-ExperimentGraph around it that implements AdaBoost.
A meta-meta-ExperimentGraph around that that does early-stopping would complete
the picture and make a useful boosting implementation.


Implementation Details / API
----------------------------

TODO: PUT IN TERMINOLOGY OF LEARNER, HYPER-LEARNER.

TODO: SEPARATE DISCUSSION OF PERSISTENT STORAGE FROM LEARNER INTERFACE.

TODO: API describing hyperparameters (categorical, integer, bounds on values, etc.)

TODO: use-case - show how we could use the optimizer from
      http://www.cs.ubc.ca/labs/beta/Projects/ParamILS/

ExperimentGraph
~~~~~~~~~~~~~~~

One API that needs to be defined for this perspective to be practical is the
ExperimentGraph.  I'll present it in terms of global functions, but an
object-oriented things probably makes more sense in the code itself.


    def explored_nodes(graph):
       """Return iterator over explored nodes (ints? objects?)"""

    def forget_nodes(graph, nodes):
       """Clear the nodes from memory (save space)"""

    def all_edges_from(graph, node):
       """Return iterator over all possible edges

       Edges might be parametric - like "set learn_rate to (float)"

       Edges might contain a reference to their 'from' end... not sure.

       """
    def explored_edges_from(graph, node):
        """Return the edges that have been explored
        """

    def add_node(graph, new_node):
        """add a node.  It may be serialized."""

    def add_edge(graph, edge):
        """add edge, it may be serialize"""

    def connect(graph, from_node, to_node, edge):
        """
        to_node = None for un-explored edge
        """

It makes sense to have one ExperimentGraph implementation for each storage
mechanism - Memory, JobMan, sqlite, couchdb, mongodb, etc.

The nodes should be serializable objects (like the 'learner' objects in Yoshua's
text above, so that you can do node.learner.predict() if the edge leading to
`node` trained something new).

The nodes could also contain the various costs (train, valid, test), and other
experiment statistics that are node-specific.


Some implementations might also include functions for asynchronous updating of
the ExperimentGraph:


ExperimentGraphEdge
~~~~~~~~~~~~~~~~~~~

The ExperimentGraph is primarily a dictionary container for nodes and edges.
An ExperimentGraphEdge implementation is the model-dependent component that
actually interprets the edges as computations.

    def estimate_compute_time(graph, node, edge):
       """Return an estimated walltime expense for the computation"""

    def compute_edge(graph, node, edge, async=False, priority=1):
       """Run the computations assocated with this graph edge, and store the
       resulting 'to_node' to the graph when complete.

       If async is True, the function doesn't return until the graph is updated
       with `to_node`.

       The priority is used by implementations that use cluster software or
       something to manage a worker pool that computes highest-priority edges
       first.

       """

    def list_compute_queue(graph):
        """Return edges scheduled for exploration (and maybe a handle for
        where/when they started running and other backend details)
        """

Different implementations of ExperimentGraphExplorer will correspond to
different experiments.  There can also be ExperimentGraphExplorer
implementations that are proxies, and perform the computations in different
threads, or across ssh, or cluster software.


Learner
~~~~~~~

A learner is a program that implements a policy for graph exploration by
exploiting the ExperimentGraph and ExperimentGraphEdge interfaces.

The convenience of the API hinges on the extent to which we can implement
policies that work on different experiment-graphs (where the labels on the edges
and semantics are different).  The use-cases above make me optimistic that it
will work sufficiently well to be worth doing in the absence of better ideas.
author	James Bergstra <bergstrj@iro.umontreal.ca>
date	Tue, 07 Sep 2010 12:29:27 -0400
parents	88b296cfba50
children	38cc6e075d9b