view doc/v2_planning/learner.txt @ 1142:4f03a9a743dc

Merged
author Olivier Delalleau <delallea@iro>
date Thu, 16 Sep 2010 13:28:50 -0400
parents f082a6c0b008
children 7a8dcf87d780
line wrap: on
line source


Comittee: AB, PL, GM, IG, RP, NB, PV
Leader: ?

Discussion of Function Specification for Learner Types
======================================================

In its most abstract form, a learner is an object with the
following semantics:

* A learner has named hyper-parameters that control how it learns (these can be viewed
as options of the constructor, or might be set directly by a user)

* A learner also has an internal state that depends on what it has learned.

* A learner reads and produces data, so the definition of learner is
intimately linked to the definition of dataset (and task).

* A learner has one or more 'train' or 'adapt' functions by which
it is given a sample of data (typically either the whole training set, or
a mini-batch, which contains as a special case a single 'example'). Learners
interface with datasets in order to obtain data. These functions cause the
learner to change its internal state and take advantage to some extent
of the data provided. The 'train' function should take charge of
completely exploiting the dataset, as specified per the hyper-parameters,
so that it would typically be called only once. An 'adapt' function
is meant for learners that can operate in an 'online' setting where
data continually arrive and the control loop (when to stop) is to
be managed outside of it. For most intents and purposes, the
'train' function could also handle the 'online' case by providing
the controlled iterations over the dataset (which would then be
seen as a stream of examples).
    * learner.train(dataset)
    * learner.adapt(data)

* Different types of learners can then exploit their internal state
in order to perform various computations after training is completed,
or in the middle of training, e.g.,

   * y=learner.predict(x)
     for learners that see (x,y) pairs during training and predict y given x,
     or for learners that see only x's and learn a transformation of it (i.e. feature extraction).
     Here and below, x and y are tensor-like objects whose first index iterates
     over particular examples in a batch or minibatch of examples.

   * p=learner.probability(examples)
     p=learner.log_probability(examples)
     for learners that can estimate probability density or probability functions,
     note that example could be a pair (x,y) for learners that expect each example
     to represent such a pair. The second form is provided in case the example
     is high-dimensional and computations in the log-domain are numerically preferable.
     The first dimension of examples or of x and y is an index over a minibatch or a dataset.

   * p=learner.free_energy(x)
     for learners that can estimate a log unnormalized probability; the output has the same length as the input.

   * c=learner.costs(examples)
     returns a matrix of costs (one row per example, i.e., again the output has the same length
     as the input), the first column of which represents the cost whose expectation
     we wish to minimize over new samples from the unknown underlying data distribution.
     

Some learners may be able to handle x's and y's that contain missing values.

* For convenience, some of these operations could be bundled, e.g.

    * [prediction,costs] = learner.predict_and_adapt((x,y))

* Some learners could include in their internal state not only what they
have learned but some information about recently seen examples that conditions
the expected distribution of upcoming examples. In that case, they might
be used, e.g. in an online setting as follows:
     for (x,y) in data_stream:
        [prediction,costs]=learner.predict((x,y))
        accumulate_statistics(prediction,costs)

* In some cases, each example is itself a (possibly variable-size) sequence
or other variable-size object (e.g. an image, or a video)







        


James's idea for Learner Interface
===================================

Theory:
-------

Think about the unfolding of a learning algorithm as exploring a path in a vast
directed graph.

There are some source nodes, which are potential initial conditions for the
learning algorithm.

At any node, there are a number of outgoing labeled edges that represent
distinct directions of exploration: like "allocate a model with N hidden units",
or "set the l1 weight decay on such-and-such units to 0.1" or "adapt for T
iterations" or "refresh the GPU dataset memory with the next batch of data".

Not all nodes have the same outgoing edge labels.  The dataset, model, and
optimization algorithm implementations may each have their various
hyper-parameters with various restrictions on what values they can take, and
when they can be changed.

Every move in this graph incurs some storage and computational expense, and
explores the graph.

Learners typically engage in goal-directed exploration of this graph - for
example to find the node with the best validation-set performance given a
certain computational budget.  We might often be interested in the best node
found.

The predict(), log_probability(), free_energy() etc correspond to costs that we
can measure at any particular node (at some computational expense) to see how we
are doing in our exploration.

Many semantically distinct components come into the definition of this graph:
the model (e.g. DAA) the dataset (e.g. an online one), the inference and
learning strategy.   I'm not sure what to call this graph than an 'experiment
graph'... so I'll go with that for now.





Use Cases
----------

Early stopping
~~~~~~~~~~~~~~

Early stopping can be implemented as a learner that progresses along a
particular kind of edge (e.g. "train more") until a stopping criterion (in terms
of a cost computed from nodes along the path) is met.


Grid Search
~~~~~~~~~~~

Grid search is a learner policy that can be implemented in an experiment graph
where all paths have the form:

(   "set param 0 to X",
    "set param 1 to Y", 
    ... , 
    "set param N to Z", 
    adapt, 
    [early stop...],
    test)

It would explore all paths of this form and then return the best node.


Stagewise learning of DBNs combined with early stopping and grid search
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This would be a learner that is effective for experiment graphs that reflect the
greedy-stagewise optimization of DBNs.


Boosting
~~~~~~~~

Given an ExperimentGraph that permits re-weighting of examples, it is
straightforward to write a meta-ExperimentGraph around it that implements AdaBoost.
A meta-meta-ExperimentGraph around that that does early-stopping would complete
the picture and make a useful boosting implementation.


Using External Hyper-Parameter Optimization Software
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TODO: use-case - show how we could use the optimizer from
http://www.cs.ubc.ca/labs/beta/Projects/ParamILS/


Implementation Details / API
----------------------------

Learner
~~~~~~~
    An object that allows us to explore the graph discussed above.  Specifically, it represents
    an explored node in that graph.

    def active_instructions()
        """ Return a list/set of Instruction instances (see below) that the Learner is prepared
        to handle.
        """

    def copy(), deepcopy()
        """ Learners should be serializable """


    To make the implementation easier, I found it was helpful to introduce a string-valued
    `fsa_state` member attribute and associate methods to these states.  That made it
    syntactically easy to build relatively complex finite-state transition graphs to describe
    which instructions were active at which times in the life-cycle of a learner.


Instruction
~~~~~~~~~~~
    An object that represents a potential edge in the graph discussed above.  It is an
    operation that a learner can perform.

    arg_types
        """a list of Type object (see below) indicating what args are required by execute"""

    def execute(learner, args, kwargs):
        """ Perform some operation on the learner (follow an edge in the graph discussed above)
        and modify the learner in-place.  Calling execute 'moves' the learner from one node in
        the graph along an edge.  To have the old learner as well, it must be copied prior to
        calling execute().
        """

    def expense(learner, args, kwargs, resource_type='CPUtime'):
        """ Return an estimated cost of performing this instruction (calling execute), in time,
        space, number of computers, disk requierement, etc.
        """

Type
~~~~
    An object that describes a parameter domain for a call to Instruction.execute.
    It is not necessary that a Type specifies exactly which arguments are legal, but it should
    `include` all legal arguments, and exclude as many illegal ones as possible.

    def includes(value):
        """return True if value is a legal argument"""


    To make things a bit more practical, there are some Type subclasses like Int, Float, Str,
    ImageDataset, SgdOptimizer, that include additional attributes (e.g. min, max, default) so
    that automatic graph exploration algorithms can generate legal arguments with reasonable
    efficiency.



The proxy pattern is a powerful way to combine learners. Especially when proxy Learner
instances also introduce Proxy Instruction classes.

For example, it is straightforward to implement a hyper-learner by implementing a Learner with
another learner (sub-learner) as a member attribute.  The hyper-learner makes some
modifications to the instruction_set() return value of the sub-learner, typically to introduce
more powerful instructions and hide simpler ones.

It is less straightforward, but consistent with the design to implement a Learner that
encompasses job management.  Such a learner would retain the semantics of the
instruction_set of the sub-learner, but would replace the Instruction objects themselves with
Instructions that arranged for remote procedure calls (e.g. jobman, multiprocessing, bqtools,
etc.)  Such a learner would replace synchronous instructions (return on completion) with
asynchronous ones (return after scheduling) and the active instruction set would also change
asynchronously, but neither of these things is inconsistent with the Learner API.


TODO - Experiment API?
~~~~~~~~~~~~~~~~~~~~~~

I feel like something is missing from the API - and that is an interface to the graph structure
discussed above.  The nodes in this graph are natural places to store meta-information for
visualization, statistics-gathering etc.   But none of the APIs above corresponds to the graph
itself. In other words, there is no API through which to attach information to nodes.  It is
not good to say that the Learner instance *is* the node because (a) learner instances change
during graph exploration and (b) learner instances are big, and we don't want to have to keep a
whole saved model just to attach meta-info e.g. validation score.    Choosing this API spills
over into other committees, so we should get their feedback about how to resolve
it.  Maybe we need an 'Experiment' API to stand for this graph?


TODO: Validation & Monitoring Costs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Even if we do have the Experiment API as a structure to hang validation and
monitoring results, what should be the mechanism for extracting those results.
The Learner API is not right because extracting a monitoring cost doesn't change
the model, doesn't change the legal instructions/edges etc.  Maybe we should use
a similar mechanism to Instruction, called something like Measurement?  Any node
/ learner can report the list of instructions (for moving) and the list of
measurements (and the cost of computing them too)


TODO - Parameter Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

YB asks: it seems to me that what we really need from "Type" is not just
testing that a value is legal, but more practically a function that specifies the
prior distribution for the hyper-parameter, i.e., how to sample from it,
and possibly some representation of it that could be used to infer
a posterior (such as an unnormalized log-density or log-probability).
Having the min and max and default limits us to the uniform distribution,
which may not always be appropriate. For example sometimes we'd like 
Gaussian (-infty to infty) or Exponential (0 to infty) or Poisson (non-negative integers).
For that reason, I think that "Type" is not a very good name.
How about "Prior" or "Density" or something like that?

JB replies: I agree that being able to choose (and update) distributions over
these values is important. I don't think the Type structure is the right place
to handle it though.  The challenge is to allow those distributions to change
for a variety of reasons - e.g. the sampling distribution on the capacity
variables is affected by the size of the dataset, it is also affected by
previous experience in general as well as experiments on that particular
dataset.  I'm not sure that the 'Type' structure is right to deal with this.
Also, even with a strategy for handling these distributions, I believe a simple
mechanism for rejecting insane values might be useful.

So how should we handle it?  Hmmm...


Comments
~~~~~~~~

OD asks: (I hope it's ok to leave comments even though I'm not in committee... I'm
interested to see how the learner interface is shaping up so I'll be keeping
an eye on this file)
I'm wondering what's the benefit of such an API compared to simply defining a
new method for each instruction. It seems to me that typically, the 'execute'
method would end up being something like
    if instruction == 'do_x':
        self.do_x(..)
    elif instruction == 'do_y':
        self.do_y(..)
    ...
so why not directly call do_x / do_y instead?


JB replies: I agree with you, and in the implementation of a Learner I suggest
using Python decorators to get the best of both worlds:
    
    class NNet(Learner):

        ...

        @Instruction.new(arg_types=(Float(min=-8, max=-1, default=-4),))
        def set_log_lr(self, log_lr):
            self.lr.value = numpy.exp(log_lr)          
  
        ...

The Learner base class can implement a instruction_set() that walks through the
methods of 'self' and pick out the ones that have corresponding instructions.
But anyone can call the method normally.  The NNet class can also have methods
that are not instructions.

OD replies: Ok thanks. I'm still unsure what is the end goal, but I'll keep
watching while you guys work on it, and hopefully it'll become clearer for me ;)

RP asks: James correct me if I'm wrong, but I think each instruction has a execute
command. The job of the learner is to traverse the graph and for each edge
that it decides to cross to call the execute of that edge. Maybe James has 
something else in mind, but this was my understanding.

JB replies: close, but let me make a bit of a clarification.  The job of a
Learner is simply to implement the API of a Learner - to list what edges are
available and to be able to cross them if asked. The code *using* the Learner
(client) decides which edges to cross.  The client may also be a Learner, but
maybe not.



Just another view/spin on the same idea (Razvan)
================================================


My idea is probably just a spin off from what James wrote. It is an extension
of what I send on the mailing list some time ago.

Big Picture
-----------

What do we care about ?
~~~~~~~~~~~~~~~~~~~~~~~

This is the list of the main points that I have in mind :

 * Re-usability 
 * Extensibility
 * Simplicity or easily readable code ( connected to re-usability )
 * Modularity ( connected to extensibility )
 * Fast to write code ( - sort of comes out of simplicity)
 * Efficient code


Composition
~~~~~~~~~~~

To me this reads as code generated by composing pieces. Imagine this :
you start of with something primitive that I will call a "variable", which
probably is a very unsuitable name. And then you compose those intial
"variables" or transform them through several "functions". Each such 
"function" hides some logic, that you as the user don't care about. 
You can have low-level or micro "functions" and high-level or macro 
"functions", where a high-level function is just a certain compositional
pattern of low-level "functions". There are several classes of "functions"
and "variables" that can be inter-changable. This is how modularity is 
obtained, by chainging between functions from a certain class.

Now when you want to research something, what you do is first select
the step you want to look into. If you are lucky you can re-write this
step as certain decomposition of low-level transformations ( there can be
multiple such decompositions). If not you have to implement such a 
decompositions acording to your needs. Pick the low-level transformations you want 
to change and write new versions that implement your logic. 

I think the code will be easy to read, because it is just applying a fixed
set of transformations, one after the other. The one who writes the code can
decide how explicit he wants to write things by switching between high-level
and low-level functions.

I think the code this way is re-usable, because you can just take this chain
of transformation and replace the one you care about, without looking into 
the rest. 

You get this fractal property of the code. Zooming in, you always get just 
a set of functions applied to a set of variables. In the begining those might 
not be there, and you would have to create new "low level" decompositions,
maybe even new "variables" that get data between those decompositions.

The thing with variables here, is that I don't want this "functions" to have
a state. All the information is passed along through these variables. This 
way understanding the graph is easy, debugging it is also easier ( then having
all these hidden states ..)

Note that while doing so we might ( and I strongly think we should) create 
a (symbolic) DAG of operations. ( this is where it becomes what James was saying).
In such a DAG the "variables" will the nodes and the functions will be edges.
I think having a DAG is useful in many ways (all this are things that one
might think about implementing in a far future, I'm not proposing to implement
them unless we want to use them - like the reconstruction ):
  * there exist the posibility of writing optimizations ( theano style ) 
  * there exist the posibility to add global view utility functions ( like 
  a reconstruction function for SdA - extremely low level here), or global 
  view diagnostic tools
  * the posibility of creating a GUI ( where you just create the Graph by
  picking transforms and variables from a list ) or working interactively
  and then generating code that will reproduce the graph 
  * you can view the graph and different granularity levels to understand
  things ( global diagnostics)

We should have a taxonomy of possible classes of functions and possible
classes of variables, but those should not be exclusive. We can work at a high
level for now, and decompose those high level functions to lower level when 
we need to. We can introduce new classes of functions or intermediate
variables between those low level functions.


Similarities with James' idea
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 As I said before, this is I think just another view on what James proposed.
 The learner in his case is the module that traverses the graph of this
 operations, which makes sense here as well. 
 
 The  'execute' command in his api is just applying a function to some variables in 
 my case. 

 The learner keeps track of the graph that is formed I think in both cases.
 
 His view is a bit more  general. I see the graph as fully created by the user, 
 and the learner just has to go from the start to the end. In his case the
 traversal is conditioned on some policies. I think these ideas can be mixed / 
 united. What I would see in my case to have this functionality is something
 similar to the lazy linker for Theano.
 


JB asks: There is definitely a strong theme of graphs in both suggestions,
furthermore graphs that have heavy-duty nodes and light-weight edges.  But I
don't necessarily think that we're proposing the same thing.  One difference is
that the graph I talked about would be infinite in most cases of interest, so
it's not going to be representable by Theano's data structures (even with lazy
if).  Another difference is that the graph I had in mind doesn't feel fractal -
it would be very common for a graph edge to be atomic.  A proxy pattern, such as
in a hyper-learner would create a notion of being able to zoom in, but other
than that, i'm not sure what you mean.

RP replies: I've been thinking about my idea a bit and yes, it might be 
quite different from what James has in mind, though there are plently of common
elements. I might have exagerated a bit with the zooming in, so in some cases
you will end up with atomic edges, though my hope is that is not most of the
edges.

I think I should go into mode details when answering this question because 
I feel I have not explained things sufficiently clear. Note, in many places
I replaced the word "function" by "transform".

Think of the learner as an object that traverses a DAG of steps created by the 
user. On this DAG the learner can potentially do a lot of cool stuff, but we
won't care about that for now. The DAG can be infinite in principle, and what 
the learner does is just to go on the path described by the user ( and here
described is not through heuristics like in James case, but by giving the list 
of edges it needs to follow). A potential cool thing the learner can do is to 
regard the path given by the user as a suggestion ( or some form of heuristic) 
and try to improve it. This would be much closer to what James has in mind,
and I definetely think is a cool way to go about it.

Now this path in the graph is given by the user by composing subgraphs or
adding nodes to the graph. Or (expressing this in a more simple way) by applying 
functions to variables. Any such function will introduce an edge ( or a subgraph) that 
will connect the vertices corresponding to the input variables to the vertices
corresponding to the output variables. The variables store the state of the
learner. These functions are state-less, I think if you would give them states
you will make this approach really ugly (I might be wrong). 
The variables would contain informations required by the function, like
number of layers, on how many cores to run, cluster configurations, and so on.

Now about the zooming part, that James asked. I might have exagerated a bit,
is not that you can zoom in on any part infinitely. You will end up with
things that are atomic. The idea is that any such "transformation" or edge 
has the potential to be split up in several "transformations". This offers 
(in my view) a way of solving the time constraints of our project. We can 
start by difining a coarse division in segments. For now we can have 
a structure transform that makes a list of parameters into a deep 
network of some type, then a learner transform that adds SGD + pre-training 
on top of network, and then early stopper on top of that, and then a 
run_on_cluster on that.We would probably want something more finely grained 
even from the start .. this is just to prove my point. When any of us 
starts experimenting with a certain sub-step of this process ( like the
structure) we will split that transform into several ( like ones that create 
a layer and so on) that make sense for that case, and then start working on 
the low level transform that we cares ( like the layer) introducing new 
versions of it. I think we can not find a universal split that will cover 
all of our cases, so I think we should allow different such splits. The one
who researches should look at what low-level transforms are available and use
those if they make sense, if not he would have to create a different split. 
Creating a different split might involve a lot of work and taking care of
several issues so it should be done with care.

I'll give an example from where I started thinking this way. Let say we want 
to do the SdA with auxiliary inputs that encourages separation of the features 
in the hidden layer that Yoshua was saying ( I had an attempt
at it some time ago for speech but I never eneded up finishing that project).

You start up with something like : 

learner = Learner()
# This will create the learner that will traverse our graph. We might 
# want it to be a function ``execute``, I just randomly picked this option. 
#I have no preference of this detail for now .. this is mostly work in progress

data  = someSpeechData(path = 'some path')
# This is such a transform that will generate from the string representing the
# path a dataset variable ( that will contain all informations you need to
# access data). This will probably be the object the datasets comittee will
# provide. Note, you might need to provide more information then the path, but
# you can easily see how to do that. All these stuff start from simple
# variables like path, batch size and so on and return a complex heavy duty
# variable (node).


model = earlyStopping(pretrain(SdA(layers = [524, 500, 500,27], noise = [0.1,0.1]),data, epochs = 10), data)
# This is a composition of two transforms. The SdA transform starts from the
# info about layers and corruption /noise for each layer and construct a SdA.
# This is a high level transform, so it will take care of defining all
# details, like pre-training, defining the cost and so on. Note that maybe it will
# require some more parameters .. you can assume that for anything else there
# is a default value that the SdA will use. earlyStopping is yet another
# transform that takes a model ( that we know how to train ) and some data,
# and does early stoppign on it. For bravity I did not provide all the
# information required like patience and so on. The SdA only knows how to do a
# step of training. Same holds for pretrain. It will loop over the layers of
# SdA and will train each one. 

steps = cluster(model, getPropertiesAndRanges(model), n_jobs = 20, cluster_info = getClusterInfo())
# This will lunch the wanted jobs. getPropertiesAndRanges will get from a
# model all knobs that need to be turn, and their ranges and will uniformly
# sample from them in each jobs. getCluterInfo will return a variable
# containing informations about the cluster ( I added this for simplicity, it
# should probably be replaced with something like username, password,
# clusterpath or whatever).

learner.execute(steps)
# As an option, each of this output variables could contain the entire graph
# until that point. We could also have this in a different way .. this is
# adhoc at the moment


Now this is a coarse vanila SdA which is not what we wanted. We do not have a
way of incorporating our auxiliary information in this. So what we have to do
is split/change the SdA transform. We would re-write it as :


arch = SdA(layers = [524, 500, 500, 27], noise = [0.1,0.1])
model = earlyStopping(pretrain(arch,data,epochs = 10)
...

And then re-write things like : 

arch = SGD( cross_entropy( logreg( DAAlayer( [DAAlayer([524,500],0.1),500],0.1))))


We would re-write the DAAlayer as : 

layer0 = DAAlayer([524,500],0.1)
layer1 = cross_entropy(reconstruct( tanh(dotW_b( layer0,500)),noise = 0.1))

At this point of detail, we can start inserting our new stuff in as follows : 


input = empty_layer(600)
# empty layer is a wrapper ; if I would to write dotW_b(200,500) which means
# go from a layer of 200 units to a one of 500 by multiplying with a matrix
# and adding a bias, what I would mean is dotW_b( empty_layer(200), 500). 
# an implementation of empty_layer could be just theano.tensor.vector()
# where we add the size tag ( we will need it later)


hidden0_mfcc = dotW_b(input[0:524],100)
hidden0_noise = dotW_b(input[0:560],50)
hidden0_speakerID = dotW_b(join(input[0:524], input[560:600]),50)
hidden0 = tanh(join( layer0_mfcc, layer0_noise, layer0_speakerID))
layer0 = cross_entropy( reconstruct( hidden0, noise = 0.1))

and so on. Hopefully you got what I mean by spliting a transform, or zooming
in. When doing all this we did not change anything about the early stopping or
lunching jobs on the cluster. In the same manner, if one would like to look
into how jobs are send to the cluster, it could just expand that part. Note
that if we wanted to do something else we might have split the DAA
differently. 

The key of this approach is to identify such low level units that can be
shared by  90% of our architectures, and the splits that make most sense
from a functional point of view that will cover the main points where people
will like to change things. This will ensure that almost all the time we have
the wanted low-level bits that we want to write our code into, and most of the
time we will only work on one of that bit. There will definetely be cases when 
whatever we have will not be sufficient or convinient. In that case some
effort has to be invested by the user to create a different decomposition of
the problem in the elements he need. 

I've been thinking about this a bit, and it definetely works in for deep
networks and theano ( the approach was inspired by theano). From what James
said, I think that other stuff might be possible to incorporate, at least as
atomic transforms if not in any other way.

TODO: one has to give some thought of this low-level transform, to find a
suitable set of them ( and variables) so that would end up most of the time 
re-using things and not creating new things.

NOTES: there are some other implementation details missing of what this state
variables should contain. I did not want to clutter this with what tricks
could be used to get this transparent interface. I have a few of them in mind
though.. 
there is a lot of hardcoded values in this example. Usually each transform
that takes an input should "know" which of these inputs are tunable and mark
them as such. The order of the input in this example is important as well. 
This can be easily solved at the expense of a few more lines of code that 
I did not want to write.