Mercurial > pylearn
view doc/v2_planning/learner.txt @ 1052:84f62533e7a8
v2planning learner - reply to comments
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Wed, 08 Sep 2010 15:43:32 -0400 |
parents | f1732269bce8 |
children | 390166ace9e5 |
line wrap: on
line source
Comittee: AB, PL, GM, IG, RP, NB, PV Leader: ? Discussion of Function Specification for Learner Types ====================================================== In its most abstract form, a learner is an object with the following semantics: * A learner has named hyper-parameters that control how it learns (these can be viewed as options of the constructor, or might be set directly by a user) * A learner also has an internal state that depends on what it has learned. * A learner reads and produces data, so the definition of learner is intimately linked to the definition of dataset (and task). * A learner has one or more 'train' or 'adapt' functions by which it is given a sample of data (typically either the whole training set, or a mini-batch, which contains as a special case a single 'example'). Learners interface with datasets in order to obtain data. These functions cause the learner to change its internal state and take advantage to some extent of the data provided. The 'train' function should take charge of completely exploiting the dataset, as specified per the hyper-parameters, so that it would typically be called only once. An 'adapt' function is meant for learners that can operate in an 'online' setting where data continually arrive and the control loop (when to stop) is to be managed outside of it. For most intents and purposes, the 'train' function could also handle the 'online' case by providing the controlled iterations over the dataset (which would then be seen as a stream of examples). * learner.train(dataset) * learner.adapt(data) * Different types of learners can then exploit their internal state in order to perform various computations after training is completed, or in the middle of training, e.g., * y=learner.predict(x) for learners that see (x,y) pairs during training and predict y given x, or for learners that see only x's and learn a transformation of it (i.e. feature extraction). Here and below, x and y are tensor-like objects whose first index iterates over particular examples in a batch or minibatch of examples. * p=learner.probability(examples) p=learner.log_probability(examples) for learners that can estimate probability density or probability functions, note that example could be a pair (x,y) for learners that expect each example to represent such a pair. The second form is provided in case the example is high-dimensional and computations in the log-domain are numerically preferable. The first dimension of examples or of x and y is an index over a minibatch or a dataset. * p=learner.free_energy(x) for learners that can estimate a log unnormalized probability; the output has the same length as the input. * c=learner.costs(examples) returns a matrix of costs (one row per example, i.e., again the output has the same length as the input), the first column of which represents the cost whose expectation we wish to minimize over new samples from the unknown underlying data distribution. Some learners may be able to handle x's and y's that contain missing values. * For convenience, some of these operations could be bundled, e.g. * [prediction,costs] = learner.predict_and_adapt((x,y)) * Some learners could include in their internal state not only what they have learned but some information about recently seen examples that conditions the expected distribution of upcoming examples. In that case, they might be used, e.g. in an online setting as follows: for (x,y) in data_stream: [prediction,costs]=learner.predict((x,y)) accumulate_statistics(prediction,costs) * In some cases, each example is itself a (possibly variable-size) sequence or other variable-size object (e.g. an image, or a video) James's idea for Learner Interface =================================== Theory: ------- Think about the unfolding of a learning algorithm as exploring a path in a vast directed graph. There are some source nodes, which are potential initial conditions for the learning algorithm. At any node, there are a number of outgoing labeled edges that represent distinct directions of exploration: like "allocate a model with N hidden units", or "set the l1 weight decay on such-and-such units to 0.1" or "adapt for T iterations" or "refresh the GPU dataset memory with the next batch of data". Not all nodes have the same outgoing edge labels. The dataset, model, and optimization algorithm implementations may each have their various hyper-parameters with various restrictions on what values they can take, and when they can be changed. Every move in this graph incurs some storage and computational expense, and explores the graph. Learners typically engage in goal-directed exploration of this graph - for example to find the node with the best validation-set performance given a certain computational budget. We might often be interested in the best node found. The predict(), log_probability(), free_energy() etc correspond to costs that we can measure at any particular node (at some computational expense) to see how we are doing in our exploration. Many semantically distinct components come into the definition of this graph: the model (e.g. DAA) the dataset (e.g. an online one), the inference and learning strategy. I'm not sure what to call this graph than an 'experiment graph'... so I'll go with that for now. Use Cases ---------- Early stopping ~~~~~~~~~~~~~~ Early stopping can be implemented as a learner that progresses along a particular kind of edge (e.g. "train more") until a stopping criterion (in terms of a cost computed from nodes along the path) is met. Grid Search ~~~~~~~~~~~ Grid search is a learner policy that can be implemented in an experiment graph where all paths have the form: ( "set param 0 to X", "set param 1 to Y", ... , "set param N to Z", adapt, [early stop...], test) It would explore all paths of this form and then return the best node. Stagewise learning of DBNs combined with early stopping and grid search ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This would be a learner that is effective for experiment graphs that reflect the greedy-stagewise optimization of DBNs. Boosting ~~~~~~~~ Given an ExperimentGraph that permits re-weighting of examples, it is straightforward to write a meta-ExperimentGraph around it that implements AdaBoost. A meta-meta-ExperimentGraph around that that does early-stopping would complete the picture and make a useful boosting implementation. Using External Hyper-Parameter Optimization Software ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TODO: use-case - show how we could use the optimizer from http://www.cs.ubc.ca/labs/beta/Projects/ParamILS/ Implementation Details / API ---------------------------- Learner ~~~~~~~ An object that allows us to explore the graph discussed above. Specifically, it represents an explored node in that graph. def active_instructions() """ Return a list/set of Instruction instances (see below) that the Learner is prepared to handle. """ def copy(), deepcopy() """ Learners should be serializable """ To make the implementation easier, I found it was helpful to introduce a string-valued `fsa_state` member attribute and associate methods to these states. That made it syntactically easy to build relatively complex finite-state transition graphs to describe which instructions were active at which times in the life-cycle of a learner. Instruction ~~~~~~~~~~~ An object that represents a potential edge in the graph discussed above. It is an operation that a learner can perform. arg_types """a list of Type object (see below) indicating what args are required by execute""" def execute(learner, args, kwargs): """ Perform some operation on the learner (follow an edge in the graph discussed above) and modify the learner in-place. Calling execute 'moves' the learner from one node in the graph along an edge. To have the old learner as well, it must be copied prior to calling execute(). """ def expense(learner, args, kwargs, resource_type='CPUtime'): """ Return an estimated cost of performing this instruction (calling execute), in time, space, number of computers, disk requierement, etc. """ Type ~~~~ An object that describes a parameter domain for a call to Instruction.execute. It is not necessary that a Type specifies exactly which arguments are legal, but it should `include` all legal arguments, and exclude as many illegal ones as possible. def includes(value): """return True if value is a legal argument""" To make things a bit more practical, there are some Type subclasses like Int, Float, Str, ImageDataset, SgdOptimizer, that include additional attributes (e.g. min, max, default) so that automatic graph exploration algorithms can generate legal arguments with reasonable efficiency. The proxy pattern is a powerful way to combine learners. Especially when proxy Learner instances also introduce Proxy Instruction classes. For example, it is straightforward to implement a hyper-learner by implementing a Learner with another learner (sub-learner) as a member attribute. The hyper-learner makes some modifications to the instruction_set() return value of the sub-learner, typically to introduce more powerful instructions and hide simpler ones. It is less straightforward, but consistent with the design to implement a Learner that encompasses job management. Such a learner would retain the semantics of the instruction_set of the sub-learner, but would replace the Instruction objects themselves with Instructions that arranged for remote procedure calls (e.g. jobman, multiprocessing, bqtools, etc.) Such a learner would replace synchronous instructions (return on completion) with asynchronous ones (return after scheduling) and the active instruction set would also change asynchronously, but neither of these things is inconsistent with the Learner API. TODO ~~~~ I feel like something is missing from the API - and that is an interface to the graph structure discussed above. The nodes in this graph are natural places to store meta-information for visualization, statistics-gathering etc. But none of the APIs above corresponds to the graph itself. In other words, there is no API through which to attach information to nodes. It is not good to say that the Learner instance *is* the node because (a) learner instances change during graph exploration and (b) learner instances are big, and we don't want to have to keep a whole saved model just to attach meta-info e.g. validation score. Choosing this API spills over into other committees, so we should get their feedback about how to resolve it. Comments ~~~~~~~~ OD asks: (I hope it's ok to leave comments even though I'm not in committee... I'm interested to see how the learner interface is shaping up so I'll be keeping an eye on this file) I'm wondering what's the benefit of such an API compared to simply defining a new method for each instruction. It seems to me that typically, the 'execute' method would end up being something like if instruction == 'do_x': self.do_x(..) elif instruction == 'do_y': self.do_y(..) ... so why not directly call do_x / do_y instead? JB replies: I agree with you, and in the implementation of a Learner I suggest using Python decorators to get the best of both worlds: class NNet(Learner): ... @Instruction.new(arg_types=(Float(min=-8, max=-1, default=-4),)) def set_log_lr(self, log_lr): self.lr.value = numpy.exp(log_lr) ... The Learner base class can implement a instruction_set() that walks through the methods of 'self' and pick out the ones that have corresponding instructions. But anyone can call the method normally. The NNet class can also have methods that are not instructions. RP asks: James correct me if I'm wrong, but I think each instruction has a execute command. The job of the learner is to traverse the graph and for each edge that it decides to cross to call the execute of that edge. Maybe James has something else in mind, but this was my understanding. JB replies: close, but let me make a bit of a clarification. The job of a Learner is simply to implement the API of a Learner - to list what edges are available and to be able to cross them if asked. The code *using* the Learner (client) decides which edges to cross. The client may also be a Learner, but maybe not. Just another view/spin on the same idea (Razvan) ================================================ My idea is probably just a spin off from what James wrote. It is an extension of what I send on the mailing list some time ago. Big Picture ----------- What do we care about ? ~~~~~~~~~~~~~~~~~~~~~~~ This is the list of the main points that I have in mind : * Re-usability * Extensibility * Simplicity or easily readable code ( connected to re-usability ) * Modularity ( connected to extensibility ) * Fast to write code ( - sort of comes out of simplicity) * Efficient code Composition ~~~~~~~~~~~ To me this reads as code generated by composing pieces. Imagine this : you start of with something primitive that I will call a "variable", which probably is a very unsuitable name. And then you compose those intial "variables" or transform them through several "functions". Each such "function" hides some logic, that you as the user don't care about. You can have low-level or micro "functions" and high-level or macro "functions", where a high-level function is just a certain compositional pattern of low-level "functions". There are several classes of "functions" and "variables" that can be inter-changable. This is how modularity is obtained, by chainging between functions from a certain class. Now when you want to research something, what you do is first select the step you want to look into. If you are lucky you can re-write this step as certain decomposition of low-level transformations ( there can be multiple such decompositions). If not you have to implement such a decompositions acording to your needs. Pick the low-level transformations you want to change and write new versions that implement your logic. I think the code will be easy to read, because it is just applying a fixed set of transformations, one after the other. The one who writes the code can decide how explicit he wants to write things by switching between high-level and low-level functions. I think the code this way is re-usable, because you can just take this chain of transformation and replace the one you care about, without looking into the rest. You get this fractal property of the code. Zooming in, you always get just a set of functions applied to a set of variables. In the begining those might not be there, and you would have to create new "low level" decompositions, maybe even new "variables" that get data between those decompositions. The thing with variables here, is that I don't want this "functions" to have a state. All the information is passed along through these variables. This way understanding the graph is easy, debugging it is also easier ( then having all these hidden states ..) Note that while doing so we might ( and I strongly think we should) create a (symbolic) DAG of operations. ( this is where it becomes what James was saying). In such a DAG the "variables" will the nodes and the functions will be edges. I think having a DAG is useful in many ways (all this are things that one might think about implementing in a far future, I'm not proposing to implement them unless we want to use them - like the reconstruction ): * there exist the posibility of writing optimizations ( theano style ) * there exist the posibility to add global view utility functions ( like a reconstruction function for SdA - extremely low level here), or global view diagnostic tools * the posibility of creating a GUI ( where you just create the Graph by picking transforms and variables from a list ) or working interactively and then generating code that will reproduce the graph * you can view the graph and different granularity levels to understand things ( global diagnostics) We should have a taxonomy of possible classes of functions and possible classes of variables, but those should not be exclusive. We can work at a high level for now, and decompose those high level functions to lower level when we need to. We can introduce new classes of functions or intermediate variables between those low level functions. Similarities with James' idea ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As I said before, this is I think just another view on what James proposed. The learner in his case is the module that traverses the graph of this operations, which makes sense here as well. The 'execute' command in his api is just applying a function to some variables in my case. The learner keeps track of the graph that is formed I think in both cases. His view is a bit more general. I see the graph as fully created by the user, and the learner just has to go from the start to the end. In his case the traversal is conditioned on some policies. I think these ideas can be mixed / united. What I would see in my case to have this functionality is something similar to the lazy linker for Theano. JB asks: There is definitely a strong theme of graphs in both suggestions, furthermore graphs that have heavy-duty nodes and light-weight edges. But I don't necessarily think that we're proposing the same thing. One difference is that the graph I talked about would be infinite in most cases of interest, so it's not going to be representable by Theano's data structures (even with lazy if). Another difference is that the graph I had in mind doesn't feel fractal - it would be very common for a graph edge to be atomic. A proxy pattern, such as in a hyper-learner would create a notion of being able to zoom in, but other than that, i'm not sure what you mean.