changeset 1061:5d96bfef8d6e

Merged
author Olivier Delalleau <delallea@iro>
date Thu, 09 Sep 2010 13:01:30 -0400
parents b4ccf6b43f27 (current diff) f082a6c0b008 (diff)
children 64720cdca3d3
files
diffstat 2 files changed, 255 insertions(+), 5 deletions(-) [+]
line wrap: on
line diff
--- a/doc/v2_planning/learner.txt	Thu Sep 09 12:01:49 2010 -0400
+++ b/doc/v2_planning/learner.txt	Thu Sep 09 13:01:30 2010 -0400
@@ -256,8 +256,8 @@
 asynchronously, but neither of these things is inconsistent with the Learner API.
 
 
-TODO
-~~~~
+TODO - Experiment API?
+~~~~~~~~~~~~~~~~~~~~~~
 
 I feel like something is missing from the API - and that is an interface to the graph structure
 discussed above.  The nodes in this graph are natural places to store meta-information for
@@ -266,10 +266,24 @@
 not good to say that the Learner instance *is* the node because (a) learner instances change
 during graph exploration and (b) learner instances are big, and we don't want to have to keep a
 whole saved model just to attach meta-info e.g. validation score.    Choosing this API spills
-over into other committees, so we should get their feedback about how to resolve it.
+over into other committees, so we should get their feedback about how to resolve
+it.  Maybe we need an 'Experiment' API to stand for this graph?
+
+
+TODO: Validation & Monitoring Costs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Comments
-~~~~~~~~
+Even if we do have the Experiment API as a structure to hang validation and
+monitoring results, what should be the mechanism for extracting those results.
+The Learner API is not right because extracting a monitoring cost doesn't change
+the model, doesn't change the legal instructions/edges etc.  Maybe we should use
+a similar mechanism to Instruction, called something like Measurement?  Any node
+/ learner can report the list of instructions (for moving) and the list of
+measurements (and the cost of computing them too)
+
+
+TODO - Parameter Distributions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 YB asks: it seems to me that what we really need from "Type" is not just
 testing that a value is legal, but more practically a function that specifies the
@@ -282,6 +296,22 @@
 For that reason, I think that "Type" is not a very good name.
 How about "Prior" or "Density" or something like that?
 
+JB replies: I agree that being able to choose (and update) distributions over
+these values is important. I don't think the Type structure is the right place
+to handle it though.  The challenge is to allow those distributions to change
+for a variety of reasons - e.g. the sampling distribution on the capacity
+variables is affected by the size of the dataset, it is also affected by
+previous experience in general as well as experiments on that particular
+dataset.  I'm not sure that the 'Type' structure is right to deal with this.
+Also, even with a strategy for handling these distributions, I believe a simple
+mechanism for rejecting insane values might be useful.
+
+So how should we handle it?  Hmmm...
+
+
+Comments
+~~~~~~~~
+
 OD asks: (I hope it's ok to leave comments even though I'm not in committee... I'm
 interested to see how the learner interface is shaping up so I'll be keeping
 an eye on this file)
@@ -446,3 +476,180 @@
 in a hyper-learner would create a notion of being able to zoom in, but other
 than that, i'm not sure what you mean.
 
+RP replies: I've been thinking about my idea a bit and yes, it might be 
+quite different from what James has in mind, though there are plently of common
+elements. I might have exagerated a bit with the zooming in, so in some cases
+you will end up with atomic edges, though my hope is that is not most of the
+edges.
+
+I think I should go into mode details when answering this question because 
+I feel I have not explained things sufficiently clear. Note, in many places
+I replaced the word "function" by "transform".
+
+Think of the learner as an object that traverses a DAG of steps created by the 
+user. On this DAG the learner can potentially do a lot of cool stuff, but we
+won't care about that for now. The DAG can be infinite in principle, and what 
+the learner does is just to go on the path described by the user ( and here
+described is not through heuristics like in James case, but by giving the list 
+of edges it needs to follow). A potential cool thing the learner can do is to 
+regard the path given by the user as a suggestion ( or some form of heuristic) 
+and try to improve it. This would be much closer to what James has in mind,
+and I definetely think is a cool way to go about it.
+
+Now this path in the graph is given by the user by composing subgraphs or
+adding nodes to the graph. Or (expressing this in a more simple way) by applying 
+functions to variables. Any such function will introduce an edge ( or a subgraph) that 
+will connect the vertices corresponding to the input variables to the vertices
+corresponding to the output variables. The variables store the state of the
+learner. These functions are state-less, I think if you would give them states
+you will make this approach really ugly (I might be wrong). 
+The variables would contain informations required by the function, like
+number of layers, on how many cores to run, cluster configurations, and so on.
+
+Now about the zooming part, that James asked. I might have exagerated a bit,
+is not that you can zoom in on any part infinitely. You will end up with
+things that are atomic. The idea is that any such "transformation" or edge 
+has the potential to be split up in several "transformations". This offers 
+(in my view) a way of solving the time constraints of our project. We can 
+start by difining a coarse division in segments. For now we can have 
+a structure transform that makes a list of parameters into a deep 
+network of some type, then a learner transform that adds SGD + pre-training 
+on top of network, and then early stopper on top of that, and then a 
+run_on_cluster on that.We would probably want something more finely grained 
+even from the start .. this is just to prove my point. When any of us 
+starts experimenting with a certain sub-step of this process ( like the
+structure) we will split that transform into several ( like ones that create 
+a layer and so on) that make sense for that case, and then start working on 
+the low level transform that we cares ( like the layer) introducing new 
+versions of it. I think we can not find a universal split that will cover 
+all of our cases, so I think we should allow different such splits. The one
+who researches should look at what low-level transforms are available and use
+those if they make sense, if not he would have to create a different split. 
+Creating a different split might involve a lot of work and taking care of
+several issues so it should be done with care.
+
+I'll give an example from where I started thinking this way. Let say we want 
+to do the SdA with auxiliary inputs that encourages separation of the features 
+in the hidden layer that Yoshua was saying ( I had an attempt
+at it some time ago for speech but I never eneded up finishing that project).
+
+You start up with something like : 
+
+learner = Learner()
+# This will create the learner that will traverse our graph. We might 
+# want it to be a function ``execute``, I just randomly picked this option. 
+#I have no preference of this detail for now .. this is mostly work in progress
+
+data  = someSpeechData(path = 'some path')
+# This is such a transform that will generate from the string representing the
+# path a dataset variable ( that will contain all informations you need to
+# access data). This will probably be the object the datasets comittee will
+# provide. Note, you might need to provide more information then the path, but
+# you can easily see how to do that. All these stuff start from simple
+# variables like path, batch size and so on and return a complex heavy duty
+# variable (node).
+
+
+model = earlyStopping(pretrain(SdA(layers = [524, 500, 500,27], noise = [0.1,0.1]),data, epochs = 10), data)
+# This is a composition of two transforms. The SdA transform starts from the
+# info about layers and corruption /noise for each layer and construct a SdA.
+# This is a high level transform, so it will take care of defining all
+# details, like pre-training, defining the cost and so on. Note that maybe it will
+# require some more parameters .. you can assume that for anything else there
+# is a default value that the SdA will use. earlyStopping is yet another
+# transform that takes a model ( that we know how to train ) and some data,
+# and does early stoppign on it. For bravity I did not provide all the
+# information required like patience and so on. The SdA only knows how to do a
+# step of training. Same holds for pretrain. It will loop over the layers of
+# SdA and will train each one. 
+
+steps = cluster(model, getPropertiesAndRanges(model), n_jobs = 20, cluster_info = getClusterInfo())
+# This will lunch the wanted jobs. getPropertiesAndRanges will get from a
+# model all knobs that need to be turn, and their ranges and will uniformly
+# sample from them in each jobs. getCluterInfo will return a variable
+# containing informations about the cluster ( I added this for simplicity, it
+# should probably be replaced with something like username, password,
+# clusterpath or whatever).
+
+learner.execute(steps)
+# As an option, each of this output variables could contain the entire graph
+# until that point. We could also have this in a different way .. this is
+# adhoc at the moment
+
+
+Now this is a coarse vanila SdA which is not what we wanted. We do not have a
+way of incorporating our auxiliary information in this. So what we have to do
+is split/change the SdA transform. We would re-write it as :
+
+
+arch = SdA(layers = [524, 500, 500, 27], noise = [0.1,0.1])
+model = earlyStopping(pretrain(arch,data,epochs = 10)
+...
+
+And then re-write things like : 
+
+arch = SGD( cross_entropy( logreg( DAAlayer( [DAAlayer([524,500],0.1),500],0.1))))
+
+
+We would re-write the DAAlayer as : 
+
+layer0 = DAAlayer([524,500],0.1)
+layer1 = cross_entropy(reconstruct( tanh(dotW_b( layer0,500)),noise = 0.1))
+
+At this point of detail, we can start inserting our new stuff in as follows : 
+
+
+input = empty_layer(600)
+# empty layer is a wrapper ; if I would to write dotW_b(200,500) which means
+# go from a layer of 200 units to a one of 500 by multiplying with a matrix
+# and adding a bias, what I would mean is dotW_b( empty_layer(200), 500). 
+# an implementation of empty_layer could be just theano.tensor.vector()
+# where we add the size tag ( we will need it later)
+
+
+hidden0_mfcc = dotW_b(input[0:524],100)
+hidden0_noise = dotW_b(input[0:560],50)
+hidden0_speakerID = dotW_b(join(input[0:524], input[560:600]),50)
+hidden0 = tanh(join( layer0_mfcc, layer0_noise, layer0_speakerID))
+layer0 = cross_entropy( reconstruct( hidden0, noise = 0.1))
+
+and so on. Hopefully you got what I mean by spliting a transform, or zooming
+in. When doing all this we did not change anything about the early stopping or
+lunching jobs on the cluster. In the same manner, if one would like to look
+into how jobs are send to the cluster, it could just expand that part. Note
+that if we wanted to do something else we might have split the DAA
+differently. 
+
+The key of this approach is to identify such low level units that can be
+shared by  90% of our architectures, and the splits that make most sense
+from a functional point of view that will cover the main points where people
+will like to change things. This will ensure that almost all the time we have
+the wanted low-level bits that we want to write our code into, and most of the
+time we will only work on one of that bit. There will definetely be cases when 
+whatever we have will not be sufficient or convinient. In that case some
+effort has to be invested by the user to create a different decomposition of
+the problem in the elements he need. 
+
+I've been thinking about this a bit, and it definetely works in for deep
+networks and theano ( the approach was inspired by theano). From what James
+said, I think that other stuff might be possible to incorporate, at least as
+atomic transforms if not in any other way.
+
+TODO: one has to give some thought of this low-level transform, to find a
+suitable set of them ( and variables) so that would end up most of the time 
+re-using things and not creating new things.
+
+NOTES: there are some other implementation details missing of what this state
+variables should contain. I did not want to clutter this with what tricks
+could be used to get this transparent interface. I have a few of them in mind
+though.. 
+there is a lot of hardcoded values in this example. Usually each transform
+that takes an input should "know" which of these inputs are tunable and mark
+them as such. The order of the input in this example is important as well. 
+This can be easily solved at the expense of a few more lines of code that 
+I did not want to write. 
+
+
+
+
+
--- a/doc/v2_planning/optimization.txt	Thu Sep 09 12:01:49 2010 -0400
+++ b/doc/v2_planning/optimization.txt	Thu Sep 09 13:01:30 2010 -0400
@@ -39,3 +39,46 @@
 
 
 
+
+Proposal for API
+================
+
+Stick to the same style of API that we've used for SGD so far.  I think it has
+worked well.  It takes theano expressions as inputs and returns theano
+expressions as results.  The caller is responsible for building those
+expressions into a callable function that does the minimization (and other
+things too maybe).
+
+
+def stochastic_gradientbased_optimization_updates(parameters, cost=None, grads=None, **kwargs):
+   """
+   :param parameters: list or tuple of Theano variables (typically shared vars)
+       that we want to optimize iteratively algorithm.
+
+   :param cost: scalar-valued Theano variable that computes noisy estimate of
+       cost  (what are the conditions on the noise?).  The cost is ignored if
+       grads are given.
+
+   :param grads: list or tuple of Theano variables representing the gradients on
+       the corresponding parameters.  These default to tensor.grad(cost,
+       parameters).
+
+   :param kwargs: algorithm-dependent arguments
+
+   :returns: a list of pairs (v, new_v) that indicate the value (new_v) each
+      variable (v) should take in order to carry out the optimization procedure.
+
+      The first section of the return value list corresponds to the terms in
+      `parameters`, and the optimization algorithm can return additional update
+      expression afterward.  This list of pairs can be passed directly to the
+      dict() constructor to create a dictionary such that dct[v] == new_v.
+   """
+
+
+Why not a class interface with an __init__ that takes the kwargs, and an
+updates() that returns the updates?  It would be wrong for auxiliary shared
+variables to be involved in two updates, so the interface should not encourage
+separate methods for those two steps.
+
+
+