# HG changeset patch # User Razvan Pascanu # Date 1284046104 14400 # Node ID 19033ef1636d0918ffe341e31fd2577279d062d1 # Parent bc3f7834db83042670503be377b9d033e809d0c1 some more details on my approach diff -r bc3f7834db83 -r 19033ef1636d doc/v2_planning/learner.txt --- a/doc/v2_planning/learner.txt Wed Sep 08 20:45:17 2010 -0400 +++ b/doc/v2_planning/learner.txt Thu Sep 09 11:28:24 2010 -0400 @@ -446,3 +446,180 @@ in a hyper-learner would create a notion of being able to zoom in, but other than that, i'm not sure what you mean. +RP replies: I've been thinking about my idea a bit and yes, it might be +quite different from what James has in mind, though there are plently of common +elements. I might have exagerated a bit with the zooming in, so in some cases +you will end up with atomic edges, though my hope is that is not most of the +edges. + +I think I should go into mode details when answering this question because +I feel I have not explained things sufficiently clear. Note, in many places +I replaced the word "function" by "transform". + +Think of the learner as an object that traverses a DAG of steps created by the +user. On this DAG the learner can potentially do a lot of cool stuff, but we +won't care about that for now. The DAG can be infinite in principle, and what +the learner does is just to go on the path described by the user ( and here +described is not through heuristics like in James case, but by giving the list +of edges it needs to follow). A potential cool thing the learner can do is to +regard the path given by the user as a suggestion ( or some form of heuristic) +and try to improve it. This would be much closer to what James has in mind, +and I definetely think is a cool way to go about it. + +Now this path in the graph is given by the user by composing subgraphs or +adding nodes to the graph. Or (expressing this in a more simple way) by applying +functions to variables. Any such function will introduce an edge ( or a subgraph) that +will connect the vertices corresponding to the input variables to the vertices +corresponding to the output variables. The variables store the state of the +learner. These functions are state-less, I think if you would give them states +you will make this approach really ugly (I might be wrong). +The variables would contain informations required by the function, like +number of layers, on how many cores to run, cluster configurations, and so on. + +Now about the zooming part, that James asked. I might have exagerated a bit, +is not that you can zoom in on any part infinitely. You will end up with +things that are atomic. The idea is that any such "transformation" or edge +has the potential to be split up in several "transformations". This offers +(in my view) a way of solving the time constraints of our project. We can +start by difining a coarse division in segments. For now we can have +a structure transform that makes a list of parameters into a deep +network of some type, then a learner transform that adds SGD + pre-training +on top of network, and then early stopper on top of that, and then a +run_on_cluster on that.We would probably want something more finely grained +even from the start .. this is just to prove my point. When any of us +starts experimenting with a certain sub-step of this process ( like the +structure) we will split that transform into several ( like ones that create +a layer and so on) that make sense for that case, and then start working on +the low level transform that we cares ( like the layer) introducing new +versions of it. I think we can not find a universal split that will cover +all of our cases, so I think we should allow different such splits. The one +who researches should look at what low-level transforms are available and use +those if they make sense, if not he would have to create a different split. +Creating a different split might involve a lot of work and taking care of +several issues so it should be done with care. + +I'll give an example from where I started thinking this way. Let say we want +to do the SdA with auxiliary inputs that encourages separation of the features +in the hidden layer that Yoshua was saying ( I had an attempt +at it some time ago for speech but I never eneded up finishing that project). + +You start up with something like : + +learner = Learner() +# This will create the learner that will traverse our graph. We might +# want it to be a function ``execute``, I just randomly picked this option. +#I have no preference of this detail for now .. this is mostly work in progress + +data = someSpeechData(path = 'some path') +# This is such a transform that will generate from the string representing the +# path a dataset variable ( that will contain all informations you need to +# access data). This will probably be the object the datasets comittee will +# provide. Note, you might need to provide more information then the path, but +# you can easily see how to do that. All these stuff start from simple +# variables like path, batch size and so on and return a complex heavy duty +# variable (node). + + +model = earlyStopping(pretrain(SdA(layers = [524, 500, 500,27], noise = [0.1,0.1]),data, epochs = 10), data) +# This is a composition of two transforms. The SdA transform starts from the +# info about layers and corruption /noise for each layer and construct a SdA. +# This is a high level transform, so it will take care of defining all +# details, like pre-training, defining the cost and so on. Note that maybe it will +# require some more parameters .. you can assume that for anything else there +# is a default value that the SdA will use. earlyStopping is yet another +# transform that takes a model ( that we know how to train ) and some data, +# and does early stoppign on it. For bravity I did not provide all the +# information required like patience and so on. The SdA only knows how to do a +# step of training. Same holds for pretrain. It will loop over the layers of +# SdA and will train each one. + +steps = cluster(model, getPropertiesAndRanges(model), n_jobs = 20, cluster_info = getClusterInfo()) +# This will lunch the wanted jobs. getPropertiesAndRanges will get from a +# model all knobs that need to be turn, and their ranges and will uniformly +# sample from them in each jobs. getCluterInfo will return a variable +# containing informations about the cluster ( I added this for simplicity, it +# should probably be replaced with something like username, password, +# clusterpath or whatever). + +learner.execute(steps) +# As an option, each of this output variables could contain the entire graph +# until that point. We could also have this in a different way .. this is +# adhoc at the moment + + +Now this is a coarse vanila SdA which is not what we wanted. We do not have a +way of incorporating our auxiliary information in this. So what we have to do +is split/change the SdA transform. We would re-write it as : + + +arch = SdA(layers = [524, 500, 500, 27], noise = [0.1,0.1]) +model = earlyStopping(pretrain(arch,data,epochs = 10) +... + +And then re-write things like : + +arch = SGD( cross_entropy( logreg( DAAlayer( [DAAlayer([524,500],0.1),500],0.1)))) + + +We would re-write the DAAlayer as : + +layer0 = DAAlayer([524,500],0.1) +layer1 = cross_entropy(reconstruct( tanh(dotW_b( layer0,500)),noise = 0.1)) + +At this point of detail, we can start inserting our new stuff in as follows : + + +input = empty_layer(600) +# empty layer is a wrapper ; if I would to write dotW_b(200,500) which means +# go from a layer of 200 units to a one of 500 by multiplying with a matrix +# and adding a bias, what I would mean is dotW_b( empty_layer(200), 500). +# an implementation of empty_layer could be just theano.tensor.vector() +# where we add the size tag ( we will need it later) + + +hidden0_mfcc = dotW_b(input[0:524],100) +hidden0_noise = dotW_b(input[0:560],50) +hidden0_speakerID = dotW_b(join(input[0:524], input[560:600]),50) +hidden0 = tanh(join( layer0_mfcc, layer0_noise, layer0_speakerID)) +layer0 = cross_entropy( reconstruct( hidden0, noise = 0.1)) + +and so on. Hopefully you got what I mean by spliting a transform, or zooming +in. When doing all this we did not change anything about the early stopping or +lunching jobs on the cluster. In the same manner, if one would like to look +into how jobs are send to the cluster, it could just expand that part. Note +that if we wanted to do something else we might have split the DAA +differently. + +The key of this approach is to identify such low level units that can be +shared by 90% of our architectures, and the splits that make most sense +from a functional point of view that will cover the main points where people +will like to change things. This will ensure that almost all the time we have +the wanted low-level bits that we want to write our code into, and most of the +time we will only work on one of that bit. There will definetely be cases when +whatever we have will not be sufficient or convinient. In that case some +effort has to be invested by the user to create a different decomposition of +the problem in the elements he need. + +I've been thinking about this a bit, and it definetely works in for deep +networks and theano ( the approach was inspired by theano). From what James +said, I think that other stuff might be possible to incorporate, at least as +atomic transforms if not in any other way. + +TODO: one has to give some thought of this low-level transform, to find a +suitable set of them ( and variables) so that would end up most of the time +re-using things and not creating new things. + +NOTES: there are some other implementation details missing of what this state +variables should contain. I did not want to clutter this with what tricks +could be used to get this transparent interface. I have a few of them in mind +though.. +there is a lot of hardcoded values in this example. Usually each transform +that takes an input should "know" which of these inputs are tunable and mark +them as such. The order of the input in this example is important as well. +This can be easily solved at the expense of a few more lines of code that +I did not want to write. + + + + +