# HG changeset patch
# User James Bergstra <bergstrj@iro.umontreal.ca>
# Date 1284047437 14400
# Node ID f082a6c0b0086efca078fc8c50d605d9e776f4ab
# Parent  e342de3ae485263ee1474bea5c9cde1e79ca4264# Parent  19033ef1636d0918ffe341e31fd2577279d062d1
merged v2planning learner

diff -r e342de3ae485 -r f082a6c0b008 doc/v2_planning/learner.txt
--- a/doc/v2_planning/learner.txt	Thu Sep 09 11:49:57 2010 -0400
+++ b/doc/v2_planning/learner.txt	Thu Sep 09 11:50:37 2010 -0400
@@ -476,3 +476,180 @@
 in a hyper-learner would create a notion of being able to zoom in, but other
 than that, i'm not sure what you mean.
 
+RP replies: I've been thinking about my idea a bit and yes, it might be 
+quite different from what James has in mind, though there are plently of common
+elements. I might have exagerated a bit with the zooming in, so in some cases
+you will end up with atomic edges, though my hope is that is not most of the
+edges.
+
+I think I should go into mode details when answering this question because 
+I feel I have not explained things sufficiently clear. Note, in many places
+I replaced the word "function" by "transform".
+
+Think of the learner as an object that traverses a DAG of steps created by the 
+user. On this DAG the learner can potentially do a lot of cool stuff, but we
+won't care about that for now. The DAG can be infinite in principle, and what 
+the learner does is just to go on the path described by the user ( and here
+described is not through heuristics like in James case, but by giving the list 
+of edges it needs to follow). A potential cool thing the learner can do is to 
+regard the path given by the user as a suggestion ( or some form of heuristic) 
+and try to improve it. This would be much closer to what James has in mind,
+and I definetely think is a cool way to go about it.
+
+Now this path in the graph is given by the user by composing subgraphs or
+adding nodes to the graph. Or (expressing this in a more simple way) by applying 
+functions to variables. Any such function will introduce an edge ( or a subgraph) that 
+will connect the vertices corresponding to the input variables to the vertices
+corresponding to the output variables. The variables store the state of the
+learner. These functions are state-less, I think if you would give them states
+you will make this approach really ugly (I might be wrong). 
+The variables would contain informations required by the function, like
+number of layers, on how many cores to run, cluster configurations, and so on.
+
+Now about the zooming part, that James asked. I might have exagerated a bit,
+is not that you can zoom in on any part infinitely. You will end up with
+things that are atomic. The idea is that any such "transformation" or edge 
+has the potential to be split up in several "transformations". This offers 
+(in my view) a way of solving the time constraints of our project. We can 
+start by difining a coarse division in segments. For now we can have 
+a structure transform that makes a list of parameters into a deep 
+network of some type, then a learner transform that adds SGD + pre-training 
+on top of network, and then early stopper on top of that, and then a 
+run_on_cluster on that.We would probably want something more finely grained 
+even from the start .. this is just to prove my point. When any of us 
+starts experimenting with a certain sub-step of this process ( like the
+structure) we will split that transform into several ( like ones that create 
+a layer and so on) that make sense for that case, and then start working on 
+the low level transform that we cares ( like the layer) introducing new 
+versions of it. I think we can not find a universal split that will cover 
+all of our cases, so I think we should allow different such splits. The one
+who researches should look at what low-level transforms are available and use
+those if they make sense, if not he would have to create a different split. 
+Creating a different split might involve a lot of work and taking care of
+several issues so it should be done with care.
+
+I'll give an example from where I started thinking this way. Let say we want 
+to do the SdA with auxiliary inputs that encourages separation of the features 
+in the hidden layer that Yoshua was saying ( I had an attempt
+at it some time ago for speech but I never eneded up finishing that project).
+
+You start up with something like : 
+
+learner = Learner()
+# This will create the learner that will traverse our graph. We might 
+# want it to be a function ``execute``, I just randomly picked this option. 
+#I have no preference of this detail for now .. this is mostly work in progress
+
+data  = someSpeechData(path = 'some path')
+# This is such a transform that will generate from the string representing the
+# path a dataset variable ( that will contain all informations you need to
+# access data). This will probably be the object the datasets comittee will
+# provide. Note, you might need to provide more information then the path, but
+# you can easily see how to do that. All these stuff start from simple
+# variables like path, batch size and so on and return a complex heavy duty
+# variable (node).
+
+
+model = earlyStopping(pretrain(SdA(layers = [524, 500, 500,27], noise = [0.1,0.1]),data, epochs = 10), data)
+# This is a composition of two transforms. The SdA transform starts from the
+# info about layers and corruption /noise for each layer and construct a SdA.
+# This is a high level transform, so it will take care of defining all
+# details, like pre-training, defining the cost and so on. Note that maybe it will
+# require some more parameters .. you can assume that for anything else there
+# is a default value that the SdA will use. earlyStopping is yet another
+# transform that takes a model ( that we know how to train ) and some data,
+# and does early stoppign on it. For bravity I did not provide all the
+# information required like patience and so on. The SdA only knows how to do a
+# step of training. Same holds for pretrain. It will loop over the layers of
+# SdA and will train each one. 
+
+steps = cluster(model, getPropertiesAndRanges(model), n_jobs = 20, cluster_info = getClusterInfo())
+# This will lunch the wanted jobs. getPropertiesAndRanges will get from a
+# model all knobs that need to be turn, and their ranges and will uniformly
+# sample from them in each jobs. getCluterInfo will return a variable
+# containing informations about the cluster ( I added this for simplicity, it
+# should probably be replaced with something like username, password,
+# clusterpath or whatever).
+
+learner.execute(steps)
+# As an option, each of this output variables could contain the entire graph
+# until that point. We could also have this in a different way .. this is
+# adhoc at the moment
+
+
+Now this is a coarse vanila SdA which is not what we wanted. We do not have a
+way of incorporating our auxiliary information in this. So what we have to do
+is split/change the SdA transform. We would re-write it as :
+
+
+arch = SdA(layers = [524, 500, 500, 27], noise = [0.1,0.1])
+model = earlyStopping(pretrain(arch,data,epochs = 10)
+...
+
+And then re-write things like : 
+
+arch = SGD( cross_entropy( logreg( DAAlayer( [DAAlayer([524,500],0.1),500],0.1))))
+
+
+We would re-write the DAAlayer as : 
+
+layer0 = DAAlayer([524,500],0.1)
+layer1 = cross_entropy(reconstruct( tanh(dotW_b( layer0,500)),noise = 0.1))
+
+At this point of detail, we can start inserting our new stuff in as follows : 
+
+
+input = empty_layer(600)
+# empty layer is a wrapper ; if I would to write dotW_b(200,500) which means
+# go from a layer of 200 units to a one of 500 by multiplying with a matrix
+# and adding a bias, what I would mean is dotW_b( empty_layer(200), 500). 
+# an implementation of empty_layer could be just theano.tensor.vector()
+# where we add the size tag ( we will need it later)
+
+
+hidden0_mfcc = dotW_b(input[0:524],100)
+hidden0_noise = dotW_b(input[0:560],50)
+hidden0_speakerID = dotW_b(join(input[0:524], input[560:600]),50)
+hidden0 = tanh(join( layer0_mfcc, layer0_noise, layer0_speakerID))
+layer0 = cross_entropy( reconstruct( hidden0, noise = 0.1))
+
+and so on. Hopefully you got what I mean by spliting a transform, or zooming
+in. When doing all this we did not change anything about the early stopping or
+lunching jobs on the cluster. In the same manner, if one would like to look
+into how jobs are send to the cluster, it could just expand that part. Note
+that if we wanted to do something else we might have split the DAA
+differently. 
+
+The key of this approach is to identify such low level units that can be
+shared by  90% of our architectures, and the splits that make most sense
+from a functional point of view that will cover the main points where people
+will like to change things. This will ensure that almost all the time we have
+the wanted low-level bits that we want to write our code into, and most of the
+time we will only work on one of that bit. There will definetely be cases when 
+whatever we have will not be sufficient or convinient. In that case some
+effort has to be invested by the user to create a different decomposition of
+the problem in the elements he need. 
+
+I've been thinking about this a bit, and it definetely works in for deep
+networks and theano ( the approach was inspired by theano). From what James
+said, I think that other stuff might be possible to incorporate, at least as
+atomic transforms if not in any other way.
+
+TODO: one has to give some thought of this low-level transform, to find a
+suitable set of them ( and variables) so that would end up most of the time 
+re-using things and not creating new things.
+
+NOTES: there are some other implementation details missing of what this state
+variables should contain. I did not want to clutter this with what tricks
+could be used to get this transparent interface. I have a few of them in mind
+though.. 
+there is a lot of hardcoded values in this example. Usually each transform
+that takes an input should "know" which of these inputs are tunable and mark
+them as such. The order of the input in this example is important as well. 
+This can be easily solved at the expense of a few more lines of code that 
+I did not want to write. 
+
+
+
+
+