comparison doc/v2_planning/learner.txt @ 1059:f082a6c0b008

merged v2planning learner
author James Bergstra <bergstrj@iro.umontreal.ca>
date Thu, 09 Sep 2010 11:50:37 -0400
parents e342de3ae485 19033ef1636d
children 7a8dcf87d780
comparison
equal deleted inserted replaced
1058:e342de3ae485 1059:f082a6c0b008
474 if). Another difference is that the graph I had in mind doesn't feel fractal - 474 if). Another difference is that the graph I had in mind doesn't feel fractal -
475 it would be very common for a graph edge to be atomic. A proxy pattern, such as 475 it would be very common for a graph edge to be atomic. A proxy pattern, such as
476 in a hyper-learner would create a notion of being able to zoom in, but other 476 in a hyper-learner would create a notion of being able to zoom in, but other
477 than that, i'm not sure what you mean. 477 than that, i'm not sure what you mean.
478 478
479 RP replies: I've been thinking about my idea a bit and yes, it might be
480 quite different from what James has in mind, though there are plently of common
481 elements. I might have exagerated a bit with the zooming in, so in some cases
482 you will end up with atomic edges, though my hope is that is not most of the
483 edges.
484
485 I think I should go into mode details when answering this question because
486 I feel I have not explained things sufficiently clear. Note, in many places
487 I replaced the word "function" by "transform".
488
489 Think of the learner as an object that traverses a DAG of steps created by the
490 user. On this DAG the learner can potentially do a lot of cool stuff, but we
491 won't care about that for now. The DAG can be infinite in principle, and what
492 the learner does is just to go on the path described by the user ( and here
493 described is not through heuristics like in James case, but by giving the list
494 of edges it needs to follow). A potential cool thing the learner can do is to
495 regard the path given by the user as a suggestion ( or some form of heuristic)
496 and try to improve it. This would be much closer to what James has in mind,
497 and I definetely think is a cool way to go about it.
498
499 Now this path in the graph is given by the user by composing subgraphs or
500 adding nodes to the graph. Or (expressing this in a more simple way) by applying
501 functions to variables. Any such function will introduce an edge ( or a subgraph) that
502 will connect the vertices corresponding to the input variables to the vertices
503 corresponding to the output variables. The variables store the state of the
504 learner. These functions are state-less, I think if you would give them states
505 you will make this approach really ugly (I might be wrong).
506 The variables would contain informations required by the function, like
507 number of layers, on how many cores to run, cluster configurations, and so on.
508
509 Now about the zooming part, that James asked. I might have exagerated a bit,
510 is not that you can zoom in on any part infinitely. You will end up with
511 things that are atomic. The idea is that any such "transformation" or edge
512 has the potential to be split up in several "transformations". This offers
513 (in my view) a way of solving the time constraints of our project. We can
514 start by difining a coarse division in segments. For now we can have
515 a structure transform that makes a list of parameters into a deep
516 network of some type, then a learner transform that adds SGD + pre-training
517 on top of network, and then early stopper on top of that, and then a
518 run_on_cluster on that.We would probably want something more finely grained
519 even from the start .. this is just to prove my point. When any of us
520 starts experimenting with a certain sub-step of this process ( like the
521 structure) we will split that transform into several ( like ones that create
522 a layer and so on) that make sense for that case, and then start working on
523 the low level transform that we cares ( like the layer) introducing new
524 versions of it. I think we can not find a universal split that will cover
525 all of our cases, so I think we should allow different such splits. The one
526 who researches should look at what low-level transforms are available and use
527 those if they make sense, if not he would have to create a different split.
528 Creating a different split might involve a lot of work and taking care of
529 several issues so it should be done with care.
530
531 I'll give an example from where I started thinking this way. Let say we want
532 to do the SdA with auxiliary inputs that encourages separation of the features
533 in the hidden layer that Yoshua was saying ( I had an attempt
534 at it some time ago for speech but I never eneded up finishing that project).
535
536 You start up with something like :
537
538 learner = Learner()
539 # This will create the learner that will traverse our graph. We might
540 # want it to be a function ``execute``, I just randomly picked this option.
541 #I have no preference of this detail for now .. this is mostly work in progress
542
543 data = someSpeechData(path = 'some path')
544 # This is such a transform that will generate from the string representing the
545 # path a dataset variable ( that will contain all informations you need to
546 # access data). This will probably be the object the datasets comittee will
547 # provide. Note, you might need to provide more information then the path, but
548 # you can easily see how to do that. All these stuff start from simple
549 # variables like path, batch size and so on and return a complex heavy duty
550 # variable (node).
551
552
553 model = earlyStopping(pretrain(SdA(layers = [524, 500, 500,27], noise = [0.1,0.1]),data, epochs = 10), data)
554 # This is a composition of two transforms. The SdA transform starts from the
555 # info about layers and corruption /noise for each layer and construct a SdA.
556 # This is a high level transform, so it will take care of defining all
557 # details, like pre-training, defining the cost and so on. Note that maybe it will
558 # require some more parameters .. you can assume that for anything else there
559 # is a default value that the SdA will use. earlyStopping is yet another
560 # transform that takes a model ( that we know how to train ) and some data,
561 # and does early stoppign on it. For bravity I did not provide all the
562 # information required like patience and so on. The SdA only knows how to do a
563 # step of training. Same holds for pretrain. It will loop over the layers of
564 # SdA and will train each one.
565
566 steps = cluster(model, getPropertiesAndRanges(model), n_jobs = 20, cluster_info = getClusterInfo())
567 # This will lunch the wanted jobs. getPropertiesAndRanges will get from a
568 # model all knobs that need to be turn, and their ranges and will uniformly
569 # sample from them in each jobs. getCluterInfo will return a variable
570 # containing informations about the cluster ( I added this for simplicity, it
571 # should probably be replaced with something like username, password,
572 # clusterpath or whatever).
573
574 learner.execute(steps)
575 # As an option, each of this output variables could contain the entire graph
576 # until that point. We could also have this in a different way .. this is
577 # adhoc at the moment
578
579
580 Now this is a coarse vanila SdA which is not what we wanted. We do not have a
581 way of incorporating our auxiliary information in this. So what we have to do
582 is split/change the SdA transform. We would re-write it as :
583
584
585 arch = SdA(layers = [524, 500, 500, 27], noise = [0.1,0.1])
586 model = earlyStopping(pretrain(arch,data,epochs = 10)
587 ...
588
589 And then re-write things like :
590
591 arch = SGD( cross_entropy( logreg( DAAlayer( [DAAlayer([524,500],0.1),500],0.1))))
592
593
594 We would re-write the DAAlayer as :
595
596 layer0 = DAAlayer([524,500],0.1)
597 layer1 = cross_entropy(reconstruct( tanh(dotW_b( layer0,500)),noise = 0.1))
598
599 At this point of detail, we can start inserting our new stuff in as follows :
600
601
602 input = empty_layer(600)
603 # empty layer is a wrapper ; if I would to write dotW_b(200,500) which means
604 # go from a layer of 200 units to a one of 500 by multiplying with a matrix
605 # and adding a bias, what I would mean is dotW_b( empty_layer(200), 500).
606 # an implementation of empty_layer could be just theano.tensor.vector()
607 # where we add the size tag ( we will need it later)
608
609
610 hidden0_mfcc = dotW_b(input[0:524],100)
611 hidden0_noise = dotW_b(input[0:560],50)
612 hidden0_speakerID = dotW_b(join(input[0:524], input[560:600]),50)
613 hidden0 = tanh(join( layer0_mfcc, layer0_noise, layer0_speakerID))
614 layer0 = cross_entropy( reconstruct( hidden0, noise = 0.1))
615
616 and so on. Hopefully you got what I mean by spliting a transform, or zooming
617 in. When doing all this we did not change anything about the early stopping or
618 lunching jobs on the cluster. In the same manner, if one would like to look
619 into how jobs are send to the cluster, it could just expand that part. Note
620 that if we wanted to do something else we might have split the DAA
621 differently.
622
623 The key of this approach is to identify such low level units that can be
624 shared by 90% of our architectures, and the splits that make most sense
625 from a functional point of view that will cover the main points where people
626 will like to change things. This will ensure that almost all the time we have
627 the wanted low-level bits that we want to write our code into, and most of the
628 time we will only work on one of that bit. There will definetely be cases when
629 whatever we have will not be sufficient or convinient. In that case some
630 effort has to be invested by the user to create a different decomposition of
631 the problem in the elements he need.
632
633 I've been thinking about this a bit, and it definetely works in for deep
634 networks and theano ( the approach was inspired by theano). From what James
635 said, I think that other stuff might be possible to incorporate, at least as
636 atomic transforms if not in any other way.
637
638 TODO: one has to give some thought of this low-level transform, to find a
639 suitable set of them ( and variables) so that would end up most of the time
640 re-using things and not creating new things.
641
642 NOTES: there are some other implementation details missing of what this state
643 variables should contain. I did not want to clutter this with what tricks
644 could be used to get this transparent interface. I have a few of them in mind
645 though..
646 there is a lot of hardcoded values in this example. Usually each transform
647 that takes an input should "know" which of these inputs are tunable and mark
648 them as such. The order of the input in this example is important as well.
649 This can be easily solved at the expense of a few more lines of code that
650 I did not want to write.
651
652
653
654
655