Mercurial > pylearn
comparison doc/v2_planning/learner.txt @ 1059:f082a6c0b008
merged v2planning learner
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Thu, 09 Sep 2010 11:50:37 -0400 |
parents | e342de3ae485 19033ef1636d |
children | 7a8dcf87d780 |
comparison
equal
deleted
inserted
replaced
1058:e342de3ae485 | 1059:f082a6c0b008 |
---|---|
474 if). Another difference is that the graph I had in mind doesn't feel fractal - | 474 if). Another difference is that the graph I had in mind doesn't feel fractal - |
475 it would be very common for a graph edge to be atomic. A proxy pattern, such as | 475 it would be very common for a graph edge to be atomic. A proxy pattern, such as |
476 in a hyper-learner would create a notion of being able to zoom in, but other | 476 in a hyper-learner would create a notion of being able to zoom in, but other |
477 than that, i'm not sure what you mean. | 477 than that, i'm not sure what you mean. |
478 | 478 |
479 RP replies: I've been thinking about my idea a bit and yes, it might be | |
480 quite different from what James has in mind, though there are plently of common | |
481 elements. I might have exagerated a bit with the zooming in, so in some cases | |
482 you will end up with atomic edges, though my hope is that is not most of the | |
483 edges. | |
484 | |
485 I think I should go into mode details when answering this question because | |
486 I feel I have not explained things sufficiently clear. Note, in many places | |
487 I replaced the word "function" by "transform". | |
488 | |
489 Think of the learner as an object that traverses a DAG of steps created by the | |
490 user. On this DAG the learner can potentially do a lot of cool stuff, but we | |
491 won't care about that for now. The DAG can be infinite in principle, and what | |
492 the learner does is just to go on the path described by the user ( and here | |
493 described is not through heuristics like in James case, but by giving the list | |
494 of edges it needs to follow). A potential cool thing the learner can do is to | |
495 regard the path given by the user as a suggestion ( or some form of heuristic) | |
496 and try to improve it. This would be much closer to what James has in mind, | |
497 and I definetely think is a cool way to go about it. | |
498 | |
499 Now this path in the graph is given by the user by composing subgraphs or | |
500 adding nodes to the graph. Or (expressing this in a more simple way) by applying | |
501 functions to variables. Any such function will introduce an edge ( or a subgraph) that | |
502 will connect the vertices corresponding to the input variables to the vertices | |
503 corresponding to the output variables. The variables store the state of the | |
504 learner. These functions are state-less, I think if you would give them states | |
505 you will make this approach really ugly (I might be wrong). | |
506 The variables would contain informations required by the function, like | |
507 number of layers, on how many cores to run, cluster configurations, and so on. | |
508 | |
509 Now about the zooming part, that James asked. I might have exagerated a bit, | |
510 is not that you can zoom in on any part infinitely. You will end up with | |
511 things that are atomic. The idea is that any such "transformation" or edge | |
512 has the potential to be split up in several "transformations". This offers | |
513 (in my view) a way of solving the time constraints of our project. We can | |
514 start by difining a coarse division in segments. For now we can have | |
515 a structure transform that makes a list of parameters into a deep | |
516 network of some type, then a learner transform that adds SGD + pre-training | |
517 on top of network, and then early stopper on top of that, and then a | |
518 run_on_cluster on that.We would probably want something more finely grained | |
519 even from the start .. this is just to prove my point. When any of us | |
520 starts experimenting with a certain sub-step of this process ( like the | |
521 structure) we will split that transform into several ( like ones that create | |
522 a layer and so on) that make sense for that case, and then start working on | |
523 the low level transform that we cares ( like the layer) introducing new | |
524 versions of it. I think we can not find a universal split that will cover | |
525 all of our cases, so I think we should allow different such splits. The one | |
526 who researches should look at what low-level transforms are available and use | |
527 those if they make sense, if not he would have to create a different split. | |
528 Creating a different split might involve a lot of work and taking care of | |
529 several issues so it should be done with care. | |
530 | |
531 I'll give an example from where I started thinking this way. Let say we want | |
532 to do the SdA with auxiliary inputs that encourages separation of the features | |
533 in the hidden layer that Yoshua was saying ( I had an attempt | |
534 at it some time ago for speech but I never eneded up finishing that project). | |
535 | |
536 You start up with something like : | |
537 | |
538 learner = Learner() | |
539 # This will create the learner that will traverse our graph. We might | |
540 # want it to be a function ``execute``, I just randomly picked this option. | |
541 #I have no preference of this detail for now .. this is mostly work in progress | |
542 | |
543 data = someSpeechData(path = 'some path') | |
544 # This is such a transform that will generate from the string representing the | |
545 # path a dataset variable ( that will contain all informations you need to | |
546 # access data). This will probably be the object the datasets comittee will | |
547 # provide. Note, you might need to provide more information then the path, but | |
548 # you can easily see how to do that. All these stuff start from simple | |
549 # variables like path, batch size and so on and return a complex heavy duty | |
550 # variable (node). | |
551 | |
552 | |
553 model = earlyStopping(pretrain(SdA(layers = [524, 500, 500,27], noise = [0.1,0.1]),data, epochs = 10), data) | |
554 # This is a composition of two transforms. The SdA transform starts from the | |
555 # info about layers and corruption /noise for each layer and construct a SdA. | |
556 # This is a high level transform, so it will take care of defining all | |
557 # details, like pre-training, defining the cost and so on. Note that maybe it will | |
558 # require some more parameters .. you can assume that for anything else there | |
559 # is a default value that the SdA will use. earlyStopping is yet another | |
560 # transform that takes a model ( that we know how to train ) and some data, | |
561 # and does early stoppign on it. For bravity I did not provide all the | |
562 # information required like patience and so on. The SdA only knows how to do a | |
563 # step of training. Same holds for pretrain. It will loop over the layers of | |
564 # SdA and will train each one. | |
565 | |
566 steps = cluster(model, getPropertiesAndRanges(model), n_jobs = 20, cluster_info = getClusterInfo()) | |
567 # This will lunch the wanted jobs. getPropertiesAndRanges will get from a | |
568 # model all knobs that need to be turn, and their ranges and will uniformly | |
569 # sample from them in each jobs. getCluterInfo will return a variable | |
570 # containing informations about the cluster ( I added this for simplicity, it | |
571 # should probably be replaced with something like username, password, | |
572 # clusterpath or whatever). | |
573 | |
574 learner.execute(steps) | |
575 # As an option, each of this output variables could contain the entire graph | |
576 # until that point. We could also have this in a different way .. this is | |
577 # adhoc at the moment | |
578 | |
579 | |
580 Now this is a coarse vanila SdA which is not what we wanted. We do not have a | |
581 way of incorporating our auxiliary information in this. So what we have to do | |
582 is split/change the SdA transform. We would re-write it as : | |
583 | |
584 | |
585 arch = SdA(layers = [524, 500, 500, 27], noise = [0.1,0.1]) | |
586 model = earlyStopping(pretrain(arch,data,epochs = 10) | |
587 ... | |
588 | |
589 And then re-write things like : | |
590 | |
591 arch = SGD( cross_entropy( logreg( DAAlayer( [DAAlayer([524,500],0.1),500],0.1)))) | |
592 | |
593 | |
594 We would re-write the DAAlayer as : | |
595 | |
596 layer0 = DAAlayer([524,500],0.1) | |
597 layer1 = cross_entropy(reconstruct( tanh(dotW_b( layer0,500)),noise = 0.1)) | |
598 | |
599 At this point of detail, we can start inserting our new stuff in as follows : | |
600 | |
601 | |
602 input = empty_layer(600) | |
603 # empty layer is a wrapper ; if I would to write dotW_b(200,500) which means | |
604 # go from a layer of 200 units to a one of 500 by multiplying with a matrix | |
605 # and adding a bias, what I would mean is dotW_b( empty_layer(200), 500). | |
606 # an implementation of empty_layer could be just theano.tensor.vector() | |
607 # where we add the size tag ( we will need it later) | |
608 | |
609 | |
610 hidden0_mfcc = dotW_b(input[0:524],100) | |
611 hidden0_noise = dotW_b(input[0:560],50) | |
612 hidden0_speakerID = dotW_b(join(input[0:524], input[560:600]),50) | |
613 hidden0 = tanh(join( layer0_mfcc, layer0_noise, layer0_speakerID)) | |
614 layer0 = cross_entropy( reconstruct( hidden0, noise = 0.1)) | |
615 | |
616 and so on. Hopefully you got what I mean by spliting a transform, or zooming | |
617 in. When doing all this we did not change anything about the early stopping or | |
618 lunching jobs on the cluster. In the same manner, if one would like to look | |
619 into how jobs are send to the cluster, it could just expand that part. Note | |
620 that if we wanted to do something else we might have split the DAA | |
621 differently. | |
622 | |
623 The key of this approach is to identify such low level units that can be | |
624 shared by 90% of our architectures, and the splits that make most sense | |
625 from a functional point of view that will cover the main points where people | |
626 will like to change things. This will ensure that almost all the time we have | |
627 the wanted low-level bits that we want to write our code into, and most of the | |
628 time we will only work on one of that bit. There will definetely be cases when | |
629 whatever we have will not be sufficient or convinient. In that case some | |
630 effort has to be invested by the user to create a different decomposition of | |
631 the problem in the elements he need. | |
632 | |
633 I've been thinking about this a bit, and it definetely works in for deep | |
634 networks and theano ( the approach was inspired by theano). From what James | |
635 said, I think that other stuff might be possible to incorporate, at least as | |
636 atomic transforms if not in any other way. | |
637 | |
638 TODO: one has to give some thought of this low-level transform, to find a | |
639 suitable set of them ( and variables) so that would end up most of the time | |
640 re-using things and not creating new things. | |
641 | |
642 NOTES: there are some other implementation details missing of what this state | |
643 variables should contain. I did not want to clutter this with what tricks | |
644 could be used to get this transparent interface. I have a few of them in mind | |
645 though.. | |
646 there is a lot of hardcoded values in this example. Usually each transform | |
647 that takes an input should "know" which of these inputs are tunable and mark | |
648 them as such. The order of the input in this example is important as well. | |
649 This can be easily solved at the expense of a few more lines of code that | |
650 I did not want to write. | |
651 | |
652 | |
653 | |
654 | |
655 |