Mercurial > pylearn
comparison doc/v2_planning/learner.txt @ 1056:19033ef1636d
some more details on my approach
author | Razvan Pascanu <r.pascanu@gmail.com> |
---|---|
date | Thu, 09 Sep 2010 11:28:24 -0400 |
parents | bc3f7834db83 |
children | f082a6c0b008 |
comparison
equal
deleted
inserted
replaced
1055:bc3f7834db83 | 1056:19033ef1636d |
---|---|
444 if). Another difference is that the graph I had in mind doesn't feel fractal - | 444 if). Another difference is that the graph I had in mind doesn't feel fractal - |
445 it would be very common for a graph edge to be atomic. A proxy pattern, such as | 445 it would be very common for a graph edge to be atomic. A proxy pattern, such as |
446 in a hyper-learner would create a notion of being able to zoom in, but other | 446 in a hyper-learner would create a notion of being able to zoom in, but other |
447 than that, i'm not sure what you mean. | 447 than that, i'm not sure what you mean. |
448 | 448 |
449 RP replies: I've been thinking about my idea a bit and yes, it might be | |
450 quite different from what James has in mind, though there are plently of common | |
451 elements. I might have exagerated a bit with the zooming in, so in some cases | |
452 you will end up with atomic edges, though my hope is that is not most of the | |
453 edges. | |
454 | |
455 I think I should go into mode details when answering this question because | |
456 I feel I have not explained things sufficiently clear. Note, in many places | |
457 I replaced the word "function" by "transform". | |
458 | |
459 Think of the learner as an object that traverses a DAG of steps created by the | |
460 user. On this DAG the learner can potentially do a lot of cool stuff, but we | |
461 won't care about that for now. The DAG can be infinite in principle, and what | |
462 the learner does is just to go on the path described by the user ( and here | |
463 described is not through heuristics like in James case, but by giving the list | |
464 of edges it needs to follow). A potential cool thing the learner can do is to | |
465 regard the path given by the user as a suggestion ( or some form of heuristic) | |
466 and try to improve it. This would be much closer to what James has in mind, | |
467 and I definetely think is a cool way to go about it. | |
468 | |
469 Now this path in the graph is given by the user by composing subgraphs or | |
470 adding nodes to the graph. Or (expressing this in a more simple way) by applying | |
471 functions to variables. Any such function will introduce an edge ( or a subgraph) that | |
472 will connect the vertices corresponding to the input variables to the vertices | |
473 corresponding to the output variables. The variables store the state of the | |
474 learner. These functions are state-less, I think if you would give them states | |
475 you will make this approach really ugly (I might be wrong). | |
476 The variables would contain informations required by the function, like | |
477 number of layers, on how many cores to run, cluster configurations, and so on. | |
478 | |
479 Now about the zooming part, that James asked. I might have exagerated a bit, | |
480 is not that you can zoom in on any part infinitely. You will end up with | |
481 things that are atomic. The idea is that any such "transformation" or edge | |
482 has the potential to be split up in several "transformations". This offers | |
483 (in my view) a way of solving the time constraints of our project. We can | |
484 start by difining a coarse division in segments. For now we can have | |
485 a structure transform that makes a list of parameters into a deep | |
486 network of some type, then a learner transform that adds SGD + pre-training | |
487 on top of network, and then early stopper on top of that, and then a | |
488 run_on_cluster on that.We would probably want something more finely grained | |
489 even from the start .. this is just to prove my point. When any of us | |
490 starts experimenting with a certain sub-step of this process ( like the | |
491 structure) we will split that transform into several ( like ones that create | |
492 a layer and so on) that make sense for that case, and then start working on | |
493 the low level transform that we cares ( like the layer) introducing new | |
494 versions of it. I think we can not find a universal split that will cover | |
495 all of our cases, so I think we should allow different such splits. The one | |
496 who researches should look at what low-level transforms are available and use | |
497 those if they make sense, if not he would have to create a different split. | |
498 Creating a different split might involve a lot of work and taking care of | |
499 several issues so it should be done with care. | |
500 | |
501 I'll give an example from where I started thinking this way. Let say we want | |
502 to do the SdA with auxiliary inputs that encourages separation of the features | |
503 in the hidden layer that Yoshua was saying ( I had an attempt | |
504 at it some time ago for speech but I never eneded up finishing that project). | |
505 | |
506 You start up with something like : | |
507 | |
508 learner = Learner() | |
509 # This will create the learner that will traverse our graph. We might | |
510 # want it to be a function ``execute``, I just randomly picked this option. | |
511 #I have no preference of this detail for now .. this is mostly work in progress | |
512 | |
513 data = someSpeechData(path = 'some path') | |
514 # This is such a transform that will generate from the string representing the | |
515 # path a dataset variable ( that will contain all informations you need to | |
516 # access data). This will probably be the object the datasets comittee will | |
517 # provide. Note, you might need to provide more information then the path, but | |
518 # you can easily see how to do that. All these stuff start from simple | |
519 # variables like path, batch size and so on and return a complex heavy duty | |
520 # variable (node). | |
521 | |
522 | |
523 model = earlyStopping(pretrain(SdA(layers = [524, 500, 500,27], noise = [0.1,0.1]),data, epochs = 10), data) | |
524 # This is a composition of two transforms. The SdA transform starts from the | |
525 # info about layers and corruption /noise for each layer and construct a SdA. | |
526 # This is a high level transform, so it will take care of defining all | |
527 # details, like pre-training, defining the cost and so on. Note that maybe it will | |
528 # require some more parameters .. you can assume that for anything else there | |
529 # is a default value that the SdA will use. earlyStopping is yet another | |
530 # transform that takes a model ( that we know how to train ) and some data, | |
531 # and does early stoppign on it. For bravity I did not provide all the | |
532 # information required like patience and so on. The SdA only knows how to do a | |
533 # step of training. Same holds for pretrain. It will loop over the layers of | |
534 # SdA and will train each one. | |
535 | |
536 steps = cluster(model, getPropertiesAndRanges(model), n_jobs = 20, cluster_info = getClusterInfo()) | |
537 # This will lunch the wanted jobs. getPropertiesAndRanges will get from a | |
538 # model all knobs that need to be turn, and their ranges and will uniformly | |
539 # sample from them in each jobs. getCluterInfo will return a variable | |
540 # containing informations about the cluster ( I added this for simplicity, it | |
541 # should probably be replaced with something like username, password, | |
542 # clusterpath or whatever). | |
543 | |
544 learner.execute(steps) | |
545 # As an option, each of this output variables could contain the entire graph | |
546 # until that point. We could also have this in a different way .. this is | |
547 # adhoc at the moment | |
548 | |
549 | |
550 Now this is a coarse vanila SdA which is not what we wanted. We do not have a | |
551 way of incorporating our auxiliary information in this. So what we have to do | |
552 is split/change the SdA transform. We would re-write it as : | |
553 | |
554 | |
555 arch = SdA(layers = [524, 500, 500, 27], noise = [0.1,0.1]) | |
556 model = earlyStopping(pretrain(arch,data,epochs = 10) | |
557 ... | |
558 | |
559 And then re-write things like : | |
560 | |
561 arch = SGD( cross_entropy( logreg( DAAlayer( [DAAlayer([524,500],0.1),500],0.1)))) | |
562 | |
563 | |
564 We would re-write the DAAlayer as : | |
565 | |
566 layer0 = DAAlayer([524,500],0.1) | |
567 layer1 = cross_entropy(reconstruct( tanh(dotW_b( layer0,500)),noise = 0.1)) | |
568 | |
569 At this point of detail, we can start inserting our new stuff in as follows : | |
570 | |
571 | |
572 input = empty_layer(600) | |
573 # empty layer is a wrapper ; if I would to write dotW_b(200,500) which means | |
574 # go from a layer of 200 units to a one of 500 by multiplying with a matrix | |
575 # and adding a bias, what I would mean is dotW_b( empty_layer(200), 500). | |
576 # an implementation of empty_layer could be just theano.tensor.vector() | |
577 # where we add the size tag ( we will need it later) | |
578 | |
579 | |
580 hidden0_mfcc = dotW_b(input[0:524],100) | |
581 hidden0_noise = dotW_b(input[0:560],50) | |
582 hidden0_speakerID = dotW_b(join(input[0:524], input[560:600]),50) | |
583 hidden0 = tanh(join( layer0_mfcc, layer0_noise, layer0_speakerID)) | |
584 layer0 = cross_entropy( reconstruct( hidden0, noise = 0.1)) | |
585 | |
586 and so on. Hopefully you got what I mean by spliting a transform, or zooming | |
587 in. When doing all this we did not change anything about the early stopping or | |
588 lunching jobs on the cluster. In the same manner, if one would like to look | |
589 into how jobs are send to the cluster, it could just expand that part. Note | |
590 that if we wanted to do something else we might have split the DAA | |
591 differently. | |
592 | |
593 The key of this approach is to identify such low level units that can be | |
594 shared by 90% of our architectures, and the splits that make most sense | |
595 from a functional point of view that will cover the main points where people | |
596 will like to change things. This will ensure that almost all the time we have | |
597 the wanted low-level bits that we want to write our code into, and most of the | |
598 time we will only work on one of that bit. There will definetely be cases when | |
599 whatever we have will not be sufficient or convinient. In that case some | |
600 effort has to be invested by the user to create a different decomposition of | |
601 the problem in the elements he need. | |
602 | |
603 I've been thinking about this a bit, and it definetely works in for deep | |
604 networks and theano ( the approach was inspired by theano). From what James | |
605 said, I think that other stuff might be possible to incorporate, at least as | |
606 atomic transforms if not in any other way. | |
607 | |
608 TODO: one has to give some thought of this low-level transform, to find a | |
609 suitable set of them ( and variables) so that would end up most of the time | |
610 re-using things and not creating new things. | |
611 | |
612 NOTES: there are some other implementation details missing of what this state | |
613 variables should contain. I did not want to clutter this with what tricks | |
614 could be used to get this transparent interface. I have a few of them in mind | |
615 though.. | |
616 there is a lot of hardcoded values in this example. Usually each transform | |
617 that takes an input should "know" which of these inputs are tunable and mark | |
618 them as such. The order of the input in this example is important as well. | |
619 This can be easily solved at the expense of a few more lines of code that | |
620 I did not want to write. | |
621 | |
622 | |
623 | |
624 | |
625 |