comparison doc/v2_planning/learner.txt @ 1056:19033ef1636d

some more details on my approach
author Razvan Pascanu <r.pascanu@gmail.com>
date Thu, 09 Sep 2010 11:28:24 -0400
parents bc3f7834db83
children f082a6c0b008
comparison
equal deleted inserted replaced
1055:bc3f7834db83 1056:19033ef1636d
444 if). Another difference is that the graph I had in mind doesn't feel fractal - 444 if). Another difference is that the graph I had in mind doesn't feel fractal -
445 it would be very common for a graph edge to be atomic. A proxy pattern, such as 445 it would be very common for a graph edge to be atomic. A proxy pattern, such as
446 in a hyper-learner would create a notion of being able to zoom in, but other 446 in a hyper-learner would create a notion of being able to zoom in, but other
447 than that, i'm not sure what you mean. 447 than that, i'm not sure what you mean.
448 448
449 RP replies: I've been thinking about my idea a bit and yes, it might be
450 quite different from what James has in mind, though there are plently of common
451 elements. I might have exagerated a bit with the zooming in, so in some cases
452 you will end up with atomic edges, though my hope is that is not most of the
453 edges.
454
455 I think I should go into mode details when answering this question because
456 I feel I have not explained things sufficiently clear. Note, in many places
457 I replaced the word "function" by "transform".
458
459 Think of the learner as an object that traverses a DAG of steps created by the
460 user. On this DAG the learner can potentially do a lot of cool stuff, but we
461 won't care about that for now. The DAG can be infinite in principle, and what
462 the learner does is just to go on the path described by the user ( and here
463 described is not through heuristics like in James case, but by giving the list
464 of edges it needs to follow). A potential cool thing the learner can do is to
465 regard the path given by the user as a suggestion ( or some form of heuristic)
466 and try to improve it. This would be much closer to what James has in mind,
467 and I definetely think is a cool way to go about it.
468
469 Now this path in the graph is given by the user by composing subgraphs or
470 adding nodes to the graph. Or (expressing this in a more simple way) by applying
471 functions to variables. Any such function will introduce an edge ( or a subgraph) that
472 will connect the vertices corresponding to the input variables to the vertices
473 corresponding to the output variables. The variables store the state of the
474 learner. These functions are state-less, I think if you would give them states
475 you will make this approach really ugly (I might be wrong).
476 The variables would contain informations required by the function, like
477 number of layers, on how many cores to run, cluster configurations, and so on.
478
479 Now about the zooming part, that James asked. I might have exagerated a bit,
480 is not that you can zoom in on any part infinitely. You will end up with
481 things that are atomic. The idea is that any such "transformation" or edge
482 has the potential to be split up in several "transformations". This offers
483 (in my view) a way of solving the time constraints of our project. We can
484 start by difining a coarse division in segments. For now we can have
485 a structure transform that makes a list of parameters into a deep
486 network of some type, then a learner transform that adds SGD + pre-training
487 on top of network, and then early stopper on top of that, and then a
488 run_on_cluster on that.We would probably want something more finely grained
489 even from the start .. this is just to prove my point. When any of us
490 starts experimenting with a certain sub-step of this process ( like the
491 structure) we will split that transform into several ( like ones that create
492 a layer and so on) that make sense for that case, and then start working on
493 the low level transform that we cares ( like the layer) introducing new
494 versions of it. I think we can not find a universal split that will cover
495 all of our cases, so I think we should allow different such splits. The one
496 who researches should look at what low-level transforms are available and use
497 those if they make sense, if not he would have to create a different split.
498 Creating a different split might involve a lot of work and taking care of
499 several issues so it should be done with care.
500
501 I'll give an example from where I started thinking this way. Let say we want
502 to do the SdA with auxiliary inputs that encourages separation of the features
503 in the hidden layer that Yoshua was saying ( I had an attempt
504 at it some time ago for speech but I never eneded up finishing that project).
505
506 You start up with something like :
507
508 learner = Learner()
509 # This will create the learner that will traverse our graph. We might
510 # want it to be a function ``execute``, I just randomly picked this option.
511 #I have no preference of this detail for now .. this is mostly work in progress
512
513 data = someSpeechData(path = 'some path')
514 # This is such a transform that will generate from the string representing the
515 # path a dataset variable ( that will contain all informations you need to
516 # access data). This will probably be the object the datasets comittee will
517 # provide. Note, you might need to provide more information then the path, but
518 # you can easily see how to do that. All these stuff start from simple
519 # variables like path, batch size and so on and return a complex heavy duty
520 # variable (node).
521
522
523 model = earlyStopping(pretrain(SdA(layers = [524, 500, 500,27], noise = [0.1,0.1]),data, epochs = 10), data)
524 # This is a composition of two transforms. The SdA transform starts from the
525 # info about layers and corruption /noise for each layer and construct a SdA.
526 # This is a high level transform, so it will take care of defining all
527 # details, like pre-training, defining the cost and so on. Note that maybe it will
528 # require some more parameters .. you can assume that for anything else there
529 # is a default value that the SdA will use. earlyStopping is yet another
530 # transform that takes a model ( that we know how to train ) and some data,
531 # and does early stoppign on it. For bravity I did not provide all the
532 # information required like patience and so on. The SdA only knows how to do a
533 # step of training. Same holds for pretrain. It will loop over the layers of
534 # SdA and will train each one.
535
536 steps = cluster(model, getPropertiesAndRanges(model), n_jobs = 20, cluster_info = getClusterInfo())
537 # This will lunch the wanted jobs. getPropertiesAndRanges will get from a
538 # model all knobs that need to be turn, and their ranges and will uniformly
539 # sample from them in each jobs. getCluterInfo will return a variable
540 # containing informations about the cluster ( I added this for simplicity, it
541 # should probably be replaced with something like username, password,
542 # clusterpath or whatever).
543
544 learner.execute(steps)
545 # As an option, each of this output variables could contain the entire graph
546 # until that point. We could also have this in a different way .. this is
547 # adhoc at the moment
548
549
550 Now this is a coarse vanila SdA which is not what we wanted. We do not have a
551 way of incorporating our auxiliary information in this. So what we have to do
552 is split/change the SdA transform. We would re-write it as :
553
554
555 arch = SdA(layers = [524, 500, 500, 27], noise = [0.1,0.1])
556 model = earlyStopping(pretrain(arch,data,epochs = 10)
557 ...
558
559 And then re-write things like :
560
561 arch = SGD( cross_entropy( logreg( DAAlayer( [DAAlayer([524,500],0.1),500],0.1))))
562
563
564 We would re-write the DAAlayer as :
565
566 layer0 = DAAlayer([524,500],0.1)
567 layer1 = cross_entropy(reconstruct( tanh(dotW_b( layer0,500)),noise = 0.1))
568
569 At this point of detail, we can start inserting our new stuff in as follows :
570
571
572 input = empty_layer(600)
573 # empty layer is a wrapper ; if I would to write dotW_b(200,500) which means
574 # go from a layer of 200 units to a one of 500 by multiplying with a matrix
575 # and adding a bias, what I would mean is dotW_b( empty_layer(200), 500).
576 # an implementation of empty_layer could be just theano.tensor.vector()
577 # where we add the size tag ( we will need it later)
578
579
580 hidden0_mfcc = dotW_b(input[0:524],100)
581 hidden0_noise = dotW_b(input[0:560],50)
582 hidden0_speakerID = dotW_b(join(input[0:524], input[560:600]),50)
583 hidden0 = tanh(join( layer0_mfcc, layer0_noise, layer0_speakerID))
584 layer0 = cross_entropy( reconstruct( hidden0, noise = 0.1))
585
586 and so on. Hopefully you got what I mean by spliting a transform, or zooming
587 in. When doing all this we did not change anything about the early stopping or
588 lunching jobs on the cluster. In the same manner, if one would like to look
589 into how jobs are send to the cluster, it could just expand that part. Note
590 that if we wanted to do something else we might have split the DAA
591 differently.
592
593 The key of this approach is to identify such low level units that can be
594 shared by 90% of our architectures, and the splits that make most sense
595 from a functional point of view that will cover the main points where people
596 will like to change things. This will ensure that almost all the time we have
597 the wanted low-level bits that we want to write our code into, and most of the
598 time we will only work on one of that bit. There will definetely be cases when
599 whatever we have will not be sufficient or convinient. In that case some
600 effort has to be invested by the user to create a different decomposition of
601 the problem in the elements he need.
602
603 I've been thinking about this a bit, and it definetely works in for deep
604 networks and theano ( the approach was inspired by theano). From what James
605 said, I think that other stuff might be possible to incorporate, at least as
606 atomic transforms if not in any other way.
607
608 TODO: one has to give some thought of this low-level transform, to find a
609 suitable set of them ( and variables) so that would end up most of the time
610 re-using things and not creating new things.
611
612 NOTES: there are some other implementation details missing of what this state
613 variables should contain. I did not want to clutter this with what tricks
614 could be used to get this transparent interface. I have a few of them in mind
615 though..
616 there is a lot of hardcoded values in this example. Usually each transform
617 that takes an input should "know" which of these inputs are tunable and mark
618 them as such. The order of the input in this example is important as well.
619 This can be easily solved at the expense of a few more lines of code that
620 I did not want to write.
621
622
623
624
625