# HG changeset patch # User Razvan Pascanu # Date 1289533201 18000 # Node ID 7548dc1b163c239ac838cb5779e7969ed232add3 # Parent f81b3b6f969800412284b122e2421f71e960b3d6 Some question/suggestions to datalearn diff -r f81b3b6f9698 -r 7548dc1b163c doc/v2_planning/datalearn.txt --- a/doc/v2_planning/datalearn.txt Thu Nov 11 18:08:05 2010 -0500 +++ b/doc/v2_planning/datalearn.txt Thu Nov 11 22:40:01 2010 -0500 @@ -53,6 +53,24 @@ just include those 'expressions coded in individual datasets' into the overall graph. +Razvan comments: 1) Having Theano expressions inside the perform of a Theano +Op can lead to issues. I know I had to deal with a few when implementing +Scan which does exactly this. Well to be fair these issues mostly come into +play when the inner graph has to interact with the outer graph and most of +the time they can be solved. I guess all that I'm saying is going that way +might lead to some head-ache to developers, though I guess some head-ache +will be involved no matter what +2) In my view (I'm not sure this is what Olivier was saying) the idea of +not putting the Dataset into a Variable is to not put the logic related to +loading data, dividing it into slices when running it on the GPU and so on +into a theano variable. In my view this logic goes into a DataSet class +that gives you shared variables, symbolic indices into that shared +variables, and also numeric indices. When looping through those numeric +indices, the dataset class can reload parts of the data into the +shared variable and so on. + + + One issue with this approach is illustrated by the following example. Imagine we want to iterate on samples in a dataset and do something with their numeric value. We would want the code to be as close as possible to: @@ -111,6 +129,75 @@ James comments: this is how I have written the last couple of projects, it's slightly verbose but it's clear and efficient. +: I assume that ``do_something_with`` is suppose to be some +numeric function, and dataset in this case is the result of some +computations on a initial dataset. +I would differentiate the two approaches (1) and (2) as : + - first of all whatever you can do with (1) you can do with (2) + - approach (1) hides the fact that you are working with symbolic graphs. + You apply functions to datasets, and when you want to see values a + function is compiled under the hood and those values are computed for + you. In approach (2) the fact that you deal with a symbolic graph is + explicit because you have to manually compile your functions. + - approach (1) needs to use this function_storage trick shared between + certain nodes of the graph to reduce the number of compilation while in + approach (2) we don't need to deal with the complexity of lazy + compilation + - approach (1) needs a replace function if you want to change the dataset. + What you would do, is once you have a "computational graph" or pipeline + or whatever you call it, say ``graph``, to change the input you would do + graph.replace({ init_data_X: new_data_X}), In approach (2) the init_data_X + and new_data_X is the ``dataset`` so you would compile two different + functions. Well I would re-write (2) -- to make the above more clear -- + as : + + .. code-block:: python + + symbolic_index = theano.tensor.iscalar() + get_sample1 = theano.function( [symbolic_index], + graph( dataset[symbolic_index] ).variable) + for numeric_index in xrange(len(dataset)): + do_something_with(get_sample(numeric_index)) + + get_sample2 = theano.function( [symbolic_index], + graph( new_dataset[symbolic_index] ).variable) + ## Note: the dataset was replaced with new_dataset + for numeric_index in xrange(len(new_dataset)): + do_something_with(get_sample2(numeric_index)) + + ######### FOR (1) you write: + + for datapoint in graph: + do_something_with( datapoint() ) + + new_graph = graph.replace({dataset:dataset2}) + + for datapoint in new_graph: + do_something_with(datapoint()) + + - in approach (1) the initial dataset object (the one that loads the data) + decides if you will use shared variables and indices to deal with the + dataset or if you will use ``theano.tensor.matrix`` and not the user( at + least not without hacking the code). Of course whoever writes that class + can add a flag to it to switch between behaviours that make sense. + In approach (2) one is not forced to do this + inside that class by construction, though by convention I would do it. + So if you consider the one who writes that class as a developer than + in (2) the user can decide/deal with this and not the developer. + Though this is a fine-line -- I would say the user would actually + write that class as well using some template. + That is to say (2) looks and feels more like working with Theano + directly, + +Bottom line, I think (1) puts more stress on the development of the library, +and hides Theano and some of the complexity for day to day usage. +In (2) everything is a bit more explicit, leaving the impression that you +have more control over the code, though I strongly feel that whatever can +be done in (2) can be done in (1). Traditionally I was more inclined +towards (1) but now I'm not that sure, I think both are equally interesting +and valid options. + + Note that although the above example focused on how to iterate over a dataset, it can be cast into a more generic problem, where some data (either dataset or sample) is the result of some transformation applied to other data, which is @@ -124,6 +211,19 @@ those who do not want to worry about it. How to achieve this is still to be determined. +Razvan Comment: I thought about this a bit at the Pylearn level. In my +original train of thought you would have the distinction between ``hand +picked parameters`` which I would call hyper-parameter and learned +parameters. A transformation in this framework (an op if you wish) could +take as inputs DataSet(s), DataField(s), Parameter(s) (which are the things +that the learner should adapt) and HyperParameter(s). All hyper-parameters +will turn into arguments of the compiled function (like the indices of each +of the dataset objects ) and therefore they can be changed without +re-compilation. Or in other words this can be easily done by having new +types of Variables that would represent Parameters and Hyper-parameters. +And as an ending note I would say that there are +hyper-parameters for which you need to recompile the thenao function and +can not be just parameters ( so we would have yet another category ?). Another syntactic option for iterating over datasets is @@ -212,3 +312,21 @@ to optimize computations so that the simultaneous computation of ``prediction`` and ``cost`` is done efficiently. +Razvan asks: What is predict_sample for ? What is predict_dataset? What I +guess you mean is that the decorator is used to convert a function that +takes a theano variable and outputs a theano variable into a class/function +that takes a DataField/DataSet and outputs a DataField/DataSet. It could +also register all those different functions, so that the Dataset that +you get out of (not one of the function) the entire Learner (this Dataset +is returned by __call__) would contain all those as fields. +I would use it like this: + +.. code-block:: python + + nnet = NeuralNetwork() + results = nnet(dataset) + for datapoint in results: + print datapoint.prediction, datapoint.nll, ... + +Is this close to what you are suggesting? +