Mercurial > pylearn
view doc/v2_planning/datalearn.txt @ 1364:01157763c2d7
Reply to Razvan
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Fri, 12 Nov 2010 11:36:30 -0500 |
parents | 18b2ebec6bca |
children | 049b99f4b323 |
line wrap: on
line source
DataLearn: How to plug Datasets & Learner together? =================================================== Participants ------------ - Yoshua - Razvan - Olivier D [leader?] High-Level Objectives --------------------- * Simple ML experiments should be simple to write * More complex / advanced scenarios should be possible without being forced to work "outside" of this framework * Computations should be optimized whenever possible * Existing code (in any language) should be "wrappable" within this framework * It should be possible to replace [parts of] this framework with C++ code Theano-Like Data Flow --------------------- We want to rely on Theano to be able to take advantage of its efficient computations. The general idea is that if we chain multiple processing elements (think e.g. of a feature selection step followed by a PCA projection, then a rescaling within a fixed bounded interval), the overall transformation from input to output data can be represented by a Theano symbolic graph. When one wants to access the actual numeric data, a function is compiled so as to do these computations efficiently. We discussed some specific API options for datasets and learners, which will be added to this file in the future, but a core question that we feel should be addressed first is how this Theano-based implementation could be achieved exactly. For this purpose, in the following, let us assume that a dataset is simply a matrix whose rows represent individual samples, and columns individual features. How to handle field names, non-tensor-like data, etc. is a very important topic that is not yet discussed in this file. A question we did not really discuss is whether datasets should be Theano Variables. The advantage would be that they would fit directly within the Theano framework, which may allow high level optimizations on data transformations. However, we would lose the ability to combine Theano expressions coded in individual datasets into a single graph. Currently, we instead consider that a dataset has a member that is a Theano variable, and this variable represents the data stored in the dataset. The same is done for individual data samples. James asks: Why would a Theano graph in which some nodes represent datasets give up the ability to combine Theano expressions coded in individual datasets? Firstly, if you want to use Theano expressions and compiled functions to implement the perform() method of an Op, you can do that. Secondly, you can just include those 'expressions coded in individual datasets' into the overall graph. OD replies to James: What I had in mind is you would be forced to compile your own function inside the perform() method of an Op. This seemed like a potential problem to me because it would prevent Theano from seeing the whole fine-grained graph and do optimizations across multiple dataset transformations (there may also be additional overhead from calling multiple function). But if you are saying it is possible to include 'expressions coded in individual datasets' into the overall graph, then I guess this point is moot. Would this be achieved with an optimization that replaces the dataset node with its internal graph? Razvan comments: 1) Having Theano expressions inside the perform of a Theano Op can lead to issues. I know I had to deal with a few when implementing Scan which does exactly this. Well to be fair these issues mostly come into play when the inner graph has to interact with the outer graph and most of the time they can be solved. I guess all that I'm saying is going that way might lead to some head-ache to developers, though I guess some head-ache will be involved no matter what 2) In my view (I'm not sure this is what Olivier was saying) the idea of not putting the Dataset into a Variable is to not put the logic related to loading data, dividing it into slices when running it on the GPU and so on into a theano variable. In my view this logic goes into a DataSet class that gives you shared variables, symbolic indices into that shared variables, and also numeric indices. When looping through those numeric indices, the dataset class can reload parts of the data into the shared variable and so on. OD replies to Razvan's point 2: I think what you are saying is another concern I had, which was the fact it may be confusing to mix in the same class the Variable/Op and DataSet interfaces. I would indeed prefer to keep them separate. However, it may be possible to come up with a system that would get the best of both worlds (maybe by having the Op/Variable as members of Dataset, and just asking the user building a theano graph to use these instead of the dataset directly). Note that I'm mixing up Op/Variable here, because it's just not clear yet for me which would go where... One issue with this approach is illustrated by the following example. Imagine we want to iterate on samples in a dataset and do something with their numeric value. We would want the code to be as close as possible to: .. code-block:: python for sample in dataset: do_something_with(sample.numeric_value()) A naive implementation of the sample API could be (assuming each sample contains a ``variable`` member which is the variable representing this sample's data): .. code-block:: python def numeric_value(self): if self.function is None: # Compile function to output the numeric value stored in this # sample's variable. self.function = theano.function([], self.variable) return self.function() However, this is not a good idea, because it would trigger a new function compilation for each sample. Instead, we would want something like this: .. code-block:: python def numeric_value(self): if self.function_storage[0] is None: # Compile function to output the numeric value stored in this # sample's variable. This function takes as input the index of # the sample in the dataset, and is shared among all samples. self.function_storage[0] = theano.function( [self.symbolic_index], self.variable) return self.function(self.numeric_index) In the code above, we assume that all samples created by the action of iterating over the dataset share the same ``function_storage``, ``symbolic_index`` and ``variable``: the first time we try to access the numeric value of some sample, a function is compiled, that takes as input the index, and outputs the variable. The only difference between samples is thus that they are given a different numeric value for the index (``numeric_index``). Another way to obtain the same result is to actually let the user take care of compiling the function. It would allow the user to really control what is being compiled, at the cost of having to write more code: .. code-block:: python symbolic_index = dataset.get_index() # Or just theano.tensor.iscalar() get_sample = theano.function([symbolic_index], dataset[symbolic_index].variable) for numeric_index in xrange(len(dataset)) do_something_with(get_sample(numeric_index)) James comments: this is how I have written the last couple of projects, it's slightly verbose but it's clear and efficient. <Razvan comments>: I assume that ``do_something_with`` is suppose to be some numeric function, and dataset in this case is the result of some computations on a initial dataset. I would differentiate the two approaches (1) and (2) as : - first of all whatever you can do with (1) you can do with (2) - approach (1) hides the fact that you are working with symbolic graphs. You apply functions to datasets, and when you want to see values a function is compiled under the hood and those values are computed for you. In approach (2) the fact that you deal with a symbolic graph is explicit because you have to manually compile your functions. - approach (1) needs to use this function_storage trick shared between certain nodes of the graph to reduce the number of compilation while in approach (2) we don't need to deal with the complexity of lazy compilation OD comments: Well, to be fair, it means we put the burden of dealing with the complexity of lazy compilation on the user (it's up to him to make sure he compiles only one function). - approach (1) needs a replace function if you want to change the dataset. What you would do, is once you have a "computational graph" or pipeline or whatever you call it, say ``graph``, to change the input you would do graph.replace({ init_data_X: new_data_X}), In approach (2) the init_data_X and new_data_X is the ``dataset`` so you would compile two different functions. Well I would re-write (2) -- to make the above more clear -- as : .. code-block:: python symbolic_index = theano.tensor.iscalar() get_sample1 = theano.function( [symbolic_index], graph( dataset[symbolic_index] ).variable) for numeric_index in xrange(len(dataset)): do_something_with(get_sample(numeric_index)) get_sample2 = theano.function( [symbolic_index], graph( new_dataset[symbolic_index] ).variable) ## Note: the dataset was replaced with new_dataset for numeric_index in xrange(len(new_dataset)): do_something_with(get_sample2(numeric_index)) ######### FOR (1) you write: for datapoint in graph: do_something_with( datapoint() ) new_graph = graph.replace({dataset:dataset2}) for datapoint in new_graph: do_something_with(datapoint()) OD comments: I don't really understand what is 'graph' in this code (it appears in both approaches but is used differently). What I have in mind would be more with 'graph' removed in the first approach you describe (#2), and graph / new_graph replaced by dataset / new_dataset in the second one (#1). You wouldn't need to call some graph.replace method: the graphs compiled for iterating on 'dataset' and 'new_dataset' would be entirely separate (using two different compiled functions, pretty much like #2). RP answers: Yes you are right. What I was trying to say is if you have two different datasets on which you want to apply the same pre-processing you can do that in both approaches. ``graph`` represents the pre-processing steps in (2) and the end dataset (after preprocessing) in (1). So the idea is that instead of making new_graph from scratch (re-applying all the transforms on the original dataset) you can use replace. Or maybe the __call__ (that compiles the function if needed) can get a givens dictionary ( that replaces datasets or more ). I only gave this argument because I thought this will be an issue people will raise. They will say, well in (2) the pipeline logic is separated from the data, so you can use the same transformation with different data easily, while in (1) you write the transformation rooted in a dataset, and if you want same transformation for a different dataset you have to re-write everything. OD replies: Still not sure I understand. If you have a "graph" function that takes a dataset as input and outputs a new dataset, you can use this same function with both (1) and (2). With (2) it is: theano.function([index], graph(my_dataset)[index].variable) while with (1) the same function is compiled implicitly with: for sample in graph(my_dataset): ... - in approach (1) the initial dataset object (the one that loads the data) decides if you will use shared variables and indices to deal with the dataset or if you will use ``theano.tensor.matrix`` and not the user( at least not without hacking the code). Of course whoever writes that class can add a flag to it to switch between behaviours that make sense. In approach (2) one is not forced to do this inside that class by construction, though by convention I would do it. So if you consider the one who writes that class as a developer than in (2) the user can decide/deal with this and not the developer. Though this is a fine-line -- I would say the user would actually write that class as well using some template. That is to say (2) looks and feels more like working with Theano directly, Bottom line, I think (1) puts more stress on the development of the library, and hides Theano and some of the complexity for day to day usage. In (2) everything is a bit more explicit, leaving the impression that you have more control over the code, though I strongly feel that whatever can be done in (2) can be done in (1). Traditionally I was more inclined towards (1) but now I'm not that sure, I think both are equally interesting and valid options. </Razvan comments> Note that although the above example focused on how to iterate over a dataset, it can be cast into a more generic problem, where some data (either dataset or sample) is the result of some transformation applied to other data, which is parameterized by parameters p1, p2, ..., pN (in the above example, we were considering a sample that was obtained by taking the p1-th element in a dataset). If we use different values for a subset Q of the parameters but keep other parameters fixed, we would probably want to compile a single function that takes as input all parameters in Q, while other parameters are fixed. Ideally it would be nice to let the user take control on what is being compiled, while leaving the option of using a default sensible behavior for those who do not want to worry about it. How to achieve this is still to be determined. Razvan Comment: I thought about this a bit at the Pylearn level. In my original train of thought you would have the distinction between ``hand picked parameters`` which I would call hyper-parameter and learned parameters. A transformation in this framework (an op if you wish) could take as inputs DataSet(s), DataField(s), Parameter(s) (which are the things that the learner should adapt) and HyperParameter(s). All hyper-parameters will turn into arguments of the compiled function (like the indices of each of the dataset objects ) and therefore they can be changed without re-compilation. Or in other words this can be easily done by having new types of Variables that would represent Parameters and Hyper-parameters. And as an ending note I would say that there are hyper-parameters for which you need to recompile the thenao function and can not be just parameters ( so we would have yet another category ?). James: Another syntactic option for iterating over datasets is .. code-block:: python for sample in dataset.numeric_iterator(batchsize=10): do_something_with(sample) The numeric_iterator would create a symbolic batch index, and compile a single function that extracts the corresponding minibatch. The arguments to the numeric_iterator function can also specify what compile mode to use, any givens you might want to apply, etc. OD comments: Would there also be some kind of function cache to avoid compiling the same function again if we re-iterate on the same dataset with the same arguments? Maybe a more generic issue is: would there be a way for Theano to be more efficient when re-compiling the same function that was already compiled in the same program? (note that I am assuming here it is not efficient, but I may be wrong). What About Learners? -------------------- The discussion above only mentioned datasets, but not learners. The learning part of a learner is not a main concern (currently). What matters most w.r.t. what was discussed above is how a learner takes as input a dataset and outputs another dataset that can be used with the dataset API. James asks: What's wrong with simply passing the variables corresponding to the dataset to the constructor of the learner? That seems much more flexible, compact, and clear than the decorator. OD replies: Not sure I understand your idea here. We probably want a learner to be able to compute its output on multiple datasets, without having to point to these datasets within the learner itself (which seems cumbersome to me). The point of the decorators is mostly to turn a single function (that outputs a theano variable for the ouptut computed on a single sample) into a function that can compute symbolic datasets as well as numeric sample outputs. Those could also be instead different functions in the base Learner class if the decorator approach is considered ugly / confusing. A Learner may be able to compute various things. For instance, a Neural Network may output a ``prediction`` vector (whose elements correspond to estimated probabilities of each class in a classification task), as well as a ``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone and the classification error). We would want to be able to build a dataset that contains some of these quantities computed on each sample in the input dataset. The Neural Network code would then look something like this: .. code-block:: python class NeuralNetwork(Learner): @datalearn(..) def compute_prediction(self, sample): return softmax(theano.tensor.dot(self.weights, sample.input)) @datalearn(..) def compute_nll(self, sample): return - log(self.compute_prediction(sample)[sample.target]) @datalearn(..) def compute_penalized_nll(self, sample): return (self.compute_nll(self, sample) + theano.tensor.sum(self.weights**2)) @datalearn(..) def compute_class_error(self, sample): probabilities = self.compute_prediction(sample) predicted_class = theano.tensor.argmax(probabilities) return predicted_class != sample.target @datalearn(..) def compute_cost(self, sample): return theano.tensor.concatenate([ self.compute_penalized_nll(sample), self.compute_nll(sample), self.compute_class_error(sample), ]) The ``@datalearn`` decorator would be responsible for allowing such a Learner to be used e.g. like this: .. code-block:: python nnet = NeuralNetwork() predict_dataset = nnet.compute_prediction(dataset) for sample in dataset: predict_sample = nnet.compute_prediction(sample) predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)}) multiple_fields_dataset = ConcatDataSet([ nnet.compute_prediction(dataset), nnet.compute_cost(dataset), ]) In the code above, if one wants to obtain the numeric value of an element of ``multiple_fields_dataset``, the Theano function being compiled would be able to optimize computations so that the simultaneous computation of ``prediction`` and ``cost`` is done efficiently. Razvan asks: What is predict_sample for ? What is predict_dataset? What I guess you mean is that the decorator is used to convert a function that takes a theano variable and outputs a theano variable into a class/function that takes a DataField/DataSet and outputs a DataField/DataSet. It could also register all those different functions, so that the Dataset that you get out of (not one of the function) the entire Learner (this Dataset is returned by __call__) would contain all those as fields. I would use it like this: .. code-block:: python nnet = NeuralNetwork() results = nnet(dataset) for datapoint in results: print datapoint.prediction, datapoint.nll, ... Is this close to what you are suggesting? OD: Yes, you guessed right, the decorator's role is to do something different depending on the input to the function (see my reply to James above).