# HG changeset patch # User Olivier Delalleau # Date 1289839893 18000 # Node ID 9474fb4ad10978b09ca6c533b34c4988c2219984 # Parent f945ed016c6811cb431d860eee7da99e143f3161 Refactored datalearn committee file to be easier to read diff -r f945ed016c68 -r 9474fb4ad109 doc/v2_planning/datalearn.txt --- a/doc/v2_planning/datalearn.txt Fri Nov 12 13:49:13 2010 -0500 +++ b/doc/v2_planning/datalearn.txt Mon Nov 15 11:51:33 2010 -0500 @@ -5,7 +5,7 @@ ------------ - Yoshua - Razvan -- Olivier D [leader?] +- Olivier D [leader] High-Level Objectives --------------------- @@ -37,14 +37,182 @@ individual features. How to handle field names, non-tensor-like data, etc. is a very important topic that is not yet discussed in this file. -A question we did not really discuss is whether datasets should be Theano -Variables. The advantage would be that they would fit directly within the -Theano framework, which may allow high level optimizations on data -transformations. However, we would lose the ability to combine Theano -expressions coded in individual datasets into a single graph. Currently, we -instead consider that a dataset has a member that is a Theano variable, and -this variable represents the data stored in the dataset. The same is done for -individual data samples. +A question we did not discuss much is to which extent the architecture could +be "theanified", i.e. whether a whole experiment could be defined as a Theano +graph on which high level optimizations could be made possible, while also +relying on Theano to "run" the graph. The other option is to use a different +mechanism, with underlying Theano graphs being built wherever possible to link +the various components of an experiment together. + +For now, let us consider the latter option, where each dataset contains a +pointer to a Theano variable that represents the data stored in this dataset. +One issue with this approach is illustrated by the following example. Imagine +we want to iterate on samples in a dataset and do something with their numeric +value. We would want the code to be as close as possible to: + + .. code-block:: python + + for sample in dataset: + do_something_with(sample.numeric_value()) + +A naive implementation of the sample API could be (assuming each sample also +contains a ``variable`` member which is the variable representing this +sample's data): + + .. code-block:: python + + def numeric_value(self): + if self.function is None: + # Compile function to output the numeric value stored in this + # sample's variable. + self.function = theano.function([], self.variable) + return self.function() + +However, this is not a good idea, because it would trigger a new function +compilation for each sample. Instead, we would want something like this: + + .. code-block:: python + + def numeric_value(self): + if self.function_storage[0] is None: + # Compile function to output the numeric value stored in this + # sample's variable. This function takes as input the index of + # the sample in the dataset, and is shared among all samples. + self.function_storage[0] = theano.function( + [self.symbolic_index], self.variable) + return self.function(self.numeric_index) + +In the code above, we assume that all samples created by the action of +iterating over the dataset share the same ``function_storage``, +``symbolic_index`` and ``variable``: the first time we try to access the numeric +value of some sample, a function is compiled, that takes as input the index, +and outputs the variable. The only difference between samples is thus that +they are given a different numeric value for the index (``numeric_index``). + +Another way to obtain the same result is to actually let the user take care of +compiling the function. It would allow the user to really control what is +being compiled, at the cost of having to write more code: + + .. code-block:: python + + symbolic_index = dataset.get_index() # Or just theano.tensor.iscalar() + get_sample = theano.function([symbolic_index], + dataset[symbolic_index].variable) + for numeric_index in xrange(len(dataset)) + do_something_with(get_sample(numeric_index)) + +James comments: this is how I have written the last couple of projects, it's +slightly verbose but it's clear and efficient. + +The code above may also be simplified by providing helper functions. In the +example above, such a function could allow us to iterate on the numeric values +of samples in a dataset while taking care of compiling the appropriate Theano +function. See Discussion: Helper Functions below. + +Note that although the above example focused on how to iterate over a dataset, +it can be cast into a more generic problem, where some data (either dataset or +sample) is the result of some transformation applied to other data, which is +parameterized by parameters p1, p2, ..., pN (in the above example, we were +considering a sample that was obtained by taking the p1-th element in a +dataset). If we use different values for a subset Q of the parameters but keep +other parameters fixed, we would probably want to compile a single function +that takes as input all parameters in Q, while other parameters are fixed. It +may be nice to try and get the best of both worlds, letting the user take +control on what is being compiled, while leaving the option of using a default +sensible behavior for those who do not want to worry about it. Whether this is +possible / desirable is still to-be-determined. + +What About Learners? +-------------------- + +The discussion above only mentioned datasets, but not learners. The learning +part of a learner is not a main concern (currently). What matters most w.r.t. +what was discussed above is how a learner takes as input a dataset and outputs +another dataset that can be used with the dataset API. + +A Learner may be able to compute various things. For instance, a Neural +Network may output a ``prediction`` vector (whose elements correspond to +estimated probabilities of each class in a classification task), as well as a +``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone +and the classification error). We would want to be able to build a dataset +that contains some of these quantities computed on each sample in the input +dataset. + +The Neural Network code would then look something like this: + + .. code-block:: python + + class NeuralNetwork(Learner): + + # The decorator below is reponsible for turning a function that + # takes a symbolic sample as input, and outputs a Theano variable, + # into a function that can also be applied on numeric sample data, + # or symbolic datasets. + # Other approaches than a decorator are possible (e.g. using + # different function names). + @datalearn(..) + def compute_prediction(self, sample): + return softmax(theano.tensor.dot(self.weights, sample.input)) + + @datalearn(..) + def compute_nll(self, sample): + return - log(self.compute_prediction(sample)[sample.target]) + + @datalearn(..) + def compute_penalized_nll(self, sample): + return (self.compute_nll(self, sample) + + theano.tensor.sum(self.weights**2)) + + @datalearn(..) + def compute_class_error(self, sample): + probabilities = self.compute_prediction(sample) + predicted_class = theano.tensor.argmax(probabilities) + return predicted_class != sample.target + + @datalearn(..) + def compute_cost(self, sample): + return theano.tensor.concatenate([ + self.compute_penalized_nll(sample), + self.compute_nll(sample), + self.compute_class_error(sample), + ]) + +The ``@datalearn`` decorator would allow such a Learner to be used e.g. like +this: + + .. code-block:: python + + nnet = NeuralNetwork() + # Symbolic dataset that represents the output on symbolic input data. + predict_dataset = nnet.compute_prediction(dataset) + for sample in dataset: + # Symbolic sample that represents the output on a single symbolic + # input sample. + predict_sample = nnet.compute_prediction(sample) + # Numeric prediction. + predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)}) + # Combining multiple symbolic outputs. + multiple_fields_dataset = ConcatDataSet([ + nnet.compute_prediction(dataset), + nnet.compute_cost(dataset), + ]) + +In the code above, if one wants to obtain the numeric value of an element of +``multiple_fields_dataset``, the Theano function being compiled should be able +to optimize computations so that the simultaneous computation of +``prediction`` and ``cost`` is done efficiently. + +Discussion: Are Datasets Variables / Ops? +----------------------------------------- + +OD wonders: Should datasets directly be Theano Variables, or should they be a +different object subclass containing a Theano Variable? The advantage of the +former option would be that they would fit directly within the Theano +framework, which may allow high level optimizations on data transformations. +However, we would lose the ability to combine Theano expressions coded in +individual datasets into a single graph. Currently, I instead considered that +a dataset has a member that is a Theano variable, and this variable represents +the data stored in the dataset. The same is done for individual data samples. James asks: Why would a Theano graph in which some nodes represent datasets give up the ability to combine Theano expressions coded in individual datasets? @@ -88,63 +256,9 @@ of the dataset directly). Note that I'm mixing up Op/Variable here, because it's just not clear yet for me which would go where... -One issue with this approach is illustrated by the following example. Imagine -we want to iterate on samples in a dataset and do something with their -numeric value. We would want the code to be as close as possible to: - .. code-block:: python - - for sample in dataset: - do_something_with(sample.numeric_value()) - -A naive implementation of the sample API could be (assuming each sample -contains a ``variable`` member which is the variable representing this -sample's data): - - .. code-block:: python - - def numeric_value(self): - if self.function is None: - # Compile function to output the numeric value stored in this - # sample's variable. - self.function = theano.function([], self.variable) - return self.function() - -However, this is not a good idea, because it would trigger a new function -compilation for each sample. Instead, we would want something like this: - - .. code-block:: python - - def numeric_value(self): - if self.function_storage[0] is None: - # Compile function to output the numeric value stored in this - # sample's variable. This function takes as input the index of - # the sample in the dataset, and is shared among all samples. - self.function_storage[0] = theano.function( - [self.symbolic_index], self.variable) - return self.function(self.numeric_index) - -In the code above, we assume that all samples created by the action of -iterating over the dataset share the same ``function_storage``, -``symbolic_index`` and ``variable``: the first time we try to access the numeric -value of some sample, a function is compiled, that takes as input the index, -and outputs the variable. The only difference between samples is thus that -they are given a different numeric value for the index (``numeric_index``). - -Another way to obtain the same result is to actually let the user take care of -compiling the function. It would allow the user to really control what is -being compiled, at the cost of having to write more code: - - .. code-block:: python - - symbolic_index = dataset.get_index() # Or just theano.tensor.iscalar() - get_sample = theano.function([symbolic_index], - dataset[symbolic_index].variable) - for numeric_index in xrange(len(dataset)) - do_something_with(get_sample(numeric_index)) - -James comments: this is how I have written the last couple of projects, it's -slightly verbose but it's clear and efficient. +Discussion: Implicit / Explicit Function Compilation +---------------------------------------------------- : I assume that ``do_something_with`` is suppose to be some numeric function, and dataset in this case is the result of some @@ -276,18 +390,8 @@ and valid options. -Note that although the above example focused on how to iterate over a dataset, -it can be cast into a more generic problem, where some data (either dataset or -sample) is the result of some transformation applied to other data, which is -parameterized by parameters p1, p2, ..., pN (in the above example, we were -considering a sample that was obtained by taking the p1-th element in a -dataset). If we use different values for a subset Q of the parameters but keep -other parameters fixed, we would probably want to compile a single function -that takes as input all parameters in Q, while other parameters are fixed. -Ideally it would be nice to let the user take control on what is being -compiled, while leaving the option of using a default sensible behavior for -those who do not want to worry about it. How to achieve this is still to be -determined. +Discussion: Fixed Parameters vs. Function Arguments +--------------------------------------------------- Razvan Comment: I thought about this a bit at the Pylearn level. In my original train of thought you would have the distinction between ``hand @@ -309,6 +413,9 @@ are possibly constant (e.g. holding some hyper-parameters constant for a while)? +Discussion: Helper Functions +---------------------------- + James: Another syntactic option for iterating over datasets is .. code-block:: python @@ -330,13 +437,8 @@ already compiled in the same program? (note that I am assuming here it is not efficient, but I may be wrong). -What About Learners? --------------------- - -The discussion above only mentioned datasets, but not learners. The learning -part of a learner is not a main concern (currently). What matters most w.r.t. -what was discussed above is how a learner takes as input a dataset and outputs -another dataset that can be used with the dataset API. +Discussion: Dataset as Learner Ouptut +------------------------------------- James asks: What's wrong with simply passing the variables corresponding to the dataset to @@ -352,67 +454,6 @@ could also be instead different functions in the base Learner class if the decorator approach is considered ugly / confusing. -A Learner may be able to compute various things. For instance, a Neural -Network may output a ``prediction`` vector (whose elements correspond to -estimated probabilities of each class in a classification task), as well as a -``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone -and the classification error). We would want to be able to build a dataset -that contains some of these quantities computed on each sample in the input -dataset. - -The Neural Network code would then look something like this: - - .. code-block:: python - - class NeuralNetwork(Learner): - - @datalearn(..) - def compute_prediction(self, sample): - return softmax(theano.tensor.dot(self.weights, sample.input)) - - @datalearn(..) - def compute_nll(self, sample): - return - log(self.compute_prediction(sample)[sample.target]) - - @datalearn(..) - def compute_penalized_nll(self, sample): - return (self.compute_nll(self, sample) + - theano.tensor.sum(self.weights**2)) - - @datalearn(..) - def compute_class_error(self, sample): - probabilities = self.compute_prediction(sample) - predicted_class = theano.tensor.argmax(probabilities) - return predicted_class != sample.target - - @datalearn(..) - def compute_cost(self, sample): - return theano.tensor.concatenate([ - self.compute_penalized_nll(sample), - self.compute_nll(sample), - self.compute_class_error(sample), - ]) - -The ``@datalearn`` decorator would be responsible for allowing such a Learner -to be used e.g. like this: - - .. code-block:: python - - nnet = NeuralNetwork() - predict_dataset = nnet.compute_prediction(dataset) - for sample in dataset: - predict_sample = nnet.compute_prediction(sample) - predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)}) - multiple_fields_dataset = ConcatDataSet([ - nnet.compute_prediction(dataset), - nnet.compute_cost(dataset), - ]) - -In the code above, if one wants to obtain the numeric value of an element of -``multiple_fields_dataset``, the Theano function being compiled would be able -to optimize computations so that the simultaneous computation of -``prediction`` and ``cost`` is done efficiently. - Razvan asks: What is predict_sample for ? What is predict_dataset? What I guess you mean is that the decorator is used to convert a function that takes a theano variable and outputs a theano variable into a class/function @@ -433,3 +474,4 @@ OD: Yes, you guessed right, the decorator's role is to do something different depending on the input to the function (see my reply to James above). +