# HG changeset patch # User Olivier Delalleau # Date 1290106849 18000 # Node ID e8fc563dad740afb04b730c824d9afb25ff75dac # Parent 7b61bfda1dab0f20f1ed32a07a3cbea410af7434 Rewrote the Theano-Like Data Flow section in datalearn.txt diff -r 7b61bfda1dab -r e8fc563dad74 doc/v2_planning/datalearn.txt --- a/doc/v2_planning/datalearn.txt Thu Nov 18 11:45:10 2010 -0500 +++ b/doc/v2_planning/datalearn.txt Thu Nov 18 14:00:49 2010 -0500 @@ -1,11 +1,14 @@ DataLearn: How to plug Datasets & Learner together? =================================================== + Participants ------------ - Yoshua - Razvan - Olivier D [leader] +- James + High-Level Objectives --------------------- @@ -18,6 +21,7 @@ framework * It should be possible to replace [parts of] this framework with C++ code + Theano-Like Data Flow --------------------- @@ -37,6 +41,234 @@ individual features. How to handle field names, non-tensor-like data, etc. is a very important topic that is not yet discussed in this file. +The main idea in this proposal is to consider some Data object as a Theano +Variable (we call 'data' an object that is either a sample, or a collection of +samples i.e a dataset). Because the Data API (for the Machine Learning user) +may conflict with the Variable API, in the following we take the approach that +a data object contains a Theano variable accessible through data.variable +(instead of Data being a subclass of Variable). For instance a basic way of +printing the content of a dataset could be: + + .. code-block:: python + + dataset = NumpyDataset(some_numpy_array) # View array as dataset. + index = theano.tensor.lscalar() + get_sample_value = theano.function([index], dataset[index].variable) + for i in xrange(len(dataset)): + print get_sample_value(i) + +There may also exist some helper function for the common task on iterating +over the numeric values found in a dataset, which would allow one to simply +write: + + .. code-block:: python + + for sample_value in theano_iterate(dataset): + print sample_value + +where the theano_iterate function would take care of the extra work: + + .. code-block:: python + + def theano_iterate(dataset, index=None, condition=None, + stop_exceptions=(IndexError, )): + if index is None: + index = theano.tensor.lscalar() + if condition is None: + condition = index < len(dataset) + get_value = theano.function([index], + [dataset[index].variable, condition]) + i = 0 + while True: + try: + output, cond = get_value(i) + except stop_exceptions: + break + i += 1 + if cond: + yield output + else: + break + +Now imagine a similar situation (willing to iterate on a dataset) where the +datsaet is the result of some transformation parameterized by another +Variable. For instance, let's say there exists a GetColumnDataset class such +that GetColumnDataset(dataset, index_variable) is a dataset whose associated +variable is dataset.variable[:, index_variable] (assuming here that +dataset.variable is a matrix variable). One would like to write: + + .. code-block:: python + + for j in xrange(dataset.n_columns()): + print 'Printing column %s' % j + for sample_value in theano_iterate(GetColumnDataset(dataset, j)): + print sample_value + +Although this would work, note that it would compile a new Theano function +each time theano_iterate is called (one for each value of j), which may be a +performance bottleneck. One way to avoid this is to just ignore the helper +function and manually compile a function that also takes the column index as +input parameter: + + .. code-block:: python + + sample_idx = theano.tensor.lscalar() + column_idx = theano.tensor.lscalar() + get_value = theano.function( + [sample_idx, column_idx], + GetColumnDataset(dataset, column_idx)[sample_idx].variable) + for j in xrange(dataset.n_columns()): + print 'Printing column %s' % j + for i in xrange(len(dataset)): + print get_value(i, j) + +It is however possible to use the helper function if it can accept an extra +argument ('givens') to be provided to the theano compilation step: + + .. code-block:: python + + def theano_iterate(dataset, index=None, condition=None, + stop_exceptions=(IndexError, ), + givens={}): + (...) + get_value = theano.function([index], + [dataset[index].variable, condition], + givens=givens) + (...) + + column_idx = theano.tensor.lscalar() + shared_column_idx = theano.shared(0) + iterate = theano_iterate(GetColumnDataset(dataset, column_idx), + givens={column_idx: shared_column_idx}) + for j in xrange(dataset.n_columns()): + print 'Printing column %s' % j + shared_column_idx.value = j + for sample_value in iterate: + print sample_value + +Note there are a couple oddities in the example above: + 1. The way theano_iterate was written, it is not possible to iterate on it + more than once. This is easily fixed by making it an iterable object. + 2. It would make more sense here to remove 'column_idx' and directly use + GetColumnDataset(dataset, shared_column_idx), in which case there is no + need to use the 'givens' keyword. But the goal here is to illustrate a + situation where one is given a dataset defined from a symbolic variable, + and we want to compute it for different numeric values of this variable. + This dataset may have been provided by code the user has no control on, + thus the need for 'givens' to replace the variable with a shared one + whose value can be updated between successive calls to the same + function. + +In summary: +- Data (samples and datasets) are basically Theano Variables, and a data + transformation an Op. +- When writing code that requires some data numeric value, one has to compile + a Theano function to obtain it. This is done either manually or through some + helper Pylearn functions for common tasks. In both cases, the user should + have enough control to be able to obtain an efficient implementation. + + +What About Learners? +-------------------- + +The discussion above only mentioned datasets, but not learners. The learning +part of a learner is not a main concern (currently). What matters most w.r.t. +what was discussed above is how a learner takes as input a dataset and outputs +another dataset that can be used with the dataset API. + +A Learner may be able to compute various things. For instance, a Neural +Network may output a ``prediction`` vector (whose elements correspond to +estimated probabilities of each class in a classification task), as well as a +``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone +and the classification error). We would want to be able to build a dataset +that contains some of these quantities computed on each sample in the input +dataset. + +The Neural Network code would then look something like this: + + .. code-block:: python + + class NeuralNetwork(Learner): + + # The decorator below is reponsible for turning a function that + # takes a symbolic sample as input, and outputs a Theano variable, + # into a function that can also be applied on numeric sample data, + # or symbolic datasets. + # Other approaches than a decorator are possible (e.g. using + # different function names). + @datalearn(..) + def compute_prediction(self, sample): + return softmax(theano.tensor.dot(self.weights, sample.input)) + + @datalearn(..) + def compute_nll(self, sample): + return - log(self.compute_prediction(sample)[sample.target]) + + @datalearn(..) + def compute_penalized_nll(self, sample): + return (self.compute_nll(self, sample) + + theano.tensor.sum(self.weights**2)) + + @datalearn(..) + def compute_class_error(self, sample): + probabilities = self.compute_prediction(sample) + predicted_class = theano.tensor.argmax(probabilities) + return predicted_class != sample.target + + @datalearn(..) + def compute_cost(self, sample): + return theano.tensor.concatenate([ + self.compute_penalized_nll(sample), + self.compute_nll(sample), + self.compute_class_error(sample), + ]) + +The ``@datalearn`` decorator would allow such a Learner to be used e.g. like +this: + + .. code-block:: python + + nnet = NeuralNetwork() + # Symbolic dataset that represents the output on symbolic input data. + predict_dataset = nnet.compute_prediction(dataset) + for sample in dataset: + # Symbolic sample that represents the output on a single symbolic + # input sample. + predict_sample = nnet.compute_prediction(sample) + # Numeric prediction. + predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)}) + # Combining multiple symbolic outputs. + multiple_fields_dataset = ConcatDataSet([ + nnet.compute_prediction(dataset), + nnet.compute_cost(dataset), + ]) + +In the code above, if one wants to obtain the numeric value of an element of +``multiple_fields_dataset``, the Theano function being compiled should be able +to optimize computations so that the simultaneous computation of +``prediction`` and ``cost`` is done efficiently. + + +Open Problems +------------- + +The above is not yet a practical proposal. Investigation of the following +topics is still missing: +- Datasets whose variables are not matrices (e.g. large datasets that do not + fit in memory, non fixed-length vector samples, ...) +- Field names. +- Typical input / target / weight split. +- Learners whose output on a dataset cannot be obtained by computing outputs + on individual samples (e.g. a Learner that ranks samples based on pair-wise + comparisons). +- Code parallelization, stop & restart. +- Modular C++ implementation without Theano. +- ... + + +Previous Introduction (deprecated) +---------------------------------- + A question we did not discuss much is to which extent the architecture could be "theanified", i.e. whether a whole experiment could be defined as a Theano graph on which high level optimizations could be made possible, while also @@ -122,85 +354,6 @@ sensible behavior for those who do not want to worry about it. Whether this is possible / desirable is still to-be-determined. -What About Learners? --------------------- - -The discussion above only mentioned datasets, but not learners. The learning -part of a learner is not a main concern (currently). What matters most w.r.t. -what was discussed above is how a learner takes as input a dataset and outputs -another dataset that can be used with the dataset API. - -A Learner may be able to compute various things. For instance, a Neural -Network may output a ``prediction`` vector (whose elements correspond to -estimated probabilities of each class in a classification task), as well as a -``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone -and the classification error). We would want to be able to build a dataset -that contains some of these quantities computed on each sample in the input -dataset. - -The Neural Network code would then look something like this: - - .. code-block:: python - - class NeuralNetwork(Learner): - - # The decorator below is reponsible for turning a function that - # takes a symbolic sample as input, and outputs a Theano variable, - # into a function that can also be applied on numeric sample data, - # or symbolic datasets. - # Other approaches than a decorator are possible (e.g. using - # different function names). - @datalearn(..) - def compute_prediction(self, sample): - return softmax(theano.tensor.dot(self.weights, sample.input)) - - @datalearn(..) - def compute_nll(self, sample): - return - log(self.compute_prediction(sample)[sample.target]) - - @datalearn(..) - def compute_penalized_nll(self, sample): - return (self.compute_nll(self, sample) + - theano.tensor.sum(self.weights**2)) - - @datalearn(..) - def compute_class_error(self, sample): - probabilities = self.compute_prediction(sample) - predicted_class = theano.tensor.argmax(probabilities) - return predicted_class != sample.target - - @datalearn(..) - def compute_cost(self, sample): - return theano.tensor.concatenate([ - self.compute_penalized_nll(sample), - self.compute_nll(sample), - self.compute_class_error(sample), - ]) - -The ``@datalearn`` decorator would allow such a Learner to be used e.g. like -this: - - .. code-block:: python - - nnet = NeuralNetwork() - # Symbolic dataset that represents the output on symbolic input data. - predict_dataset = nnet.compute_prediction(dataset) - for sample in dataset: - # Symbolic sample that represents the output on a single symbolic - # input sample. - predict_sample = nnet.compute_prediction(sample) - # Numeric prediction. - predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)}) - # Combining multiple symbolic outputs. - multiple_fields_dataset = ConcatDataSet([ - nnet.compute_prediction(dataset), - nnet.compute_cost(dataset), - ]) - -In the code above, if one wants to obtain the numeric value of an element of -``multiple_fields_dataset``, the Theano function being compiled should be able -to optimize computations so that the simultaneous computation of -``prediction`` and ``cost`` is done efficiently. Discussion: Are Datasets Variables / Ops? ----------------------------------------- @@ -390,6 +543,7 @@ and valid options. + Discussion: Fixed Parameters vs. Function Arguments --------------------------------------------------- @@ -534,6 +688,7 @@ once. Maybe this can be solved at the Theano level with an efficient function cache? + Discussion: Dataset as Learner Ouptut -------------------------------------