pylearn: doc/v2_planning/datalearn.txt comparison

comparison doc/v2_planning/datalearn.txt @ 1357:ffa2932a8cba

Added datalearn committee discussion file

author	Olivier Delalleau <delallea@iro>
date	Thu, 11 Nov 2010 16:34:38 -0500
parents
children	5db730bb0e8e

comparison

equal deleted inserted replaced

-:26644a775a0d
+:ffa2932a8cba
+DataLearn: How to plug Datasets & Learner together?
+===================================================
+Participants
+------------
+- Yoshua
+- Razvan
+- Olivier D [leader?]
+High-Level Objectives
+---------------------
+* Simple ML experiments should be simple to write
+* More complex / advanced scenarios should be possible without being forced
+to work "outside" of this framework
+* Computations should be optimized whenever possible
+* Existing code (in any language) should be "wrappable" within this
+framework
+* It should be possible to replace [parts of] this framework with C++ code
+Theano-Like Data Flow
+---------------------
+We want to rely on Theano to be able to take advantage of its efficient
+computations. The general idea is that if we chain multiple processing
+elements (think e.g. of a feature selection step followed by a PCA projection,
+then a rescaling within a fixed bounded interval), the overall transformation
+from input to output data can be represented by a Theano symbolic graph. When
+one wants to access the actual numeric data, a function is compiled so as to
+do these computations efficiently.
+We discussed some specific API options for datasets and learners, which will
+be added to this file in the future, but a core question that we feel should
+be addressed first is how this Theano-based implementation could be achieved
+exactly. For this purpose, in the following, let us assume that a dataset is
+simply a matrix whose rows represent individual samples, and columns
+individual features. How to handle field names, non-tensor-like data, etc. is
+a very important topic that is not yet discussed in this file.
+A question we did not really discuss is whether datasets should be Theano
+Variables. The advantage would be that they would fit directly within the
+Theano framework, which may allow high level optimizations on data
+transformations. However, we would lose the ability to combine Theano
+expressions coded in individual datasets into a single graph. Currently, we
+instead consider that a dataset has a member that is a Theano variable, and
+this variable represents the data stored in the dataset. The same is done for
+individual data samples.
+One issue with this approach is illustrated by the following example. Imagine
+we want to iterate on samples in a dataset and do something with their
+numeric value. We would want the code to be as close as possible to:
+.. code-block:: python
+for sample in dataset:
+do_something_with(sample.numeric_value())
+A naive implementation of the sample API could be (assuming each sample
+contains a ``variable`` member which is the variable representing this
+sample's data):
+.. code-block:: python
+def numeric_value(self):
+if self.function is None:
+# Compile function to output the numeric value stored in this
+# sample's variable.
+self.function = theano.function([], self.variable)
+return self.function()
+However, this is not a good idea, because it would trigger a new function
+compilation for each sample. Instead, we would want something like this:
+.. code-block:: python
+def numeric_value(self):
+if self.function_storage[0] is None:
+# Compile function to output the numeric value stored in this
+# sample's variable. This function takes as input the index of
+# the sample in the dataset, and is shared among all samples.
+self.function_storage[0] = theano.function(
+[self.symbolic_index], self.variable)
+return self.function(self.numeric_index)
+In the code above, we assume that all samples created by the action of
+iterating over the dataset share the same ``function_storage``,
+``symbolic_index`` and ``variable``: the first time we try to access the numeric
+value of some sample, a function is compiled, that takes as input the index,
+and outputs the variable.  The only difference between samples is thus that
+they are given a different numeric value for the index (``numeric_index``).
+Another way to obtain the same result is to actually let the user take care of
+compiling the function. It would allow the user to really control what is
+being compiled, at the cost of having to write more code:
+.. code-block:: python
+symbolic_index = dataset.get_index()  # Or just theano.tensor.iscalar()
+get_sample = theano.function([symbolic_index],
+dataset[symbolic_index].variable)
+for numeric_index in xrange(len(dataset))
+do_something_with(get_sample(numeric_index))
+Note that although the above example focused on how to iterate over a dataset,
+it can be cast into a more generic problem, where some data (either dataset or
+sample) is the result of some transformation applied to other data, which is
+parameterized by parameters p1, p2, ..., pN (in the above example, we were
+considering a sample that was obtained by taking the p1-th element in a
+dataset). If we use different values for a subset Q of the parameters but keep
+other parameters fixed, we would probably want to compile a single function
+that takes as input all parameters in Q, while other parameters are fixed.
+Ideally it would be nice to let the user take control on what is being
+compiled, while leaving the option of using a default sensible behavior for
+those who do not want to worry about it. How to achieve this is still to be
+determined.
+What About Learners?
+--------------------
+The discussion above only mentioned datasets, but not learners. The learning
+part of a learner is not a main concern (currently). What matters most w.r.t.
+what was discussed above is how a learner takes as input a dataset and outputs
+another dataset that can be used with the dataset API.
+A Learner may be able to compute various things. For instance, a Neural
+Network may output a ``prediction`` vector (whose elements correspond to
+estimated probabilities of each class in a classification task), as well as a
+``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
+and the classification error). We would want to be able to build a dataset
+that contains some of these quantities computed on each sample in the input
+dataset.
+The Neural Network code would then look something like this:
+.. code-block:: python
+class NeuralNetwork(Learner):
+@datalearn(..)
+def compute_prediction(self, sample):
+return softmax(theano.tensor.dot(self.weights, sample.input))
+@datalearn(..)
+def compute_nll(self, sample):
+return - log(self.compute_prediction(sample)[sample.target])
+@datalearn(..)
+def compute_penalized_nll(self, sample):
+return (self.compute_nll(self, sample) +
+theano.tensor.sum(self.weights**2))
+@datalearn(..)
+def compute_class_error(self, sample):
+probabilities = self.compute_prediction(sample)
+predicted_class = theano.tensor.argmax(probabilities)
+return predicted_class != sample.target
+@datalearn(..)
+def compute_cost(self, sample):
+return theano.tensor.concatenate([
+self.compute_penalized_nll(sample),
+self.compute_nll(sample),
+self.compute_class_error(sample),
+])
+The ``@datalearn`` decorator would be responsible for allowing such a Learner
+to be used e.g. like this:
+.. code-block:: python
+nnet = NeuralNetwork()
+predict_dataset = nnet.compute_prediction(dataset)
+for sample in dataset:
+predict_sample = nnet.compute_prediction(sample)
+predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
+multiple_fields_dataset = ConcatDataSet([
+nnet.compute_prediction(dataset),
+nnet.compute_cost(dataset),
+])
+In the code above, if one wants to obtain the numeric value of an element of
+``multiple_fields_dataset``, the Theano function being compiled would be able
+to optimize computations so that the simultaneous computation of
+``prediction`` and ``cost`` is done efficiently.

Mercurial > pylearn

comparison doc/v2_planning/datalearn.txt @ 1357:ffa2932a8cba