Mercurial > pylearn
changeset 1117:c1943feada10
Proposal for theano dataset wrapper. The details still have to be worked out.
author | Arnaud Bergeron <abergeron@gmail.com> |
---|---|
date | Tue, 14 Sep 2010 15:22:48 -0400 |
parents | 18a092001752 |
children | 8cc324f388ba |
files | doc/v2_planning/dataset.txt doc/v2_planning/shared_dataset.py |
diffstat | 2 files changed, 52 insertions(+), 4 deletions(-) [+] |
line wrap: on
line diff
--- a/doc/v2_planning/dataset.txt Tue Sep 14 14:20:31 2010 -0400 +++ b/doc/v2_planning/dataset.txt Tue Sep 14 15:22:48 2010 -0400 @@ -368,7 +368,8 @@ AB: I have an idea about this which kind of fits in the "building a theano op" thing that we talked about at the last meeting. -We could have a specialezed theano op that takes a dataset and returns -chunks of it with a index using the standard Dataset interface. The -code to transfer to the GPU or whatever goes in that Op and we don't -need to change to dataset interface. +We can just build a theano Op that wraps dataset objects and takes +care of the details of tranferring data to the GPU or otherwise. + +I have a prototype interface/implemantation in the shared_dataset.py +file in this directory.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/shared_dataset.py Tue Sep 14 15:22:48 2010 -0400 @@ -0,0 +1,47 @@ +import theano + +# This is not final and may not even run for now. It is just to give +# a feeling of what the interface could look like. + +def shared_dataset(dataset, mem_size): + if dataset.total_size > mem_size: + return OnlineDataset(dataset) + else: + return MemoryDataset(dataset) + +class MemoryDataset(theano.Op): + def __init__(self, dataset): + self.input = theano.shared(dataset.input) + self.output = theano.shared(dataset.output) + self.batch_size = dataset.batch_size + + def make_node(self, idx): + idx_ = theano.as_tensor_variable(idx) + return theano.Apply(self, + inputs = [idx_], + outputs = [self.input.type(), + self.output.type()]) + + def preform(self, node, inputs, output_storage): + idx, = inputs + self.output_storage[0][0] = self.input[idx*self.batch_size:(idx+1)*self.batch_size] + self.output_storage[1][0] = self.output[idx*self.batch_size:(idx+1)*self.batch_size] + +class OnlineDataset(theano.Op): + def __init__(self, dataset): + self.dataset = dataset + + def make_node(self, idx): + idx_ = theano.as_tensor_variable(idx) + return theano.Apply(self, + inputs = [idx_], + outputs = [theano.tensor.fmatrix(), + theano.tensor.fmatrix()]) + # fix this so its not fmatrix(), + # but whatever the dataset outputs + + def perform(self, node, inputs, output_storage): + idx, = inputs + b = self.dataset.get_batch(idx.value) + output_storage[0][0] = b.input + output_storage[1][0] = b.output