pylearn: doc/v2_planning/datalearn.txt comparison

comparison doc/v2_planning/datalearn.txt @ 1361:7548dc1b163c

Some question/suggestions to datalearn

author	Razvan Pascanu <r.pascanu@gmail.com>
date	Thu, 11 Nov 2010 22:40:01 -0500
parents	5db730bb0e8e
children	6b9673d72a41

comparison

equal deleted inserted replaced

-:f81b3b6f9698
+:7548dc1b163c
 Firstly, if you want to use Theano expressions and compiled functions to
 implement the perform() method of an Op, you can do that.  Secondly, you can
 just include those 'expressions coded in individual datasets' into the overall
 graph.
+Razvan comments: 1) Having Theano expressions inside the perform of a Theano
+Op can lead to issues. I know I had to deal with a few when implementing
+Scan which does exactly this. Well to be fair these issues mostly come into
+play when the inner graph has to interact with the outer graph and most of
+the time they can be solved. I guess all that I'm saying is going that way
+might lead to some head-ache to developers, though I guess some head-ache
+will be involved no matter what
+2) In my view (I'm not sure this is what Olivier was saying) the idea of
+not putting the Dataset into a Variable is to not put the logic related to
+loading data, dividing it into slices when running it on the GPU and so on
+into a theano variable. In my view this logic goes into a DataSet class
+that gives you  shared variables, symbolic indices into that shared
+variables, and also numeric indices. When looping through those numeric
+indices, the dataset class can reload parts of the data into the
+shared variable and so on.
 One issue with this approach is illustrated by the following example. Imagine
 we want to iterate on samples in a dataset and do something with their
 numeric value. We would want the code to be as close as possible to:
 .. code-block:: python
 for numeric_index in xrange(len(dataset))
 do_something_with(get_sample(numeric_index))
 James comments: this is how I have written the last couple of projects, it's
 slightly verbose but it's clear and efficient.
+<Razvan comments>: I assume that ``do_something_with`` is suppose to be some
+numeric function, and dataset in this case is the result of some
+computations on a initial dataset.
+I would differentiate the two approaches (1) and (2) as :
+- first of all whatever you can do with (1) you can do with (2)
+- approach (1) hides the fact that you are working with symbolic graphs.
+You apply functions to datasets, and when you want to see values a
+function is compiled under the hood and those values are computed for
+you. In approach (2) the fact that you deal with a symbolic graph is
+explicit because you have to manually compile your functions.
+- approach (1) needs to use this function_storage trick shared between
+certain nodes of the graph to reduce the number of compilation while in
+approach (2) we don't need to deal with the complexity of lazy
+compilation
+- approach (1) needs a replace function if you want to change the dataset.
+What you would do, is once you have a "computational graph" or pipeline
+or whatever you call it, say ``graph``, to change the input you would do
+graph.replace({ init_data_X: new_data_X}), In approach (2) the init_data_X
+and new_data_X is the ``dataset`` so you would compile two different
+functions. Well I would re-write (2) -- to make the above more clear --
+as :
+.. code-block:: python
+symbolic_index = theano.tensor.iscalar()
+get_sample1 = theano.function( [symbolic_index],
+graph( dataset[symbolic_index] ).variable)
+for numeric_index in xrange(len(dataset)):
+do_something_with(get_sample(numeric_index))
+get_sample2 = theano.function( [symbolic_index],
+graph( new_dataset[symbolic_index] ).variable)
+## Note: the dataset was replaced with new_dataset
+for numeric_index in xrange(len(new_dataset)):
+do_something_with(get_sample2(numeric_index))
+######### FOR (1) you write:
+for datapoint in graph:
+do_something_with( datapoint() )
+new_graph = graph.replace({dataset:dataset2})
+for datapoint in new_graph:
+do_something_with(datapoint())
+- in approach (1) the initial dataset object (the one that loads the data)
+decides if you will use shared variables and indices to deal with the
+dataset or if you will use ``theano.tensor.matrix`` and not the user( at
+least not without hacking the code). Of course whoever writes that class
+can add a flag to it to switch between behaviours that make sense.
+In approach (2) one is not forced to do this
+inside that class by construction, though by convention I would do it.
+So if you consider the one who writes that class as a developer than
+in (2) the user can decide/deal with this and not the developer.
+Though this is a fine-line -- I would say the user would actually
+write that class as well using some template.
+That is to say (2) looks and feels more like working with Theano
+directly,
+Bottom line, I think (1) puts more stress on the development of the library,
+and hides Theano and some of the complexity for day to day usage.
+In (2) everything is a bit more explicit, leaving the impression that you
+have more control over the code, though I strongly feel that whatever can
+be done in (2) can be done in (1). Traditionally I was more inclined
+towards (1) but now I'm not that sure, I think both are equally interesting
+and valid options.
+</Razvan comments>
 Note that although the above example focused on how to iterate over a dataset,
 it can be cast into a more generic problem, where some data (either dataset or
 sample) is the result of some transformation applied to other data, which is
 parameterized by parameters p1, p2, ..., pN (in the above example, we were
 Ideally it would be nice to let the user take control on what is being
 compiled, while leaving the option of using a default sensible behavior for
 those who do not want to worry about it. How to achieve this is still to be
 determined.
+Razvan Comment: I thought about this a bit at the Pylearn level. In my
+original train of thought you would have the distinction between ``hand
+picked parameters`` which I would call hyper-parameter and learned
+parameters. A transformation in this framework (an op if you wish) could
+take as inputs DataSet(s),  DataField(s), Parameter(s) (which are the things
+that the learner should adapt) and HyperParameter(s). All hyper-parameters
+will turn into arguments of the compiled function (like the indices of each
+of the dataset objects ) and therefore they can be changed without
+re-compilation. Or in other words this can be easily done by having new
+types of Variables that would represent Parameters and Hyper-parameters.
+And as an ending note I would say  that there are
+hyper-parameters for which you need to recompile the thenao function and
+can not be just parameters ( so we would have yet another category ?).
 Another syntactic option for iterating over datasets is
 .. code-block:: python
 In the code above, if one wants to obtain the numeric value of an element of
 ``multiple_fields_dataset``, the Theano function being compiled would be able
 to optimize computations so that the simultaneous computation of
 ``prediction`` and ``cost`` is done efficiently.
+Razvan asks: What is predict_sample for ? What is predict_dataset? What I
+guess you mean is that the decorator is used to convert a function that
+takes a theano variable and outputs a theano variable into a class/function
+that takes a DataField/DataSet and outputs a DataField/DataSet. It could
+also register all those different functions, so that the Dataset that
+you get out of (not one of the function) the entire Learner (this Dataset
+is returned by __call__) would contain all those as fields.
+I would use it like this:
+.. code-block:: python
+nnet = NeuralNetwork()
+results = nnet(dataset)
+for datapoint in results:
+print datapoint.prediction, datapoint.nll, ...
+Is this close to what you are suggesting?

Mercurial > pylearn

comparison doc/v2_planning/datalearn.txt @ 1361:7548dc1b163c