diff doc/v2_planning/datalearn.txt @ 1361:7548dc1b163c

Some question/suggestions to datalearn
author Razvan Pascanu <r.pascanu@gmail.com>
date Thu, 11 Nov 2010 22:40:01 -0500
parents 5db730bb0e8e
children 6b9673d72a41
line wrap: on
line diff
--- a/doc/v2_planning/datalearn.txt	Thu Nov 11 18:08:05 2010 -0500
+++ b/doc/v2_planning/datalearn.txt	Thu Nov 11 22:40:01 2010 -0500
@@ -53,6 +53,24 @@
 just include those 'expressions coded in individual datasets' into the overall
 graph.
 
+Razvan comments: 1) Having Theano expressions inside the perform of a Theano
+Op can lead to issues. I know I had to deal with a few when implementing
+Scan which does exactly this. Well to be fair these issues mostly come into
+play when the inner graph has to interact with the outer graph and most of
+the time they can be solved. I guess all that I'm saying is going that way
+might lead to some head-ache to developers, though I guess some head-ache 
+will be involved no matter what
+2) In my view (I'm not sure this is what Olivier was saying) the idea of
+not putting the Dataset into a Variable is to not put the logic related to 
+loading data, dividing it into slices when running it on the GPU and so on
+into a theano variable. In my view this logic goes into a DataSet class 
+that gives you  shared variables, symbolic indices into that shared 
+variables, and also numeric indices. When looping through those numeric 
+indices, the dataset class can reload parts of the data into the 
+shared variable and so on.
+
+
+
 One issue with this approach is illustrated by the following example. Imagine
 we want to iterate on samples in a dataset and do something with their
 numeric value. We would want the code to be as close as possible to:
@@ -111,6 +129,75 @@
 James comments: this is how I have written the last couple of projects, it's
 slightly verbose but it's clear and efficient.
 
+<Razvan comments>: I assume that ``do_something_with`` is suppose to be some
+numeric function, and dataset in this case is the result of some
+computations on a initial dataset.
+I would differentiate the two approaches (1) and (2) as :
+ - first of all whatever you can do with (1) you can do with (2)
+ - approach (1) hides the fact that you are working with symbolic graphs.
+   You apply functions to datasets, and when you want to see values a
+   function is compiled under the hood and those values are computed for
+   you. In approach (2) the fact that you deal with a symbolic graph is
+   explicit because you have to manually compile your functions.
+ - approach (1) needs to use this function_storage trick shared between
+   certain nodes of the graph to reduce the number of compilation while in
+   approach (2) we don't need to deal with the complexity of lazy
+   compilation
+ - approach (1) needs a replace function if you want to change the dataset.
+   What you would do, is once you have a "computational graph" or pipeline
+   or whatever you call it, say ``graph``, to change the input you would do
+   graph.replace({ init_data_X: new_data_X}), In approach (2) the init_data_X
+   and new_data_X is the ``dataset`` so you would compile two different
+   functions. Well I would re-write (2) -- to make the above more clear --
+   as : 
+
+   .. code-block:: python
+
+        symbolic_index = theano.tensor.iscalar()
+        get_sample1 = theano.function( [symbolic_index],
+                        graph( dataset[symbolic_index] ).variable)
+        for numeric_index in xrange(len(dataset)):
+            do_something_with(get_sample(numeric_index))
+
+        get_sample2 = theano.function( [symbolic_index],
+                        graph( new_dataset[symbolic_index] ).variable)
+                        ## Note: the dataset was replaced with new_dataset
+        for numeric_index in xrange(len(new_dataset)):
+            do_something_with(get_sample2(numeric_index))
+
+        ######### FOR (1) you write: 
+
+        for datapoint in graph:
+            do_something_with( datapoint() )
+
+        new_graph = graph.replace({dataset:dataset2})
+
+        for datapoint in new_graph:
+            do_something_with(datapoint())
+        
+ - in approach (1) the initial dataset object (the one that loads the data)
+   decides if you will use shared variables and indices to deal with the
+   dataset or if you will use ``theano.tensor.matrix`` and not the user( at
+   least not without hacking the code). Of course whoever writes that class
+   can add a flag to it to switch between behaviours that make sense.
+   In approach (2) one is not forced to do this
+   inside that class by construction, though by convention I would do it. 
+   So if you consider the one who writes that class as a developer than 
+   in (2) the user can decide/deal with this and not the developer.
+   Though this is a fine-line -- I would say the user would actually 
+   write that class as well using some template. 
+   That is to say (2) looks and feels more like working with Theano 
+   directly, 
+
+Bottom line, I think (1) puts more stress on the development of the library,
+and hides Theano and some of the complexity for day to day usage.
+In (2) everything is a bit more explicit, leaving the impression that you
+have more control over the code, though I strongly feel that whatever can
+be done in (2) can be done in (1). Traditionally I was more inclined
+towards (1) but now I'm not that sure, I think both are equally interesting
+and valid options.
+</Razvan comments>
+
 Note that although the above example focused on how to iterate over a dataset,
 it can be cast into a more generic problem, where some data (either dataset or
 sample) is the result of some transformation applied to other data, which is
@@ -124,6 +211,19 @@
 those who do not want to worry about it. How to achieve this is still to be
 determined.
 
+Razvan Comment: I thought about this a bit at the Pylearn level. In my
+original train of thought you would have the distinction between ``hand
+picked parameters`` which I would call hyper-parameter and learned
+parameters. A transformation in this framework (an op if you wish) could
+take as inputs DataSet(s),  DataField(s), Parameter(s) (which are the things
+that the learner should adapt) and HyperParameter(s). All hyper-parameters
+will turn into arguments of the compiled function (like the indices of each
+of the dataset objects ) and therefore they can be changed without
+re-compilation. Or in other words this can be easily done by having new
+types of Variables that would represent Parameters and Hyper-parameters.
+And as an ending note I would say  that there are
+hyper-parameters for which you need to recompile the thenao function and 
+can not be just parameters ( so we would have yet another category ?).
 
 Another syntactic option for iterating over datasets is
 
@@ -212,3 +312,21 @@
 to optimize computations so that the simultaneous computation of
 ``prediction`` and ``cost`` is done efficiently.
 
+Razvan asks: What is predict_sample for ? What is predict_dataset? What I
+guess you mean is that the decorator is used to convert a function that
+takes a theano variable and outputs a theano variable into a class/function
+that takes a DataField/DataSet and outputs a DataField/DataSet. It could 
+also register all those different functions, so that the Dataset that 
+you get out of (not one of the function) the entire Learner (this Dataset
+is returned by __call__) would contain all those as fields. 
+I would use it like this:
+
+.. code-block:: python
+
+    nnet = NeuralNetwork()
+    results = nnet(dataset)
+    for datapoint in results:
+        print datapoint.prediction, datapoint.nll, ...
+
+Is this close to what you are suggesting?
+