annotate doc/v2_planning/datalearn.txt @ 1365:049b99f4b323

reply to OD
author Razvan Pascanu <r.pascanu@gmail.com>
date Fri, 12 Nov 2010 11:49:00 -0500
parents 01157763c2d7
children f945ed016c68
rev   line source
1357
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
1 DataLearn: How to plug Datasets & Learner together?
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
2 ===================================================
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
3
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
4 Participants
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
5 ------------
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
6 - Yoshua
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
7 - Razvan
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
8 - Olivier D [leader?]
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
9
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
10 High-Level Objectives
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
11 ---------------------
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
12
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
13 * Simple ML experiments should be simple to write
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
14 * More complex / advanced scenarios should be possible without being forced
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
15 to work "outside" of this framework
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
16 * Computations should be optimized whenever possible
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
17 * Existing code (in any language) should be "wrappable" within this
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
18 framework
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
19 * It should be possible to replace [parts of] this framework with C++ code
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
20
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
21 Theano-Like Data Flow
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
22 ---------------------
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
23
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
24 We want to rely on Theano to be able to take advantage of its efficient
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
25 computations. The general idea is that if we chain multiple processing
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
26 elements (think e.g. of a feature selection step followed by a PCA projection,
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
27 then a rescaling within a fixed bounded interval), the overall transformation
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
28 from input to output data can be represented by a Theano symbolic graph. When
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
29 one wants to access the actual numeric data, a function is compiled so as to
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
30 do these computations efficiently.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
31
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
32 We discussed some specific API options for datasets and learners, which will
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
33 be added to this file in the future, but a core question that we feel should
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
34 be addressed first is how this Theano-based implementation could be achieved
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
35 exactly. For this purpose, in the following, let us assume that a dataset is
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
36 simply a matrix whose rows represent individual samples, and columns
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
37 individual features. How to handle field names, non-tensor-like data, etc. is
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
38 a very important topic that is not yet discussed in this file.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
39
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
40 A question we did not really discuss is whether datasets should be Theano
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
41 Variables. The advantage would be that they would fit directly within the
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
42 Theano framework, which may allow high level optimizations on data
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
43 transformations. However, we would lose the ability to combine Theano
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
44 expressions coded in individual datasets into a single graph. Currently, we
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
45 instead consider that a dataset has a member that is a Theano variable, and
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
46 this variable represents the data stored in the dataset. The same is done for
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
47 individual data samples.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
48
1359
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
49 James asks: Why would a Theano graph in which some nodes represent datasets give
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
50 up the ability to combine Theano expressions coded in individual datasets?
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
51 Firstly, if you want to use Theano expressions and compiled functions to
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
52 implement the perform() method of an Op, you can do that. Secondly, you can
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
53 just include those 'expressions coded in individual datasets' into the overall
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
54 graph.
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
55
1362
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
56 OD replies to James: What I had in mind is you would be forced to compile your
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
57 own function inside the perform() method of an Op. This seemed like a
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
58 potential problem to me because it would prevent Theano from seeing the whole
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
59 fine-grained graph and do optimizations across multiple dataset
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
60 transformations (there may also be additional overhead from calling multiple
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
61 function). But if you are saying it is possible to include 'expressions coded
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
62 in individual datasets' into the overall graph, then I guess this point is
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
63 moot. Would this be achieved with an optimization that replaces the dataset
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
64 node with its internal graph?
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
65
1361
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
66 Razvan comments: 1) Having Theano expressions inside the perform of a Theano
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
67 Op can lead to issues. I know I had to deal with a few when implementing
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
68 Scan which does exactly this. Well to be fair these issues mostly come into
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
69 play when the inner graph has to interact with the outer graph and most of
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
70 the time they can be solved. I guess all that I'm saying is going that way
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
71 might lead to some head-ache to developers, though I guess some head-ache
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
72 will be involved no matter what
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
73 2) In my view (I'm not sure this is what Olivier was saying) the idea of
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
74 not putting the Dataset into a Variable is to not put the logic related to
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
75 loading data, dividing it into slices when running it on the GPU and so on
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
76 into a theano variable. In my view this logic goes into a DataSet class
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
77 that gives you shared variables, symbolic indices into that shared
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
78 variables, and also numeric indices. When looping through those numeric
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
79 indices, the dataset class can reload parts of the data into the
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
80 shared variable and so on.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
81
1362
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
82 OD replies to Razvan's point 2: I think what you are saying is another concern
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
83 I had, which was the fact it may be confusing to mix in the same class the
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
84 Variable/Op and DataSet interfaces. I would indeed prefer to keep them
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
85 separate. However, it may be possible to come up with a system that would get
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
86 the best of both worlds (maybe by having the Op/Variable as members of
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
87 Dataset, and just asking the user building a theano graph to use these instead
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
88 of the dataset directly). Note that I'm mixing up Op/Variable here, because
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
89 it's just not clear yet for me which would go where...
1361
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
90
1357
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
91 One issue with this approach is illustrated by the following example. Imagine
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
92 we want to iterate on samples in a dataset and do something with their
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
93 numeric value. We would want the code to be as close as possible to:
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
94
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
95 .. code-block:: python
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
96
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
97 for sample in dataset:
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
98 do_something_with(sample.numeric_value())
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
99
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
100 A naive implementation of the sample API could be (assuming each sample
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
101 contains a ``variable`` member which is the variable representing this
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
102 sample's data):
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
103
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
104 .. code-block:: python
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
105
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
106 def numeric_value(self):
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
107 if self.function is None:
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
108 # Compile function to output the numeric value stored in this
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
109 # sample's variable.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
110 self.function = theano.function([], self.variable)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
111 return self.function()
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
112
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
113 However, this is not a good idea, because it would trigger a new function
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
114 compilation for each sample. Instead, we would want something like this:
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
115
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
116 .. code-block:: python
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
117
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
118 def numeric_value(self):
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
119 if self.function_storage[0] is None:
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
120 # Compile function to output the numeric value stored in this
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
121 # sample's variable. This function takes as input the index of
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
122 # the sample in the dataset, and is shared among all samples.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
123 self.function_storage[0] = theano.function(
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
124 [self.symbolic_index], self.variable)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
125 return self.function(self.numeric_index)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
126
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
127 In the code above, we assume that all samples created by the action of
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
128 iterating over the dataset share the same ``function_storage``,
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
129 ``symbolic_index`` and ``variable``: the first time we try to access the numeric
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
130 value of some sample, a function is compiled, that takes as input the index,
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
131 and outputs the variable. The only difference between samples is thus that
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
132 they are given a different numeric value for the index (``numeric_index``).
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
133
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
134 Another way to obtain the same result is to actually let the user take care of
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
135 compiling the function. It would allow the user to really control what is
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
136 being compiled, at the cost of having to write more code:
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
137
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
138 .. code-block:: python
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
139
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
140 symbolic_index = dataset.get_index() # Or just theano.tensor.iscalar()
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
141 get_sample = theano.function([symbolic_index],
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
142 dataset[symbolic_index].variable)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
143 for numeric_index in xrange(len(dataset))
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
144 do_something_with(get_sample(numeric_index))
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
145
1359
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
146 James comments: this is how I have written the last couple of projects, it's
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
147 slightly verbose but it's clear and efficient.
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
148
1361
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
149 <Razvan comments>: I assume that ``do_something_with`` is suppose to be some
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
150 numeric function, and dataset in this case is the result of some
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
151 computations on a initial dataset.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
152 I would differentiate the two approaches (1) and (2) as :
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
153 - first of all whatever you can do with (1) you can do with (2)
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
154 - approach (1) hides the fact that you are working with symbolic graphs.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
155 You apply functions to datasets, and when you want to see values a
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
156 function is compiled under the hood and those values are computed for
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
157 you. In approach (2) the fact that you deal with a symbolic graph is
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
158 explicit because you have to manually compile your functions.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
159 - approach (1) needs to use this function_storage trick shared between
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
160 certain nodes of the graph to reduce the number of compilation while in
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
161 approach (2) we don't need to deal with the complexity of lazy
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
162 compilation
1362
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
163
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
164 OD comments: Well, to be fair, it means we put the burden of dealing with the
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
165 complexity of lazy compilation on the user (it's up to him to make sure he
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
166 compiles only one function).
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
167
1361
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
168 - approach (1) needs a replace function if you want to change the dataset.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
169 What you would do, is once you have a "computational graph" or pipeline
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
170 or whatever you call it, say ``graph``, to change the input you would do
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
171 graph.replace({ init_data_X: new_data_X}), In approach (2) the init_data_X
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
172 and new_data_X is the ``dataset`` so you would compile two different
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
173 functions. Well I would re-write (2) -- to make the above more clear --
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
174 as :
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
175
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
176 .. code-block:: python
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
177
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
178 symbolic_index = theano.tensor.iscalar()
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
179 get_sample1 = theano.function( [symbolic_index],
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
180 graph( dataset[symbolic_index] ).variable)
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
181 for numeric_index in xrange(len(dataset)):
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
182 do_something_with(get_sample(numeric_index))
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
183
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
184 get_sample2 = theano.function( [symbolic_index],
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
185 graph( new_dataset[symbolic_index] ).variable)
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
186 ## Note: the dataset was replaced with new_dataset
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
187 for numeric_index in xrange(len(new_dataset)):
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
188 do_something_with(get_sample2(numeric_index))
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
189
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
190 ######### FOR (1) you write:
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
191
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
192 for datapoint in graph:
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
193 do_something_with( datapoint() )
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
194
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
195 new_graph = graph.replace({dataset:dataset2})
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
196
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
197 for datapoint in new_graph:
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
198 do_something_with(datapoint())
1362
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
199
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
200 OD comments: I don't really understand what is 'graph' in this code (it
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
201 appears in both approaches but is used differently). What I have in mind would
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
202 be more with 'graph' removed in the first approach you describe (#2), and
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
203 graph / new_graph replaced by dataset / new_dataset in the second one (#1).
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
204 You wouldn't need to call some graph.replace method: the graphs compiled for
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
205 iterating on 'dataset' and 'new_dataset' would be entirely separate (using two
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
206 different compiled functions, pretty much like #2).
1363
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
207
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
208 RP answers: Yes you are right. What I was trying to say is if you have two
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
209 different datasets on which you want to apply the same pre-processing you
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
210 can do that in both approaches. ``graph`` represents the pre-processing
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
211 steps in (2) and the end dataset (after preprocessing) in (1). So the idea
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
212 is that instead of making new_graph from scratch (re-applying all the
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
213 transforms on the original dataset) you can use replace. Or maybe the
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
214 __call__ (that compiles the function if needed) can get a givens dictionary
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
215 ( that replaces datasets or more ). I only gave this argument because I
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
216 thought this will be an issue people will raise. They will say, well in (2)
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
217 the pipeline logic is separated from the data, so you can use the same
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
218 transformation with different data easily, while in (1) you write the
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
219 transformation rooted in a dataset, and if you want same transformation
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
220 for a different dataset you have to re-write everything.
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
221
1364
01157763c2d7 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1363
diff changeset
222 OD replies: Still not sure I understand. If you have a "graph" function that
01157763c2d7 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1363
diff changeset
223 takes a dataset as input and outputs a new dataset, you can use this same
01157763c2d7 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1363
diff changeset
224 function with both (1) and (2). With (2) it is:
01157763c2d7 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1363
diff changeset
225 theano.function([index], graph(my_dataset)[index].variable)
01157763c2d7 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1363
diff changeset
226 while with (1) the same function is compiled implicitly with:
01157763c2d7 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1363
diff changeset
227 for sample in graph(my_dataset):
01157763c2d7 Reply to Razvan
Olivier Delalleau <delallea@iro>
parents: 1363
diff changeset
228 ...
1363
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
229
1365
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
230 RP answers: right. I was actually constructing this stupid example in my mind when
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
231 you would do like :
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
232 i1 = f1(data)
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
233 i2 = f2(i1)
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
234 i3 = f3(i2)
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
235 ...
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
236 iN = fN(iN-1)
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
237 and then you would say .. wait I want to do this on new_data as well. Oh no, I
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
238 have to copy the entire block or whatever. That is so annoying. But actually you
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
239 could just write:
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
240
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
241 def my_f(data):
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
242 i1 = f1(data)
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
243 ...
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
244 return iN
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
245
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
246 and then just use that function which is what you pointed out. I agree I'm
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
247 not sure anymore on the point that I was trying to make. Is like if you are
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
248 a lazy programmer, and you write everything without functions, you can
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
249 argue that you like more (2) because you only pass the dataset at the end
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
250 and not at the beginning. But if (1) would have the replace function this
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
251 argument will fail. Though this only stands if you like don't want to make
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
252 a function out of your pipeline that takes the dataset as input, which now
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
253 that I think about it is pretty stupid not to do. Sorry for that.
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
254
049b99f4b323 reply to OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1364
diff changeset
255
1361
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
256 - in approach (1) the initial dataset object (the one that loads the data)
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
257 decides if you will use shared variables and indices to deal with the
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
258 dataset or if you will use ``theano.tensor.matrix`` and not the user( at
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
259 least not without hacking the code). Of course whoever writes that class
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
260 can add a flag to it to switch between behaviours that make sense.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
261 In approach (2) one is not forced to do this
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
262 inside that class by construction, though by convention I would do it.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
263 So if you consider the one who writes that class as a developer than
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
264 in (2) the user can decide/deal with this and not the developer.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
265 Though this is a fine-line -- I would say the user would actually
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
266 write that class as well using some template.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
267 That is to say (2) looks and feels more like working with Theano
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
268 directly,
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
269
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
270 Bottom line, I think (1) puts more stress on the development of the library,
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
271 and hides Theano and some of the complexity for day to day usage.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
272 In (2) everything is a bit more explicit, leaving the impression that you
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
273 have more control over the code, though I strongly feel that whatever can
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
274 be done in (2) can be done in (1). Traditionally I was more inclined
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
275 towards (1) but now I'm not that sure, I think both are equally interesting
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
276 and valid options.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
277 </Razvan comments>
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
278
1357
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
279 Note that although the above example focused on how to iterate over a dataset,
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
280 it can be cast into a more generic problem, where some data (either dataset or
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
281 sample) is the result of some transformation applied to other data, which is
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
282 parameterized by parameters p1, p2, ..., pN (in the above example, we were
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
283 considering a sample that was obtained by taking the p1-th element in a
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
284 dataset). If we use different values for a subset Q of the parameters but keep
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
285 other parameters fixed, we would probably want to compile a single function
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
286 that takes as input all parameters in Q, while other parameters are fixed.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
287 Ideally it would be nice to let the user take control on what is being
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
288 compiled, while leaving the option of using a default sensible behavior for
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
289 those who do not want to worry about it. How to achieve this is still to be
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
290 determined.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
291
1361
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
292 Razvan Comment: I thought about this a bit at the Pylearn level. In my
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
293 original train of thought you would have the distinction between ``hand
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
294 picked parameters`` which I would call hyper-parameter and learned
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
295 parameters. A transformation in this framework (an op if you wish) could
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
296 take as inputs DataSet(s), DataField(s), Parameter(s) (which are the things
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
297 that the learner should adapt) and HyperParameter(s). All hyper-parameters
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
298 will turn into arguments of the compiled function (like the indices of each
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
299 of the dataset objects ) and therefore they can be changed without
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
300 re-compilation. Or in other words this can be easily done by having new
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
301 types of Variables that would represent Parameters and Hyper-parameters.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
302 And as an ending note I would say that there are
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
303 hyper-parameters for which you need to recompile the thenao function and
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
304 can not be just parameters ( so we would have yet another category ?).
1359
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
305
1362
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
306 James: Another syntactic option for iterating over datasets is
1359
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
307
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
308 .. code-block:: python
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
309
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
310 for sample in dataset.numeric_iterator(batchsize=10):
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
311 do_something_with(sample)
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
312
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
313 The numeric_iterator would create a symbolic batch index, and compile a single function
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
314 that extracts the corresponding minibatch. The arguments to the
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
315 numeric_iterator function can also specify what compile mode to use, any givens
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
316 you might want to apply, etc.
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
317
1362
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
318 OD comments: Would there also be some kind of function cache to avoid
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
319 compiling the same function again if we re-iterate on the same dataset with
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
320 the same arguments? Maybe a more generic issue is: would there be a way for
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
321 Theano to be more efficient when re-compiling the same function that was
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
322 already compiled in the same program? (note that I am assuming here it is not
1363
18b2ebec6bca Reply to a comment of OD
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1362
diff changeset
323 efficient, but I may be wrong).
1359
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
324
1357
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
325 What About Learners?
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
326 --------------------
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
327
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
328 The discussion above only mentioned datasets, but not learners. The learning
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
329 part of a learner is not a main concern (currently). What matters most w.r.t.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
330 what was discussed above is how a learner takes as input a dataset and outputs
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
331 another dataset that can be used with the dataset API.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
332
1359
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
333 James asks:
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
334 What's wrong with simply passing the variables corresponding to the dataset to
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
335 the constructor of the learner?
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
336 That seems much more flexible, compact, and clear than the decorator.
5db730bb0e8e comments on datalearn
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1357
diff changeset
337
1362
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
338 OD replies: Not sure I understand your idea here. We probably want a learner
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
339 to be able to compute its output on multiple datasets, without having to point
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
340 to these datasets within the learner itself (which seems cumbersome to me).
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
341 The point of the decorators is mostly to turn a single function (that outputs
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
342 a theano variable for the ouptut computed on a single sample) into a function
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
343 that can compute symbolic datasets as well as numeric sample outputs. Those
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
344 could also be instead different functions in the base Learner class if the
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
345 decorator approach is considered ugly / confusing.
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
346
1357
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
347 A Learner may be able to compute various things. For instance, a Neural
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
348 Network may output a ``prediction`` vector (whose elements correspond to
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
349 estimated probabilities of each class in a classification task), as well as a
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
350 ``cost`` vector (whose elements correspond to the penalized NLL, the NLL alone
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
351 and the classification error). We would want to be able to build a dataset
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
352 that contains some of these quantities computed on each sample in the input
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
353 dataset.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
354
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
355 The Neural Network code would then look something like this:
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
356
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
357 .. code-block:: python
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
358
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
359 class NeuralNetwork(Learner):
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
360
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
361 @datalearn(..)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
362 def compute_prediction(self, sample):
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
363 return softmax(theano.tensor.dot(self.weights, sample.input))
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
364
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
365 @datalearn(..)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
366 def compute_nll(self, sample):
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
367 return - log(self.compute_prediction(sample)[sample.target])
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
368
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
369 @datalearn(..)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
370 def compute_penalized_nll(self, sample):
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
371 return (self.compute_nll(self, sample) +
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
372 theano.tensor.sum(self.weights**2))
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
373
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
374 @datalearn(..)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
375 def compute_class_error(self, sample):
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
376 probabilities = self.compute_prediction(sample)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
377 predicted_class = theano.tensor.argmax(probabilities)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
378 return predicted_class != sample.target
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
379
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
380 @datalearn(..)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
381 def compute_cost(self, sample):
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
382 return theano.tensor.concatenate([
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
383 self.compute_penalized_nll(sample),
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
384 self.compute_nll(sample),
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
385 self.compute_class_error(sample),
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
386 ])
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
387
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
388 The ``@datalearn`` decorator would be responsible for allowing such a Learner
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
389 to be used e.g. like this:
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
390
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
391 .. code-block:: python
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
392
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
393 nnet = NeuralNetwork()
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
394 predict_dataset = nnet.compute_prediction(dataset)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
395 for sample in dataset:
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
396 predict_sample = nnet.compute_prediction(sample)
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
397 predict_numeric = nnet.compute_prediction({'input': numpy.zeros(10)})
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
398 multiple_fields_dataset = ConcatDataSet([
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
399 nnet.compute_prediction(dataset),
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
400 nnet.compute_cost(dataset),
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
401 ])
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
402
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
403 In the code above, if one wants to obtain the numeric value of an element of
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
404 ``multiple_fields_dataset``, the Theano function being compiled would be able
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
405 to optimize computations so that the simultaneous computation of
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
406 ``prediction`` and ``cost`` is done efficiently.
ffa2932a8cba Added datalearn committee discussion file
Olivier Delalleau <delallea@iro>
parents:
diff changeset
407
1361
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
408 Razvan asks: What is predict_sample for ? What is predict_dataset? What I
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
409 guess you mean is that the decorator is used to convert a function that
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
410 takes a theano variable and outputs a theano variable into a class/function
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
411 that takes a DataField/DataSet and outputs a DataField/DataSet. It could
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
412 also register all those different functions, so that the Dataset that
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
413 you get out of (not one of the function) the entire Learner (this Dataset
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
414 is returned by __call__) would contain all those as fields.
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
415 I would use it like this:
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
416
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
417 .. code-block:: python
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
418
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
419 nnet = NeuralNetwork()
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
420 results = nnet(dataset)
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
421 for datapoint in results:
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
422 print datapoint.prediction, datapoint.nll, ...
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
423
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
424 Is this close to what you are suggesting?
7548dc1b163c Some question/suggestions to datalearn
Razvan Pascanu <r.pascanu@gmail.com>
parents: 1359
diff changeset
425
1362
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
426 OD: Yes, you guessed right, the decorator's role is to do something different
6b9673d72a41 Datalearn replies / comments
Olivier Delalleau <delallea@iro>
parents: 1361
diff changeset
427 depending on the input to the function (see my reply to James above).