comparison doc/v2_planning/use_cases.txt @ 1106:21d25bed2ce9

use_cases: Comment about using predefined dataset dimensions
author Olivier Delalleau <delallea@iro>
date Mon, 13 Sep 2010 22:44:37 -0400
parents b422cbaddc52
children 0e12ea6ba661
comparison
equal deleted inserted replaced
1105:546bd0ccb0e4 1106:21d25bed2ce9
95 algorithms, etc.) can be swapped. 95 algorithms, etc.) can be swapped.
96 96
97 - there are no APIs for things which are not passed as arguments (i.e. the logic 97 - there are no APIs for things which are not passed as arguments (i.e. the logic
98 of the whole program is not exposed via some uber-API). 98 of the whole program is not exposed via some uber-API).
99 99
100 OD comments: I didn't have time to look closely at the details, but overall I
101 like the general feel of it. At least I'd expect us to need something like
102 that to be able to handle the multiple use cases we want to support. I must
103 say I'm a bit worried though that it could become scary pretty fast to the
104 newcomer, with 'lambda functions' and 'virtual machines'.
105 Anyway, one point I would like to comment on is the line that creates the
106 linear classifier. I hope that, as much as possible, we can avoid the need to
107 specify dataset dimensions / number of classes in algorithm constructors. I
108 regularly had issues in PLearn with the fact we had for instance to give the
109 number of inputs when creating a neural network. I much prefer when this kind
110 of thing can be figured out at runtime:
111 - Any parameter you can get rid of is a significant gain in
112 user-friendliness.
113 - It's not always easy to know in advance e.g. the dimension of your input
114 dataset. Imagine for instance this dataset is obtained in a first step
115 by going through a PCA whose number of output dimensions is set so as to
116 keep 90% of the variance.
117 - It seems to me it fits better the idea of a symbolic graph: my intuition
118 (that may be very different from what you actually have in mind) is to
119 see an experiment as a symbolic graph, which you instantiate when you
120 provide the input data. One advantage of this point of view is it makes
121 it natural to re-use the same block components on various datasets /
122 splits, something we often want to do.
100 123
101 K-fold cross validation of a classifier 124 K-fold cross validation of a classifier
102 --------------------------------------- 125 ---------------------------------------
103 126
104 splits = kfold_cross_validate( 127 splits = kfold_cross_validate(