# HG changeset patch # User Olivier Delalleau # Date 1284432277 14400 # Node ID 21d25bed2ce92e88b44f0d349ad93efa47bde476 # Parent 546bd0ccb0e4c837f05eefd63e320c1497accee1 use_cases: Comment about using predefined dataset dimensions diff -r 546bd0ccb0e4 -r 21d25bed2ce9 doc/v2_planning/use_cases.txt --- a/doc/v2_planning/use_cases.txt Mon Sep 13 22:06:23 2010 -0400 +++ b/doc/v2_planning/use_cases.txt Mon Sep 13 22:44:37 2010 -0400 @@ -97,6 +97,29 @@ - there are no APIs for things which are not passed as arguments (i.e. the logic of the whole program is not exposed via some uber-API). +OD comments: I didn't have time to look closely at the details, but overall I +like the general feel of it. At least I'd expect us to need something like +that to be able to handle the multiple use cases we want to support. I must +say I'm a bit worried though that it could become scary pretty fast to the +newcomer, with 'lambda functions' and 'virtual machines'. +Anyway, one point I would like to comment on is the line that creates the +linear classifier. I hope that, as much as possible, we can avoid the need to +specify dataset dimensions / number of classes in algorithm constructors. I +regularly had issues in PLearn with the fact we had for instance to give the +number of inputs when creating a neural network. I much prefer when this kind +of thing can be figured out at runtime: + - Any parameter you can get rid of is a significant gain in + user-friendliness. + - It's not always easy to know in advance e.g. the dimension of your input + dataset. Imagine for instance this dataset is obtained in a first step + by going through a PCA whose number of output dimensions is set so as to + keep 90% of the variance. + - It seems to me it fits better the idea of a symbolic graph: my intuition + (that may be very different from what you actually have in mind) is to + see an experiment as a symbolic graph, which you instantiate when you + provide the input data. One advantage of this point of view is it makes + it natural to re-use the same block components on various datasets / + splits, something we often want to do. K-fold cross validation of a classifier ---------------------------------------