Mercurial > pylearn
changeset 1106:21d25bed2ce9
use_cases: Comment about using predefined dataset dimensions
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Mon, 13 Sep 2010 22:44:37 -0400 |
parents | 546bd0ccb0e4 |
children | e5306f5626d4 |
files | doc/v2_planning/use_cases.txt |
diffstat | 1 files changed, 23 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/doc/v2_planning/use_cases.txt Mon Sep 13 22:06:23 2010 -0400 +++ b/doc/v2_planning/use_cases.txt Mon Sep 13 22:44:37 2010 -0400 @@ -97,6 +97,29 @@ - there are no APIs for things which are not passed as arguments (i.e. the logic of the whole program is not exposed via some uber-API). +OD comments: I didn't have time to look closely at the details, but overall I +like the general feel of it. At least I'd expect us to need something like +that to be able to handle the multiple use cases we want to support. I must +say I'm a bit worried though that it could become scary pretty fast to the +newcomer, with 'lambda functions' and 'virtual machines'. +Anyway, one point I would like to comment on is the line that creates the +linear classifier. I hope that, as much as possible, we can avoid the need to +specify dataset dimensions / number of classes in algorithm constructors. I +regularly had issues in PLearn with the fact we had for instance to give the +number of inputs when creating a neural network. I much prefer when this kind +of thing can be figured out at runtime: + - Any parameter you can get rid of is a significant gain in + user-friendliness. + - It's not always easy to know in advance e.g. the dimension of your input + dataset. Imagine for instance this dataset is obtained in a first step + by going through a PCA whose number of output dimensions is set so as to + keep 90% of the variance. + - It seems to me it fits better the idea of a symbolic graph: my intuition + (that may be very different from what you actually have in mind) is to + see an experiment as a symbolic graph, which you instantiate when you + provide the input data. One advantage of this point of view is it makes + it natural to re-use the same block components on various datasets / + splits, something we often want to do. K-fold cross validation of a classifier ---------------------------------------