Mercurial > pylearn
view doc/v2_planning/use_cases.txt @ 1233:91c285e30364
Added another code review tool to look into
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Thu, 23 Sep 2010 10:10:00 -0400 |
parents | 0e12ea6ba661 |
children |
line wrap: on
line source
Use Cases (Functional Requirements) =================================== These use cases exhibit pseudo-code for some of the sorts of tasks listed in the requirements (requirements.txt) Evaluate a classifier on MNIST ------------------------------- The evaluation of a classifier on MNIST requires iterating over examples in some set (e.g. validation, test) and comparing the model's prediction with the correct answer. The score of the classifier is the number of correct predictions divided by the total number of predictions. To perform this calculation, the user should specify: - the classifier (e.g. a function operating on weights loaded from disk) - the dataset (e.g. MNIST) - the subset of examples on which to evaluate (e.g. test set) For example: vm.call(classification_accuracy( function = classifier, examples = MNIST.validation_iterator)) The user types very few things beyond the description of the fields necessary for the computation, no boilerplate. The `MNIST.validation_iterator` must respect a protocol that remains to be worked out. The `vm.call` is a compilation & execution step, as opposed to the symbolic-graph building performed by the `classification_accuracy` call. Train a linear classifier on MNIST ---------------------------------- The training of a linear classifier requires specification of - problem dimensions (e.g. n. of inputs, n. of classes) - parameter initialization method - regularization - dataset - schedule for obtaining training examples (e.g. batch, online, minibatch, weighted examples) - algorithm for adapting parameters (e.g. SGD, Conj. Grad) - a stopping criterion (may be in terms of validation examples) Often the dataset determines the problem dimensions. Often the training examples and validation examples come from the same set (e.g. a large matrix of all examples) but this is not necessarily the case. There are many ways that the training could be configured, but here is one: .. code-block:: python vm.call( halflife_stopper( # OD: is n_hidden supposed to be n_classes instead? initial_model=random_linear_classifier(MNIST.n_inputs, MNIST.n_hidden, r_seed=234432), burnin=100, score_fn = vm_lambda(('learner_obj',), classification_accuracy( examples=MNIST.validation_dataset, function=as_classifier('learner_obj'))), step_fn = vm_lambda(('learner_obj',), sgd_step_fn( parameters = vm_getattr('learner_obj', 'params'), cost_and_updates=classif_nll('learner_obj', example_stream=minibatches( source=MNIST.training_dataset, batchsize=100, loop=True)), momentum=0.9, anneal_at_iter=50, n_iter=100))) #step_fn goes through lots of examples (e.g. an epoch) Although I expect this specific code might have to change quite a bit in a final version, I want to draw attention to a few aspects of it: - we build a symbolic expression graph that contains the whole program, not just the learning algorithm - the configuration language allows for callable objects (e.g. functions, curried functions) to be arguments - there is a lambda function-constructor (vm_lambda) we can use in this language - APIs and protocols are at work in establishing conventions for parameter-passing so that sub-expressions (e.g. datasets, optimization algorithms, etc.) can be swapped. - there are no APIs for things which are not passed as arguments (i.e. the logic of the whole program is not exposed via some uber-API). OD comments: I didn't have time to look closely at the details, but overall I like the general feel of it. At least I'd expect us to need something like that to be able to handle the multiple use cases we want to support. I must say I'm a bit worried though that it could become scary pretty fast to the newcomer, with 'lambda functions' and 'virtual machines'. Anyway, one point I would like to comment on is the line that creates the linear classifier. I hope that, as much as possible, we can avoid the need to specify dataset dimensions / number of classes in algorithm constructors. I regularly had issues in PLearn with the fact we had for instance to give the number of inputs when creating a neural network. I much prefer when this kind of thing can be figured out at runtime: - Any parameter you can get rid of is a significant gain in user-friendliness. - It's not always easy to know in advance e.g. the dimension of your input dataset. Imagine for instance this dataset is obtained in a first step by going through a PCA whose number of output dimensions is set so as to keep 90% of the variance. - It seems to me it fits better the idea of a symbolic graph: my intuition (that may be very different from what you actually have in mind) is to see an experiment as a symbolic graph, which you instantiate when you provide the input data. One advantage of this point of view is it makes it natural to re-use the same block components on various datasets / splits, something we often want to do. K-fold cross validation of a classifier --------------------------------------- .. code-block:: python splits = kfold_cross_validate( # OD: What would these parameters mean? indexlist = range(1000) train = 8, valid = 1, test = 1, ) trained_models = [ halflife_early_stopper( initial_model=alloc_model('param1', 'param2'), burnin=100, score_fn = vm_lambda(('learner_obj',), classification_error( function=as_classifier('learner_obj'), dataset=MNIST.subset(validation_set))), step_fn = vm_lambda(('learner_obj',), sgd_step_fn( parameters = vm_getattr('learner_obj', 'params'), cost_and_updates=classif_nll('learner_obj', example_stream=minibatches( source=MNIST.subset(train_set), batchsize=100, loop=True)), n_iter=100))) for (train_set, validation_set, test_set) in splits] vm.call(trained_models, param1=1, param2=2) vm.call(trained_models, param1=3, param2=4) I want to draw attention to the fact that the call method treats the expression tree as one big lambda expression, with potentially free variables that must be assigned - here the 'param1' and 'param2' arguments to `alloc_model`. There is no need to have separate compile and run steps like in Theano because these functions are expected to be long-running, and called once. Analyze the results of the K-fold cross validation -------------------------------------------------- It often happens that a user doesn't know what statistics to compute *before* running a bunch of learning jobs, but only afterward. This can be done by extending the symbolic program, and calling the extended function. vm.call( [pylearn.min(pylearn_getattr(model, 'weights')) for model in trained_models], param1=1, param2=2) If this is run after the previous calls: vm.call(trained_models, param1=1, param2=2) vm.call(trained_models, param1=3, param2=4) Then it should run very quickly, because the `vm` can cache the return values of the trained_models when param1=1 and param2=2.