Mercurial > pylearn
diff doc/v2_planning/use_cases.txt @ 1093:a65598681620
v2planning - initial commit of use_cases, requirements
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Sun, 12 Sep 2010 21:45:22 -0400 |
parents | |
children | 8be7928cc1aa |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/use_cases.txt Sun Sep 12 21:45:22 2010 -0400 @@ -0,0 +1,157 @@ + +Use Cases (Functional Requirements) +=================================== + +These use cases exhibit pseudo-code for some of the sorts of tasks listed in the +requirements (requirements.txt) + + +Evaluate a classifier on MNIST +------------------------------- + +The evaluation of a classifier on MNIST requires iterating over examples in some +set (e.g. validation, test) and comparing the model's prediction with the +correct answer. The score of the classifier is the number of correct +predictions divided by the total number of predictions. + +To perform this calculation, the user should specify: +- the classifier (e.g. a function operating on weights loaded from disk) +- the dataset (e.g. MNIST) +- the subset of examples on which to evaluate (e.g. test set) + +For example: + + vm.call(classification_accuracy( + function = classifier, + examples = MNIST.validation_iterator)) + + +The user types very few things beyond the description of the fields necessary +for the computation, no boilerplate. The `MNIST.validation_iterator` must +respect a protocol that remains to be worked out. + +The `vm.call` is a compilation & execution step, as opposed to the +symbolic-graph building performed by the `classification_accuracy` call. + + + +Train a linear classifier on MNIST +---------------------------------- + +The training of a linear classifier requires specification of + +- problem dimensions (e.g. n. of inputs, n. of classes) +- parameter initialization method +- regularization +- dataset +- schedule for obtaining training examples (e.g. batch, online, minibatch, + weighted examples) +- algorithm for adapting parameters (e.g. SGD, Conj. Grad) +- a stopping criterion (may be in terms of validation examples) + +Often the dataset determines the problem dimensions. + +Often the training examples and validation examples come from the same set (e.g. +a large matrix of all examples) but this is not necessarily the case. + +There are many ways that the training could be configured, but here is one: + + +vm.call( + halflife_stopper( + initial_model=random_linear_classifier(MNIST.n_inputs, MNIST.n_hidden, r_seed=234432), + burnin=100, + score_fn = vm_lambda(('learner_obj',), + classification_accuracy( + examples=MNIST.validation_dataset, + function=as_classifier('learner_obj'))), + step_fn = vm_lambda(('learner_obj',), + sgd_step_fn( + parameters = vm_getattr('learner_obj', 'params'), + cost_and_updates=classif_nll('learner_obj', + example_stream=minibatches( + source=MNIST.training_dataset, + batchsize=100, + loop=True)), + momentum=0.9, + anneal_at_iter=50, + n_iter=100))) #step_fn goes through lots of examples (e.g. an epoch) + +Although I expect this specific code might have to change quite a bit in a final +version, I want to draw attention to a few aspects of it: + +- we build a symbolic expression graph that contains the whole program, not just + the learning algorithm + +- the configuration language allows for callable objects (e.g. functions, + curried functions) to be arguments + +- there is a lambda function-constructor (vm_lambda) we can use in this language + +- APIs and protocols are at work in establishing conventions for + parameter-passing so that sub-expressions (e.g. datasets, optimization + algorithms, etc.) can be swapped. + +- there are no APIs for things which are not passed as arguments (i.e. the logic + of the whole program is not exposed via some uber-API). + + +K-fold cross validation of a classifier +--------------------------------------- + + splits = kfold_cross_validate( + indexlist = range(1000) + train = 8, + valid = 1, + test = 1, + ) + + trained_models = [ + halflife_early_stopper( + initial_model=alloc_model('param1', 'param2'), + burnin=100, + score_fn = vm_lambda(('learner_obj',), + graph=classification_error( + function=as_classifier('learner_obj'), + dataset=MNIST.subset(validation_set))), + step_fn = vm_lambda(('learner_obj',), + sgd_step_fn( + parameters = vm_getattr('learner_obj', 'params'), + cost_and_updates=classif_nll('learner_obj', + example_stream=minibatches( + source=MNIST.subset(train_set), + batchsize=100, + loop=True)), + n_iter=100))) + for (train_set, validation_set, test_set) in splits] + + vm.call(trained_models, param1=1, param2=2) + vm.call(trained_models, param1=3, param2=4) + +I want to draw attention to the fact that the call method treats the expression +tree as one big lambda expression, with potentially free variables that must be +assigned - here the 'param1' and 'param2' arguments to `alloc_model`. There is +no need to have separate compile and run steps like in Theano because these +functions are expected to be long-running, and called once. + + +Analyze the results of the K-fold cross validation +-------------------------------------------------- + +It often happens that a user doesn't know what statistics to compute *before* +running a bunch of learning jobs, but only afterward. This can be done by +extending the symbolic program, and calling the extended function. + + vm.call( + [pylearn.min(model.weights) for model in trained_models], + param1=1, param2=2) + +If this is run after the previous calls: + + vm.call(trained_models, param1=1, param2=2) + vm.call(trained_models, param1=3, param2=4) + +Then it should run very quickly, because the `vm` can cache the return values of +the trained_models when param1=1 and param2=2. + +