Mercurial > pylearn
view doc/v2_planning/use_cases.txt @ 1104:5e6d7d9e803a
a comment on the GPU issue for datasets
author | Razvan Pascanu <r.pascanu@gmail.com> |
---|---|
date | Mon, 13 Sep 2010 20:21:23 -0400 |
parents | b422cbaddc52 |
children | 21d25bed2ce9 |
line wrap: on
line source
Use Cases (Functional Requirements) =================================== These use cases exhibit pseudo-code for some of the sorts of tasks listed in the requirements (requirements.txt) Evaluate a classifier on MNIST ------------------------------- The evaluation of a classifier on MNIST requires iterating over examples in some set (e.g. validation, test) and comparing the model's prediction with the correct answer. The score of the classifier is the number of correct predictions divided by the total number of predictions. To perform this calculation, the user should specify: - the classifier (e.g. a function operating on weights loaded from disk) - the dataset (e.g. MNIST) - the subset of examples on which to evaluate (e.g. test set) For example: vm.call(classification_accuracy( function = classifier, examples = MNIST.validation_iterator)) The user types very few things beyond the description of the fields necessary for the computation, no boilerplate. The `MNIST.validation_iterator` must respect a protocol that remains to be worked out. The `vm.call` is a compilation & execution step, as opposed to the symbolic-graph building performed by the `classification_accuracy` call. Train a linear classifier on MNIST ---------------------------------- The training of a linear classifier requires specification of - problem dimensions (e.g. n. of inputs, n. of classes) - parameter initialization method - regularization - dataset - schedule for obtaining training examples (e.g. batch, online, minibatch, weighted examples) - algorithm for adapting parameters (e.g. SGD, Conj. Grad) - a stopping criterion (may be in terms of validation examples) Often the dataset determines the problem dimensions. Often the training examples and validation examples come from the same set (e.g. a large matrix of all examples) but this is not necessarily the case. There are many ways that the training could be configured, but here is one: vm.call( halflife_stopper( # OD: is n_hidden supposed to be n_classes instead? initial_model=random_linear_classifier(MNIST.n_inputs, MNIST.n_hidden, r_seed=234432), burnin=100, score_fn = vm_lambda(('learner_obj',), classification_accuracy( examples=MNIST.validation_dataset, function=as_classifier('learner_obj'))), step_fn = vm_lambda(('learner_obj',), sgd_step_fn( parameters = vm_getattr('learner_obj', 'params'), cost_and_updates=classif_nll('learner_obj', example_stream=minibatches( source=MNIST.training_dataset, batchsize=100, loop=True)), momentum=0.9, anneal_at_iter=50, n_iter=100))) #step_fn goes through lots of examples (e.g. an epoch) Although I expect this specific code might have to change quite a bit in a final version, I want to draw attention to a few aspects of it: - we build a symbolic expression graph that contains the whole program, not just the learning algorithm - the configuration language allows for callable objects (e.g. functions, curried functions) to be arguments - there is a lambda function-constructor (vm_lambda) we can use in this language - APIs and protocols are at work in establishing conventions for parameter-passing so that sub-expressions (e.g. datasets, optimization algorithms, etc.) can be swapped. - there are no APIs for things which are not passed as arguments (i.e. the logic of the whole program is not exposed via some uber-API). K-fold cross validation of a classifier --------------------------------------- splits = kfold_cross_validate( # OD: What would these parameters mean? indexlist = range(1000) train = 8, valid = 1, test = 1, ) trained_models = [ halflife_early_stopper( initial_model=alloc_model('param1', 'param2'), burnin=100, score_fn = vm_lambda(('learner_obj',), classification_error( function=as_classifier('learner_obj'), dataset=MNIST.subset(validation_set))), step_fn = vm_lambda(('learner_obj',), sgd_step_fn( parameters = vm_getattr('learner_obj', 'params'), cost_and_updates=classif_nll('learner_obj', example_stream=minibatches( source=MNIST.subset(train_set), batchsize=100, loop=True)), n_iter=100))) for (train_set, validation_set, test_set) in splits] vm.call(trained_models, param1=1, param2=2) vm.call(trained_models, param1=3, param2=4) I want to draw attention to the fact that the call method treats the expression tree as one big lambda expression, with potentially free variables that must be assigned - here the 'param1' and 'param2' arguments to `alloc_model`. There is no need to have separate compile and run steps like in Theano because these functions are expected to be long-running, and called once. Analyze the results of the K-fold cross validation -------------------------------------------------- It often happens that a user doesn't know what statistics to compute *before* running a bunch of learning jobs, but only afterward. This can be done by extending the symbolic program, and calling the extended function. vm.call( [pylearn.min(pylearn_getattr(model, 'weights')) for model in trained_models], param1=1, param2=2) If this is run after the previous calls: vm.call(trained_models, param1=1, param2=2) vm.call(trained_models, param1=3, param2=4) Then it should run very quickly, because the `vm` can cache the return values of the trained_models when param1=1 and param2=2.