view doc/v2_planning/use_cases.txt @ 1104:5e6d7d9e803a

a comment on the GPU issue for datasets
author Razvan Pascanu <r.pascanu@gmail.com>
date Mon, 13 Sep 2010 20:21:23 -0400
parents b422cbaddc52
children 21d25bed2ce9
line wrap: on
line source


Use Cases (Functional Requirements)
===================================

These use cases exhibit pseudo-code for some of the sorts of tasks listed in the
requirements (requirements.txt)


Evaluate a classifier on MNIST
-------------------------------

The evaluation of a classifier on MNIST requires iterating over examples in some
set (e.g. validation, test) and comparing the model's prediction with the
correct answer.  The score of the classifier is the number of correct
predictions divided by the total number of predictions.

To perform this calculation, the user should specify:
- the classifier (e.g. a function operating on weights loaded from disk)
- the dataset (e.g. MNIST)
- the subset of examples on which to evaluate (e.g. test set)

For example:

    vm.call(classification_accuracy(
       function = classifier,
       examples = MNIST.validation_iterator))


The user types very few things beyond the description of the fields necessary
for the computation, no boilerplate.  The `MNIST.validation_iterator` must
respect a protocol that remains to be worked out.

The `vm.call` is a compilation & execution step, as opposed to the
symbolic-graph building performed by the `classification_accuracy` call.



Train a linear classifier on MNIST
----------------------------------

The training of a linear classifier requires specification of

- problem dimensions (e.g. n. of inputs, n. of classes)
- parameter initialization method
- regularization
- dataset
- schedule for obtaining training examples (e.g. batch, online, minibatch,
  weighted examples)
- algorithm for adapting parameters (e.g. SGD, Conj. Grad)
- a stopping criterion (may be in terms of validation examples)

Often the dataset determines the problem dimensions.

Often the training examples and validation examples come from the same set (e.g.
a large matrix of all examples) but this is not necessarily the case.

There are many ways that the training could be configured, but here is one:


vm.call(
    halflife_stopper(
        # OD: is n_hidden supposed to be n_classes instead?
        initial_model=random_linear_classifier(MNIST.n_inputs, MNIST.n_hidden, r_seed=234432),
        burnin=100,
        score_fn = vm_lambda(('learner_obj',),
            classification_accuracy(
                examples=MNIST.validation_dataset,
                function=as_classifier('learner_obj'))),

        step_fn = vm_lambda(('learner_obj',),
            sgd_step_fn(
                parameters = vm_getattr('learner_obj', 'params'),
                cost_and_updates=classif_nll('learner_obj', 
                    example_stream=minibatches(
                        source=MNIST.training_dataset,
                        batchsize=100,
                        loop=True)),
                momentum=0.9,
                anneal_at_iter=50,
                n_iter=100)))  #step_fn goes through lots of examples (e.g. an epoch)

Although I expect this specific code might have to change quite a bit in a final
version, I want to draw attention to a few aspects of it:

- we build a symbolic expression graph that contains the whole program, not just
  the learning algorithm

- the configuration language allows for callable objects (e.g. functions,
  curried functions) to be arguments

- there is a lambda function-constructor (vm_lambda) we can use in this language

- APIs and protocols are at work in establishing conventions for
  parameter-passing so that sub-expressions (e.g. datasets, optimization
  algorithms, etc.) can be swapped.

- there are no APIs for things which are not passed as arguments (i.e. the logic
  of the whole program is not exposed via some uber-API).


K-fold cross validation of a classifier
---------------------------------------

    splits = kfold_cross_validate(
        # OD: What would these parameters mean?
        indexlist = range(1000)
        train = 8,
        valid = 1,
        test = 1,
    )

    trained_models = [
        halflife_early_stopper(
            initial_model=alloc_model('param1', 'param2'),
            burnin=100,
            score_fn = vm_lambda(('learner_obj',),
                classification_error(
                    function=as_classifier('learner_obj'),
                    dataset=MNIST.subset(validation_set))),
            step_fn = vm_lambda(('learner_obj',),
                    sgd_step_fn(
                        parameters = vm_getattr('learner_obj', 'params'),
                        cost_and_updates=classif_nll('learner_obj', 
                            example_stream=minibatches(
                                source=MNIST.subset(train_set),
                                batchsize=100,
                                loop=True)),
                        n_iter=100)))
        for (train_set, validation_set, test_set) in splits]

    vm.call(trained_models, param1=1, param2=2)
    vm.call(trained_models, param1=3, param2=4)

I want to  draw attention to the fact that the call method treats the expression
tree as one big lambda expression, with potentially free variables that must be
assigned - here the 'param1' and 'param2' arguments to `alloc_model`.  There is
no need to have separate compile and run steps like in Theano because these
functions are expected to be long-running, and called once.


Analyze the results of the K-fold cross validation
--------------------------------------------------

It often happens that a user doesn't know what statistics to compute *before*
running a bunch of learning jobs, but only afterward.  This can be done by
extending the symbolic program, and calling the extended function.

    vm.call(
        [pylearn.min(pylearn_getattr(model, 'weights')) for model in trained_models], 
        param1=1, param2=2)

If this is run after the previous calls:

    vm.call(trained_models, param1=1, param2=2)
    vm.call(trained_models, param1=3, param2=4)

Then it should run very quickly, because the `vm` can cache the return values of
the trained_models when param1=1 and param2=2.