view doc/v2_planning/use_cases.txt @ 1276:822c7691a759

pca - added pca_whiten_inverse function
author James Bergstra <bergstrj@iro.umontreal.ca>
date Tue, 14 Sep 2010 22:35:51 -0400
parents 0e12ea6ba661
children
line wrap: on
line source


Use Cases (Functional Requirements)
===================================

These use cases exhibit pseudo-code for some of the sorts of tasks listed in the
requirements (requirements.txt)


Evaluate a classifier on MNIST
-------------------------------

The evaluation of a classifier on MNIST requires iterating over examples in some
set (e.g. validation, test) and comparing the model's prediction with the
correct answer.  The score of the classifier is the number of correct
predictions divided by the total number of predictions.

To perform this calculation, the user should specify:
- the classifier (e.g. a function operating on weights loaded from disk)
- the dataset (e.g. MNIST)
- the subset of examples on which to evaluate (e.g. test set)

For example:

    vm.call(classification_accuracy(
       function = classifier,
       examples = MNIST.validation_iterator))


The user types very few things beyond the description of the fields necessary
for the computation, no boilerplate.  The `MNIST.validation_iterator` must
respect a protocol that remains to be worked out.

The `vm.call` is a compilation & execution step, as opposed to the
symbolic-graph building performed by the `classification_accuracy` call.



Train a linear classifier on MNIST
----------------------------------

The training of a linear classifier requires specification of

- problem dimensions (e.g. n. of inputs, n. of classes)
- parameter initialization method
- regularization
- dataset
- schedule for obtaining training examples (e.g. batch, online, minibatch,
  weighted examples)
- algorithm for adapting parameters (e.g. SGD, Conj. Grad)
- a stopping criterion (may be in terms of validation examples)

Often the dataset determines the problem dimensions.

Often the training examples and validation examples come from the same set (e.g.
a large matrix of all examples) but this is not necessarily the case.

There are many ways that the training could be configured, but here is one:

.. code-block:: python

  vm.call(
    halflife_stopper(
        # OD: is n_hidden supposed to be n_classes instead?
        initial_model=random_linear_classifier(MNIST.n_inputs, MNIST.n_hidden, r_seed=234432),
        burnin=100,
        score_fn = vm_lambda(('learner_obj',),
            classification_accuracy(
                examples=MNIST.validation_dataset,
                function=as_classifier('learner_obj'))),

        step_fn = vm_lambda(('learner_obj',),
            sgd_step_fn(
                parameters = vm_getattr('learner_obj', 'params'),
                cost_and_updates=classif_nll('learner_obj', 
                    example_stream=minibatches(
                        source=MNIST.training_dataset,
                        batchsize=100,
                        loop=True)),
                momentum=0.9,
                anneal_at_iter=50,
                n_iter=100)))  #step_fn goes through lots of examples (e.g. an epoch)

Although I expect this specific code might have to change quite a bit in a final
version, I want to draw attention to a few aspects of it:

- we build a symbolic expression graph that contains the whole program, not just
  the learning algorithm

- the configuration language allows for callable objects (e.g. functions,
  curried functions) to be arguments

- there is a lambda function-constructor (vm_lambda) we can use in this language

- APIs and protocols are at work in establishing conventions for
  parameter-passing so that sub-expressions (e.g. datasets, optimization
  algorithms, etc.) can be swapped.

- there are no APIs for things which are not passed as arguments (i.e. the logic
  of the whole program is not exposed via some uber-API).

OD comments: I didn't have time to look closely at the details, but overall I
like the general feel of it. At least I'd expect us to need something like
that to be able to handle the multiple use cases we want to support. I must
say I'm a bit worried though that it could become scary pretty fast to the
newcomer, with 'lambda functions' and 'virtual machines'.
Anyway, one point I would like to comment on is the line that creates the
linear classifier. I hope that, as much as possible, we can avoid the need to
specify dataset dimensions / number of classes in algorithm constructors. I
regularly had issues in PLearn with the fact we had for instance to give the
number of inputs when creating a neural network. I much prefer when this kind
of thing can be figured out at runtime:

- Any parameter you can get rid of is a significant gain in
  user-friendliness.
- It's not always easy to know in advance e.g. the dimension of your input
  dataset. Imagine for instance this dataset is obtained in a first step
  by going through a PCA whose number of output dimensions is set so as to
  keep 90% of the variance.
- It seems to me it fits better the idea of a symbolic graph: my intuition
  (that may be very different from what you actually have in mind) is to
  see an experiment as a symbolic graph, which you instantiate when you
  provide the input data. One advantage of this point of view is it makes
  it natural to re-use the same block components on various datasets /
  splits, something we often want to do.

K-fold cross validation of a classifier
---------------------------------------

.. code-block:: python

    splits = kfold_cross_validate(
        # OD: What would these parameters mean?
        indexlist = range(1000)
        train = 8,
        valid = 1,
        test = 1,
    )

    trained_models = [
        halflife_early_stopper(
            initial_model=alloc_model('param1', 'param2'),
            burnin=100,
            score_fn = vm_lambda(('learner_obj',),
                classification_error(
                    function=as_classifier('learner_obj'),
                    dataset=MNIST.subset(validation_set))),
            step_fn = vm_lambda(('learner_obj',),
                    sgd_step_fn(
                        parameters = vm_getattr('learner_obj', 'params'),
                        cost_and_updates=classif_nll('learner_obj', 
                            example_stream=minibatches(
                                source=MNIST.subset(train_set),
                                batchsize=100,
                                loop=True)),
                        n_iter=100)))
        for (train_set, validation_set, test_set) in splits]

    vm.call(trained_models, param1=1, param2=2)
    vm.call(trained_models, param1=3, param2=4)

I want to  draw attention to the fact that the call method treats the expression
tree as one big lambda expression, with potentially free variables that must be
assigned - here the 'param1' and 'param2' arguments to `alloc_model`.  There is
no need to have separate compile and run steps like in Theano because these
functions are expected to be long-running, and called once.


Analyze the results of the K-fold cross validation
--------------------------------------------------

It often happens that a user doesn't know what statistics to compute *before*
running a bunch of learning jobs, but only afterward.  This can be done by
extending the symbolic program, and calling the extended function.

    vm.call(
        [pylearn.min(pylearn_getattr(model, 'weights')) for model in trained_models], 
        param1=1, param2=2)

If this is run after the previous calls:

    vm.call(trained_models, param1=1, param2=2)
    vm.call(trained_models, param1=3, param2=4)

Then it should run very quickly, because the `vm` can cache the return values of
the trained_models when param1=1 and param2=2.