pylearn: doc/v2_planning/use_cases.txt comparison

comparison doc/v2_planning/use_cases.txt @ 1093:a65598681620

v2planning - initial commit of use_cases, requirements

author	James Bergstra <bergstrj@iro.umontreal.ca>
date	Sun, 12 Sep 2010 21:45:22 -0400
parents
children	8be7928cc1aa

comparison

equal deleted inserted replaced

-:aab9c261361c
+:a65598681620
+Use Cases (Functional Requirements)
+===================================
+These use cases exhibit pseudo-code for some of the sorts of tasks listed in the
+requirements (requirements.txt)
+Evaluate a classifier on MNIST
+-------------------------------
+The evaluation of a classifier on MNIST requires iterating over examples in some
+set (e.g. validation, test) and comparing the model's prediction with the
+correct answer.  The score of the classifier is the number of correct
+predictions divided by the total number of predictions.
+To perform this calculation, the user should specify:
+- the classifier (e.g. a function operating on weights loaded from disk)
+- the dataset (e.g. MNIST)
+- the subset of examples on which to evaluate (e.g. test set)
+For example:
+vm.call(classification_accuracy(
+function = classifier,
+examples = MNIST.validation_iterator))
+The user types very few things beyond the description of the fields necessary
+for the computation, no boilerplate.  The `MNIST.validation_iterator` must
+respect a protocol that remains to be worked out.
+The `vm.call` is a compilation & execution step, as opposed to the
+symbolic-graph building performed by the `classification_accuracy` call.
+Train a linear classifier on MNIST
+----------------------------------
+The training of a linear classifier requires specification of
+- problem dimensions (e.g. n. of inputs, n. of classes)
+- parameter initialization method
+- regularization
+- dataset
+- schedule for obtaining training examples (e.g. batch, online, minibatch,
+weighted examples)
+- algorithm for adapting parameters (e.g. SGD, Conj. Grad)
+- a stopping criterion (may be in terms of validation examples)
+Often the dataset determines the problem dimensions.
+Often the training examples and validation examples come from the same set (e.g.
+a large matrix of all examples) but this is not necessarily the case.
+There are many ways that the training could be configured, but here is one:
+vm.call(
+halflife_stopper(
+initial_model=random_linear_classifier(MNIST.n_inputs, MNIST.n_hidden, r_seed=234432),
+burnin=100,
+score_fn = vm_lambda(('learner_obj',),
+classification_accuracy(
+examples=MNIST.validation_dataset,
+function=as_classifier('learner_obj'))),
+step_fn = vm_lambda(('learner_obj',),
+sgd_step_fn(
+parameters = vm_getattr('learner_obj', 'params'),
+cost_and_updates=classif_nll('learner_obj',
+example_stream=minibatches(
+source=MNIST.training_dataset,
+batchsize=100,
+loop=True)),
+momentum=0.9,
+anneal_at_iter=50,
+n_iter=100)))  #step_fn goes through lots of examples (e.g. an epoch)
+Although I expect this specific code might have to change quite a bit in a final
+version, I want to draw attention to a few aspects of it:
+- we build a symbolic expression graph that contains the whole program, not just
+the learning algorithm
+- the configuration language allows for callable objects (e.g. functions,
+curried functions) to be arguments
+- there is a lambda function-constructor (vm_lambda) we can use in this language
+- APIs and protocols are at work in establishing conventions for
+parameter-passing so that sub-expressions (e.g. datasets, optimization
+algorithms, etc.) can be swapped.
+- there are no APIs for things which are not passed as arguments (i.e. the logic
+of the whole program is not exposed via some uber-API).
+K-fold cross validation of a classifier
+---------------------------------------
+splits = kfold_cross_validate(
+indexlist = range(1000)
+train = 8,
+valid = 1,
+test = 1,
+)
+trained_models = [
+halflife_early_stopper(
+initial_model=alloc_model('param1', 'param2'),
+burnin=100,
+score_fn = vm_lambda(('learner_obj',),
+graph=classification_error(
+function=as_classifier('learner_obj'),
+dataset=MNIST.subset(validation_set))),
+step_fn = vm_lambda(('learner_obj',),
+sgd_step_fn(
+parameters = vm_getattr('learner_obj', 'params'),
+cost_and_updates=classif_nll('learner_obj',
+example_stream=minibatches(
+source=MNIST.subset(train_set),
+batchsize=100,
+loop=True)),
+n_iter=100)))
+for (train_set, validation_set, test_set) in splits]
+vm.call(trained_models, param1=1, param2=2)
+vm.call(trained_models, param1=3, param2=4)
+I want to  draw attention to the fact that the call method treats the expression
+tree as one big lambda expression, with potentially free variables that must be
+assigned - here the 'param1' and 'param2' arguments to `alloc_model`.  There is
+no need to have separate compile and run steps like in Theano because these
+functions are expected to be long-running, and called once.
+Analyze the results of the K-fold cross validation
+--------------------------------------------------
+It often happens that a user doesn't know what statistics to compute *before*
+running a bunch of learning jobs, but only afterward.  This can be done by
+extending the symbolic program, and calling the extended function.
+vm.call(
+[pylearn.min(model.weights) for model in trained_models],
+param1=1, param2=2)
+If this is run after the previous calls:
+vm.call(trained_models, param1=1, param2=2)
+vm.call(trained_models, param1=3, param2=4)
+Then it should run very quickly, because the `vm` can cache the return values of
+the trained_models when param1=1 and param2=2.

Mercurial > pylearn

comparison doc/v2_planning/use_cases.txt @ 1093:a65598681620