changeset 1093:a65598681620

v2planning - initial commit of use_cases, requirements
author James Bergstra <bergstrj@iro.umontreal.ca>
date Sun, 12 Sep 2010 21:45:22 -0400
parents aab9c261361c
children 520fcaa45692
files doc/v2_planning/requirements.txt doc/v2_planning/use_cases.txt
diffstat 2 files changed, 256 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/v2_planning/requirements.txt	Sun Sep 12 21:45:22 2010 -0400
@@ -0,0 +1,99 @@
+============
+Requirements
+============
+
+
+Application Requirements
+========================
+
+Terminology and Abbreviations:
+------------------------------
+
+MLA - machine learning algorithm
+
+learning problem - a machine learning application typically characterized by a
+dataset (possibly dataset folds) one or more functions to be learned from the
+data, and one or more metrics to evaluate those functions.  Learning problems
+are the benchmarks for empirical model comparison.
+
+n. of - number of
+
+SGD - stochastic gradient descent
+
+Users:
+------
+
+- New masters and PhD students in the lab should be able to quickly move into
+  'production' mode without having to reinvent the wheel.
+
+- Students in the two ML classes, able to play with the library to explore new
+  ML variants. This means some APIs (e.g. Experiment level) must be really well
+  documented and conceptually simple.
+
+- Researchers outside the lab (who might study and experiment with our
+  algorithms)
+
+- Partners outside the lab (e.g. Bell, Ubisoft) with closed-source commercial
+  projects.
+
+Uses:
+-----
+
+R1. reproduce previous work (our own and others')
+
+R2. explore MLA variants by swapping components (e.g.  optimization algo, dataset,
+  hyper-parameters).
+
+R3. analyze experimental results (e.g. plotting training curves, finding best
+  models, marginalizing across hyper-parameter choices)
+
+R4. disseminate (or serve as platform for disseminating) our own published algorithms
+
+R5. provide implementations of common MLA components (e.g. classifiers, datasets,
+  optimization algorithms, meta-learning algorithms)
+
+R6. drive large scale parallizable computations (e.g. grid search, bagging,
+  random search)
+
+R7. provide implementations of standard pre-processing algorithms (e.g. PCA,
+  stemming, Mel-scale spectrograms, GIST features, etc.)
+
+R8. provide high performance suitable for large-scale experiments,
+
+R9. be able to use the most efficient algorithms in special case combinations of
+  learning algorithm components (e.g. when there is a fast k-fold validation
+  algorithm for a particular model family, the library should not require users
+  to rewrite their standard k-fold validation script to use it)
+
+R10. support experiments on a variety of datasets (e.g. movies, images, text,
+    sound, reinforcement learning?)
+
+R11. support efficient computations on datasets larger than RAM and GPU memory
+
+R12. support infinite datasets (i.e. generated on the fly)
+
+
+
+Basic Design Approach
+=====================
+
+An ability to drive parallel computations is essential in addressing [R6,R8].
+
+The basic design approach for the library is to implement 
+- a few virtual machines (VMs), some of which can run programs that can be
+  parallelized across processors, hosts, and networks.
+- MLAs in a Symbolic Expression language (similar to Theano) as required by
+  [R5,R7,R8]
+
+MLAs are typically specified by Symbolic programs that are compiled to these
+instructions, but some MLAs may be implemented in these instructions directly.
+Symbolic programs are naturally modularized by sub-expressions [R2] and can be
+optimized automatically (like in Theano) to address [R9].
+
+A VM that caches instruction return values serves as 
+- a reliable record of what jobs were run [R1]
+- a database of intermediate results that can be analyzed after the
+  model-training jobs have completed [R3]
+- a clean API to several possible storage and execution backends.
+
+
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/v2_planning/use_cases.txt	Sun Sep 12 21:45:22 2010 -0400
@@ -0,0 +1,157 @@
+
+Use Cases (Functional Requirements)
+===================================
+
+These use cases exhibit pseudo-code for some of the sorts of tasks listed in the
+requirements (requirements.txt)
+
+
+Evaluate a classifier on MNIST
+-------------------------------
+
+The evaluation of a classifier on MNIST requires iterating over examples in some
+set (e.g. validation, test) and comparing the model's prediction with the
+correct answer.  The score of the classifier is the number of correct
+predictions divided by the total number of predictions.
+
+To perform this calculation, the user should specify:
+- the classifier (e.g. a function operating on weights loaded from disk)
+- the dataset (e.g. MNIST)
+- the subset of examples on which to evaluate (e.g. test set)
+
+For example:
+
+    vm.call(classification_accuracy(
+       function = classifier,
+       examples = MNIST.validation_iterator))
+
+
+The user types very few things beyond the description of the fields necessary
+for the computation, no boilerplate.  The `MNIST.validation_iterator` must
+respect a protocol that remains to be worked out.
+
+The `vm.call` is a compilation & execution step, as opposed to the
+symbolic-graph building performed by the `classification_accuracy` call.
+
+
+
+Train a linear classifier on MNIST
+----------------------------------
+
+The training of a linear classifier requires specification of
+
+- problem dimensions (e.g. n. of inputs, n. of classes)
+- parameter initialization method
+- regularization
+- dataset
+- schedule for obtaining training examples (e.g. batch, online, minibatch,
+  weighted examples)
+- algorithm for adapting parameters (e.g. SGD, Conj. Grad)
+- a stopping criterion (may be in terms of validation examples)
+
+Often the dataset determines the problem dimensions.
+
+Often the training examples and validation examples come from the same set (e.g.
+a large matrix of all examples) but this is not necessarily the case.
+
+There are many ways that the training could be configured, but here is one:
+
+
+vm.call(
+    halflife_stopper(
+        initial_model=random_linear_classifier(MNIST.n_inputs, MNIST.n_hidden, r_seed=234432),
+        burnin=100,
+        score_fn = vm_lambda(('learner_obj',),
+            classification_accuracy(
+                examples=MNIST.validation_dataset,
+                function=as_classifier('learner_obj'))),
+        step_fn = vm_lambda(('learner_obj',),
+            sgd_step_fn(
+                parameters = vm_getattr('learner_obj', 'params'),
+                cost_and_updates=classif_nll('learner_obj', 
+                    example_stream=minibatches(
+                        source=MNIST.training_dataset,
+                        batchsize=100,
+                        loop=True)),
+                momentum=0.9,
+                anneal_at_iter=50,
+                n_iter=100)))  #step_fn goes through lots of examples (e.g. an epoch)
+
+Although I expect this specific code might have to change quite a bit in a final
+version, I want to draw attention to a few aspects of it:
+
+- we build a symbolic expression graph that contains the whole program, not just
+  the learning algorithm
+
+- the configuration language allows for callable objects (e.g. functions,
+  curried functions) to be arguments
+
+- there is a lambda function-constructor (vm_lambda) we can use in this language
+
+- APIs and protocols are at work in establishing conventions for
+  parameter-passing so that sub-expressions (e.g. datasets, optimization
+  algorithms, etc.) can be swapped.
+
+- there are no APIs for things which are not passed as arguments (i.e. the logic
+  of the whole program is not exposed via some uber-API).
+
+
+K-fold cross validation of a classifier
+---------------------------------------
+
+    splits = kfold_cross_validate(
+        indexlist = range(1000)
+        train = 8,
+        valid = 1,
+        test = 1,
+    )
+
+    trained_models = [
+        halflife_early_stopper(
+            initial_model=alloc_model('param1', 'param2'),
+            burnin=100,
+            score_fn = vm_lambda(('learner_obj',),
+                graph=classification_error(
+                    function=as_classifier('learner_obj'),
+                    dataset=MNIST.subset(validation_set))),
+            step_fn = vm_lambda(('learner_obj',),
+                    sgd_step_fn(
+                        parameters = vm_getattr('learner_obj', 'params'),
+                        cost_and_updates=classif_nll('learner_obj', 
+                            example_stream=minibatches(
+                                source=MNIST.subset(train_set),
+                                batchsize=100,
+                                loop=True)),
+                        n_iter=100)))
+        for (train_set, validation_set, test_set) in splits]
+
+    vm.call(trained_models, param1=1, param2=2)
+    vm.call(trained_models, param1=3, param2=4)
+
+I want to  draw attention to the fact that the call method treats the expression
+tree as one big lambda expression, with potentially free variables that must be
+assigned - here the 'param1' and 'param2' arguments to `alloc_model`.  There is
+no need to have separate compile and run steps like in Theano because these
+functions are expected to be long-running, and called once.
+
+
+Analyze the results of the K-fold cross validation
+--------------------------------------------------
+
+It often happens that a user doesn't know what statistics to compute *before*
+running a bunch of learning jobs, but only afterward.  This can be done by
+extending the symbolic program, and calling the extended function.
+
+    vm.call(
+        [pylearn.min(model.weights) for model in trained_models], 
+        param1=1, param2=2)
+
+If this is run after the previous calls:
+
+    vm.call(trained_models, param1=1, param2=2)
+    vm.call(trained_models, param1=3, param2=4)
+
+Then it should run very quickly, because the `vm` can cache the return values of
+the trained_models when param1=1 and param2=2.
+
+