# HG changeset patch # User James Bergstra # Date 1284342322 14400 # Node ID a6559868162055883e0fdf524ce9ea8ca17b7c58 # Parent aab9c261361c42ee791342d1dfb6b124ca16e975 v2planning - initial commit of use_cases, requirements diff -r aab9c261361c -r a65598681620 doc/v2_planning/requirements.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/requirements.txt Sun Sep 12 21:45:22 2010 -0400 @@ -0,0 +1,99 @@ +============ +Requirements +============ + + +Application Requirements +======================== + +Terminology and Abbreviations: +------------------------------ + +MLA - machine learning algorithm + +learning problem - a machine learning application typically characterized by a +dataset (possibly dataset folds) one or more functions to be learned from the +data, and one or more metrics to evaluate those functions. Learning problems +are the benchmarks for empirical model comparison. + +n. of - number of + +SGD - stochastic gradient descent + +Users: +------ + +- New masters and PhD students in the lab should be able to quickly move into + 'production' mode without having to reinvent the wheel. + +- Students in the two ML classes, able to play with the library to explore new + ML variants. This means some APIs (e.g. Experiment level) must be really well + documented and conceptually simple. + +- Researchers outside the lab (who might study and experiment with our + algorithms) + +- Partners outside the lab (e.g. Bell, Ubisoft) with closed-source commercial + projects. + +Uses: +----- + +R1. reproduce previous work (our own and others') + +R2. explore MLA variants by swapping components (e.g. optimization algo, dataset, + hyper-parameters). + +R3. analyze experimental results (e.g. plotting training curves, finding best + models, marginalizing across hyper-parameter choices) + +R4. disseminate (or serve as platform for disseminating) our own published algorithms + +R5. provide implementations of common MLA components (e.g. classifiers, datasets, + optimization algorithms, meta-learning algorithms) + +R6. drive large scale parallizable computations (e.g. grid search, bagging, + random search) + +R7. provide implementations of standard pre-processing algorithms (e.g. PCA, + stemming, Mel-scale spectrograms, GIST features, etc.) + +R8. provide high performance suitable for large-scale experiments, + +R9. be able to use the most efficient algorithms in special case combinations of + learning algorithm components (e.g. when there is a fast k-fold validation + algorithm for a particular model family, the library should not require users + to rewrite their standard k-fold validation script to use it) + +R10. support experiments on a variety of datasets (e.g. movies, images, text, + sound, reinforcement learning?) + +R11. support efficient computations on datasets larger than RAM and GPU memory + +R12. support infinite datasets (i.e. generated on the fly) + + + +Basic Design Approach +===================== + +An ability to drive parallel computations is essential in addressing [R6,R8]. + +The basic design approach for the library is to implement +- a few virtual machines (VMs), some of which can run programs that can be + parallelized across processors, hosts, and networks. +- MLAs in a Symbolic Expression language (similar to Theano) as required by + [R5,R7,R8] + +MLAs are typically specified by Symbolic programs that are compiled to these +instructions, but some MLAs may be implemented in these instructions directly. +Symbolic programs are naturally modularized by sub-expressions [R2] and can be +optimized automatically (like in Theano) to address [R9]. + +A VM that caches instruction return values serves as +- a reliable record of what jobs were run [R1] +- a database of intermediate results that can be analyzed after the + model-training jobs have completed [R3] +- a clean API to several possible storage and execution backends. + + diff -r aab9c261361c -r a65598681620 doc/v2_planning/use_cases.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/use_cases.txt Sun Sep 12 21:45:22 2010 -0400 @@ -0,0 +1,157 @@ + +Use Cases (Functional Requirements) +=================================== + +These use cases exhibit pseudo-code for some of the sorts of tasks listed in the +requirements (requirements.txt) + + +Evaluate a classifier on MNIST +------------------------------- + +The evaluation of a classifier on MNIST requires iterating over examples in some +set (e.g. validation, test) and comparing the model's prediction with the +correct answer. The score of the classifier is the number of correct +predictions divided by the total number of predictions. + +To perform this calculation, the user should specify: +- the classifier (e.g. a function operating on weights loaded from disk) +- the dataset (e.g. MNIST) +- the subset of examples on which to evaluate (e.g. test set) + +For example: + + vm.call(classification_accuracy( + function = classifier, + examples = MNIST.validation_iterator)) + + +The user types very few things beyond the description of the fields necessary +for the computation, no boilerplate. The `MNIST.validation_iterator` must +respect a protocol that remains to be worked out. + +The `vm.call` is a compilation & execution step, as opposed to the +symbolic-graph building performed by the `classification_accuracy` call. + + + +Train a linear classifier on MNIST +---------------------------------- + +The training of a linear classifier requires specification of + +- problem dimensions (e.g. n. of inputs, n. of classes) +- parameter initialization method +- regularization +- dataset +- schedule for obtaining training examples (e.g. batch, online, minibatch, + weighted examples) +- algorithm for adapting parameters (e.g. SGD, Conj. Grad) +- a stopping criterion (may be in terms of validation examples) + +Often the dataset determines the problem dimensions. + +Often the training examples and validation examples come from the same set (e.g. +a large matrix of all examples) but this is not necessarily the case. + +There are many ways that the training could be configured, but here is one: + + +vm.call( + halflife_stopper( + initial_model=random_linear_classifier(MNIST.n_inputs, MNIST.n_hidden, r_seed=234432), + burnin=100, + score_fn = vm_lambda(('learner_obj',), + classification_accuracy( + examples=MNIST.validation_dataset, + function=as_classifier('learner_obj'))), + step_fn = vm_lambda(('learner_obj',), + sgd_step_fn( + parameters = vm_getattr('learner_obj', 'params'), + cost_and_updates=classif_nll('learner_obj', + example_stream=minibatches( + source=MNIST.training_dataset, + batchsize=100, + loop=True)), + momentum=0.9, + anneal_at_iter=50, + n_iter=100))) #step_fn goes through lots of examples (e.g. an epoch) + +Although I expect this specific code might have to change quite a bit in a final +version, I want to draw attention to a few aspects of it: + +- we build a symbolic expression graph that contains the whole program, not just + the learning algorithm + +- the configuration language allows for callable objects (e.g. functions, + curried functions) to be arguments + +- there is a lambda function-constructor (vm_lambda) we can use in this language + +- APIs and protocols are at work in establishing conventions for + parameter-passing so that sub-expressions (e.g. datasets, optimization + algorithms, etc.) can be swapped. + +- there are no APIs for things which are not passed as arguments (i.e. the logic + of the whole program is not exposed via some uber-API). + + +K-fold cross validation of a classifier +--------------------------------------- + + splits = kfold_cross_validate( + indexlist = range(1000) + train = 8, + valid = 1, + test = 1, + ) + + trained_models = [ + halflife_early_stopper( + initial_model=alloc_model('param1', 'param2'), + burnin=100, + score_fn = vm_lambda(('learner_obj',), + graph=classification_error( + function=as_classifier('learner_obj'), + dataset=MNIST.subset(validation_set))), + step_fn = vm_lambda(('learner_obj',), + sgd_step_fn( + parameters = vm_getattr('learner_obj', 'params'), + cost_and_updates=classif_nll('learner_obj', + example_stream=minibatches( + source=MNIST.subset(train_set), + batchsize=100, + loop=True)), + n_iter=100))) + for (train_set, validation_set, test_set) in splits] + + vm.call(trained_models, param1=1, param2=2) + vm.call(trained_models, param1=3, param2=4) + +I want to draw attention to the fact that the call method treats the expression +tree as one big lambda expression, with potentially free variables that must be +assigned - here the 'param1' and 'param2' arguments to `alloc_model`. There is +no need to have separate compile and run steps like in Theano because these +functions are expected to be long-running, and called once. + + +Analyze the results of the K-fold cross validation +-------------------------------------------------- + +It often happens that a user doesn't know what statistics to compute *before* +running a bunch of learning jobs, but only afterward. This can be done by +extending the symbolic program, and calling the extended function. + + vm.call( + [pylearn.min(model.weights) for model in trained_models], + param1=1, param2=2) + +If this is run after the previous calls: + + vm.call(trained_models, param1=1, param2=2) + vm.call(trained_models, param1=3, param2=4) + +Then it should run very quickly, because the `vm` can cache the return values of +the trained_models when param1=1 and param2=2. + +