Mercurial > pylearn
diff doc/v2_planning/requirements.txt @ 1093:a65598681620
v2planning - initial commit of use_cases, requirements
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Sun, 12 Sep 2010 21:45:22 -0400 |
parents | |
children | 2bbc294fa5ac |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/requirements.txt Sun Sep 12 21:45:22 2010 -0400 @@ -0,0 +1,99 @@ +============ +Requirements +============ + + +Application Requirements +======================== + +Terminology and Abbreviations: +------------------------------ + +MLA - machine learning algorithm + +learning problem - a machine learning application typically characterized by a +dataset (possibly dataset folds) one or more functions to be learned from the +data, and one or more metrics to evaluate those functions. Learning problems +are the benchmarks for empirical model comparison. + +n. of - number of + +SGD - stochastic gradient descent + +Users: +------ + +- New masters and PhD students in the lab should be able to quickly move into + 'production' mode without having to reinvent the wheel. + +- Students in the two ML classes, able to play with the library to explore new + ML variants. This means some APIs (e.g. Experiment level) must be really well + documented and conceptually simple. + +- Researchers outside the lab (who might study and experiment with our + algorithms) + +- Partners outside the lab (e.g. Bell, Ubisoft) with closed-source commercial + projects. + +Uses: +----- + +R1. reproduce previous work (our own and others') + +R2. explore MLA variants by swapping components (e.g. optimization algo, dataset, + hyper-parameters). + +R3. analyze experimental results (e.g. plotting training curves, finding best + models, marginalizing across hyper-parameter choices) + +R4. disseminate (or serve as platform for disseminating) our own published algorithms + +R5. provide implementations of common MLA components (e.g. classifiers, datasets, + optimization algorithms, meta-learning algorithms) + +R6. drive large scale parallizable computations (e.g. grid search, bagging, + random search) + +R7. provide implementations of standard pre-processing algorithms (e.g. PCA, + stemming, Mel-scale spectrograms, GIST features, etc.) + +R8. provide high performance suitable for large-scale experiments, + +R9. be able to use the most efficient algorithms in special case combinations of + learning algorithm components (e.g. when there is a fast k-fold validation + algorithm for a particular model family, the library should not require users + to rewrite their standard k-fold validation script to use it) + +R10. support experiments on a variety of datasets (e.g. movies, images, text, + sound, reinforcement learning?) + +R11. support efficient computations on datasets larger than RAM and GPU memory + +R12. support infinite datasets (i.e. generated on the fly) + + + +Basic Design Approach +===================== + +An ability to drive parallel computations is essential in addressing [R6,R8]. + +The basic design approach for the library is to implement +- a few virtual machines (VMs), some of which can run programs that can be + parallelized across processors, hosts, and networks. +- MLAs in a Symbolic Expression language (similar to Theano) as required by + [R5,R7,R8] + +MLAs are typically specified by Symbolic programs that are compiled to these +instructions, but some MLAs may be implemented in these instructions directly. +Symbolic programs are naturally modularized by sub-expressions [R2] and can be +optimized automatically (like in Theano) to address [R9]. + +A VM that caches instruction return values serves as +- a reliable record of what jobs were run [R1] +- a database of intermediate results that can be analyzed after the + model-training jobs have completed [R3] +- a clean API to several possible storage and execution backends. + +