# HG changeset patch # User James Bergstra # Date 1281551757 14400 # Node ID cafa16bfc7df7b741f3e9a49b4c8b64766715914 # Parent 1529c84e460f57c3044d0808ea5c632b74621324 additions to v2_planning diff -r 1529c84e460f -r cafa16bfc7df doc/v2_planning.txt --- a/doc/v2_planning.txt Wed Aug 11 13:16:51 2010 -0400 +++ b/doc/v2_planning.txt Wed Aug 11 14:35:57 2010 -0400 @@ -63,11 +63,111 @@ ================================================ +Theano Symbolic Expressions for ML +---------------------------------- + +We could make this a submodule of pylearn: ``pylearn.nnet``. + +There are a number of ideas floating around for how to handle classes / +modules (LeDeepNet, pylearn.shared.layers, pynnet) so lets implement as much +math as possible in global functions with no classes. There are no models in +the wish list that require than a few vectors and matrices to parametrize. +Global functions are more reusable than classes. + + +Data access +----------- + +A general interface to datasets from the perspective of an experiment driver +(e.g. kfold) is to see them as a function that maps index (typically integer) +to example (whose type and nature depends on the dataset, it could for +instance be an (image, label) pair). This interface permits iterating over +the dataset, shuffling the dataset, and splitting it into folds. For +efficiency, it is nice if the dataset interface supports looking up several +index values at once, because looking up many examples at once can sometimes +be faster than looking each one up in turn. + +Some datasets may not support random access (e.g. a random number stream) and +that's fine if an exception is raised. The user will see a NotImplementedError +or similar, and try something else. + + +A more intuitive interface for many datasets (or subsets) is to load them as +matrices or lists of examples. This format is more convenient to work with at +an ipython shell, for example. It is not good to provide only the "dataset +as a function" view of a dataset. Even if a dataset is very large, it is nice +to have a standard way to get some representative examples in a convenient +structure, to be able to play with them in ipython. + + +Another thing to consider related to datasets is that there are a number of +other efforts to have standard ML datasets, and we should be aware of them, +and compatible with them when it's easy: + - mldata.org (they have a file format, not sure how many use it) + - weka (ARFF file format) + - scikits.learn + - hdf5 / pytables + + +pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem +folder that is assumed to have a standard form across different installations. +That's where the data files are. The correct format of this folder is currently +defined implicitly by the contents of /data/lisa/data at DIRO, but it would be +better to document in pylearn what the contents of this folder should be as +much as possible. It should be possible to rebuild this tree from information +found in pylearn. + + + +Model Selection & Hyper-Parameter Optimization +---------------------------------------------- + +Driving a distributed computing job for a long time to optimize +hyper-parameters using one or more clusters is the goal here. +Although there might be some library-type code to write here, I think of this +more as an application template. The user would use python code to describe +the experiment to run and the hyper-parameter space to search. Then this +application-driver would take control of scheduling jobs and running them on +various computers... I'm imagining a potentially ugly brute of a hack that's +not necessarily something we will want to expose at a low-level for reuse. + + +Python scripts for common ML algorithms +--------------------------------------- + +The script aspect of this feature request makes me think that what would be +good here is more tutorial-type scripts. And the existing tutorials could +potentially be rewritten to use some of the pylearn.nnet expressions. More +tutorials / demos would be great. + Functional Specifications ========================= +TODO: Put these into different text files so that this one does not become a monster. For each thing with a functional spec (e.g. datasets library, optimization library) make a separate file. + + +pylearn.nnet +------------ + +Submodule with functions for building layers, calculating classification +errors, cross-entropies with various distributions, free energies. This +module would include for the most part global functions, Theano Ops and Theano +optimizations. + +Indexing Convention +~~~~~~~~~~~~~~~~~~~ + +Something to decide on - Fortran-style or C-style indexing. Although we have +often used c-style indexing in the past (for efficiency in c!) this is no +longer an issue with numpy because the physical layout is independent of the +indexing order. The fact remains that Fortran-style indexing follows linear +algebra conventions, while c-style indexing does not. If a global function +includes a lot of math derivations, it would be *really* nice if the code used +the same convention for the orientation of matrices, and endlessly annoying to +have to be always transposing everything. +