Mercurial > pylearn
view doc/v2_planning.txt @ 945:cafa16bfc7df
additions to v2_planning
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Wed, 11 Aug 2010 14:35:57 -0400 |
parents | 939806d33183 |
children | 7c4504a4ce1a |
line wrap: on
line source
Motivation ========== Yoshua: ------- We are missing a *Theano Machine Learning library*. The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library: - a well-organized collection of Theano symbolic expressions (formulas) for handling most of what is needed either in implementing existing well-known ML and deep learning algorithms or for creating new variants (without having to start from scratch each time), that is the mathematical core, - a well-organized collection of python modules to help with the following: - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.) - generic utility code for optimization - stochastic gradient descent variants - early stopping variants - interfacing to generic 2nd order optimization methods - 2nd order methods tailored to work on minibatches - optimizers for sparse coefficients / parameters - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman) - generic code for performance estimation and experimental statistics - visualization tools (using existing python libraries) and examples for all of the above - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.] - a well-documented set of python scripts using the above library to show how to run the most common ML algorithms (possibly with examples showing how to run multiple experiments with many different models and collect statistical comparative results). This is particularly important for pure users to adopt Theano in the ML application work. Ideally, there would be one person in charge of this project, making sure a coherent and easy-to-read design is developed, along with many helping hands (to implement the various helper modules, formulae, and learning algorithms). James: ------- I am interested in the design and implementation of the "well-organized collection of Theano symbolic expressions..." I would like to explore algorithms for hyper-parameter optimization, following up on some "high-throughput" work. I'm most interested in the "generic code for model selection and hyper-parameter optimization..." and "generic code for performance estimation...". I have some experiences with the data-access requirements, and some lessons I'd like to share on that, but no time to work on that aspect of things. I will continue to contribute to the "well-documented set of python scripts using the above to showcase common ML algorithms...". I have an Olshausen&Field-style sparse coding script that could be polished up. I am also implementing the mcRBM and I'll be able to add that when it's done. Suggestions for how to tackle various desiderata ================================================ Theano Symbolic Expressions for ML ---------------------------------- We could make this a submodule of pylearn: ``pylearn.nnet``. There are a number of ideas floating around for how to handle classes / modules (LeDeepNet, pylearn.shared.layers, pynnet) so lets implement as much math as possible in global functions with no classes. There are no models in the wish list that require than a few vectors and matrices to parametrize. Global functions are more reusable than classes. Data access ----------- A general interface to datasets from the perspective of an experiment driver (e.g. kfold) is to see them as a function that maps index (typically integer) to example (whose type and nature depends on the dataset, it could for instance be an (image, label) pair). This interface permits iterating over the dataset, shuffling the dataset, and splitting it into folds. For efficiency, it is nice if the dataset interface supports looking up several index values at once, because looking up many examples at once can sometimes be faster than looking each one up in turn. Some datasets may not support random access (e.g. a random number stream) and that's fine if an exception is raised. The user will see a NotImplementedError or similar, and try something else. A more intuitive interface for many datasets (or subsets) is to load them as matrices or lists of examples. This format is more convenient to work with at an ipython shell, for example. It is not good to provide only the "dataset as a function" view of a dataset. Even if a dataset is very large, it is nice to have a standard way to get some representative examples in a convenient structure, to be able to play with them in ipython. Another thing to consider related to datasets is that there are a number of other efforts to have standard ML datasets, and we should be aware of them, and compatible with them when it's easy: - mldata.org (they have a file format, not sure how many use it) - weka (ARFF file format) - scikits.learn - hdf5 / pytables pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem folder that is assumed to have a standard form across different installations. That's where the data files are. The correct format of this folder is currently defined implicitly by the contents of /data/lisa/data at DIRO, but it would be better to document in pylearn what the contents of this folder should be as much as possible. It should be possible to rebuild this tree from information found in pylearn. Model Selection & Hyper-Parameter Optimization ---------------------------------------------- Driving a distributed computing job for a long time to optimize hyper-parameters using one or more clusters is the goal here. Although there might be some library-type code to write here, I think of this more as an application template. The user would use python code to describe the experiment to run and the hyper-parameter space to search. Then this application-driver would take control of scheduling jobs and running them on various computers... I'm imagining a potentially ugly brute of a hack that's not necessarily something we will want to expose at a low-level for reuse. Python scripts for common ML algorithms --------------------------------------- The script aspect of this feature request makes me think that what would be good here is more tutorial-type scripts. And the existing tutorials could potentially be rewritten to use some of the pylearn.nnet expressions. More tutorials / demos would be great. Functional Specifications ========================= TODO: Put these into different text files so that this one does not become a monster. For each thing with a functional spec (e.g. datasets library, optimization library) make a separate file. pylearn.nnet ------------ Submodule with functions for building layers, calculating classification errors, cross-entropies with various distributions, free energies. This module would include for the most part global functions, Theano Ops and Theano optimizations. Indexing Convention ~~~~~~~~~~~~~~~~~~~ Something to decide on - Fortran-style or C-style indexing. Although we have often used c-style indexing in the past (for efficiency in c!) this is no longer an issue with numpy because the physical layout is independent of the indexing order. The fact remains that Fortran-style indexing follows linear algebra conventions, while c-style indexing does not. If a global function includes a lot of math derivations, it would be *really* nice if the code used the same convention for the orientation of matrices, and endlessly annoying to have to be always transposing everything.