Mercurial > pylearn
diff doc/v2_planning/main_plan.txt @ 1001:660d784d14c7
moved planning into its directory
author | Yoshua Bengio <bengioy@iro.umontreal.ca> |
---|---|
date | Wed, 01 Sep 2010 12:18:08 -0400 |
parents | doc/v2_planning.txt@216f4ce969b2 |
children | 2e515be92a0e |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/main_plan.txt Wed Sep 01 12:18:08 2010 -0400 @@ -0,0 +1,241 @@ + +Motivation +========== + +Yoshua: +------- + +We are missing a *Theano Machine Learning library*. + +The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library: + + - a well-organized collection of Theano symbolic expressions (formulas) for handling most of + what is needed either in implementing existing well-known ML and deep learning algorithms or + for creating new variants (without having to start from scratch each time), that is the + mathematical core, + + - a well-organized collection of python modules to help with the following: + - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.) + - generic utility code for optimization + - stochastic gradient descent variants + - early stopping variants + - interfacing to generic 2nd order optimization methods + - 2nd order methods tailored to work on minibatches + - optimizers for sparse coefficients / parameters + - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman) + - generic code for performance estimation and experimental statistics + - visualization tools (using existing python libraries) and examples for all of the above + - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them + + [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.] + + - a well-documented set of python scripts using the above library to show how to run the most + common ML algorithms (possibly with examples showing how to run multiple experiments with + many different models and collect statistical comparative results). This is particularly + important for pure users to adopt Theano in the ML application work. + +Ideally, there would be one person in charge of this project, making sure a coherent and +easy-to-read design is developed, along with many helping hands (to implement the various +helper modules, formulae, and learning algorithms). + + +James: +------- + +I am interested in the design and implementation of the "well-organized collection of Theano +symbolic expressions..." + +I would like to explore algorithms for hyper-parameter optimization, following up on some +"high-throughput" work. I'm most interested in the "generic code for model selection and +hyper-parameter optimization..." and "generic code for performance estimation...". + +I have some experiences with the data-access requirements, and some lessons I'd like to share +on that, but no time to work on that aspect of things. + +I will continue to contribute to the "well-documented set of python scripts using the above to +showcase common ML algorithms...". I have an Olshausen&Field-style sparse coding script that +could be polished up. I am also implementing the mcRBM and I'll be able to add that when it's +done. + + + +Suggestions for how to tackle various desiderata +================================================ + + +Theano Symbolic Expressions for ML +---------------------------------- + +We could make this a submodule of pylearn: ``pylearn.nnet``. + +Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just +about neural nets, and that this is a collection of formulas (expressions), rather than +completely self-contained classes for learners. We could have a "nnet.py" file for +neural nets, though. + +There are a number of ideas floating around for how to handle classes / +modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much +math as possible in global functions with no classes. There are no models in +the wish list that require than a few vectors and matrices to parametrize. +Global functions are more reusable than classes. + + +Data access +----------- + +A general interface to datasets from the perspective of an experiment driver +(e.g. kfold) is to see them as a function that maps index (typically integer) +to example (whose type and nature depends on the dataset, it could for +instance be an (image, label) pair). This interface permits iterating over +the dataset, shuffling the dataset, and splitting it into folds. For +efficiency, it is nice if the dataset interface supports looking up several +index values at once, because looking up many examples at once can sometimes +be faster than looking each one up in turn. In particular, looking up +a consecutive block of indices, or a slice, should be well supported. + +Some datasets may not support random access (e.g. a random number stream) and +that's fine if an exception is raised. The user will see a NotImplementedError +or similar, and try something else. We might want to have a way to test +that a dataset is random-access or not without having to load an example. + + +A more intuitive interface for many datasets (or subsets) is to load them as +matrices or lists of examples. This format is more convenient to work with at +an ipython shell, for example. It is not good to provide only the "dataset +as a function" view of a dataset. Even if a dataset is very large, it is nice +to have a standard way to get some representative examples in a convenient +structure, to be able to play with them in ipython. + + +Another thing to consider related to datasets is that there are a number of +other efforts to have standard ML datasets, and we should be aware of them, +and compatible with them when it's easy: + - mldata.org (they have a file format, not sure how many use it) + - weka (ARFF file format) + - scikits.learn + - hdf5 / pytables + + +pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem +folder that is assumed to have a standard form across different installations. +That's where the data files are. The correct format of this folder is currently +defined implicitly by the contents of /data/lisa/data at DIRO, but it would be +better to document in pylearn what the contents of this folder should be as +much as possible. It should be possible to rebuild this tree from information +found in pylearn. + +Yoshua (about ideas proposed by Pascal Vincent a while ago): + + - we may want to distinguish between datasets and tasks: a task defines + not just the data but also things like what is the input and what is the + target (for supervised learning), and *importantly* a set of performance metrics + that make sense for this task (e.g. those used by papers solving a particular + task, or reported for a particular benchmark) + + - we should discuss about a few "standards" that datasets and tasks may comply to, such as + - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks + (with a convention for the semi-supervised case when only the input or only the target is observed) + - "input" for unsupervised learning + - conventions for missing-valued components inside input or target + - how examples that are sequences are treated (e.g. the input or the target is a sequence) + - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous) + - how error metrics are specified + * example-level statistics (e.g. classification error) + * dataset-level statistics (e.g. ROC curve, mean and standard error of error) + + +Model Selection & Hyper-Parameter Optimization +---------------------------------------------- + +Driving a distributed computing job for a long time to optimize +hyper-parameters using one or more clusters is the goal here. +Although there might be some library-type code to write here, I think of this +more as an application template. The user would use python code to describe +the experiment to run and the hyper-parameter space to search. Then this +application-driver would take control of scheduling jobs and running them on +various computers... I'm imagining a potentially ugly brute of a hack that's +not necessarily something we will want to expose at a low-level for reuse. + +Yoshua: We want both the library-defined driver that takes instructions about how to generate +new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which +to sample them), and examples showing how to use it in typical cases. +Note that sometimes we just want to find the best configuration of hyper-parameters, +but sometimes we want to do more subtle analysis. Often a combination of both. +In this respect it could be useful for the user to define hyper-parameters over +which scientific questions are sought (e.g. depth of an architecture) vs +hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate). +This can influence both the sampling of configurations (we want to make sure that all +combinations of question-driving hyper-parameters are covered) and the analysis +of results (we may be willing to estimate ANOVAs or averaging or quantiles over +the non-question-driving hyper-parameters). + +Python scripts for common ML algorithms +--------------------------------------- + +The script aspect of this feature request makes me think that what would be +good here is more tutorial-type scripts. And the existing tutorials could +potentially be rewritten to use some of the pylearn.nnet expressions. More +tutorials / demos would be great. + +Yoshua: agreed that we could write them as tutorials, but note how the +spirit would be different from the current deep learning tutorials: we would +not mind using library code as much as possible instead of trying to flatten +out everything in the interest of pedagogical simplicity. Instead, these +tutorials should be meant to illustrate not the algorithms but *how to take +advantage of the library*. They could also be used as *BLACK BOX* implementations +by people who don't want to dig lower and just want to run experiments. + +Functional Specifications +========================= + +TODO: +Put these into different text files so that this one does not become a monster. +For each thing with a functional spec (e.g. datasets library, optimization library) make a +separate file. + + + +pylearn.formulas +---------------- + +Directory with functions for building layers, calculating classification +errors, cross-entropies with various distributions, free energies, etc. This +module would include for the most part global functions, Theano Ops and Theano +optimizations. + +Yoshua: I would break it down in module files, e.g.: + +pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error, +abs. error, various sparsity penalties (L1, Student) + +pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA + +pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions, +layers which could be plugged with various costs & penalties, and stacked + +pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants + +pylearn.formulas.noise: formulas for corruption processes + +pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling + +pylearn.formulas.trees: formulas for decision trees + +pylearn.formulas.boosting: formulas for boosting variants + +etc. + +Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function. + +Indexing Convention +~~~~~~~~~~~~~~~~~~~ + +Something to decide on - Fortran-style or C-style indexing. Although we have +often used c-style indexing in the past (for efficiency in c!) this is no +longer an issue with numpy because the physical layout is independent of the +indexing order. The fact remains that Fortran-style indexing follows linear +algebra conventions, while c-style indexing does not. If a global function +includes a lot of math derivations, it would be *really* nice if the code used +the same convention for the orientation of matrices, and endlessly annoying to +have to be always transposing everything. +