# HG changeset patch # User Yoshua Bengio # Date 1283357888 14400 # Node ID 660d784d14c7b3188b7c9142cf196e488fd73a3d # Parent d4a14c6c36e07de09a8bad62514687e346c28f44 moved planning into its directory diff -r d4a14c6c36e0 -r 660d784d14c7 doc/v2_planning.txt --- a/doc/v2_planning.txt Tue Aug 24 19:24:54 2010 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,241 +0,0 @@ - -Motivation -========== - -Yoshua: -------- - -We are missing a *Theano Machine Learning library*. - -The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library: - - - a well-organized collection of Theano symbolic expressions (formulas) for handling most of - what is needed either in implementing existing well-known ML and deep learning algorithms or - for creating new variants (without having to start from scratch each time), that is the - mathematical core, - - - a well-organized collection of python modules to help with the following: - - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.) - - generic utility code for optimization - - stochastic gradient descent variants - - early stopping variants - - interfacing to generic 2nd order optimization methods - - 2nd order methods tailored to work on minibatches - - optimizers for sparse coefficients / parameters - - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman) - - generic code for performance estimation and experimental statistics - - visualization tools (using existing python libraries) and examples for all of the above - - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them - - [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.] - - - a well-documented set of python scripts using the above library to show how to run the most - common ML algorithms (possibly with examples showing how to run multiple experiments with - many different models and collect statistical comparative results). This is particularly - important for pure users to adopt Theano in the ML application work. - -Ideally, there would be one person in charge of this project, making sure a coherent and -easy-to-read design is developed, along with many helping hands (to implement the various -helper modules, formulae, and learning algorithms). - - -James: -------- - -I am interested in the design and implementation of the "well-organized collection of Theano -symbolic expressions..." - -I would like to explore algorithms for hyper-parameter optimization, following up on some -"high-throughput" work. I'm most interested in the "generic code for model selection and -hyper-parameter optimization..." and "generic code for performance estimation...". - -I have some experiences with the data-access requirements, and some lessons I'd like to share -on that, but no time to work on that aspect of things. - -I will continue to contribute to the "well-documented set of python scripts using the above to -showcase common ML algorithms...". I have an Olshausen&Field-style sparse coding script that -could be polished up. I am also implementing the mcRBM and I'll be able to add that when it's -done. - - - -Suggestions for how to tackle various desiderata -================================================ - - -Theano Symbolic Expressions for ML ----------------------------------- - -We could make this a submodule of pylearn: ``pylearn.nnet``. - -Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just -about neural nets, and that this is a collection of formulas (expressions), rather than -completely self-contained classes for learners. We could have a "nnet.py" file for -neural nets, though. - -There are a number of ideas floating around for how to handle classes / -modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much -math as possible in global functions with no classes. There are no models in -the wish list that require than a few vectors and matrices to parametrize. -Global functions are more reusable than classes. - - -Data access ------------ - -A general interface to datasets from the perspective of an experiment driver -(e.g. kfold) is to see them as a function that maps index (typically integer) -to example (whose type and nature depends on the dataset, it could for -instance be an (image, label) pair). This interface permits iterating over -the dataset, shuffling the dataset, and splitting it into folds. For -efficiency, it is nice if the dataset interface supports looking up several -index values at once, because looking up many examples at once can sometimes -be faster than looking each one up in turn. In particular, looking up -a consecutive block of indices, or a slice, should be well supported. - -Some datasets may not support random access (e.g. a random number stream) and -that's fine if an exception is raised. The user will see a NotImplementedError -or similar, and try something else. We might want to have a way to test -that a dataset is random-access or not without having to load an example. - - -A more intuitive interface for many datasets (or subsets) is to load them as -matrices or lists of examples. This format is more convenient to work with at -an ipython shell, for example. It is not good to provide only the "dataset -as a function" view of a dataset. Even if a dataset is very large, it is nice -to have a standard way to get some representative examples in a convenient -structure, to be able to play with them in ipython. - - -Another thing to consider related to datasets is that there are a number of -other efforts to have standard ML datasets, and we should be aware of them, -and compatible with them when it's easy: - - mldata.org (they have a file format, not sure how many use it) - - weka (ARFF file format) - - scikits.learn - - hdf5 / pytables - - -pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem -folder that is assumed to have a standard form across different installations. -That's where the data files are. The correct format of this folder is currently -defined implicitly by the contents of /data/lisa/data at DIRO, but it would be -better to document in pylearn what the contents of this folder should be as -much as possible. It should be possible to rebuild this tree from information -found in pylearn. - -Yoshua (about ideas proposed by Pascal Vincent a while ago): - - - we may want to distinguish between datasets and tasks: a task defines - not just the data but also things like what is the input and what is the - target (for supervised learning), and *importantly* a set of performance metrics - that make sense for this task (e.g. those used by papers solving a particular - task, or reported for a particular benchmark) - - - we should discuss about a few "standards" that datasets and tasks may comply to, such as - - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks - (with a convention for the semi-supervised case when only the input or only the target is observed) - - "input" for unsupervised learning - - conventions for missing-valued components inside input or target - - how examples that are sequences are treated (e.g. the input or the target is a sequence) - - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous) - - how error metrics are specified - * example-level statistics (e.g. classification error) - * dataset-level statistics (e.g. ROC curve, mean and standard error of error) - - -Model Selection & Hyper-Parameter Optimization ----------------------------------------------- - -Driving a distributed computing job for a long time to optimize -hyper-parameters using one or more clusters is the goal here. -Although there might be some library-type code to write here, I think of this -more as an application template. The user would use python code to describe -the experiment to run and the hyper-parameter space to search. Then this -application-driver would take control of scheduling jobs and running them on -various computers... I'm imagining a potentially ugly brute of a hack that's -not necessarily something we will want to expose at a low-level for reuse. - -Yoshua: We want both the library-defined driver that takes instructions about how to generate -new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which -to sample them), and examples showing how to use it in typical cases. -Note that sometimes we just want to find the best configuration of hyper-parameters, -but sometimes we want to do more subtle analysis. Often a combination of both. -In this respect it could be useful for the user to define hyper-parameters over -which scientific questions are sought (e.g. depth of an architecture) vs -hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate). -This can influence both the sampling of configurations (we want to make sure that all -combinations of question-driving hyper-parameters are covered) and the analysis -of results (we may be willing to estimate ANOVAs or averaging or quantiles over -the non-question-driving hyper-parameters). - -Python scripts for common ML algorithms ---------------------------------------- - -The script aspect of this feature request makes me think that what would be -good here is more tutorial-type scripts. And the existing tutorials could -potentially be rewritten to use some of the pylearn.nnet expressions. More -tutorials / demos would be great. - -Yoshua: agreed that we could write them as tutorials, but note how the -spirit would be different from the current deep learning tutorials: we would -not mind using library code as much as possible instead of trying to flatten -out everything in the interest of pedagogical simplicity. Instead, these -tutorials should be meant to illustrate not the algorithms but *how to take -advantage of the library*. They could also be used as *BLACK BOX* implementations -by people who don't want to dig lower and just want to run experiments. - -Functional Specifications -========================= - -TODO: -Put these into different text files so that this one does not become a monster. -For each thing with a functional spec (e.g. datasets library, optimization library) make a -separate file. - - - -pylearn.formulas ----------------- - -Directory with functions for building layers, calculating classification -errors, cross-entropies with various distributions, free energies, etc. This -module would include for the most part global functions, Theano Ops and Theano -optimizations. - -Yoshua: I would break it down in module files, e.g.: - -pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error, -abs. error, various sparsity penalties (L1, Student) - -pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA - -pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions, -layers which could be plugged with various costs & penalties, and stacked - -pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants - -pylearn.formulas.noise: formulas for corruption processes - -pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling - -pylearn.formulas.trees: formulas for decision trees - -pylearn.formulas.boosting: formulas for boosting variants - -etc. - -Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function. - -Indexing Convention -~~~~~~~~~~~~~~~~~~~ - -Something to decide on - Fortran-style or C-style indexing. Although we have -often used c-style indexing in the past (for efficiency in c!) this is no -longer an issue with numpy because the physical layout is independent of the -indexing order. The fact remains that Fortran-style indexing follows linear -algebra conventions, while c-style indexing does not. If a global function -includes a lot of math derivations, it would be *really* nice if the code used -the same convention for the orientation of matrices, and endlessly annoying to -have to be always transposing everything. - diff -r d4a14c6c36e0 -r 660d784d14c7 doc/v2_planning/main_plan.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/v2_planning/main_plan.txt Wed Sep 01 12:18:08 2010 -0400 @@ -0,0 +1,241 @@ + +Motivation +========== + +Yoshua: +------- + +We are missing a *Theano Machine Learning library*. + +The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library: + + - a well-organized collection of Theano symbolic expressions (formulas) for handling most of + what is needed either in implementing existing well-known ML and deep learning algorithms or + for creating new variants (without having to start from scratch each time), that is the + mathematical core, + + - a well-organized collection of python modules to help with the following: + - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.) + - generic utility code for optimization + - stochastic gradient descent variants + - early stopping variants + - interfacing to generic 2nd order optimization methods + - 2nd order methods tailored to work on minibatches + - optimizers for sparse coefficients / parameters + - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman) + - generic code for performance estimation and experimental statistics + - visualization tools (using existing python libraries) and examples for all of the above + - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them + + [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.] + + - a well-documented set of python scripts using the above library to show how to run the most + common ML algorithms (possibly with examples showing how to run multiple experiments with + many different models and collect statistical comparative results). This is particularly + important for pure users to adopt Theano in the ML application work. + +Ideally, there would be one person in charge of this project, making sure a coherent and +easy-to-read design is developed, along with many helping hands (to implement the various +helper modules, formulae, and learning algorithms). + + +James: +------- + +I am interested in the design and implementation of the "well-organized collection of Theano +symbolic expressions..." + +I would like to explore algorithms for hyper-parameter optimization, following up on some +"high-throughput" work. I'm most interested in the "generic code for model selection and +hyper-parameter optimization..." and "generic code for performance estimation...". + +I have some experiences with the data-access requirements, and some lessons I'd like to share +on that, but no time to work on that aspect of things. + +I will continue to contribute to the "well-documented set of python scripts using the above to +showcase common ML algorithms...". I have an Olshausen&Field-style sparse coding script that +could be polished up. I am also implementing the mcRBM and I'll be able to add that when it's +done. + + + +Suggestions for how to tackle various desiderata +================================================ + + +Theano Symbolic Expressions for ML +---------------------------------- + +We could make this a submodule of pylearn: ``pylearn.nnet``. + +Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just +about neural nets, and that this is a collection of formulas (expressions), rather than +completely self-contained classes for learners. We could have a "nnet.py" file for +neural nets, though. + +There are a number of ideas floating around for how to handle classes / +modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much +math as possible in global functions with no classes. There are no models in +the wish list that require than a few vectors and matrices to parametrize. +Global functions are more reusable than classes. + + +Data access +----------- + +A general interface to datasets from the perspective of an experiment driver +(e.g. kfold) is to see them as a function that maps index (typically integer) +to example (whose type and nature depends on the dataset, it could for +instance be an (image, label) pair). This interface permits iterating over +the dataset, shuffling the dataset, and splitting it into folds. For +efficiency, it is nice if the dataset interface supports looking up several +index values at once, because looking up many examples at once can sometimes +be faster than looking each one up in turn. In particular, looking up +a consecutive block of indices, or a slice, should be well supported. + +Some datasets may not support random access (e.g. a random number stream) and +that's fine if an exception is raised. The user will see a NotImplementedError +or similar, and try something else. We might want to have a way to test +that a dataset is random-access or not without having to load an example. + + +A more intuitive interface for many datasets (or subsets) is to load them as +matrices or lists of examples. This format is more convenient to work with at +an ipython shell, for example. It is not good to provide only the "dataset +as a function" view of a dataset. Even if a dataset is very large, it is nice +to have a standard way to get some representative examples in a convenient +structure, to be able to play with them in ipython. + + +Another thing to consider related to datasets is that there are a number of +other efforts to have standard ML datasets, and we should be aware of them, +and compatible with them when it's easy: + - mldata.org (they have a file format, not sure how many use it) + - weka (ARFF file format) + - scikits.learn + - hdf5 / pytables + + +pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem +folder that is assumed to have a standard form across different installations. +That's where the data files are. The correct format of this folder is currently +defined implicitly by the contents of /data/lisa/data at DIRO, but it would be +better to document in pylearn what the contents of this folder should be as +much as possible. It should be possible to rebuild this tree from information +found in pylearn. + +Yoshua (about ideas proposed by Pascal Vincent a while ago): + + - we may want to distinguish between datasets and tasks: a task defines + not just the data but also things like what is the input and what is the + target (for supervised learning), and *importantly* a set of performance metrics + that make sense for this task (e.g. those used by papers solving a particular + task, or reported for a particular benchmark) + + - we should discuss about a few "standards" that datasets and tasks may comply to, such as + - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks + (with a convention for the semi-supervised case when only the input or only the target is observed) + - "input" for unsupervised learning + - conventions for missing-valued components inside input or target + - how examples that are sequences are treated (e.g. the input or the target is a sequence) + - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous) + - how error metrics are specified + * example-level statistics (e.g. classification error) + * dataset-level statistics (e.g. ROC curve, mean and standard error of error) + + +Model Selection & Hyper-Parameter Optimization +---------------------------------------------- + +Driving a distributed computing job for a long time to optimize +hyper-parameters using one or more clusters is the goal here. +Although there might be some library-type code to write here, I think of this +more as an application template. The user would use python code to describe +the experiment to run and the hyper-parameter space to search. Then this +application-driver would take control of scheduling jobs and running them on +various computers... I'm imagining a potentially ugly brute of a hack that's +not necessarily something we will want to expose at a low-level for reuse. + +Yoshua: We want both the library-defined driver that takes instructions about how to generate +new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which +to sample them), and examples showing how to use it in typical cases. +Note that sometimes we just want to find the best configuration of hyper-parameters, +but sometimes we want to do more subtle analysis. Often a combination of both. +In this respect it could be useful for the user to define hyper-parameters over +which scientific questions are sought (e.g. depth of an architecture) vs +hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate). +This can influence both the sampling of configurations (we want to make sure that all +combinations of question-driving hyper-parameters are covered) and the analysis +of results (we may be willing to estimate ANOVAs or averaging or quantiles over +the non-question-driving hyper-parameters). + +Python scripts for common ML algorithms +--------------------------------------- + +The script aspect of this feature request makes me think that what would be +good here is more tutorial-type scripts. And the existing tutorials could +potentially be rewritten to use some of the pylearn.nnet expressions. More +tutorials / demos would be great. + +Yoshua: agreed that we could write them as tutorials, but note how the +spirit would be different from the current deep learning tutorials: we would +not mind using library code as much as possible instead of trying to flatten +out everything in the interest of pedagogical simplicity. Instead, these +tutorials should be meant to illustrate not the algorithms but *how to take +advantage of the library*. They could also be used as *BLACK BOX* implementations +by people who don't want to dig lower and just want to run experiments. + +Functional Specifications +========================= + +TODO: +Put these into different text files so that this one does not become a monster. +For each thing with a functional spec (e.g. datasets library, optimization library) make a +separate file. + + + +pylearn.formulas +---------------- + +Directory with functions for building layers, calculating classification +errors, cross-entropies with various distributions, free energies, etc. This +module would include for the most part global functions, Theano Ops and Theano +optimizations. + +Yoshua: I would break it down in module files, e.g.: + +pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error, +abs. error, various sparsity penalties (L1, Student) + +pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA + +pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions, +layers which could be plugged with various costs & penalties, and stacked + +pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants + +pylearn.formulas.noise: formulas for corruption processes + +pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling + +pylearn.formulas.trees: formulas for decision trees + +pylearn.formulas.boosting: formulas for boosting variants + +etc. + +Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function. + +Indexing Convention +~~~~~~~~~~~~~~~~~~~ + +Something to decide on - Fortran-style or C-style indexing. Although we have +often used c-style indexing in the past (for efficiency in c!) this is no +longer an issue with numpy because the physical layout is independent of the +indexing order. The fact remains that Fortran-style indexing follows linear +algebra conventions, while c-style indexing does not. If a global function +includes a lot of math derivations, it would be *really* nice if the code used +the same convention for the orientation of matrices, and endlessly annoying to +have to be always transposing everything. +