Mercurial > pylearn

--- a/doc/v2_planning.txt	Tue Aug 24 19:24:54 2010 -0400
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,241 +0,0 @@
-
-Motivation
-==========
-
-Yoshua:
--------
-
-We are missing a *Theano Machine Learning library*.
-
-The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library:
-
- - a well-organized collection of Theano symbolic expressions (formulas) for handling most of
-   what is needed either in implementing existing well-known ML and deep learning algorithms or
-   for creating new variants (without having to start from scratch each time), that is the
-   mathematical core,
-
- - a well-organized collection of python modules to help with the following:
-      - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.)
-      - generic utility code for optimization
-             - stochastic gradient descent variants
-             - early stopping variants
-             - interfacing to generic 2nd order optimization methods
-             - 2nd order methods tailored to work on minibatches
-             - optimizers for sparse coefficients / parameters
-     - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman)
-     - generic code for performance estimation and experimental statistics
-     - visualization tools (using existing python libraries) and examples for all of the above
-     - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them
-
-   [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.]
-
- - a well-documented set of python scripts using the above library to show how to run the most
-   common ML algorithms (possibly with examples showing how to run multiple experiments with
-   many different models and collect statistical comparative results). This is particularly
-   important for pure users to adopt Theano in the ML application work.
-
-Ideally, there would be one person in charge of this project, making sure a coherent and
-easy-to-read design is developed, along with many helping hands (to implement the various
-helper modules, formulae, and learning algorithms).
-
-
-James:
--------
-
-I am interested in the design and implementation of the "well-organized collection of Theano
-symbolic expressions..."
-
-I would like to explore algorithms for hyper-parameter optimization, following up on some
-"high-throughput" work.  I'm most interested in the "generic code for model selection and
-hyper-parameter optimization..." and "generic code for performance estimation...".
-
-I have some experiences with the data-access requirements, and some lessons I'd like to share
-on that, but no time to work on that aspect of things.
-
-I will continue to contribute to the "well-documented set of python scripts using the above to
-showcase common ML algorithms...".  I have an Olshausen&Field-style sparse coding script that
-could be polished up.  I am also implementing the mcRBM and I'll be able to add that when it's
-done.
-
-
-
-Suggestions for how to tackle various desiderata
-================================================
-
-
-Theano Symbolic Expressions for ML
-----------------------------------
-
-We could make this a submodule of pylearn: ``pylearn.nnet``.
-
-Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just
-about neural nets, and that this is a collection of formulas (expressions), rather than
-completely self-contained classes for learners. We could have a "nnet.py" file for
-neural nets, though.
-
-There are a number of ideas floating around for how to handle classes /
-modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much
-math as possible in global functions with no classes.  There are no models in
-the wish list that require than a few vectors and matrices to parametrize.
-Global functions are more reusable than classes.
-
-
-Data access
------------
-
-A general interface to datasets from the perspective of an experiment driver
-(e.g. kfold) is to see them as a function that maps index (typically integer)
-to example (whose type and nature depends on the dataset, it could for
-instance be an (image, label) pair).  This interface permits iterating over
-the dataset, shuffling the dataset, and splitting it into folds.  For
-efficiency, it is nice if the dataset interface supports looking up several
-index values at once, because looking up many examples at once can sometimes
-be faster than looking each one up in turn. In particular, looking up
-a consecutive block of indices, or a slice, should be well supported.
-
-Some datasets may not support random access (e.g. a random number stream) and
-that's fine if an exception is raised. The user will see a NotImplementedError
-or similar, and try something else. We might want to have a way to test
-that a dataset is random-access or not without having to load an example.
-
-
-A more intuitive interface for many datasets (or subsets) is to load them as
-matrices or lists of examples.  This format is more convenient to work with at
-an ipython shell, for example.  It is not good to provide only the "dataset
-as a function" view of a dataset.  Even if a dataset is very large, it is nice
-to have a standard way to get some representative examples in a convenient
-structure, to be able to play with them in ipython.
-
-
-Another thing to consider related to datasets is that there are a number of
-other efforts to have standard ML datasets, and we should be aware of them,
-and compatible with them when it's easy:
- - mldata.org    (they have a file format, not sure how many use it)
- - weka          (ARFF file format)
- - scikits.learn
- - hdf5 / pytables
-
-
-pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem
-folder that is assumed to have a standard form across different installations.
-That's where the data files are.  The correct format of this folder is currently
-defined implicitly by the contents of /data/lisa/data at DIRO, but it would be
-better to document in pylearn what the contents of this folder should be as
-much as possible.  It should be possible to rebuild this tree from information
-found in pylearn.
-
-Yoshua (about ideas proposed by Pascal Vincent a while ago):
-
-  - we may want to distinguish between datasets and tasks: a task defines
-  not just the data but also things like what is the input and what is the
-  target (for supervised learning), and *importantly* a set of performance metrics
-  that make sense for this task (e.g. those used by papers solving a particular
-  task, or reported for a particular benchmark)
-
-  - we should discuss about a few "standards" that datasets and tasks may comply to, such as
-    - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks
-      (with a convention for the semi-supervised case when only the input or only the target is observed)
-    - "input" for unsupervised learning
-    - conventions for missing-valued components inside input or target
-    - how examples that are sequences are treated (e.g. the input or the target is a sequence)
-    - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous)
-    - how error metrics are specified
-        * example-level statistics (e.g. classification error)
-        * dataset-level statistics (e.g. ROC curve, mean and standard error of error)
-
-
-Model Selection & Hyper-Parameter Optimization
-----------------------------------------------
-
-Driving a distributed computing job for a long time to optimize
-hyper-parameters using one or more clusters is the goal here.
-Although there might be some library-type code to write here, I think of this
-more as an application template.  The user would use python code to describe
-the experiment to run and the hyper-parameter space to search.  Then this
-application-driver would take control of scheduling jobs and running them on
-various computers... I'm imagining a potentially ugly brute of a hack that's
-not necessarily something we will want to expose at a low-level for reuse.
-
-Yoshua: We want both the library-defined driver that takes instructions about how to generate
-new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which
-to sample them), and examples showing how to use it in typical cases.
-Note that sometimes we just want to find the best configuration of hyper-parameters,
-but sometimes we want to do more subtle analysis. Often a combination of both.
-In this respect it could be useful for the user to define hyper-parameters over
-which scientific questions are sought (e.g. depth of an architecture) vs
-hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate).
-This can influence both the sampling of configurations (we want to make sure that all
-combinations of question-driving hyper-parameters are covered) and the analysis
-of results (we may be willing to estimate ANOVAs or averaging or quantiles over
-the non-question-driving hyper-parameters).
-
-Python scripts for common ML algorithms
----------------------------------------
-
-The script aspect of this feature request makes me think that what would be
-good here is more tutorial-type scripts.  And the existing tutorials could
-potentially be rewritten to use some of the pylearn.nnet expressions.   More
-tutorials / demos would be great.
-
-Yoshua: agreed that we could write them as tutorials, but note how the
-spirit would be different from the current deep learning tutorials: we would
-not mind using library code as much as possible instead of trying to flatten
-out everything in the interest of pedagogical simplicity. Instead, these
-tutorials should be meant to illustrate not the algorithms but *how to take
-advantage of the library*. They could also be used as *BLACK BOX* implementations
-by people who don't want to dig lower and just want to run experiments.
-
-Functional Specifications
-=========================
-
-TODO:
-Put these into different text files so that this one does not become a monster.
-For each thing with a functional spec (e.g. datasets library, optimization library) make a
-separate file.
-
-
-
-pylearn.formulas
-----------------
-
-Directory with functions for building layers, calculating classification
-errors, cross-entropies with various distributions, free energies, etc.  This
-module would include for the most part global functions, Theano Ops and Theano
-optimizations.
-
-Yoshua: I would break it down in module files, e.g.:
-
-pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error,
-abs. error, various sparsity penalties (L1, Student)
-
-pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA
-
-pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions,
-layers which could be plugged with various costs & penalties, and stacked
-
-pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants
-
-pylearn.formulas.noise: formulas for corruption processes
-
-pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling
-
-pylearn.formulas.trees: formulas for decision trees
-
-pylearn.formulas.boosting: formulas for boosting variants
-
-etc.
-
-Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function.
-
-Indexing Convention
-~~~~~~~~~~~~~~~~~~~
-
-Something to decide on - Fortran-style or C-style indexing.  Although we have
-often used c-style indexing in the past (for efficiency in c!) this is no
-longer an issue with numpy because the physical layout is independent of the
-indexing order.  The fact remains that Fortran-style indexing follows linear
-algebra conventions, while c-style indexing does not.  If a global function
-includes a lot of math derivations, it would be *really* nice if the code used
-the same convention for the orientation of matrices, and endlessly annoying to
-have to be always transposing everything.
-
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/v2_planning/main_plan.txt	Wed Sep 01 12:18:08 2010 -0400
@@ -0,0 +1,241 @@
+
+Motivation
+==========
+
+Yoshua:
+-------
+
+We are missing a *Theano Machine Learning library*.
+
+The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library:
+
+ - a well-organized collection of Theano symbolic expressions (formulas) for handling most of
+   what is needed either in implementing existing well-known ML and deep learning algorithms or
+   for creating new variants (without having to start from scratch each time), that is the
+   mathematical core,
+
+ - a well-organized collection of python modules to help with the following:
+      - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.)
+      - generic utility code for optimization
+             - stochastic gradient descent variants
+             - early stopping variants
+             - interfacing to generic 2nd order optimization methods
+             - 2nd order methods tailored to work on minibatches
+             - optimizers for sparse coefficients / parameters
+     - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman)
+     - generic code for performance estimation and experimental statistics
+     - visualization tools (using existing python libraries) and examples for all of the above
+     - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them
+
+   [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.]
+
+ - a well-documented set of python scripts using the above library to show how to run the most
+   common ML algorithms (possibly with examples showing how to run multiple experiments with
+   many different models and collect statistical comparative results). This is particularly
+   important for pure users to adopt Theano in the ML application work.
+
+Ideally, there would be one person in charge of this project, making sure a coherent and
+easy-to-read design is developed, along with many helping hands (to implement the various
+helper modules, formulae, and learning algorithms).
+
+
+James:
+-------
+
+I am interested in the design and implementation of the "well-organized collection of Theano
+symbolic expressions..."
+
+I would like to explore algorithms for hyper-parameter optimization, following up on some
+"high-throughput" work.  I'm most interested in the "generic code for model selection and
+hyper-parameter optimization..." and "generic code for performance estimation...".
+
+I have some experiences with the data-access requirements, and some lessons I'd like to share
+on that, but no time to work on that aspect of things.
+
+I will continue to contribute to the "well-documented set of python scripts using the above to
+showcase common ML algorithms...".  I have an Olshausen&Field-style sparse coding script that
+could be polished up.  I am also implementing the mcRBM and I'll be able to add that when it's
+done.
+
+
+
+Suggestions for how to tackle various desiderata
+================================================
+
+
+Theano Symbolic Expressions for ML
+----------------------------------
+
+We could make this a submodule of pylearn: ``pylearn.nnet``.
+
+Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just
+about neural nets, and that this is a collection of formulas (expressions), rather than
+completely self-contained classes for learners. We could have a "nnet.py" file for
+neural nets, though.
+
+There are a number of ideas floating around for how to handle classes /
+modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much
+math as possible in global functions with no classes.  There are no models in
+the wish list that require than a few vectors and matrices to parametrize.
+Global functions are more reusable than classes.
+
+
+Data access
+-----------
+
+A general interface to datasets from the perspective of an experiment driver
+(e.g. kfold) is to see them as a function that maps index (typically integer)
+to example (whose type and nature depends on the dataset, it could for
+instance be an (image, label) pair).  This interface permits iterating over
+the dataset, shuffling the dataset, and splitting it into folds.  For
+efficiency, it is nice if the dataset interface supports looking up several
+index values at once, because looking up many examples at once can sometimes
+be faster than looking each one up in turn. In particular, looking up
+a consecutive block of indices, or a slice, should be well supported.
+
+Some datasets may not support random access (e.g. a random number stream) and
+that's fine if an exception is raised. The user will see a NotImplementedError
+or similar, and try something else. We might want to have a way to test
+that a dataset is random-access or not without having to load an example.
+
+
+A more intuitive interface for many datasets (or subsets) is to load them as
+matrices or lists of examples.  This format is more convenient to work with at
+an ipython shell, for example.  It is not good to provide only the "dataset
+as a function" view of a dataset.  Even if a dataset is very large, it is nice
+to have a standard way to get some representative examples in a convenient
+structure, to be able to play with them in ipython.
+
+
+Another thing to consider related to datasets is that there are a number of
+other efforts to have standard ML datasets, and we should be aware of them,
+and compatible with them when it's easy:
+ - mldata.org    (they have a file format, not sure how many use it)
+ - weka          (ARFF file format)
+ - scikits.learn
+ - hdf5 / pytables
+
+
+pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem
+folder that is assumed to have a standard form across different installations.
+That's where the data files are.  The correct format of this folder is currently
+defined implicitly by the contents of /data/lisa/data at DIRO, but it would be
+better to document in pylearn what the contents of this folder should be as
+much as possible.  It should be possible to rebuild this tree from information
+found in pylearn.
+
+Yoshua (about ideas proposed by Pascal Vincent a while ago):
+
+  - we may want to distinguish between datasets and tasks: a task defines
+  not just the data but also things like what is the input and what is the
+  target (for supervised learning), and *importantly* a set of performance metrics
+  that make sense for this task (e.g. those used by papers solving a particular
+  task, or reported for a particular benchmark)
+
+  - we should discuss about a few "standards" that datasets and tasks may comply to, such as
+    - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks
+      (with a convention for the semi-supervised case when only the input or only the target is observed)
+    - "input" for unsupervised learning
+    - conventions for missing-valued components inside input or target
+    - how examples that are sequences are treated (e.g. the input or the target is a sequence)
+    - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous)
+    - how error metrics are specified
+        * example-level statistics (e.g. classification error)
+        * dataset-level statistics (e.g. ROC curve, mean and standard error of error)
+
+
+Model Selection & Hyper-Parameter Optimization
+----------------------------------------------
+
+Driving a distributed computing job for a long time to optimize
+hyper-parameters using one or more clusters is the goal here.
+Although there might be some library-type code to write here, I think of this
+more as an application template.  The user would use python code to describe
+the experiment to run and the hyper-parameter space to search.  Then this
+application-driver would take control of scheduling jobs and running them on
+various computers... I'm imagining a potentially ugly brute of a hack that's
+not necessarily something we will want to expose at a low-level for reuse.
+
+Yoshua: We want both the library-defined driver that takes instructions about how to generate
+new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which
+to sample them), and examples showing how to use it in typical cases.
+Note that sometimes we just want to find the best configuration of hyper-parameters,
+but sometimes we want to do more subtle analysis. Often a combination of both.
+In this respect it could be useful for the user to define hyper-parameters over
+which scientific questions are sought (e.g. depth of an architecture) vs
+hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate).
+This can influence both the sampling of configurations (we want to make sure that all
+combinations of question-driving hyper-parameters are covered) and the analysis
+of results (we may be willing to estimate ANOVAs or averaging or quantiles over
+the non-question-driving hyper-parameters).
+
+Python scripts for common ML algorithms
+---------------------------------------
+
+The script aspect of this feature request makes me think that what would be
+good here is more tutorial-type scripts.  And the existing tutorials could
+potentially be rewritten to use some of the pylearn.nnet expressions.   More
+tutorials / demos would be great.
+
+Yoshua: agreed that we could write them as tutorials, but note how the
+spirit would be different from the current deep learning tutorials: we would
+not mind using library code as much as possible instead of trying to flatten
+out everything in the interest of pedagogical simplicity. Instead, these
+tutorials should be meant to illustrate not the algorithms but *how to take
+advantage of the library*. They could also be used as *BLACK BOX* implementations
+by people who don't want to dig lower and just want to run experiments.
+
+Functional Specifications
+=========================
+
+TODO:
+Put these into different text files so that this one does not become a monster.
+For each thing with a functional spec (e.g. datasets library, optimization library) make a
+separate file.
+
+
+
+pylearn.formulas
+----------------
+
+Directory with functions for building layers, calculating classification
+errors, cross-entropies with various distributions, free energies, etc.  This
+module would include for the most part global functions, Theano Ops and Theano
+optimizations.
+
+Yoshua: I would break it down in module files, e.g.:
+
+pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error,
+abs. error, various sparsity penalties (L1, Student)
+
+pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA
+
+pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions,
+layers which could be plugged with various costs & penalties, and stacked
+
+pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants
+
+pylearn.formulas.noise: formulas for corruption processes
+
+pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling
+
+pylearn.formulas.trees: formulas for decision trees
+
+pylearn.formulas.boosting: formulas for boosting variants
+
+etc.
+
+Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function.
+
+Indexing Convention
+~~~~~~~~~~~~~~~~~~~
+
+Something to decide on - Fortran-style or C-style indexing.  Although we have
+often used c-style indexing in the past (for efficiency in c!) this is no
+longer an issue with numpy because the physical layout is independent of the
+indexing order.  The fact remains that Fortran-style indexing follows linear
+algebra conventions, while c-style indexing does not.  If a global function
+includes a lot of math derivations, it would be *really* nice if the code used
+the same convention for the orientation of matrices, and endlessly annoying to
+have to be always transposing everything.
+