diff doc/v2_planning.txt @ 949:d944e1c26a57

merge
author gdesjardins
date Mon, 16 Aug 2010 10:39:36 -0400
parents 216f4ce969b2
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/v2_planning.txt	Mon Aug 16 10:39:36 2010 -0400
@@ -0,0 +1,241 @@
+
+Motivation
+==========
+
+Yoshua:
+-------
+
+We are missing a *Theano Machine Learning library*.
+
+The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library:
+
+ - a well-organized collection of Theano symbolic expressions (formulas) for handling most of
+   what is needed either in implementing existing well-known ML and deep learning algorithms or
+   for creating new variants (without having to start from scratch each time), that is the
+   mathematical core,
+
+ - a well-organized collection of python modules to help with the following:
+      - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.)
+      - generic utility code for optimization
+             - stochastic gradient descent variants
+             - early stopping variants
+             - interfacing to generic 2nd order optimization methods
+             - 2nd order methods tailored to work on minibatches
+             - optimizers for sparse coefficients / parameters
+     - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman)
+     - generic code for performance estimation and experimental statistics
+     - visualization tools (using existing python libraries) and examples for all of the above
+     - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them
+
+   [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.]
+
+ - a well-documented set of python scripts using the above library to show how to run the most
+   common ML algorithms (possibly with examples showing how to run multiple experiments with
+   many different models and collect statistical comparative results). This is particularly
+   important for pure users to adopt Theano in the ML application work.
+
+Ideally, there would be one person in charge of this project, making sure a coherent and
+easy-to-read design is developed, along with many helping hands (to implement the various
+helper modules, formulae, and learning algorithms).
+
+
+James:
+-------
+
+I am interested in the design and implementation of the "well-organized collection of Theano
+symbolic expressions..."
+
+I would like to explore algorithms for hyper-parameter optimization, following up on some
+"high-throughput" work.  I'm most interested in the "generic code for model selection and
+hyper-parameter optimization..." and "generic code for performance estimation...".  
+
+I have some experiences with the data-access requirements, and some lessons I'd like to share
+on that, but no time to work on that aspect of things.
+
+I will continue to contribute to the "well-documented set of python scripts using the above to
+showcase common ML algorithms...".  I have an Olshausen&Field-style sparse coding script that
+could be polished up.  I am also implementing the mcRBM and I'll be able to add that when it's
+done.
+
+
+
+Suggestions for how to tackle various desiderata
+================================================
+
+
+Theano Symbolic Expressions for ML
+----------------------------------
+
+We could make this a submodule of pylearn: ``pylearn.nnet``.  
+
+Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just 
+about neural nets, and that this is a collection of formulas (expressions), rather than
+completely self-contained classes for learners. We could have a "nnet.py" file for
+neural nets, though.
+
+There are a number of ideas floating around for how to handle classes /
+modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much
+math as possible in global functions with no classes.  There are no models in
+the wish list that require than a few vectors and matrices to parametrize.
+Global functions are more reusable than classes.
+
+
+Data access 
+-----------
+
+A general interface to datasets from the perspective of an experiment driver
+(e.g. kfold) is to see them as a function that maps index (typically integer)
+to example (whose type and nature depends on the dataset, it could for
+instance be an (image, label) pair).  This interface permits iterating over
+the dataset, shuffling the dataset, and splitting it into folds.  For
+efficiency, it is nice if the dataset interface supports looking up several
+index values at once, because looking up many examples at once can sometimes
+be faster than looking each one up in turn. In particular, looking up
+a consecutive block of indices, or a slice, should be well supported.
+
+Some datasets may not support random access (e.g. a random number stream) and
+that's fine if an exception is raised. The user will see a NotImplementedError
+or similar, and try something else. We might want to have a way to test
+that a dataset is random-access or not without having to load an example.
+
+
+A more intuitive interface for many datasets (or subsets) is to load them as
+matrices or lists of examples.  This format is more convenient to work with at
+an ipython shell, for example.  It is not good to provide only the "dataset
+as a function" view of a dataset.  Even if a dataset is very large, it is nice
+to have a standard way to get some representative examples in a convenient
+structure, to be able to play with them in ipython.
+
+
+Another thing to consider related to datasets is that there are a number of
+other efforts to have standard ML datasets, and we should be aware of them,
+and compatible with them when it's easy:
+ - mldata.org    (they have a file format, not sure how many use it)
+ - weka          (ARFF file format)
+ - scikits.learn 
+ - hdf5 / pytables
+
+
+pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem
+folder that is assumed to have a standard form across different installations.
+That's where the data files are.  The correct format of this folder is currently
+defined implicitly by the contents of /data/lisa/data at DIRO, but it would be
+better to document in pylearn what the contents of this folder should be as
+much as possible.  It should be possible to rebuild this tree from information
+found in pylearn.
+
+Yoshua (about ideas proposed by Pascal Vincent a while ago): 
+
+  - we may want to distinguish between datasets and tasks: a task defines
+  not just the data but also things like what is the input and what is the
+  target (for supervised learning), and *importantly* a set of performance metrics
+  that make sense for this task (e.g. those used by papers solving a particular
+  task, or reported for a particular benchmark)
+
+  - we should discuss about a few "standards" that datasets and tasks may comply to, such as
+    - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks
+      (with a convention for the semi-supervised case when only the input or only the target is observed)
+    - "input" for unsupervised learning
+    - conventions for missing-valued components inside input or target 
+    - how examples that are sequences are treated (e.g. the input or the target is a sequence)
+    - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous)
+    - how error metrics are specified
+        * example-level statistics (e.g. classification error)
+        * dataset-level statistics (e.g. ROC curve, mean and standard error of error)
+
+
+Model Selection & Hyper-Parameter Optimization
+----------------------------------------------
+
+Driving a distributed computing job for a long time to optimize
+hyper-parameters using one or more clusters is the goal here.
+Although there might be some library-type code to write here, I think of this
+more as an application template.  The user would use python code to describe
+the experiment to run and the hyper-parameter space to search.  Then this
+application-driver would take control of scheduling jobs and running them on
+various computers... I'm imagining a potentially ugly brute of a hack that's
+not necessarily something we will want to expose at a low-level for reuse.
+
+Yoshua: We want both the library-defined driver that takes instructions about how to generate
+new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which 
+to sample them), and examples showing how to use it in typical cases.
+Note that sometimes we just want to find the best configuration of hyper-parameters,
+but sometimes we want to do more subtle analysis. Often a combination of both.
+In this respect it could be useful for the user to define hyper-parameters over
+which scientific questions are sought (e.g. depth of an architecture) vs
+hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate).
+This can influence both the sampling of configurations (we want to make sure that all
+combinations of question-driving hyper-parameters are covered) and the analysis
+of results (we may be willing to estimate ANOVAs or averaging or quantiles over
+the non-question-driving hyper-parameters).
+
+Python scripts for common ML algorithms
+---------------------------------------
+
+The script aspect of this feature request makes me think that what would be
+good here is more tutorial-type scripts.  And the existing tutorials could
+potentially be rewritten to use some of the pylearn.nnet expressions.   More
+tutorials / demos would be great.
+
+Yoshua: agreed that we could write them as tutorials, but note how the
+spirit would be different from the current deep learning tutorials: we would
+not mind using library code as much as possible instead of trying to flatten
+out everything in the interest of pedagogical simplicity. Instead, these
+tutorials should be meant to illustrate not the algorithms but *how to take
+advantage of the library*. They could also be used as *BLACK BOX* implementations
+by people who don't want to dig lower and just want to run experiments.
+
+Functional Specifications
+=========================
+
+TODO: 
+Put these into different text files so that this one does not become a monster.
+For each thing with a functional spec (e.g. datasets library, optimization library) make a
+separate file.
+
+
+
+pylearn.formulas
+----------------
+
+Directory with functions for building layers, calculating classification
+errors, cross-entropies with various distributions, free energies, etc.  This
+module would include for the most part global functions, Theano Ops and Theano
+optimizations.
+
+Yoshua: I would break it down in module files, e.g.:
+
+pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error, 
+abs. error, various sparsity penalties (L1, Student)
+
+pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA
+
+pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions,
+layers which could be plugged with various costs & penalties, and stacked
+
+pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants
+
+pylearn.formulas.noise: formulas for corruption processes
+
+pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling
+
+pylearn.formulas.trees: formulas for decision trees
+
+pylearn.formulas.boosting: formulas for boosting variants
+
+etc.
+
+Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function.
+
+Indexing Convention
+~~~~~~~~~~~~~~~~~~~
+
+Something to decide on - Fortran-style or C-style indexing.  Although we have
+often used c-style indexing in the past (for efficiency in c!) this is no
+longer an issue with numpy because the physical layout is independent of the
+indexing order.  The fact remains that Fortran-style indexing follows linear
+algebra conventions, while c-style indexing does not.  If a global function
+includes a lot of math derivations, it would be *really* nice if the code used
+the same convention for the orientation of matrices, and endlessly annoying to
+have to be always transposing everything.
+