Mercurial > pylearn

--- a/doc/v2_planning.txt	Wed Aug 11 13:16:51 2010 -0400
+++ b/doc/v2_planning.txt	Wed Aug 11 14:35:57 2010 -0400
@@ -63,11 +63,111 @@
 ================================================


+Theano Symbolic Expressions for ML
+----------------------------------
+
+We could make this a submodule of pylearn: ``pylearn.nnet``.
+
+There are a number of ideas floating around for how to handle classes /
+modules (LeDeepNet, pylearn.shared.layers, pynnet) so lets implement as much
+math as possible in global functions with no classes.  There are no models in
+the wish list that require than a few vectors and matrices to parametrize.
+Global functions are more reusable than classes.
+
+
+Data access
+-----------
+
+A general interface to datasets from the perspective of an experiment driver
+(e.g. kfold) is to see them as a function that maps index (typically integer)
+to example (whose type and nature depends on the dataset, it could for
+instance be an (image, label) pair).  This interface permits iterating over
+the dataset, shuffling the dataset, and splitting it into folds.  For
+efficiency, it is nice if the dataset interface supports looking up several
+index values at once, because looking up many examples at once can sometimes
+be faster than looking each one up in turn.
+
+Some datasets may not support random access (e.g. a random number stream) and
+that's fine if an exception is raised. The user will see a NotImplementedError
+or similar, and try something else.
+
+
+A more intuitive interface for many datasets (or subsets) is to load them as
+matrices or lists of examples.  This format is more convenient to work with at
+an ipython shell, for example.  It is not good to provide only the "dataset
+as a function" view of a dataset.  Even if a dataset is very large, it is nice
+to have a standard way to get some representative examples in a convenient
+structure, to be able to play with them in ipython.
+
+
+Another thing to consider related to datasets is that there are a number of
+other efforts to have standard ML datasets, and we should be aware of them,
+and compatible with them when it's easy:
+ - mldata.org    (they have a file format, not sure how many use it)
+ - weka          (ARFF file format)
+ - scikits.learn
+ - hdf5 / pytables
+
+
+pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem
+folder that is assumed to have a standard form across different installations.
+That's where the data files are.  The correct format of this folder is currently
+defined implicitly by the contents of /data/lisa/data at DIRO, but it would be
+better to document in pylearn what the contents of this folder should be as
+much as possible.  It should be possible to rebuild this tree from information
+found in pylearn.
+
+
+
+Model Selection & Hyper-Parameter Optimization
+----------------------------------------------
+
+Driving a distributed computing job for a long time to optimize
+hyper-parameters using one or more clusters is the goal here.
+Although there might be some library-type code to write here, I think of this
+more as an application template.  The user would use python code to describe
+the experiment to run and the hyper-parameter space to search.  Then this
+application-driver would take control of scheduling jobs and running them on
+various computers... I'm imagining a potentially ugly brute of a hack that's
+not necessarily something we will want to expose at a low-level for reuse.
+
+
+Python scripts for common ML algorithms
+---------------------------------------
+
+The script aspect of this feature request makes me think that what would be
+good here is more tutorial-type scripts.  And the existing tutorials could
+potentially be rewritten to use some of the pylearn.nnet expressions.   More
+tutorials / demos would be great.
+

 Functional Specifications
 =========================

+TODO:
 Put these into different text files so that this one does not become a monster.
 For each thing with a functional spec (e.g. datasets library, optimization library) make a
 separate file.

+
+
+pylearn.nnet
+------------
+
+Submodule with functions for building layers, calculating classification
+errors, cross-entropies with various distributions, free energies.  This
+module would include for the most part global functions, Theano Ops and Theano
+optimizations.
+
+Indexing Convention
+~~~~~~~~~~~~~~~~~~~
+
+Something to decide on - Fortran-style or C-style indexing.  Although we have
+often used c-style indexing in the past (for efficiency in c!) this is no
+longer an issue with numpy because the physical layout is independent of the
+indexing order.  The fact remains that Fortran-style indexing follows linear
+algebra conventions, while c-style indexing does not.  If a global function
+includes a lot of math derivations, it would be *really* nice if the code used
+the same convention for the orientation of matrices, and endlessly annoying to
+have to be always transposing everything.
+