view doc/v2_planning/main_plan.txt @ 1047:1b61cbe0810b

A very rough draft of ideas, to kick-start things
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Wed, 08 Sep 2010 14:13:43 -0400
parents 2e515be92a0e
children bc246542d6ff
line wrap: on
line source


Motivation
==========

Yoshua (points discussed Thursday Sept 2, 2010 at LISA tea-talk)
------

****** Why we need to get better organized in our code-writing ******

- current state of affairs on top of Theano is anarchic and does not lend itself to easy code re-use
- the lab is growing and will continue to grow significantly, and more people outside the lab are using Theano
- we have new industrial partners and funding sources that demand deliverables, and more/better collectively organized efforts

*** Who can take advantage of this ***

- us, directly, taking advantage of the different advances made by different researchers in the lab to yield better models
- us, easier to compare different models and different datasets with different metrics on different computing platforms available to us
- future us, new students, able to quickly move into 'production' mode without having to reinvent the wheel 
- students in the two ML classes, able to play with the library to explore new ML variants
- other ML researchers in academia, able to play with our algorithms, try new variants, cite our papers
- non-ML users in or out of academia, and our user-partners


*** Move with care ***

- Write down use-cases, examples for each type of module, do not try to be TOO general
- Want to keep ease of exploring and flexibility, not create a prison
- Too many constraints can lead to paralysis, especially in C++ object-oriented model
- Too few guidelines lead to code components that are not interchangeable
- Poor code practice leads to buggy, spaguetti code

*** What ***

- define standards
- write-up a few instances of each basic type (dataset, learner, optimizer, hyper-parameter exploration boilerplate, etc.) enough to implement some of the basic algorithms we use often (e.g. like those in the tutorials)
- let the library grow according to our needs 
- keep tight reins on it to control quality 

*** Content and Form ***

We need to establish guidelines and conventions for 

 * Content: what are the re-usable components? define conventions or API for each, make sure they fit with each other
 * Form: social engineering, coding practices and conventions, code review, incentives

Yoshua:
-------

We are missing a *Theano Machine Learning library*.

The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library:

 - a well-organized collection of Theano symbolic expressions (formulas) for handling most of
   what is needed either in implementing existing well-known ML and deep learning algorithms or
   for creating new variants (without having to start from scratch each time), that is the
   mathematical core,

 - a well-organized collection of python modules to help with the following:
      - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.)
      - generic utility code for optimization
             - stochastic gradient descent variants
             - early stopping variants
             - interfacing to generic 2nd order optimization methods
             - 2nd order methods tailored to work on minibatches
             - optimizers for sparse coefficients / parameters
     - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman)
     - generic code for performance estimation and experimental statistics
     - visualization tools (using existing python libraries) and examples for all of the above
     - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them

   [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.]

 - a well-documented set of python scripts using the above library to show how to run the most
   common ML algorithms (possibly with examples showing how to run multiple experiments with
   many different models and collect statistical comparative results). This is particularly
   important for pure users to adopt Theano in the ML application work.

Ideally, there would be one person in charge of this project, making sure a coherent and
easy-to-read design is developed, along with many helping hands (to implement the various
helper modules, formulae, and learning algorithms).


James:
-------

I am interested in the design and implementation of the "well-organized collection of Theano
symbolic expressions..."

I would like to explore algorithms for hyper-parameter optimization, following up on some
"high-throughput" work.  I'm most interested in the "generic code for model selection and
hyper-parameter optimization..." and "generic code for performance estimation...".  

I have some experiences with the data-access requirements, and some lessons I'd like to share
on that, but no time to work on that aspect of things.

I will continue to contribute to the "well-documented set of python scripts using the above to
showcase common ML algorithms...".  I have an Olshausen&Field-style sparse coding script that
could be polished up.  I am also implementing the mcRBM and I'll be able to add that when it's
done.



Suggestions for how to tackle various desiderata
================================================


Theano Symbolic Expressions for ML
----------------------------------

We could make this a submodule of pylearn: ``pylearn.nnet``.  

Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just 
about neural nets, and that this is a collection of formulas (expressions), rather than
completely self-contained classes for learners. We could have a "nnet.py" file for
neural nets, though.

There are a number of ideas floating around for how to handle classes /
modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much
math as possible in global functions with no classes.  There are no models in
the wish list that require than a few vectors and matrices to parametrize.
Global functions are more reusable than classes.


Data access 
-----------

A general interface to datasets from the perspective of an experiment driver
(e.g. kfold) is to see them as a function that maps index (typically integer)
to example (whose type and nature depends on the dataset, it could for
instance be an (image, label) pair).  This interface permits iterating over
the dataset, shuffling the dataset, and splitting it into folds.  For
efficiency, it is nice if the dataset interface supports looking up several
index values at once, because looking up many examples at once can sometimes
be faster than looking each one up in turn. In particular, looking up
a consecutive block of indices, or a slice, should be well supported.

Some datasets may not support random access (e.g. a random number stream) and
that's fine if an exception is raised. The user will see a NotImplementedError
or similar, and try something else. We might want to have a way to test
that a dataset is random-access or not without having to load an example.


A more intuitive interface for many datasets (or subsets) is to load them as
matrices or lists of examples.  This format is more convenient to work with at
an ipython shell, for example.  It is not good to provide only the "dataset
as a function" view of a dataset.  Even if a dataset is very large, it is nice
to have a standard way to get some representative examples in a convenient
structure, to be able to play with them in ipython.


Another thing to consider related to datasets is that there are a number of
other efforts to have standard ML datasets, and we should be aware of them,
and compatible with them when it's easy:
 - mldata.org    (they have a file format, not sure how many use it)
 - weka          (ARFF file format)
 - scikits.learn 
 - hdf5 / pytables


pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem
folder that is assumed to have a standard form across different installations.
That's where the data files are.  The correct format of this folder is currently
defined implicitly by the contents of /data/lisa/data at DIRO, but it would be
better to document in pylearn what the contents of this folder should be as
much as possible.  It should be possible to rebuild this tree from information
found in pylearn.

Yoshua (about ideas proposed by Pascal Vincent a while ago): 

  - we may want to distinguish between datasets and tasks: a task defines
  not just the data but also things like what is the input and what is the
  target (for supervised learning), and *importantly* a set of performance metrics
  that make sense for this task (e.g. those used by papers solving a particular
  task, or reported for a particular benchmark)

  - we should discuss about a few "standards" that datasets and tasks may comply to, such as
    - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks
      (with a convention for the semi-supervised case when only the input or only the target is observed)
    - "input" for unsupervised learning
    - conventions for missing-valued components inside input or target 
    - how examples that are sequences are treated (e.g. the input or the target is a sequence)
    - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous)
    - how error metrics are specified
        * example-level statistics (e.g. classification error)
        * dataset-level statistics (e.g. ROC curve, mean and standard error of error)


Model Selection & Hyper-Parameter Optimization
----------------------------------------------

Driving a distributed computing job for a long time to optimize
hyper-parameters using one or more clusters is the goal here.
Although there might be some library-type code to write here, I think of this
more as an application template.  The user would use python code to describe
the experiment to run and the hyper-parameter space to search.  Then this
application-driver would take control of scheduling jobs and running them on
various computers... I'm imagining a potentially ugly brute of a hack that's
not necessarily something we will want to expose at a low-level for reuse.

Yoshua: We want both the library-defined driver that takes instructions about how to generate
new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which 
to sample them), and examples showing how to use it in typical cases.
Note that sometimes we just want to find the best configuration of hyper-parameters,
but sometimes we want to do more subtle analysis. Often a combination of both.
In this respect it could be useful for the user to define hyper-parameters over
which scientific questions are sought (e.g. depth of an architecture) vs
hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate).
This can influence both the sampling of configurations (we want to make sure that all
combinations of question-driving hyper-parameters are covered) and the analysis
of results (we may be willing to estimate ANOVAs or averaging or quantiles over
the non-question-driving hyper-parameters).

Python scripts for common ML algorithms
---------------------------------------

The script aspect of this feature request makes me think that what would be
good here is more tutorial-type scripts.  And the existing tutorials could
potentially be rewritten to use some of the pylearn.nnet expressions.   More
tutorials / demos would be great.

Yoshua: agreed that we could write them as tutorials, but note how the
spirit would be different from the current deep learning tutorials: we would
not mind using library code as much as possible instead of trying to flatten
out everything in the interest of pedagogical simplicity. Instead, these
tutorials should be meant to illustrate not the algorithms but *how to take
advantage of the library*. They could also be used as *BLACK BOX* implementations
by people who don't want to dig lower and just want to run experiments.

Functional Specifications
=========================

TODO: 
Put these into different text files so that this one does not become a monster.
For each thing with a functional spec (e.g. datasets library, optimization library) make a
separate file.



pylearn.formulas
----------------

Directory with functions for building layers, calculating classification
errors, cross-entropies with various distributions, free energies, etc.  This
module would include for the most part global functions, Theano Ops and Theano
optimizations.

Yoshua: I would break it down in module files, e.g.:

pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error, 
abs. error, various sparsity penalties (L1, Student)

pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA

pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions,
layers which could be plugged with various costs & penalties, and stacked

pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants

pylearn.formulas.noise: formulas for corruption processes

pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling

pylearn.formulas.trees: formulas for decision trees

pylearn.formulas.boosting: formulas for boosting variants

etc.

Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function.

Indexing Convention
~~~~~~~~~~~~~~~~~~~

Something to decide on - Fortran-style or C-style indexing.  Although we have
often used c-style indexing in the past (for efficiency in c!) this is no
longer an issue with numpy because the physical layout is independent of the
indexing order.  The fact remains that Fortran-style indexing follows linear
algebra conventions, while c-style indexing does not.  If a global function
includes a lot of math derivations, it would be *really* nice if the code used
the same convention for the orientation of matrices, and endlessly annoying to
have to be always transposing everything.