pylearn: doc/v2_planning.txt comparison

comparison doc/v2_planning.txt @ 946:7c4504a4ce1a

additions to formulas, data access, hyper-params, scripts

author	Yoshua Bengio <bengioy@iro.umontreal.ca>
date	Wed, 11 Aug 2010 21:32:31 -0400
parents	cafa16bfc7df
children	216f4ce969b2

comparison

equal deleted inserted replaced

-:cafa16bfc7df
+:7c4504a4ce1a
 Theano Symbolic Expressions for ML
 ----------------------------------
 We could make this a submodule of pylearn: ``pylearn.nnet``.
+Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just
+about neural nets, and that this is a collection of formulas (expressions), rather than
+completely self-contained classes for learners. We could have a "nnet.py" file for
+neural nets, though.
 There are a number of ideas floating around for how to handle classes /
 modules (LeDeepNet, pylearn.shared.layers, pynnet) so lets implement as much
 math as possible in global functions with no classes.  There are no models in
 the wish list that require than a few vectors and matrices to parametrize.
 Global functions are more reusable than classes.
 to example (whose type and nature depends on the dataset, it could for
 instance be an (image, label) pair).  This interface permits iterating over
 the dataset, shuffling the dataset, and splitting it into folds.  For
 efficiency, it is nice if the dataset interface supports looking up several
 index values at once, because looking up many examples at once can sometimes
-be faster than looking each one up in turn.
+be faster than looking each one up in turn. In particular, looking up
+a consecutive block of indices, or a slice, should be well supported.
 Some datasets may not support random access (e.g. a random number stream) and
 that's fine if an exception is raised. The user will see a NotImplementedError
-or similar, and try something else.
+or similar, and try something else. We might want to have a way to test
+that a dataset is random-access or not without having to load an example.
 A more intuitive interface for many datasets (or subsets) is to load them as
 matrices or lists of examples.  This format is more convenient to work with at
 an ipython shell, for example.  It is not good to provide only the "dataset
 defined implicitly by the contents of /data/lisa/data at DIRO, but it would be
 better to document in pylearn what the contents of this folder should be as
 much as possible.  It should be possible to rebuild this tree from information
 found in pylearn.
+Yoshua (about ideas proposed by Pascal Vincent a while ago):
+- we may want to distinguish between datasets and tasks: a task defines
+not just the data but also things like what is the input and what is the
+target (for supervised learning), and *importantly* a set of performance metrics
+that make sense for this task (e.g. those used by papers solving a particular
+task, or reported for a particular benchmark)
+- we should discuss about a few "standards" that datasets and tasks may comply to, such as
+- "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks
+(with a convention for the semi-supervised case when only the input or only the target is observed)
+- "input" for unsupervised learning
+- conventions for missing-valued components inside input or target
+- how examples that are sequences are treated (e.g. the input or the target is a sequence)
+- how time-stamps are specified when appropriate (e.g., the sequences are asynchronous)
+- how error metrics are specified
+* example-level statistics (e.g. classification error)
+* dataset-level statistics (e.g. ROC curve, mean and standard error of error)
 Model Selection & Hyper-Parameter Optimization
 ----------------------------------------------
 the experiment to run and the hyper-parameter space to search.  Then this
 application-driver would take control of scheduling jobs and running them on
 various computers... I'm imagining a potentially ugly brute of a hack that's
 not necessarily something we will want to expose at a low-level for reuse.
+Yoshua: We want both the library-defined driver that takes instructions about how to generate
+new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which
+to sample them), and examples showing how to use it in typical cases.
+Note that sometimes we just want to find the best configuration of hyper-parameters,
+but sometimes we want to do more subtle analysis. Often a combination of both.
+In this respect it could be useful for the user to define hyper-parameters over
+which scientific questions are sought (e.g. depth of an architecture) vs
+hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate).
+This can influence both the sampling of configurations (we want to make sure that all
+combinations of question-driving hyper-parameters are covered) and the analysis
+of results (we may be willing to estimate ANOVAs or averaging or quantiles over
+the non-question-driving hyper-parameters).
 Python scripts for common ML algorithms
 ---------------------------------------
 The script aspect of this feature request makes me think that what would be
 good here is more tutorial-type scripts.  And the existing tutorials could
 potentially be rewritten to use some of the pylearn.nnet expressions.   More
 tutorials / demos would be great.
+Yoshua: agreed that we could write them as tutorials, but note how the
+spirit would be different from the current deep learning tutorials: we would
+not mind using library code as much as possible instead of trying to flatten
+out everything in the interest of pedagogical simplicity. Instead, these
+tutorials should be meant to illustrate not the algorithms but *how to take
+advantage of the library*. They could also be used as *BLACK BOX* implementations
+by people who don't want to dig lower and just want to run experiments.
 Functional Specifications
 =========================
 TODO:
 For each thing with a functional spec (e.g. datasets library, optimization library) make a
 separate file.
-pylearn.nnet
+pylearn.formulas
-------------
+----------------
-Submodule with functions for building layers, calculating classification
+Directory with functions for building layers, calculating classification
-errors, cross-entropies with various distributions, free energies.  This
+errors, cross-entropies with various distributions, free energies, etc.  This
 module would include for the most part global functions, Theano Ops and Theano
 optimizations.
+Yoshua: I would break it down in module files, e.g.:
+pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error,
+abs. error, various sparsity penalties (L1, Student)
+pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA
+pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions,
+layers which could be plugged with various costs & penalties, and stacked
+pylearn.formulas.ae: formulas for auto-encoders, denoising auto-encoder variants, and corruption processes
+pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling
+pylearn.formulas.trees: formulas for decision trees
+pylearn.formulas.boosting: formulas for boosting variants
+etc.
 Indexing Convention
 ~~~~~~~~~~~~~~~~~~~
 Something to decide on - Fortran-style or C-style indexing.  Although we have

Mercurial > pylearn

comparison doc/v2_planning.txt @ 946:7c4504a4ce1a