changeset 946:7c4504a4ce1a

additions to formulas, data access, hyper-params, scripts
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Wed, 11 Aug 2010 21:32:31 -0400
parents cafa16bfc7df
children 216f4ce969b2
files doc/v2_planning.txt
diffstat 1 files changed, 70 insertions(+), 6 deletions(-) [+]
line wrap: on
line diff
--- a/doc/v2_planning.txt	Wed Aug 11 14:35:57 2010 -0400
+++ b/doc/v2_planning.txt	Wed Aug 11 21:32:31 2010 -0400
@@ -68,6 +68,11 @@
 
 We could make this a submodule of pylearn: ``pylearn.nnet``.  
 
+Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just 
+about neural nets, and that this is a collection of formulas (expressions), rather than
+completely self-contained classes for learners. We could have a "nnet.py" file for
+neural nets, though.
+
 There are a number of ideas floating around for how to handle classes /
 modules (LeDeepNet, pylearn.shared.layers, pynnet) so lets implement as much
 math as possible in global functions with no classes.  There are no models in
@@ -85,11 +90,13 @@
 the dataset, shuffling the dataset, and splitting it into folds.  For
 efficiency, it is nice if the dataset interface supports looking up several
 index values at once, because looking up many examples at once can sometimes
-be faster than looking each one up in turn.
+be faster than looking each one up in turn. In particular, looking up
+a consecutive block of indices, or a slice, should be well supported.
 
 Some datasets may not support random access (e.g. a random number stream) and
 that's fine if an exception is raised. The user will see a NotImplementedError
-or similar, and try something else.
+or similar, and try something else. We might want to have a way to test
+that a dataset is random-access or not without having to load an example.
 
 
 A more intuitive interface for many datasets (or subsets) is to load them as
@@ -117,6 +124,24 @@
 much as possible.  It should be possible to rebuild this tree from information
 found in pylearn.
 
+Yoshua (about ideas proposed by Pascal Vincent a while ago): 
+
+  - we may want to distinguish between datasets and tasks: a task defines
+  not just the data but also things like what is the input and what is the
+  target (for supervised learning), and *importantly* a set of performance metrics
+  that make sense for this task (e.g. those used by papers solving a particular
+  task, or reported for a particular benchmark)
+
+  - we should discuss about a few "standards" that datasets and tasks may comply to, such as
+    - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks
+      (with a convention for the semi-supervised case when only the input or only the target is observed)
+    - "input" for unsupervised learning
+    - conventions for missing-valued components inside input or target 
+    - how examples that are sequences are treated (e.g. the input or the target is a sequence)
+    - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous)
+    - how error metrics are specified
+        * example-level statistics (e.g. classification error)
+        * dataset-level statistics (e.g. ROC curve, mean and standard error of error)
 
 
 Model Selection & Hyper-Parameter Optimization
@@ -131,6 +156,18 @@
 various computers... I'm imagining a potentially ugly brute of a hack that's
 not necessarily something we will want to expose at a low-level for reuse.
 
+Yoshua: We want both the library-defined driver that takes instructions about how to generate
+new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which 
+to sample them), and examples showing how to use it in typical cases.
+Note that sometimes we just want to find the best configuration of hyper-parameters,
+but sometimes we want to do more subtle analysis. Often a combination of both.
+In this respect it could be useful for the user to define hyper-parameters over
+which scientific questions are sought (e.g. depth of an architecture) vs
+hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate).
+This can influence both the sampling of configurations (we want to make sure that all
+combinations of question-driving hyper-parameters are covered) and the analysis
+of results (we may be willing to estimate ANOVAs or averaging or quantiles over
+the non-question-driving hyper-parameters).
 
 Python scripts for common ML algorithms
 ---------------------------------------
@@ -140,6 +177,13 @@
 potentially be rewritten to use some of the pylearn.nnet expressions.   More
 tutorials / demos would be great.
 
+Yoshua: agreed that we could write them as tutorials, but note how the
+spirit would be different from the current deep learning tutorials: we would
+not mind using library code as much as possible instead of trying to flatten
+out everything in the interest of pedagogical simplicity. Instead, these
+tutorials should be meant to illustrate not the algorithms but *how to take
+advantage of the library*. They could also be used as *BLACK BOX* implementations
+by people who don't want to dig lower and just want to run experiments.
 
 Functional Specifications
 =========================
@@ -151,14 +195,34 @@
 
 
 
-pylearn.nnet
-------------
+pylearn.formulas
+----------------
 
-Submodule with functions for building layers, calculating classification
-errors, cross-entropies with various distributions, free energies.  This
+Directory with functions for building layers, calculating classification
+errors, cross-entropies with various distributions, free energies, etc.  This
 module would include for the most part global functions, Theano Ops and Theano
 optimizations.
 
+Yoshua: I would break it down in module files, e.g.:
+
+pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error, 
+abs. error, various sparsity penalties (L1, Student)
+
+pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA
+
+pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions,
+layers which could be plugged with various costs & penalties, and stacked
+
+pylearn.formulas.ae: formulas for auto-encoders, denoising auto-encoder variants, and corruption processes
+
+pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling
+
+pylearn.formulas.trees: formulas for decision trees
+
+pylearn.formulas.boosting: formulas for boosting variants
+
+etc.
+
 Indexing Convention
 ~~~~~~~~~~~~~~~~~~~