changeset 965:bf54637bb994

merge
author James Bergstra <bergstrj@iro.umontreal.ca>
date Fri, 20 Aug 2010 09:31:39 -0400
parents 6a778bca0dec (current diff) d944e1c26a57 (diff)
children e88d7b7d53ed
files
diffstat 2 files changed, 129 insertions(+), 7 deletions(-) [+]
line wrap: on
line diff
--- a/doc/v2_planning.txt	Fri Aug 20 09:31:24 2010 -0400
+++ b/doc/v2_planning.txt	Fri Aug 20 09:31:39 2010 -0400
@@ -68,8 +68,13 @@
 
 We could make this a submodule of pylearn: ``pylearn.nnet``.  
 
+Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just 
+about neural nets, and that this is a collection of formulas (expressions), rather than
+completely self-contained classes for learners. We could have a "nnet.py" file for
+neural nets, though.
+
 There are a number of ideas floating around for how to handle classes /
-modules (LeDeepNet, pylearn.shared.layers, pynnet) so lets implement as much
+modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much
 math as possible in global functions with no classes.  There are no models in
 the wish list that require than a few vectors and matrices to parametrize.
 Global functions are more reusable than classes.
@@ -85,11 +90,13 @@
 the dataset, shuffling the dataset, and splitting it into folds.  For
 efficiency, it is nice if the dataset interface supports looking up several
 index values at once, because looking up many examples at once can sometimes
-be faster than looking each one up in turn.
+be faster than looking each one up in turn. In particular, looking up
+a consecutive block of indices, or a slice, should be well supported.
 
 Some datasets may not support random access (e.g. a random number stream) and
 that's fine if an exception is raised. The user will see a NotImplementedError
-or similar, and try something else.
+or similar, and try something else. We might want to have a way to test
+that a dataset is random-access or not without having to load an example.
 
 
 A more intuitive interface for many datasets (or subsets) is to load them as
@@ -117,6 +124,24 @@
 much as possible.  It should be possible to rebuild this tree from information
 found in pylearn.
 
+Yoshua (about ideas proposed by Pascal Vincent a while ago): 
+
+  - we may want to distinguish between datasets and tasks: a task defines
+  not just the data but also things like what is the input and what is the
+  target (for supervised learning), and *importantly* a set of performance metrics
+  that make sense for this task (e.g. those used by papers solving a particular
+  task, or reported for a particular benchmark)
+
+  - we should discuss about a few "standards" that datasets and tasks may comply to, such as
+    - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks
+      (with a convention for the semi-supervised case when only the input or only the target is observed)
+    - "input" for unsupervised learning
+    - conventions for missing-valued components inside input or target 
+    - how examples that are sequences are treated (e.g. the input or the target is a sequence)
+    - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous)
+    - how error metrics are specified
+        * example-level statistics (e.g. classification error)
+        * dataset-level statistics (e.g. ROC curve, mean and standard error of error)
 
 
 Model Selection & Hyper-Parameter Optimization
@@ -131,6 +156,18 @@
 various computers... I'm imagining a potentially ugly brute of a hack that's
 not necessarily something we will want to expose at a low-level for reuse.
 
+Yoshua: We want both the library-defined driver that takes instructions about how to generate
+new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which 
+to sample them), and examples showing how to use it in typical cases.
+Note that sometimes we just want to find the best configuration of hyper-parameters,
+but sometimes we want to do more subtle analysis. Often a combination of both.
+In this respect it could be useful for the user to define hyper-parameters over
+which scientific questions are sought (e.g. depth of an architecture) vs
+hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate).
+This can influence both the sampling of configurations (we want to make sure that all
+combinations of question-driving hyper-parameters are covered) and the analysis
+of results (we may be willing to estimate ANOVAs or averaging or quantiles over
+the non-question-driving hyper-parameters).
 
 Python scripts for common ML algorithms
 ---------------------------------------
@@ -140,6 +177,13 @@
 potentially be rewritten to use some of the pylearn.nnet expressions.   More
 tutorials / demos would be great.
 
+Yoshua: agreed that we could write them as tutorials, but note how the
+spirit would be different from the current deep learning tutorials: we would
+not mind using library code as much as possible instead of trying to flatten
+out everything in the interest of pedagogical simplicity. Instead, these
+tutorials should be meant to illustrate not the algorithms but *how to take
+advantage of the library*. They could also be used as *BLACK BOX* implementations
+by people who don't want to dig lower and just want to run experiments.
 
 Functional Specifications
 =========================
@@ -151,14 +195,38 @@
 
 
 
-pylearn.nnet
-------------
+pylearn.formulas
+----------------
 
-Submodule with functions for building layers, calculating classification
-errors, cross-entropies with various distributions, free energies.  This
+Directory with functions for building layers, calculating classification
+errors, cross-entropies with various distributions, free energies, etc.  This
 module would include for the most part global functions, Theano Ops and Theano
 optimizations.
 
+Yoshua: I would break it down in module files, e.g.:
+
+pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error, 
+abs. error, various sparsity penalties (L1, Student)
+
+pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA
+
+pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions,
+layers which could be plugged with various costs & penalties, and stacked
+
+pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants
+
+pylearn.formulas.noise: formulas for corruption processes
+
+pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling
+
+pylearn.formulas.trees: formulas for decision trees
+
+pylearn.formulas.boosting: formulas for boosting variants
+
+etc.
+
+Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function.
+
 Indexing Convention
 ~~~~~~~~~~~~~~~~~~~
 
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/pylearn/datasets/test_modes.py	Fri Aug 20 09:31:39 2010 -0400
@@ -0,0 +1,54 @@
+from pylearn.datasets import Dataset
+import numpy
+
+def neal94_AC(p=0.01, size=10000, seed=238904, w=[.25,.25,.25,.25]):
+    """
+    Generates the dataset used in [Desjardins et al, AISTATS 2010]. The dataset
+    is composed of 4x4 binary images with four basic modes: full black, full
+    white, and [black,white] and [white,black] images. Modes are created by
+    drawing each pixel from the 4 basic modes with a bit-flip probability p.
+    
+    :param p: probability of flipping each pixel p: scalar, list (one per mode) 
+    :param size: total size of the dataset
+    :param seed: seed used to draw random samples
+    :param w: weight of each mode within the dataset
+    """
+
+    # can modify the p-value separately for each mode
+    if not isinstance(p, (list,tuple)):
+        p = [p for i in w]
+
+    rng = numpy.random.RandomState(seed)
+    data = numpy.zeros((size,16))
+
+    # mode 1: black image
+    B = numpy.zeros((1,16))
+    # mode 2: white image
+    W = numpy.ones((1,16))
+    # mode 3: white image with black stripe in left-hand side of image
+    BW = numpy.ones((4,4))
+    BW[:, :2] = 0
+    BW = BW.reshape(1,16)
+    # mode 4: white image with black stripe in right-hand side of image
+    WB = numpy.zeros((4,4))
+    WB[:, :2] = 1
+    WB = WB.reshape(1,16)
+
+    modes = [B,W,BW,WB]
+    data = numpy.zeros((0,16))
+    
+    # create permutations of basic modes with bitflip prob p
+    for i, m in enumerate(modes):
+        n = size * w[i]
+        bitflip = rng.binomial(1,p[i],size=(n,16))
+        d = numpy.abs(numpy.repeat(m, n, axis=0) - bitflip)
+        data = numpy.vstack((data,d))
+
+    y = numpy.zeros((size,1))
+    
+    set = Dataset()
+    set.train = Dataset.Obj(x=data, y=y)
+    set.test = None
+    set.img_shape = (4,4)
+
+    return set