# HG changeset patch # User Frederic Bastien # Date 1284772481 14400 # Node ID 9ff2242a817b6f8ceacd81416fdb2973124ceadb # Parent 0e12ea6ba6612903bb646baf2eb67f3df93afe7a fix rst syntax errors/warnings diff -r 0e12ea6ba661 -r 9ff2242a817b doc/v2_planning/architecture.txt --- a/doc/v2_planning/architecture.txt Fri Sep 17 20:55:18 2010 -0400 +++ b/doc/v2_planning/architecture.txt Fri Sep 17 21:14:41 2010 -0400 @@ -76,6 +76,9 @@ clarification of what the h*** am I talking about) in the following example: * Linear version: + +.. code-block:: python + my_experiment = pipeline([ data, filter_samples, @@ -86,6 +89,9 @@ ]) * Encapsulated version: + +.. code-block:: python + my_experiment = evaluation( data=PCA(filter_samples(data)), split=k_fold_split, diff -r 0e12ea6ba661 -r 9ff2242a817b doc/v2_planning/dataset.txt --- a/doc/v2_planning/dataset.txt Fri Sep 17 20:55:18 2010 -0400 +++ b/doc/v2_planning/dataset.txt Fri Sep 17 21:14:41 2010 -0400 @@ -4,8 +4,8 @@ Some talking points from the September 2 meeting: * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification - needs to be flexible enough to accommodate different (sub)tasks and views of - the same underlying data. + needs to be flexible enough to accommodate different (sub)tasks and views of + the same underlying data. * Datasets as probability distributions from which one can sample. * That's not something I would consider to be a dataset-related problem to tackle now: a probability distribution in Pylearn would probably be a @@ -13,7 +13,7 @@ DatasetToDistribution class for instance, that would take care of viewing a dataset as a probability distribution. -- OD * Our specification should allow transparent handling of infinite datasets (or - simply datasets which cannot fit in memory) + simply datasets which cannot fit in memory) * GPU/buffering issues. Commiteee: DE, OB, OD, AB, PV @@ -117,7 +117,9 @@ dataset that we use, and the class declaration contains essentially everything there is to know about the dataset): -class MNIST(Dataset): +.. code-block:: python + + class MNIST(Dataset): def __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']): self.type='standard_xy' self.in_memory = True @@ -259,8 +261,12 @@ the writer of a new dataset subclass). Anyway, maybe a first thing we could think about is what we want a mini-batch to be. I think we can agree that we would like to be able to do something like: + +.. code-block:: python + for mb in dataset.mini_batches(size=10): learner.update(mb.input, mb.target) + so that it should be ok for a mini-batch to be an object whose fields (that should have the same name as those of the dataset) are numpy arrays. More generally, we would like to be able to iterate on samples in a @@ -285,6 +291,7 @@ OD: (this is hopefully a clearer re-write of the original version from r7e6e77d50eeb, which I was not happy with). There are typically three kinds of objects that spit out data: + 1. Datasets that are loaded from disk or are able to generate data all by themselves (i.e. without any other dataset as input) 2. Datasets that transform their input dataset in a way that only depends on @@ -293,6 +300,7 @@ potentially different dataset (e.g. PCA when you want to learn the projection space on the training set in order to transform both the training and test sets). + My impression currently is that we would use dataset subclasses to handle 1 and 2. However, 3 requires a learner framework, so you would need to have something like a LearnerOutputDataset(trained_learner, dataset). @@ -304,6 +312,7 @@ The main advantages I find in this approach (that I have been using at Ubisoft) are: + - You only need to learn how to subclass the learner class. The only dataset class is LearnerOutputDataset, which you could just name Dataset. - You do not have different ways to achieve the same result (having to figure diff -r 0e12ea6ba661 -r 9ff2242a817b doc/v2_planning/plugin_architecture_GD.txt --- a/doc/v2_planning/plugin_architecture_GD.txt Fri Sep 17 20:55:18 2010 -0400 +++ b/doc/v2_planning/plugin_architecture_GD.txt Fri Sep 17 21:14:41 2010 -0400 @@ -3,11 +3,13 @@ The "central authority" (CA) is the glue which takes care of interfacing plugins with one another. It has 3 basic roles: + * it maintains a list of "registered" or "active" plugins * it receives and queues the various messages sent by the plugins * dispatches the messages to the recipient, based on various "events" Events can take different forms: + * the CA can trigger various events based on running time * can be linked to messages emitted by the various plugins. Events can be triggered based on the frequency of such messages. @@ -26,13 +28,15 @@ James and OB to python-ize this :) -class MessageX(Message): +.. code-block:: python + + class MessageX(Message): """ A message is basically a data container. This could very well be replaced by a generic Python object. """ -class Plugin(object): + class Plugin(object): """ The base plugin object doesn't do much. It contains a reference to the CA (upon plugin being registered with the CA), provides boilerplate code @@ -92,7 +96,7 @@ callback(message) -class ProducerPlugin(Plugin): + class ProducerPlugin(Plugin): def dostuff(): """ @@ -108,7 +112,7 @@ ca.send(msga) # ask CA to forward to other plugins -class ConsumerPlugin(Plugin): + class ConsumerPlugin(Plugin): @handler(MessageA) def func(msga): @@ -119,7 +123,7 @@ # do something with message A -class ConsumerProducerPlugin(Plugin): + class ConsumerProducerPlugin(Plugin): @handler(MessageA) def func(msga): @@ -138,7 +142,7 @@ -class CentralAuthority(object): + class CentralAuthority(object): active_plugins = [] # contains a list of registered plugins @@ -211,7 +215,9 @@ ======================= -def main(): +.. code-block:: python + + def main(): ca = CentralAuthority()