diff doc/v2_planning/dataset.txt @ 1190:9ff2242a817b

fix rst syntax errors/warnings
author Frederic Bastien <nouiz@nouiz.org>
date Fri, 17 Sep 2010 21:14:41 -0400
parents d9550c27a192
children 7dfc3d3052ea
line wrap: on
line diff
--- a/doc/v2_planning/dataset.txt	Fri Sep 17 20:55:18 2010 -0400
+++ b/doc/v2_planning/dataset.txt	Fri Sep 17 21:14:41 2010 -0400
@@ -4,8 +4,8 @@
 Some talking points from the September 2 meeting:
 
  * Datasets as views/tasks (Pascal Vincent's idea): our dataset specification
- needs to be flexible enough to accommodate different (sub)tasks and views of
- the same underlying data.
+   needs to be flexible enough to accommodate different (sub)tasks and views of
+   the same underlying data.
  * Datasets as probability distributions from which one can sample.
     * That's not something I would consider to be a dataset-related problem to
         tackle now: a probability distribution in Pylearn would probably be a
@@ -13,7 +13,7 @@
         DatasetToDistribution class for instance, that would take care of viewing a
         dataset as a probability distribution. -- OD
  * Our specification should allow transparent handling of infinite datasets (or
- simply datasets which cannot fit in memory)
+   simply datasets which cannot fit in memory)
  * GPU/buffering issues.
 
 Commiteee: DE, OB, OD, AB, PV
@@ -117,7 +117,9 @@
 dataset that we use, and the class declaration contains essentially everything
 there is to know about the dataset):
 
-class MNIST(Dataset):
+.. code-block:: python
+
+  class MNIST(Dataset):
     def  __init__(self,inputs=['train_x.npy'],outputs=['train_y.npy']):
         self.type='standard_xy'
         self.in_memory = True
@@ -259,8 +261,12 @@
 the writer of a new dataset subclass). Anyway, maybe a first thing we could
 think about is what we want a mini-batch to be. I think we can agree that we
 would like to be able to do something like:
+
+.. code-block:: python
+
     for mb in dataset.mini_batches(size=10):
         learner.update(mb.input, mb.target)
+
 so that it should be ok for a mini-batch to be an object whose fields
 (that should have the same name as those of the dataset) are numpy arrays.
 More generally, we would like to be able to iterate on samples in a
@@ -285,6 +291,7 @@
 OD: (this is hopefully a clearer re-write of the original version from
 r7e6e77d50eeb, which I was not happy with).
 There are typically three kinds of objects that spit out data:
+
 1. Datasets that are loaded from disk or are able to generate data all by
    themselves (i.e. without any other dataset as input)
 2. Datasets that transform their input dataset in a way that only depends on
@@ -293,6 +300,7 @@
    potentially different dataset (e.g. PCA when you want to learn the projection
    space on the training set in order to transform both the training and test
    sets).
+
 My impression currently is that we would use dataset subclasses to handle 1
 and 2. However, 3 requires a learner framework, so you would need to have
 something like a LearnerOutputDataset(trained_learner, dataset).
@@ -304,6 +312,7 @@
 
 The main advantages I find in this approach (that I have been using at
 Ubisoft) are:
+
 - You only need to learn how to subclass the learner class. The only dataset
   class is LearnerOutputDataset, which you could just name Dataset.
 - You do not have different ways to achieve the same result (having to figure