changeset 911:fdb63e4e042d

Had forgotten to hg add SeriesTable .txt doc
author fsavard
date Thu, 18 Mar 2010 12:29:36 -0400
parents 8837535006f1
children 0354b682c289
files doc/images/logo_pylearn_200x57.png doc/images/vitables_example_series.png doc/seriestables.txt
diffstat 3 files changed, 222 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
Binary file doc/images/logo_pylearn_200x57.png has changed
Binary file doc/images/vitables_example_series.png has changed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/seriestables.txt	Thu Mar 18 12:29:36 2010 -0400
@@ -0,0 +1,222 @@
+.. SeriesTables documentation master file, created by
+   sphinx-quickstart on Wed Mar 10 17:56:41 2010.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Introduction to ``SeriesTables``
+--------------------------------
+
+SeriesTables was created to make it easier to **record scalar data series**, such as, notably, the **evolution of errors (training, valid, test) during training**. There are other common usecases I foresee, such as **recording basic statistics (mean, min/max, variance) of parameters** during training, to diagnose problems.
+
+I also think that if such recording is easily accessible, it might lead us to record other statistics, such as stats concerning activations in the network (i.e. to diagnose unit saturation problems).
+
+Each **element of a series is indexed and timestamped**. By default, for example, the index is named "epoch", which means that with each row an epoch number is stored (but this can be easily customized). By default, the timestamp at row creation time will also be stored, along with the CPU clock() time. This is to allow graphs plotting error series against epoch or training time.
+
+Series are saved in HDF5 files, which I'll introduce briefly.
+
+Introduction to PyTables and HDF5
+---------------------------------
+
+HDF5_ is a file format intended for storage of big numerical datasets. In practice, for our concern, you'll create a single ``.h5`` file, in which many tables, corresponding to different series, will be stored. Datasets in a single file are organized hierarchically, in the equivalent of "folders" called "groups". The "files" in the analogy would be our tables.
+
+.. _HDF5: http://www.hdfgroup.org/HDF5/
+
+A useful property of HDF5 is that metadata is stored along with the data itself. Notably, we have the table names and column names inside the file. We can also attach more complex data, such as title, or even complex objects (which will be pickled), as attributes.
+
+PyTables_ is a Python library to use the HDF5 format.
+
+.. _PyTables: http://www.pytables.org/moin/HowToUse
+
+Here's a basic Python session in which I create a new file and store a few rows in a single table:
+
+>>> import tables
+>>> 
+>>> hdf5_file = tables.openFile("mytables.h5", "w")
+>>> 
+>>> # Create a new subgroup under the root group "/"
+... mygroup = hdf5_file.createGroup("/", "mygroup")
+>>> 
+>>> # Define the type of data we want to store
+... class MyDescription(tables.IsDescription):
+...     int_column_1 = tables.Int32Col(pos=0)
+...     float_column_1 = tables.Float32Col(pos=1)
+... 
+>>> # Create a table under mygroup
+... mytable = hdf5_file.createTable("/mygroup", "mytable", MyDescription)
+>>> 
+>>> newrow = mytable.row
+>>> 
+>>> # a first row
+... newrow["int_column_1"] = 15
+>>> newrow["float_column_1"] = 30.0
+>>> newrow.append()
+>>> 
+>>> # and a second row
+... newrow["int_column_1"] = 16
+>>> newrow["float_column_1"] = 32.0
+>>> newrow.append()
+>>> 
+>>> # make sure we write to disk
+... hdf5_file.flush()
+>>> 
+>>> hdf5_file.close()
+
+
+And here's a session in which I reload the data and explore it:
+
+>>> import tables
+>>> 
+>>> hdf5_file = tables.openFile("mytables.h5", "r")
+>>> 
+>>> mytable = hdf5_file.getNode("/mygroup", "mytable")
+>>> 
+>>> # tables can be "sliced" this way
+... mytable[0:2]
+array([(15, 30.0), (16, 32.0)], 
+      dtype=[('int_column_1', '<i4'), ('float_column_1', '<f4')])
+>>> 
+>>> # or we can access columns individually
+... mytable.cols.int_column_1[0:2]
+array([15, 16], dtype=int32)
+
+
+Using ``SeriesTables``: a basic example
+---------------------------------------
+
+Here's a very basic example usage:
+
+>>> import tables
+>>> from pylearn.io.seriestables import *
+>>> 
+>>> tables_file = tables.openFile("series.h5", "w")
+>>> 
+>>> error_series = ErrorSeries(error_name="validation_error", \
+...                         table_name="validation_error", \
+...                         hdf5_file=tables_file)
+>>> 
+>>> error_series.append((1,), 32.0)
+>>> error_series.append((2,), 28.0)
+>>> error_series.append((3,), 26.0)
+
+I can then open the file ``series.h5``, which will contain a table named ``validation_error`` with a column name ``epoch`` and another named ``validation_error``. There will also be ``timestamp`` and ``cpuclock`` columns, as this is the default behavior. The table rows will correspond to the data added with ``append()`` above.
+
+Indices
+.......
+
+You may notice that the first parameter in ``append()`` is a tuple. This is because the *index* may have multiple levels. The index is a way for rows to have an order.
+
+In the default case for ErrorSeries, the index only has an "epoch", so the tuple only has one element. But in the ErrorSeries(...) constructor, you could have specified the ``index_names`` parameter, e.g. ``('epoch','minibatch')``, which would allow you to specify both the epoch and the minibatch as index.
+
+
+Summary of the most useful classes
+----------------------------------
+
+By default, for each of these series, there are also columns for timestamp and CPU clock() value when append() is called. This can be changed with the store_timestamp and store_cpuclock parameters of their constructors.
+
+ErrorSeries
+  This records one floating point (32 bit) value along with an index in a new table. 
+
+AccumulatorSeriesWrapper
+  This wraps another Series and calls its ``append()`` method when its own ``append()`` as been called N times, N being a parameter when constructing the ``AccumulatorSeriesWrapper``. A simple use case: say you want to store the mean of the training error every 100 minibatches. You create an ErrorSeries, wrap it with an Accumulator and then call its ``append()`` for every minibatch. It will collect the errors, wait until it has 100, then take the mean (with ``numpy.mean``) and store it in the ErrorSeries, and start over again.
+  Other "reducing" functions can be used instead of "mean".
+
+BasicStatisticsSeries
+  This stores the mean, the min, the max and the standard deviation of arrays you pass to its ``append()`` method. This is useful, notably, to see how the weights (and other parameters) evolve during training without actually storing the parameters themselves.
+
+SharedParamsStatisticsWrapper
+  This wraps a few BasicStatisticsSeries. It is specifically designed so you can pass it a list of shared (as in theano.shared) parameter arrays. Each array will get its own table, under a new HDF5 group. You can name each table, e.g. "layer1_b", "layer1_W", etc.
+
+Example of real usage
+---------------------
+
+The following is a function where I create the series used to record errors and statistics about parameters in a stacked denoising autoencoder script:
+
+.. code-block:: python
+
+	def create_series(num_hidden_layers):
+
+		# Replace series we don't want to save with DummySeries, e.g.
+		# series['training_error'] = DummySeries()
+
+		series = {}
+
+		basedir = os.getcwd()
+
+		h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w")
+
+		# training error is accumulated over 100 minibatches,
+		# then the mean is computed and saved in the training_base series
+		training_base = \
+					ErrorSeries(error_name="training_error",
+						table_name="training_error",
+						hdf5_file=h5f,
+						index_names=('epoch','minibatch'),
+						title="Training error (mean over 100 minibatches)")
+
+		# this series wraps training_base, performs accumulation
+		series['training_error'] = \
+					AccumulatorSeriesWrapper(base_series=training_base,
+						reduce_every=100)
+
+		# valid and test are not accumulated/mean, saved directly
+		series['validation_error'] = \
+					ErrorSeries(error_name="validation_error",
+						table_name="validation_error",
+						hdf5_file=h5f,
+						index_names=('epoch',))
+
+		series['test_error'] = \
+					ErrorSeries(error_name="test_error",
+						table_name="test_error",
+						hdf5_file=h5f,
+						index_names=('epoch',))
+
+		# next we want to store the parameters statistics
+		# so first we create the names for each table, based on 
+		# position of each param in the array
+		param_names = []
+		for i in range(num_hidden_layers):
+			param_names += ['layer%d_W'%i, 'layer%d_b'%i, 'layer%d_bprime'%i]
+		param_names += ['logreg_layer_W', 'logreg_layer_b']
+
+		
+		series['params'] = SharedParamsStatisticsWrapper(
+							new_group_name="params",
+							base_group="/",
+							arrays_names=param_names,
+							hdf5_file=h5f,
+							index_names=('epoch',))
+
+		return series
+
+Then, here's an example of append() usage for each of these series, wrapped in pseudocode:
+
+.. code-block:: python
+
+	series = create_series(num_hidden_layers=3)
+	
+	...
+
+	for epoch in range(num_epochs):
+		for mb_index in range(num_minibatches):
+			train_error = finetune(mb_index)
+			series['training_error'].append((epoch, mb_index), train_error)
+
+		valid_error = compute_validation_error()
+		series['validation_error'].append((epoch,), valid_error)
+
+		test_error = compute_test_error()
+		series['test_error'].append((epoch,), test_error)
+
+		# suppose all_params is a list [layer1_W, layer1_b, ...]
+		# where each element is a shared (as in theano.shared) array
+		series['params'].append((epoch,), all_params)
+
+Visualizing in vitables
+-----------------------
+
+vitables_ is a program with which you can easily explore an HDF5 ``.h5`` file. Here's a screenshot in which I visualize series produced for the preceding example:
+
+.. _vitables: http://vitables.berlios.de/
+
+.. image:: images/vitables_example_series.png