view doc/seriestables.txt @ 1140:7d2e65249bf9

coding_style: Closed some open questions for which a decision was reached during meeting
author Olivier Delalleau <delallea@iro>
date Thu, 16 Sep 2010 13:14:19 -0400
parents 34d1cd516f76
children
line wrap: on
line source

.. SeriesTables documentation master file, created by
   sphinx-quickstart on Wed Mar 10 17:56:41 2010.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Introduction to ``SeriesTables``
--------------------------------

SeriesTables was created to make it easier to **record scalar data series**, such as, notably, the **evolution of errors (training, valid, test) during training**. There are other common usecases I foresee, such as **recording basic statistics (mean, min/max, variance) of parameters** during training, to diagnose problems.

I also think that if such recording is easily accessible, it might lead us to record other statistics, such as stats concerning activations in the network (i.e. to diagnose unit saturation problems).

Each **element of a series is indexed and timestamped**. By default, for example, the index is named "epoch", which means that with each row an epoch number is stored (but this can be easily customized). By default, the timestamp at row creation time will also be stored, along with the CPU clock() time. This is to allow graphs plotting error series against epoch or training time.

Series are saved in HDF5 files, which I'll introduce briefly.

Introduction to PyTables and HDF5
---------------------------------

HDF5_ is a file format intended for storage of big numerical datasets. In practice, for our concern, you'll create a single ``.h5`` file, in which many tables, corresponding to different series, will be stored. Datasets in a single file are organized hierarchically, in the equivalent of "folders" called "groups". The "files" in the analogy would be our tables.

.. _HDF5: http://www.hdfgroup.org/HDF5/

A useful property of HDF5 is that metadata is stored along with the data itself. Notably, we have the table names and column names inside the file. We can also attach more complex data, such as title, or even complex objects (which will be pickled), as attributes.

PyTables_ is a Python library to use the HDF5 format.

.. _PyTables: http://www.pytables.org/moin/HowToUse

Here's a basic Python session in which I create a new file and store a few rows in a single table:

>>> import tables
>>> 
>>> hdf5_file = tables.openFile("mytables.h5", "w")
>>> 
>>> # Create a new subgroup under the root group "/"
... mygroup = hdf5_file.createGroup("/", "mygroup")
>>> 
>>> # Define the type of data we want to store
... class MyDescription(tables.IsDescription):
...     int_column_1 = tables.Int32Col(pos=0)
...     float_column_1 = tables.Float32Col(pos=1)
... 
>>> # Create a table under mygroup
... mytable = hdf5_file.createTable("/mygroup", "mytable", MyDescription)
>>> 
>>> newrow = mytable.row
>>> 
>>> # a first row
... newrow["int_column_1"] = 15
>>> newrow["float_column_1"] = 30.0
>>> newrow.append()
>>> 
>>> # and a second row
... newrow["int_column_1"] = 16
>>> newrow["float_column_1"] = 32.0
>>> newrow.append()
>>> 
>>> # make sure we write to disk
... hdf5_file.flush()
>>> 
>>> hdf5_file.close()


And here's a session in which I reload the data and explore it:

>>> import tables
>>> 
>>> hdf5_file = tables.openFile("mytables.h5", "r")
>>> 
>>> mytable = hdf5_file.getNode("/mygroup", "mytable")
>>> 
>>> # tables can be "sliced" this way
... mytable[0:2]
array([(15, 30.0), (16, 32.0)], 
      dtype=[('int_column_1', '<i4'), ('float_column_1', '<f4')])
>>> 
>>> # or we can access columns individually
... mytable.cols.int_column_1[0:2]
array([15, 16], dtype=int32)


Using ``SeriesTables``: a basic example
---------------------------------------

Here's a very basic example usage:

>>> import tables
>>> from pylearn.io.seriestables import *
>>> 
>>> tables_file = tables.openFile("series.h5", "w")
>>> 
>>> error_series = ErrorSeries(error_name="validation_error", \
...                         table_name="validation_error", \
...                         hdf5_file=tables_file)
>>> 
>>> error_series.append((1,), 32.0)
>>> error_series.append((2,), 28.0)
>>> error_series.append((3,), 26.0)

I can then open the file ``series.h5``, which will contain a table named ``validation_error`` with a column name ``epoch`` and another named ``validation_error``. There will also be ``timestamp`` and ``cpuclock`` columns, as this is the default behavior. The table rows will correspond to the data added with ``append()`` above.

Indices
.......

You may notice that the first parameter in ``append()`` is a tuple. This is because the *index* may have multiple levels. The index is a way for rows to have an order.

In the default case for ErrorSeries, the index only has an "epoch", so the tuple only has one element. But in the ErrorSeries(...) constructor, you could have specified the ``index_names`` parameter, e.g. ``('epoch','minibatch')``, which would allow you to specify both the epoch and the minibatch as index.


Summary of the most useful classes
----------------------------------

By default, for each of these series, there are also columns for timestamp and CPU clock() value when append() is called. This can be changed with the store_timestamp and store_cpuclock parameters of their constructors.

ErrorSeries
  This records one floating point (32 bit) value along with an index in a new table. 

AccumulatorSeriesWrapper
  This wraps another Series and calls its ``append()`` method when its own ``append()`` as been called N times, N being a parameter when constructing the ``AccumulatorSeriesWrapper``. A simple use case: say you want to store the mean of the training error every 100 minibatches. You create an ErrorSeries, wrap it with an Accumulator and then call its ``append()`` for every minibatch. It will collect the errors, wait until it has 100, then take the mean (with ``numpy.mean``) and store it in the ErrorSeries, and start over again.
  Other "reducing" functions can be used instead of "mean".

BasicStatisticsSeries
  This stores the mean, the min, the max and the standard deviation of arrays you pass to its ``append()`` method. This is useful, notably, to see how the weights (and other parameters) evolve during training without actually storing the parameters themselves.

SharedParamsStatisticsWrapper
  This wraps a few BasicStatisticsSeries. It is specifically designed so you can pass it a list of shared (as in theano.shared) parameter arrays. Each array will get its own table, under a new HDF5 group. You can name each table, e.g. "layer1_b", "layer1_W", etc.

Example of real usage
---------------------

The following is a function where I create the series used to record errors and statistics about parameters in a stacked denoising autoencoder script:

.. code-block:: python

	def create_series(num_hidden_layers):

		# Replace series we don't want to save with DummySeries, e.g.
		# series['training_error'] = DummySeries()

		series = {}

		basedir = os.getcwd()

		h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w")

		# training error is accumulated over 100 minibatches,
		# then the mean is computed and saved in the training_base series
		training_base = \
					ErrorSeries(error_name="training_error",
						table_name="training_error",
						hdf5_file=h5f,
						index_names=('epoch','minibatch'),
						title="Training error (mean over 100 minibatches)")

		# this series wraps training_base, performs accumulation
		series['training_error'] = \
					AccumulatorSeriesWrapper(base_series=training_base,
						reduce_every=100)

		# valid and test are not accumulated/mean, saved directly
		series['validation_error'] = \
					ErrorSeries(error_name="validation_error",
						table_name="validation_error",
						hdf5_file=h5f,
						index_names=('epoch',))

		series['test_error'] = \
					ErrorSeries(error_name="test_error",
						table_name="test_error",
						hdf5_file=h5f,
						index_names=('epoch',))

		# next we want to store the parameters statistics
		# so first we create the names for each table, based on 
		# position of each param in the array
		param_names = []
		for i in range(num_hidden_layers):
			param_names += ['layer%d_W'%i, 'layer%d_b'%i, 'layer%d_bprime'%i]
		param_names += ['logreg_layer_W', 'logreg_layer_b']

		
		series['params'] = SharedParamsStatisticsWrapper(
							new_group_name="params",
							base_group="/",
							arrays_names=param_names,
							hdf5_file=h5f,
							index_names=('epoch',))

		return series

Then, here's an example of append() usage for each of these series, wrapped in pseudocode:

.. code-block:: python

	series = create_series(num_hidden_layers=3)
	
	...

	for epoch in range(num_epochs):
		for mb_index in range(num_minibatches):
			train_error = finetune(mb_index)
			series['training_error'].append((epoch, mb_index), train_error)

		valid_error = compute_validation_error()
		series['validation_error'].append((epoch,), valid_error)

		test_error = compute_test_error()
		series['test_error'].append((epoch,), test_error)

		# suppose all_params is a list [layer1_W, layer1_b, ...]
		# where each element is a shared (as in theano.shared) array
		series['params'].append((epoch,), all_params)

Other targets for appending (e.g. printing to stdout)
-----------------------------------------------------

SeriesTables was created with an HDF5 file in mind, but often, for debugging,
it's useful to be able to redirect the series elsewhere, notably the standard
output. A mechanism was added to do just that.

What you do is you create a ``AppendTarget`` instance (or more than one) and
pass it as an argument to the Series constructor. For example, to print every
row appended to the standard output, you use StdoutAppendTarget.

If you want to skip appending to the HDF5 file entirely, this is also
possible. You simply specify ``skip_hdf5_append=True`` in the constructor. You
still need to pass in a valid HDF5 file, though, even though nothing will be
written to it (for, err, legacy reasons).

Here's an example:

.. code-block:: python

	def create_series(num_hidden_layers):

		# Replace series we don't want to save with DummySeries, e.g.
		# series['training_error'] = DummySeries()

		series = {}

		basedir = os.getcwd()

		h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w")

		# Here we create the new target, with a message prepended
		# before every row is printed to stdout
		stdout_target = \
			StdoutAppendTarget( \
				prepend='\n-----------------\nValidation error',
				indent_str='\t')

		# Notice here we won't even write to the HDF5 file
		series['validation_error'] = \
			ErrorSeries(error_name="validation_error",
				table_name="validation_error",
				hdf5_file=h5f,
				index_names=('epoch',),
				other_targets=[stdout_target],
				skip_hdf5_append=True)

		return series

		
Now calls to series['validation_error'].append() will print to stdout outputs
like::

	----------------
	Validation error
		timestamp : 1271202144
		cpuclock : 0.12
		epoch : 1
		validation_error : 30.0

	----------------
	Validation error
		timestamp : 1271202144
		cpuclock : 0.12
		epoch : 2
		validation_error : 26.0


Visualizing in vitables
-----------------------

vitables_ is a program with which you can easily explore an HDF5 ``.h5`` file. Here's a screenshot in which I visualize series produced for the preceding example:

.. _vitables: http://vitables.berlios.de/

.. image:: images/vitables_example_series.png