Mercurial > pylearn
view doc/seriestables.txt @ 1415:234e5e48d60d
added datasets.tinyimages.rebuild_numpy_file method to build 5GB memmappable file of all images
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Thu, 03 Feb 2011 18:07:04 -0500 |
parents | 34d1cd516f76 |
children |
line wrap: on
line source
.. SeriesTables documentation master file, created by sphinx-quickstart on Wed Mar 10 17:56:41 2010. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Introduction to ``SeriesTables`` -------------------------------- SeriesTables was created to make it easier to **record scalar data series**, such as, notably, the **evolution of errors (training, valid, test) during training**. There are other common usecases I foresee, such as **recording basic statistics (mean, min/max, variance) of parameters** during training, to diagnose problems. I also think that if such recording is easily accessible, it might lead us to record other statistics, such as stats concerning activations in the network (i.e. to diagnose unit saturation problems). Each **element of a series is indexed and timestamped**. By default, for example, the index is named "epoch", which means that with each row an epoch number is stored (but this can be easily customized). By default, the timestamp at row creation time will also be stored, along with the CPU clock() time. This is to allow graphs plotting error series against epoch or training time. Series are saved in HDF5 files, which I'll introduce briefly. Introduction to PyTables and HDF5 --------------------------------- HDF5_ is a file format intended for storage of big numerical datasets. In practice, for our concern, you'll create a single ``.h5`` file, in which many tables, corresponding to different series, will be stored. Datasets in a single file are organized hierarchically, in the equivalent of "folders" called "groups". The "files" in the analogy would be our tables. .. _HDF5: http://www.hdfgroup.org/HDF5/ A useful property of HDF5 is that metadata is stored along with the data itself. Notably, we have the table names and column names inside the file. We can also attach more complex data, such as title, or even complex objects (which will be pickled), as attributes. PyTables_ is a Python library to use the HDF5 format. .. _PyTables: http://www.pytables.org/moin/HowToUse Here's a basic Python session in which I create a new file and store a few rows in a single table: >>> import tables >>> >>> hdf5_file = tables.openFile("mytables.h5", "w") >>> >>> # Create a new subgroup under the root group "/" ... mygroup = hdf5_file.createGroup("/", "mygroup") >>> >>> # Define the type of data we want to store ... class MyDescription(tables.IsDescription): ... int_column_1 = tables.Int32Col(pos=0) ... float_column_1 = tables.Float32Col(pos=1) ... >>> # Create a table under mygroup ... mytable = hdf5_file.createTable("/mygroup", "mytable", MyDescription) >>> >>> newrow = mytable.row >>> >>> # a first row ... newrow["int_column_1"] = 15 >>> newrow["float_column_1"] = 30.0 >>> newrow.append() >>> >>> # and a second row ... newrow["int_column_1"] = 16 >>> newrow["float_column_1"] = 32.0 >>> newrow.append() >>> >>> # make sure we write to disk ... hdf5_file.flush() >>> >>> hdf5_file.close() And here's a session in which I reload the data and explore it: >>> import tables >>> >>> hdf5_file = tables.openFile("mytables.h5", "r") >>> >>> mytable = hdf5_file.getNode("/mygroup", "mytable") >>> >>> # tables can be "sliced" this way ... mytable[0:2] array([(15, 30.0), (16, 32.0)], dtype=[('int_column_1', '<i4'), ('float_column_1', '<f4')]) >>> >>> # or we can access columns individually ... mytable.cols.int_column_1[0:2] array([15, 16], dtype=int32) Using ``SeriesTables``: a basic example --------------------------------------- Here's a very basic example usage: >>> import tables >>> from pylearn.io.seriestables import * >>> >>> tables_file = tables.openFile("series.h5", "w") >>> >>> error_series = ErrorSeries(error_name="validation_error", \ ... table_name="validation_error", \ ... hdf5_file=tables_file) >>> >>> error_series.append((1,), 32.0) >>> error_series.append((2,), 28.0) >>> error_series.append((3,), 26.0) I can then open the file ``series.h5``, which will contain a table named ``validation_error`` with a column name ``epoch`` and another named ``validation_error``. There will also be ``timestamp`` and ``cpuclock`` columns, as this is the default behavior. The table rows will correspond to the data added with ``append()`` above. Indices ....... You may notice that the first parameter in ``append()`` is a tuple. This is because the *index* may have multiple levels. The index is a way for rows to have an order. In the default case for ErrorSeries, the index only has an "epoch", so the tuple only has one element. But in the ErrorSeries(...) constructor, you could have specified the ``index_names`` parameter, e.g. ``('epoch','minibatch')``, which would allow you to specify both the epoch and the minibatch as index. Summary of the most useful classes ---------------------------------- By default, for each of these series, there are also columns for timestamp and CPU clock() value when append() is called. This can be changed with the store_timestamp and store_cpuclock parameters of their constructors. ErrorSeries This records one floating point (32 bit) value along with an index in a new table. AccumulatorSeriesWrapper This wraps another Series and calls its ``append()`` method when its own ``append()`` as been called N times, N being a parameter when constructing the ``AccumulatorSeriesWrapper``. A simple use case: say you want to store the mean of the training error every 100 minibatches. You create an ErrorSeries, wrap it with an Accumulator and then call its ``append()`` for every minibatch. It will collect the errors, wait until it has 100, then take the mean (with ``numpy.mean``) and store it in the ErrorSeries, and start over again. Other "reducing" functions can be used instead of "mean". BasicStatisticsSeries This stores the mean, the min, the max and the standard deviation of arrays you pass to its ``append()`` method. This is useful, notably, to see how the weights (and other parameters) evolve during training without actually storing the parameters themselves. SharedParamsStatisticsWrapper This wraps a few BasicStatisticsSeries. It is specifically designed so you can pass it a list of shared (as in theano.shared) parameter arrays. Each array will get its own table, under a new HDF5 group. You can name each table, e.g. "layer1_b", "layer1_W", etc. Example of real usage --------------------- The following is a function where I create the series used to record errors and statistics about parameters in a stacked denoising autoencoder script: .. code-block:: python def create_series(num_hidden_layers): # Replace series we don't want to save with DummySeries, e.g. # series['training_error'] = DummySeries() series = {} basedir = os.getcwd() h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w") # training error is accumulated over 100 minibatches, # then the mean is computed and saved in the training_base series training_base = \ ErrorSeries(error_name="training_error", table_name="training_error", hdf5_file=h5f, index_names=('epoch','minibatch'), title="Training error (mean over 100 minibatches)") # this series wraps training_base, performs accumulation series['training_error'] = \ AccumulatorSeriesWrapper(base_series=training_base, reduce_every=100) # valid and test are not accumulated/mean, saved directly series['validation_error'] = \ ErrorSeries(error_name="validation_error", table_name="validation_error", hdf5_file=h5f, index_names=('epoch',)) series['test_error'] = \ ErrorSeries(error_name="test_error", table_name="test_error", hdf5_file=h5f, index_names=('epoch',)) # next we want to store the parameters statistics # so first we create the names for each table, based on # position of each param in the array param_names = [] for i in range(num_hidden_layers): param_names += ['layer%d_W'%i, 'layer%d_b'%i, 'layer%d_bprime'%i] param_names += ['logreg_layer_W', 'logreg_layer_b'] series['params'] = SharedParamsStatisticsWrapper( new_group_name="params", base_group="/", arrays_names=param_names, hdf5_file=h5f, index_names=('epoch',)) return series Then, here's an example of append() usage for each of these series, wrapped in pseudocode: .. code-block:: python series = create_series(num_hidden_layers=3) ... for epoch in range(num_epochs): for mb_index in range(num_minibatches): train_error = finetune(mb_index) series['training_error'].append((epoch, mb_index), train_error) valid_error = compute_validation_error() series['validation_error'].append((epoch,), valid_error) test_error = compute_test_error() series['test_error'].append((epoch,), test_error) # suppose all_params is a list [layer1_W, layer1_b, ...] # where each element is a shared (as in theano.shared) array series['params'].append((epoch,), all_params) Other targets for appending (e.g. printing to stdout) ----------------------------------------------------- SeriesTables was created with an HDF5 file in mind, but often, for debugging, it's useful to be able to redirect the series elsewhere, notably the standard output. A mechanism was added to do just that. What you do is you create a ``AppendTarget`` instance (or more than one) and pass it as an argument to the Series constructor. For example, to print every row appended to the standard output, you use StdoutAppendTarget. If you want to skip appending to the HDF5 file entirely, this is also possible. You simply specify ``skip_hdf5_append=True`` in the constructor. You still need to pass in a valid HDF5 file, though, even though nothing will be written to it (for, err, legacy reasons). Here's an example: .. code-block:: python def create_series(num_hidden_layers): # Replace series we don't want to save with DummySeries, e.g. # series['training_error'] = DummySeries() series = {} basedir = os.getcwd() h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w") # Here we create the new target, with a message prepended # before every row is printed to stdout stdout_target = \ StdoutAppendTarget( \ prepend='\n-----------------\nValidation error', indent_str='\t') # Notice here we won't even write to the HDF5 file series['validation_error'] = \ ErrorSeries(error_name="validation_error", table_name="validation_error", hdf5_file=h5f, index_names=('epoch',), other_targets=[stdout_target], skip_hdf5_append=True) return series Now calls to series['validation_error'].append() will print to stdout outputs like:: ---------------- Validation error timestamp : 1271202144 cpuclock : 0.12 epoch : 1 validation_error : 30.0 ---------------- Validation error timestamp : 1271202144 cpuclock : 0.12 epoch : 2 validation_error : 26.0 Visualizing in vitables ----------------------- vitables_ is a program with which you can easily explore an HDF5 ``.h5`` file. Here's a screenshot in which I visualize series produced for the preceding example: .. _vitables: http://vitables.berlios.de/ .. image:: images/vitables_example_series.png