comparison doc/seriestables.txt @ 911:fdb63e4e042d

Had forgotten to hg add SeriesTable .txt doc
author fsavard
date Thu, 18 Mar 2010 12:29:36 -0400
parents
children 34d1cd516f76
comparison
equal deleted inserted replaced
910:8837535006f1 911:fdb63e4e042d
1 .. SeriesTables documentation master file, created by
2 sphinx-quickstart on Wed Mar 10 17:56:41 2010.
3 You can adapt this file completely to your liking, but it should at least
4 contain the root `toctree` directive.
5
6 Introduction to ``SeriesTables``
7 --------------------------------
8
9 SeriesTables was created to make it easier to **record scalar data series**, such as, notably, the **evolution of errors (training, valid, test) during training**. There are other common usecases I foresee, such as **recording basic statistics (mean, min/max, variance) of parameters** during training, to diagnose problems.
10
11 I also think that if such recording is easily accessible, it might lead us to record other statistics, such as stats concerning activations in the network (i.e. to diagnose unit saturation problems).
12
13 Each **element of a series is indexed and timestamped**. By default, for example, the index is named "epoch", which means that with each row an epoch number is stored (but this can be easily customized). By default, the timestamp at row creation time will also be stored, along with the CPU clock() time. This is to allow graphs plotting error series against epoch or training time.
14
15 Series are saved in HDF5 files, which I'll introduce briefly.
16
17 Introduction to PyTables and HDF5
18 ---------------------------------
19
20 HDF5_ is a file format intended for storage of big numerical datasets. In practice, for our concern, you'll create a single ``.h5`` file, in which many tables, corresponding to different series, will be stored. Datasets in a single file are organized hierarchically, in the equivalent of "folders" called "groups". The "files" in the analogy would be our tables.
21
22 .. _HDF5: http://www.hdfgroup.org/HDF5/
23
24 A useful property of HDF5 is that metadata is stored along with the data itself. Notably, we have the table names and column names inside the file. We can also attach more complex data, such as title, or even complex objects (which will be pickled), as attributes.
25
26 PyTables_ is a Python library to use the HDF5 format.
27
28 .. _PyTables: http://www.pytables.org/moin/HowToUse
29
30 Here's a basic Python session in which I create a new file and store a few rows in a single table:
31
32 >>> import tables
33 >>>
34 >>> hdf5_file = tables.openFile("mytables.h5", "w")
35 >>>
36 >>> # Create a new subgroup under the root group "/"
37 ... mygroup = hdf5_file.createGroup("/", "mygroup")
38 >>>
39 >>> # Define the type of data we want to store
40 ... class MyDescription(tables.IsDescription):
41 ... int_column_1 = tables.Int32Col(pos=0)
42 ... float_column_1 = tables.Float32Col(pos=1)
43 ...
44 >>> # Create a table under mygroup
45 ... mytable = hdf5_file.createTable("/mygroup", "mytable", MyDescription)
46 >>>
47 >>> newrow = mytable.row
48 >>>
49 >>> # a first row
50 ... newrow["int_column_1"] = 15
51 >>> newrow["float_column_1"] = 30.0
52 >>> newrow.append()
53 >>>
54 >>> # and a second row
55 ... newrow["int_column_1"] = 16
56 >>> newrow["float_column_1"] = 32.0
57 >>> newrow.append()
58 >>>
59 >>> # make sure we write to disk
60 ... hdf5_file.flush()
61 >>>
62 >>> hdf5_file.close()
63
64
65 And here's a session in which I reload the data and explore it:
66
67 >>> import tables
68 >>>
69 >>> hdf5_file = tables.openFile("mytables.h5", "r")
70 >>>
71 >>> mytable = hdf5_file.getNode("/mygroup", "mytable")
72 >>>
73 >>> # tables can be "sliced" this way
74 ... mytable[0:2]
75 array([(15, 30.0), (16, 32.0)],
76 dtype=[('int_column_1', '<i4'), ('float_column_1', '<f4')])
77 >>>
78 >>> # or we can access columns individually
79 ... mytable.cols.int_column_1[0:2]
80 array([15, 16], dtype=int32)
81
82
83 Using ``SeriesTables``: a basic example
84 ---------------------------------------
85
86 Here's a very basic example usage:
87
88 >>> import tables
89 >>> from pylearn.io.seriestables import *
90 >>>
91 >>> tables_file = tables.openFile("series.h5", "w")
92 >>>
93 >>> error_series = ErrorSeries(error_name="validation_error", \
94 ... table_name="validation_error", \
95 ... hdf5_file=tables_file)
96 >>>
97 >>> error_series.append((1,), 32.0)
98 >>> error_series.append((2,), 28.0)
99 >>> error_series.append((3,), 26.0)
100
101 I can then open the file ``series.h5``, which will contain a table named ``validation_error`` with a column name ``epoch`` and another named ``validation_error``. There will also be ``timestamp`` and ``cpuclock`` columns, as this is the default behavior. The table rows will correspond to the data added with ``append()`` above.
102
103 Indices
104 .......
105
106 You may notice that the first parameter in ``append()`` is a tuple. This is because the *index* may have multiple levels. The index is a way for rows to have an order.
107
108 In the default case for ErrorSeries, the index only has an "epoch", so the tuple only has one element. But in the ErrorSeries(...) constructor, you could have specified the ``index_names`` parameter, e.g. ``('epoch','minibatch')``, which would allow you to specify both the epoch and the minibatch as index.
109
110
111 Summary of the most useful classes
112 ----------------------------------
113
114 By default, for each of these series, there are also columns for timestamp and CPU clock() value when append() is called. This can be changed with the store_timestamp and store_cpuclock parameters of their constructors.
115
116 ErrorSeries
117 This records one floating point (32 bit) value along with an index in a new table.
118
119 AccumulatorSeriesWrapper
120 This wraps another Series and calls its ``append()`` method when its own ``append()`` as been called N times, N being a parameter when constructing the ``AccumulatorSeriesWrapper``. A simple use case: say you want to store the mean of the training error every 100 minibatches. You create an ErrorSeries, wrap it with an Accumulator and then call its ``append()`` for every minibatch. It will collect the errors, wait until it has 100, then take the mean (with ``numpy.mean``) and store it in the ErrorSeries, and start over again.
121 Other "reducing" functions can be used instead of "mean".
122
123 BasicStatisticsSeries
124 This stores the mean, the min, the max and the standard deviation of arrays you pass to its ``append()`` method. This is useful, notably, to see how the weights (and other parameters) evolve during training without actually storing the parameters themselves.
125
126 SharedParamsStatisticsWrapper
127 This wraps a few BasicStatisticsSeries. It is specifically designed so you can pass it a list of shared (as in theano.shared) parameter arrays. Each array will get its own table, under a new HDF5 group. You can name each table, e.g. "layer1_b", "layer1_W", etc.
128
129 Example of real usage
130 ---------------------
131
132 The following is a function where I create the series used to record errors and statistics about parameters in a stacked denoising autoencoder script:
133
134 .. code-block:: python
135
136 def create_series(num_hidden_layers):
137
138 # Replace series we don't want to save with DummySeries, e.g.
139 # series['training_error'] = DummySeries()
140
141 series = {}
142
143 basedir = os.getcwd()
144
145 h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w")
146
147 # training error is accumulated over 100 minibatches,
148 # then the mean is computed and saved in the training_base series
149 training_base = \
150 ErrorSeries(error_name="training_error",
151 table_name="training_error",
152 hdf5_file=h5f,
153 index_names=('epoch','minibatch'),
154 title="Training error (mean over 100 minibatches)")
155
156 # this series wraps training_base, performs accumulation
157 series['training_error'] = \
158 AccumulatorSeriesWrapper(base_series=training_base,
159 reduce_every=100)
160
161 # valid and test are not accumulated/mean, saved directly
162 series['validation_error'] = \
163 ErrorSeries(error_name="validation_error",
164 table_name="validation_error",
165 hdf5_file=h5f,
166 index_names=('epoch',))
167
168 series['test_error'] = \
169 ErrorSeries(error_name="test_error",
170 table_name="test_error",
171 hdf5_file=h5f,
172 index_names=('epoch',))
173
174 # next we want to store the parameters statistics
175 # so first we create the names for each table, based on
176 # position of each param in the array
177 param_names = []
178 for i in range(num_hidden_layers):
179 param_names += ['layer%d_W'%i, 'layer%d_b'%i, 'layer%d_bprime'%i]
180 param_names += ['logreg_layer_W', 'logreg_layer_b']
181
182
183 series['params'] = SharedParamsStatisticsWrapper(
184 new_group_name="params",
185 base_group="/",
186 arrays_names=param_names,
187 hdf5_file=h5f,
188 index_names=('epoch',))
189
190 return series
191
192 Then, here's an example of append() usage for each of these series, wrapped in pseudocode:
193
194 .. code-block:: python
195
196 series = create_series(num_hidden_layers=3)
197
198 ...
199
200 for epoch in range(num_epochs):
201 for mb_index in range(num_minibatches):
202 train_error = finetune(mb_index)
203 series['training_error'].append((epoch, mb_index), train_error)
204
205 valid_error = compute_validation_error()
206 series['validation_error'].append((epoch,), valid_error)
207
208 test_error = compute_test_error()
209 series['test_error'].append((epoch,), test_error)
210
211 # suppose all_params is a list [layer1_W, layer1_b, ...]
212 # where each element is a shared (as in theano.shared) array
213 series['params'].append((epoch,), all_params)
214
215 Visualizing in vitables
216 -----------------------
217
218 vitables_ is a program with which you can easily explore an HDF5 ``.h5`` file. Here's a screenshot in which I visualize series produced for the preceding example:
219
220 .. _vitables: http://vitables.berlios.de/
221
222 .. image:: images/vitables_example_series.png