Mercurial > pylearn
annotate doc/seriestables.txt @ 1143:fa1715e759e3
Added API file for coding style committee (now we just need to fill it)
author | Olivier Delalleau <delallea@iro> |
---|---|
date | Thu, 16 Sep 2010 14:17:34 -0400 |
parents | 34d1cd516f76 |
children |
rev | line source |
---|---|
911 | 1 .. SeriesTables documentation master file, created by |
2 sphinx-quickstart on Wed Mar 10 17:56:41 2010. | |
3 You can adapt this file completely to your liking, but it should at least | |
4 contain the root `toctree` directive. | |
5 | |
6 Introduction to ``SeriesTables`` | |
7 -------------------------------- | |
8 | |
9 SeriesTables was created to make it easier to **record scalar data series**, such as, notably, the **evolution of errors (training, valid, test) during training**. There are other common usecases I foresee, such as **recording basic statistics (mean, min/max, variance) of parameters** during training, to diagnose problems. | |
10 | |
11 I also think that if such recording is easily accessible, it might lead us to record other statistics, such as stats concerning activations in the network (i.e. to diagnose unit saturation problems). | |
12 | |
13 Each **element of a series is indexed and timestamped**. By default, for example, the index is named "epoch", which means that with each row an epoch number is stored (but this can be easily customized). By default, the timestamp at row creation time will also be stored, along with the CPU clock() time. This is to allow graphs plotting error series against epoch or training time. | |
14 | |
15 Series are saved in HDF5 files, which I'll introduce briefly. | |
16 | |
17 Introduction to PyTables and HDF5 | |
18 --------------------------------- | |
19 | |
20 HDF5_ is a file format intended for storage of big numerical datasets. In practice, for our concern, you'll create a single ``.h5`` file, in which many tables, corresponding to different series, will be stored. Datasets in a single file are organized hierarchically, in the equivalent of "folders" called "groups". The "files" in the analogy would be our tables. | |
21 | |
22 .. _HDF5: http://www.hdfgroup.org/HDF5/ | |
23 | |
24 A useful property of HDF5 is that metadata is stored along with the data itself. Notably, we have the table names and column names inside the file. We can also attach more complex data, such as title, or even complex objects (which will be pickled), as attributes. | |
25 | |
26 PyTables_ is a Python library to use the HDF5 format. | |
27 | |
28 .. _PyTables: http://www.pytables.org/moin/HowToUse | |
29 | |
30 Here's a basic Python session in which I create a new file and store a few rows in a single table: | |
31 | |
32 >>> import tables | |
33 >>> | |
34 >>> hdf5_file = tables.openFile("mytables.h5", "w") | |
35 >>> | |
36 >>> # Create a new subgroup under the root group "/" | |
37 ... mygroup = hdf5_file.createGroup("/", "mygroup") | |
38 >>> | |
39 >>> # Define the type of data we want to store | |
40 ... class MyDescription(tables.IsDescription): | |
41 ... int_column_1 = tables.Int32Col(pos=0) | |
42 ... float_column_1 = tables.Float32Col(pos=1) | |
43 ... | |
44 >>> # Create a table under mygroup | |
45 ... mytable = hdf5_file.createTable("/mygroup", "mytable", MyDescription) | |
46 >>> | |
47 >>> newrow = mytable.row | |
48 >>> | |
49 >>> # a first row | |
50 ... newrow["int_column_1"] = 15 | |
51 >>> newrow["float_column_1"] = 30.0 | |
52 >>> newrow.append() | |
53 >>> | |
54 >>> # and a second row | |
55 ... newrow["int_column_1"] = 16 | |
56 >>> newrow["float_column_1"] = 32.0 | |
57 >>> newrow.append() | |
58 >>> | |
59 >>> # make sure we write to disk | |
60 ... hdf5_file.flush() | |
61 >>> | |
62 >>> hdf5_file.close() | |
63 | |
64 | |
65 And here's a session in which I reload the data and explore it: | |
66 | |
67 >>> import tables | |
68 >>> | |
69 >>> hdf5_file = tables.openFile("mytables.h5", "r") | |
70 >>> | |
71 >>> mytable = hdf5_file.getNode("/mygroup", "mytable") | |
72 >>> | |
73 >>> # tables can be "sliced" this way | |
74 ... mytable[0:2] | |
75 array([(15, 30.0), (16, 32.0)], | |
76 dtype=[('int_column_1', '<i4'), ('float_column_1', '<f4')]) | |
77 >>> | |
78 >>> # or we can access columns individually | |
79 ... mytable.cols.int_column_1[0:2] | |
80 array([15, 16], dtype=int32) | |
81 | |
82 | |
83 Using ``SeriesTables``: a basic example | |
84 --------------------------------------- | |
85 | |
86 Here's a very basic example usage: | |
87 | |
88 >>> import tables | |
89 >>> from pylearn.io.seriestables import * | |
90 >>> | |
91 >>> tables_file = tables.openFile("series.h5", "w") | |
92 >>> | |
93 >>> error_series = ErrorSeries(error_name="validation_error", \ | |
94 ... table_name="validation_error", \ | |
95 ... hdf5_file=tables_file) | |
96 >>> | |
97 >>> error_series.append((1,), 32.0) | |
98 >>> error_series.append((2,), 28.0) | |
99 >>> error_series.append((3,), 26.0) | |
100 | |
101 I can then open the file ``series.h5``, which will contain a table named ``validation_error`` with a column name ``epoch`` and another named ``validation_error``. There will also be ``timestamp`` and ``cpuclock`` columns, as this is the default behavior. The table rows will correspond to the data added with ``append()`` above. | |
102 | |
103 Indices | |
104 ....... | |
105 | |
106 You may notice that the first parameter in ``append()`` is a tuple. This is because the *index* may have multiple levels. The index is a way for rows to have an order. | |
107 | |
108 In the default case for ErrorSeries, the index only has an "epoch", so the tuple only has one element. But in the ErrorSeries(...) constructor, you could have specified the ``index_names`` parameter, e.g. ``('epoch','minibatch')``, which would allow you to specify both the epoch and the minibatch as index. | |
109 | |
110 | |
111 Summary of the most useful classes | |
112 ---------------------------------- | |
113 | |
114 By default, for each of these series, there are also columns for timestamp and CPU clock() value when append() is called. This can be changed with the store_timestamp and store_cpuclock parameters of their constructors. | |
115 | |
116 ErrorSeries | |
117 This records one floating point (32 bit) value along with an index in a new table. | |
118 | |
119 AccumulatorSeriesWrapper | |
120 This wraps another Series and calls its ``append()`` method when its own ``append()`` as been called N times, N being a parameter when constructing the ``AccumulatorSeriesWrapper``. A simple use case: say you want to store the mean of the training error every 100 minibatches. You create an ErrorSeries, wrap it with an Accumulator and then call its ``append()`` for every minibatch. It will collect the errors, wait until it has 100, then take the mean (with ``numpy.mean``) and store it in the ErrorSeries, and start over again. | |
121 Other "reducing" functions can be used instead of "mean". | |
122 | |
123 BasicStatisticsSeries | |
124 This stores the mean, the min, the max and the standard deviation of arrays you pass to its ``append()`` method. This is useful, notably, to see how the weights (and other parameters) evolve during training without actually storing the parameters themselves. | |
125 | |
126 SharedParamsStatisticsWrapper | |
127 This wraps a few BasicStatisticsSeries. It is specifically designed so you can pass it a list of shared (as in theano.shared) parameter arrays. Each array will get its own table, under a new HDF5 group. You can name each table, e.g. "layer1_b", "layer1_W", etc. | |
128 | |
129 Example of real usage | |
130 --------------------- | |
131 | |
132 The following is a function where I create the series used to record errors and statistics about parameters in a stacked denoising autoencoder script: | |
133 | |
134 .. code-block:: python | |
135 | |
136 def create_series(num_hidden_layers): | |
137 | |
138 # Replace series we don't want to save with DummySeries, e.g. | |
139 # series['training_error'] = DummySeries() | |
140 | |
141 series = {} | |
142 | |
143 basedir = os.getcwd() | |
144 | |
145 h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w") | |
146 | |
147 # training error is accumulated over 100 minibatches, | |
148 # then the mean is computed and saved in the training_base series | |
149 training_base = \ | |
150 ErrorSeries(error_name="training_error", | |
151 table_name="training_error", | |
152 hdf5_file=h5f, | |
153 index_names=('epoch','minibatch'), | |
154 title="Training error (mean over 100 minibatches)") | |
155 | |
156 # this series wraps training_base, performs accumulation | |
157 series['training_error'] = \ | |
158 AccumulatorSeriesWrapper(base_series=training_base, | |
159 reduce_every=100) | |
160 | |
161 # valid and test are not accumulated/mean, saved directly | |
162 series['validation_error'] = \ | |
163 ErrorSeries(error_name="validation_error", | |
164 table_name="validation_error", | |
165 hdf5_file=h5f, | |
166 index_names=('epoch',)) | |
167 | |
168 series['test_error'] = \ | |
169 ErrorSeries(error_name="test_error", | |
170 table_name="test_error", | |
171 hdf5_file=h5f, | |
172 index_names=('epoch',)) | |
173 | |
174 # next we want to store the parameters statistics | |
175 # so first we create the names for each table, based on | |
176 # position of each param in the array | |
177 param_names = [] | |
178 for i in range(num_hidden_layers): | |
179 param_names += ['layer%d_W'%i, 'layer%d_b'%i, 'layer%d_bprime'%i] | |
180 param_names += ['logreg_layer_W', 'logreg_layer_b'] | |
181 | |
182 | |
183 series['params'] = SharedParamsStatisticsWrapper( | |
184 new_group_name="params", | |
185 base_group="/", | |
186 arrays_names=param_names, | |
187 hdf5_file=h5f, | |
188 index_names=('epoch',)) | |
189 | |
190 return series | |
191 | |
192 Then, here's an example of append() usage for each of these series, wrapped in pseudocode: | |
193 | |
194 .. code-block:: python | |
195 | |
196 series = create_series(num_hidden_layers=3) | |
197 | |
198 ... | |
199 | |
200 for epoch in range(num_epochs): | |
201 for mb_index in range(num_minibatches): | |
202 train_error = finetune(mb_index) | |
203 series['training_error'].append((epoch, mb_index), train_error) | |
204 | |
205 valid_error = compute_validation_error() | |
206 series['validation_error'].append((epoch,), valid_error) | |
207 | |
208 test_error = compute_test_error() | |
209 series['test_error'].append((epoch,), test_error) | |
210 | |
211 # suppose all_params is a list [layer1_W, layer1_b, ...] | |
212 # where each element is a shared (as in theano.shared) array | |
213 series['params'].append((epoch,), all_params) | |
214 | |
929
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
215 Other targets for appending (e.g. printing to stdout) |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
216 ----------------------------------------------------- |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
217 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
218 SeriesTables was created with an HDF5 file in mind, but often, for debugging, |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
219 it's useful to be able to redirect the series elsewhere, notably the standard |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
220 output. A mechanism was added to do just that. |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
221 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
222 What you do is you create a ``AppendTarget`` instance (or more than one) and |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
223 pass it as an argument to the Series constructor. For example, to print every |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
224 row appended to the standard output, you use StdoutAppendTarget. |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
225 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
226 If you want to skip appending to the HDF5 file entirely, this is also |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
227 possible. You simply specify ``skip_hdf5_append=True`` in the constructor. You |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
228 still need to pass in a valid HDF5 file, though, even though nothing will be |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
229 written to it (for, err, legacy reasons). |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
230 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
231 Here's an example: |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
232 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
233 .. code-block:: python |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
234 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
235 def create_series(num_hidden_layers): |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
236 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
237 # Replace series we don't want to save with DummySeries, e.g. |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
238 # series['training_error'] = DummySeries() |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
239 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
240 series = {} |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
241 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
242 basedir = os.getcwd() |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
243 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
244 h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w") |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
245 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
246 # Here we create the new target, with a message prepended |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
247 # before every row is printed to stdout |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
248 stdout_target = \ |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
249 StdoutAppendTarget( \ |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
250 prepend='\n-----------------\nValidation error', |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
251 indent_str='\t') |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
252 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
253 # Notice here we won't even write to the HDF5 file |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
254 series['validation_error'] = \ |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
255 ErrorSeries(error_name="validation_error", |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
256 table_name="validation_error", |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
257 hdf5_file=h5f, |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
258 index_names=('epoch',), |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
259 other_targets=[stdout_target], |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
260 skip_hdf5_append=True) |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
261 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
262 return series |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
263 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
264 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
265 Now calls to series['validation_error'].append() will print to stdout outputs |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
266 like:: |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
267 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
268 ---------------- |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
269 Validation error |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
270 timestamp : 1271202144 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
271 cpuclock : 0.12 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
272 epoch : 1 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
273 validation_error : 30.0 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
274 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
275 ---------------- |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
276 Validation error |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
277 timestamp : 1271202144 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
278 cpuclock : 0.12 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
279 epoch : 2 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
280 validation_error : 26.0 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
281 |
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
282 |
911 | 283 Visualizing in vitables |
284 ----------------------- | |
285 | |
286 vitables_ is a program with which you can easily explore an HDF5 ``.h5`` file. Here's a screenshot in which I visualize series produced for the preceding example: | |
287 | |
288 .. _vitables: http://vitables.berlios.de/ | |
289 | |
290 .. image:: images/vitables_example_series.png | |
929
34d1cd516f76
Added other targets (printing to stdout, notably) to seriestables, and corresponding doc
fsavard
parents:
911
diff
changeset
|
291 |