Mercurial > ift6266
annotate datasets/ftfile.py @ 266:1e4e60ddadb1
Merge. Ah, et dans le dernier commit, j'avais oublié de mentionner que j'ai ajouté du code pour gérer l'isolation de différents clones pour rouler des expériences et modifier le code en même temps.
author | fsavard |
---|---|
date | Fri, 19 Mar 2010 10:56:16 -0400 |
parents | 966272e7f14b |
children | a92ec9939e4f |
rev | line source |
---|---|
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
1 from pylearn.io.filetensor import _read_header, _prod |
178
938bd350dbf0
Make the datasets iterators return theano shared slices with the appropriate types.
Arnaud Bergeron <abergeron@gmail.com>
parents:
177
diff
changeset
|
2 import numpy, theano |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
3 from dataset import DataSet |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
4 from dsetiter import DataIterator |
178
938bd350dbf0
Make the datasets iterators return theano shared slices with the appropriate types.
Arnaud Bergeron <abergeron@gmail.com>
parents:
177
diff
changeset
|
5 from itertools import izip, imap |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
6 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
7 class FTFile(object): |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
8 def __init__(self, fname, scale=1, dtype=None): |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
9 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
10 Tests: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
11 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft') |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
12 """ |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
13 self.file = open(fname, 'rb') |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
14 self.magic_t, self.elsize, _, self.dim, _ = _read_header(self.file, False) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
15 self.size = self.dim[0] |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
16 self.scale = scale |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
17 self.dtype = dtype |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
18 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
19 def skip(self, num): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
20 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
21 Skips `num` items in the file. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
22 |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
23 If `num` is negative, skips size-num. |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
24 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
25 Tests: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
26 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft') |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
27 >>> f.size |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
28 58646 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
29 >>> f.elsize |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
30 4 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
31 >>> f.file.tell() |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
32 20 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
33 >>> f.skip(1000) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
34 >>> f.file.tell() |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
35 4020 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
36 >>> f.size |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
37 57646 |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
38 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft') |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
39 >>> f.size |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
40 58646 |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
41 >>> f.file.tell() |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
42 20 |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
43 >>> f.skip(-1000) |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
44 >>> f.file.tell() |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
45 230604 |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
46 >>> f.size |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
47 1000 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
48 """ |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
49 if num < 0: |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
50 num += self.size |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
51 if num < 0: |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
52 raise ValueError('Skipping past the start of the file') |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
53 if num >= self.size: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
54 self.size = 0 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
55 else: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
56 self.size -= num |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
57 f_start = self.file.tell() |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
58 self.file.seek(f_start + (self.elsize * _prod(self.dim[1:]) * num)) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
59 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
60 def read(self, num): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
61 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
62 Reads `num` elements from the file and return the result as a |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
63 numpy matrix. Last read is truncated. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
64 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
65 Tests: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
66 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft') |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
67 >>> f.read(1) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
68 array([6], dtype=int32) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
69 >>> f.read(10) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
70 array([7, 4, 7, 5, 6, 4, 8, 0, 9, 6], dtype=int32) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
71 >>> f.skip(58630) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
72 >>> f.read(10) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
73 array([9, 2, 4, 2, 8], dtype=int32) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
74 >>> f.read(10) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
75 array([], dtype=int32) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
76 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_data.ft') |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
77 >>> f.read(1) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
78 array([[0, 0, 0, ..., 0, 0, 0]], dtype=uint8) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
79 """ |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
80 if num > self.size: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
81 num = self.size |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
82 self.dim[0] = num |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
83 self.size -= num |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
84 res = numpy.fromfile(self.file, dtype=self.magic_t, count=_prod(self.dim)).reshape(self.dim) |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
85 if self.dtype is not None: |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
86 res = res.astype(self.dtype) |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
87 if self.scale != 1: |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
88 res /= self.scale |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
89 return res |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
90 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
91 class FTSource(object): |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
92 def __init__(self, file, skip=0, size=None, maxsize=None, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
93 dtype=None, scale=1): |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
94 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
95 Create a data source from a possible subset of a .ft file. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
96 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
97 Parameters: |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
98 `file` -- (string) the filename |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
99 `skip` -- (int, optional) amount of examples to skip from |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
100 the start of the file. If negative, skips |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
101 filesize - skip. |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
102 `size` -- (int, optional) truncates number of examples |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
103 read (after skipping). If negative truncates to |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
104 filesize - size (also after skipping). |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
105 `maxsize` -- (int, optional) the maximum size of the file |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
106 `dtype` -- (dtype, optional) convert the data to this |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
107 dtype after reading. |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
108 `scale` -- (number, optional) scale (that is divide) the |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
109 data by this number (after dtype conversion, if |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
110 any). |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
111 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
112 Tests: |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
113 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft') |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
114 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=1000) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
115 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=10) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
116 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=100, size=120) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
117 """ |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
118 self.file = file |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
119 self.skip = skip |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
120 self.size = size |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
121 self.dtype = dtype |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
122 self.scale = scale |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
123 self.maxsize = maxsize |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
124 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
125 def open(self): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
126 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
127 Returns an FTFile that corresponds to this dataset. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
128 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
129 Tests: |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
130 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft') |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
131 >>> f = s.open() |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
132 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=1) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
133 >>> len(s.open().read(2)) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
134 1 |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
135 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=57646) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
136 >>> s.open().size |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
137 1000 |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
138 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=57646, size=1) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
139 >>> s.open().size |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
140 1 |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
141 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=-10) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
142 >>> s.open().size |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
143 58636 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
144 """ |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
145 f = FTFile(self.file, scale=self.scale, dtype=self.dtype) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
146 if self.skip != 0: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
147 f.skip(self.skip) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
148 if self.size is not None and self.size < f.size: |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
149 if self.size < 0: |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
150 f.size += self.size |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
151 if f.size < 0: |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
152 f.size = 0 |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
153 else: |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
154 f.size = self.size |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
155 if self.maxsize is not None and f.size > self.maxsize: |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
156 f.size = self.maxsize |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
157 return f |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
158 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
159 class FTData(object): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
160 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
161 This is a list of FTSources. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
162 """ |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
163 def __init__(self, datafiles, labelfiles, skip=0, size=None, maxsize=None, |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
164 inscale=1, indtype=None, outscale=1, outdtype=None): |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
165 if maxsize is not None: |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
166 maxsize /= len(datafiles) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
167 self.inputs = [FTSource(f, skip, size, maxsize, scale=inscale, dtype=indtype) |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
168 for f in datafiles] |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
169 self.outputs = [FTSource(f, skip, size, maxsize, scale=outscale, dtype=outdtype) |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
170 for f in labelfiles] |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
171 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
172 def open_inputs(self): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
173 return [f.open() for f in self.inputs] |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
174 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
175 def open_outputs(self): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
176 return [f.open() for f in self.outputs] |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
177 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
178 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
179 class FTDataSet(DataSet): |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
180 def __init__(self, train_data, train_lbl, test_data, test_lbl, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
181 valid_data=None, valid_lbl=None, indtype=None, outdtype=None, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
182 inscale=1, outscale=1, maxsize=None): |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
183 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
184 Defines a DataSet from a bunch of files. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
185 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
186 Parameters: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
187 `train_data` -- list of train data files |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
188 `train_label` -- list of train label files (same length as `train_data`) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
189 `test_data`, `test_labels` -- same thing as train, but for |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
190 test. The number of files |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
191 can differ from train. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
192 `valid_data`, `valid_labels` -- same thing again for validation. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
193 (optional) |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
194 `indtype`, `outdtype`, -- see FTSource.__init__() |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
195 `inscale`, `outscale` (optional) |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
196 `maxsize` -- maximum size of the set returned |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
197 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
198 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
199 If `valid_data` and `valid_labels` are not supplied then a sample |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
200 approximately equal in size to the test set is taken from the train |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
201 set. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
202 """ |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
203 if valid_data is None: |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
204 total_valid_size = min(sum(FTFile(td).size for td in test_data), maxsize) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
205 valid_size = total_valid_size/len(train_data) |
214
1faae5079522
The in/outscale parameters were not passed to FTData
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
181
diff
changeset
|
206 self._train = FTData(train_data, train_lbl, size=-valid_size, |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
207 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
208 indtype=indtype, outdtype=outdtype, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
209 maxsize=maxsize) |
214
1faae5079522
The in/outscale parameters were not passed to FTData
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
181
diff
changeset
|
210 self._valid = FTData(train_data, train_lbl, skip=-valid_size, |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
211 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
212 indtype=indtype, outdtype=outdtype, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
213 maxsize=maxsize) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
214 else: |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
215 self._train = FTData(train_data, train_lbl, maxsize=maxsize, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
216 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
217 indtype=indtype, outdtype=outdtype) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
218 self._valid = FTData(valid_data, valid_lbl, maxsize=maxsize, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
219 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
220 indtype=indtype, outdtype=outdtype) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
221 self._test = FTData(test_data, test_lbl, maxsize=maxsize, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
222 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
223 indtype=indtype, outdtype=outdtype) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
224 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
225 def _return_it(self, batchsize, bufsize, ftdata): |
177
be714ac9bcbd
Use izip(), not zip() to return a lazy iterator. (datasets)
Arnaud Bergeron <abergeron@gmail.com>
parents:
173
diff
changeset
|
226 return izip(DataIterator(ftdata.open_inputs(), batchsize, bufsize), |
181
f0f47b045cbf
Remove a stray cast in the FTDataSet code and export the ocr dataset.
Arnaud Bergeron <abergeron@gmail.com>
parents:
180
diff
changeset
|
227 DataIterator(ftdata.open_outputs(), batchsize, bufsize)) |