annotate datasets/ftfile.py @ 266:1e4e60ddadb1

Merge. Ah, et dans le dernier commit, j'avais oublié de mentionner que j'ai ajouté du code pour gérer l'isolation de différents clones pour rouler des expériences et modifier le code en même temps.
author fsavard
date Fri, 19 Mar 2010 10:56:16 -0400
parents 966272e7f14b
children a92ec9939e4f
rev   line source
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
1 from pylearn.io.filetensor import _read_header, _prod
178
938bd350dbf0 Make the datasets iterators return theano shared slices with the appropriate types.
Arnaud Bergeron <abergeron@gmail.com>
parents: 177
diff changeset
2 import numpy, theano
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
3 from dataset import DataSet
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
4 from dsetiter import DataIterator
178
938bd350dbf0 Make the datasets iterators return theano shared slices with the appropriate types.
Arnaud Bergeron <abergeron@gmail.com>
parents: 177
diff changeset
5 from itertools import izip, imap
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
6
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
7 class FTFile(object):
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
8 def __init__(self, fname, scale=1, dtype=None):
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
9 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
10 Tests:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
11 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft')
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
12 """
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
13 self.file = open(fname, 'rb')
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
14 self.magic_t, self.elsize, _, self.dim, _ = _read_header(self.file, False)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
15 self.size = self.dim[0]
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
16 self.scale = scale
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
17 self.dtype = dtype
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
18
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
19 def skip(self, num):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
20 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
21 Skips `num` items in the file.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
22
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
23 If `num` is negative, skips size-num.
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
24
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
25 Tests:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
26 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft')
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
27 >>> f.size
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
28 58646
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
29 >>> f.elsize
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
30 4
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
31 >>> f.file.tell()
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
32 20
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
33 >>> f.skip(1000)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
34 >>> f.file.tell()
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
35 4020
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
36 >>> f.size
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
37 57646
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
38 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft')
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
39 >>> f.size
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
40 58646
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
41 >>> f.file.tell()
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
42 20
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
43 >>> f.skip(-1000)
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
44 >>> f.file.tell()
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
45 230604
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
46 >>> f.size
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
47 1000
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
48 """
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
49 if num < 0:
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
50 num += self.size
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
51 if num < 0:
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
52 raise ValueError('Skipping past the start of the file')
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
53 if num >= self.size:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
54 self.size = 0
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
55 else:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
56 self.size -= num
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
57 f_start = self.file.tell()
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
58 self.file.seek(f_start + (self.elsize * _prod(self.dim[1:]) * num))
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
59
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
60 def read(self, num):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
61 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
62 Reads `num` elements from the file and return the result as a
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
63 numpy matrix. Last read is truncated.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
64
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
65 Tests:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
66 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft')
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
67 >>> f.read(1)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
68 array([6], dtype=int32)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
69 >>> f.read(10)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
70 array([7, 4, 7, 5, 6, 4, 8, 0, 9, 6], dtype=int32)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
71 >>> f.skip(58630)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
72 >>> f.read(10)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
73 array([9, 2, 4, 2, 8], dtype=int32)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
74 >>> f.read(10)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
75 array([], dtype=int32)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
76 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_data.ft')
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
77 >>> f.read(1)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
78 array([[0, 0, 0, ..., 0, 0, 0]], dtype=uint8)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
79 """
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
80 if num > self.size:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
81 num = self.size
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
82 self.dim[0] = num
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
83 self.size -= num
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
84 res = numpy.fromfile(self.file, dtype=self.magic_t, count=_prod(self.dim)).reshape(self.dim)
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
85 if self.dtype is not None:
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
86 res = res.astype(self.dtype)
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
87 if self.scale != 1:
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
88 res /= self.scale
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
89 return res
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
90
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
91 class FTSource(object):
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
92 def __init__(self, file, skip=0, size=None, maxsize=None,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
93 dtype=None, scale=1):
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
94 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
95 Create a data source from a possible subset of a .ft file.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
96
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
97 Parameters:
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
98 `file` -- (string) the filename
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
99 `skip` -- (int, optional) amount of examples to skip from
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
100 the start of the file. If negative, skips
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
101 filesize - skip.
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
102 `size` -- (int, optional) truncates number of examples
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
103 read (after skipping). If negative truncates to
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
104 filesize - size (also after skipping).
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
105 `maxsize` -- (int, optional) the maximum size of the file
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
106 `dtype` -- (dtype, optional) convert the data to this
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
107 dtype after reading.
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
108 `scale` -- (number, optional) scale (that is divide) the
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
109 data by this number (after dtype conversion, if
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
110 any).
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
111
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
112 Tests:
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
113 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft')
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
114 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=1000)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
115 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=10)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
116 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=100, size=120)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
117 """
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
118 self.file = file
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
119 self.skip = skip
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
120 self.size = size
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
121 self.dtype = dtype
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
122 self.scale = scale
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
123 self.maxsize = maxsize
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
124
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
125 def open(self):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
126 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
127 Returns an FTFile that corresponds to this dataset.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
128
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
129 Tests:
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
130 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft')
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
131 >>> f = s.open()
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
132 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=1)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
133 >>> len(s.open().read(2))
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
134 1
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
135 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=57646)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
136 >>> s.open().size
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
137 1000
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
138 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=57646, size=1)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
139 >>> s.open().size
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
140 1
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
141 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=-10)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
142 >>> s.open().size
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
143 58636
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
144 """
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
145 f = FTFile(self.file, scale=self.scale, dtype=self.dtype)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
146 if self.skip != 0:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
147 f.skip(self.skip)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
148 if self.size is not None and self.size < f.size:
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
149 if self.size < 0:
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
150 f.size += self.size
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
151 if f.size < 0:
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
152 f.size = 0
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
153 else:
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
154 f.size = self.size
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
155 if self.maxsize is not None and f.size > self.maxsize:
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
156 f.size = self.maxsize
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
157 return f
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
158
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
159 class FTData(object):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
160 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
161 This is a list of FTSources.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
162 """
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
163 def __init__(self, datafiles, labelfiles, skip=0, size=None, maxsize=None,
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
164 inscale=1, indtype=None, outscale=1, outdtype=None):
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
165 if maxsize is not None:
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
166 maxsize /= len(datafiles)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
167 self.inputs = [FTSource(f, skip, size, maxsize, scale=inscale, dtype=indtype)
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
168 for f in datafiles]
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
169 self.outputs = [FTSource(f, skip, size, maxsize, scale=outscale, dtype=outdtype)
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
170 for f in labelfiles]
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
171
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
172 def open_inputs(self):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
173 return [f.open() for f in self.inputs]
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
174
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
175 def open_outputs(self):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
176 return [f.open() for f in self.outputs]
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
177
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
178
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
179 class FTDataSet(DataSet):
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
180 def __init__(self, train_data, train_lbl, test_data, test_lbl,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
181 valid_data=None, valid_lbl=None, indtype=None, outdtype=None,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
182 inscale=1, outscale=1, maxsize=None):
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
183 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
184 Defines a DataSet from a bunch of files.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
185
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
186 Parameters:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
187 `train_data` -- list of train data files
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
188 `train_label` -- list of train label files (same length as `train_data`)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
189 `test_data`, `test_labels` -- same thing as train, but for
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
190 test. The number of files
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
191 can differ from train.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
192 `valid_data`, `valid_labels` -- same thing again for validation.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
193 (optional)
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
194 `indtype`, `outdtype`, -- see FTSource.__init__()
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
195 `inscale`, `outscale` (optional)
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
196 `maxsize` -- maximum size of the set returned
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
197
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
198
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
199 If `valid_data` and `valid_labels` are not supplied then a sample
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
200 approximately equal in size to the test set is taken from the train
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
201 set.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
202 """
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
203 if valid_data is None:
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
204 total_valid_size = min(sum(FTFile(td).size for td in test_data), maxsize)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
205 valid_size = total_valid_size/len(train_data)
214
1faae5079522 The in/outscale parameters were not passed to FTData
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 181
diff changeset
206 self._train = FTData(train_data, train_lbl, size=-valid_size,
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
207 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
208 indtype=indtype, outdtype=outdtype,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
209 maxsize=maxsize)
214
1faae5079522 The in/outscale parameters were not passed to FTData
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 181
diff changeset
210 self._valid = FTData(train_data, train_lbl, skip=-valid_size,
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
211 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
212 indtype=indtype, outdtype=outdtype,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
213 maxsize=maxsize)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
214 else:
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
215 self._train = FTData(train_data, train_lbl, maxsize=maxsize,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
216 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
217 indtype=indtype, outdtype=outdtype)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
218 self._valid = FTData(valid_data, valid_lbl, maxsize=maxsize,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
219 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
220 indtype=indtype, outdtype=outdtype)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
221 self._test = FTData(test_data, test_lbl, maxsize=maxsize,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
222 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
223 indtype=indtype, outdtype=outdtype)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
224
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
225 def _return_it(self, batchsize, bufsize, ftdata):
177
be714ac9bcbd Use izip(), not zip() to return a lazy iterator. (datasets)
Arnaud Bergeron <abergeron@gmail.com>
parents: 173
diff changeset
226 return izip(DataIterator(ftdata.open_inputs(), batchsize, bufsize),
181
f0f47b045cbf Remove a stray cast in the FTDataSet code and export the ocr dataset.
Arnaud Bergeron <abergeron@gmail.com>
parents: 180
diff changeset
227 DataIterator(ftdata.open_outputs(), batchsize, bufsize))