annotate datasets/ftfile.py @ 624:49933073590c

added jmlr_review1.txt and jmlr_review2.txt
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sun, 13 Mar 2011 18:25:25 -0400
parents 337253b82409
children
rev   line source
615
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
1 from itertools import izip
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
2 import os
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
3
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
4 import numpy
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
5 from pylearn.io.filetensor import _read_header, _prod
615
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
6
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
7 from dataset import DataSet
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
8 from dsetiter import DataIterator
615
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
9
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
10
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
11 class FTFile(object):
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
12 def __init__(self, fname, scale=1, dtype=None):
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
13 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
14 Tests:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
15 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft')
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
16 """
615
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
17 if os.path.exists(fname):
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
18 self.file = open(fname, 'rb')
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
19 self.magic_t, self.elsize, _, self.dim, _ = _read_header(self.file, False)
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
20 self.gz=False
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
21 else:
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
22 import gzip
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
23 self.file = gzip.open(fname+'.gz','rb')
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
24 self.magic_t, self.elsize, _, self.dim, _ = _read_header(self.file, False, True)
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
25 self.gz=True
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
26
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
27 self.size = self.dim[0]
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
28 self.scale = scale
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
29 self.dtype = dtype
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
30
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
31 def skip(self, num):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
32 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
33 Skips `num` items in the file.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
34
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
35 If `num` is negative, skips size-num.
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
36
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
37 Tests:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
38 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft')
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
39 >>> f.size
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
40 58646
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
41 >>> f.elsize
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
42 4
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
43 >>> f.file.tell()
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
44 20
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
45 >>> f.skip(1000)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
46 >>> f.file.tell()
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
47 4020
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
48 >>> f.size
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
49 57646
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
50 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft')
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
51 >>> f.size
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
52 58646
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
53 >>> f.file.tell()
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
54 20
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
55 >>> f.skip(-1000)
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
56 >>> f.file.tell()
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
57 230604
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
58 >>> f.size
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
59 1000
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
60 """
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
61 if num < 0:
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
62 num += self.size
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
63 if num < 0:
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
64 raise ValueError('Skipping past the start of the file')
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
65 if num >= self.size:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
66 self.size = 0
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
67 else:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
68 self.size -= num
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
69 f_start = self.file.tell()
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
70 self.file.seek(f_start + (self.elsize * _prod(self.dim[1:]) * num))
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
71
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
72 def read(self, num):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
73 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
74 Reads `num` elements from the file and return the result as a
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
75 numpy matrix. Last read is truncated.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
76
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
77 Tests:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
78 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft')
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
79 >>> f.read(1)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
80 array([6], dtype=int32)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
81 >>> f.read(10)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
82 array([7, 4, 7, 5, 6, 4, 8, 0, 9, 6], dtype=int32)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
83 >>> f.skip(58630)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
84 >>> f.read(10)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
85 array([9, 2, 4, 2, 8], dtype=int32)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
86 >>> f.read(10)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
87 array([], dtype=int32)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
88 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_data.ft')
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
89 >>> f.read(1)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
90 array([[0, 0, 0, ..., 0, 0, 0]], dtype=uint8)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
91 """
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
92 if num > self.size:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
93 num = self.size
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
94 self.dim[0] = num
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
95 self.size -= num
615
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
96 if self.gz:
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
97 d = self.file.read(_prod(self.dim)*self.elsize)
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
98 res = numpy.fromstring(d, dtype=self.magic_t, count=_prod(self.dim)).reshape(self.dim)
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
99 else:
337253b82409 repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents: 614
diff changeset
100 res = numpy.fromfile(self.file, dtype=self.magic_t, count=_prod(self.dim)).reshape(self.dim)
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
101 if self.dtype is not None:
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
102 res = res.astype(self.dtype)
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
103 if self.scale != 1:
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
104 res /= self.scale
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
105 return res
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
106
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
107 class FTSource(object):
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
108 def __init__(self, file, skip=0, size=None, maxsize=None,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
109 dtype=None, scale=1):
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
110 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
111 Create a data source from a possible subset of a .ft file.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
112
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
113 Parameters:
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
114 `file` -- (string) the filename
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
115 `skip` -- (int, optional) amount of examples to skip from
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
116 the start of the file. If negative, skips
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
117 filesize - skip.
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
118 `size` -- (int, optional) truncates number of examples
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
119 read (after skipping). If negative truncates to
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
120 filesize - size (also after skipping).
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
121 `maxsize` -- (int, optional) the maximum size of the file
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
122 `dtype` -- (dtype, optional) convert the data to this
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
123 dtype after reading.
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
124 `scale` -- (number, optional) scale (that is divide) the
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
125 data by this number (after dtype conversion, if
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
126 any).
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
127
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
128 Tests:
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
129 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft')
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
130 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=1000)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
131 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=10)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
132 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=100, size=120)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
133 """
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
134 self.file = file
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
135 self.skip = skip
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
136 self.size = size
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
137 self.dtype = dtype
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
138 self.scale = scale
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
139 self.maxsize = maxsize
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
140
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
141 def open(self):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
142 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
143 Returns an FTFile that corresponds to this dataset.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
144
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
145 Tests:
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
146 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft')
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
147 >>> f = s.open()
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
148 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=1)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
149 >>> len(s.open().read(2))
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
150 1
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
151 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=57646)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
152 >>> s.open().size
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
153 1000
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
154 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=57646, size=1)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
155 >>> s.open().size
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
156 1
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
157 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=-10)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
158 >>> s.open().size
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
159 58636
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
160 """
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
161 f = FTFile(self.file, scale=self.scale, dtype=self.dtype)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
162 if self.skip != 0:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
163 f.skip(self.skip)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
164 if self.size is not None and self.size < f.size:
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
165 if self.size < 0:
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
166 f.size += self.size
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
167 if f.size < 0:
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
168 f.size = 0
173
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
169 else:
954185d6002a Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents: 163
diff changeset
170 f.size = self.size
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
171 if self.maxsize is not None and f.size > self.maxsize:
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
172 f.size = self.maxsize
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
173 return f
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
174
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
175 class FTData(object):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
176 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
177 This is a list of FTSources.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
178 """
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
179 def __init__(self, datafiles, labelfiles, skip=0, size=None, maxsize=None,
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
180 inscale=1, indtype=None, outscale=1, outdtype=None):
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
181 if maxsize is not None:
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
182 maxsize /= len(datafiles)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
183 self.inputs = [FTSource(f, skip, size, maxsize, scale=inscale, dtype=indtype)
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
184 for f in datafiles]
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
185 self.outputs = [FTSource(f, skip, size, maxsize, scale=outscale, dtype=outdtype)
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
186 for f in labelfiles]
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
187
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
188 def open_inputs(self):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
189 return [f.open() for f in self.inputs]
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
190
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
191 def open_outputs(self):
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
192 return [f.open() for f in self.outputs]
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
193
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
194
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
195 class FTDataSet(DataSet):
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
196 def __init__(self, train_data, train_lbl, test_data, test_lbl,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
197 valid_data=None, valid_lbl=None, indtype=None, outdtype=None,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
198 inscale=1, outscale=1, maxsize=None):
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
199 r"""
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
200 Defines a DataSet from a bunch of files.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
201
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
202 Parameters:
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
203 `train_data` -- list of train data files
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
204 `train_label` -- list of train label files (same length as `train_data`)
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
205 `test_data`, `test_labels` -- same thing as train, but for
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
206 test. The number of files
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
207 can differ from train.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
208 `valid_data`, `valid_labels` -- same thing again for validation.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
209 (optional)
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
210 `indtype`, `outdtype`, -- see FTSource.__init__()
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
211 `inscale`, `outscale` (optional)
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
212 `maxsize` -- maximum size of the set returned
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 178
diff changeset
213
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
214
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
215 If `valid_data` and `valid_labels` are not supplied then a sample
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
216 approximately equal in size to the test set is taken from the train
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
217 set.
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
218 """
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
219 if valid_data is None:
271
a92ec9939e4f fixed a problem with maxsize when not provided
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 257
diff changeset
220 total_valid_size = sum(FTFile(td).size for td in test_data)
a92ec9939e4f fixed a problem with maxsize when not provided
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 257
diff changeset
221 if maxsize is not None:
a92ec9939e4f fixed a problem with maxsize when not provided
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents: 257
diff changeset
222 total_valid_size = min(total_valid_size, maxsize)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
223 valid_size = total_valid_size/len(train_data)
214
1faae5079522 The in/outscale parameters were not passed to FTData
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 181
diff changeset
224 self._train = FTData(train_data, train_lbl, size=-valid_size,
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
225 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
226 indtype=indtype, outdtype=outdtype,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
227 maxsize=maxsize)
214
1faae5079522 The in/outscale parameters were not passed to FTData
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 181
diff changeset
228 self._valid = FTData(train_data, train_lbl, skip=-valid_size,
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
229 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
230 indtype=indtype, outdtype=outdtype,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
231 maxsize=maxsize)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
232 else:
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
233 self._train = FTData(train_data, train_lbl, maxsize=maxsize,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
234 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
235 indtype=indtype, outdtype=outdtype)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
236 self._valid = FTData(valid_data, valid_lbl, maxsize=maxsize,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
237 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
238 indtype=indtype, outdtype=outdtype)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
239 self._test = FTData(test_data, test_lbl, maxsize=maxsize,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
240 inscale=inscale, outscale=outscale,
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 214
diff changeset
241 indtype=indtype, outdtype=outdtype)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
242
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
243 def _return_it(self, batchsize, bufsize, ftdata):
177
be714ac9bcbd Use izip(), not zip() to return a lazy iterator. (datasets)
Arnaud Bergeron <abergeron@gmail.com>
parents: 173
diff changeset
244 return izip(DataIterator(ftdata.open_inputs(), batchsize, bufsize),
181
f0f47b045cbf Remove a stray cast in the FTDataSet code and export the ocr dataset.
Arnaud Bergeron <abergeron@gmail.com>
parents: 180
diff changeset
245 DataIterator(ftdata.open_outputs(), batchsize, bufsize))