Mercurial > ift6266
annotate datasets/ftfile.py @ 647:47af8a002530 tip
changed Theano to ift6266 and remove numpy as we do not use code from numpy in this repository
author | Razvan Pascanu <r.pascanu@gmail.com> |
---|---|
date | Wed, 17 Oct 2012 09:26:14 -0400 |
parents | 337253b82409 |
children |
rev | line source |
---|---|
615
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
1 from itertools import izip |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
2 import os |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
3 |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
4 import numpy |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
5 from pylearn.io.filetensor import _read_header, _prod |
615
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
6 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
7 from dataset import DataSet |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
8 from dsetiter import DataIterator |
615
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
9 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
10 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
11 class FTFile(object): |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
12 def __init__(self, fname, scale=1, dtype=None): |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
13 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
14 Tests: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
15 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft') |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
16 """ |
615
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
17 if os.path.exists(fname): |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
18 self.file = open(fname, 'rb') |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
19 self.magic_t, self.elsize, _, self.dim, _ = _read_header(self.file, False) |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
20 self.gz=False |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
21 else: |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
22 import gzip |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
23 self.file = gzip.open(fname+'.gz','rb') |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
24 self.magic_t, self.elsize, _, self.dim, _ = _read_header(self.file, False, True) |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
25 self.gz=True |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
26 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
27 self.size = self.dim[0] |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
28 self.scale = scale |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
29 self.dtype = dtype |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
30 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
31 def skip(self, num): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
32 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
33 Skips `num` items in the file. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
34 |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
35 If `num` is negative, skips size-num. |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
36 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
37 Tests: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
38 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft') |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
39 >>> f.size |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
40 58646 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
41 >>> f.elsize |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
42 4 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
43 >>> f.file.tell() |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
44 20 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
45 >>> f.skip(1000) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
46 >>> f.file.tell() |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
47 4020 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
48 >>> f.size |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
49 57646 |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
50 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft') |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
51 >>> f.size |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
52 58646 |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
53 >>> f.file.tell() |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
54 20 |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
55 >>> f.skip(-1000) |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
56 >>> f.file.tell() |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
57 230604 |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
58 >>> f.size |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
59 1000 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
60 """ |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
61 if num < 0: |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
62 num += self.size |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
63 if num < 0: |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
64 raise ValueError('Skipping past the start of the file') |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
65 if num >= self.size: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
66 self.size = 0 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
67 else: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
68 self.size -= num |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
69 f_start = self.file.tell() |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
70 self.file.seek(f_start + (self.elsize * _prod(self.dim[1:]) * num)) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
71 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
72 def read(self, num): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
73 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
74 Reads `num` elements from the file and return the result as a |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
75 numpy matrix. Last read is truncated. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
76 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
77 Tests: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
78 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_labels.ft') |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
79 >>> f.read(1) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
80 array([6], dtype=int32) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
81 >>> f.read(10) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
82 array([7, 4, 7, 5, 6, 4, 8, 0, 9, 6], dtype=int32) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
83 >>> f.skip(58630) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
84 >>> f.read(10) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
85 array([9, 2, 4, 2, 8], dtype=int32) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
86 >>> f.read(10) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
87 array([], dtype=int32) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
88 >>> f = FTFile('/data/lisa/data/nist/by_class/digits/digits_test_data.ft') |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
89 >>> f.read(1) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
90 array([[0, 0, 0, ..., 0, 0, 0]], dtype=uint8) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
91 """ |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
92 if num > self.size: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
93 num = self.size |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
94 self.dim[0] = num |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
95 self.size -= num |
615
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
96 if self.gz: |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
97 d = self.file.read(_prod(self.dim)*self.elsize) |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
98 res = numpy.fromstring(d, dtype=self.magic_t, count=_prod(self.dim)).reshape(self.dim) |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
99 else: |
337253b82409
repair the class/fct that allow to read pnist07 and others by allowing them to read gziped file.
Frederic Bastien <nouiz@nouiz.org>
parents:
614
diff
changeset
|
100 res = numpy.fromfile(self.file, dtype=self.magic_t, count=_prod(self.dim)).reshape(self.dim) |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
101 if self.dtype is not None: |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
102 res = res.astype(self.dtype) |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
103 if self.scale != 1: |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
104 res /= self.scale |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
105 return res |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
106 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
107 class FTSource(object): |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
108 def __init__(self, file, skip=0, size=None, maxsize=None, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
109 dtype=None, scale=1): |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
110 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
111 Create a data source from a possible subset of a .ft file. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
112 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
113 Parameters: |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
114 `file` -- (string) the filename |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
115 `skip` -- (int, optional) amount of examples to skip from |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
116 the start of the file. If negative, skips |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
117 filesize - skip. |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
118 `size` -- (int, optional) truncates number of examples |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
119 read (after skipping). If negative truncates to |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
120 filesize - size (also after skipping). |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
121 `maxsize` -- (int, optional) the maximum size of the file |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
122 `dtype` -- (dtype, optional) convert the data to this |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
123 dtype after reading. |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
124 `scale` -- (number, optional) scale (that is divide) the |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
125 data by this number (after dtype conversion, if |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
126 any). |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
127 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
128 Tests: |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
129 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft') |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
130 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=1000) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
131 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=10) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
132 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=100, size=120) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
133 """ |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
134 self.file = file |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
135 self.skip = skip |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
136 self.size = size |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
137 self.dtype = dtype |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
138 self.scale = scale |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
139 self.maxsize = maxsize |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
140 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
141 def open(self): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
142 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
143 Returns an FTFile that corresponds to this dataset. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
144 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
145 Tests: |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
146 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft') |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
147 >>> f = s.open() |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
148 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=1) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
149 >>> len(s.open().read(2)) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
150 1 |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
151 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=57646) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
152 >>> s.open().size |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
153 1000 |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
154 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', skip=57646, size=1) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
155 >>> s.open().size |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
156 1 |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
157 >>> s = FTSource('/data/lisa/data/nist/by_class/digits/digits_test_data.ft', size=-10) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
158 >>> s.open().size |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
159 58636 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
160 """ |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
161 f = FTFile(self.file, scale=self.scale, dtype=self.dtype) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
162 if self.skip != 0: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
163 f.skip(self.skip) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
164 if self.size is not None and self.size < f.size: |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
165 if self.size < 0: |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
166 f.size += self.size |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
167 if f.size < 0: |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
168 f.size = 0 |
173
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
169 else: |
954185d6002a
Take the validation set at the end of the training set files rather than at the beginning.
Arnaud Bergeron <abergeron@gmail.com>
parents:
163
diff
changeset
|
170 f.size = self.size |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
171 if self.maxsize is not None and f.size > self.maxsize: |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
172 f.size = self.maxsize |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
173 return f |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
174 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
175 class FTData(object): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
176 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
177 This is a list of FTSources. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
178 """ |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
179 def __init__(self, datafiles, labelfiles, skip=0, size=None, maxsize=None, |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
180 inscale=1, indtype=None, outscale=1, outdtype=None): |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
181 if maxsize is not None: |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
182 maxsize /= len(datafiles) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
183 self.inputs = [FTSource(f, skip, size, maxsize, scale=inscale, dtype=indtype) |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
184 for f in datafiles] |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
185 self.outputs = [FTSource(f, skip, size, maxsize, scale=outscale, dtype=outdtype) |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
186 for f in labelfiles] |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
187 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
188 def open_inputs(self): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
189 return [f.open() for f in self.inputs] |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
190 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
191 def open_outputs(self): |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
192 return [f.open() for f in self.outputs] |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
193 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
194 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
195 class FTDataSet(DataSet): |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
196 def __init__(self, train_data, train_lbl, test_data, test_lbl, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
197 valid_data=None, valid_lbl=None, indtype=None, outdtype=None, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
198 inscale=1, outscale=1, maxsize=None): |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
199 r""" |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
200 Defines a DataSet from a bunch of files. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
201 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
202 Parameters: |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
203 `train_data` -- list of train data files |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
204 `train_label` -- list of train label files (same length as `train_data`) |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
205 `test_data`, `test_labels` -- same thing as train, but for |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
206 test. The number of files |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
207 can differ from train. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
208 `valid_data`, `valid_labels` -- same thing again for validation. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
209 (optional) |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
210 `indtype`, `outdtype`, -- see FTSource.__init__() |
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
211 `inscale`, `outscale` (optional) |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
212 `maxsize` -- maximum size of the set returned |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
178
diff
changeset
|
213 |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
214 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
215 If `valid_data` and `valid_labels` are not supplied then a sample |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
216 approximately equal in size to the test set is taken from the train |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
217 set. |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
218 """ |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
219 if valid_data is None: |
271
a92ec9939e4f
fixed a problem with maxsize when not provided
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
257
diff
changeset
|
220 total_valid_size = sum(FTFile(td).size for td in test_data) |
a92ec9939e4f
fixed a problem with maxsize when not provided
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
257
diff
changeset
|
221 if maxsize is not None: |
a92ec9939e4f
fixed a problem with maxsize when not provided
Yoshua Bengio <bengioy@iro.umontreal.ca>
parents:
257
diff
changeset
|
222 total_valid_size = min(total_valid_size, maxsize) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
223 valid_size = total_valid_size/len(train_data) |
214
1faae5079522
The in/outscale parameters were not passed to FTData
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
181
diff
changeset
|
224 self._train = FTData(train_data, train_lbl, size=-valid_size, |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
225 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
226 indtype=indtype, outdtype=outdtype, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
227 maxsize=maxsize) |
214
1faae5079522
The in/outscale parameters were not passed to FTData
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
181
diff
changeset
|
228 self._valid = FTData(train_data, train_lbl, skip=-valid_size, |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
229 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
230 indtype=indtype, outdtype=outdtype, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
231 maxsize=maxsize) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
232 else: |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
233 self._train = FTData(train_data, train_lbl, maxsize=maxsize, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
234 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
235 indtype=indtype, outdtype=outdtype) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
236 self._valid = FTData(valid_data, valid_lbl, maxsize=maxsize, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
237 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
238 indtype=indtype, outdtype=outdtype) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
239 self._test = FTData(test_data, test_lbl, maxsize=maxsize, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
240 inscale=inscale, outscale=outscale, |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
214
diff
changeset
|
241 indtype=indtype, outdtype=outdtype) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
242 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
243 def _return_it(self, batchsize, bufsize, ftdata): |
177
be714ac9bcbd
Use izip(), not zip() to return a lazy iterator. (datasets)
Arnaud Bergeron <abergeron@gmail.com>
parents:
173
diff
changeset
|
244 return izip(DataIterator(ftdata.open_inputs(), batchsize, bufsize), |
181
f0f47b045cbf
Remove a stray cast in the FTDataSet code and export the ocr dataset.
Arnaud Bergeron <abergeron@gmail.com>
parents:
180
diff
changeset
|
245 DataIterator(ftdata.open_outputs(), batchsize, bufsize)) |