Mercurial > pylearn
annotate pylearn/io/amat.py @ 1223:621e03253f0c
mini-bug in taglist
author | boulanni <nicolas_boulanger@hotmail.com> |
---|---|
date | Wed, 22 Sep 2010 15:08:39 -0400 |
parents | e4a92dce13fe |
children |
rev | line source |
---|---|
606
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
1 """load PLearn AMat files |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
2 |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
3 |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
4 An AMat file is an ascii format for dense matrices. |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
5 |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
6 The format is not precisely defined, so I'll describe here a single recipe for making a valid |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
7 file. |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
8 |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
9 .. code-block:: text |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
10 |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
11 #size: <rows> <cols> |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
12 #sizes: <input cols> <target cols> <weight cols> <extra cols 0> <extra cols 1> <extra cols ...> |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
13 number number number .... |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
14 number number number .... |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
15 |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
16 |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
17 Tabs and spaces are both valid delimiters. Newlines separate consecutive rows. |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
18 |
e4a92dce13fe
added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
537
diff
changeset
|
19 """ |
266
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
20 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
21 import sys, numpy, array |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
22 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
23 class AMat: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
24 """DataSource to access a plearn amat file as a periodic unrandomized stream. |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
25 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
26 Attributes: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
27 |
470
bd937e845bbb
new stuff: algorithms/logistic_regression, datasets/MNIST
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
266
diff
changeset
|
28 input -- all columns of input |
bd937e845bbb
new stuff: algorithms/logistic_regression, datasets/MNIST
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
266
diff
changeset
|
29 target -- all columns of target |
bd937e845bbb
new stuff: algorithms/logistic_regression, datasets/MNIST
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
266
diff
changeset
|
30 weight -- all columns of weight |
bd937e845bbb
new stuff: algorithms/logistic_regression, datasets/MNIST
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
266
diff
changeset
|
31 extra -- all columns of extra |
266
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
32 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
33 all -- the entire data contents of the amat file |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
34 n_examples -- the number of training examples in the file |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
35 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
36 AMat stands for Ascii Matri[x,ces] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
37 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
38 """ |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
39 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
40 marker_size = '#size:' |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
41 marker_sizes = '#sizes:' |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
42 marker_col_names = '#:' |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
43 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
44 def __init__(self, path, head=None, update_interval=0, ofile=sys.stdout): |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
45 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
46 """Load the amat at <path> into memory. |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
47 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
48 path - str: location of amat file |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
49 head - int: stop reading after this many data rows |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
50 update_interval - int: print '.' to ofile every <this many> lines |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
51 ofile - file: print status, msgs, etc. to this file |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
52 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
53 """ |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
54 self.all = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
55 self.input = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
56 self.target = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
57 self.weight = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
58 self.extra = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
59 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
60 self.header = False |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
61 self.header_size = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
62 self.header_rows = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
63 self.header_cols = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
64 self.header_sizes = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
65 self.header_col_names = [] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
66 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
67 data_started = False |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
68 data = array.array('d') |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
69 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
70 f = open(path) |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
71 n_data_lines = 0 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
72 len_float_line = None |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
73 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
74 for i,line in enumerate(f): |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
75 if n_data_lines == head: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
76 #we've read enough data, |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
77 # break even if there's more in the file |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
78 break |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
79 if len(line) == 0 or line == '\n': |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
80 continue |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
81 if line[0] == '#': |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
82 if not data_started: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
83 #the condition means that the file has a header, and we're on |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
84 # some header line |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
85 self.header = True |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
86 if line.startswith(AMat.marker_size): |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
87 info = line[len(AMat.marker_size):] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
88 self.header_size = [int(s) for s in info.split()] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
89 self.header_rows, self.header_cols = self.header_size |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
90 if line.startswith(AMat.marker_col_names): |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
91 info = line[len(AMat.marker_col_names):] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
92 self.header_col_names = info.split() |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
93 elif line.startswith(AMat.marker_sizes): |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
94 info = line[len(AMat.marker_sizes):] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
95 self.header_sizes = [int(s) for s in info.split()] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
96 else: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
97 #the first non-commented line tells us that the header is done |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
98 data_started = True |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
99 float_line = [float(s) for s in line.split()] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
100 if len_float_line is None: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
101 len_float_line = len(float_line) |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
102 if (self.header_cols is not None) \ |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
103 and self.header_cols != len_float_line: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
104 print >> sys.stderr, \ |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
105 'WARNING: header declared %i cols but first line has %i, using %i',\ |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
106 self.header_cols, len_float_line, len_float_line |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
107 else: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
108 if len_float_line != len(float_line): |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
109 raise IOError('wrong line length', i, line) |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
110 data.extend(float_line) |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
111 n_data_lines += 1 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
112 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
113 if update_interval > 0 and (ofile is not None) \ |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
114 and n_data_lines % update_interval == 0: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
115 ofile.write('.') |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
116 ofile.flush() |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
117 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
118 if update_interval > 0: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
119 ofile.write('\n') |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
120 f.close() |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
121 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
122 # convert from array.array to numpy.ndarray |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
123 nshape = (len(data) / len_float_line, len_float_line) |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
124 self.all = numpy.frombuffer(data).reshape(nshape) |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
125 self.n_examples = self.all.shape[0] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
126 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
127 # assign |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
128 if self.header_sizes is not None: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
129 if len(self.header_sizes) > 4: |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
130 print >> sys.stderr, 'WARNING: ignoring sizes after 4th in %s' % path |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
131 leftmost = 0 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
132 #here we make use of the fact that if header_sizes has len < 4 |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
133 # the loop will exit before 4 iterations |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
134 attrlist = ['input', 'target', 'weight', 'extra'] |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
135 for attr, ncols in zip(attrlist, self.header_sizes): |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
136 setattr(self, attr, self.all[:, leftmost:leftmost+ncols]) |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
137 leftmost += ncols |
6e69fb91f3c0
initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff
changeset
|
138 |