annotate pylearn/io/amat.py @ 1223:621e03253f0c

mini-bug in taglist
author boulanni <nicolas_boulanger@hotmail.com>
date Wed, 22 Sep 2010 15:08:39 -0400
parents e4a92dce13fe
children
rev   line source
606
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
1 """load PLearn AMat files
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
2
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
3
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
4 An AMat file is an ascii format for dense matrices.
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
5
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
6 The format is not precisely defined, so I'll describe here a single recipe for making a valid
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
7 file.
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
8
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
9 .. code-block:: text
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
10
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
11 #size: <rows> <cols>
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
12 #sizes: <input cols> <target cols> <weight cols> <extra cols 0> <extra cols 1> <extra cols ...>
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
13 number number number ....
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
14 number number number ....
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
15
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
16
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
17 Tabs and spaces are both valid delimiters. Newlines separate consecutive rows.
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
18
e4a92dce13fe added comments to amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 537
diff changeset
19 """
266
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
20
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
21 import sys, numpy, array
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
22
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
23 class AMat:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
24 """DataSource to access a plearn amat file as a periodic unrandomized stream.
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
25
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
26 Attributes:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
27
470
bd937e845bbb new stuff: algorithms/logistic_regression, datasets/MNIST
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 266
diff changeset
28 input -- all columns of input
bd937e845bbb new stuff: algorithms/logistic_regression, datasets/MNIST
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 266
diff changeset
29 target -- all columns of target
bd937e845bbb new stuff: algorithms/logistic_regression, datasets/MNIST
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 266
diff changeset
30 weight -- all columns of weight
bd937e845bbb new stuff: algorithms/logistic_regression, datasets/MNIST
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 266
diff changeset
31 extra -- all columns of extra
266
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
32
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
33 all -- the entire data contents of the amat file
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
34 n_examples -- the number of training examples in the file
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
35
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
36 AMat stands for Ascii Matri[x,ces]
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
37
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
38 """
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
39
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
40 marker_size = '#size:'
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
41 marker_sizes = '#sizes:'
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
42 marker_col_names = '#:'
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
43
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
44 def __init__(self, path, head=None, update_interval=0, ofile=sys.stdout):
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
45
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
46 """Load the amat at <path> into memory.
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
47
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
48 path - str: location of amat file
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
49 head - int: stop reading after this many data rows
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
50 update_interval - int: print '.' to ofile every <this many> lines
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
51 ofile - file: print status, msgs, etc. to this file
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
52
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
53 """
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
54 self.all = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
55 self.input = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
56 self.target = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
57 self.weight = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
58 self.extra = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
59
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
60 self.header = False
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
61 self.header_size = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
62 self.header_rows = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
63 self.header_cols = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
64 self.header_sizes = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
65 self.header_col_names = []
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
66
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
67 data_started = False
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
68 data = array.array('d')
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
69
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
70 f = open(path)
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
71 n_data_lines = 0
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
72 len_float_line = None
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
73
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
74 for i,line in enumerate(f):
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
75 if n_data_lines == head:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
76 #we've read enough data,
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
77 # break even if there's more in the file
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
78 break
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
79 if len(line) == 0 or line == '\n':
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
80 continue
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
81 if line[0] == '#':
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
82 if not data_started:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
83 #the condition means that the file has a header, and we're on
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
84 # some header line
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
85 self.header = True
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
86 if line.startswith(AMat.marker_size):
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
87 info = line[len(AMat.marker_size):]
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
88 self.header_size = [int(s) for s in info.split()]
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
89 self.header_rows, self.header_cols = self.header_size
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
90 if line.startswith(AMat.marker_col_names):
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
91 info = line[len(AMat.marker_col_names):]
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
92 self.header_col_names = info.split()
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
93 elif line.startswith(AMat.marker_sizes):
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
94 info = line[len(AMat.marker_sizes):]
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
95 self.header_sizes = [int(s) for s in info.split()]
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
96 else:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
97 #the first non-commented line tells us that the header is done
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
98 data_started = True
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
99 float_line = [float(s) for s in line.split()]
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
100 if len_float_line is None:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
101 len_float_line = len(float_line)
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
102 if (self.header_cols is not None) \
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
103 and self.header_cols != len_float_line:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
104 print >> sys.stderr, \
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
105 'WARNING: header declared %i cols but first line has %i, using %i',\
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
106 self.header_cols, len_float_line, len_float_line
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
107 else:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
108 if len_float_line != len(float_line):
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
109 raise IOError('wrong line length', i, line)
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
110 data.extend(float_line)
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
111 n_data_lines += 1
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
112
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
113 if update_interval > 0 and (ofile is not None) \
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
114 and n_data_lines % update_interval == 0:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
115 ofile.write('.')
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
116 ofile.flush()
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
117
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
118 if update_interval > 0:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
119 ofile.write('\n')
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
120 f.close()
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
121
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
122 # convert from array.array to numpy.ndarray
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
123 nshape = (len(data) / len_float_line, len_float_line)
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
124 self.all = numpy.frombuffer(data).reshape(nshape)
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
125 self.n_examples = self.all.shape[0]
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
126
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
127 # assign
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
128 if self.header_sizes is not None:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
129 if len(self.header_sizes) > 4:
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
130 print >> sys.stderr, 'WARNING: ignoring sizes after 4th in %s' % path
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
131 leftmost = 0
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
132 #here we make use of the fact that if header_sizes has len < 4
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
133 # the loop will exit before 4 iterations
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
134 attrlist = ['input', 'target', 'weight', 'extra']
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
135 for attr, ncols in zip(attrlist, self.header_sizes):
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
136 setattr(self, attr, self.all[:, leftmost:leftmost+ncols])
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
137 leftmost += ncols
6e69fb91f3c0 initial commit of amat
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
138