annotate featuremap.py @ 363:9e84e8a20a75

Added to misc.file
author Joseph Turian <turian@gmail.com>
date Thu, 03 Jul 2008 17:52:11 -0400
parents 18702ceb2096
children
rev   line source
356
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
1 """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
2 Feature mapping.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
3
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
4 A feature map is idenfied by a unique name, e.g. "parsing features, experiment 35".
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
5 This unique name also determines the name of the on-disk version of the feature map.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
6
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
7 @todo: This should be rewritten to be more Pythonic. Perhaps use a class?
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
8 @todo: Maybe look at older C++ Id/Vocab code? Id could have a __str__ method
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
9 @todo: Clearer documentation.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
10 @todo: Create an fmap directory
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
11 @todo: Use cPickle, not pickle
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
12
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
13 @todo: Autosynchronize mode: Each time a new entry is added
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
14 to a L{FeatureMap}, the on-disk version of the feature map is
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
15 updated. Alternately, synchronize to disk when the object is destroyed.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
16 """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
17
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
18 from common import myopen
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
19 import pickle
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
20
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
21 # We want this map to be a singleton
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
22 name_to_fmap = {}
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
23
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
24 def get(name=None, synchronize=True):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
25 """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
26 Get the L{FeatureMap} for a particular feature name.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
27 """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
28 global name_to_fmap
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
29 if name not in name_to_fmap:
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
30 # Create a new L{FeatureMap}
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
31 name_to_fmap[name] = FeatureMap(name, synchronize)
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
32 fmap = name_to_fmap[name]
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
33 assert fmap.name == name
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
34 assert fmap.synchronize == synchronize
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
35 return fmap
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
36
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
37 def free_memory():
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
38 """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
39 Free the memory associated with all feature maps.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
40 """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
41 global name_to_fmap
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
42 name_to_fmap = {}
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
43
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
44 class KeyError(Exception):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
45 """Exception raised for keys missing from a readonly FeatureMap
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
46 Attributes:
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
47 name -- Name of the FeatureMap raising the error.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
48 key -- Key not present.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
49 """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
50 def __init__(self, name, key):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
51 self.name = name
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
52 self.key = key
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
53
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
54
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
55 class FeatureMap:
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
56 """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
57 Map from a feature string to a numerial ID (starting from 0).
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
58
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
59 If synchronize is False, the feature map is considered temporary
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
60 and we never actually synchronize it with disk. It expires with the
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
61 lifetime of this execution.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
62
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
63 @warning: Do not construct this directly. Instead, use the global get() method.
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
64 @todo: More documentation
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
65 """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
66
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
67 # name = None
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
68 # synchronize = True
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
69 # map = {}
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
70 # readonly = False # If True, then each time we look for an ID
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
71 # that is not present we throw a ValueError
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
72 def __init__(self, name=None, synchronize=True):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
73 self.name = name
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
74 self.synchronize = synchronize
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
75 self.map = {}
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
76 self.reverse_map = {}
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
77 self.readonly = False
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
78
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
79 # There must be a name provided, or we cannot perform synchronization
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
80 assert self.name or not self.synchronize
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
81
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
82 if self.synchronize:
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
83 # Try loading map from disk
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
84 self.load()
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
85
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
86 def exists(self, str):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
87 """ Return True iff this str is in the map """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
88 return str in self.map
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
89
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
90 def id(self, str):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
91 """ Get the ID for this string. Add a new ID if not is available """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
92 """ @todo: Don't want to synchronize every add, this may be too slow. """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
93 if str not in self.map:
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
94 if self.readonly: raise KeyError(self.name, str)
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
95 l = self.len
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
96 self.map[str] = l
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
97 self.reverse_map[l] = str
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
98 assert l+1 == self.len
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
99 return l
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
100 else: return self.map[str]
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
101
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
102 def str(self, id):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
103 """ Get the string for this ID. """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
104 return self.reverse_map[id]
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
105
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
106 # This next function should just convert a list to a list
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
107 # def ids(self, lst):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
108 # """ Get the IDs for the elements of a list. Return the ID numbers of these keys as a map. """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
109 # idset = {}
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
110 # for k in lst:
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
111 # try:
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
112 # idset[self.id(k)] = True
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
113 # except KeyError, e:
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
114 # print "Feature map '%s' does not contain key '%s'. Skipping..." % (e.name, e.key)
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
115 # return idset
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
116
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
117 len = property(lambda self: len(self.map), doc="Number of different feature IDs")
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
118 filename = property(lambda self: "fmap.%s.pkl.gz" % self.name, doc="The on-disk file synchronized to this feature map.")
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
119
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
120 def load(self):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
121 """ Load the map from disk. """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
122 assert self.synchronize
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
123 try:
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
124 f = myopen(self.filename, "rb")
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
125 (self.map, self.reverse_map) = pickle.load(f)
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
126 except IOError: print "Could not open %s" % self.filename
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
127
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
128 def dump(self):
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
129 """ Dump the map to disk. """
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
130 assert self.synchronize
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
131 f = myopen(self.filename, "wb")
18702ceb2096 Added more functions
Joseph Turian <turian@iro.umontreal.ca>
parents:
diff changeset
132 pickle.dump((self.map, self.reverse_map), f)