# HG changeset patch # User Joseph Turian # Date 1226995070 18000 # Node ID 34ee3aff3e8f67ad1752dffece080a501e9b9ef6 # Parent 90a76a8238e8727a3347f94328b9ffc66b88464c Improved embedding word preprocessing. diff -r 90a76a8238e8 -r 34ee3aff3e8f embeddings/process.py --- a/embeddings/process.py Tue Nov 18 00:32:39 2008 -0500 +++ b/embeddings/process.py Tue Nov 18 02:57:50 2008 -0500 @@ -67,9 +67,10 @@ elif origw == "-RSB-": w = "]" else: w = origw - w = string.lower(w) - w = slashre.sub("/", w) - w = numberre.sub("NUMBER", w) + if w not in __word_to_embedding: + w = string.lower(w) + w = slashre.sub("/", w) + w = numberre.sub("NUMBER", w) if w not in __word_to_embedding: # sys.stderr.write("Word not in vocabulary, using %s: %s (original %s)\n" % (UNKNOWN, w, origw)) w = UNKNOWN