The PAGC postal address geocoder: TUNING THE STANDARDIZER

11. TUNING THE STANDARDIZER

A standardization can be incorrect, in terms of correctly parsing and representing an address form, and yet be acceptable in terms of matching. As long as the reference records and user records can be properly linked, it doesn't much matter how those records are expressed.

Standardization becomes truly problematic when

the standardization that would serve as a linkage to the right reference record isn't produced.
a reference record cannot be standardized at all

The first problem won't become evident until the match phase, whereas the second problem is usually discovered during the build phase. The process in dealing with them, however, is the similar. The easiest remedy often is to make modifications to your reference or user data. However, this remedy only suffices for small-scale, transient problems. The other solution is to change the behaviour of the standardizer. This is done by changing the standardization files.

If the standardizer produces a problem standardization or fails to produce a standardization at all, and if editing the relevant user or reference attribute records isn't feasible, then we have two principle options:

Adding, deleting or changing rules ( See Changing the Rules).
Adding, deleting or changing lexical entries ( See Changing the Lexicons).

11.1 Testing Standardizations

To test the standardizer, invoke PAGC from the command line with the -t flag. Enter the MICRO and MACRO portions of the address as prompted and the best standardization will be printed - or you will learn that the program can't standardize it:


[your_name@localhost whatcom]$ pagc -t
Standardization test. Type "exit" to quit:
MICRO:123 Ta Ta Lost Dog Rd
MACRO:Anywhere BC V0V 0V0
No standardization of MICRO 123 Ta Ta Lost Dog Rd

For more detailed information, invoke PAGC with both the -t and -v switches. For each of the MICRO and MACRO portions of the address the raw input tokenization candidates and raw standardizations will be printed, followed by the rule statistics for the combined address. However, if PAGC fails to find a standardization for either the MICRO or the MACRO, it will not produce the rule statistics.

Raw input tokenization candidates: For each position in the input the associated standardized text is printed along with the associated input token, number and name.
Raw standardizations: For each of the successful standardizations the score is given and then the content. The content will consist of the position in the input, the input token (number and name) selected, the standardized text, and the output token (postal attribute) assigned, number and name.
Rule Statistics: The rule statistics will include all the rules used to form a successful standardization candidate. For each rule the following data is given: the rule number, the rule type, the rank, how that rank is weighted, how many times it was hit out of the total of rules hit.

Example of -t -v output.


MICRO:123 Ta Ta Lost Dog Rd
MACRO:Anywhere BC V0V 0V0
Input tokenization candidates:
        (0) std: ANYWHERE, tok: 1 (WORD)
        (1) std: BRITISH COLUMBIA, tok: 11 (PROV)
        (1) std: BRITISH COLUMBIA, tok: 1 (WORD)
        (1) std: BRITISH COLUMBIA, tok: 6 (ROAD)
        (2) std: V0V, tok: 27 (PCH)
        (2) std: V0V, tok: 23 (MIXED)
        (3) std: 0V0, tok: 26 (PCT)
        (3) std: 0V0, tok: 23 (MIXED)
Raw standardization 1 with score 0.950000:
        (0) Input 1 (WORD) text ANYWHERE mapped to output 10 (CITY)
        (1) Input 11 (PROV) text BRITISH COLUMBIA mapped to output 11 (PROV)
        (2) Input 27 (PCH) text V0V mapped to output 13 (POSTAL)
        (3) Input 26 (PCT) text 0V0 mapped to output 13 (POSTAL)
Raw standardization 2 with score 0.675000:
        (0) Input 1 (WORD) text ANYWHERE mapped to output 10 (CITY)
        (1) Input 1 (WORD) text BRITISH COLUMBIA mapped to output 10 (CITY)
        (2) Input 27 (PCH) text V0V mapped to output 13 (POSTAL)
        (3) Input 26 (PCT) text 0V0 mapped to output 13 (POSTAL)
Standardization of an address has failed
Input tokenization candidates:
        (0) std: 123, tok: 0 (NUMBER)
        (1) std: TA, tok: 21 (DOUBLE)
        (2) std: TA, tok: 21 (DOUBLE)
        (3) std: LOST, tok: 1 (WORD)
        (4) std: DOG, tok: 1 (WORD)
        (5) std: ROAD, tok: 2 (TYPE)
No standardization of MICRO 123 Ta Ta Lost Dog Rd

See The Statistics File for details on the contents of rule statistics output.

The information that can be gleaned from this output can be put to use in correcting standardizations.

In the above case, we can tell from the tokenization and by consulting the lexicon, that the only lexicon entry being applied is:


"1","RD",2,"ROAD"

Because the words LOST and DOG are not in the lexicon, they will be interpreted as WORD input tokens. TA, however, because it is only 2 letters long, will be interpreted as a DOUBLE.

Checking the rules, we find that there is no rule to translate DOUBLE DOUBLE WORD TYPE to STREET STREET STREET SUFTYPE. This done by searching (using your text editor's find option or grep or some other search method) for the input string.

Recall that a string of WORD tokens in the same field will compress into one:


DOUBLE DOUBLE WORD TYPE.

Looking up the token numbers, the following input token sequence is constructed:


21 21 1 2 -1

The negative one terminates the input tokens. This is the string we search for.

Searching, we discover that there is no such rule.

So, we have two options: add TA TA to the lexicon or add a new rule to the file rules.txt.

11.2 Changing the Lexicon or Gazeteer

If you find an abbreviation, word or phrase that is not being recognized by the standardizer, check the Lexicon and Gazeteer to see if an entry exists there. If not you can add it. (See the format for lexicon entries). If it is not being applied, you can change, or if it is interferring with a correct standardization, you can delete it. Be aware that any alteration may have consequences in terms of other standardizations. If you wish to restrict the scope of your changes to a particular reference, place the files in the same directory as the reference shapeset before building.

As an example, consider the phrase "FS RD". This is used in British Columbia to denote "FOREST SERVICE ROAD" and is used as a type (TYPE). To add this to the lexicon, we look for the lookup key FS RD. Not finding any standardizations of this key, we add an entry:


"1,","FS RD",2,"FOREST SERVICE ROAD"

The 1 is the definition number, the FS RD is the lookup key, the 2 is the input token (TYPE=2) and "FOREST SERVICE ROAD" is the standardization that will be used.

It is advisable to edit these files with a text editor. Some spreadsheets that handle comma delimited files will misrepresent some values.

The precedence in applying input tokens to rules is established not by the definition number but by the order of loading. The Gazeteer is loaded before the Lexicon. The first standardization/input token pair that occurs in the file for a given lookup key will be the first to which the rules are applied. The files are arranged in order of lookup key for the user's convenience. Except for the standardization order for the same lookup key, this has no effect on the rule application.

Another example of a Lexicon change. Let's suppose we want to add "TA TA" to lexicon.csv.

Scroll to the appropriate place in the file. The file is in alphabetical order to facilitate editing.

Before the entry:


"1","SW",22,"SOUTHWEST"
"1","SWP",1,"SWAMP"
"1","TANK TRAIL",2,"TANK TRAIL"
"1","TEN",1,"10"
"2","TEN",0,"10"

There is no previous entry for "TA TA", so give it the definition number "1". The lookup key, the phrase we want the standardizer to recognize, is "TA TA". This should be recognized as a word, so (looking up word in the input tokens ( See Input Tokens) and finding that the token number is 1), give it the value. The standardization, the phrase we want the standardizer to emit when it find the lookup key, is the same as the lookup key, "TA TA".

After the entry:


"1","SW",22,"SOUTHWEST"
"1","SWP",1,"SWAMP"
"1","TA TA",1,"TA TA"
"1","TANK TRAIL",2,"TANK TRAIL"
"1","TEN",1,"10"
"2","TEN",0,"10"

Save the file in the location you prefer and exit. If you want to make this change a global one, save it to the PAGC installation directory (usually /usr/local/share/pagc/ - you may need to be su to do this). If you want to keep it specific to shapesets in a particular directory, save it to that directory. Or, if you want it to be applied every time you invoke PAGC from a particular working directory (and there is no other copy in the reference shapeset directory you work on), save it in the working directory.

11.3 Changing the Rules File

It may be that an address does not standardize properly because of a missing rule. After using the -t -v switches on the address 123 Ta Ta Lost Dog Road ( See the testing example) we see that it is tokenized as NUMBER DOUBLE DOUBLE WORD WORD TYPE. NUMBER will be handled by a CIVIC_C rule, so we need a rule DOUBLE DOUBLE WORD WORD TYPE ARC_C rule. We look for DOUBLE DOUBLE WORD TYPE (the two WORD tokens are treated as one by the standardizer) in the rules and find there isn't one.

The format of a rule is described below ( See Rule Records). To construct the rule, we take the input token segment we constructed to search for the missing rules (See the testing example) and use it to construct the rule:


21 21 1 2 -1

The rest of the rule consists of the output segment, type and rank.

The output tokens ( See Postal Attributes) we want the input segment to be mapped to are :


STREET STREET STREET SUFTYP.

Looking up the token numbers, the following output token sequence is constructed:


5 5 5 6 -1.

The number for an ARC_C rule type is 2. The rank can be (arbitrarily) assigned a value of 9. The complete rule therefore is:


21 21 1 2 -1 5 5 5 6 -1 2 9

This rule establishes the requisite mapping of input tokens to output tokens in a rule of type ARC_C with an arbitrarily assigned rank of 9. Because there are no rules with the same input tokens with which this rule will collide, the rank isn't too important - any rank at all would result in a standardization. However, you don't want it to be too high in case it gets applied in some future standardization in the wrong context and overrides

Note that in the above example there are two words in the input, but only one word input token in the rule. Multiple WORD/STOPWORD sequences that map to the same output token are represented in the rules by a single WORD token.

One thing to consider is that often a single rule will not be sufficient. You may need to add similar rules to cover the presence or absence of (to mention the most probable attributes) a PREDIR or SUFTYP.

The rule file included with the distribution was generated by a C program. The source of this program, genrule.c should be included with the distribution along with the source of another program, named collide.c. The collide program takes an amended rules.txt file and produces the C header file gamma.h and a report collide.txt. If the number of rules in rules.txt is increased, PAGC must be recompiled with the gamma.h header. The contents of collide.txt give a list of those rules in the file each which take the same string of input tokens but emit a different string of postal attributes.

Next Previous Contents