Next Previous Contents

10. MATCHING

In attempting a match PAGC first tries to find a perfect match (an exact match on each attribute, giving a maximum score). The indices are searched ( See Indices for the search order), and for each index the records returned are compared to the target, scored and placed on a list in order of score. A hash accessed list is maintained of records read per standardization, in order to eliminate duplicate reads.

10.1 Scoring the Reference Records

The method of scoring candidates is a modified version of the standard Fellegi-Sunter method. As in that method, each attribute in the reference schema has two values associated, the match weight m and the mismatch weight u. A reference record is matched with the user's record and scored attribute by attribute, and each attribute contributes to the total a value determined by m and u. If the standardizations of the user and reference records agree on that attribute, a contribution to a total sum is calculated by log( m / u ). If they don't agree, the contribution is log( ( 1 - m ) / ( 1 - u ) ). The total of the contributions from each schema attribute is the total score of the reference record candidate.

The Default Match and Mismatch Weights

The default match and mismatch weights that the program uses are as follows.

HOUSE

match = 0.999 mismatch = .05.

STREET

match = 0.9 mismatch = .01.

CITY

match = 0.9 mismatch = .1.

POSTAL

match = 0.9 mismatch = .1.

SUFDIR

match = 0.85 mismatch = .1.

SUFTYP

match = 0.85 mismatch = .1.

PREDIR

match = 0.8 mismatch = .1.

PRETYP

match = 0.7 mismatch = .1.

QUALIF

match = 0.7 mismatch = .1.

These default weights can be reset in the schema table. If you wish to change the defaults, you will need to use a schema table, even though the program would recognize your schema without one. The match weight is the probability that a match is not a random false positive and the mismatch weight is the probability that a mismatch is not a random false negative. In theory these values should be determined empirically, but in practice the default values appear to suffice.

Scoring modifications

Agreement for a particular attribute is determined by the comparison type ( See Comparison Types) associated with that attribute in the reference schema. However, the method of comparison demanded by the type is modified and elaborated in order to introduce similarity comparison and to handle attribute redirection. The specifics of these modifications follow.

CHAR_SINGLE

In fields of this comparison type, where there is no attribute re-direction, the user and reference values are first examined for an exact match. If there is an exact match, the match weight is applied. If not, the two strings are examined for similarity and given a similarity measure. This value is then interpolated into the interval between the match and mismatch weights to determine the value to add to the score.

POSTAL

The standardizer ensures that postal codes for the user and the reference are both valid forms. However, it may be that the codes of one is a different length than that of the other. One may, for instance have both the zip and zip+4 while the other may have just the zip. Whichever of the two is longer will be truncated, for the purposes of the comparison, to the length of the shorter. The two are first examined for an exact match. An exact match gets the match weight. If no exact match, they are examined for similarity. The similarity measure produced is then interpolated into the interval between the match and mismatch weights, and this value is added to the score. If the reference is using block faces, the left and right postal codes may differ. Both will be compared to the user postal code and the better of the two scores will be taken.

HOUSE

House numbers are converted to integers and are examined to see, in the case of NUMBER_INTERVAL_LEFT_RIGHT, if they fall within the range of either the left or right block face. A success for either will result in the addition of the match weight. If neither, then simple transpositions are tested, and a similarity measure (determined by Jaro's algorithm) is used to interpolate into the range between the match and mismatch weights. Provision is also made for address parity. Parity is a common (although undoubtedly not universal) method of distinguishing opposite block faces. That is, the even numbers will be on one side of the street, and the odds on the other. A non-match on parity (if there are numbers on only one side, for example) will slightly decrease the score of an address range. However, a parity mismatch will be ignored in geocoding - a warning is issued in the error file and the address is geocoded as if the parity corresponded. The arithmetic proximity of an address range is also used in scoring. That is, the ranges closer to the target number will score higher. This is useful in selecting records to edit and to look for range continuations.

Redirected Attributes

The use of a redirection strategy ( See Redirection) to deal with the problem of differing user and reference schemas complicates comparison of those attributes to which other attributes are redirected. It may be, for example, that when SUFDIRS are redirected to PREDIRSs, that a SUFDIR in the user address (10th Ave West) corresponds exactly with a PREDIR in the reference (West 10th Ave), in which case we want to these values to be considered a match. However, suppose you have a name with both a PREDIR and a SUFDIR in the reference, say West 10th Ave East. The agreement on the West would count for nothing. The fewer attributes in the schema, the more acute this problem becomes. Therefore, a similarity measure is used when redirection occurs. If an exact match can't be obtained, then Jaro's algorithm is used to weight the number of common attributes and the order in which they appear.

10.2 Reference Candidate Selection

If a candidate reference record achieves the maximum score, and POSTAL is included in the reference schema, the candidate is immediately accepted, searching is terminated and the location geocoded. If POSTAL is not part of the schema, then searching continues until all candidates with scores above the cutoff have been found. If there are more than one with a maximum score, then all the candidates are passed to the user for selection. If there is only one with a maximum score, it is accepted without the user's intervention and geocoded.

The reason for using the presence or absence of the POSTAL attribute as a criterion is that if you have a large-area reference, without the presence of POSTAL (as with a metropolitan Statscan file that hasn't been merged with PCCF) you will may have a number of identically named streets into which an address might legitimately fit. These collisions presumably disappear in the presence of zip or postal codes.

The Candidate Selection Interaction

If there are no candidates with a maximum score, regardless of the presence of reference POSTAL, then the user is given a numbered list (in groups of ten) of the one hundred (maximum) highest scoring candidates for selection.

Each line of the list gives the rank number of the candidate, the address ranges of both blockfaces (denoting a blank by -1), the postal address attributes delimited by pipes, the score, and the shape number in the original reference shapeset. The following is the fifth item of a list, with the two address ranges of 4569 to 4707 and 4612 to 4664 for a street named Lougheed Highway. It received a score of .975334 when matched with the target address and was derived from record 14380 in the reference file:

(5): 4569-4707 & 4612-4664|LOUGHEED|HIGHWAY| 0.975334 (shp 14380)

The user may peruse the list of candidates and either make a selection by entering the rank number of the candidate - or not. The user may quit by entering "q" or exit the program by entering "x". The user can go to the start of the list with "s", the end of the list with "e", the previous 10 candidates with "b" or the next 10 candidates with "n".

Example:

Unstandardized user row 22: 15342 BUENA VISTA

Numbers 0 to 9 of 29 items:
(0): 15367-15367 & 15324-15366|BUENA VISTA|AVENUE| 0.822127 (shp 55846)
(1): 15171-15235 & 15284-15284|BUENA VISTA|AVENUE| 0.819762 (shp 55840)
(2): 15237-15301 & 15286-15320|BUENA VISTA|AVENUE| 0.813156 (shp 55842)
(3): 15391-15391 & 15368-15410|BUENA VISTA|AVENUE| 0.811525 (shp 55848)
(4): 15303-15365 & 15322-15322|BUENA VISTA|AVENUE| 0.801738 (shp 55844)
(5): 15409-15447 & 15412-15444|BUENA VISTA|AVENUE| 0.794806 (shp 55886)
(6): 15463-15463 & 15446-15470|BUENA VISTA|AVENUE| 0.779719 (shp 55885)
(7): 15479-15505 & 15472-15506|BUENA VISTA|AVENUE| 0.769117 (shp 55853)
(8): 15139-15169 & 15114-15178|BUENA VISTA|AVENUE| 0.755253 (shp 55868)
(9): 15515-15537 & 15516-15518|BUENA VISTA|AVENUE| 0.751583 (shp 55854)
Select by item number, or q to quit, n for next,
b for back, s for start, e for end, x exit program
"*" followed by item number see documenttion.
"+" or "-" followed by number see documentation
> 

Failure to Match

If no selection is made, then the user's address is not geocoded and a null record is added to the shape file. If the program is left before the user's attribute file has been completely matched, the results are undefined.

The unstandardized address was: 5830 Golden Eagle Dr
 
98248
 
Numbers 0 to 3 of 4 items:
(0): 5398-5300 & 5399-5301|GOLDEN EAGLE|LANE|98230 & 98230 0.792023
(1): 5789-5799 & -1--1|GOLDEN EAGLE|DRIVE|98248 & 98248 0.642193
(2): 5781-5787 & -1--1|GOLDEN EAGLE|DRIVE|98248 & 98248 0.642193
(3): 5771-5779 & 5772-5780|GOLDEN EAGLE|DRIVE|98248 & 98248 0.642193
Select by item number, or q to quit, n for next,
b for back, s for start, e for end, x exit program
>

None of the candidate blockfaces are correct. The only thing to do is type "q" (enter). When you type "q", the following entry will appear in the error log:

Because no match found/selected:
User address row 31 was not geocoded.
The unstandardized address was: 5830 Golden Eagle Dr

98248

Standardized as:


House Address:    5830
Street Name:      GOLDEN EAGLE
Suffix Type:      DRIVE
Postal/Zip Code:  98248

In the above case, had you selected one of the candidates , despite the fact that the user address will not interpolate, you would see a Because 5830 doesn't fit... entry in the error log instead.

The list of reference candidates presented to the user for selection will frequently consist of many address ranges into which it is impossible to interpolate the civic number. This is because the civic number contributes only part of the matching weight and it is possible to gain a fair score even without agreement on the HOUSE attribute. However, selection of any of these candidates is futile, since the program cannot make the necessary calculations to position them.

One problem that may be encountered is the discovery that there is no blockface with a correct address range among the candidates presented, and the only correct response is to bypass the record. When this occurs, it is most often due to the fact that either the reference is outdated or that the user record actually falls outside the reference area.

The next subsection describes PAGC's strategy for circumventing some omissions in the original reference.

Editing Candidates

It is sometimes the case that it is obvious that the record you want to select is present in the candidate list, but is defective in some way - in the address ranges - that will prevent geocoding. It is possible to edit any candidate. The changes made will not alter the original reference file or even the standardized pgx file. The changes last only until the current record has been processed. However, an entry will be made to the imp file for later reference or use.

To edit a record:

When the lists of candidates comes up, select the candidate number you want to edit preceded by an asterix. For instance, if you want to edit candidate 0, type: *0(enter).

Numbers 0 to 9 of 100 items:
(0): 4859-4859 & 4874-4874|LOUGHEED|HIGHWAY| 1.000000 (shp 14816)
(1): 4315-4345 & 4336-4388|LOUGHEED|HIGHWAY| 1.000000 (shp 14301)
(2): 2152-2294 & 2241-2293|DOUGLAS|ROAD| 1.000000 (shp 15554)
(3): 2330-2390 & 2303-2345|DOUGLAS|ROAD| 1.000000 (shp 15748)
(4): 4813-4831 & 4728-4828|LOUGHEED|HIGHWAY| 0.980390 (shp 14690)
(5): 4569-4707 & 4612-4664|LOUGHEED|HIGHWAY| 0.975334 (shp 14380)
(6): 5257-5695 & 5258-5650|LOUGHEED|HIGHWAY| 0.971990 (shp 14933)
(7): 4129-4267 & 4120-4280|LOUGHEED|HIGHWAY| 0.957922 (shp 14430)
(8): 5757-5997 & 5750-6000|LOUGHEED|HIGHWAY| 0.951887 (shp 15300)
(9): 3771-3951 & 4080-4118|LOUGHEED|HIGHWAY| 0.951316 (shp 14650)
Select by item number, or q to quit, n for next,
b for back, s for start, e for end, x exit program
"*" followed by item number see documenttion.
"+" or "-" followed by number see documentation
> *0

You will then get a list of the attribute fields, each associated with a number. Use the number to select the field to edit. For example, suppose ADDR_FM_LE is the field name of the field you want to edit, and it is labelled 0. Suppose you want to change it from the 4859 shown to 4959. Type: 0(enter). You are then prompted to enter a new value. Type 4959 (enter).

Numbers 0 to 6 of 7 items:
(0) ADDR_FM_LE:   4859
(1) ADDR_TO_LE:   4859
(2) ADDR_FM_RG:   4874
(3) ADDR_TO_RG:   4874
(4) DIRECTION: _ 
(5) NAME: Lougheed                                                              
(6) TYPE: HWY   
Select by item number, or q to quit, n for next,
b for back, s for start, e for end, x exit program
> 0
ADDR_FM_LE:   4859
Change to (type in new value):
4959

Numbers 0 to 6 of 7 items:
(0) ADDR_FM_LE:   4959
(1) ADDR_TO_LE:   4859
(2) ADDR_FM_RG:   4874
(3) ADDR_TO_RG:   4874
(4) DIRECTION: _ 
(5) NAME: Lougheed                                                              
(6) TYPE: HWY   
Select by item number, or q to quit, n for next,
b for back, s for start, e for end, x exit program
> q

Repeat to make changes in other fields. When you are finished with your changes, type q(enter). You will be returned to the candidate list.

The alterations in the record will not be shown in the candidate list. Instead, for that record, there is a prompt telling you to use the asterix method to view the contents. The record is not re-scored based on the altered information. It will remain in the same position in the candidate list that it occupied before the edit.

Unstandardized user row 4: #205 4941 LOUGHEED HWY
Numbers 0 to 9 of 100 items:
(0): USE * TO VIEW RAW REF DATA 1.000000 (shp 14816)
(1): 4315-4345 & 4336-4388|LOUGHEED|HIGHWAY| 1.000000 (shp 14301)
(2): 2152-2294 & 2241-2293|DOUGLAS|ROAD| 1.000000 (shp 15554)
(3): 2330-2390 & 2303-2345|DOUGLAS|ROAD| 1.000000 (shp 15748)
(4): 4813-4831 & 4728-4828|LOUGHEED|HIGHWAY| 0.980390 (shp 14690)
(5): 4569-4707 & 4612-4664|LOUGHEED|HIGHWAY| 0.975334 (shp 14380)
(6): 5257-5695 & 5258-5650|LOUGHEED|HIGHWAY22| 0.971990 (shp 14933)
(7): 4129-4267 & 4120-4280|LOUGHEED|HIGHWAY| 0.957922 (shp 14430)
(8): 5757-5997 & 5750-6000|LOUGHEED|HIGHWAY| 0.951887 (shp 15300)
(9): 3771-3951 & 4080-4118|LOUGHEED|HIGHWAY| 0.951316 (shp 14650)
Select by item number, or q to quit, n for next,
b for back, s for start, e for end, x exit program
"*" followed by item number see documenttion.
"+" or "-" followed by number see documentation
> 

The program still will not geocode an arc if it can't put the target address into the range. It's up to the user to make sure that the alterations will accomodate the target.

Impute File If an altered record is selected, the changes are output to a file, in the same directory as the reference shapeset, with the reference base name and an .imp extension. Each field changed for a row is given a separate record. Here is a file with one record, recording the change of field 16 (named ADDR_FM_RG) in shape 47172 to the value 6700

<imputes> <impute><row>47172</row> <field><number>16</number><name>ADDR_FM_RG</name></field> <new_value> 6500</new_value></impute> </imputes>

No changes are made to the original reference shapeset or to the standardized representation (the pgx file). The impute file can be used to make changes later in the original reference shapeset - using an xbase editor - if desired. To incorporate these changes in the pgx file, the reference will need to be rebuilt.

Finding the Next or Previous Range

It is sometimes the case that the block that should contain your target address does not appear in the list of candidates. However, the one just up or just down is present. PAGC provides a method of finding that record and adding (or moving) it to the beginning of the candidate list.

To search for the next arc either up or down from an address range.

Select the candidate number from the candidate list you want to go up from. Say you have a candidate, labelled 4, with range of 4850-4900 and you want the arc that would correspond to 4900-4950. That is, you want to go up. Type: +4(enter). The program searches the ix4 index for all arcs in the reference (including those that were not standardized) that terminate at the same point as the selected candidate. The program will find the most likely arc in terms of similarity in angle of direction, either in the candidate list, in the standardized records, or in the unstandardized records and tell you which arc it was and the degrees of divergence in angle. Or it will tell you it couldn't find an arc (the road network terminates here). The divergence in angle is useful information. The closer the divergence is to 0, the more likely it is that the selected arc is a continuation.

Unstandardized user row 4: #205 4941 LOUGHEED HWY

Numbers 0 to 9 of 100 items:
(0): 4813-4831 & 4728-4828|LOUGHEED|HIGHWAY| 0.980390 (shp 14690)
(1): 4569-4707 & 4612-4664|LOUGHEED|HIGHWAY| 0.975334 (shp 14380)
(2): 5257-5695 & 5258-5650|LOUGHEED|HIGHWAY| 0.971990 (shp 14933)
(3): 4129-4267 & 4120-4280|LOUGHEED|HIGHWAY| 0.957922 (shp 14430)
(4): 4859-4859 & 4874-4874|LOUGHEED|HIGHWAY| 0.957555 (shp 14816)
(5): 5757-5997 & 5750-6000|LOUGHEED|HIGHWAY| 0.951887 (shp 15300)
(6): 3771-3951 & 4080-4118|LOUGHEED|HIGHWAY| 0.951316 (shp 14650)
(7): 6001-6001 & 6150-6150|LOUGHEED|HIGHWAY| 0.941652 (shp 15479)
(8): 6011-16145 & 6012-26106|LOUGHEED|HIGHWAY| 0.941244 (shp 39298)
(9): 4749-4785 & 4756-4756|LOUGHEED|HIGHWAY| 0.921264 (shp 14553)
Select by item number, or q to quit, n for next,
b for back, s for start, e for end, x exit program
"*" followed by item number see documenttion.
"+" or "-" followed by number see documentation
> +0
Selecting arc 14816 with an angle divergence of .0054 degrees

The selected arc is given a perfect score so that it will appear at the top or near the top of the candidate list (all the perfect scores will congregrate at the beginning of the list). You can search around using - or +. If you find the arc you want, you can edit it using the asterix method.

Unstandardized user row 4: #205 4941 LOUGHEED HWY

Numbers 0 to 9 of 100 items:
(0): 4859-4859 & 4874-4874|LOUGHEED|HIGHWAY| 1.000000 (shp 14816)
(1): 4813-4831 & 4728-4828|LOUGHEED|HIGHWAY| 0.980390 (shp 14690)
(2): 4569-4707 & 4612-4664|LOUGHEED|HIGHWAY| 0.975334 (shp 14380)
(3): 5257-5695 & 5258-5650|LOUGHEED|HIGHWAY| 0.971990 (shp 14933)
(4): 4129-4267 & 4120-4280|LOUGHEED|HIGHWAY| 0.957922 (shp 14430)
(5): 5757-5997 & 5750-6000|LOUGHEED|HIGHWAY| 0.951887 (shp 15300)
(6): 3771-3951 & 4080-4118|LOUGHEED|HIGHWAY| 0.951316 (shp 14650)
(7): 6001-6001 & 6150-6150|LOUGHEED|HIGHWAY| 0.941652 (shp 15479)
(8): 6011-16145 & 6012-26106|LOUGHEED|HIGHWAY| 0.941244 (shp 39298)
(9): 4749-4785 & 4756-4756|LOUGHEED|HIGHWAY| 0.921264 (shp 14553)
Select by item number, or q to quit, n for next,
b for back, s for start, e for end, x exit program
"*" followed by item number see documenttion.
"+" or "-" followed by number see documentation
>

10.3 MATCH PHASE PRODUCTS

The match phase produces three files, each with the same base name and in the same directory as the user attribute table, but with the extensions .shx, .shp and .err. A fourth file is produced, with the base name of the reference shapeset and in the same directory as the reference shapeset, to record any edits made during candidate selection ( See The Impute File).

The Shapeset Produced

The target product of PAGC are the two shape files (USER_ATTRIBUTE_TABLE_NAME.shp and USER_ATTRIBUTE_TABLE_NAME.shx) which, with the input user attribute xbase file (USER_ATTRIBUTE_TABLE.dbf), complete the point shapeset. These may then be combined with the reference shapeset in a GIS program to produce a map.

Each record of the shapeset consists of the position data coupled with the address data. When a reference record has been linked to a user address record, the civic number from the user's record is interpolated into the range of the appropriate blockface from the reference record and positioned at the global distance from the arc. The method of interpolation is to calculate the total length of the blockface by summing the lengths of each segment of the arc and then placing the user civic number at the ratio determined by subtracting the lowest number of the range from the highest number. The latitude and longitude in decimal degrees is calculated from the values in the reference shapefile and a point shape record is written to the USER_NAME shapeset.

The program currently uses only meters for distance (which is really only useful here for calculating the offset of the address from the street), and decimal degrees for the coordinate system. When the program is loaded, the Haversine algorithm is used to get an average meters per degree for the reference files bounding box. A single global offset distance is currently used to position the address point at a 90 degree angle from the arc. This is 11 meters for the NUMBER_INTERVAL_LEFT_RIGHT comparison type and 0 meters for the NUMBER_LEFT_RIGHT comparison type.

The Match Error Log

Also produced in the match phase is an error log which will note the errors encountered by the program in doing the matching. The match phase error file takes the same name as the user attribute table and is placed in the same directory as the user shapeset. It takes the extension ".err". It will record all rows in the user attribute table that were not geocoded. It is opened after the match phase has set itself up. The form of the error report is:

First, a reason message ( See Reason Message) for the error is given, followed by the message, "User address row N was not geocoded. The unstandardized address was: STRING", where STRING is the concatenated MICRO fields, followed by the concatenated MACRO fields from row N of the user's attribute table. It is followed by the standardizer's best standardization of the address. If the error is a No Standarization error, this best standardization is undefined and should be ignored.

Match Log Error Messages

Reason Message. The reasons for the matching errors are as follows:

No Standardization

"No standardization of STRING for row N". STRING will be either the concatenated MACRO fields or MICRO fields from row N of the user's attribute table. The standardizer was unable to standardize the address given. No attempt to match this address will be made. The reason message is "Because row did not standardize".

No Candidates

Reason message: "Because no candidates above cutoff found:".

Parity Mismatch

"Warning: Parity Mismatch for user row N" The user row house address is of a different parity from the reference block range it has been geocoded into. That is, the house (civic) address is even and the reference is odd, or vice versa.

User Rejection

If the no geocoding error is due to the user rejecting all of the candidates found, it will be preceded by the message "Because no match found/selected:".

No Interval Fit

The reason message is "Because HOUSE_NUMBER doesn't fit between either FROM_LEFT and TO_LEFT or FROM_RIGHT and TO_RIGHT:" , where HOUSE_NUMBER is the HOUSE attribute of the standardization selected, and FROM_LEFT, TO_LEFT, FROM_RIGHT and TO_RIGHT are the address ranges for the left and right sides of the reference record selected by the user.

Example Log Entry An example of a match log error listing:

No standardization of Benson Rd PO Box 910
 for row 28
Because row did not standardize
User address row 28 was not geocoded.
The unstandardized address was: Benson Rd PO Box 910

98281

Standardized as:


Postal/Zip Code:  98281

This error shows an unstandardized address at row 28 of the user attribute table. Because there is no HOUSE attribute present, standardization failed for the MICRO portion of the address.


Next Previous Contents