Next Previous Contents

6. THE REFERENCE SHAPESET

The reference data must be in the form of a shapeset. A shapeset will be the set of three files, named REFERENCE_SHAPESET_NAME.shx, REFERENCE_SHAPESET_NAME.shp and REFERENCE_SHAPESET_NAME.dbf. The first two files will contain the positional data of the shapes, whereas the third, an xbase table, is the reference attribute table. It will contain the streetnames, address ranges and other data. The shapeset, a file format in the public domain, is a file that can be read and produced by many GIS programs.

The coordinates used in the shapefile are assumed by the program to be in decimal degrees.

In the build phase, however, it is only the shapeset's xbase attribute table that is used by the program. The postal address related information is extracted from this file, indexed, and standardized in preparation for matching against user attribute files.

6.1 Reference Attribute Table Data Values

The reference table data is assumed to be represented in 7 bit ASCII characters. The latin-1 characters that may exist in some schemas are modified, for standardization purposes, to remove diacritical marks and accents.

In Statistic Canada schemas, if there exists an ARC_GROUP field for a record and the value does not begin with the letter "A", it is bypassed as a non-addressable feature. In all other schemas, including newer Statscan RNF files, a record is bypassed only if it has a blank street name field.

In every field (except address range fields) a field is regarded blank if it is in fact blank or if it begins with an underscore.

HOUSE

In fields which represent part of an address range a civic number with the value of 0 is interpreted literally, except when using a Statistics Canada schema, in which case it is intrepreted as a blank. Non numeric characters are ignored. The first string of digits encountered is interpreted as the civic number. If a fraction is included, the number is rounded up. Any initial fraction is rounded up to 1. Milepost, coordinate-style and addresses that depend on box numbers and rural routes or post office boxes cannot be handled by this version.

CITY, PROV, POSTAL, NATION

If a reference field of this type has a value for a block face, but there is no corresponding address range, it is omitted from the standardized reference.

6.2 Standardization of a Reference Record

The unstandardized reference record is parsed, according to the schema, into a MICRO, a MACRO left and a MACRO right portion and each portion is standardized separately. The AddressRange Numbers are omitted from the MICRO portion when it is sent to the standardizer.

The standardizer uses only rules of the ARC_C class ( See rule types) to standardize the MICRO portion of the record. These rules differ from the MICRO_C class in that they do not possess HOUSE output symbols. That is, a MICRO_C rule is like a CIVIC_C and an ARC_C rule combined.

The standardizer creates multiple, ranked standardizations of the MICRO portion of the record. If no standardization can be found, an error message is logged and the record is bypassed. Otherwise, each standardization is examined to determine how closely it corresponds to the unstandardizd data. In particular, a standardization is downgraded if an attribute present in the reference data is missing (except for those attributes not explicitly included in the reference schema) or an attribute missing in the reference data pops up in the standardization. The best standardization with none of these kinds of errors is accepted. Otherwise, an error message is logged, and the best standardization with the fewest of these errors is accepted. The standardization used in such a case is also error logged.

The standardization accepted may produce attributes not present in the reference schema. Some schemas are rather sparse in attributes. These extra-schema attributes will be dealt with in the match phase by the redirection strategy ( See Redirection).

The left and right sides of the MACRO ( See Micro/Macro) are first examined to see if the feature is a MACRO boundary - i.e. if they differ. If they don't then only one side is standardized. If a blockface MACRO, stipulated by the schema, is absent even though an address range exists for that blockface, an error is logged. If no standardization is found, then an error is logged and the record bypassed. The standardizer uses only rules of the MACRO_C class to standardize these portions of the record.

6.3 Reference Attribute Schema.

There can be many different representations of postal address attribute data in reference attribute tables. PAGC must be able to determine this representation before it can standardize and index. It can either attempt to detemine the schema by probing the field names in the reference attribute table or the user can provide it with the schema beforehand. The program needs to know the postal attribute ( See Postal Attributes) associated with each field and the comparison type ( See Comparison Types). It does not need the BLDNG, UNITH, UNITT, BOXH, BOXT, RR or UNKNWN attributes and these should not be included in schemas even if present in the reference.

If no schema table is provided on the command line with the -s switch, the program attempts to discover the the schema from the reference table's field names. The program first probes the reference attribute table for either the US Tigerline format (by looking for a "FEDIRP" field) or a Statistics Canada format (by looking for a "ADD_FM_LE" field). If either of these fields are discovered, it probes for additional address fields in case these have been added.

The case (upper, lower, mixed) of the field names is not important, nor the order in which they appear in the file.

Field names not part of a schema or on any of the probe lists are ignored. Reference Attribute tables often contain data that is irrelevant to postal address geocoding.

Built-in Schemas

Certain entire schemas are built-in to the program. In particular:

Tiger

PAGC is set up to recognize the US Census Tiger Line format in the reference attribute table without a schema table. It looks for the address range fields FRADDL ,TOADDL, FRADDR, TOADDR and interprets them as the NUMBER_INTERVAL_LEFT_RIGHT comparison type. It also looks for the CHAR_SINGLE fields FEDIRP, FNAME, FEDIRS, FETYPE, corresponding to the PREDIR, STREET, SUFDIR, SUFTYP attributes respectively. It also looks for the POSTAL_LEFT_RIGHT fields ZIPL, ZIPR.

Statscan

PAGC is set up to recognize the Statistics Canada road network format. It looks for the address range fields ADDR_FM_LE, ADDR_TO_LE, ADDR_FM_RG, ADDR_TO_RG and the CHAR_SINGLE fields DIRECTION, NAME, TYPE, corresponding to the PREDIR, STREET, SUFTYPE attributes. Note that it has no POSTAL or indeed any other MACRO attribute. Note too that the DIRECTION and TYPE fields are actually ambiguous. A SUFTYPE in Anglophone Canada may be more reasonably assumed to be a PRETYPE in Francophone Canada.

Because of the possibility that both of these formats can be extended by adding additional fields (a CITY field, for example, or POSTAL_LEFT_RIGHT fields for Statscan), the program will also probe for field names for attributes absent from the schema.

Reference Field Names

The field names (with attribute and comparison types) that can be recognized in reference attribute tables by field name probing are listed here. It is an error to have more than one field name set from the same attribute.

If you wish your reference attribute table to be recognized properly by PAGC, you can always change your xbase field names to correspond.

Beware of the non-address field names that may conflict with field names PAGC may recognize. In particular, NAME is intrepeted as a CHAR_SINGLE STREET name. You will need a schema table if such conflicts exist.

6.4 Schema Xbase File

The schema file is an xbase file that describes the schema of the reference shapeset. If the program is unable to discover the schema of the reference attribute table by probing the field names, it will be necessary for the user to provide the program with the switch -sSCHEMA_TABLE_NAME along with the -b switch. The xbase (.dbf) table consists of six to eight columns. The extension (.dbf) need not be provided. It will have as many rows as the number of active postal attributes that are incorporated.

Schema field structure

The schema xbase (dbf) file consists of six fields:


Field Number Field Name Type Size
1 ATTRIB C 8
2 COMPARE C 35
3 NAME1 C 25
4 NAME2 C 25
5 NAME3 C 25
6 NAME4 C 25
The Structure of a Schema Table

An additional two fields can also, if desired, be appended for the purposes of overriding the default weights ( See the default matching weights):


Field Number Field Name Type Size
7 M F 10
8 U F 10
Optional match weight fields

The Attrib Field

The Attrib Field. In the schema table there will be one row for each active postal attribute ( See Postal Attributes). In other words, the postal attribute is the key for the row: there is one attribute per row, one row per attribute. The HOUSE attribute is used, for example, for the fields associated with address ranges, and the POSTAL attribute is used for zip/postal codes, STREET for the streetname (such as Tenth in West Tenth Ave , PREDIR for a predirectional (such as West in West Tenth Ave, SUFTYP for the posttype (such as Ave in West Tenth Ave, etc.

The Compare Field

Compare Field. With each attribute there is associated a comparison type. Each comparison type specifies the method by which the relevant fields in the reference record will be compared to the corresponding elements of the user's address record. The comparison types that PAGC defines are listed below.

Comparison Types

NO_COMPARISON

This comparison type is not used in schema tables. It is used internally for redirection ( See Redirection).

CHAR_SINGLE

The reference and target fields are compared by matching the character string letter for letter. CHAR_SINGLE comparisons may also be extended to string similarity comparisons. This is the most common comparison type for fields other than POSTAL or HOUSE attribute fields. PRETYPE, PREDIR, QUALIF, STREET, SUFDIR, and SUFTYPE will ordinarily use this comparison type.

CHAR_LEFT_RIGHT

The reference has a left and a right fields, either of which can be matched to the target. This is for MACRO fields such as CITY that may be different for the left and right blockfaces of a street. String similarity measures may be used here too.

NUMBER_SINGLE

The reference and target match on a single number.

NUMBER_INTERVAL

The target number must fall between the two numbers (from and to) in the reference.

NUMBER_INTERVAL_LEFT_RIGHT

The reference has four numbers, a from-to interval on the left and one on the right. The target number must fall between either one of the intervals. This is the most common of the NUMBER comparison types used in postal address geocoding. This comparison type may also include a similarity comparison (transpositions only).

POSTAL_SINGLE

Reference and target match letter for letter on a single postal/zip code field.

POSTAL_SPLIT

This is the comparison type used when the reference splits the postal code into two separate fields (eg zip, zip4 or fsa, ldu).

POSTAL_LEFT_RIGHT

The comparison type used for postal code with both left and right fields, either of which can be matchted to the target. This is the most common POSTAL type used in postal address geocoding.

POSTAL_LEFT_RIGHT_SPLIT

This comparison type combines the previous two.

The Name Fields

The Four Name Fields. The four fields named NAME1, NAME2, NAME3 and NAME4 will give the fieldnames in which the components of the comparison type will be found in the reference file. The fieldnames, in the schema fields NAME1 .. NAME4 are expected in the precedences given below. In other words, the names should appear in the order specified even if that is not the order in which they appear in the record structure.

CHAR_LEFT_RIGHT and POSTAL_LEFT_RIGHT

LEFT > RIGHT

POSTAL_SPLIT

GENERAL > SPECIFIC (e.g. ZIP BEFORE ZIP+4)

NUMBER_INTERVAL

FROM > TO

NUMBER_INTERVAL_LEFT_RIGHT

FROMLEFT > TOLEFT > FROMRIGHT > TORIGHT

POSTAL_INTERVAL_LEFT_RIGHT

FROMLEFT > TOLEFT > FROMRIGHT > TORIGHT

POSTAL_LEFT_RIGHT_SPLIT

GENERAL_LEFT > SPECIFIC_LEFT > GENERAL_RIGHT > SPECIFIC RIGHT

The M and U Fields

The M and U Fields. You may also, if you wish, add fields to set the matching weights for each attribute. Add a field named M for the match weight and/or add a field named U for the mismatch weight. These are the weights used to weight the matches between the user's address and the reference records and represent the probability of random false negatives and random false positives. If these fields are absent, the default values ( See matching default values) for each attribute is used. If either field is present, but has a blank or 0.0 value for any attribute, that attribute will be given the default value. The default value is used, therefore, unless the field is present in the schema table and the non-blank value in the field is greater than 0.0 but less than 1.0.

Example Schema Table

Example: the schema file freetig.dbf, giving the free shapefile distributions of TigerLine files will have the following 6 records:


ATTRIBCOMPARE NAME1 NAME2 NAME3 NAME 4
HOUSE NUMBER INTERVAL LEFT RIGHT FraddlToaddlFraddr Toaddr
PREDIRCHAR SINGLE Fedirp
STREETCHAR SINGLE Fename
SUFTYPCHAR SINGLE Fetype
SUFDIRCHAR SINGLE Fedirs
POSTALPOSTAL LEFT RIGHT Zipl Zipr
freetig.dbf


ATTRIBCOMPARE NAME1 NAME2 NAME3 NAME 4 M
HOUSE NUMBER INTERVAL LEFT RIGHT FraddlToaddlFraddr Toaddr .99
PREDIRCHAR SINGLE Fedirp
STREETCHAR SINGLE Fename
SUFTYPCHAR SINGLE Fetype
SUFDIRCHAR SINGLE Fedirs
POSTALPOSTAL LEFT RIGHT Zipl Zipr
The above table with an M field


Next Previous Contents