Although much available data is already geocoded, in many situations you will find data in a table that you want to plot on a map. One example is the survey data from this chapter that lists street addresses, ZIP Codes, and the home states of the people surveyed. Using these location attributes, you can map (geocode) their locations. Another example is transaction data collected by organizations (and perhaps you are working for such an organization). Because transactions, such as delivery of an appliance, often occur at a location, it’s useful to map these locations to analyze the market served. Another example for transactions is surgery patients in a hospital. In this case, it’s useful to map the residences of patients to identify the service area of the hospital.
A geocode is data that identifies a unique location—a point, line, or polygon—on planet Earth. For example, the geocode 4800 Forbes Ave, Pittsburgh, PA 15213, identifies a unique point. The ZIP Code, 15213, is also a geocode identifying a unique polygon, and Pittsburgh, PA, is a geocode identifying a unique county subdivision polygon. If you have such source data, you can use ArcGIS Pro’s geocoding algorithm to map corresponding points by matching to reference data. Reference data is a feature class with existing locations—for example, street centerlines with street address attributes or ZIP Code polygons with a ZIP Code attribute.
The source data that you will geocode in the beginning of this chapter is from a survey that includes residence addresses of attendees taken by an arts organization for its annual art show. The arts organization wants to analyze locations of attendees to better target future marketing efforts. You’ll first geocode by ZIP Code only and then by street address.
The problem with geocoding is that source data suppliers (for example, survey respondents for the survey data) and data entry workers can write or type anything they want, including misspellings, abbreviations, omissions, and place-names such as “University of Pittsburgh” instead of an address. Organizations new to GIS often have typed addresses with notes included in address fields such as “333 W Pine Ave, watch out for vicious dog” that make geocoding a challenge. Consequently, exact matching of source to reference data is not possible. Instead, you must use “fuzzy matching” (a kind of matching used in computer science to make matches that are approximate instead of 100 percent accurate). For example, the address “123 Fleet St” may be on a street map, and a data entry worker may have typed “123 Fleat” for source data, with a misspelling and without the “St.” A fuzzy-matching algorithm might determine that address is close enough to the correct address and plot the residence at 123 Fleet St.
A rule-based expert software system can make fuzzy matches; ArcGIS Pro’s geocoding software is such a system. The system attempts to use the thought processes and rules that an expert would use to accomplish a complex and ambiguous task. In this case, the expert system attempts to mimic what a resourceful mail delivery person would do, using their expert knowledge to get a badly addressed piece of mail to the right address. These expert system components are used in ArcGIS Pro’s geocoding:
To account for spelling errors, an algorithm computes a Soundex key, which is a code assigned to names that sound alike (for example, “Fleet” and “Fleat” both have Soundex key F43), and identifies candidate matches on the basis of matching source and reference street addresses. (Look up “Soundex key” in your web search engine.) The algorithm starts with a score of 100 for each case and subtracts penalty points for each problem encountered. If the end score is greater than the threshold set as a parameter by default or by the user, a reference location is a candidate for matching. The candidate with the maximum score is chosen as the estimated location. If there is a tie, one of the tying locations is arbitrarily assigned as the match unless the user chooses to not accept ties (50 percent of which are incorrect).
You can geocode with any number of data types. For this chapter, you will geocode using street centerlines and ZIP Codes. People generally will disclose their ZIP Codes in surveys and get them right, so the results are complete and accurate, albeit only at the ZIP Code level. You can download a US Census Bureau ZIP Code map for the entire United States for geocoding nationwide. Often, a ZIP Code may be the only available data type and will suffice for marketing purposes. Service, product delivery, and other location-based needs require more precise locations. Street centerlines are sufficient for many purposes (but certainly not for locating in-ground natural gas and other lines during construction digging). You can easily use geocoding with ZIP Codes and street centerlines with free map layers downloaded from the Internet (see chapter 5). However, cities and states perform many other kinds of geocoding, often more precisely, using land parcel centroids with street addresses provided by many city governments. Note that Esri provides the highly accurate and current ArcGIS Online World Geocoding Service. If you are in a class, however, check with your instructor before using this service, because using the service consumes credits that must be purchased.
Dual-range maps, available from the Census Bureau’s TIGER/Line data and from vendors, are widely used for geocoding but are limited by only having house numbers on the left and right for the beginning and end of each one-block-long street segment. Consequently, addresses within blocks are linearly interpolated (for example, 150 Main St. is plotted halfway along the street segment with ranges 100 to 198 and 101 to 199) and are not exact locations.
Generally, not all source data records are matched when geocoding. A performance measure for geocoding is the percentage of source addresses that get matched and plotted using the reference data. To compute match rates, you should subtract all records in source data that are not addresses (records that are blank, do not start with a house number, are not street intersections, and so on) from the total number of addresses in the source data used as the denominator for the match rate.
Unfortunately, there is no way to judge if a match is truly correct. Organizations that critically depend on geocoding (such as 911 emergency calls for services from police, fire, and ambulance responders) review nonmatches and incorrect matches to improve their maps and procedures for obtaining correct source data from callers. The sensitivity analysis in assignment 8-3 on the book resource web page offers some insight into match accuracy. In that assignment, you loosen thresholds in matching rules (changeable through tool parameters) until matched addresses added to previously matched addresses are identified clearly as errors. In general, the default settings for locator files perform well in the sensitivity analysis.
In this tutorial, you will geocode survey data collected by a Pittsburgh arts organization that holds an event each year attended from across the three-state region of Pennsylvania, Ohio, Maryland, and beyond. To save space on your computer’s hard drive, these exercises will use only ZIP Code polygons from the three states mentioned, plus West Virginia (the relevant region for Pittsburgh) instead of the entire country (which takes 0.5 GB of disk space for ZIP Codes).
Recall that a geocoding locator is a set of files that stores parameters and other data for the geocoding process.
Important note: Do not select https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer or any other such URL for Input Address Locator unless you have permission from your instructor or employer. The organizational account that you are using would be billed for using the geocoding service, and you might be billed!
The match rate, 99.7 percent, is extremely high and well above the threshold for any marketing decisionmaking or other management purposes, so the Attendees map could be used without any changes. But for practice, you’ll rematch and match to 100 percent. You’ll correct the ZIP Code in one record, and then pick approximate points for the two ZIP Codes that are not in the reference data (in practice by looking them up elsewhere).
Now you have attendees’ survey data geocoded to ZIP Code centroids, generally with many attendees at center points of ZIP Codes. For symbolizing attendees, next you will count the number of attendees in each ZIP Code and plot size-graduated symbols, with symbol size increasing as the number of attendees increases. You could do this work manually in a couple of steps, but the Collect Events tool will do the job in one step.
This tutorial starts with the data from tutorial 8-1 but only includes records that have street addresses and are in Allegheny County, which includes Pittsburgh. Allegheny County is the local market for the arts event, and more detailed location data on attendees is desirable for marketing there. So you’ll geocode by street address to place a unique point on the map for each attendee in the county. You’ll use the same workflow as with ZIP Code matching: build a locator (but this time using street centerlines as the reference data), geocode the source data of survey respondents, and rematch some of the unmatched records.
Note that the locator you are building is called “dual-range locator” because the address style is dual ranges. The data records of this address style have the beginning and ending house number for both sides of the street, both the even- and odd-numbered sides.
The tool automatically finds all but one of the essential fields needed for geocoding (indicated by *) from the reference data. Select FullName for Street Name. Scroll down in the field map to see that the tool found ZIPL and ZIPR, ZIP Codes on the left and right side of each street segment. Streets on the borders of ZIP Codes have different ZIP Codes, hence the left and right ZIP Code fields. Street names are unique in ZIP Codes and in cities. Allegheny County has many cities, and several of these cities may have streets with the same street name and street numbers, such as a “100 Main Street.” Including a ZIP Code in the matching process guarantees that you will find the correct Main Street and location.
There are only 67 unmatched addresses to complete (your number may be slightly different depending on your version of Pro). The match is high enough now for marketing decision-making, but for practice, you’ll attempt to rematch nine unmatched records. The research for fixing unmatched addresses is already done and made available in the step 3 table, where Address is the address from the survey data; Comment is the result of research using the US Post Office’s ZIP Code lookup web page, an online mapping website, and the TIGER Streets attribute table; and Rematch is the action determined to be taken. As you go through the tutorial, you will use this workflow and table as references:
This road is only two blocks long. You’ll just edit the first row that has house number ranges 49 to 99 and 2 to 4. Given the survey address of 1 Bayard Rd, it’s reasonable to modify the LFromAdd field value from 49 to 1 and keep the value 99 for LToAdd. For the same row and RFromAdd and RToAdd fields, use values 2 and 98, respectively.
Make edits for Mary Ann St, Pittsburgh, PA 15203 as noted in the table. Save your edits. Make sure that no streets are selected.
Considering that you just changed the street attributes used in geocoding with the Streets reference data, you must rebuild the streets locator to incorporate the changes.
In the Rematch Addresses pane, the unmatched addresses are presented in the same order as the table you previously used. First up is 100 Rudolph Lane, which according to the research results in the table, is a bad address.
This record has a perfect match score, 100, because of the street map editing you did earlier.
You will likely find place-names instead of addresses for some records in a table that is supposed to have street addresses. For example, instead of typing “15 Federal St,” Pittsburgh, police commonly enter the place-name, “PNC Park,” the name of the Pittsburgh Pirates’ baseball stadium at that location. For cases with place-names, ArcGIS Pro’s geocoding algorithm can use an alias table, which you must create to have ArcGIS Pro replace aliases with street addresses. You can easily data mine unmatched records in geocoded data for place-names (sort by address and look for place-names). Then, with a number of place-names in hand, you must look up their addresses and create a table with place-names in one column, addresses in a second column, and ZIP Codes in a third column. You can use Microsoft Excel or Notepad to do this work.
Suppose that the map you are about to open of Pittsburgh’s central business district (CBD) is for a food-catering service that delivers lunches to the area’s business employees. The catering service needs a master layer of its customers’ businesses to create the best routes for cutting delivery costs (you’ll solve such problems in chapter 13). In this tutorial, you will take the preliminary step of setting up delivery stops. Note that the data here has relatively few records, so you could edit the source data directly to replace place-names with correct addresses. Alias tables pay off when you have hundreds or thousands of records to geocode regularly. In those cases, automatically replacing place-names saves time and work.
Create a dual-ranges locator, named CBDLocatorNoAlias, using CBDStreets as the reference data and Clients.csv as the source data. Remember to look down the field map and select FullName for Street Name. Then geocode Clients.csv to produce a point layer named ClientsNoAliases. A partial listing of Clients.csv is shown. The selected four records have place-names (names of buildings) instead of addresses and are the only records that will not geocode at this time, because you’re not yet using an alias table. Check to see that the four highlighted records are the only ones not matched, as shown.
The figure that follows has the contents of Alias.csv. If you geocoded this table, you’d find that all three addresses match. So you can expect that geocoding with this alias table will result in 100 percent matches for the client data.
Geocode the Clients.csv using the CBDStreets_CreateAddressLoca locator to produce ClientsAliasTable. You’ll find that 100 percent of the clients’ records match, which is what you’d need to carry out optimal routing of deliveries. Between doing the research to rematch source records and alias tables, you can always geocode 100 percent of records that have usable addresses. When you finish, save and close your project.
This chapter has three assignments to complete that you can download from this book’s resource web page, esri.com/gist1arcgispro: