Maps sometimes require more than a visualization of spatial data for users to answer questions and solve problems. The right data might be on the map, but analytical methods offer the best answers or solutions to a problem. This chapter covers four spatial analytical methods: buffers, service areas, facility location models, and clustering. The application areas include identifying illegal drug dealing within drug-free zones around schools, estimating a so-called “gravity model” for the fraction of youths intending to use public swimming pools as a function of the distance to the nearest pool from their residences, determining which public swimming pools to keep open during a budget crisis, and determining spatial patterns of arrested person attributes for serious violent crimes.
This chapter introduces a new spatial data type, the network dataset, which is used for estimating travel distance or time on a street network. ArcGIS Online provides accurate network analysis with network datasets for much of the world. Because these services require purchase, ensure that you check with your instructor to see if you have credits. For instructional purposes, this chapter uses approximate network datasets built from TIGER street centerlines that are free to use.
A buffer is a polygon surrounding map features of a feature class. As the analyst, you will specify a buffer’s radius, and then the Buffer tool will sweep the radius around each feature, remaining perpendicular to features, to create buffers. For points, buffers are circles; for lines, they’re rectangles with rounded end points; and for polygons, they’re enlarged polygons with rounded vertices.
Generally, you use buffers to find what’s near (proximate to) the features being buffered. For schools symbolized as points, 1,000-foot school buffers define drug-free zones. Any person convicted of dealing illicit drugs within such zones will receive a mandatory increase in jail sentence, which is intended to serve as a deterrent to selling drugs to children. You’ll analyze drug-free zones in this tutorial.
Another example of spatial analysis with buffers is “food deserts,” which are often defined as areas that are more than a mile from the nearest grocery store in a city. Often, the persons living in food deserts are poor and from minority populations, and it’s straightforward to analyze affected populations using selection by location. You will select the populations within buffers, and then analyze selected features—for example, using ArcGIS Pro’s ability to summarize data. Additionally, from the book resource web page, you will have the opportunity to explore this topic further. Assignment 9-2 addresses the analogous problem to food deserts of “urgent health care deserts” in Pittsburgh—areas more than a mile from the nearest urgent health care facilities. Assignment 9-4 is a response to food deserts, finding the best locations for additional farmers’ markets in Washington, DC.
Sometimes, buffers are exactly the right tool. One such example are the drug-free zones for which federal and state laws prescribe a buffer radius, generally 1,000 feet, of school properties. Other times, buffers are less accurate but provide a quick estimate. One such case is geographic access by youths to public swimming pools in Pittsburgh, because users of the pools travel the street network to get to the pools. Pittsburgh has an irregular street network because of its rivers and hilly terrain, so even though some youths appear to be close to a pool on a map, they may have no direct route to the pool. In this case, you’ll need a network model that uses travel distance on a street network dataset. Buffers estimated with street networks are called service areas, and you’ll work with them in tutorial 9-3 to analyze public swimming pools in Pittsburgh.
In this tutorial, you’ll buffer Pittsburgh schools to find illicit drug dealing arrests within drug-free school zones—1,000-foot buffers of schools. Drug arrests often occur at the scene of drug dealing, so arrest locations, which are the same as illegal drug sales violations, are relevant for the analysis of this exercise.
If drug violations occurred randomly in Pittsburgh, then for any given area within Pittsburgh, you’d expect the fraction of such crimes to be the same as that area’s fraction of Pittsburgh’s area. Run the buffer tool again with Schools as input, Schools_Buffer_Dissolved as output, 1,000-foot radius, and this time with the option, “Dissolve all output features into a single feature.” Then divide the area of the resulting buffer by the area of Pittsburgh. Both areas are in square feet and so are large numbers. You can copy and paste them into Microsoft Excel from Shape_Area attributes of the two feature classes to do the division. You should get 398,645,239/1,627,099,663 = 0.245, or 24.5 percent, but earlier you found that a substantially higher fraction, 33.8 percent, of drug violations occurred in drug-free zones. Although not a definitive result, you should be suspicious that drug-free zones are not working in Pittsburgh. A better estimate, however, than the one you just did uses Pittsburgh’s area with human activity instead of all of Pittsburgh for the divisor. You’d have to subtract the area of rivers, cemeteries, steep hillsides, and so on, which for Pittsburgh is about 50 percent of its area. Then you’d expect 49 percent of drug arrests to be in the drug-free zone buffers, which is considerably larger than the 33.8 percent found. So perhaps the law reduces drug dealing near schools. Save your project.
You can get a list of all drug violations within school buffers with the names of the schools included. If a drug violation is in more than one school buffer, you can get the names of all the schools. You’ll use a spatial join of school buffers to drug violations to get this information, which could be passed along to police and the courts.
A multiple-ring buffer for a point looks like a bull’s-eye target, with a center circle and rings extending out. You can configure the center circle and each ring to be separate polygons, thereby allowing you to select other features within given distance ranges from the buffered feature.
During a budget crisis, Pittsburgh officials permanently closed 16 of 32 public swimming pools. You’ll estimate the number of youths, ages 5 to 17, living at different distances from the nearest swimming pool for all 32 pools versus the 16 that were kept open. Youths living within a half mile of the nearest open pool are considered to have good access to pools, while youths living from a half to one mile from the nearest pool are considered to have fair access. Youths living farther than one mile from the nearest pool are considered to have poor access (borrowing from the definition of food deserts). In tutorial 9-4, you’ll make more precise access estimates based on travel time across the street network of Pittsburgh from youth residences to the nearest pool.
Next, find good or fair youth access for pools that remained open. Select open pools using Select by Attributes with the criterion Open is Equal to 1. Create the same multiple-ring buffers for open pools. Perform a spatial join of the block centroids to the new buffers, and get the new totals for good and fair access. You’ll find that 10,726 youths have good access and 20,450 have fair access, compared with 21,833 who had good access and 20,725 who had fair access when all pools were open. Why do you suppose that fair access remained so high? Maybe it’s the conversion of good access to fair access, or maybe it’s because with all pools open, there tended to be access to more than one pool for a given residence, whereas with pools closed, there’s still fair access but to only one pool. In any event, now only 63.7 percent of the youths have good or fair access compared with 87 percent with all pools open. Save your project.
Service areas are similar to buffers but are based on travel over a network, usually a street network dataset. If a point, say for a retail store, has a five-minute service area constructed using ArcGIS Pro’s Network Analysis tools, then anyone residing in the service area has, at most, a five-minute trip to the store. If you have permission from your instructor to use an ArcGIS Online service that consumes credits or otherwise have access to your own ArcGIS Online credits, you could use an ArcGIS Online network service, which would be much more accurate than the free, TIGER-based network datasets that you’ll use in this chapter. Nevertheless, you will use the PittsburghStreets_ND network dataset provided in Tutorial9-4.aprx so that your results match the tutorial results.
In this tutorial, you’ll use service areas to estimate a so-called “gravity model” of geography (also known as the “spatial interaction model”), which assumes that the farther apart two features are, the less attraction between them. The falloff in attraction with distance is often nonlinear and rapid, as in Newton’s gravity model for physical objects where the denominator of attraction is distance squared. The application of this tutorial is a continuation of the pool case study, based on a random sample of youths owning pool tags (which allow admission to any Pittsburgh public pool). To scale the random sample up to the full population of youths with pool tags, you will need to multiply estimates by 11.3. With service areas, you could use distance or travel time to estimate a gravity model. Here, you’ll use time (minutes).
This tutorial uses this multiple-step workflow:
Note for the technically advanced student: You can make more accurate estimates of the average travel times for each service area ring of step 6 using an origin destination (OD) cost matrix calculated using a Network Analyst tool (search for “OD cost matrix analysis” in ArcGIS Pro help). You can configure this matrix to record the shortest-path distance or time between each demand point (origin) and supply point (destination). In the current problem, block centroids are origins and pools are destinations. Manipulations of the OD matrix, say in Microsoft® Excel, can yield exact averages (or medians) of travel times for rings instead of assuming that the midpoints of service area rings are average travel times.
Next are steps 3 and 4 of the workflow—counting youths with pool tags and summing up youth population, all by service area polygons.
Next are the final steps, 5 and 6, of the workflow—calculating and plotting use rate.
It’s easy in ArcGIS Pro to make a scatterplot of Use Rate versus Average Time to visualize the estimated gravity model data points.
Suppose that you are an analyst for an organization that owns several facilities in a city, and you are asked to find the best locations for new facilities. The classic problem of this kind is to locate facilities for a chain of retail stores in an urban area, but other examples are federally qualified health centers (FQHCs) as studied in chapter 1 and the public swimming pools earlier in this chapter. In the most general case, your organization has existing facilities that will remain open, a set of competitor facilities, and a set of potential new locations from which you want to determine a best subset of a specified size. Another case is where a subset of existing facilities must be closed, as with the Pittsburgh swimming pools, for which you want to determine the appropriate 16 of 32 facilities to close. Yet another case is where there are no existing facilities, and you want to locate one new facility.
ArcGIS Pro’s Location-Allocation model in the Network Analysis collection of models handles these sorts of facility location problems. Inputs are a network dataset, locations of facility types (existing, competitors, and new potential sites), demand points, and a gravity model. Demand is represented by polygon centroids—blocks, block groups, tracts, ZIP Codes, and so on—for which you have data on the target population, generally from the US Census Bureau, such as youth population. Resistance to flow in the network, called impedance, is represented by a gravity model and can be distance- or time-traveled along shortest paths to facilities. Several optimization models are available within the Location-Allocation model. You’ll use the Maximize Attendance model, which includes a gravity model (for which you supply parameter values) and a network optimization model that selects a specified number of new facility locations that maximize attendance.
In this tutorial, you’ll run a model to choose the best 16 out of 32 swimming pools to keep open using geographic access (distance from the nearest pool) as the criterion.
The Location-Allocation model is straightforward to use, but first you must fit a gravity model to the five points of the scatterplot at the end of tutorial 9-3. In other words, tutorial 9-3 produced data points, but the Location-Allocation model needs a gravity-model function fitted to those points as input. Available in the Location-Allocation model are three functional forms for a gravity model: linear, exponential, and power. The power form does not work well for cases with short travel times, such as for the swimming pools (a few minutes), so it’s not discussed further here. The linear form is easy to understand and use and is based on an impedance cutoff (10 minutes for the swimming pools). It estimates that 100 percent of the target population that lives at a pool uses it (of course no one lives at a pool, but some are nearby, and nearly 100 percent of the population uses the pool), 75 percent uses the pool at a quarter of the cutoff (2.5 minutes), 50 percent at half the cutoff time (5 minutes), 25 percent uses it at three quarters of the cutoff (7.5 minutes), and 0 percent at the cutoff or beyond—(10 minutes or higher). If C = cutoff in minutes and T = impedance in minutes, then as a percentage, Use Rate = 100 × (1 – T/C) for T ranges between 0 and C, and is 0 otherwise.
Exponential is the most applicable gravity model for the swimming pool case, because it declines rapidly as travel time increases, and it generalizes to other cases very well. The Microsoft Excel worksheet, Exponential.xlsx, available in Chapter9\Data, provides a method of fitting the exponential model to the results of tutorial 9-3.
It takes a minute or so for ArcGIS Pro to load the 7,493 demand points.
Officials had closed half (eight) of the optimal pools. Remember, however, that officials had criteria in addition to geographic access for selecting pools to close or keep open, and didn’t have GIS-based location analysis.
Remove the Location-Allocation group layer from the Contents pane, and set up a new Location-Allocation model to estimate pool use from pools that officials left open. First, select open pools with the clause Open Is Equal to 1. Then start a new Location-Allocation model, and import Pools as facilities with FacilityType (under Property) being Required. Then import Demand Points just as you did in the previous steps (Weight = Age_5_17, and Cutoff_Minutes = Cutoff_Minutes). Select Towards Facilities, and type 16 facilities. Set the problem type to Maximize Attendance and the function to exponential with 0.25 for beta. Run the model. Sum DemandWeight for Facilities to find 25,044 youths estimated to use the open pools. Dividing by 2, you get the number 12,522 as the best estimate of pool users, compared with 13,804 from the optimal solution, only 1,282 fewer (9.3 percent). Ultimately, although officials chose a much different set of pools to keep open compared with the model-based optimum, not much was given up in terms of potential users. That’s good news for youths living in Pittsburgh from a policy point of view.
If you have time, make one more model run, this time using all 32 pools. You’ll find that the model estimates 30,218 users compared with 27,607 for the optimal solution and 25,044 for the remaining 16 pools. Why do 32 pools only produce such a small gain over 16 pools? The answer is probably because there were too many pools competing for users when there were 32 pools. So it appears that officials did not make a bad decision, if the network dataset is accurate enough and travel to pools is primarily driving on the road network.
The goal of data mining is exploration, to find hidden structure in large and complex datasets that has some interest or value. Data clustering, a branch of data mining, finds clusters of data points that are close to each other but distant from points of other clusters. If your data points were 2D and graphed as a scatterplot, it would be easy to draw boundaries around points and call them clusters. You’d perform as well as any clustering method. The problem is when the points lie in more than 3D space because you can’t see them anymore, and that’s where cluster methods are valuable. In this tutorial, you’ll cluster crimes using four attributes/dimensions, including severity of the crime, plus the race, age, and gender of the arrested persons. A better attribute than race would be to use the poverty status of arrested persons, which is an underlying cause of criminal behavior, but that data is not collected by police.
A limitation of clustering, however done, is that there is no way of knowing “true” clusters in real data to compare against what a method determines are clusters. You take what clustering methods provide, and if you get information or ideas from cluster patterns in your data, you can confirm them or determine their predictability using additional models or other methods. Clustering is purely an exploratory method. There are no right or wrong clusters, only more or less informative ones.
This tutorial uses a simple method called k-means clustering, which partitions a dataset with n observations and p variables into k clusters. You’ll use a dataset with n = 303 observations, p = 4 variables for clustering, and k = 5 clusters desired. K-means is a heuristic algorithm, as are all clustering methods: it’s a set of repeated steps that produces good if not optimal results. For this tutorial’s data, the algorithm starts with a procedure that selects five, four-dimensional (4D) observations called seeds as initial centroids. Then each observation is assigned to its nearest centroid, on the basis of Euclidean (straight line) distance in the 4D cluster variable space. Next, a new centroid is calculated for each cluster, and then all observations are reassigned to the nearest centroid. These steps are repeated until the cluster centroids do not move appreciably. So k-means clustering is a common-sense, mechanistic method.
K-means assumes that all attributes are equally important for clustering because of its use of distance between numerical points as its basis. To meet this assumption, all input attributes must be scaled to a similar magnitude and range. Generally, you can use standardized variables (for each variable, subtract its mean and divide by its standard deviation) to accomplish scale parity, but other ways of rescaling are acceptable, too. It is the range, or relative distances between observations, that’s important in k-means clustering. The data used in this tutorial includes numerical (interval or ratio), ordinal, and nominal class data for classification. K-means clustering is intended for numerical data only because of its use of distance in cluster variable space. Nevertheless, with rescaling, it’s possible to get informative clusters when including nonnumerical data. Next, the discussion turns to the specific case at hand and how to rescale attributes.
The data used in this tutorial is serious violent crimes from a summer in Pittsburgh, with the data mapped as points. The crimes are ranked by severity using FBI hierarchy numbers (1 = murder, 2 = rape, 3 = robbery, and 4 = aggravated assault), with of course murder being the most serious. Clearly, the nature of crimes should be important for their clustering. So the first assumption you must make is that the distance between crime types, such as the number 3 between 1 for murder and 4 for aggravated assault (attempted or actual serious bodily harm to another person), is meaningful for clustering purposes. The criminal justice system agrees on the order, and for clustering, we can accept the “distances” or change them using our judgment. For this tutorial, you’ll leave them as listed here.
Next, consider the single numerical attribute that is available, the age of an arrested person. Crime is generally committed by young adults, tapering off with older ages. For the serious violent crimes studied here, age varies between 13 and 65 (range of 42) with a mean of 29.7. Together with crime severity, age would dominate clustering because of its greater range. The remedy is to standardize age, which then varies between −2.3 to 2.7 with a range of 5, whereas crime severity has a range of 3. So then both attributes are fairly equal in determining clusters.
Finally, there are two nominal attributes: race (black or white) and gender (male or female). These can be encoded as binary attributes: race (0 = white, 1 = black) and gender (0 = male, 1 = female). As a binary indicator variable, race has a mean, which is the fraction of arrestees who are black, and similarly gender has a mean, which is the fraction of arrestees who are female. These variables as encoded here would have perhaps lesser roles than the previous two, but not by that much. If you wanted to increase the importance of the binary variables for clustering, you could encode them as (0, 2) or (0, 4) indicators. You’ll leave them as (0, 1) variables, which makes interpretation of clustering results easier.
One last point is that you must choose the number of clusters instead of having k-means clustering find an optimal number for you. That’s the case for most clustering methods. For the crime data, experimentation with three through six clusters resulted in five clusters being the most informative, so you’ll run with five clusters.
In summary, each observation is a 4D vector (crime, standardized age, gender, race)—for example, (1, –0.364, 0, 0) is a murder with an arrested 25-year-old (standardized age 25 is –0.364) male who is white. The clusters found by k-means exist in the 4D space in which the observations lie. Each cluster is characterized by its centroid with corresponding means and standard deviations of each cluster variable.
The k-means algorithm is in a geoprocessing tool called Grouping Analysis, which you’ll run next.
Each group (or cluster) has its font color matching mapped points for SS_GROUP. Note that your groups may have different numbers and colors than those reported here, but the statistics and everything else will be the same. One issue is that the standardized age needs to be unstandardized. The mean of age is 30.3, and its standard deviation is 12.3. So for the Group 1 (blue) standardized value, which is −0.7242, to get the unstandardized corresponding age, calculate age = 30.3 + (−0.7542) x 12.3 = 20. Each group’s mean values are summarized as shown. It’s difficult to find terms to describe the mean crime hierarchy numbers, so the range from least serious to most serious used in the table are in the context of reporting results in regard to harm done.
These results have moderately interesting patterns and one anomalous group—Group 2. With a group size of only three crimes for Group 2, you can’t rely on that small result. Group 1 represents young black males committing a range of serious violent crimes. Group 3 is middle-aged persons of either race committing a range of crimes. Group 4 represents middle-aged persons of either race committing aggravated assaults (FBI hierarchy 4). Finally, group 5, the largest in size, is composed mainly of young blacks, mostly committing aggravated assaults.
Next, you can see if there are any spatial patterns for these groups.
This chapter has five assignments to complete that you can download from this book’s resource web page, esri.com/gist1arcgispro: