Chapter 8. Case Study: City of Palo Alto Open Data

Why Open Data?

When people first start to work with Cascading, one frequent question is “Where can I get large data sets to use for examples?” For great sources of data sets, look toward Open Data. Many governments at the city, state, and federal levels have initiatives to make much of their data openly available. Open Data gives a community greater visibility into how its government functions. The general idea is that people within the community—entrepreneurs, students, social groups, etc.—will find novel ways to leverage that data. In turn, the results of those efforts benefit the public good.

Here are some good examples of Open Data and other publically available repositories:

City of Palo Alto

The sample app discussed in this chapter was originally developed for a graduate engineering seminar at Carnegie Mellon University. The intent was to create an example Cascading app based on the Open Data initiative by the City of Palo Alto. Many thanks for help with this project go to Dr. Stuart Evans, CMU Distinguished Service Professor; Jonathan Reichental, CIO for the City of Palo Alto; and Diego May, CEO of Junar, the company that provided the data infrastructure for this initiative and many others.

Thinking about Palo Alto and its Open Data initiative, a few ideas come to mind. The city is generally quite a pleasant place: the weather is temperate, there are lots of parks with enormous trees, most of downtown is quite walkable, and it’s not particularly crowded. On a summer day in Palo Alto, one of the last things anybody really wants is to be stuck in an office on a long phone call. Instead people walk outside and take their calls, probably heading toward a favorite espresso bar or a popular frozen yogurt shop. On a hot summer day in Palo Alto, knowing a nice route to walk in the shade would be great. There must be a smartphone app for that—but as of late 2012, there wasn’t!

In this chapter, we’ll build an example Cascading workflow for that smartphone app as a case study. A sample app is shown in both Java and Clojure to power a mobile data API.

Imagine a mobile app that leverages the city’s municipal data to personalize recommendations: “Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo.” This app shows the process of structuring data as a workflow, progressing from raw sources to refine that process until we obtain the data products for that recommender. The results are personalized based on the neighborhoods where a person tends to walk.

To download source code, first connect to a directory on your computer where you have a few gigabytes of available disk space, and then use Git to clone the source code repo:

$ git clone git://github.com/Cascading/CoPA.git

Once that download completes, connect into that newly cloned directory. Source code is shown in both Cascading (Java) and Cascalog (Clojure). We’ll work through the Cascalog example, and its source is located in the src/main/clj/copa/core.clj file.

Moving from Raw Sources to Data Products

The City of Palo Alto has its Open Data portal available online. It publishes a wide range of different data sets: budget history, census data, geographic information systems (GIS) as shown in Figure 8-1, building permits, utility consumption rates, street sweeping schedules, creek levels, etc.

Figure 8-1. GIS data about trees in Palo Alto

For this app, we use parts of the GIS export—in particular the location data about trees and roads. Most governments track components of their infrastructure using a GIS system. ArcGIS is a popular software platform for that kind of work. Palo Alto exports its GIS data, which you can download from the portal on Junar. A copy is also given in the data/copa.csv file.

Take a look at one of the tree records in the GIS export:

$ cat data/copa.csv | grep "HAWTHORNE AV 22"

"Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl",
"   Private:   -1    Tree ID:   412    Street_Name:   HAWTHORNE AV
 Situs Number:   115    Tree Site:   1    Species:   Liquidambar styraciflua
 Source:   davey tree    Protected:       Designated:       Heritage:
 Appraised Value:       Hardscape:   None    Identifier:   474
 Active Numeric:   1    Location Feature ID:   18583
 Provisional:       Install Date:      ",
"37.446001565119,-122.167713417554,0.0 ",
"Point"

Clearly that is an example of unstructured data. Our next step is to structure those kinds of records into tuple streams that we can use in our workflow.

Looking at the source code located in the src/main/clj/copa/core.clj file, the first several lines define a Clojure namespace for importing required libraries:

(ns copa.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)]
        [date-clj])
  (:require [clojure.string :as s]
            [cascalog [ops :as c]]
            [clojure-csv.core :as csv]
            [geohash.core :as geo])
  (:gen-class))

Next, there are two functions that begin to parse and structure the raw data from the GIS export:

(def parse-csv
  "parse complex CSV format in the unclean GIS export"
  (comp first csv/parse-csv))

(defn load-gis
  "Parse GIS csv data"
  [in trap]
  (<- [?blurb ?misc ?geo ?kind]
      ((hfs-textline in) ?line)
      (parse-csv ?line :> ?blurb ?misc ?geo ?kind)
      (:trap (hfs-textline trap))))

The GIS data is exported in comma-separated values (CSV) format. There are missing values and other errors in the export, so we need to handle the parsing specially. The load-gis function reads each line using the hfs-textline tap, then parses those into tuples using the csv/parse-csv Clojure library. A trap collects any data lines that are not formatted properly. In this case the trapped data does not contain much information, so we simply ignore it.

One side note about process: in data science work, we typically encounter an 80/20 rule such that 80% of the time and costs go toward cleaning up the data, while 20% of the time and costs get spent on the science used to obtain actionable insights. The better tools and frameworks help to balance and reduce those costs. It’s true in this app that most of the code is needed for data preparation, while the recommender portion is only a few lines. Even so, Cascalog helps make that data preparation process relatively simple. Here we invoke the principle of “Specify what you require, not how to achieve it.” In just a few lines of Clojure, we state the requirement to derive four fields (blurb, misc, geo, kind) from the GIS export, and trap (discard) records that fail to follow that pattern.

Next we need to focus on structuring the tree data. Looking in the example record shown previously, the tree has several properties listed. There is a unique identifier (412), a street address (115 Hawthorne Av), a species name (Liquidambar styraciflua), etc., plus its geo coordinates. Our goal is to find a quiet shady spot in which to walk and take a cell phone call. We definitely know about the location of each tree, but what can we determine about shade? Given the tree species, we could look up average height and use that as an estimator for shade. So the next step is to use a regular expression to parse the tree properties, such as address and species, from the misc field:

(defn re-seq-chunks [pattern s]
  (rest (first (re-seq pattern s))))

(def parse-tree
  "parses the special fields in the tree format"
  (partial re-seq-chunks
    #"^\s+Private\:\s+(\S+)\s+Tree ID\:\s+(\d+)\s+.*Situs
    Number\:\s+(\d+)\s+Tree Site\:\s+(\d+)\s+Species\:\s+(\S.*\S)\s+Source.*"
   ))

Great, now we begin to have some structured data about trees:

Identifier:   474
Tree ID:      412
Tree:         412 site 1 at 115 HAWTHORNE AV
Tree Site:    1
Street_Name:  HAWTHORNE AV
Situs Number: 115
Private:      -1
Species:      Liquidambar styraciflua
Source:       davey tree
Hardscape:    None

We can use the species name to join with a table of tree species metadata and look up average height, along with inferring other valuable data. Take a look in the data/meta_tree.tsv file to see the metadata about trees, which was derived from Wikipedia.org, Calflora.org, USDA.gov, etc. The species liquidambar styraciflua, commonly known as an American sweetgum, grows to a height that ranges between 20 and 35 meters.

The next section of code completes our definition of a data product about trees. The geo-tree function parses the geo coordinates: latitude, longitude, and altitude. The trees-fields function defines the fields used to describe trees throughout the app; other fields get discarded. The get-trees function is the subquery used to filter, merge, and refine the estimators about trees.

(def geo-tree
  "parses geolocation for tree format"
  (partial re-seq-chunks #"^(\S+),(\S+),(\S+)\s*$"))

(def trees-fields ["?blurb" "?tree_id" "?situs" "?tree_site"
                   "?species" "?wikipedia" "?calflora" "?avg_height"
                   "?tree_lat" "?tree_lng" "?tree_alt" "?geohash"])

(defn get-trees [src tree-meta trap]
  "subquery to parse/filter the tree data"
  (<- trees-fields
      (src ?blurb ?misc ?geo ?kind)
      (re-matches #"^\s+Private.*Tree ID.*" ?misc)
      (parse-tree ?misc :> ?priv ?tree_id ?situs
       ?tree_site ?raw_species)
      ((c/comp s/trim s/lower-case) ?raw_species :> ?species)
      (tree-meta ?species ?wikipedia ?calflora
       ?min_height ?max_height)
      (avg ?min_height ?max_height :> ?avg_height)
      (geo-tree ?geo :> ?tree_lat ?tree_lng ?tree_alt)
      ((c/each read-string) ?tree_lat ?tree_lng :> ?lat ?lng)
      (geo/encode ?lat ?lng geo-precision :> ?geohash)
      (:trap (hfs-textline trap))))

Note the call (re-matches #"^\s+Private.*Tree ID.*" ?misc) early in the subquery. This regular expression filters records about trees out of the GIS tuple stream. This creates a branch in the Cascading flow diagram.

After calling parse-tree to get the tree properties from the raw data, next we use ((c/comp s/trim s/lower-case) ?raw_species :> ?species) to normalize the species name. In other words, force it to lowercase and strip any trailing spaces, so that it can be used in a join. The call to tree-meta performs that join. Next, the call to avg estimates the height for each tree. This is a rough approximation, but good enough to produce a reasonable “shade” metric.

The last few lines clean up the geolocation coordinates. First these coordinates are parsed, then converted from strings to decimal numbers. Then the geo/encode uses the coordinates to create a “geohash” index . A geohash is a string that gives an approximate location. In this case, the six-digit geohash 9q9jh0 identifies a five-block square in which tree 412 is located. That’s a good enough approximation to join with other data about that location, later in the workflow.

Finally, the fields defined in trees-fields for tree 412 get structured this way:

?blurb      Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl
?tree_id    412
?situs      115
?tree_site  1
?species    liquidambar styraciflua
?wikipedia  http://en.wikipedia.org/wiki/Liquidambar_styraciflua
?calflora   http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598
?avg_height 27.5
?tree_lat   37.446001565119
?tree_lng   -122.167713417554
?tree_alt   0.0
?geohash    9q9jh0

At this point we have a data product for trees. Figure 8-2 shows a conceptual flow diagram for the part of the workflow that structured this data.

Figure 8-2. Conceptual flow diagram for tree data

Next we repeat many of the same steps for the road data. The GIS export is more complex for roads than for trees because the roads are described per block, with each block divided into segments. Effectively, there is a new segment recorded for every turn in the road. Road data also includes metrics about traffic rates, pavement age and type, etc. Our goal is to find a quiet shady spot in which to walk and take a cell phone call. So we can leverage the road data per segment in a couple of ways. Let’s create one estimator to describe how quiet each segment is based on comparing the traffic types and rates. Then we’ll create another estimator to describe the shade based on comparing how the pavement reflects sunlight.

(def roads-fields ["?road_name" "?bike_lane" "?bus_route" "?truck_route"
                   "?albedo" "?road_lat" "?road_lng" "?road_alt" "?geohash"
                   "?traffic_count" "?traffic_index" "?traffic_class"
                   "?paving_length" "?paving_width" "?paving_area"
                   "?surface_type"])

(defn get-roads [src road-meta trap]
  "subquery to parse/filter the road data"
  (<- roads-fields
      (src ?road_name ?misc ?geo ?kind)
      (re-matches #"^\s+Sequence.*Traffic Count.*" ?misc)
      (parse-road ?misc :>
                  ?traffic_count ?traffic_index ?traffic_class
                  ?paving_length ?paving_width ?paving_area ?surface_type
                  ?overlay_year_str ?bike_lane ?bus_route ?truck_route)
      (road-meta ?surface_type ?albedo_new ?albedo_worn)
      ((c/each read-string) ?overlay_year_str :> ?overlay_year)
      (estimate-albedo ?overlay_year ?albedo_new ?albedo_worn :> ?albedo)
      (bigram ?geo :> ?pt0 ?pt1)
      (midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt)
      ;; why filter for min? because there are geo duplicates..
      ((c/each c/min) ?lat ?lng ?alt :> ?road_lat ?road_lng ?road_alt)
      (geo/encode ?road_lat ?road_lng geo-precision :> ?geohash)
      (:trap (hfs-textline trap))))

Similar to the business process for trees, the get-roads function is the subquery used to filter, merge, and refine the estimators about roads. The roads-fields function defines the fields used to describe roads throughout the app; other fields get discarded. The regular expression (re-matches #"^\s+Sequence.*Traffic Count.*" ?misc) filters records about roads out of the GIS tuple stream, creating a branch.

We use some metadata about roads, in this case just to infer metrics about the pavement reflecting sunlight. As pavement ages, its albedo properties change. So we parse the surface_type and overlay_year, then call road-meta to join with metadata. Then we can estimate albedo to describe how much a road segment reflects sunlight.

Note that there are some duplicates in the geo coordinates for road segments. We use (c/each c/min) to take the minimum value for each segment, reducing the segment list to unique values. Then we use geo/encode to create a six-digit geohash for each segment.

Great—now we have another data product, for roads. Figure 8-3 shows a conceptual flow diagram for the part of the workflow that structured and enriched this data.

The following tuple shows the road segment located nearest to tree 412—note that the geohash matches, because they are within the same bounding box:

?blurb         Hawthorne Avenue from Alma Street to High Street
?traffic_count 3110
?traffic_class local residential
?surface_type  asphalt concrete
?albedo        0.12
?min_lat       37.446140860599854
?min_lng       -122.1674652295435
?min_alt       0.0
?geohash       9q9jh0

Figure 8-3. Conceptual flow diagram for road data

Calibrating Metrics for the Recommender

A good next step is to use an analytics tool such as R to analyze and visualize the data about trees and roads. We do that step to perform calibration and testing of the data products so far. Take a look at the src/scripts/copa.R file, which is an R script to analyze tree and road data.

For example, Figure 8-4 shows a chart for the distribution of tree species in Palo Alto. American sweetgum (Liquidambar styraciflua) is the most common tree.

Figure 8-4. Summary analysis for tree data

Also, there’s a density plot/bar chart of estimated tree heights, most of which are in the 10- to 30-meter range. Palo Alto is known for many tall eucalyptus and sequoia trees (the city name translates to “Tall Stick”), and these show up on the right side of the density plot—great for lots of shade. Overall, the distribution of trees shows a wide range of estimated heights, which helps confirm that our approximation is reasonable to use.

library(ggplot2)

dat_folder <- "~/src/concur/CoPA/out"
d <- read.table(file=paste(dat_folder, "tree/part-00000", sep="/"),
                sep="\t", quote="", na.strings="NULL", header=FALSE,
                encoding="UTF8")

colnames(d) <- c("blurb", "tree_id", "situs", "tree_site", "species",
                 "wikipedia", "calflora", "avg_height", "tree_lat",
                 "tree_lng", "tree_alt", "geohash")

# plot density for estimated tree heights
m <- ggplot(d, aes(x=avg_height))
m <- m + ggtitle("Estimated Tree Height (meters)")
m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()

# which are the most popular trees?
t <- sort(table(d$species), decreasing=TRUE)
trees <- head(as.data.frame.table(t), n=20)
colnames(trees) <- c("species", "count")
trees

Looking at Figure 8-5 for analysis of the road data, most of the road segments are classified as local residential. There are also arteries and collectors (busy roads) plus truck routes that are likely to be more noisy.

Figure 8-5. Summary analysis for road data

We also see a distribution with a relatively long tail for traffic counts. Using traffic classes and traffic counts as estimators seems reasonable.

d <- read.table(file=paste(dat_folder, "road/part-00000", sep="/"),
                sep="\t", quote="", na.strings="NULL", header=FALSE,
                encoding="UTF8")

colnames(d) <- c("road_name", "bike_lane", "bus_route", "truck_route",
                 "albedo", "road_lat", "road_lng", "road_alt", "geohash",
                 "traffic_count", "traffic_index", "traffic_class",
                 "paving_length", "paving_width", "paving_area",
                 "surface_type")

t <- sort(table(d$surface_type), decreasing=TRUE)
roads <- head(as.data.frame.table(t), n=20)
colnames(roads) <- c("surface_type", "count")
roads

summary(d$traffic_class)
t <- sort(table(d$traffic_class), decreasing=TRUE)
roads <- head(as.data.frame.table(t), n=20)
colnames(roads) <- c("traffic_class", "count")
roads

summary(d$traffic_count)
plot(ecdf(d$traffic_count))

m <- ggplot(d, aes(x=traffic_count))
m <- m + ggtitle("Traffic Count Density")
m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()

Spatial Indexing

Because we are working with GIS data, the attributes that tie together tree data, road data, and GPS track are obviously the geo coordinates: latitude, longitude, and altitude. Much of Palo Alto is relatively flat and not far above sea level because it is close to San Francisco Bay. To make this code a bit simpler, we can ignore altitude. However, we’ll need to do large-scale joins and queries based on latitude and longitude. Those are problematic at scale: they are represented as decimal values, and range queries will be required, both of which make parallelization difficult at scale. So we’ve used a geohash as an approximate location, as a kind of bounding box: it combines the decimal values for latitude and longitude into a string. That makes joins and queries much simpler and makes the app more reasonable to parallelize. Effectively we cut the entire map of Palo Alto into bounding boxes and then compute for each bounding box in parallel.

There can be problems with this approach. For instance, what if the center of a road segment is right in between two geohash squares? We might end up with joins that reference only half the trees near that road segment. There are a number of more interesting algorithms to use for spatial indexing . R-trees is one common approach. The general idea would be to join a given road segment with trees in its bounding box plus the neighboring bounding boxes. Then we apply a better algorithm within those collections of data. The problem is still reasonably constrained and can be parallelized.

In this sample app, we simply consider each geohash value as a kind of “bucket.” Imagine that all the data points that fall into the same bucket get evaluated together. Figure 8-6 shows how each block of a road is divided into road segments.

Figure 8-6. Road segments

Our app analyzes each road segment as a data tuple, calculating a center point for each. We use a geohash value to construct a bounding box around that center point, then join the data to collect metrics for all the trees nearby, as Figure 8-7 shows.

Figure 8-7. Trees near road segments

The join occurs in the get-shade function where both the roads and trees tuples reference the ?geohash field:

(defn tree-distance [tree_lat tree_lng road_lat road_lng]
  "calculates distance from a tree to the midpoint of a road segment"
  (let [y (- tree_lat road_lat)
        x (- tree_lng road_lng)]
    (Math/sqrt (+ (Math/pow y 2.0) (Math/pow x 2.0)))))

(defn get-shade [trees roads]
  "subquery to join the tree and road estimates, to maximize for shade"
  (<- [?road_name ?geohash ?road_lat ?road_lng ?road_alt
       ?road_metric ?tree_metric]
      ((select-fields roads ["?road_name" "?albedo" "?road_lat" "?road_lng"
        "?road_alt" "?geohash" "?traffic_count" "?traffic_class"])
       ?road_name ?albedo ?road_lat ?road_lng ?road_alt ?geohash
       ?traffic_count ?traffic_class)
      (road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric)
      ((select-fields trees ["?avg_height" "?tree_lat" "?tree_lng"
                             "?tree_alt" "?geohash"])
       ?height ?tree_lat ?tree_lng ?tree_alt ?geohash)
      (> ?height 2.0) ;; limit to trees which are higher than people
      (tree-distance ?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance)
      (<= ?distance 25.0) ;; one block radius (not in meters)
      (/ ?height ?distance :> ?tree_moment)
      (c/sum ?tree_moment :> ?sum_tree_moment)
      (/ ?sum_tree_moment 200000.0 :> ?tree_metric)))

This approach is inclusive, so we get more data than we need. Let’s filter out the trees that won’t contribute much shade. The call to (> ?height 2.0) limits the trees to those that are taller than people, i.e., those that provide shade. The tree-distance function calculates a distance-to-midpoint from each tree to the road segment’s center point. Note that this is not in meters. The call to (<= ?distance 25.0) limits the trees to those within a one-block radius. The distance-to-midpoint calculation is used to filter trees that are too small or too far away to provide shade.

The next step is a trick borrowed from physics. We calculate a sum of moments based on tree height and distance-to-midpoint, then use that as an estimator for shade. The dimensions of this calculation are not particularly important, so long as we get a distribution of estimator values to use for ranking. The R script in src/scripts/metrics.R shows some analysis and visualization of this sum of moments. Based on the median of its distribution, we use 200000.0 to scale the estimator—to make its values simpler to understand and compare with other metrics.

The road-metric function calculates metrics for comparing road segments. We have three properties known about each road segment that can be used to create estimators:

(defn road-metric [traffic_class traffic_count albedo]
  "calculates a metric for comparing road segments"
  [[(condp = traffic_class
      "local residential"       1.0
      "local business district" 0.5
      0.0)
    (-> traffic_count (/ 200.0) (Math/log) (/ 5.0))
    (- 1.0 albedo)]])

First, the traffic class has two values, local residential and local business district, which represent reasonably quiet places to walk—while the other possible values are relatively busy and noisy. So we map from the traffic_class labels to numeric values. Second, the traffic counts get scaled, based on their distribution—similarly to make their values simpler to understand and compare with other metrics. Third, the albedo value needs a sign change but otherwise works directly as an estimator.

In practice, we might train a predictive model—such as a decision tree—to compare these estimators. That could help incorporate customer feedback, QA for the data, etc. Having three estimators to compare road segments—to rank the final results—works well enough for this example. The following tuple shows the resulting metrics for the road segment located nearest to tree 412:

?road_name   Hawthorne Avenue from Alma Street to High Street
?geohash     9q9jh0
?road_lat    37.446140860599854
?road_lng    -122.1674652295435
?road_alt    0.0
?road_metric [1.0 0.5488121277250486 0.88]
?tree_metric 4.36321007861036

Figure 8-8 shows the conceptual flow diagram for merging the tree and road metrics to calculate estimators for each road segment.

Figure 8-8. Conceptual flow diagram for shade metrics

Personalization

The steps in our app so far have structured the Open Data (GIS export), merged it with curated metadata, then calculated metrics to use for ranking recommendations. That unit of work created data products about quiet shady spots in Palo Alto in which to walk and take a cell phone call. Given different data sources, the same approach could be used for GIS export from other cities. Of course the distribution of geohash values would change, but the business logic would remain the same. In other words, the same workflow could scale to include many different cities in parallel—potentially, even worldwide.

Our next step is to incorporate the machine data component, namely the log files collected from GPS tracks on smartphones. This data serves to personalize the app, selecting recommendations for the road segments nearest to where the app’s users tend to walk. For this example, we had people walking around Palo Alto with their iPhones recording GPS tracks. Then those files were downloaded and formatted as logs. The data/gps.csv file shows a sample. Each tuple has a timestamp (date), a unique identifier for the user (uuid), geo coordinates, plus measurements for movement at that point.

The function get-gps is a Cascalog subquery that parses those logs:

(defn get-gps [gps_logs trap]
  "subquery to aggregate and rank GPS tracks per user"
  (<- [?uuid ?geohash ?gps_count ?recent_visit]
      (gps_logs ?date ?uuid ?gps_lat ?gps_lng ?alt
       ?speed ?heading ?elapsed ?distance)
      (read-string ?gps_lat :> ?lat)
      (read-string ?gps_lng :> ?lng)
      (geohash ?lat ?lng :> ?geohash)
      (c/count :> ?gps_count)
      (date-num ?date :> ?visit)
      (c/max ?visit :> ?recent_visit)
 ))

The function calculates a geohash, then aggregates some of the other values to create estimators. For instance, the call to (c/count :> ?gps_count) counts the number of visits, per user, to the same location. That provides an estimator for the “popularity” of each location. The call to (c/max ?visit :> ?recent_visit) aggregates the timestamps, finding the most recent visit per user, per location. That provides an estimator for the “recency” of each location. Given the identifiers uuid and geohash, plus the two metrics gps_count and recent_visit, we can join the GPS data with the data product for road segments to apply a form of behavioral targeting .

Figure 8-9 shows the conceptual flow diagram for preparing the GPS tracks data.

Figure 8-9. Conceptual flow diagram for GPS tracks

The following data shows some of the results near our geohash 9q9jh0 example. Note how the 9q9 prefix identifies neighboring geohash values:

?uuid                               ?geohash ?gps_count ?recent_visit
342ac6fd3f5f44c6b97724d618d587cf    9q9htz   4          1972376690969
32cc09e69bc042f1ad22fc16ee275e21    9q9hv3   3          1972376670935
342ac6fd3f5f44c6b97724d618d587cf    9q9hv3   3          1972376691356
342ac6fd3f5f44c6b97724d618d587cf    9q9hv6   1          1972376691180
342ac6fd3f5f44c6b97724d618d587cf    9q9hv8   18         1972376691028
342ac6fd3f5f44c6b97724d618d587cf    9q9hv9   7          1972376691101
342ac6fd3f5f44c6b97724d618d587cf    9q9hvb   22         1972376691010
342ac6fd3f5f44c6b97724d618d587cf    9q9hwn   13         1972376690782
342ac6fd3f5f44c6b97724d618d587cf    9q9hwp   58         1972376690965
482dc171ef0342b79134d77de0f31c4f    9q9jh0   15         1972376952532
b1b4d653f5d9468a8dd18a77edcc5143    9q9jh0   18         1972376945348

Great, now we have a data product about areas in Palo Alto that are known to be walkable. Over time, a production app might use this evidence to optimize the workflow.

Recommendations

The last part of this app is the actual recommender. As mentioned earlier, most of the code in the workflow is used for data preparation—recall the 80/20 rule about that. When it comes to the actual recommender, that’s just a few lines of code:

(defn get-reco [tracks shades]
  "subquery to recommend road segments based on GPS tracks"
  (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count
       ?recent_visit ?road_metric ?tree_metric]
      (tracks :>> gps-fields)
      (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric)))

Mostly this involves a join on geohash fields, then collecting road segment metrics for each user—based on the uuid field. Due to the sparseness of geo coordinates in practice, that join is likely to be efficient. For example, if the mobile app using this data gains millions of users, then the road segment data could be placed in the righthand side (RHS) of the join. Each geohash is a five-block radius, which implies hundreds of road segments or less. That allows for a HashJoin as a replicated join that runs more efficiently in parallel at scale.

At this point, we have recommendations to feed into a data API for a mobile app. In other words, per uuid value we have a set of recommended road segments. Each road segment has metrics for aggregate tree shade, road reflection, traffic class, traffic rate—in addition to the personalization metrics of recency and popularity for walking near that location.

Recommenders generally combine multiple signals, such as the six metrics we have for each road segment. Then they rank the metrics to personalize results. Some people might prefer recency of visit, others might prefer as little traffic as possible. By providing a tuple of those metrics to the end use case, the mobile app could allow people to adjust their own preferences. In the case of our earlier example, the recommender results nearest to tree 412 are as shown in Table 8-1.

Table 8-1. Example results from recommender

Label	Value
tree	413 site 2
addr	115 Hawthorne Ave
species	Liquidambar styraciflua
geohash	9q9jh0
lat/lng	37.446, -122.168
est. height	23
shade metric	4.363
traffic	local residential, light traffic
visit recency	1972376952532

That spot happens to be a short walk away from my train stop. Two huge American sweetgum trees provide ample amounts of shade on a quiet block of Hawthorne Avenue, which is a great place to walk and take a phone call on a hot summer day in Palo Alto. (It’s also not far from a really great fro-yo shop.)

Build and Run

The build script in project.clj looks much like the build in Chapter 5:

(defproject cascading-copa "0.1.0-SNAPSHOT"
  :description "City of Palo Alto Open Data recommender in Cascalog"
  :url "https://github.com/Cascading/CoPA"
  :license {:name "Apache License, Version 2.0"
            :url "http://www.apache.org/licenses/LICENSE-2.0"
            :distribution :repo}
  :uberjar-name "copa.jar"
  :aot [copa.core]
  :main copa.core
  :min-lein-version "2.0.0"
  :source-paths ["src/main/clj"]
  :dependencies [[org.clojure/clojure "1.4.0"]
                 [cascalog "1.10.1-SNAPSHOT"]
                 [cascalog-more-taps "0.3.1-SNAPSHOT"]
                 [clojure-csv/clojure-csv "2.0.0-alpha2"]
                 [org.clojars.sunng/geohash "1.0.1"]
                 [date-clj "1.0.1"]]
  :exclusions [org.clojure/clojure]
  :profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]}
             :provided {:dependencies
               [[org.apache.hadoop/hadoop-core "0.20.2-dev"]]
             }})

To build this sample app from a command line, run Leiningen :

$ lein clean
$ lein uberjar

That builds a “fat jar” that includes all the libraries for the Cascalog app. Next, we clear any previous output directory (required by Hadoop), then run the app in standalone mode:

$ rm -rf out/
$ hadoop jar ./target/copa.jar \
   data/copa.csv data/meta_tree.tsv data/meta_road.tsv data/gps.csv \
   out/trap out/park out/tree out/road out/shade out/gps out/reco

The recommender results will be in partition files in the out/reco/ directory. A gist on GitHub shows building and running this app. If your results look similar, you should be good to go.

Alternatively, if you want to run this app on the Amazon AWS cloud, the steps are the same as for Example 3 in Scalding: Word Count with Customized Operations. First you’ll need to sign up for the EMR and S3 services, and also have your credentials set up in the local configuration—for example, in your ~/.aws_cred/ directory. Edit the emr.sh Bash script to use one of your S3 buckets, and then run that script from your command line.

Key Points of the Recommender Workflow

This workflow illustrates some of the key points of building Enterprise data workflows:

Typically a workflow starts with some kind of ETL, loading unstructured data—which we see for the GIS export and GPS log files.
Then we have several steps of data preparation—in other words, creating a data product about shady quiet road segments.
From that point, we used R to analyze and visualize the intermediate data—essentially testing and calibrating before setting parameters for the recommender.
Next, we leveraged algorithms—geospatial indexing approximated by a geohash, behavioral targeting, plus a replicated join—so that the workflow could run efficiently in parallel at scale.

Another important point is to consider what kinds of data sources were used and what value each contributed. This app shows how to combine three major categories of data:

Open Data: Unstructured data about municipal infrastructure (trees and roads) exported from the city’s GIS systems—which provides the value back to the community
Machine Data: Unstructured data about where people like to walk, e.g., log files of GPS tracks downloaded from smartphones—which provides the Big Data aspect and drives personalization, giving value to individuals
Curated Metadata: Structured data (tabular) that allows us to leverage other sources, e.g., make inferences about tree species and road conditions

Open Data practices are relatively recent and evolving rapidly. Ultimately these will include the process of curation, incorporating metadata and ontologies, to help make community uses simpler and more immediate.

Of course, there are plenty of criticisms about this app and ways in which it might be improved. We made assumptions about badly formatted data, simply throwing it away. Some of the tree species names have spelling errors or misclassifications that could be cleaned up and provided back to the City of Palo Alto to improve its GIS. Certainly there are more sophisticated ways to handle the geospatial work. Arguably, this app was intended as a base to build upon for student projects. The workflow can be extended to include more data sources and produce different kinds of recommendations.

As an example of extending the app, the data products could be even more valuable if there were estimators for ambient noise levels based on time and location. So how could we get that? This app infers noise from data about road segments: traffic classes, traffic rates. We could take it a step further and adjust the traffic rates using statistical models based on time of day, and perhaps infer from bus lines, train schedules, etc. We might be able to pull in data from other APIs, such as Google Maps. Thinking a bit more broadly, we might be able to purchase aggregate data from other sources, such as business security networks, where cameras have audio feeds. Or perhaps we could sample audio levels from mobile devices, in exchange for some kind of credits. Large telecoms use techniques like that to build their location services.

Some of the extensions that have been suggested so far include the following:

City of Palo Alto: Help assess the impact of new zoning and building permits; e.g., are there poisonous trees near a proposed day care center?
Calflora: Report concentrations of invasive trees or endangered species, or perhaps optimize where to release beneficial insects.
Real estate: Optimize sales leads by comparing estimated allergy zones with buyers’ preferences.
Start-ups: Some invasive tree species have valuable by-products like medicine, whereas others can be converted to biodiesel for targeted harvest services.

Quite a large number of data APIs are available that could be leveraged to extend this app:

Factual for place data—along with CityGrid, Foursquare, Yelp, Localeze, YP, etc.
Trulia for neighborhood data, housing prices, etc.
Google for maps, photos, geocoding, etc.
Wunderground for local weather data
WalkScore for neighborhood data and walkability metrics
GeoWordNet for semantic knowledge base about localized terms
Various photo sharing APIs and Facebook Graph API in general
Beer…need we say more?

The leverage for Open Data is about evolving feedback loops. This area represents a greenfield for new approaches, new data sources, and new use cases. Overall, the app shown here provides an interesting example to use for think-out-of-the-box exercises. Fork it on GitHub and show us a new twist.