When people first start to work with Cascading, one frequent question is “Where can I get large data sets to use for examples?” For great sources of data sets, look toward Open Data. Many governments at the city, state, and federal levels have initiatives to make much of their data openly available. Open Data gives a community greater visibility into how its government functions. The general idea is that people within the community—entrepreneurs, students, social groups, etc.—will find novel ways to leverage that data. In turn, the results of those efforts benefit the public good.
Here are some good examples of Open Data and other publically available repositories:
The sample app discussed in this chapter was originally developed for a graduate engineering seminar at Carnegie Mellon University. The intent was to create an example Cascading app based on the Open Data initiative by the City of Palo Alto. Many thanks for help with this project go to Dr. Stuart Evans, CMU Distinguished Service Professor; Jonathan Reichental, CIO for the City of Palo Alto; and Diego May, CEO of Junar, the company that provided the data infrastructure for this initiative and many others.
Thinking about Palo Alto and its Open Data initiative, a few ideas come to mind. The city is generally quite a pleasant place: the weather is temperate, there are lots of parks with enormous trees, most of downtown is quite walkable, and it’s not particularly crowded. On a summer day in Palo Alto, one of the last things anybody really wants is to be stuck in an office on a long phone call. Instead people walk outside and take their calls, probably heading toward a favorite espresso bar or a popular frozen yogurt shop. On a hot summer day in Palo Alto, knowing a nice route to walk in the shade would be great. There must be a smartphone app for that—but as of late 2012, there wasn’t!
In this chapter, we’ll build an example Cascading workflow for that smartphone app as a case study. A sample app is shown in both Java and Clojure to power a mobile data API.
Imagine a mobile app that leverages the city’s municipal data to personalize recommendations: “Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo.” This app shows the process of structuring data as a workflow, progressing from raw sources to refine that process until we obtain the data products for that recommender. The results are personalized based on the neighborhoods where a person tends to walk.
To download source code, first connect to a directory on your computer where you have a few gigabytes of available disk space, and then use Git to clone the source code repo:
$
git clone git://github.com/Cascading/CoPA.git
Once that download completes, connect into that newly cloned directory. Source code is shown in both Cascading (Java) and Cascalog (Clojure). We’ll work through the Cascalog example, and its source is located in the src/main/clj/copa/core.clj file.
The City of Palo Alto has its Open Data portal available online. It publishes a wide range of different data sets: budget history, census data, geographic information systems (GIS) as shown in Figure 8-1, building permits, utility consumption rates, street sweeping schedules, creek levels, etc.
For this app, we use parts of the GIS export—in particular the location data about trees and roads. Most governments track components of their infrastructure using a GIS system. ArcGIS is a popular software platform for that kind of work. Palo Alto exports its GIS data, which you can download from the portal on Junar. A copy is also given in the data/copa.csv file.
Take a look at one of the tree records in the GIS export:
$
cat data/copa.csv | grep"HAWTHORNE AV 22"
"Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl"
," Private: -1 Tree ID: 412 Street_Name: HAWTHORNE AV
Situs Number: 115 Tree Site: 1 Species: Liquidambar styraciflua
Source: davey tree Protected: Designated: Heritage:
Appraised Value: Hardscape: None Identifier: 474
Active Numeric: 1 Location Feature ID: 18583
Provisional: Install Date: "
,"37.446001565119,-122.167713417554,0.0 "
,"Point"
Clearly that is an example of unstructured data. Our next step is to structure those kinds of records into tuple streams that we can use in our workflow.
Looking at the source code located in the src/main/clj/copa/core.clj file, the first several lines define a Clojure namespace for importing required libraries:
(
ns
copa.core
(
:use
[
cascalog.api
]
[
cascalog.more-taps
:only
(
hfs-delimited
)]
[
date-clj
])
(
:require
[
clojure.string
:as
s
]
[
cascalog
[
ops
:as
c
]]
[
clojure-csv.core
:as
csv
]
[
geohash.core
:as
geo
])
(
:gen-class
))
Next, there are two functions that begin to parse and structure the raw data from the GIS export:
(
def
parse-csv
"parse complex CSV format in the unclean GIS export"
(
comp first
csv/parse-csv
))
(
defn
load-gis
"Parse GIS csv data"
[
in
trap
]
(
<-
[
?blurb
?misc
?geo
?kind
]
((
hfs-textline
in
)
?line
)
(
parse-csv
?line
:>
?blurb
?misc
?geo
?kind
)
(
:trap
(
hfs-textline
trap
))))
The GIS data is exported in comma-separated values (CSV) format.
There are missing values and other errors in the export, so we need to handle the parsing specially.
The load-gis
function reads each line using the hfs-textline
tap, then parses those into tuples using the csv/parse-csv
Clojure library.
A trap collects any data lines that are not formatted properly.
In this case the trapped data does not contain much information, so we simply ignore it.
One side note about process:
in data science work, we typically encounter an 80/20 rule such that 80% of the time and costs go toward cleaning up the data,
while 20% of the time and costs get spent on the science used to obtain actionable insights.
The better tools and frameworks help to balance and reduce those costs.
It’s true in this app that most of the code is needed for data preparation, while the recommender portion is only a few lines.
Even so, Cascalog helps make that data preparation process relatively simple.
Here we invoke the principle of “Specify what you require, not how to achieve it.”
In just a few lines of Clojure, we state the requirement to derive four fields (blurb
, misc
, geo
, kind
)
from the GIS export, and trap (discard) records that fail to follow that pattern.
Next we need to focus on structuring the tree data.
Looking in the example record shown previously, the tree has several properties listed.
There is a unique identifier (412
), a street address (115 Hawthorne Av
), a species name (Liquidambar styraciflua
), etc.,
plus its geo coordinates.
Our goal is to find a quiet shady spot in which to walk and take a cell phone call.
We definitely know about the location of each tree, but what can we determine about shade?
Given the tree species, we could look up average height and use that as an estimator for shade.
So the next step is to use a regular expression to parse the tree properties, such as address and species, from the misc
field:
(
defn
re-seq-chunks
[
pattern
s
]
(
rest
(
first
(
re-seq
pattern
s
))))
(
def
parse-tree
"parses the special fields in the tree format"
(
partial
re-seq-chunks
#
"^\s+Private\:\s+(\S+)\s+Tree ID\:\s+(\d+)\s+.*Situs
Number\:\s+(\d+)\s+Tree Site\:\s+(\d+)\s+Species\:\s+(\S.*\S)\s+Source.*"
))
Great, now we begin to have some structured data about trees:
Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AV Tree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None
We can use the species name to join with a table of tree species metadata and look up average height, along with inferring other valuable data. Take a look in the data/meta_tree.tsv file to see the metadata about trees, which was derived from Wikipedia.org, Calflora.org, USDA.gov, etc. The species liquidambar styraciflua, commonly known as an American sweetgum, grows to a height that ranges between 20 and 35 meters.
The next section of code completes our definition of a data product about trees.
The geo-tree
function parses the geo coordinates: latitude, longitude, and altitude.
The trees-fields
function defines the fields used to describe trees throughout the app; other fields get discarded.
The get-trees
function is the subquery used to filter, merge, and refine the estimators about trees.
(
def
geo-tree
"parses geolocation for tree format"
(
partial
re-seq-chunks
#
"^(\S+),(\S+),(\S+)\s*$"
))
(
def
trees-fields
[
"?blurb"
"?tree_id"
"?situs"
"?tree_site"
"?species"
"?wikipedia"
"?calflora"
"?avg_height"
"?tree_lat"
"?tree_lng"
"?tree_alt"
"?geohash"
])
(
defn
get-trees
[
src
tree-meta
trap
]
"subquery to parse/filter the tree data"
(
<-
trees-fields
(
src
?blurb
?misc
?geo
?kind
)
(
re-matches
#
"^\s+Private.*Tree ID.*"
?misc
)
(
parse-tree
?misc
:>
?priv
?tree_id
?situs
?tree_site
?raw_species
)
((
c/comp
s/trim
s/lower-case
)
?raw_species
:>
?species
)
(
tree-meta
?species
?wikipedia
?calflora
?min_height
?max_height
)
(
avg
?min_height
?max_height
:>
?avg_height
)
(
geo-tree
?geo
:>
?tree_lat
?tree_lng
?tree_alt
)
((
c/each
read-string
)
?tree_lat
?tree_lng
:>
?lat
?lng
)
(
geo/encode
?lat
?lng
geo-precision
:>
?geohash
)
(
:trap
(
hfs-textline
trap
))))
Note the call (re-matches #"^\s+Private.*Tree ID.*" ?misc)
early in the subquery.
This regular expression filters records about trees out of the GIS tuple stream.
This creates a branch in the Cascading flow diagram.
After calling parse-tree
to get the tree properties from the raw data,
next we use ((c/comp s/trim s/lower-case) ?raw_species :> ?species)
to normalize the species name.
In other words, force it to lowercase and strip any trailing spaces, so that it can be used in a join.
The call to tree-meta
performs that join.
Next, the call to avg
estimates the height for each tree.
This is a rough approximation, but good enough to produce a reasonable “shade” metric.
The last few lines clean up the geolocation coordinates.
First these coordinates are parsed, then converted from strings to decimal numbers.
Then the geo/encode
uses the coordinates to create a “geohash” index.
A geohash is a string that gives an approximate location.
In this case, the six-digit geohash 9q9jh0
identifies a five-block square in which tree 412 is located.
That’s a good enough approximation to join with other data about that location, later in the workflow.
Finally, the fields defined in trees-fields
for tree 412 get structured this way:
?blurb Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl ?tree_id 412 ?situs 115 ?tree_site 1 ?species liquidambar styraciflua ?wikipedia http://en.wikipedia.org/wiki/Liquidambar_styraciflua ?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598 ?avg_height 27.5 ?tree_lat 37.446001565119 ?tree_lng -122.167713417554 ?tree_alt 0.0 ?geohash 9q9jh0
At this point we have a data product for trees. Figure 8-2 shows a conceptual flow diagram for the part of the workflow that structured this data.
Next we repeat many of the same steps for the road data. The GIS export is more complex for roads than for trees because the roads are described per block, with each block divided into segments. Effectively, there is a new segment recorded for every turn in the road. Road data also includes metrics about traffic rates, pavement age and type, etc. Our goal is to find a quiet shady spot in which to walk and take a cell phone call. So we can leverage the road data per segment in a couple of ways. Let’s create one estimator to describe how quiet each segment is based on comparing the traffic types and rates. Then we’ll create another estimator to describe the shade based on comparing how the pavement reflects sunlight.
(
def
roads-fields
[
"?road_name"
"?bike_lane"
"?bus_route"
"?truck_route"
"?albedo"
"?road_lat"
"?road_lng"
"?road_alt"
"?geohash"
"?traffic_count"
"?traffic_index"
"?traffic_class"
"?paving_length"
"?paving_width"
"?paving_area"
"?surface_type"
])
(
defn
get-roads
[
src
road-meta
trap
]
"subquery to parse/filter the road data"
(
<-
roads-fields
(
src
?road_name
?misc
?geo
?kind
)
(
re-matches
#
"^\s+Sequence.*Traffic Count.*"
?misc
)
(
parse-road
?misc
:>
?traffic_count
?traffic_index
?traffic_class
?paving_length
?paving_width
?paving_area
?surface_type
?overlay_year_str
?bike_lane
?bus_route
?truck_route
)
(
road-meta
?surface_type
?albedo_new
?albedo_worn
)
((
c/each
read-string
)
?overlay_year_str
:>
?overlay_year
)
(
estimate-albedo
?overlay_year
?albedo_new
?albedo_worn
:>
?albedo
)
(
bigram
?geo
:>
?pt0
?pt1
)
(
midpoint
?pt0
?pt1
:>
?lat
?lng
?alt
)
;; why filter for min? because there are geo duplicates..
((
c/each
c/min
)
?lat
?lng
?alt
:>
?road_lat
?road_lng
?road_alt
)
(
geo/encode
?road_lat
?road_lng
geo-precision
:>
?geohash
)
(
:trap
(
hfs-textline
trap
))))
Similar to the business process for trees,
the get-roads
function is the subquery used to filter, merge, and refine the estimators about roads.
The roads-fields
function defines the fields used to describe roads throughout the app; other fields get discarded.
The regular expression (re-matches #"^\s+Sequence.*Traffic Count.*" ?misc)
filters records about roads out of the GIS tuple stream, creating a branch.
We use some metadata about roads, in this case just to infer metrics about the pavement reflecting sunlight.
As pavement ages, its albedo properties change.
So we parse the surface_type
and overlay_year
, then call road-meta
to join with metadata.
Then we can estimate albedo to describe how much a road segment reflects sunlight.
Note that there are some duplicates in the geo coordinates for road segments.
We use (c/each c/min)
to take the minimum value for each segment, reducing the segment list to unique values.
Then we use geo/encode
to create a six-digit geohash for each segment.
Great—now we have another data product, for roads. Figure 8-3 shows a conceptual flow diagram for the part of the workflow that structured and enriched this data.
The following tuple shows the road segment located nearest to tree 412—note that the geohash matches, because they are within the same bounding box:
?blurb Hawthorne Avenue from Alma Street to High Street ?traffic_count 3110 ?traffic_class local residential ?surface_type asphalt concrete ?albedo 0.12 ?min_lat 37.446140860599854 ?min_lng -122.1674652295435 ?min_alt 0.0 ?geohash 9q9jh0
A good next step is to use an analytics tool such as R to analyze and visualize the data about trees and roads. We do that step to perform calibration and testing of the data products so far. Take a look at the src/scripts/copa.R file, which is an R script to analyze tree and road data.
For example, Figure 8-4 shows a chart for the distribution of tree species in Palo Alto. American sweetgum (Liquidambar styraciflua) is the most common tree.
Also, there’s a density plot/bar chart of estimated tree heights, most of which are in the 10- to 30-meter range. Palo Alto is known for many tall eucalyptus and sequoia trees (the city name translates to “Tall Stick”), and these show up on the right side of the density plot—great for lots of shade. Overall, the distribution of trees shows a wide range of estimated heights, which helps confirm that our approximation is reasonable to use.
library(
ggplot2)
dat_folder<-
"~/src/concur/CoPA/out"
d<-
read.table(
file=
paste(
dat_folder,
"tree/part-00000"
,
sep=
"/"
),
sep=
"\t"
,
quote=
""
,
na.strings=
"NULL"
,
header=
FALSE
,
encoding=
"UTF8"
)
colnames(
d)
<-
c(
"blurb"
,
"tree_id"
,
"situs"
,
"tree_site"
,
"species"
,
"wikipedia"
,
"calflora"
,
"avg_height"
,
"tree_lat"
,
"tree_lng"
,
"tree_alt"
,
"geohash"
)
# plot density for estimated tree heights
m<-
ggplot(
d,
aes(
x=
avg_height))
m<-
m+
ggtitle(
"Estimated Tree Height (meters)"
)
m+
geom_histogram(
aes(
y=
..
density..,
fill=
..
count..))
+
geom_density()
# which are the most popular trees?
t<-
sort(
table(
d$
species),
decreasing=
TRUE
)
trees<-
head(
as.data.frame.table(
t),
n=
20
)
colnames(
trees)
<-
c(
"species"
,
"count"
)
trees
Looking at Figure 8-5 for analysis of the road data, most of the road segments are classified as local residential
.
There are also arteries and collectors (busy roads) plus truck routes that are likely to be more noisy.
We also see a distribution with a relatively long tail for traffic counts. Using traffic classes and traffic counts as estimators seems reasonable.
d<-
read.table(
file=
paste(
dat_folder,
"road/part-00000"
,
sep=
"/"
),
sep=
"\t"
,
quote=
""
,
na.strings=
"NULL"
,
header=
FALSE
,
encoding=
"UTF8"
)
colnames(
d)
<-
c(
"road_name"
,
"bike_lane"
,
"bus_route"
,
"truck_route"
,
"albedo"
,
"road_lat"
,
"road_lng"
,
"road_alt"
,
"geohash"
,
"traffic_count"
,
"traffic_index"
,
"traffic_class"
,
"paving_length"
,
"paving_width"
,
"paving_area"
,
"surface_type"
)
t<-
sort(
table(
d$
surface_type),
decreasing=
TRUE
)
roads<-
head(
as.data.frame.table(
t),
n=
20
)
colnames(
roads)
<-
c(
"surface_type"
,
"count"
)
roads summary(
d$
traffic_class)
t<-
sort(
table(
d$
traffic_class),
decreasing=
TRUE
)
roads<-
head(
as.data.frame.table(
t),
n=
20
)
colnames(
roads)
<-
c(
"traffic_class"
,
"count"
)
roads summary(
d$
traffic_count)
plot(
ecdf(
d$
traffic_count))
m<-
ggplot(
d,
aes(
x=
traffic_count))
m<-
m+
ggtitle(
"Traffic Count Density"
)
m+
geom_histogram(
aes(
y=
..
density..,
fill=
..
count..))
+
geom_density()
Because we are working with GIS data, the attributes that tie together tree data, road data, and GPS track are obviously the geo coordinates: latitude, longitude, and altitude. Much of Palo Alto is relatively flat and not far above sea level because it is close to San Francisco Bay. To make this code a bit simpler, we can ignore altitude. However, we’ll need to do large-scale joins and queries based on latitude and longitude. Those are problematic at scale: they are represented as decimal values, and range queries will be required, both of which make parallelization difficult at scale. So we’ve used a geohash as an approximate location, as a kind of bounding box: it combines the decimal values for latitude and longitude into a string. That makes joins and queries much simpler and makes the app more reasonable to parallelize. Effectively we cut the entire map of Palo Alto into bounding boxes and then compute for each bounding box in parallel.
There can be problems with this approach. For instance, what if the center of a road segment is right in between two geohash squares? We might end up with joins that reference only half the trees near that road segment. There are a number of more interesting algorithms to use for spatial indexing. R-trees is one common approach. The general idea would be to join a given road segment with trees in its bounding box plus the neighboring bounding boxes. Then we apply a better algorithm within those collections of data. The problem is still reasonably constrained and can be parallelized.
In this sample app, we simply consider each geohash value as a kind of “bucket.” Imagine that all the data points that fall into the same bucket get evaluated together. Figure 8-6 shows how each block of a road is divided into road segments.
Our app analyzes each road segment as a data tuple, calculating a center point for each. We use a geohash value to construct a bounding box around that center point, then join the data to collect metrics for all the trees nearby, as Figure 8-7 shows.
The join occurs in the get-shade
function where both the roads and trees tuples reference the ?geohash
field:
(
defn
tree-distance
[
tree_lat
tree_lng
road_lat
road_lng
]
"calculates distance from a tree to the midpoint of a road segment"
(
let
[
y
(
-
tree_lat
road_lat
)
x
(
-
tree_lng
road_lng
)]
(
Math/sqrt
(
+
(
Math/pow
y
2.0
)
(
Math/pow
x
2.0
)))))
(
defn
get-shade
[
trees
roads
]
"subquery to join the tree and road estimates, to maximize for shade"
(
<-
[
?road_name
?geohash
?road_lat
?road_lng
?road_alt
?road_metric
?tree_metric
]
((
select-fields
roads
[
"?road_name"
"?albedo"
"?road_lat"
"?road_lng"
"?road_alt"
"?geohash"
"?traffic_count"
"?traffic_class"
])
?road_name
?albedo
?road_lat
?road_lng
?road_alt
?geohash
?traffic_count
?traffic_class
)
(
road-metric
?traffic_class
?traffic_count
?albedo
:>
?road_metric
)
((
select-fields
trees
[
"?avg_height"
"?tree_lat"
"?tree_lng"
"?tree_alt"
"?geohash"
])
?height
?tree_lat
?tree_lng
?tree_alt
?geohash
)
(
>
?height
2.0
)
;; limit to trees which are higher than people
(
tree-distance
?tree_lat
?tree_lng
?road_lat
?road_lng
:>
?distance
)
(
<=
?distance
25.0
)
;; one block radius (not in meters)
(
/
?height
?distance
:>
?tree_moment
)
(
c/sum
?tree_moment
:>
?sum_tree_moment
)
(
/
?sum_tree_moment
200000.0
:>
?tree_metric
)))
This approach is inclusive, so we get more data than we need.
Let’s filter out the trees that won’t contribute much shade.
The call to (> ?height 2.0)
limits the trees to those that are taller than people, i.e., those that provide shade.
The tree-distance
function calculates a distance-to-midpoint from each tree to the road segment’s center point.
Note that this is not in meters.
The call to (<= ?distance 25.0)
limits the trees to those within a one-block radius.
The distance-to-midpoint calculation is used to filter trees that are too small or too far away to provide shade.
The next step is a trick borrowed from physics.
We calculate a sum of moments based on tree height and distance-to-midpoint, then use that as an estimator for shade.
The dimensions of this calculation are not particularly important, so long as we get a distribution of estimator values to use for ranking.
The R script in src/scripts/metrics.R shows some analysis and visualization of this sum of moments.
Based on the median of its distribution, we use 200000.0
to scale the estimator—to make its values simpler to understand and compare with other metrics.
The road-metric
function calculates metrics for comparing road segments.
We have three properties known about each road segment that can be used to create estimators:
(
defn
road-metric
[
traffic_class
traffic_count
albedo
]
"calculates a metric for comparing road segments"
[[(
condp
=
traffic_class
"local residential"
1.0
"local business district"
0.5
0.0
)
(
->
traffic_count
(
/
200.0
)
(
Math/log
)
(
/
5.0
))
(
-
1.0
albedo
)]])
First, the traffic class has two values, local residential
and local business district
, which represent reasonably quiet places to walk—while the other possible values are relatively busy and noisy.
So we map from the traffic_class
labels to numeric values.
Second, the traffic counts get scaled, based on their distribution—similarly to make their values simpler to understand and compare with other metrics.
Third, the albedo
value needs a sign change but otherwise works directly as an estimator.
In practice, we might train a predictive model—such as a decision tree—to compare these estimators. That could help incorporate customer feedback, QA for the data, etc. Having three estimators to compare road segments—to rank the final results—works well enough for this example. The following tuple shows the resulting metrics for the road segment located nearest to tree 412:
?road_name Hawthorne Avenue from Alma Street to High Street ?geohash 9q9jh0 ?road_lat 37.446140860599854 ?road_lng -122.1674652295435 ?road_alt 0.0 ?road_metric [1.0 0.5488121277250486 0.88] ?tree_metric 4.36321007861036
Figure 8-8 shows the conceptual flow diagram for merging the tree and road metrics to calculate estimators for each road segment.
The steps in our app so far have structured the Open Data (GIS export), merged it with curated metadata, then calculated metrics to use for ranking recommendations. That unit of work created data products about quiet shady spots in Palo Alto in which to walk and take a cell phone call. Given different data sources, the same approach could be used for GIS export from other cities. Of course the distribution of geohash values would change, but the business logic would remain the same. In other words, the same workflow could scale to include many different cities in parallel—potentially, even worldwide.
Our next step is to incorporate the machine data component, namely the log files collected from GPS tracks on smartphones.
This data serves to personalize the app, selecting recommendations for the road segments nearest to where the app’s users tend to walk.
For this example, we had people walking around Palo Alto with their iPhones recording GPS tracks.
Then those files were downloaded and formatted as logs.
The data/gps.csv file shows a sample.
Each tuple has a timestamp (date
), a unique identifier for the user (uuid
), geo coordinates, plus measurements for movement at that point.
The function get-gps
is a Cascalog subquery that parses those logs:
(
defn
get-gps
[
gps_logs
trap
]
"subquery to aggregate and rank GPS tracks per user"
(
<-
[
?uuid
?geohash
?gps_count
?recent_visit
]
(
gps_logs
?date
?uuid
?gps_lat
?gps_lng
?alt
?speed
?heading
?elapsed
?distance
)
(
read-string
?gps_lat
:>
?lat
)
(
read-string
?gps_lng
:>
?lng
)
(
geohash
?lat
?lng
:>
?geohash
)
(
c/count
:>
?gps_count
)
(
date-num
?date
:>
?visit
)
(
c/max
?visit
:>
?recent_visit
)
))
The function calculates a geohash, then aggregates some of the other values to create estimators.
For instance, the call to (c/count :> ?gps_count)
counts the number of visits, per user, to the same location.
That provides an estimator for the “popularity” of each location.
The call to (c/max ?visit :> ?recent_visit)
aggregates the timestamps, finding the most recent visit per user, per location.
That provides an estimator for the “recency” of each location.
Given the identifiers uuid
and geohash
, plus the two metrics gps_count
and recent_visit
,
we can join the GPS data with the data product for road segments to apply a form of behavioral targeting.
Figure 8-9 shows the conceptual flow diagram for preparing the GPS tracks data.
The following data shows some of the results near our geohash 9q9jh0
example. Note how the 9q9
prefix identifies neighboring geohash values:
?uuid ?geohash ?gps_count ?recent_visit 342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969 32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935 342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356 342ac6fd3f5f44c6b97724d618d587cf 9q9hv6 1 1972376691180 342ac6fd3f5f44c6b97724d618d587cf 9q9hv8 18 1972376691028 342ac6fd3f5f44c6b97724d618d587cf 9q9hv9 7 1972376691101 342ac6fd3f5f44c6b97724d618d587cf 9q9hvb 22 1972376691010 342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782 342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965 482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532 b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348
Great, now we have a data product about areas in Palo Alto that are known to be walkable. Over time, a production app might use this evidence to optimize the workflow.
The last part of this app is the actual recommender. As mentioned earlier, most of the code in the workflow is used for data preparation—recall the 80/20 rule about that. When it comes to the actual recommender, that’s just a few lines of code:
(
defn
get-reco
[
tracks
shades
]
"subquery to recommend road segments based on GPS tracks"
(
<-
[
?uuid
?road
?geohash
?lat
?lng
?alt
?gps_count
?recent_visit
?road_metric
?tree_metric
]
(
tracks
:>>
gps-fields
)
(
shades
?road
?geohash
?lat
?lng
?alt
?road_metric
?tree_metric
)))
Mostly this involves a join on geohash
fields, then collecting road segment metrics for each user—based on the uuid
field.
Due to the sparseness of geo coordinates in practice, that join is likely to be efficient.
For example, if the mobile app using this data gains millions of users,
then the road segment data could be placed in the righthand side (RHS) of the join.
Each geohash is a five-block radius, which implies hundreds of road segments or less.
That allows for a HashJoin
as a replicated join that runs more efficiently in parallel at scale.
At this point, we have recommendations to feed into a data API for a mobile app.
In other words, per uuid
value we have a set of recommended road segments.
Each road segment has metrics for aggregate tree shade, road reflection, traffic class, traffic rate—in addition to the personalization metrics of recency and popularity for walking near that location.
Recommenders generally combine multiple signals, such as the six metrics we have for each road segment. Then they rank the metrics to personalize results. Some people might prefer recency of visit, others might prefer as little traffic as possible. By providing a tuple of those metrics to the end use case, the mobile app could allow people to adjust their own preferences. In the case of our earlier example, the recommender results nearest to tree 412 are as shown in Table 8-1.
Table 8-1. Example results from recommender
Label | Value | |
---|---|---|
tree | 413 site 2 | |
addr | 115 Hawthorne Ave | |
species | Liquidambar styraciflua | |
geohash | 9q9jh0 | |
lat/lng | 37.446, -122.168 | |
est. height | 23 | |
shade metric | 4.363 | |
traffic | local residential, light traffic | |
visit recency | 1972376952532 |
That spot happens to be a short walk away from my train stop. Two huge American sweetgum trees provide ample amounts of shade on a quiet block of Hawthorne Avenue, which is a great place to walk and take a phone call on a hot summer day in Palo Alto. (It’s also not far from a really great fro-yo shop.)
The build script in project.clj looks much like the build in Chapter 5:
(
defproject
cascading-copa
"0.1.0-SNAPSHOT"
:description
"City of Palo Alto Open Data recommender in Cascalog"
:url
"https://github.com/Cascading/CoPA"
:license
{
:name
"Apache License, Version 2.0"
:url
"http://www.apache.org/licenses/LICENSE-2.0"
:distribution
:repo
}
:uberjar-name
"copa.jar"
:aot
[
copa.core
]
:main
copa.core
:min-lein-version
"2.0.0"
:source-paths
[
"src/main/clj"
]
:dependencies
[[
org.clojure/clojure
"1.4.0"
]
[
cascalog
"1.10.1-SNAPSHOT"
]
[
cascalog-more-taps
"0.3.1-SNAPSHOT"
]
[
clojure-csv/clojure-csv
"2.0.0-alpha2"
]
[
org.clojars.sunng/geohash
"1.0.1"
]
[
date-clj
"1.0.1"
]]
:exclusions
[
org.clojure/clojure
]
:profiles
{
:dev
{
:dependencies
[[
midje-cascalog
"0.4.0"
]]}
:provided
{
:dependencies
[[
org.apache.hadoop/hadoop-core
"0.20.2-dev"
]]
}})
To build this sample app from a command line, run Leiningen:
$
lein clean$
lein uberjar
That builds a “fat jar” that includes all the libraries for the Cascalog app. Next, we clear any previous output directory (required by Hadoop), then run the app in standalone mode:
$
rm -rf out/$
hadoop jar ./target/copa.jar\
data/copa.csv data/meta_tree.tsv data/meta_road.tsv data/gps.csv\
out/trap out/park out/tree out/road out/shade out/gps out/reco
The recommender results will be in partition files in the out/reco/ directory. A gist on GitHub shows building and running this app. If your results look similar, you should be good to go.
Alternatively, if you want to run this app on the Amazon AWS cloud, the steps are the same as for Example 3 in Scalding: Word Count with Customized Operations. First you’ll need to sign up for the EMR and S3 services, and also have your credentials set up in the local configuration—for example, in your ~/.aws_cred/ directory. Edit the emr.sh Bash script to use one of your S3 buckets, and then run that script from your command line.
This workflow illustrates some of the key points of building Enterprise data workflows:
Another important point is to consider what kinds of data sources were used and what value each contributed. This app shows how to combine three major categories of data:
Open Data practices are relatively recent and evolving rapidly. Ultimately these will include the process of curation, incorporating metadata and ontologies, to help make community uses simpler and more immediate.
Of course, there are plenty of criticisms about this app and ways in which it might be improved. We made assumptions about badly formatted data, simply throwing it away. Some of the tree species names have spelling errors or misclassifications that could be cleaned up and provided back to the City of Palo Alto to improve its GIS. Certainly there are more sophisticated ways to handle the geospatial work. Arguably, this app was intended as a base to build upon for student projects. The workflow can be extended to include more data sources and produce different kinds of recommendations.
As an example of extending the app, the data products could be even more valuable if there were estimators for ambient noise levels based on time and location. So how could we get that? This app infers noise from data about road segments: traffic classes, traffic rates. We could take it a step further and adjust the traffic rates using statistical models based on time of day, and perhaps infer from bus lines, train schedules, etc. We might be able to pull in data from other APIs, such as Google Maps. Thinking a bit more broadly, we might be able to purchase aggregate data from other sources, such as business security networks, where cameras have audio feeds. Or perhaps we could sample audio levels from mobile devices, in exchange for some kind of credits. Large telecoms use techniques like that to build their location services.
Some of the extensions that have been suggested so far include the following:
Quite a large number of data APIs are available that could be leveraged to extend this app:
The leverage for Open Data is about evolving feedback loops. This area represents a greenfield for new approaches, new data sources, and new use cases. Overall, the app shown here provides an interesting example to use for think-out-of-the-box exercises. Fork it on GitHub and show us a new twist.