CHAPTER 9 BIG DATA QUALITY

This chapter includes contributions from David Borean (InfoTrellis), Roger Rea (IBM), Randy Schnier (IBM), and Brian Williams (IBM).

Data quality management is a discipline that includes the methods to measure, improve, and certify the quality and integrity of an organization’s data. Because of its extreme volumes, velocity, and variety, big data quality needs to be handled differently than quality for traditional information governance programs. Table 9.1 compares and contrasts the differences between traditional and big data quality programs.

Given the differences highlighted in Table 9.1, the big data quality program has to adopt a somewhat different approach than traditional projects while still adhering to key best practices. As a result, the big data governance program needs to adopt the following best practices to address data quality issues:

9.1 Work with business stakeholders to establish and measure confidence intervals for the quality of big data.

9.2 Leverage semi-structured and unstructured data to improve the quality of sparsely populated structured data.

9.3 Use streaming analytics to address data quality issues in-memory without landing interim results to disk.

9.4 Appoint data stewards accountable to the information governance council for improving the metrics over time.

Each of these sub-steps is discussed in detail in the rest of this chapter.

*Table 9.1: Traditional versus Big Data Quality Programs*
Dimension	Traditional Data Quality	Big Data Quality
Frequency of processing	Processing is batch-oriented.	Processing is both real-time and batch-oriented.
Variety of data	Data format is largely structured.	Data format may be structured, semi-structured, or unstructured.
Confidence levels	Data needs to be in pristine condition for analytics in the data warehouse.	“Noise” needs to be filtered out, but data needs to be “good enough.” Poor data quality might or might not impede analytics to glean business insights.
Timing of data cleansing	Data is cleansed prior to loading into the data warehouse.	Data may be loaded as is because the critical data elements and relationships might not be fully understood.The volume and velocity of data might require streaming, in-memory analytics to cleanse data, thus reducing storage requirements.
Critical data elements	Data quality is assessed for critical data elements such as customer address.	Data may be quasi- or ill-defined and subject to further exploration, hence critical data elements may change iteratively.
Location of analysis	Data moves to the data quality and analytics engines.	Data quality and analytics engines may move to the data, to ensure an acceptable processing speed.
Stewardship	Stewards can manage a high percentage of the data.	Stewards can manage a smaller percentage of data, due to high volumes and/or velocity.

9.1 Work with Business Stakeholders to Establish and Measure Confidence Intervals for the Quality of Big Data

The big data governance program needs to identify the critical data elements that drive the success of the program. Unlike a traditional business intelligence project where data quality is addressed upfront, a big data initiative might actually have to deal with data quality challenges on the fly. In a traditional project, the business intelligence team can address quality issues that are discovered after profiling structured fields in an internal, structured data source. In contrast, big data projects might need to address data quality issues on the fly. In addition, big data projects often rely on external data, which makes it difficult to discover quality issues on time. Notwithstanding all of this, it is still possible to use historical experience to describe the types of expected issues with the quality of external big data.

We’ll discuss Twitter data in the rest of this section. Consider a simple Tweet:

Best day ever. Went mountain biking today. Will get one for my husband too.

The business intelligence team can likely infer the following attributes about the person who Tweeted this message, with a high degree of confidence:

Gender: Female
Marital Status: Married
Sports: Biking
Age: 25-55 (since she is married and engaged in active sports)

The team should be able to get more insight as this person generates additional Tweets.

The big data governance program should collaborate with business stakeholders to develop a quality requirements matrix. The matrix identifies the critical data elements, data quality concerns, and business rules. Business stakeholders must be involved in this activity to provide a level of specificity on how the critical data elements are actually being used. In many cases, there might be more than one consumer for the same data. Case Study 9.1 describes the issues relating to Twitter data quality at a hypothetical company called Acme Corporation.

Case Study 9.1: Twitter data quality at Acme Corporation

Table 9.2 provides a sample quality requirements matrix for Twitter data at a hypothetical company called Acme Corporation. The measurement algorithm is greatly simplified and is for illustration purposes only.

As with traditional information intensive projects, the quality of big data needs to be “fit for purpose.” The acceptable threshold of big data quality depends on how it is to be used in analytics and operational processes. Acme also established confidence intervals for Twitter data, shown in Table 9.3.

*Table 9.2: Sample Quality Requirements Matrix for Twitter Data at Acme Corporation*
Critical Data Element	Data Quality Concerns	Data Quality Business Rules
Tweet timestamp	• The format of timestamps in Tweets is inconsistent with the standard format, which can cause issues in analytical queries when joining to other datasets.	• Reformat all timestamps to YYYY-MM-DD HH:MM:SS.
User name	• The user name from the profile is not the person’s real name in 40-50 percent of Twitter accounts. This user name is a useful element in matching the Tweet to a customer record in MDM. It is noteworthy that we are referring to the name from the user profile, as opposed to the Twitter handle (screen_name). • How representative is the data for reputation analysis?	• If non-name characters such as numbers and symbols exist in the user name, then the confidence level is zero percent. • If the user name is one word, such as first or last name, then the confidence level is 25 percent. • If there are two or three words, then the confidence level is 50 percent. • If the words are verified as people’s names using a name library, then the confidence level is 99 percent. Note: The rules for matching Tweets to internal customer or product MDM records are beyond the scope of this discussion.
References to Acme Corporation	• Is the Tweet about Acme Corporation, or is it noise that needs to be filtered out?	• If the Tweet contains “@Acme” then the confidence level is 99 percent. • If the Tweet contains “Acme” and Acme product names, then the confidence level is 75 percent. • If the Tweet is on the ignore list, then confidence level is zero percent.
Location	• Location data from Tweets is important. It can be used to answer questions such as, “Are users in the southeastern United States more likely to be disgruntled than those in other parts of the country?” • Although the user profile often contains city and state information, it is not validated by Twitter and might not contain valid information.	^• Extract the Tweet.user.location string from the Tweet metadata and validate the city and state names. If the location is not recognizable, then flag it as unknown.

*Table 9.3: Confidence Intervals for Twitter Data*
Category	Confidence Level
Very high	90% - 99%
High	80% - 89%
Medium	70% - 79%
Low	< 69%

Here are the critical data elements for Twitter data:

Tweet timestamp

The timestamp indicates when the Tweet was created. The name of the element in the Tweet metadata is “created_at.” This is a very important element, even though it is a simple one. The timestamp may be used in the time dimension for sentiment analysis. It can also play a role in matching the Twitter handle (user) to a customer record. Take, for example, the Tweet, “bought a new purse at acme today!” The text of the Tweet indicates the user is a customer who completed a transaction on the date indicated by the Tweet timestamp. This information can be used to narrow down the customer record in the matching process.

User name
The user name contains the person’s real name in only 40-50 percent of Twitter accounts. In other cases, the user name might be a company name or an alias. The Twitter handle, which is called the “screen_name” in the Twitter developers’ documentation, is often even more cryptic than the user name. The person’s name might not be very useful in aggregate sentiment analysis, but is a useful element in matching a Tweet to a customer record. An analytics team might establish the following business rules to calculate the confidence that the user name is, in fact, a person’s name:
- If non-name characters such as numbers and symbols exist in the name, then the confidence level is zero percent (low).
- If there is one word (maybe a first or last name), then the confidence level is 25 percent (low).
- If there are two or three words, then the confidence level is 50 percent (low).
- If the words are verified as people’s names, using a name library, then the confidence level is 99 percent (high).
As discussed in chapter 13 on web and social media governance, organizations that use Tweets to conduct reputation analysis also need to consider whether the dataset is truly representative of all customers:
- Are disgruntled customers more likely to use Twitter?
- Do younger customers use Twitter more often?
- Are affluent customers more likely to use Twitter?

Case Study 9.2 describes the challenges faced by the social listening department at a high-end retailer.

Case Study 9.2: The social listening department at a high-end retailer

The social listening department at a high-end retailer had to address senior management’s concerns about whether Twitter users were in a different demographic from their customers, who were primarily female and older than 30 years. The social listening department conducted marketing surveys and found, to their surprise, that the demographics of their Twitter users were actually very similar to their traditional customers. Armed with the survey data, the social listening department was able to get more attention and budget from senior management.

References to the specific organization

Tweets need to be qualified as having some use to the organization. In other words, only a subset of Tweets will be useful, and the rest are noise that must be filtered out. As an example, a Tweet that contains the name “John Acme” is not about “Acme Corporation.”

The analytics team established the following business rules to qualify Tweets with a confidence level:

If the Tweet contains “@Acme,” then it pertains to Acme Corporation with a probability of 99 percent (very high confidence level).
If the Tweet contains “Acme” and an Acme product (such as “purse”), then it pertains to Acme Corporation with a probability of 75 percent (medium confidence level).
If the Tweet contains just “Acme,” then it pertains to Acme Corporation with a probability of 50 percent (low confidence level).
If the Tweet is from a user on the ignore list, then the probability is set to zero.

This is a very simplified example because the analytics team might establish additional business rules for Tweets containing location broadcasts (e.g., “foursquare” and “yelp” Tweets), transaction keywords (e.g., “purchase,” “bought”), and other variables. The business users need to establish the required confidence levels based on how they will use the data. For example, the marketing team that is building a dashboard with aggregated customer sentiment may only use data within the “high” confidence interval. However, the call center may only use data with a “very high” confidence interval, because it needs to use sentiment data for each customer record to support real-time operations.

Location
Location information might exist in two different places:
- The user profile might contain the user-defined location that is an optional data element, normally set when the user creates an account on Twitter. As described in the Twitter developers’ documentation, the user-defined location is “not necessarily a location or parseable,” which means that it is not validated by Twitter.
- Tweets might include their latitude and longitude coordinates if the user has enabled this functionality and his or her application supports it. The focus is on location information within the user’s profile. Therefore, there must be some evaluation of the element to extract useful content. Examples of content might include “Boston, MA,” “Newark, New Jersey,” “Maryland,” and “Central Florida.”

9.2 Leverage Semi-Structured and Unstructured Data to Improve the Quality of Sparsely Populated Structured Data

The big data governance program might be faced with situations where the structured data is sparsely populated. In that case, the big data governance team might leverage semi-structured and unstructured sources to improve data quality. Case Study 18.1 in chapter 18 provides an example of this, in which the analytics team at a hospital system used unstructured data from electronic medical records, patient physicals, physicians’ notes, and discharge summaries to obtain better insight into smoking status, drug and alcohol abuse, and residence in an assisted living facility. The analytics team used this data to predict the likelihood of a patient being readmitted to a hospital within 30 days of treatment for congestive heart failure.

9.3 Use Streaming Analytics to Address Data Quality Issues In-Memory Without Landing Interim Results to Disk

Streaming analytics can analyze high volumes of data in real-time without landing interim results to disk. Streaming applications need to consider two aspects relating to the underlying data:

Sources—This refers to data that is available as input to the streams application. This could be from a socket connection, database query, Java® Message Service (JMS) topic/queue, or file. A schema can usually define each source, although the transport medium may vary.
Sinks or destinations—This refers to data produced by a streams application and sent or written to a socket connection, database table, JMS topic/queue, file, or web service. Once again, the destination schema must be modeled.

Before building a streaming application, big data teams need to understand the characteristics of the data. Taking the Twitter example in Case Study 9.1, the big data team would use the Twitter API to download a sample set of Tweets for further analysis. The profiling of streaming data is similar to traditional data projects. Both types of projects need to understand the characteristics of the underlying data, such as the frequency of null values. However, profiling of streaming data also needs to consider two additional aspects of the source data:

Temporal alignment—Streaming applications need to discover the temporal offset when joining, correlating, and matching data from different sources. For example, a streaming application that needs to combine data from two sensors needs to know that one sensor generates events every second, while the other generates events every three seconds.
Rate of arrival—Streaming applications need to understand the rate of arrival of data:
- Does the data arrive continuously?
- Are there bursts in the data?
- Are there gaps in the arrival of data?

This section includes three case studies that describe the use of streaming applications. Case Study 9.3 describes a hypothetical streaming application that monitors sensor data in a school building.

Case Study 9.3: A streaming application that monitors sensor data in a school building

A simple streaming application correlates data from motion and temperature sensors in a school building. The big data team has profiled the data to understand its characteristics. Data from the motion sensors arrives every 30 seconds, while data from the temperature sensors arrives every 60 seconds. The streaming application uses this information to conduct a temporal alignment of the motion and temperature sensor data that arrive at different intervals. It accomplishes this by creating a window during which it holds both types of sensor events in memory, so that it can match the two streams of data.

The streaming application also uses reference data that room A is a classroom and room B is the boiler room. The streaming application stores temperature data every 10 minutes in Hadoop. The analytics team has used Hadoop to build a normalized model indicating that the average temperature readings of the boiler room and classroom are 65 degrees at 3:00 a.m. and 75 degrees at 9:00 a.m., respectively. (Hadoop is discussed in detail in chapter 21, on big data reference architecture.)

Finally, the streaming application will use the available data to generate alerts, such as when sensor data does not arrive for five minutes, the temperature of the boiler room rises to 75 degrees at 3:00 a.m., or the motion sensor detects movement in the classroom at 5 a.m.

Case Study 9.4 describes the usage of streaming technologies to monitor real-time network performance at a large wireless telecommunications provider.

Case Study 9.4: The use of streaming technologies to monitor real-time network performance at a wireless telecommunications provider

A large wireless telecommunications operator served customers in multiple metropolitan areas all over the country. The provider needed to analyze call detail records (CDRs), Internet protocol detail records (IPDRs) for web usage, and SMS detail records—collectively referred to as xDRs—in real-time to troubleshoot poorly performing cells.

The provider wanted to use this real-time information to accomplish the following business objectives:

Analyze customer call data to serve up location-dependent advertisements.
Identify possible network problems so that it could, for instance, initiate capital expenditure requests to upgrade poorly performing wireless towers several months before they became bottlenecks.
Provide customer service representatives with the latest information when a subscriber called with a service problem.

The provider’s current architecture, however, made it increasingly unable to proactively address customer and network issues, due to the increased volumes of xDRs driven by 3G technologies. The team knew that this problem would only get worse with the transition to 4G.

The technical team implemented streaming technologies to address these issues. The team had to face the following data quality issues:

For each metropolitan market, the technical team had to merge xDRs and call quality information from different sources.

The xDR data included a number of duplicates. For example, additional call quality data was sent when the subscriber moved between towers. The call quality information included data from records relating to signal strength, dropped calls, and calls that had to be re-routed due to poor signal strength.

The call quality information included data from two to 20 records, and sometimes more, with the average being six records.

The goal was to create a golden copy of an xDR with the associated call quality information.

The team had to deal with high-volume and high-velocity data. For example, the team had to handle 50 gigabytes per hour of xDRs and 90 gigabytes per hour of call quality information in just one mid-size metropolitan market.

The technical team used a network switch-created field called the Universal Access Terminal Identifier (UATI) as the primary key for merge operations, but found that this field was reused two to three times per hour for different calls by the same switch. Therefore, a combination of UATI and start/end timestamps was used to perform the merge.

To address this situation, the technical team used the concept of streaming windows (buffering based on time or count of records received) to match xDRs and call quality characteristics. The team had to make certain trade-offs. For example, the team found that if the window was too large, the system ran out of memory. On the other hand, if the window was too small, some of the call quality records would tend to fall out of it before they were matched with their corresponding xDR. By making repeated runs of a representative set of data from each market and analyzing the number of call quality records matched versus the number of “orphan” call quality records produced, the team was able to arrive at an optimum window size for each market.

Case Study 9.5 discusses the governance of time series data in a neonatal intensive care unit.

Case Study 9.5: The governance of time series data in a neonatal intensive care unit 1

A hospital leveraged streaming technologies to monitor the health of newborn babies in the neonatal intensive care unit. Using streaming technologies, the hospital was able to predict the onset of disease a full 24 hours before the onset of symptoms. From a big data governance perspective, the hospital had to establish multiple policies:

Data quality—The application depended on large volumes of time series data. However, the time series data was sometimes missing when a patient moved, which caused a lead (a monitor attached to the baby’s skin) to disengage and discontinue readings. In these situations, the streaming platform used linear and polynomial regressions to use historical readings to fill in the gaps in the time series data.
Information lifecycle management—The hospital tagged all time series data that had been modified by software algorithms. In the event of a lawsuit or medical enquiry, the hospital was able to produce both the original and modified readings.
Privacy—The hospital also established policies around safeguarding protected health information (PHI).

9.4 Appoint Data Stewards Accountable to the Information Governance Council for Improving the Metrics Over Time

The big data governance program needs to appoint stewards who will be accountable for the quality of the underlying data. In the context of big data quality, stewards have three main responsibilities:

Set and refine business rules and confidence intervals for big data quality.

Stewards must define business rules for big data quality once critical data elements, confidence intervals, and measurement algorithms have been defined. Stewards must continuously refine these business rules as they gain more experience with the dataset.

Address big data quality issues.
Stewardship over big data takes a different form than for other types of data, such as master data and reference data. Stewards will likely be able to make quality decisions on only a small percentage of the big data, given the sheer volumes involved. In other cases such as streaming analytics, there might not be the opportunity for manual intervention, given the high velocity of the incoming data and the need to turn around responses in milliseconds. As a result, proper criteria must be established for big data records that require stewards’ attention. As discussed in chapter 5 on roadmaps, claims data is a type of big transactional data within health plans (insurers). Case Study 9.6 discusses claims data stewardship at a large health plan.
Report on trends in big data quality.
The big data governance program should produce reports to help stewards validate and improve data quality. These reports should provide visibility to the information governance council in terms of how quality is trending over defined periods. Basic elements on the report can include the following:
- The trend of quality for a particular critical data element
- The trend of reported and resolved quality issues that require stewardship

Case Study 9.6: Claims data stewardship at a large health plan

There are many dates associated with healthcare claims. These dates include date of service, claims receipt date, and claims payment date. The claims data stewards at a large U.S. health plan ran periodic checks on claims data that arrived from various contractors for dental, vision, and family planning.

The health plan reported several key metrics to the Centers for Medicare and Medicaid Services (CMS). One such metric was the timeliness of claims payments. To improve data quality to support this metric, the claims data stewards developed a business rule that flagged all records where the date of claim payment was earlier than the date of claim receipt.

The actuaries also needed accurate claims dates to calculate the amounts needed in reserve based on a calculation called Incurred But Not Reported (IBNR). IBNR is the total amount owed by the insurer to all policyholders who have incurred a loss, but have not submitted a claim. The claims data stewards developed business rules to flag all claims where the receipt date was prior to the service date.

Summary

Big data quality should be handled differently from traditional projects. Big data quality might need to be handled in real-time, the data is often poorly defined, confidence levels need to be established, and data stewards might be in a position to handle only a subset of the data.

1. Rea, Roger. “IBM InfoSphere Streams: Redefining Real Time Analytical Processing.” IBM, 2010.

CHAPTER 9

BIG DATA QUALITY