CHAPTER 9

BIG DATA QUALITY

This chapter includes contributions from David Borean (InfoTrellis), Roger Rea (IBM), Randy Schnier (IBM), and Brian Williams (IBM).

Data quality management is a discipline that includes the methods to measure, improve, and certify the quality and integrity of an organization’s data. Because of its extreme volumes, velocity, and variety, big data quality needs to be handled differently than quality for traditional information governance programs. Table 9.1 compares and contrasts the differences between traditional and big data quality programs.

Given the differences highlighted in Table 9.1, the big data quality program has to adopt a somewhat different approach than traditional projects while still adhering to key best practices. As a result, the big data governance program needs to adopt the following best practices to address data quality issues:

9.1 Work with business stakeholders to establish and measure confidence intervals for the quality of big data.

9.2 Leverage semi-structured and unstructured data to improve the quality of sparsely populated structured data.

9.3 Use streaming analytics to address data quality issues in-memory without landing interim results to disk.

9.4 Appoint data stewards accountable to the information governance council for improving the metrics over time.

Each of these sub-steps is discussed in detail in the rest of this chapter.

Table 9.1: Traditional versus Big Data Quality Programs
Dimension Traditional Data Quality Big Data Quality
Frequency of processing Processing is batch-oriented. Processing is both real-time and batch-oriented.
Variety of data Data format is largely structured. Data format may be structured, semi-structured, or unstructured.
Confidence
levels
Data needs to be in pristine condition for analytics in the data warehouse. “Noise” needs to be filtered out, but data needs to be “good enough.” Poor data quality might or might not impede analytics to glean business insights.
Timing of data cleansing Data is cleansed prior to loading into the data warehouse. Data may be loaded as is because the critical data elements and relationships might not be fully understood.The volume and velocity of data might require streaming, in-memory analytics to cleanse data, thus reducing storage requirements.
Critical data elements Data quality is assessed for critical data elements such as customer address. Data may be quasi- or ill-defined and subject to further exploration, hence critical data elements may change iteratively.
Location of analysis Data moves to the data quality and analytics engines. Data quality and analytics engines may move to the data, to ensure an acceptable processing speed.
Stewardship Stewards can manage a high percentage of the data. Stewards can manage a smaller percentage of data, due to high volumes and/or velocity.

9.1 Work with Business Stakeholders to Establish and Measure Confidence Intervals for the Quality of Big Data

The big data governance program needs to identify the critical data elements that drive the success of the program. Unlike a traditional business intelligence project where data quality is addressed upfront, a big data initiative might actually have to deal with data quality challenges on the fly. In a traditional project, the business intelligence team can address quality issues that are discovered after profiling structured fields in an internal, structured data source. In contrast, big data projects might need to address data quality issues on the fly. In addition, big data projects often rely on external data, which makes it difficult to discover quality issues on time. Notwithstanding all of this, it is still possible to use historical experience to describe the types of expected issues with the quality of external big data.

We’ll discuss Twitter data in the rest of this section. Consider a simple Tweet:

Best day ever. Went mountain biking today. Will get one for my husband too.

The business intelligence team can likely infer the following attributes about the person who Tweeted this message, with a high degree of confidence:

The team should be able to get more insight as this person generates additional Tweets.

The big data governance program should collaborate with business stakeholders to develop a quality requirements matrix. The matrix identifies the critical data elements, data quality concerns, and business rules. Business stakeholders must be involved in this activity to provide a level of specificity on how the critical data elements are actually being used. In many cases, there might be more than one consumer for the same data. Case Study 9.1 describes the issues relating to Twitter data quality at a hypothetical company called Acme Corporation.

Case Study 9.1: Twitter data quality at Acme Corporation

Table 9.2 provides a sample quality requirements matrix for Twitter data at a hypothetical company called Acme Corporation. The measurement algorithm is greatly simplified and is for illustration purposes only.

As with traditional information intensive projects, the quality of big data needs to be “fit for purpose.” The acceptable threshold of big data quality depends on how it is to be used in analytics and operational processes. Acme also established confidence intervals for Twitter data, shown in Table 9.3.

Table 9.2: Sample Quality Requirements Matrix for Twitter Data at Acme Corporation
Critical Data Element Data Quality Concerns Data Quality Business Rules
Tweet timestamp

• The format of timestamps in Tweets is inconsistent with the standard format, which can cause issues in analytical queries when joining to other datasets.

• Reformat all timestamps to YYYY-MM-DD HH:MM:SS.

User name

• The user name from the profile is not the person’s real name in 40-50 percent of Twitter accounts. This user name is a useful element in matching the Tweet to a customer record in MDM. It is noteworthy that we are referring to the name from the user profile, as opposed to the Twitter handle (screen_name).

• How representative is the data for reputation analysis?

• If non-name characters such as numbers and symbols exist in the user name, then the confidence level is zero percent.

• If the user name is one word, such as first or last name, then the confidence level is 25 percent.

• If there are two or three words, then the confidence level is 50 percent.

• If the words are verified as people’s names using a name library, then the confidence level is 99 percent.

Note: The rules for matching Tweets to internal customer or product MDM records are beyond the scope of this discussion.
References to Acme Corporation

• Is the Tweet about Acme Corporation, or is it noise that needs to be filtered out?

• If the Tweet contains “@Acme” then the
confidence level is 99 percent.

• If the Tweet contains “Acme” and Acme product names, then the confidence level is 75 percent.

• If the Tweet is on the ignore list, then confidence level is zero percent.

Location

• Location data from Tweets is important. It can be used to answer questions such as, “Are users in the southeastern United States more likely to be disgruntled than those in other parts of the country?”

• Although the user profile often contains city and state information, it is not validated by Twitter and might not contain valid information.

Extract the Tweet.user.location string from the Tweet metadata and validate the city and state names. If the location is not recognizable, then flag it as unknown.
Table 9.3: Confidence Intervals for Twitter Data
Category Confidence Level
Very high 90% - 99%
High 80% - 89%
Medium 70% - 79%
Low < 69%

Here are the critical data elements for Twitter data:

Case Study 9.2 describes the challenges faced by the social listening department at a high-end retailer.

Case Study 9.2: The social listening department at a high-end retailer

The social listening department at a high-end retailer had to address senior management’s concerns about whether Twitter users were in a different demographic from their customers, who were primarily female and older than 30 years. The social listening department conducted marketing surveys and found, to their surprise, that the demographics of their Twitter users were actually very similar to their traditional customers. Armed with the survey data, the social listening department was able to get more attention and budget from senior management.

9.2 Leverage Semi-Structured and Unstructured Data to Improve the Quality of Sparsely Populated Structured Data

The big data governance program might be faced with situations where the structured data is sparsely populated. In that case, the big data governance team might leverage semi-structured and unstructured sources to improve data quality. Case Study 18.1 in chapter 18 provides an example of this, in which the analytics team at a hospital system used unstructured data from electronic medical records, patient physicals, physicians’ notes, and discharge summaries to obtain better insight into smoking status, drug and alcohol abuse, and residence in an assisted living facility. The analytics team used this data to predict the likelihood of a patient being readmitted to a hospital within 30 days of treatment for congestive heart failure.

9.3 Use Streaming Analytics to Address Data Quality Issues In-Memory Without Landing Interim Results to Disk

Streaming analytics can analyze high volumes of data in real-time without landing interim results to disk. Streaming applications need to consider two aspects relating to the underlying data:

Before building a streaming application, big data teams need to understand the characteristics of the data. Taking the Twitter example in Case Study 9.1, the big data team would use the Twitter API to download a sample set of Tweets for further analysis. The profiling of streaming data is similar to traditional data projects. Both types of projects need to understand the characteristics of the underlying data, such as the frequency of null values. However, profiling of streaming data also needs to consider two additional aspects of the source data:

  1. Temporal alignment—Streaming applications need to discover the temporal offset when joining, correlating, and matching data from different sources. For example, a streaming application that needs to combine data from two sensors needs to know that one sensor generates events every second, while the other generates events every three seconds.
  2. Rate of arrival—Streaming applications need to understand the rate of arrival of data:
    • Does the data arrive continuously?
    • Are there bursts in the data?
    • Are there gaps in the arrival of data?

This section includes three case studies that describe the use of streaming applications. Case Study 9.3 describes a hypothetical streaming application that monitors sensor data in a school building.

Case Study 9.3: A streaming application that monitors sensor data in a school building

A simple streaming application correlates data from motion and temperature sensors in a school building. The big data team has profiled the data to understand its characteristics. Data from the motion sensors arrives every 30 seconds, while data from the temperature sensors arrives every 60 seconds. The streaming application uses this information to conduct a temporal alignment of the motion and temperature sensor data that arrive at different intervals. It accomplishes this by creating a window during which it holds both types of sensor events in memory, so that it can match the two streams of data.

The streaming application also uses reference data that room A is a classroom and room B is the boiler room. The streaming application stores temperature data every 10 minutes in Hadoop. The analytics team has used Hadoop to build a normalized model indicating that the average temperature readings of the boiler room and classroom are 65 degrees at 3:00 a.m. and 75 degrees at 9:00 a.m., respectively. (Hadoop is discussed in detail in chapter 21, on big data reference architecture.)

Finally, the streaming application will use the available data to generate alerts, such as when sensor data does not arrive for five minutes, the temperature of the boiler room rises to 75 degrees at 3:00 a.m., or the motion sensor detects movement in the classroom at 5 a.m.

Case Study 9.4 describes the usage of streaming technologies to monitor real-time network performance at a large wireless telecommunications provider.

Case Study 9.4: The use of streaming technologies to monitor real-time network performance at a wireless telecommunications provider

A large wireless telecommunications operator served customers in multiple metropolitan areas all over the country. The provider needed to analyze call detail records (CDRs), Internet protocol detail records (IPDRs) for web usage, and SMS detail records—collectively referred to as xDRs—in real-time to troubleshoot poorly performing cells.

The provider wanted to use this real-time information to accomplish the following business objectives:

  1. Analyze customer call data to serve up location-dependent advertisements.
  2. Identify possible network problems so that it could, for instance, initiate capital expenditure requests to upgrade poorly performing wireless towers several months before they became bottlenecks.
  3. Provide customer service representatives with the latest information when a subscriber called with a service problem.

The provider’s current architecture, however, made it increasingly unable to proactively address customer and network issues, due to the increased volumes of xDRs driven by 3G technologies. The team knew that this problem would only get worse with the transition to 4G.

The technical team implemented streaming technologies to address these issues. The team had to face the following data quality issues:

To address this situation, the technical team used the concept of streaming windows (buffering based on time or count of records received) to match xDRs and call quality characteristics. The team had to make certain trade-offs. For example, the team found that if the window was too large, the system ran out of memory. On the other hand, if the window was too small, some of the call quality records would tend to fall out of it before they were matched with their corresponding xDR. By making repeated runs of a representative set of data from each market and analyzing the number of call quality records matched versus the number of “orphan” call quality records produced, the team was able to arrive at an optimum window size for each market.

Case Study 9.5 discusses the governance of time series data in a neonatal intensive care unit.

Case Study 9.5: The governance of time series data in a neonatal intensive care unit1

A hospital leveraged streaming technologies to monitor the health of newborn babies in the neonatal intensive care unit. Using streaming technologies, the hospital was able to predict the onset of disease a full 24 hours before the onset of symptoms. From a big data governance perspective, the hospital had to establish multiple policies:

9.4 Appoint Data Stewards Accountable to the Information Governance Council for Improving the Metrics Over Time

The big data governance program needs to appoint stewards who will be accountable for the quality of the underlying data. In the context of big data quality, stewards have three main responsibilities:

  1. Set and refine business rules and confidence intervals for big data quality.

    Stewards must define business rules for big data quality once critical data elements, confidence intervals, and measurement algorithms have been defined. Stewards must continuously refine these business rules as they gain more experience with the dataset.

  2. Address big data quality issues.

    Stewardship over big data takes a different form than for other types of data, such as master data and reference data. Stewards will likely be able to make quality decisions on only a small percentage of the big data, given the sheer volumes involved. In other cases such as streaming analytics, there might not be the opportunity for manual intervention, given the high velocity of the incoming data and the need to turn around responses in milliseconds. As a result, proper criteria must be established for big data records that require stewards’ attention. As discussed in chapter 5 on roadmaps, claims data is a type of big transactional data within health plans (insurers). Case Study 9.6 discusses claims data stewardship at a large health plan.

  3. Report on trends in big data quality.

    The big data governance program should produce reports to help stewards validate and improve data quality. These reports should provide visibility to the information governance council in terms of how quality is trending over defined periods. Basic elements on the report can include the following:

Case Study 9.6: Claims data stewardship at a large health plan

There are many dates associated with healthcare claims. These dates include date of service, claims receipt date, and claims payment date. The claims data stewards at a large U.S. health plan ran periodic checks on claims data that arrived from various contractors for dental, vision, and family planning.

The health plan reported several key metrics to the Centers for Medicare and Medicaid Services (CMS). One such metric was the timeliness of claims payments. To improve data quality to support this metric, the claims data stewards developed a business rule that flagged all records where the date of claim payment was earlier than the date of claim receipt.

The actuaries also needed accurate claims dates to calculate the amounts needed in reserve based on a calculation called Incurred But Not Reported (IBNR). IBNR is the total amount owed by the insurer to all policyholders who have incurred a loss, but have not submitted a claim. The claims data stewards developed business rules to flag all claims where the receipt date was prior to the service date.

Summary

Big data quality should be handled differently from traditional projects. Big data quality might need to be handled in real-time, the data is often poorly defined, confidence levels need to be established, and data stewards might be in a position to handle only a subset of the data.

Image

1. Rea, Roger. “IBM InfoSphere Streams: Redefining Real Time Analytical Processing.” IBM, 2010.