This chapter includes contributions from Bob Leo (IBM).
As we discussed in chapter 1, big data exposes the natural tensions across different functions. The big data governance program needs to adopt the following best practices to improve organizational alignment:
6.2 Determine the appropriate mix of new and existing roles.
6.3 Appoint big data stewards as appropriate.
6.4 Add big data responsibilities to traditional information governance roles as appropriate.
Each best practice is discussed in detail in the following pages.
A responsibility assignment (RACI) matrix can demonstrate how different organizations might be engaged in big data governance. “RACI” stands for the following:
Case Study 6.1 demonstrates the use of process mapping and a RACI matrix at a large health plan. To set the context for this case study, a brief primer on claim codes will be helpful.
Health plans use claim codes to reimburse providers and hospitals, to benchmark costs and quality of service, and to offer care management services that reduce medical costs.
Health plans require their providers (doctors) to include the appropriate ICD-9, ICD-10, and CPT codes when submitting claims:
A large health plan processed 500 millions claims per year. Each claims record contained approximately 600 attributes, in addition to unstructured text. The health plan decided to focus on claims data governance because it spent about 85 cents of every premium dollar on claims, as was the norm in the industry. The business intelligence department conducted analytics on claims data. This activity drove several downstream activities, including care management. For example, if an elderly member (patient) made a number of doctor visits for “ankle pain,” a nurse from healthcare services would call the person to consider treatment for arthritis. This proactive approach would improve the quality of life for the member while also reducing medical costs for the plan (insurer).
The business intelligence department noticed that a number of entries in the diagnosis code field were not ICD-9 codes. Upon profiling the data, the business intelligence team determined that the field included both ICD-9 (diagnosis) codes and CPT codes. The business intelligence team then met with the network management team that was responsible for managing provider (doctor) relationships. After many meetings, it became clear that the network management team had allowed doctors to use either ICD-9 codes or CPT codes, despite stringent guidelines that the field was only for ICD-9 codes. As a result, the claims reports showed inconsistent data, which resulted in the health plan devoting scarce nurse resources to dealing with low-risk patients.
The business intelligence team also conducted text analytics on the freeform text fields in the claims documents. The team compared the results with the reference data for CPT codes and found several anomalies. For example, the free-form text seemed to indicate that the procedure was “flu shot” but the CPT code was “99214,” which may be used for a physical. The conclusion was that providers might have been inadvertently entering incorrect procedure codes into the claims documents. Additionally, the business intelligence team analyzed text such as “chronic congestion” and “blood sugar monitoring” to determine that members might be candidates for disease management programs for asthma and diabetes, respectively.
Figure 6.1 describes a simple process to administer big claims transaction data at the health plan.
There are a number of actors in the claims administration process:
The health plan had to establish a number of policies to govern its big claims transaction data. A mapping of the key activities from Figure 6.1 to the big data governance policies is described in Table 6.1.
Table 6.1: Big Data Governance Policies for Big Claims Transaction Data at a Large Health Plan | ||
Seq. | Activity | Big Data Governance Policy |
4. | Process claim | The United States Health Insurance Portability and Accountability Act (HIPAA) safeguards the security and privacy of protected health information (PHI). The information security team established database monitoring to ensure that only authorized personnel could access claims records. For example, the insurer did not want the claims data for its senior executives and for high-profile celebrities to be randomly accessed by database administrators out of idle curiosity. |
7. | Analyze claims text | As discussed earlier, the health plan used text analytics to uncover inconsistencies, such as when the procedure was “flu shot” but the CPT code was “99214,” which may be used for a physical. The health plan used reference data for ICD-9 and CPT codes to support these analytics. (We discuss reference data in chapter 11, on master data integration.) |
9. | Discover anomalies in the data | The medical informatics team noticed that a number of entries in the procedure code field were not ICD-9 codes. |
10. | Profile data | Upon profiling the data, the business intelligence team determined that the diagnosis field incorrectly included both ICD-9 codes and CPT codes, despite stringent guidelines that the field was only for ICD-9 codes. As a result, the claims reports showed inconsistent data. |
11. | Communicate procedure codes to providers | The big data governance program established a policy that the network management team would ask providers to only use ICD-9 codes in the relevant fields in claims documents. |
As summarized in Table 6.2, the business intelligence team developed a RACI matrix to understand overall roles and responsibilities. The business intelligence team was accountable for the quality of data that drove claims analytics. However, the network management team was ultimately responsible to ensure that providers used the codes in a consistent manner. In addition, the medical informatics, healthcare services, and claims administration departments were consulted because they used the claims information on a day-to-day basis.
Once an information governance program has reached a certain state of maturity, it will already have a number of existing roles, such as a chief data officer, an information governance officer, and data stewards. The organization needs to determine if the existing roles should also assume responsibility for big data. Alternatively, the organization may appoint new roles that are specifically focused on big data. There is no right or wrong answer. Every organization needs to balance the revenues, costs, and risks while making these trade-offs. Section 6.3 discusses the roles and responsibilities of big data stewards. Section 6.4 discusses the additional responsibilities that might need to be assumed by existing information governance roles. While we do our best to provide insight, many of these roles are still emerging, and their specific responsibilities will continue to evolve.
A data steward ideally reports into the business and, by virtue of his or her deep subject matter expertise, is responsible for improving the trustworthiness and privacy of data as an enterprise asset. Organizations may choose to appoint big data stewards or to extend the roles and responsibilities of existing stewards. This section describes the roles for stewards who are responsible for specific types of big data.
These stewards are responsible for clickstream data and social media. Here are the roles and responsibilities of these data stewards:
The deployment patterns for web and social media data stewards are still very immature. However, here are a few emerging developments:
These stewards are responsible for data from utility smart meters, equipment sensors, RFID, and the like. Here are the roles and responsibilities of these stewards:
Here are some examples of M2M data stewards:
Oil and gas companies consume vast amounts of high velocity data in structured, semi-structured, and unstructured formats to measure production. In many cases, this data comes from sensors on rigs and other platforms. The production data is reported to partners and governments, usually on a daily, monthly, or yearly basis. These figures are also reported quarterly to the stock market and can have a material impact on the stock price of publicly traded companies. In addition, incorrect production data can affect joint venture accounting. For example, all the leases in the North Sea are owned by joint ventures between two or more companies. As a result, incorrect production data can result in inaccuracies in profit sharing, as part of the joint venture. Finally, companies pay taxes based on the metering of their production of oil and gas, which heightens the importance of this data.
A large upstream oil and gas company decided to implement an information governance program around production volumetric data. The information governance program started with a focus on the metadata around the production reports. The program defined 50 key information artifacts such as “well” that had a major impact on the consistency of definitions across reports. The information governance program then defined a number of child terms, including “well origin,” “well completion,” “wellbore,” and “wellbore completion.” The information governance program leveraged the Professional Petroleum Data Management (PPDM) Association model for well data and definitions.
The company had a highly complex and matrixed organization, consisting of geographically-based production business units including the Gulf of Mexico, West Africa, Canada, the North Sea, and Asia. Each business unit had its own definitions for production volumetrics. Ultimately, the organization determined that the shared services organization was the overall sponsor for production volumetrics. Because the shared services organization included reservoir management, facilities engineering, and drill completion, it included some degree of accountability for production volumetrics.
Given the complex nature of the enterprise, the stewards for production volumetrics actually reported into IT. However, the definitions for critical business terms had to be approved by a cross-functional team consisting of IT and business executives.
These stewards are responsible for big transaction data such as insurance claims, telecommunications CDRs, and consumer packaged goods DSRs (demand signal repositories). Here are the roles and responsibilities of big transaction data stewards:
Here are some examples of big transaction data stewards:
Finally, the organization decided to have the claims stewards report into claims administration. The claims stewards also had dotted-line responsibility into the information governance council that had senior representation from business intelligence, network management, medical informatics, healthcare services, and claims administration.
The use of biometric data outside of the law enforcement and intelligence communities is very limited. Hence, we can only speculate whether organizations will ever deploy biometric data stewards. A possible candidate might be human resources for biometric data that is collected when employees log into applications.
Organizations are only beginning to use their treasure troves of human-generated data for analytics. Once again, the deployment patterns for human-generated stewards are largely unknown today. A possible candidate might be customer service that analyzes voice data for quality assurance and operating efficiencies.
Many organizations are hiring data scientists to get their arms around big data. Data science is a discipline that combines math, programming, and scientific insight.3 In Building Data Science Teams, D. J. Patil, former head of products and chief scientist at LinkedIn, says that the best data scientists have the following characteristics4:
Because data scientists grapple with poor data quality and inconsistent definitions on a daily basis, they may also act as big data stewards.
This section describes the additional responsibilities that may be assumed by existing information governance roles.
Many organizations, especially those in the financial services industry, are appointing chief data officers who are responsible for the trustworthiness of information at the enterprise level. Chief data officers are accountable for enterprise information governance programs, and it is highly likely that they will bring big data within their purview as well.
Many organizations have appointed owners for information governance who may also perform other functions, including enterprise data architecture, business intelligence, risk management, and information security. However, information governance positions are increasingly being staffed on a full-time basis, as organizations recognize the value of information as an enterprise asset. The information governance officer may need to assume the following additional responsibilities to govern big data:
The information governance council consists of the executive sponsors for the program. The council sets the overall direction for the information governance program, provides alignment within the organization across business and IT, and acts as a tiebreaker in case of disagreements. Depending on the underlying business problem, the chief information officer, the vice president of information management, the chief information security officer, the chief risk officer, or some other executive will chair the information governance council. The council may also include functional representation from finance, legal, and HR, as well as representatives from various lines of business that have a stake in information as an enterprise asset. These executives are the overall champions for the information governance program and ensure buy-in across the organization. The information governance council may meet monthly or quarterly.
Sample big data topics on the agenda of the information governance council may include the following:
The information governance working group is the next level down from the council in the organization. The working group runs the information governance program on a day-to-day basis. It is also responsible for oversight of the data stewardship community. The information governance officer may chair the information governance working group. The information governance working group includes middle management and may meet weekly or monthly. The information governance working group discusses topics that are similar to the information governance council, but at a greater level of granularity.
The customer data steward may need to assume the following additional responsibilities to govern big data:
The materials data steward may need to assume additional responsibilities to govern big data. Case Study 11.3 in chapter 11 describes the use of web content to improve the quality of product data at an information services provider.
The asset data steward’s responsibilities may also extend to big data. For example, an asset steward needs to standardize equipment naming conventions across plants and operating units to facilitate the optimal usage of sensor data. When machine sensor data predicts the failure of a pump at a plant, the operations team might need to replace similar pumps at other plants. However, this can only happen if the pumps have similar naming conventions.
Data stewards should ideally report into the business. Because data stewards report into multiple business units or functions, the organization should ideally appoint a chief data steward to ensure consistency across the various stewardship roles and to ensure a sense of community. The chief data steward should oversee stewards across the entire information governance program, including big data.
The data custodian is responsible for the repositories and technical infrastructure to manage big data. IT is often the custodian for big data and other traditional data types.
Using a combination of new and existing roles, the information governance organization needs to extend its charter and membership to address both big and traditional data types. Figure 6.2 describes the information governance organization at a health plan. At the very bottom, the stewardship community includes a steward for big claims transaction data. At the middle level, the information governance working group also includes a representative from claims administration. At the very highest level, the chief operating officer represents claims administration that reports into him or her.
As shown in Case Study 6.3, executive sponsorship for big data governance in government may be across multiple agencies and elected officials. (This case study has been disguised.)
State governments in the U.S. are building longitudinal data warehouses with information from preschool, early learning programs, K-12 education, post-secondary education, and work force institutions to address pressing public policy questions. The source data for these so-called “P20W” data warehouses comes from a variety of agencies led by different elected officials or independent boards. In one state, a small research group located in the governor’s budget office is responsible for building the P20W longitudinal data warehouse. The research group has a number of statutory partners across the government, including the following:
The success of the P20W initiative depends on careful information governance based on proper data sharing agreements and adherence to differing laws and rules around the release of data from different organizations. Executive sponsorship and buy-in across the multiple organizations is critical for successfully building and operating a longitudinal P20W data warehouse. Critical to building the level of executive sponsorship needed for success is an understanding that data from the source systems is an asset of the state, not just of the contributing organization.
Most of the content in the P20W data warehouse is not big data in the context of this book. The data has limited volume, is largely structured, and is only updated on a monthly basis. However, we anticipate that the information governance program will have to grapple with a number of questions relating to big data. For example, the P20W data warehouse does not always have the full post-graduation employment history of the students (“constituents”). The key question is whether the P20W team can use social media like LinkedIn and Facebook to update the employment status for these constituents. These rapidly evolving privacy issues will pose a continuing challenge to the P20W team.
The information governance organization needs to extend its charter to assume responsibility for big data. The information governance organization also needs to either extend the responsibilities of existing roles, add new roles, or both.
1. Zhang, Jane. “Why we need 1,170 Codes for Angioplasty.” The Wall Street Journal, November 11, 2008. http://online.wsj.com/article/SB122636897819516185.html.
2. “Procedural and Diagnosis Coding Must Be Linked By Medical Necessity.” University of Florida, College of Medicine Office of Compliance, September 20, 1999. http://www.med.ufl.edu/complian/q&a/cpt-codes.html.
3. Dumbill, Edd. “What is big data?” O’Reilly website, January 11, 2012. http://radar.oreilly.com/2012/01/what-is-big-data.html.
4. Patil, D. J. Building Data Science Teams. (O’Reilly Radar: 2011.)