CHAPTER 2

THE BIG DATA GOVERNANCE FRAMEWORK

This chapter provides a framework for big data governance. As shown in Figure 2.1, this framework consists of three dimensions:

We will discuss each dimension in this chapter.

Image

Figure 2.1: A framework for big data governance.

2.1 Big Data Types

As shown in Figure 2.2, big data can be broadly classified into five types.

Let’s consider each type in more detail:

  1. Web and social media—This includes clickstream and social media data such as Facebook, Twitter, LinkedIn, and blogs. Big data governance programs will increasingly be required to integrate this data with master data and with core business processes such as customer loyalty programs. The big data governance program needs to establish policies regarding the acceptable use of social media data, especially since regulations and precedents are continually evolving. The program also needs to establish guidelines regarding the acceptable use of cookies, especially third-party cookies, to track users and to personalize their web interactions. Metadata is also critical to web and social media. For example, two sites might measure the term “unique visitors” differently for clickstream analytics. One site might measure unique visitors within a month, while the other might measure unique visitors within a week.

    Image

    Figure 2.2: Big data types.

  2. Machine-to-machine data—Machine-to-machine, or M2M, refers to technologies that allow both wireless and wired systems to communicate with other devices. M2M uses a device such as a sensor or meter to capture an event such as speed, temperature, pressure, flow, or salinity. This event is relayed through a wireless, wired, or hybrid network to an application that translates the captured event into meaningful information. M2M communications create the so-called “internet of things.” The big data governance program needs to establish a number of policies around M2M data. For example, the program needs to draw up guidelines around the acceptable use of geolocation and RFID data that can be used to build a profile of individuals and potentially violate their privacy. The program needs to establish retention policies around the massive volumes of M2M data, which can easily overwhelm IT budgets if not properly controlled. The big data governance program needs to address any data quality concerns such as RFID read rates in environments with high moisture content and lots of congestion. Finally, the big data governance program needs to secure the Supervisory Control and Data Acquisition (SCADA) infrastructure from vulnerability to cyber attacks.
  3. Big transaction data—This includes healthcare claims, telecommunications CDRs, and utility billing records. Big transaction data is increasingly available in semi-structured and unstructured formats. Information governance challenges such as metadata, data quality, privacy, and information lifecycle management also apply to this data.
  4. Biometrics—Biometric recognition, or biometrics, refers to the automatic identification of a person based on his or her anatomical or behavioral characteristics or traits.1 Anatomical data is created from the physical characteristics of a person including a fingerprint, an iris, a retina, a face, an outline of a hand, an ear shape, a voice pattern, DNA—even body odor. Behavioral data includes handwriting and keystroke analysis.2 Advances in technology have vastly increased the available biometric data. Law enforcement, the legal system, and intelligence agencies have been using this information for a long time. However, biometric data is increasingly available in the commercial arena, where it can be combined with other types of data such as social media. This opens up new business opportunities as well as several governance issues relating to privacy and data retention.
  5. Human-generated data—Human beings generate vast quantities of data such as call center agents’ notes, voice recordings, email, paper documents, surveys, and electronic medical records. This data might contain sensitive information that needs to be masked. It might also contain insights that can improve the quality of structured data sets and integrate with MDM. In addition to dealing with these issues, organizations need to establish policies regarding the retention period for this data to adhere to regulations and manage storage costs.

2.2 Information Governance Disciplines

The seven core disciplines of information governance also apply to big data:

  1. Organization—The information governance organization needs to consider adding big data to its overall framework, including the charter, organization structure, and roles and responsibilities. The information governance council might seek new members who can provide a unique perspective on big data, such as data scientists. It might also decide to appoint stewards for social media, RFID, and other types of big data. Finally, the information governance program might add additional responsibilities to the job descriptions of existing stewards. For example, the customer data steward might be accountable for the Twitter handles and Facebook accounts within the master data repository.
  2. Metadata—The big data governance program needs to integrate big data with the enterprise metadata repository. This involves the following activities:
    • Include big data terms within the business glossary. For example, add the term “unique visitor” to support clickstream analytics.
    • Import technical metadata from Hadoop into the metadata repository.
    • Ensure that the data lineage administrator is able to import flows from Hadoop into the technical metadata repository.
    • Manage data lineage and impact analysis within the big data environment.
  3. Privacy—As far back as 1890, Louis Brandeis (later a justice of the United States Supreme Court) and Samuel Warren published an article called “The Right to Privacy” in the Harvard Law Review. This article defined privacy as the “right to be left alone.”3 Subsequent regulations and legislation around the world have formalized and expanded this theory of privacy. Big data governance needs to identify sensitive data and establish policies regarding its acceptable use. These policies need to consider regulations that vary by big data type, industry, and country. Given the many headlines on the subject, the big data governance program needs to establish guidelines regarding the acceptable use of social media and geolocation data, if applicable.
  4. Data quality—Data quality management is a discipline that includes the methods to measure, improve, and certify the quality and integrity of an organization’s data. Because of its extreme volume, velocity, and variety, big data quality needs to be handled differently than traditional data types. For example, big data quality might need to be handled in real-time and address issues relating to semi-structured and unstructured data. Big data needs to be “good enough” because poor data quality does not necessarily impede the analytics that are required to derive business insights.
  5. Business process integration—The program needs to identify key business processes that require big data. The program then needs to define key policies to support the governance of big data. For example, drilling and production are key processes within oil and gas. The big data governance program must establish policies around the retention period for sensor data such as temperature, flow, pressure, and salinity on an oil rig. Not only is this data costly to store, but it might also be required by regulators to justify an operator’s actions in case of an oil spill. In another example, a retailer might establish a policy that it will access a customer’s Facebook profile, including his or her list of friends, only if it has obtained informed consent via a Facebook app. The retailer will obtain the informed consent from the customer in exchange for discounts on certain products as part of an overall loyalty program.
  6. Master data integration—The big data governance program needs to establish policies regarding the integration of big data into the master data management environment. As discussed above, a retailer needs to first define policies for the acceptable use of social media. The retailer then needs to deploy the appropriate data stewardship policies and tools to determine if the “Susie Smith” on Facebook is the same as the “Susan Smith” in the customer master.
  7. Information lifecycle management—Because of the massive increase in big data volumes, organizations will be challenged to understand the regulatory and business requirements that determine what data to retain in operational and analytical systems, what data to archive, and what data to delete. Without a high level of specificity regarding the legal and regulatory obligations of information, IT must manage all data as if it had high value and ongoing obligations, or the company faces a very high risk of improper disposal. With IT budgets continuing to be under pressure, over-managing information is a gross waste of capital resources. The program needs to expand the retention schedule to include big data based on regulations and business needs. The big data governance team needs to create pointers to the physical repositories of big data to facilitate records retention and e-discovery activities. The big data governance program needs to leverage compression and archiving policies, tools, and best practices to reduce storage costs and to improve application performance. Finally, the organization needs to defensibly dispose of big data that is no longer required based on regulations and business needs.

2.3 Industry and Functional Scenarios for Big Data Governance

Before going any further, it’s important to discuss experimentation in big data projects where the killer use case might not be known upfront. Tom Deutsch argues that organizations should adopt experimentation as a strategic process, along with a tolerance (and even encouragement) of failure as a way of advancing the business. This tolerance of experimentation is rooted in the understanding that failure is often the single best way to learn something. Deutsch further argues that organizations that are not trying and failing at validating their analytic hypotheses are actually not being aggressive enough in working with their data.4 Notwithstanding all of this, big data needs to be governed with specific reference to the analytic and operational requirements for applications that vary by industry and function. We discuss these scenarios by industry and function in the following pages.

Healthcare Industry, Scenario 1

Solution: Sentiment analysis
Big data type: Web and social media (health plans)
Disciplines: Privacy

Because of privacy regulations such as the United States Health Insurance Portability and Accountability Act (HIPAA), health plans are somewhat limited in what they can do online.

If someone posts a complaint on Twitter, the health plan might want to post a limited response and then move the conversation offline.

Healthcare Industry, Scenario 2

Solution: Patient monitoring
Big data type: M2M data (healthcare providers)
Disciplines: Data quality, information lifecycle management, privacy

A hospital leveraged streaming technologies to monitor the health of newborn babies in the neonatal intensive care unit. Using streaming technologies, the hospital was able to predict the onset of disease a full 24 hours before the onset of symptoms. The application depended on large volumes of time series data. However, the time series data was sometimes missing when a patient moved, which caused a lead to disengage and stop providing readings. In these situations, the streaming platform used linear and polynomial regressions to fill in gaps in the historical time series data. The hospital also stored both the original and modified readings in the event of a lawsuit or medical inquiry. Finally, the hospital established policies around safeguarding protected health information.

Healthcare Industry, Scenario 3

Solution: Claims analytics
Big data type: Big transaction data (health plans)
Disciplines: Data quality

A large health plan processed over 500 million claims per year, with each claims record consisting of 600 to 1,000 attributes. The plan was using predictive analytics to determine if certain proactive care was required for a small subset of members. However, the business intelligence team found that physicians were using inconsistent procedure codes to submit claims, which limited the effectiveness of the predictive analytics. The business intelligence team also analyzed the text within claims documents. The team used terms such as “chronic congestion” and “blood sugar monitoring” to determine that members might be candidates for disease management programs for asthma and diabetes, respectively.

Healthcare Industry, Scenario 4

Solution: Employee credentialing
Big data type: Biometrics (healthcare providers)
Disciplines: Privacy

The use of biometric information continues to expand in the marketplace and the workplace. For example, hospital staff members might use biometrics so that they do not have to retype their user names and passwords. Instead, they can use a biometric scan to quickly log into the system when they arrive at a bed or station. Hospitals need to work with legal counsel, however, to conduct privacy impact assessments regarding the use and retention of biometric data about their employees.

Healthcare Industry, Scenario 5

Solution: Predictive modeling based on electronic medical records
Big data type: Human-generated data (healthcare providers)
Disciplines: Data quality

The analytics department at a hospital built a predictive model, based on 150 variables and 20,000 patient encounters, to determine the likelihood that a patient would be readmitted within 30 days of treatment for congestive heart failure. The analytics team identified “smoking status” as a critical variable in their predictive models. At the outset, only 25 percent of the structured data around smoking status was populated with binary yes/no answers. However, the analytics team was able to increase the population rate for smoking status to 85 percent of the encounters by using content analytics based on electronic medical records containing doctors’ notes, discharge summaries, and patient physicals. In summary, the analytics team was able to improve the quality of sparsely populated structured data by using unstructured data sources.

Utilities Industry

Solution: Smart meters
Big data type: M2M data
Disciplines: Privacy, information lifecycle management

Several utilities are rolling out smart meters to measure the consumption of water, gas, and electricity at regular intervals of an hour or less. These smart meters generate copious amounts of “interval” data that need to be governed appropriately. Utilities need to safeguard the privacy of this interval data because it can potentially point to a subscriber’s household activities, as well as the comings and goings from his or her home. In addition, utilities need to establish policies for the archival and deletion of interval data to reduce storage costs.

Retail Industry, Scenario 1

Solution: Facebook loyalty app
Big data type: Web and social media
Disciplines: Privacy, master data integration

The marketing department at a retailer might want to use master data on customers, products, employees, and store locations to enrich its Facebook app. The success of the Facebook app depends on a strong foundation of master data management and policies pertaining to social media governance. For example, the retailer needs to adhere to the Facebook Platform Policies by not using data on a customer’s friends outside of the context of the app. In addition, the retailer needs to leverage a consistent set of identifiers to link a customer’s Facebook profile with his or her MDM record. Finally, the retailer needs to establish a robust product hierarchy to enable product comparisons. As a simple example, the retailer would need to know that a customer who purchased a “Whirlpool GX5FHDXVY” already had a product in the “refrigerator” hierarchy.

Retail Industry, Scenario 2

Solution: RFID tags
Big data type: M2M data
Disciplines: Privacy

Organizations invest in radio frequency identification (RFID) technologies to track the movement of materials across the supply chain, from the manufacturer to the distribution center and the store. However, RFID tags can create privacy issues if they are coupled with personally identifiable information. Consider an example of a retailer that does not deactivate RFID tags at the point of sale. In this situation, another retailer might be able to scan the RFID tags if a shopper is in the proximity of their outlet, to tailor marketing messages in a manner that might violate the shopper’s privacy.

Retail Industry, Scenario 3

Solution: Personalized messaging based on facial recognition and social media
Big data type: Web and social media, biometrics
Disciplines: Privacy, business process integration

A recent report from the U.S. Federal Trade Commission (FTC) reviews the potential uses of facial recognition technology in combination with social media.5 Today, retailers use facial detection software in digital signs to analyze the age and gender of viewers and deliver targeted advertisements.6 Facial detection does not uniquely identify an individual. Instead, it detects human faces and determines gender and approximate age range. In the future, digital signs and kiosks placed in supermarkets, transit stations, and college campuses could capture images of viewers and, with facial recognition software, match those faces to online identities and return advertisements based on websites that specific individuals have visited or the publicly available information contained in their social media profiles.

Retailers could also implement loyalty programs, ask users to associate a photo with the account, and then use the combined data to link the consumer to other online accounts or their in-store actions. This would enable the retailer to glean information about the consumer’s purchase habits, interests, and even movements, which could be used to offer discounts on particular products or otherwise market to the consumer. The ability of facial recognition technology to identify consumers based solely on a photograph, create linkages between the offline and online world, and compile highly detailed dossiers of information makes it especially important for companies using this technology to implement the appropriate privacy policies.

Telecommunications Industry, Scenario 1

Solution: Customer churn analytics
Big data type: Web and social media, big transaction data
Disciplines: Privacy, master data integration, data quality

Telecommunications operators build detailed customer churn models that include big transaction data such as call detail records (CDRs) and social media. However, the overall value of the churn models also depends on the quality of traditional attributes of customer master data, such as date of birth, gender, location, and income. A large operator wanted to implement a predictive analytics strategy around churn management. Analyzing subscribers’ calling patterns has empirically proved to be an effective way to predict churn. The operator decided that it needed to outsource its churn analytics to an overseas vendor. Because these CDRs needed to be shipped to the vendor on a daily basis, there was significant concern about safeguarding the privacy of customer data. After the appropriate deliberation, the operator decided to mask sensitive data such as subscriber name because the calling and receiving telephone numbers were the primary fields of value for churn analytics.

Telecommunications Industry, Scenario 2

Solution: Location-based services
Big data type: M2M data
Disciplines: Privacy

Let’s consider an example of the marketing department at a telecommunications operator that wants to unlock new sources of revenue. The marketing team wants to sell geolocation data to third parties, who can offer coupons to subscribers based on their proximity to certain retailers. However, the privacy department is concerned about the reputational and regulatory risks associated with sharing subscriber geolocation data. The big data governance program needs to balance the revenue potential from a new revenue source with the privacy risks involved.

Insurance Industry, Scenario 1

Solution: Claims investigation, underwriting
Big data type: Web and social media
Disciplines: Privacy, business process integration

Many insurance carriers now use social media to investigate claims. However, most regulators still do not permit insurers to use social media to set rates on policies during the underwriting process. For example, can a life insurer use the fact that an applicant’s Facebook profile indicates that she is a skydiver to increase her premiums because of her higher risk profile?

Insurance Industry, Scenario 2

Solution: Vehicle telematics
Big data type: M2M data
Disciplines: Information lifecycle management

An insurer instituted a pilot that offered lower rates to policyholders in exchange for the ability to place on-board sensors on motor vehicles. These sensors gathered telematics data that monitored the driving behavior of policyholders. The insurer was overwhelmed with a large amount of telematics data. It had to establish a policy regarding the retention period for the data.

Insurance Industry, Scenario 3

Solution: Claims
processing  
Big data type: Big transaction data
Disciplines: Master data integration, business process integration

Marginal improvements in claims processing and analytics can have outsized improvements on an insurer’s bottom-line. Big claims transaction data is highly dependent on high-quality reference data such as product codes, fire protection classes, and building codes. This data might be dispersed in paper format and spreadsheets. The lack of a centralized library of reference data can make it very hard to price policies and process claims because the data is in the heads of actuaries and underwriters.

Insurance Industry, Scenario 4

Solution: Underwriting
Big data type: Biometrics
Disciplines: Privacy

DNA splicing technology has become sufficiently commoditized that some companies now offer low-cost DNA-mapping services. The result is an explosion in the available DNA datasets, which opens up new opportunities as well as significant privacy issues. Most states in the U.S. have specific laws that govern the privacy of genetic information regarding insurance. The Association of British Insurers has announced a moratorium on the use of genetic testing for any type of insurance other than life insurance over £500,000. Above that amount, insurers will not use adverse predictive genetic test results unless the government has specifically approved the test.7

Oil and Gas Industry, Scenario 1

Solution: Geospatial and seismic analysis
Big data type: M2M data
Disciplines: Metadata

Oil and gas companies have big data initiatives to analyze geospatial and seismic data to discover new energy reserves and to extend the life of existing reservoirs. However, oil and gas companies need to ensure that this data is of the appropriate quality. For example, an oil company bought the same seismic data twice from an external vendor because of inconsistent naming conventions in its internal repositories. This is a metadata challenge relating to big data governance.

Oil and Gas Industry, Scenario 2

Solution: Rig and environmental monitoring
Big data type: M2M data
Disciplines: Information lifecycle management

In the past, an oil rig might have had only about 1,000 sensors, of which only about 10 fed databases that would be purged every two weeks due to capacity limitations. Today, a typical facility might have more than 30,000 sensors. Oil and gas companies also need to retain sensor data for a longer period. This information needs to be maintained well after the lifetime of the rig itself, to demonstrate adherence to environmental regulations. As a result, this information might need to be stored for 50 to 70 years, and up to 100 years in some cases. While storage is cheap, it is not free. The big data governance program needs to establish retention schedules for sensor data, and archiving policies to move information on to cheaper storage, if possible.

Consumer Packaged Goods Industry

Solution: Demand Signal Repository (DSR)
Big data type: Big transaction data
Disciplines: Business process integration, master data integration, data quality

Consumer packaged goods companies implement DSRs, which are business intelligence environments that capture, aggregate, and cleanse all customer-related data. DSRs support core processes such as forecasting and optimizing stock-keeping unit (SKU) level assortments, vendor managed inventory programs (where the manufacturer manages the inventory of products at the retailer), trade promotions, product development, and consumer segmentation. DSRs leverage multiple data sources, including Point of Sale Transaction Logs (POS TLogs) that represent highly granular transaction data from retailers. DSRs have to deal with multiple product data quality issues, including product descriptions and Universal Product Codes (UPCs) that are inconsistent for the same product across retailers. These data quality issues might result in inaccurate forecasts based on misleading demand signals.

Banking Industry, Scenario 1

Solution: Risk management
Big data type: Web and social media (web content)
Disciplines: Master data integration

Risk management departments need to update their customer hierarchies based on up-to-date financial information. As an example, when Tata Motors acquired Jaguar, the risk management department needed to update the risk hierarchy for Tata Motors to also include any exposure to Jaguar. In another example, a bank has developed an economic hierarchy to aggregate its overall exposure to a car manufacturer, its tier-one and tier-two suppliers, and the employees of the manufacturer and its suppliers. The risk management department might want to update its economic hierarchy in the event of consolidation between suppliers. All these hierarchies depend on up-to-date financial information. The risk management department might use text analytics to crawl through unstructured financial information, such as U.S. Securities and Exchange Commission (SEC) 10-K and 10-Q filings, and U.S. Federal Deposit Insurance Corporation (FDIC) call reports to dynamically update changes in company ownership, directors, and officers within its MDM hierarchies.

Banking Industry, Scenario 2

Solution: Credit, collections
Big data type: Web and social media (social media)
Disciplines: Privacy

Banks need to consider regulations such as the United States Fair Credit Reporting Act (FCRA) when using social media for credit decisions. In addition, collections departments need to adhere to regulations such as the United States Fair Debt Collection Practices Act (FDCPA), which are designed to prevent collectors from harassing debtors or infringing upon their privacy, including within social media.

Railroads Industry

Solution: Preventive maintenance
Big data type: M2M data
Disciplines: Data quality, information lifecycle management

Sensors on a modern train record more than 1,000 different types of mechanical and electrical events. These include operational events such as “opening door” or “train is braking,” warning events such as “line voltage frequency is out of range” or “compression is low in compressor X,” and failure events such as “pantograph is out of order” or “inverter lockout.” The preventive maintenance team uses predictive models to identify events that are highly correlated with preceding events.

Consider an example where failure event 1245 is preceded by warning event 2389 in 90 percent of the cases. In that example, the operations team needs to issue a work order for preventive maintenance whenever warning event 2389 is logged in the system. If the railroad has trains in its fleet from different manufacturers, the sensor data from different trains might first need to be standardized before it can be used. If a particular part fails on one train, the operations department might want to inspect similar parts on other trains. However, that might be difficult if the same part is named differently across trains. Finally, retention of sensor data is driven by business and safety-related regulatory requirements.

Education Industry

Solution: Longitudinal data warehouse
Big data type: Web and social media
Disciplines: Privacy

State governments in the U.S. are building longitudinal data warehouses with information from preschool, early learning programs, K-12 education, post-secondary education, and work force institutions to address pressing public policy questions. The source data for these so-called “P20W” data warehouses comes from a variety of agencies led by different elected officials or independent boards. Many P20W data warehouses do not always have the full post-graduation employment history for all students. The key question is whether the P20W team can use social media like LinkedIn and Facebook to update the employment status for these students. From a big data governance perspective, the P20W team will have to grapple with privacy issues that are constantly evolving.

Customer Service Function

Solution: Call monitoring, analytics of call center agents‘ notes
Big data type: Human generated
Disciplines: Master data integration, privacy

Given intense financial pressure, customer service departments are always looking for avenues to reduce operating expenses based on specific metrics such as average handle time, self-service, and first call resolution. The customer service department might leverage text analytics on call center agents’ notes to discern specific trends such as a high percentage of calls relating to a similar complaint. The customer service department might also integrate call data with customer master data to gain insight into the following:

Customer service departments also analyze voice recordings to improve operational efficiency and to support agent training. Before using this data, customer service departments need to mask the portions of the voice recordings that contain sensitive information such as Social Security number, account number, name, and address.

Information Technology Function

Solution: Log analytics
Big data type: M2M data
Disciplines: Metadata

IT departments are turning to big data to analyze application logs for slivers of insight that can improve the performance of systems. Because application vendors’ log files are in different formats, they need to be standardized first, before anything useful can be done with them.

Marketing Function

Solution: Sentiment analysis
Big data type: Web and social media
Disciplines: Master data integration, data quality, privacy

The marketing department might want to use Twitter feeds for “sentiment analysis,” to determine what users are saying about the company and its products or services. The analytics team needs to determine if, for example, references to “@Acme” and “Acme” both refer to “Acme Corporation.” Integrating sentiment analysis with a customer’s profile can also be challenging. In addition to privacy issues, the Twitter handle might reveal the user name in only 50 to 60 percent of cases. Finally, marketing might need to answer the following question, “Do we really believe that Twitter sentiment analysis is representative if users are younger and more affluent than our typical customers?”

Plant Operations Function

Solution: Operations management
Big data type: M2M data
Disciplines: Metadata

A large automotive facility makes extensive use of robots on its assembly line. The operations department at the factory would like to compare the operating performance of different robots. However, this might be difficult because the error codes generated by each robot are different, since their software versions are inconsistent.

Human Resources Function

Solution: Employment pre-screening
Big data type: Web and social media
Disciplines: Privacy

Regulations in many jurisdictions prohibit employers from using information such as a candidate’s age, marital status, religion, race, and sexual orientation in hiring decisions. Social media such as LinkedIn, Twitter, and Facebook are rich sources of such information. If an employer prescreens a prospective employee by consulting their LinkedIn profile, Tweets, or Facebook page, it becomes problematic to prove later on that they did not use this information to discriminate against a prospective employee. As of the publication of this book, the U.S. states of Maryland, Minnesota, Illinois, Washington, Massachusetts, and New Jersey either have passed or are moving towards legislation that prohibits employers from using social media to prescreen job candidates.8 The German government drafted legislation in late 2010 to prohibit employers from screening the Facebook profiles of job candidates. Although Canada does not yet have specific legislation targeted at social media, it does have tough privacy laws and privacy commissioners at the Federal and provincial levels. Obviously, legislation, regulations, and case law governing the use of social media about employees and job candidates vary by country, state, and province and are continually evolving. Employers need to work with legal counsel to draft specific policies tailored to their unique situations.

Information Security Function

Solution: Network analytics
Big data type: M2M data
Disciplines: Metadata

Security Information and Event Management (SIEM) tools aggregate log data from systems, applications, network elements, and security devices across the enterprise. SIEMs correlate the aggregated information to determine if a security incident might be taking place. SIEMs then determine the origin of the security incident and apply recourse techniques to interrupt the flow of information to prevent further data leakage.

According to a report by Jon Olstik, security intelligence demands more data.9 The report mentions that early SIEMs collected event and log data and then steadily added other data sources like NetFlow, packet capture, database activity monitoring, identity, and access management. Large enterprises now regularly collect gigabytes or even terabytes of data for security intelligence, investigations, and forensics. Finally, the report indicates that many existing tools cannot meet these scalability needs and makes the case for big data technologies.

From a big data governance perspective, security professionals need to grapple with inconsistent nomenclature across network elements such as firewalls, virtual private networks, bridges, and routers from different vendors. It is highly likely that the log files from two network elements will refer to the same event using different codes. Security professionals need to normalize these event codes prior to SIEM analytics.

Summary

There are five different types of big data: web and social media, M2M, big transaction data, biometrics, and human generated. In addition, there are seven information governance disciplines: organization, metadata, privacy, data quality, business process integration, master data integration, and information lifecycle management. This chapter provides a framework so that organizations can tailor their governance programs by big data type, information governance discipline, industry, and function.

Image

1. “An Overview of Biometric Recognition.” http://biometrics.cse.msu.edu/info.html/.

2. “Biometrics in the workplace.” Data Protection Commissioner of Ireland. http://dataprotection.ie/viewdoc.asp?DocID=244.

3. Warren, Samuel and Brandeis, Louis. “The Right to Privacy.” Harvard Law Review, Vol. IV No. 5, December 15, 1890.

4. Deutsch, Tom. “Failure Is Not a Four-Letter Word.” IBM Data Management, April 17, 2012. http://ibmdatamag.com/2012/04/failure-is-not-a-four-letter-word/.

5. Reproduced from Protecting Consumer Privacy in an Era of Rapid Change: Recommendations for Businesses and Policymakers. Federal Trade Commission report, March 2012.

6. Li, Shan and Sarno, David. “Advertisers Start Using Facial Recognition to Tailor Pitches.” Los Angeles Times, August 21, 2011. http://articles.latimes.com/2011/aug/21/business/la-fi-facial-recognition-20110821.

7. Association of British Insurers news release, April 5, 2011. http://www.abi.org.uk/Media/Releases/2011/04/Insurance_Genetics_Moratorium_extended_to_2017.aspx.

8. Stern, Joanna. “Maryland Bill Bans Employers From Facebook Passwords.” ABC News, April 11, 2012. http://abcnews.go.com/blogs/technology/2012/04/maryland-bill-bans-employers-fromfacebook-passwords/.

9. Olstik, Jon. “Insecure About Security.” February 13, 2012.