According to the 2010 Gartner study “IT Metrics: IT Spending and Staff Report,” IT costs are 3.5 percent of revenue and are under significant pressure, with 61 percent of these costs being a function of information volume. The big data governance program needs to establish policies that govern the lifecycle of big data to reduce legal risk and IT costs.

12.1 Expand the Retention Schedule to Include Big Data Based on Local Regulations and Business Needs

Every country, state, and province has unique regulations relating to data retention. As a first step, the big data governance program needs to understand the retention requirements for each big data type by industry and jurisdiction. As shown in Case Study 12.1, this is no easy task.

Case Study 12.1: Retention requirements for telecommunications data for a few countries

Because telecommunications data contains a wealth of knowledge about people’s relationships and locations, it is extremely valuable to law enforcement and counter-terrorism efforts. As a result, many countries have enacted legislation and regulations that govern the retention of this data. We discuss retention requirements across several different jurisdictions.

United Kingdom

The United Kingdom Home Office has issued a voluntary code of practice for the retention of telecommunications data, based on the Anti-Terrorism, Crime and Security Act, 2001 for the purposes of safeguarding national security. These guidelines are summarized here:²

Subscriber data such as name, date of birth, installation and billing address, credit card number, telephone number, International Mobile Equipment Identity (IMEI), and International Mobile Subscriber Identity (IMSI)—12 months
Telephony data including calling number, called number, call duration, and geolocation data—12 months

SMS, EMS, and MMS data such as calling number, called number, and geolocation data in the form of a latitude/longitude reference—six months

Email data including user name, date and time of login/log-off, IP address, and from/to email address—six months

Web activity including data and time of use, IP address, and URLs visited—four days

European Union

The European Union has determined that member states must require the retention of telecommunications data for six to 24 months.3 However, the European Union has applied these regulations inconsistently across member countries, including Italy, Germany, and France.

Italy

Italy has the following regulations for telecommunications data:

Mobile and fixed telephony information must be retained for 29 months.
Internet providers must retain all data for at least six months, with a possibility for extension for another six months.
Because Italian law also requires identification for mobile telephony users, resellers of mobile subscriptions or prepaid cards must retain a photocopy of each presented identity card.⁴

Germany

The German Bundestag had passed legislation requiring the retention of telecommunications data for six months. However, the German Constitutional Court has invalidated the law as an unconstitutional infringement of privacy rights.⁵

France

France requires internet and telephony operators to retain data for one year.⁶

United States

The United States does not have any data retention legislation along the lines of the European Union.

Thailand7

Thailand requires that all internet, access point, and telecommunications operators retain computer traffic data for 90 days. Computer traffic data has been defined to include location, traffic, IP address allocations, and URLs.

India⁸

India has long required that ISPs sign the “License Agreement for Provision of Internet Services” prior to commencing operations. The agreement specifies that ISPs retain commercial records with regard to the communications exchanged on their networks for at least one year. The Information Technology Act of 2009 enacted a data retention mandate, but the specific regulations have not yet been promulgated. However, industry insiders say that SMS messages, telephone call logs, email headers, and web requests are typically archived from anywhere between three months and a year.⁹

Australia

As of the publication of this book, Australia does not have retention requirements for telecommunications data, although the government has been privately exploring the viability of doing so.¹⁰

12.2 Document Legal Holds and Support eDiscovery Requests

The information governance program needs to control legal risk and manage costs while communicating obligations to information custodians, gathering evidence, and analyzing results. As the use of big data becomes more prevalent across the enterprise, so will its use in legal matters. For example, drilling companies that are sued for oil spills might need to produce sensor data from the rig to demonstrate that they exercised appropriate caution when dealing with the associated events.

12.3 Compress and Archive Big Data to Reduce IT Costs and Improve Application Performance

Organizations need to compress and archive their big data at rest to reduce storage costs and to improve application performance. Big data at rest includes smart meter readings, sensor data, RFID data, and web logs that might reside in relational databases, file systems, NoSQL databases, and Hadoop. Of course, organizations need to consider any local regulations that apply. For example, the German Data Access and Digital Signature Authentication Law (GDPdU) covers all tax-related documents, such as invoices and financial statements. The GDPdU requires companies that produce or received tax documents in electronic form to retain the same format during the archival process.¹² As implied by the law, companies are not allowed to convert structured data into another format such as PDF during the archival process. Otherwise, a tax auditor would need to go through hundreds of PDFs to determine if a German company has paid its taxes correctly.

Case Study 4.3 in chapter 4 describes the financial benefits associated with archiving and compressing smart meter data at a European utility. Because Hadoop avoids data loss by replicating the same data across multiple nodes in a cluster, organizations should also consider it for fault-tolerant data archiving, as shown in Case Study 12.2.

Case Study 12.2: The economics of Hadoop as a data archiving solution at a mid-sized company

Table 12.1 describes the economics of Hadoop as a data archiving solution at a mid-sized company. The organization calculated the annual cost of existing data storage at $20,000 per terabyte, including systems administrators and software tools. The organization calculated that it could save $11,000 per terabyte despite the data replication that is inherent to Hadoop. Notwithstanding this, the organization needs to be mindful of other considerations, as well. For example, a non-Hadoop file-based approach to archiving might be immutable, which means that the data cannot be changed. This might make it a preferable approach from a compliance perspective. The organization also needs to consider the level of compression that it can achieve when archiving data in Hadoop and non-Hadoop environments.

*Table 12.1: The Economics of Hadoop as a Data Archiving Solution at a Mid-sized Company*
A.	Annual cost per terabyte for existing data storage	$20,000
B.	Annual cost per terabyte for data storage within Hadoop	$3,000
C.	Number of times that data is replicated in Hadoop	3
D.	Annual cost per terabyte of Hadoop storage (B x C)	$9,000
E.	Annual storage cost savings per terabyte (A - D)	$11,000

As discussed in chapter 21 on big data reference architecture, LZO and Gzip have been the traditional techniques to compress data within a Hadoop environment. However, technologies like RainStor® now exist to significantly improve compression ratios within Hadoop.

12.4 Manage the Lifecycle of Real-Time, Streaming Data

Information lifecycle management (ILM) is turned on its head in the context of real-time, streaming data. When data is arriving at high velocity, big data teams need to know what data is valuable and what needs to be persisted. If the streaming analytics application can make this determination “in the moment,” then it can apply ILM policies to data in motion.

12.6 Defensibly Dispose of Big Data No Longer Required Based on Regulations and Business Needs

Many organizations believe that keeping data forever is a good response to legal requirements. Actually, the converse is true. Any data, whether in electronic or paper format, is subject to legal discovery if it exists anywhere in the organization, whether in a storage cabinet, an employee’s desk drawer, a server, or on a thumb drive. In a 2010 survey by the Compliance, Governance and Oversight Council titled “Information Governance Benchmark Report in Global 1000 Companies,” 75 percent of respondents cited the inability to defensibly dispose of data as their greatest challenge. Many highlighted massive legacy data as a financial drag on the business and a compliance hazard. The big data governance program needs to establish policies that require the deletion of big data based on the retention schedule, unless it is subject to legal holds. As an example, if the retention schedule requires telecommunications CDRs to be kept for two years, then all records should be deleted after that period, except for the subset that is subject to legal holds.

Summary

The lifecycle of big data needs to be managed for data at rest and for data in motion. By managing the lifecycle of big data, organizations can reduce IT costs, improve application performance, defensibly dispose of information, respond to eDiscovery requests, and address emerging issues relating to social media.

1. “2010 Information Governance Benchmark Report in Global 1000 Companies.” The Compliance, Governance and Oversight Council.

3. Directive 2006/24/EC of the European Parliament and of the Council, March 15, 2006.

7. “Data Retention Mandates: A Threat to Privacy, Free Expression and Business Development.” The Center for Democracy and Technology, October 2011.

13. Green, Steven W. “Social Media and E-Discovery: Impact and Influence.” Bloomberg Law Reports, January 24, 2012.

CHAPTER 12

MANAGING THE LIFECYCLE OF BIG DATA

12.1 Expand the Retention Schedule to Include Big Data Based on Local Regulations and Business Needs

Case Study 12.1: Retention requirements for telecommunications data for a few countries

United Kingdom

European Union

Italy

Germany

France

United States

Thailand7

India⁸

Australia

12.2 Document Legal Holds and Support eDiscovery Requests

12.3 Compress and Archive Big Data to Reduce IT Costs and Improve Application Performance

Case Study 12.2: The economics of Hadoop as a data archiving solution at a mid-sized company

12.4 Manage the Lifecycle of Real-Time, Streaming Data

Case Study 12.3: A network monitoring system that analyzes streaming data for abnormal events

12.5 Retain Social Media Records to Comply with Regulations and Support eDiscovery Requests

12.6 Defensibly Dispose of Big Data No Longer Required Based on Regulations and Business Needs

Summary

CHAPTER 12

MANAGING THE LIFECYCLE OF BIG DATA

12.1 Expand the Retention Schedule to Include Big Data Based on Local Regulations and Business Needs

Case Study 12.1: Retention requirements for telecommunications data for a few countries

United Kingdom

European Union

Italy

Germany

France

United States

Thailand7

India8

Australia

12.2 Document Legal Holds and Support eDiscovery Requests

12.3 Compress and Archive Big Data to Reduce IT Costs and Improve Application Performance

Case Study 12.2: The economics of Hadoop as a data archiving solution at a mid-sized company

12.4 Manage the Lifecycle of Real-Time, Streaming Data

Case Study 12.3: A network monitoring system that analyzes streaming data for abnormal events

12.5 Retain Social Media Records to Comply with Regulations and Support eDiscovery Requests

12.6 Defensibly Dispose of Big Data No Longer Required Based on Regulations and Business Needs

Summary

India⁸