This chapter includes contributions from Rani Hublou (IBM), Deidre Paknad (IBM), and Brian M. Williams (IBM).
Because of the massive increase in data volumes, organizations are being challenged to understand the regulatory and business requirements that determine what data to retain in operational and analytical systems, what data to archive, and what data to delete. Without a high level of specificity regarding the legal and regulatory obligations of information, IT must manage all data as if it has high value and ongoing obligations, or the company faces very high risks from improper disposal. With IT budgets continuing to be under pressure, over-managing information is a gross waste of capital resources.1
According to the 2010 Gartner study “IT Metrics: IT Spending and Staff Report,” IT costs are 3.5 percent of revenue and are under significant pressure, with 61 percent of these costs being a function of information volume. The big data governance program needs to establish policies that govern the lifecycle of big data to reduce legal risk and IT costs.
The best practices to manage the lifecycle of big data are as follows:
12.2 Document legal holds and support eDiscovery requests.
12.3 Compress and archive big data to reduce IT costs and improve application performance.
12.4 Manage the lifecycle of real-time, streaming data.
12.5 Retain social media records to comply with regulations and support eDiscovery requests.
12.6 Defensibly dispose of big data no longer required based on regulations and business needs.
These sub-steps are discussed in detail in the remainder of this chapter.
Every country, state, and province has unique regulations relating to data retention. As a first step, the big data governance program needs to understand the retention requirements for each big data type by industry and jurisdiction. As shown in Case Study 12.1, this is no easy task.
Because telecommunications data contains a wealth of knowledge about people’s relationships and locations, it is extremely valuable to law enforcement and counter-terrorism efforts. As a result, many countries have enacted legislation and regulations that govern the retention of this data. We discuss retention requirements across several different jurisdictions.
The United Kingdom Home Office has issued a voluntary code of practice for the retention of telecommunications data, based on the Anti-Terrorism, Crime and Security Act, 2001 for the purposes of safeguarding national security. These guidelines are summarized here:2
The European Union has determined that member states must require the retention of telecommunications data for six to 24 months.3 However, the European Union has applied these regulations inconsistently across member countries, including Italy, Germany, and France.
Italy has the following regulations for telecommunications data:
The German Bundestag had passed legislation requiring the retention of telecommunications data for six months. However, the German Constitutional Court has invalidated the law as an unconstitutional infringement of privacy rights.5
France requires internet and telephony operators to retain data for one year.6
The United States does not have any data retention legislation along the lines of the European Union.
Thailand requires that all internet, access point, and telecommunications operators retain computer traffic data for 90 days. Computer traffic data has been defined to include location, traffic, IP address allocations, and URLs.
India has long required that ISPs sign the “License Agreement for Provision of Internet Services” prior to commencing operations. The agreement specifies that ISPs retain commercial records with regard to the communications exchanged on their networks for at least one year. The Information Technology Act of 2009 enacted a data retention mandate, but the specific regulations have not yet been promulgated. However, industry insiders say that SMS messages, telephone call logs, email headers, and web requests are typically archived from anywhere between three months and a year.9
As of the publication of this book, Australia does not have retention requirements for telecommunications data, although the government has been privately exploring the viability of doing so.10
Most corporations and entities are subject to litigation and governmental investigations that require them to preserve potential evidence. Large entities might have hundreds or thousands of open legal matters with varying obligations for data. For example, the eDiscovery for the first phase of the trial in the BP oil spill case exceeded nine million documents and 15 terabytes of data.11 A typical legal matter lasts three years and many last five or more years.
The information governance program needs to control legal risk and manage costs while communicating obligations to information custodians, gathering evidence, and analyzing results. As the use of big data becomes more prevalent across the enterprise, so will its use in legal matters. For example, drilling companies that are sued for oil spills might need to produce sensor data from the rig to demonstrate that they exercised appropriate caution when dealing with the associated events.
Organizations need to compress and archive their big data at rest to reduce storage costs and to improve application performance. Big data at rest includes smart meter readings, sensor data, RFID data, and web logs that might reside in relational databases, file systems, NoSQL databases, and Hadoop. Of course, organizations need to consider any local regulations that apply. For example, the German Data Access and Digital Signature Authentication Law (GDPdU) covers all tax-related documents, such as invoices and financial statements. The GDPdU requires companies that produce or received tax documents in electronic form to retain the same format during the archival process.12 As implied by the law, companies are not allowed to convert structured data into another format such as PDF during the archival process. Otherwise, a tax auditor would need to go through hundreds of PDFs to determine if a German company has paid its taxes correctly.
Case Study 4.3 in chapter 4 describes the financial benefits associated with archiving and compressing smart meter data at a European utility. Because Hadoop avoids data loss by replicating the same data across multiple nodes in a cluster, organizations should also consider it for fault-tolerant data archiving, as shown in Case Study 12.2.
Table 12.1 describes the economics of Hadoop as a data archiving solution at a mid-sized company. The organization calculated the annual cost of existing data storage at $20,000 per terabyte, including systems administrators and software tools. The organization calculated that it could save $11,000 per terabyte despite the data replication that is inherent to Hadoop. Notwithstanding this, the organization needs to be mindful of other considerations, as well. For example, a non-Hadoop file-based approach to archiving might be immutable, which means that the data cannot be changed. This might make it a preferable approach from a compliance perspective. The organization also needs to consider the level of compression that it can achieve when archiving data in Hadoop and non-Hadoop environments.
Table 12.1: The Economics of Hadoop as a Data Archiving Solution at a Mid-sized Company | ||
A. | Annual cost per terabyte for existing data storage | $20,000 |
B. | Annual cost per terabyte for data storage within Hadoop | $3,000 |
C. | Number of times that data is replicated in Hadoop | 3 |
D. | Annual cost per terabyte of Hadoop storage (B x C) | $9,000 |
E. | Annual storage cost savings per terabyte (A - D) | $11,000 |
As discussed in chapter 21 on big data reference architecture, LZO and Gzip have been the traditional techniques to compress data within a Hadoop environment. However, technologies like RainStor® now exist to significantly improve compression ratios within Hadoop.
Information lifecycle management (ILM) is turned on its head in the context of real-time, streaming data. When data is arriving at high velocity, big data teams need to know what data is valuable and what needs to be persisted. If the streaming analytics application can make this determination “in the moment,” then it can apply ILM policies to data in motion.
For example, a streaming application might analyze sensor readings every tenth of a millisecond and store the readings every second. However, when sensor readings begin to indicate anomalous behavior, the streaming analytics application might store every reading up to and after the event. Consider Case Study 12.3, where a network monitoring system analyzes streaming data for abnormal events.
A network monitoring system analyzes NetFlow data from different routers. Each NetFlow record contains statistical information from network routers, such as the source IP address, destination IP address, and the number of bytes and packets. The network monitoring application profiles the data in real-time and compares it with historical norms. It might observe an increase in traffic to a social media website at 9:00 a.m., when employees begin their workdays. However, suppose it notices an abnormally large volume of outgoing network traffic to a previously unknown destination. That might be a sign of an exfiltration (data leaving the company’s network).
The network monitoring system accomplishes real-time analytics by keeping a portion of network history in memory. The security operations team needs to determine how much data should live in memory. For example, it might decide to keep two hours’ worth of NetFlow records in memory and persist the readings to disk every minute for historical analysis.
As more and more communications move into social media, they will increasingly become subject to legal discovery. This means that companies also need to exercise the appropriate governance over archiving and retention of social media. According to Green, there have been a number of court decisions in the United States dealing with the discovery of social media content. The United States Stored Communications Act of 1986 limits service providers from divulging the contents of users’ communications and data. Recent court decisions have held that the act is applicable to legal discovery of social media content as well. However, while service providers may be forbidden from divulging social media content, individuals may still be compelled to produce it.13
The United States Financial Industry Regulatory Authority (FINRA) has issued regulatory notice 10-06 in 2010 that requires financial institutions to retain communications with customers through social media sites. According to Green, social media creates specific challenges around controlling the retention and preservation of content. Data is generally outside of corporate control and is subject to the retention policies of the service provider. This has given rise to the development of tools and services specifically for social media archiving and retention, which should be considered alongside similar initiatives for email and other content.14 We list some of these tools in chapter 21, on big data reference architecture.
Many organizations believe that keeping data forever is a good response to legal requirements. Actually, the converse is true. Any data, whether in electronic or paper format, is subject to legal discovery if it exists anywhere in the organization, whether in a storage cabinet, an employee’s desk drawer, a server, or on a thumb drive. In a 2010 survey by the Compliance, Governance and Oversight Council titled “Information Governance Benchmark Report in Global 1000 Companies,” 75 percent of respondents cited the inability to defensibly dispose of data as their greatest challenge. Many highlighted massive legacy data as a financial drag on the business and a compliance hazard. The big data governance program needs to establish policies that require the deletion of big data based on the retention schedule, unless it is subject to legal holds. As an example, if the retention schedule requires telecommunications CDRs to be kept for two years, then all records should be deleted after that period, except for the subset that is subject to legal holds.
The lifecycle of big data needs to be managed for data at rest and for data in motion. By managing the lifecycle of big data, organizations can reduce IT costs, improve application performance, defensibly dispose of information, respond to eDiscovery requests, and address emerging issues relating to social media.
1. “2010 Information Governance Benchmark Report in Global 1000 Companies.” The Compliance, Governance and Oversight Council.
2. “Retention of Communications Data” under “Part 11: Anti-Terrorism, Crime and Security Act 2001, Voluntary Code of Practice,” United Kingdom Home Office. http://en.wikipedia.org/wiki/Telecommunications_data_retention#Home_Office_Voluntary_Code_of_Practice_on_Data_Retention.
3. Directive 2006/24/EC of the European Parliament and of the Council, March 15, 2006.
4. “Italy decrees data retention until 31 December 2007.” EDRIGram, August 10, 2005. http://www.edri.org/edrigram/number3.16/Italy.
5. “German court orders stored telecoms data deletion.” BBC News, March 2, 2010. http://news.bbc.co.uk/2/hi/8545772.stm.
6. http://www.edri.org/book/export/html/844.
7. “Data Retention Mandates: A Threat to Privacy, Free Expression and Business Development.” The Center for Democracy and Technology, October 2011.
8. Ibid.
9. Abraham, Sunil. “Does the Government want to enter our homes?” The Center for Internet and Society, August 13, 2010, http://www.cis-india.org/advocacy/igov/blog/government-enter-homes.
10. Grubb, Ben. “No Minister: 90% of web snoop document censored to stop ‘premature unnecessary debate.” The Sydney Morning Herald, July 23, 2010. http://www.smh.com.au/technology/technology-news/no-minister-90-of-web-snoop-document-censored-to-stop--premature-unnecessary-debate-20100722-10mxo.html.
11. http://legaltalkmedia.com/LTN/TDD/TDD_051412_BPOil.mp3.
12. http://www.cgrey.be/rm_map/Germany.html.
13. Green, Steven W. “Social Media and E-Discovery: Impact and Influence.” Bloomberg Law Reports, January 24, 2012.
14. Ibid.