27_9781118504222-ch19.html

Chapter 19

Security and Governance for Big Data Environments

In This Chapter

Considering security requirements for big data

Governing big data

Approaching a big data ecosystem

Extending governance for big data analytics

Developing a secure big data environment

Maintaining a safe big data environment

In many areas of data management, an assumption exists that the data being leveraged for analysis and planning has been well vetted and secure. It has typically been through a data-cleansing and -profiling process so that the data can be trusted. The world of big data offers a new set of challenges and obstacles that make security and governance a challenge. Many individuals and organizations working with big data assume that they do not have to worry about security or governance. Therefore, little thought or planning is done. This all changes, however, when those big data sources become operational. In this chapter, we present the issues that you need to think about and plan for when you begin to leverage big data sources as part of your analysis and planning process.

Security in Context with Big Data

While companies are very concerned about the security and governance of their data in general, they are unprepared for the complexities that are presented by the management of big data.

Information governance is the capability to create an information resource that can be trusted by employees, partners, and customers, as well as government organizations.

Often big data analysis is conducted with a vast array of data sources that might come from many unvetted sources. Additionally, your organization needs to be aware of the security and governance policies that apply to various big data sources. Your organization might be looking to determine the importance of large amounts of new data culled from many different unstructured or semi-structured sources. Does your newly sourced data contain personal health information (PHI) that is protected by the Health Insurance Accountability and Portability Act (HIPAA) or personal identifiable information (PII) such as names and addresses? Once you acquire the data, you will subject your company to compliance issues if it is not managed securely. Some of this data will not be needed and must be properly disposed of. The data that remains will need to be secured and governed. Therefore, whatever your information management strategy is, you will have to have a well-defined security strategy.

Security is something you can never really relax about because the state of the art is constantly evolving. Hand in hand with this security strategy needs to be a governance strategy. The combination of security and governance will ensure accountability by all parties involved in your information management deployment. Managing the security of information needs to be viewed as a shared responsibility across the organization. You can implement all the latest technical security controls and still face security risks if your end users don’t have a clear understanding of their role in keeping all the data that they are working with secure.

Assessing the risk for the business

Big data is becoming critical to business executives who are trying to understand new product direction and customer requirements or understand the health of their overall environment. However, if the data from a variety of sources introduces security risks into the company, unintended consequences can endanger the company. You have a lot to consider, and understanding security is a moving target, especially with the introduction of big data into the data management landscape. Ultimately, education is key to ensuring that everyone in the organization has an understanding of his or her roles and responsibilities with regard to security.

Risks lurking inside big data

While security and governance are corporate-wide issues that companies have to focus on, some differences are specific to big data that you need to remember. For example, if you are collecting data from unstructured data sources such as social media sites, you have to make sure that viruses or bogus links are not buried in the content. If you take this data and make it part of your analytics system, you could be putting your company at risk. Also, keep in mind what the original source of this data might be. An unstructured data source that might have interesting commentary about the type of customer you are trying to understand may also include extraneous noise. You need to know the nature of this data source. Has the data been verified? Is it secure and vetted against intrusion? The more reputable social media sites, for example, will watch closely for patterns of malicious behavior and delete those accounts before they cause damage. This requires a level of sophisticated big data analysis that not all sites are capable of. Your organization may have discovered a wonderful site, but that site has been hacked and you have selected that data as part of your big data platform. The consequences can be serious. Not all security threats are deliberate. You don’t want to incorporate a big data source that includes sensitive personally identifiable information that could put your customers and your company’s reputation at risk.

Understanding Data Protection Options

Some experts believe that different kinds of data require different forms of protection and that, in some cases in a cloud environment, data encryption might, in fact, be overkill. You could encrypt everything. You could encrypt data, for example, when you write it to your own hard drive, when you send it to a cloud provider, and when you store it in a cloud provider’s database. You could encrypt at every layer.

Encrypting everything in a comprehensive way reduces your exposure; however, encryption poses a performance penalty. For example, many experts advise managing your own keys rather than letting a cloud provider do so, and that can become complicated. Keeping track of too many keys can be a nightmare. Additionally, encrypting everything can create other issues. For example, if you’re trying to encrypt data in a database, you will have to examine the data as it’s moving (point-to-point encryption) and also while it’s being stored in the database. This procedure can be costly and complicated. Also, even when you think you’ve encrypted everything and you’re safe, that may not be the case.

One of the long-standing weaknesses with encryption strategies is that your data is at risk before and after it’s encrypted. For example, in a major data breach at Hannaford Supermarkets in 2008, the hackers hid in the network for months and were able to steal payment data when customers used their credit card at the point of sale. This breach took place before the data was encrypted.

Maintaining a large number of keys can be impractical, and managing the storing, archiving, and accessing of the keys is difficult. To alleviate this problem, generate and compute encryption keys as needed to reduce complexity and improve security.

Here are some other available data-safeguarding techniques:

Data anonymization: When data is anonymized, you remove all data that can be uniquely tied to an individual (such as a person’s name, Social Security number, or credit card number). Although this technique can protect some personal identification, hence privacy, you need to be really careful about the amount of information you strip out. If it’s not enough, hackers can still figure out whom the data pertains to.

Tokenization: This technique protects sensitive data by replacing it with random tokens or alias values that mean nothing to someone who gains unauthorized access to this data. This technique decreases the chance that thieves could do anything with the data. Tokenization can protect credit card information, passwords, personal information, and so on. Some experts argue that it’s more secure than encryption.

Cloud database controls: In this technique, access controls are built into the database to protect the whole database so that each piece of data doesn’t need to be encrypted.

The Data Governance Challenge

Data governance is important to your company no matter what your data sources are or how they are managed. In the traditional world of data warehouses or relational database management, it is likely that your company has well-understood rules about how data needs to be protected. For example, in the healthcare world, it is critical to keep patient data private. You may be able to store and analyze data about patients as long as names, Social Security numbers, and other personal data is masked. You have to make sure that unauthorized individuals cannot access private or restricted data. What happens when you flood your environment with big data sources that come from a variety of sources? Some of these sources will come from commercial third-party vendors that have carefully vetted the data and masked out sensitive data.

However, it is quite likely that the big data sources may be insecure and unprotected, and include a lot of personal data. During initial processing of this data, you will probably analyze lots of data that will not turn out to be relevant to your organization. Therefore, you don’t want to invest resources to protect and govern data that you do not intend to retain. However, if sensitive personal data passes across your network, you may expose your company to unanticipated compliance requirements. For data that is truly exploratory, with unknown contents, it might be safer to perform the initial analysis in a “walled” environment that is internal but segmented, or in the cloud.

Finally, after you decide that a subset of that data is going to be analyzed more deeply so that results may be incorporated into your business process, it is important to institute a process of carefully applying governance requirements to that data.

What issues should you consider when you incorporate these unvetted sources into your environment? Consider the following:

Determine beforehand who is allowed to access new data sources initially as well as after the data has been analyzed and understood.

Understand how this data will be segregated from other companies’ data.

Understand what your responsibility is to leverage the data. If the data is privately owned, you have to make sure that you are adhering to contracts or rules of use. Some data may be linked to a usage contract with a vendor.

Understand where your data will be physically located. You may include data that is linked to customers or prospects in specific countries that have strict privacy requirements. You need to be aware of the details of these sources to avoid violating regulations.

Understand how your data needs to be treated if it is physically moved from one location to another. Are you going to store some of this data with a cloud provider? What type of promises will that provider offer in terms of where the data will be stored, and how well it will be secured?

Just because you have created a security and governance process for your traditional data sources doesn’t mean that you can assume that employees and partners will expand those rules to new data sources. You need to consider two key issues: visibility of the data and the trust of those working with the data.

Visibility: While business analysts and partners you are working with may be eager to use these new data sources, you may not be aware of how this data will be used and controlled. In other words, you may not have control over your visibility into your resources that are running outside of your control. This situation is especially troublesome if you need to ensure that your provider is following compliance regulations or laws. This is also true when you are using a cloud provider to manage that data because the storage may be very inexpensive to manage.

Unvetted employees: Although your company may go through an extensive background check on all of its employees, you’re now trusting that no malicious insiders work in various business units outside of IT. You also have to assume that your cloud provider has diligently checked its employees. This concern is real because close to 50 percent of security breaches are caused by insiders (or by people getting help from insiders). If your company is going to use these new data sources in a highly distributed manner, you need to have a plan to deal with inside as well as outside threats.

You have a responsibility to make sure that your new big data sources do not open your company to unanticipated threats or governance risks. It is your responsibility to have good security, governance processes, and education in place across your entire information management environment.

As with any technology life cycle, you need to have a process for assessing the capability of your organization to meet the readiness of all constituents to follow security and governance requirements. You may already have processes for data security, privacy, and governance in place for your existing structured databases and data warehouses. These processes need to be extended for your big data implementation.

For example, is the chief security officer of the company aware of the new data sources being used in the various businesses? Is it clear how you are allowed to use third-party data sources? If you begin to incorporate proprietary data into your big data environment, your company may be violating copyright rules. When you create a big data environment that brings in a lot of new data sources, have you exposed private data that should be masked? At the same time, are you adhering to the data privacy policies of the different countries that you are operating in?

Auditing your big data process

At the end of the day, you have to be able to demonstrate to internal and external auditors that you are meeting the rules necessary to support the operations of the business. You will need a way to show logs or other evidence that the data you are using is secure and clean. You will need to explain the sources of that data. Will you be able to validate the results so that you minimize the risk to the company? You may have to prove that you have archived the data that you are using to make decisions and run the business. This may be well-managed for your traditional databases and your data warehouse, but your unstructured big data sources have not been added to this process.

Although external auditors may not analyze the accuracy of the data warehouse–based data with external big data sources, your internal process will dictate that these sources be well synchronized. For example, the data warehouse will have a clear set of master data definitions, but the big data sources may not have documented metadata. Therefore, it is important that external data sources be managed in a way that metadata definitions are codified so that you can have a set of consistent metadata across these sources. Thinking through this process can make the difference between business success and failure.

Identifying the key stakeholders

One of the characteristics of big data is that it is typically tied to specific business initiatives. For example, the Marketing organization wants to be able to use the huge volumes of data generated by social media sites such as Facebook, Twitter, and so on. Operations teams will want to manage their supply chain leveraging RFID data. The Human Resources department will be eager to keep track of what employees are publishing on social media sites to make sure that they are not violating internal and external regulations. A medical claims department will want to keep track of the regulations determining how patient claim information within health insurance records is managed so that privacy rules are not violated. All of these constituents may reside within the same company, so it is critical that everyone has a common understanding of what the rules are and that the infrastructure is in place to keep the company consistently safe.

Putting the Right Organizational Structure in Place

Typically, companies begin their journey to big data by starting with an experiment to see whether big data can play an important role in defining and impacting business strategy. However, after it becomes clear that big data will have a strategic role as part of the information management environment, you have to make sure that the right structure is in place to support and protect the organization.

Before you establish policies, you first have to know what you are dealing with. For example, are you going to involve transactional systems, social media data, or machine-generated data? Do you intend to combine information from these different sources as part of your data analytics strategy? If you are planning to move forward with more than an isolated experiment, you will need to update your governance strategy so that you are prepared to manage a new variety of data in ways that are safe.

Preparing for stewardship and management of risk

No matter what your information management strategy is, you need to make sure that you have the right level of oversight. This is simply a best practice in general and does not change when you add big data to the mix. However, you may need to implement data stewardship differently with the addition of big data sources. For example, you might need to have a different individual monitor social media data because it has a different origin and different structure than traditional relational data. This new data steward role needs to be carefully defined so that the individual selected can work across the business units that find this type of data most relevant to how they are analyzing the business. For example, the data steward needs to understand or have access to the right people who understand the company’s data retention policy as well as the requirements for masking out personal data no matter where that data originates.

Setting the right governance and quality policies

The way that an organization deals with big data is an ongoing cycle and not a one-time project. The potential for causing risk to the business can be serious if consistent rules and processes are not applied consistently. Data quality should also be approached from a governance standpoint. When you think about policy, here are some of the key elements that need to be codified to protect your organization:

Determine best practices that your peers have implemented to have consistent polices documented so that everyone has the same understanding of what is required.

Compare your policies with the governance requirements for your own business and your industry. Update your policies if you find oversights.

Do you have a policy about the length of time that you must hold on to information? Do these policies apply to the data you are collecting from external sources, such as customer discussion groups and social media sites?

What is the importance of the data sources that you are bringing into the business? Do you have quality standards in place so that a set of data is only used for decision making if it is proven to be clean and well documented? It is easy to get caught up in the excitement of leveraging big data to conduct the type of analysis that was never achievable before. But if that analysis leads to incorrect conclusions, your business will be at risk. Even data coming from sensors could be impacted by extraneous data that will cause an organization to come to the wrong conclusion.

Developing a Well-Governed and Secure Big Data Environment

A thoughtful approach to security can succeed in mitigating against many security risks. You need to develop a secure big data environment. One thing that you can do is to evaluate your current state.

In a big data environment, security starts with assessing your current state. A great place to begin is by answering a set of questions that can help you form your approach to your data security strategy. Here are a few important questions to consider:

Have you evaluated your own traditional data security approach?

How do you control access rights to the data in your applications, your databases, and your warehouse both those within your company and those from third-party sources? Who has the right to access existing data resources as well as the new big data sources you are introducing? How do you ensure that only the right identities gain access to your applications and information?

Can you identify data vulnerabilities and risks and then correct any weaknesses?

Do you have a way of tracking your security risk over time so that you can easily share updated information with those who need it?

Is your overall infrastructure protected at all times from external security threats? If not, this could be the weak link that could seriously impact the security of your data.

Do you maintain your own keys if you are using encryption, or do you get them from a trusted, reliable provider? Do you use standard algorithms? Have you applied this standard to new data sources that you have determined are critical to your business?

Are you able to monitor and quantify security risks in real time?

Can you implement security and governance policies consistently across all types of data sources, including ones that reside in a cloud environment?

Can you protect all your data no matter where it’s stored?

Can you satisfy auditing and reporting requirements for data wherever it resides?

Can you meet the compliance requirements of your industry?

What are your disaster and recovery plans? How do you ensure service continuity for all your critical data sources?