6
Data Classification and Categorization
In this chapter you will
•  Learn basic terminology associated with data classification and categorization
•  Discover the basic approaches to data classification and categorization
•  Examine data ownership issues
•  Examine data labeling issues and types of data
•  Explore elements of the data lifecycle associated with software security
One of the key elements in security is identifying which assets are critical from a security point of view. Data is one of the key assets in an enterprise; it is the item that many criminals seek when breaking into systems; it has both tangible and intangible value to the enterprise. Managing the asset portfolio of data is an interesting challenge, for which elements need how much protection? And from whom?
Enterprise data, by its very name, is data that flows through an organization, providing value to aspects of the business. A typical enterprise has multiple data flows, with data activities on different sets of data having various lifecycles. Data is created, accessed, transmitted, stored, manipulated, and deleted. Different business processes deal with these activities across different data flows, creating a complex web of data manipulation and storage. In all but the smallest organization, maintaining a complete understanding of all of the data flows and the business implications is virtually impossible. To manage this situation, the problem is broken into pieces, where data is classified and labeled, and responsibility for management is distributed with the data flows.
Data Classification
Data can be classified in several different manners, each with a level of significance for security. Data can be classified as to its state, its use, or its importance from a security perspective. As these are overlapping schemes, it is important to understand all aspects of data classification before determining how data should be handled as part of the development process. Data classification is a risk management tool, with the objective to reduce the costs associated with protecting data. One of the tenets of security is to match the level of protection and cost of security to the value of the asset under protection. Data classification is one of the tools that are used to align protection and asset value in the enterprise.
Data classification can be simple or fairly complex, depending on the size and scope of the enterprise. A small enterprise may be sufficiently covered with a simple strategy, whereas a large enterprise may have a wide range of data protection needs. In large enterprises, it may be desirable to actually determine separate, differing protection needs based on data attributes such as confidentiality, integrity, and availability. These attributes may be expanded to include specific compliance requirements.
One way of looking at data-related security is to take a data-centric view of the security process. Always examining what protections are needed for the data across the entire lifecycle can reveal weaknesses in enterprise protection schemes.
Data States
Data can be considered a static item that exists in a particular state at any given time. For purposes of development and security, these states are
•  At rest, or being stored
•  Being created
•  Being transmitted from one location to another
•  Being changed or deleted
In addition, one should consider where the data is currently residing:
•  On permanent media (hard drive, CD/DVD)
•  Remote media (USB, cloud/hosted storage)
•  In RAM on a machine
When considering data states, it is easy to expand this idea to the information life-cycle model (ILM), which includes generation, retention, and disposal. This is covered later in this chapter.
Data Usage
Data can also be classified as to how it is going to be used in a system. This is meant to align the data with how it is used in the business and provide clues as to how it should be shared, if appropriate. The classifications include
    •   Internal data    Data initialized in the application, used in an internal representation, or computed within the application itself
    •   Input data    Data read into a system and possibly stored in an internal representation or used in a computation and stored
    •   Output data    Data written to an output destination following processin g
In addition, data can be considered security sensitive, marked as containing personally identifiable information (PII) or to be hidden. These categories include
    •   Security-sensitive data    A subset of data of high value to an attacker
    •   PII data    Data that contains PII elements
    •   Hidden data    Data that should be concealed to protect it from unauthorized disclosure using obfuscation techniques
Data Risk Impact
Data can be classified as to the specific risk associated with the loss of the data. This classification is typically labeled high, medium, and low, although additional caveats of PII or compliance may be added to include elements that have PII or compliance issues if disclosed.
Data that is labeled high risk is data that if disclosed or lost could result in severe or catastrophic adverse effect on assets, operations, or people. The definition of severe will vary from firm to firm in financial terms, as what is severe for a small firm may be of no consequence to a multinational firm.
Data that is labeled medium risk is data that if disclosed would have serious consequences. Low-risk data is data that has limited, if any, consequences if lost or disclosed. Each firm, as part of its data management plan, needs to determine the appropriate definitions of severe, serious, and limited, both from a dollar loss point of view and from an operational impact and people impact point of view.
Additional labels, such as PII, compliance-related, or for official use only, can be used to alert the development team as to specialized requirements associated with data elements.
Data Ownership
Data does not really belong to people in the enterprise; it is actually the property of the enterprise or company itself. That said, the enterprise has limited methods of acting except through the actions of its employees, contractors, and other agents. For practical reasons, data will be assigned to people in a form of ownership or stewardship role. Ownership is a business-driven issue, for the driving factors behind the data ownership responsibilities are business reasons.
Data Owner
Data owners act in the interests of the enterprise in managing certain aspects of data. The data owner is the party who determines who has specific levels of access associated with specific data elements: who can read, who can write, who can change, delete, and so on. The owner is not to be confused with the custodian, the person who actually has the responsibility for making the change. A good example is in the case of database records. The owner of the data for the master chart of accounts in the accounting system may be the chief financial officer (CFO), but the ability to directly change it may reside with a database administrator (DBA).
   
EXAM TIP   Data owners are responsible for defining data classification, defining authorized users and access criteria, defining the appropriate security controls and making sure they are implemented and operational.
This brings forth the issue of data custodians, or people who have the ability to directly interact with the data. Data owners define the requirements, while data custodians are responsible for implementing the desired actions.
Data Custodian
Data custodians support the business use of the data in the enterprise. As such, they are responsible for ensuring that the processes safely transport, manipulate, and store the data. Data custodians are aware of the data management policies issued by the data owners and are responsible for ensuring that during operations these rules and regulations are followed.
   
EXAM TIP   Data custodians are responsible for maintaining defined security controls, managing authorized users and access controls, and performing operational tasks such as backups and data retention and disposal.
Data custodians may not require access to read the data elements. They do need appropriate access to apply policies to the data elements. Without appropriate segregation of data controls to ensure custodians can only manage the data without actually reading the data, confidentiality is exposed to a larger set of people, a situation that may or may not be desired.
Labeling
Because data can exist in the enterprise for an extended period of time, it is important to label the data in a manner that can ensure it is properly handled. For data in the enterprise, the use of metadata fields, which is data about the data, can be used to support data labeling. The metadata can be used to support the protection of the data by providing a means to ensure a label describing the importance of the data is coupled with it.
Sensitivity
Data can have different levels of sensitivity within an organization. Payroll data can be sensitive, with employees having restricted access. But some employees, based on position, may need specific access to this type of data. A manager has a business reason to see and interact with salary and performance data for people under his or her direct management, but not others. HR personnel have business reasons to access data such as this, although in these cases the access may not be just by job title or position, but also by current job task. Understanding and properly managing sensitive data can prevent issues should it become public knowledge or disclosed. The commonsense approach is built around business purpose—if someone has a legitimate business purpose, they should have appropriate access. If not, then they should not have access. The challenge is in defining the circumstances and building the procedures and systems to manage data according to sensitivity. Fortunately, the range of sensitive data is typically limited in most organizations.
Impact
The impact that data can have when improperly handled is a much wider concern than sensitivity. Virtually all data in the enterprise can, and should be, classified by impact. Data can be classified by the impact the organization would suffer in the event of data loss, disclosure, or alteration. Impact is a business-driven function, and although highly qualitative in nature, if the levels of high, medium, and low impact are clearly defined, then the application of the impact designation is fairly straightforward:
   
EXAM TIP   NIST FIPS 199 and SP 800-18 provide a framework for classifying data based on impacts across the three standard dimensions: confidentiality, integrity, and availability.
•  Typically, three levels are used: high, medium (or moderate), and low.
•  Separate levels of impact may be defined by data element for each attribute. For example, a specific data element could have high impact for confidentiality and high for integrity, but low for availability.
The first step in impact analysis is defining the levels of high, medium, and low. The idea behind the high level is to set the bar high enough that only a reasonably small number of data elements are included. The exception to this is when the levels are set with some specific criteria associated with people. The differentiators for the separation of high, medium, and low can be based on impact to people, impact on customers, and financial impact. Table 6-1 shows a typical breakdown of activity.
Table 6-1    Summary of Impact Level Definition s
Each organization needs to define its own financial limits—a dollar loss that would be catastrophic to some organizations is a rounding error to others. The same issue revolves around customer-related issues—what is severe in some industries is insignificant in others. Each organization needs to completely define each of the levels summarized in Table 6-1 for its own use.
Types of Data
Data can come in many forms and it can be separated into two main types: structured and unstructured. Databases hold a lot of enterprise data, yet many studies have shown that the largest quantity of information is unstructured data in elements such as office documents, spreadsheets, and emails. The type of data can play a role in determining the appropriate method of securing it.
Structured
Structured data has defined structures and is managed via those structures. The most common form of structured data is that stored in databases. Other forms of structured data include formatted file structures, Extensible Markup Language (XML) data, and certain types of text files, such as log files. The structure allows a parser to go through the data, sort, and search.
Unstructured
Unstructured data is the rest of the data in a system. Although it may be structured per some application such as Microsoft Word, the structure is irregular and not easily parsed and searched. It is also more difficult to modify outside the originating application. Unstructured data makes up the vast majority of data in most firms, but its unstructured nature makes it more difficult to navigate and manage. A good example of this is examining the number that represents the sales totals for the previous quarter. This can be found in databases, in the sales application system, in word documents and PDFs describing the previous quarter’s performance, and in emails between senior executives. When searching for specific data items, some of these sources are easily navigated (the structured ones, such as financial scorecards), while it is virtually impossible to find items in emails and word processing or PDF documents without the use of enterprise data archival and discovery tools.
Data Lifecycle
Data in the enterprise has a lifecycle. It can be created, used, stored, and even destroyed. Although data storage devices have come down in price, the total cost of storing data in a system is still a significant resource issue. Data that is stored must also be managed from a backup and business continuity/disaster recovery perspective. Managing the data lifecycle is a data owner’s responsibility. Ensuring the correct sets of data are properly retained is a business function, one that is best defined by the data owner .
Generation
Data can be generated in the enterprise in many ways. It can be generated in a system as a result of operations, or it can be generated as a function of some input. Regardless of the path to generation, data that is generated has to be managed at this point—is it going to be persistent or not? If the data is going to be persistent, then it needs to be classified and have the appropriate protection and destruction policies assigned. If it is not going to be persistent—that is, it is some form of temporary display or calculation—then these steps are not necessary.
Retention
Data that is going to be persistent in the system, or stored, must have a series of items defined: who is the data owner, what is the purpose of storing the data, what levels of protection will be needed, and how long will it need to be stored? These are just some of the myriad of questions that need answers. Data that is retained in an enterprise becomes subject to the full treatment of data owner and data custodian responsibilities. The protection schemes need to be designed, not just for the primary storage, but for alternative forms of storage as well, such as backups, copies for disaster recovery (DR) sites, and legal hold archives.
An important element to consider in both security and retention are system logs. Data in log files can contain sensitive information, thus necessitating protection, and appropriate retention schemes need to be devised. Log files are important elements in legal hold, e-discovery, and many compliance elements. Proper log data security and planning of lifecycle elements are important for CSSLPs to consider throughout the development process.
Disposal
Data destruction serves two primary purposes in the enterprise. First, it serves to conserve resources by ceasing to spend resources on data retention for elements that have no further business purpose. Second, it can serve to limit data-based liability in specific legal situations. The length of storage requirements are set by two factors: business purpose and compliance. Once data has reached its end of life as defined by all of these requirements, it is the data custodian’s task to ensure it is appropriately disposed of from all appropriate sources. Alternative sources, such as backups, copies for DR sites, data warehouse history, and other copies, need to be managed. Legal hold data is managed separately and is not subject to normal disposal procedures.
Chapter Review
In this chapter, you examined the effects and implications of data classification and categorization. Data can be classified by a variety of means, but the principle ones are all linked to business requirements and impacts. The aspects of data classification are part of an overall information lifecycle management approach that is implemented by roles such as data custodians as defined by data owners .
Data labeling to assist in managing data protection due to impact requirements was covered, as were the lifecycle elements: generation, retention, and disposal. The interconnected nature of data classification and ownership on the information lifecycle management approach was covered as well.
Quick Tips
•  Data classification is a risk management tool, with the objective of reducing the costs associated with protecting data.
•  Data is in typically one of four states: being stored (at rest), being created, being transmitted from one place to another, and being processed (changed or deleted).
•  Data management responsibilities are typically split between data owners and data custodians.
•  Data can be characterized and labeled based on its relative sensitivity and business impact.
•  Data can be categorized by its formatting and structure,.
Questions
To further help you prepare for the CSSLP exam, and to provide you with a feel for your level of preparedness, answer the following questions and then check your answers against the list of correct answers found at the end of the chapter.
   1 .   The party that determines which users or groups should have access to specific data elements is:
          A.   Data custodian
          B.   Data manager
          C.   System administrator
          D.   Data owner
   2 .   HR and payroll data should be classified by which methodology?
          A.   Utility
          B.   Impact
          C.   Structured
          D.   Sensitivity
   3 .   Which of the following would not be considered structured data?
          A.   Excel spreadsheet of parts prices
          B.   Oracle database of customer orders
          C.   XML file of parts and descriptions
          D.   Log file of VPN failures
   4 .    Which of the following is not a stage of the data lifecycle?
          A.   Retention
          B.   Disposal
          C.   Sharing
          D.   Generation
   5 .   The party responsible for defining data classification is:
          A.   Data custodian
          B.   Senior manager (CIO)
          C.   Security management
          D.   Data owner
   6 .   To match the level of protection desired for data, which of the following elements is used?
          A.   Data classification
          B.   Impact analysis
          C.   Data usage
          D.   Security rules
   7 .   Which of the following is not a type of data in a system?
          A.   Security sensitive
          B.   PII
          C.   Hidden
          D.   Encrypted
   8 .   When deleting data at the end of its life, consideration should be given to copies. Which of the following copies is not necessary to specifically manage?
          A.   Shadow copies
          B.   Backups
          C.   DR sites (hot sites)
          D.   Data warehouse history
   9 .   Managing authorized users and access controls for data is a responsibility of:
          A.   Security analyst/technician
          B.   Data owner
          C.   System administrator
          D.   Data custodian
10 .    The standard categories of risk associated with impact analysis include:
          A.   Financial impact, people impact, security impact
          B.   Time impact, people impact, financial impact
          C.   Financial impact, people impact, customer impact
          D.   Time impact, customer impact, people impact
11 .   Data retention is primarily driven by what?
          A.   Business requirements
          B.   Security requirements
          C.   Storage space requirements
          D.   Government regulation
12 .   If the loss of confidentiality of a data element would have no effect on the enterprise, this data element would be in which risk category?
          A.   High
          B.   Low
          C.   Safe
          D.   Moderate or medium
13 .   Retention requirements for data in a system are determined by:
          A.   Business requirements
          B.   Storage space
          C.   Data sensitivity
          D.   Data impact
14 .   Data classification is performed at which stage of the lifecycle model?
          A.   Data retention
          B.   Disposal
          C.   Generation
          D.   Data reduction
15 .   The party responsible for performing operational tasks associated with data retention and disposal is:
          A.   Backup operator
          B.   Data owner
          C.   Data custodian
          D.   Security personne l
Answers
   1 .  D. The data owner is the party who determines who has specific levels of access associated with specific data elements.
   2 .  D. HR and payroll data can be sensitive to operations and needs to be controlled.
   3 .  A. Microsoft Office files are considered unstructured data.
   4 .  C. The stages of the lifecycle are generation, retention, and disposal.
   5 .  D. Data owners are responsible for defining data classification.
   6 .  A. Data classification is one of the tools used to align protection and asset value in the enterprise. Data classification is more inclusive than the other answers.
   7 .  D. Although data can be encrypted to protect it, this is considered a method, not a type of data.
   8 .  A. Shadow copies will be cleaned up by the operating system as part of the deletion and recycle process.
   9 .  D. Data custodians are responsible for managing authorized user and access controls for data. Data owners define the relationship; the custodians enforce the operation.
10 .  C. The standard categories used in impact analysis are impact on people, finances, and customers.
11 .  A. The primary drive is business requirements. A business may have legitimate business requirements that exceed compliance requirements.
12 .  B. Low risk includes limited or no consequence associated with failure.
13 .  A. Business requirements define storage requirements. Business requirements must take all business factors, including compliance, into account.
14 .  C. Data classification efforts need to be performed every time a new data type is created.
15 .  C. Data custodians are responsible for performing the operational tasks associated with data retention and disposal.