Chapter 9

Security Operations

IN THIS CHAPTER

check Understanding investigations

check Applying security operations concepts and controls

check Responding to incidents

check Preparing for disasters

check Keeping facilities and personnel safe

The Security Operations domain covers lots of essential security concepts and builds on many of the other security domains, including Security and Risk Management (Chapter 3), Asset Security (Chapter 4), Security Architecture and Engineering (Chapter 5), and Communication and Network Security (Chapter 6). Security operations represents routine operations that occur across many of the CISSP domains. This domain represents 13 percent of the CISSP certification exam.

Understand and Support Investigations

Conducting investigations for various purposes is an important function for security professionals. You must understand evidence collection and handling procedures, reporting and documentation requirements, various investigative processes, and digital forensics tools and techniques. Successful conclusions in investigations depend heavily on proficiency in these skills.

Evidence collection and handling

Evidence is information presented in a court of law to confirm or dispel a fact that’s under contention, such as the commission of a crime, the violation of policy, or an ethics matter. A case can’t be brought to trial or other legal proceeding without sufficient evidence to support the case. Thus, properly gathering and protecting evidence is one of the most important and most difficult tasks that an investigator must master.

Important evidence collection and handling topics covered on the CISSP exam include the types of evidence, rules of evidence, admissibility of evidence, chain of custody, and the evidence lifecycle.

Types of evidence

Sources of legal evidence that you can present in a court of law generally fall into one of four major categories:

  • Direct evidence: Oral testimony or a written statement based on information gathered through a witness’s five senses (in other words, an eyewitness account) that proves or disproves a specific fact or issue.
  • Real (or physical) evidence: Tangible objects from the actual crime, such as the tools or weapons used and any stolen or damaged property. May also include visual or audio surveillance tapes generated during or after the event. Physical evidence from a computer crime is not always available.
  • Documentary evidence: Includes originals and copies of business records, computer-generated and computer-stored records, manuals, policies, standards, procedures, and log files. Most evidence presented in a computer crime case is documentary evidence. The hearsay rule (which we discuss in the section “Hearsay rule,” later in this chapter) is an extremely important test of documentary evidence that must be understood and applied to this type of evidence.
  • Demonstrative evidence: Used to aid the court’s understanding of a case. Opinions are considered demonstrative evidence and may be either expert (based on personal expertise and facts) or non-expert (based on facts only). Other examples of demonstrative evidence include models, simulations, charts, and illustrations.

Other types of evidence that may fall into one or more of the above major categories include

  • Best evidence: Original, unaltered evidence, which is preferred by the court over secondary evidence. Read more about this evidence in the section “Best evidence rule,” later in this chapter.
  • Secondary evidence: A duplicate or copy of evidence, such as a tape backup, screen capture, or photograph.
  • Corroborative evidence: Supports or substantiates other evidence presented in a case.
  • Conclusive evidence: Incontrovertible and irrefutable — you know, the smoking gun.
  • Circumstantial evidence: Relevant facts that you can’t directly or conclusively connect to other events, but about which a reasonable person can make a reasonable inference.

Rules of evidence

Important rules of evidence for computer crime cases include the best evidence rule and the hearsay evidence rule. The CISSP candidate must understand both of these rules and their applicability to evidence in computer crime cases.

BEST EVIDENCE RULE

The best evidence rule, defined in the Federal Rules of Evidence, states that “to prove the content of a writing, recording, or photograph, the original writing, recording, or photograph is [ordinarily] required.”

However, the Federal Rules of Evidence define an exception to this rule as “[i]f data are stored in a computer or similar device, any printout or other output readable by sight, shown to reflect the data accurately, is an ‘original’.”

Thus, data extracted from a computer — if that data is a fair and accurate representation of the original data — satisfies the best evidence rule and may normally be introduced into court proceedings as such.

HEARSAY RULE

Hearsay evidence is evidence that’s not based on personal, first-hand knowledge of a witness, but rather comes from other sources. Under the Federal Rules of Evidence, hearsay evidence is normally not admissible in court. This rule exists to prevent unreliable testimony from improperly influencing the outcome of a trial.

Business records, including computer records, have traditionally, and perhaps mistakenly, been considered hearsay evidence by most courts because these records cannot be proven accurate and reliable. One of the most significant obstacles for a prosecutor to overcome in a computer crime case is seeking the admission of computer records as evidence.

tip A prosecutor may be able to introduce computer records as best evidence, rather than hearsay evidence, which we discuss in the preceding section.

Several courts have acknowledged that the hearsay rules are applicable to computer-stored records containing human statements but are not applicable to computer-generated records untouched by human hands.

Perhaps the most successful and commonly applied test of admissibility for computer records, in general, has been the business records exception, established in the U.S. Federal Rules of Evidence, for records of regularly conducted activity, meeting the following criteria:

  • Made at or near the time (contemporaneously) that the act occurred.
  • Made by a person who has knowledge of the business process or from information transmitted by a person who has knowledge of the business process.
  • Made and relied on during the regular conduct of business or in the furtherance of the business, as verified by the custodian or other witness familiar with the records’ use.
  • Kept for motives that tend to assure their accuracy.
  • In the custody of the witness on a regular basis (as required by the chain of evidence).

tip The chain of evidence establishes accountability for the handling of evidence throughout the evidence lifecycle. See the section “Chain of custody and the evidence lifecycle” later in this chapter.

Admissibility of evidence

Because computer-generated evidence can sometimes be easily manipulated, altered, or tampered with, and because it’s not easily and commonly understood, this type of evidence is usually considered suspect in a court of law. In order to be admissible, evidence must be

  • Relevant: It must tend to prove or disprove facts that are relevant and material to the case.
  • Reliable: It must be reasonably proven that what is presented as evidence is what was originally collected and that the evidence itself is reliable. This is accomplished, in part, through proper evidence handling and the chain of custody. (We discuss this in the upcoming section “Chain of custody and the evidence lifecycle.”)
  • Legally permissible: It must be obtained through legal means. Evidence that’s not legally permissible may include evidence obtained through the following means:
    • Illegal search and seizure: Law enforcement personnel must obtain a prior court order; however, non–law enforcement personnel, such as a supervisor or system administrator, may be able to conduct an authorized search under some circumstances.
    • Illegal wiretaps or phone taps: Anyone conducting wiretaps or phone taps must obtain a prior court order.
    • Entrapment or enticement: Entrapment encourages someone to commit a crime that the individual may have had no intention of committing. Conversely, enticement lures someone toward certain evidence (a honey pot, if you will) after that individual has already committed a crime. Enticement isn’t necessarily illegal, but it does raise certain ethical arguments and may not be admissible in court.
    • Coercion: Coerced testimony or confessions are not legally permissible. Coercion involves compelling a person to involuntarily provide evidence through the use of threats, violence (torture), bribery, trickery, or intimidation.
    • Unauthorized or improper monitoring: Active monitoring must be properly authorized and conducted in a standard manner; users must be notified that they may be subject to monitoring.

Chain of custody and the evidence lifecycle

The chain of custody (or chain of evidence) provides accountability and protection for evidence throughout its entire lifecycle and includes the following information, which is normally kept in an evidence log:

  • Persons involved (Who): Identify any and all individual(s) who discovered, collected, seized, analyzed, stored, preserved, transported, or otherwise controlled the evidence. Also identify any witnesses or other individuals present during any of the above actions.
  • Description of evidence (What): Ensure that all evidence is completely and uniquely described.
  • Location of evidence (Where): Provide specific information about the evidence’s location when it is discovered, analyzed, stored, or transported.
  • Date/Time (When): Record the date and time that evidence is discovered, collected, seized, analyzed, stored, or transported. Also, record date and time information for any evidence log entries associated with the evidence.
  • Methods used (How): Provide specific information about how evidence was discovered, collected, stored, preserved, or transported.

Any time that evidence changes possession or is transferred to a different media type, it must be properly recorded in the evidence log to maintain the chain of custody.

Law enforcement officials must strictly adhere to chain of custody requirements, and this adherence is highly recommended for anyone else involved in collecting or seizing evidence. Security professionals and incident response teams must fully understand and follow chain of custody principles and procedures, no matter how minor or insignificant a security incident may initially appear. In both cases, chain of custody serves to prove that digital evidence has not been modified at any point in the forensic examination and analysis.

Even properly trained law enforcement officials sometimes make crucial mistakes in evidence handling and safekeeping. Most attorneys won’t understand the technical aspects of the evidence that you may present in a case, but they will definitely know evidence-handling rules and will most certainly scrutinize your actions in this area. Improperly handled evidence, no matter how conclusive or damaging, will likely be inadmissible in a court of law.

The evidence lifecycle describes the various phases of evidence, from its initial discovery to its final disposition.

The evidence lifecycle has the following five stages:

  • Collection and identification
  • Analysis
  • Storage, preservation, and transportation
  • Presentation in court
  • Final disposition — for example, return to owner or destroy (if it is a copy)

The following sections explain more about each stage.

COLLECTION AND IDENTIFICATION

Collecting evidence involves taking that evidence into custody. Unfortunately, evidence can’t always be collected and must instead be seized. Many legal issues are involved in seizing computers and other electronic evidence. The publication Searching and Seizing Computers and Obtaining Evidence in Criminal Investigations (3rd edition, 2009), published by the U.S. Department of Justice (DOJ) Computer Crime and Intellectual Property Section (CCIPS), provides comprehensive guidance on this subject. Find this publication available for download at www.justice.gov/sites/default/files/criminal-ccips/legacy/2015/01/14/ssmanual2009.pdf.

In general, law enforcement officials can search and/or seize computers and other electronic evidence under any of four circumstances:

  • Voluntary or consensual: The owner of the computer or electronic evidence can freely surrender the evidence.
  • Subpoena: A court issues a subpoena to an individual, ordering that individual to deliver the evidence to the court.
  • Search warrant or Anton Piller order: A search warrant is issued to a law enforcement official by the court, allowing that official to search and seize specific evidence. An Anton Piller order is a court order that allows the premises to be searched and evidence seized without prior warning, usually to prevent the possible destruction of evidence.
  • Exigent circumstances: If probable cause exists and the destruction of evidence is imminent, that evidence may be searched or seized without a warrant.

When evidence is collected, it must be properly marked and identified. This ensures that it can later be properly presented in court as actual evidence gathered from the scene or incident. The collected evidence must be recorded in an evidence log with the following information:

  • A description of the particular piece of evidence including any specific information, such as make, model, serial number, physical appearance, material condition, and preexisting damage.
  • The name(s) of the person(s) who discovered and collected the evidence.
  • The exact date and time, specific location, and circumstances of the discovery/collection.

Additionally, the evidence must be marked, using the following guidelines:

  • Mark the evidence: If possible without damaging the evidence, mark the actual piece of evidence with the collecting individual’s initials, the date, and the case number (if known). Seal the evidence in an appropriate container and again mark the container with the same information (see the previous bullet).
  • Use an evidence tag: If the actual evidence cannot be marked, attach an evidence tag with the same information as above, seal the evidence and tag in an appropriate container, and again mark the container with the same information.
  • Seal the evidence: Seal the container with evidence tape and mark the tape in a manner that will clearly indicate any tampering.
  • Protect the evidence: Use extreme caution when collecting and marking evidence to ensure that it’s not damaged. If you’re using plastic bags for evidence containers, be sure that they’re static free to protect magnetic media.

Always collect and mark evidence in a consistent manner so that you can easily identify evidence and describe your collection and identification techniques to an opposing attorney in court, if necessary.

ANALYSIS

Analysis involves examining the evidence for information pertinent to the case. Analysis should be conducted with extreme caution, by properly trained and experienced personnel only, to ensure the evidence is not altered, damaged, or destroyed.

STORAGE, PRESERVATION, AND TRANSPORTATION

All evidence must be properly stored in a secure facility and preserved to prevent damage or contamination from various hazards, including intense heat or cold, extreme humidity, water, magnetic fields, and vibration. Evidence that’s not properly protected may be inadmissible in court, and the party responsible for collection and storage may be liable. Care must also be exercised during transportation to ensure that evidence is not lost, temporarily misplaced, damaged, or destroyed.

PRESENTATION IN COURT

Evidence to be presented in court must continue to follow the chain of custody and be handled with the same care as at all other times in the evidence lifecycle. This process continues throughout the trial until all testimony related to the evidence is completed and the trial has concluded or the case is settled or dismissed.

FINAL DISPOSITION

After the conclusion of the trial or other disposition, evidence is normally returned to its proper owner. However, under some circumstances, certain evidence may be ordered destroyed, such as contraband, drugs, or drug paraphernalia. Any evidence obtained through a search warrant is legally under the control of the court, possibly requiring the original owner to petition the court for its return.

Reporting and documentation

As described in the preceding section, complete and accurate recordkeeping is critical to each investigation. An investigation’s report is intended to be a complete record of an investigation, and usually includes the following:

  • Incident investigators, including their qualifications and contact information.
  • Names of parties interviewed, including their role, involvement, and contact information.
  • List of all evidence collected, including chain(s) of custody.
  • Tools used to examine or process evidence, including versions.
  • Samples and sampling methodologies used, if applicable.
  • Computers used to examine, process, or store evidence, including a description of configuration.
  • Root-cause analysis of incident, if applicable.
  • Conclusions and opinions of investigators.
  • Hearings or proceedings.
  • Parties to whom the report is delivered.

Investigative techniques

An investigation should begin immediately upon report of an alleged computer crime, policy violation, or incident. Any incident should be handled, at least initially, as a computer crime investigation or policy violation until a preliminary investigation determines otherwise. Different investigative techniques may be required, depending upon the goal of the investigation or applicable laws and regulations. For example, incident handling requires expediency to contain any potential damage as quickly as possible. A root cause analysis requires in-depth examination to determine what happened, how it happened, and how to prevent the same thing from happening again. However, in all cases, proper evidence collection and handling is essential. Even if a preliminary investigation determines that a security incident was not the result of criminal activity, you should always handle any potential evidence properly, in case either further legal proceedings are anticipated or a crime is later uncovered during the course of a full investigation. The CISSP candidate should be familiar with the general steps of the investigative process:

  1. Detect and contain an incident.

    Early detection is critical to a successful investigation. Unfortunately, computer-related incidents usually involve passive or reactive detection techniques (such as the review of audit trails and accidental discovery), which often leave a cold evidence trail. Containment minimizes further loss or damage. The computer incident response team (CIRT), which we discuss later in this chapter, is the team that is normally responsible for conducting an investigation. The CIRT should be notified (or activated) as quickly as possible after a computer crime is detected or suspected.

  2. Notify management.

    Management must be notified of any investigations as soon as possible. Knowledge of the investigations should be limited to as few people as possible, on a need-to-know basis. Out-of-band communication methods (reporting in person) should be used to ensure that an intruder does not intercept sensitive communications about the investigation.

  3. Conduct a preliminary investigation.

    This preliminary investigation determines whether an incident or crime actually occurred. Most incidents turn out to be honest mistakes rather than malicious conduct. This step includes reviewing the complaint or report, inspecting damage, interviewing witnesses, examining logs, and identifying further investigation requirements.

  4. Determine whether the organization should disclose that the crime occurred.

    First, and most importantly, determine whether laws or regulations require the organization to disclose a crime or incident. Next, by coordinating with a public relations or public affairs official of the organization, determine whether the organization wants to disclose this information.

  5. Conduct the investigation.

    Conducting the investigation involves three activities:

    1. Identify potential suspects.

      Potential suspects include insiders and outsiders to the organization. One standard discriminator to help determine or eliminate potential suspects is the MOM test: Did the suspect have the Motive, Opportunity, and Means? The Motive might relate to financial gain, revenge, or notoriety. A suspect had Opportunity if he or she had access, whether as an authorized user for an unauthorized purpose or as an unauthorized user — due to the existence of a security weakness or vulnerability — for an unauthorized purpose. And Means relates to whether the suspect had the necessary tools and skills to commit the crime.

    2. Identify potential witnesses.

      Determine whom you want interviewed and who conducts the interviews. Be careful not to alert any potential suspects to the investigation; focus on obtaining facts, not opinions, in witness statements.

    3. Prepare for search and seizure.

      Identify the types of systems and evidence that you plan to search or seize, designate and train the search and seizure team members (normally members of the Computer Incident Response Team, or CIRT), obtain and serve proper search warrants (if required), and determine potential risk to the system during a search and seizure effort.

  6. Report your findings.

    The results of the investigation, including evidence, should be reported to management and turned over to proper law enforcement officials or prosecutors, as appropriate.

remember MOM stands for Motive, Opportunity, and Means.

Digital forensics tools, tactics, and procedures

Digital forensics is the science of conducting a computer incident investigation to determine what has happened and who is responsible, and to collect legally admissible evidence for use in subsequent legal proceedings, such as a criminal investigation, internal investigation, or lawsuit.

Proper forensic analysis and investigation requires in-depth knowledge of hardware (such as endpoint devices and networking equipment), operating systems (including desktop, server, mobile device, and other device operating systems, like routers, switches, and load balancers), applications, databases, and software programming languages, as well as knowledge and experience using sophisticated forensics tools and toolkits.

The types of forensic data-gathering techniques include

  • Hard drive forensics. Here, specialized tools are used to create one or more forensically identical copies of a computer’s hard drive. A device called a write blocker is typically used to prevent any possible alterations to the original drive. Cryptographic checksums can be used to verify that a forensic copy is an exact duplicate of the original.

    Tools are then used to examine the contents of the hard drive in order to determine

    • Last known state of the computer
    • History of files accessed
    • History of files created
    • History of files deleted
    • History of programs executed
    • History of web sites visited by a browser
    • History of attempts by the user to remove evidence
  • Live forensics. Here, specialized tools are used to examine a running system, including:

    • Running processes
    • Currently open files
    • Contents of main storage (RAM)
    • Keystrokes
    • Communications traffic in/out of the computer

    Live forensics are difficult to perform, because the tools used to collect information can also affect the system being examined.

Understand Requirements for Investigation Types

The purpose of an investigation is to determine what happened and who is responsible, and to collect evidence that supports this hypothesis. Closely related to, but distinctly different from, investigations is incident management (discussed in detail later in this chapter). Incident management determines what happened, contains and assesses damage, and restores normal operations.

Investigations and incident management must often be conducted simultaneously in a well-coordinated and controlled manner to ensure that the initial actions of either activity don’t destroy evidence or cause further damage to the organization’s assets. For this reason, it’s important that Computer Incident (or Emergency) Response Teams (CIRT or CERT, or Computer Security Incident Response Teams - CSIRT, respectively) be properly trained and qualified to secure a computer-related crime scene or incident while preserving evidence. Ideally, the CIRT includes individuals who will actually be conducting the investigation.

An analogy to this would be an example of a police patrolman who discovers a murder victim. It’s important that the patrolman quickly assesses the safety of the situation and secures the crime scene, but at the same time, he must be careful not to disturb or destroy any evidence. The homicide detective’s job is to gather and analyze the evidence. Ideally, but rarely, the homicide detective would be the individual who discovers the murder victim, allowing her to assess the safety of the situation, secure the crime scene, and begin collecting evidence. Think of yourself as a CSI-SSP!

Different requirements for various investigation types include

Various industry standards and guidelines provide guidance for conducting investigations. These include the American Bar Association’s (ABA) Best Practices in Internal Investigations, various best practice guidelines and toolkits published by the U.S. Department of Justice (DOJ), and ASTM International’s Standard Practice for Computer Forensics (ASTM E2763).

Conduct Logging and Monitoring Activities

Event logging is an essential part of an organization’s IT operations. Increasingly, organizations are implementing centralized log collection systems that often serve as security information and event management (SIEM) platforms.

Intrusion detection and prevention

Intrusion detection is a passive technique used to detect unauthorized activity on a network. An intrusion detection system is frequently called an IDS. Three types of IDSs used today are

  • Network-based intrusion detection (NIDS): Consists of a separate device attached to a network that listens to all network traffic by using various methods (which we describe later in this section) to detect anomalous activity.
  • Host-based intrusion detection (HIDS): This is really a subset of network-based IDS, in which only the network traffic destined for a particular host is monitored.
  • Wireless intrusion detection (WIDS): This is another type of network intrusion detection that focuses on wireless intrusion by scanning for rogue access points.

Both network- and host-based IDSs use a couple of methods:

  • Signature-based: A signature-based IDS compares network traffic that is observed with a list of patterns in a signature file. A signature-based IDS detects any of a known set of attacks, but if an intruder is able to change the patterns that he uses in his attack, then his attack may be able to slip by the IDS without being detected. The other downside of signature-based IDS is that the signature file must be frequently updated.
  • Reputation-based: Closely akin to signature based, reputation-based alerting is all about detecting when communications and other activities involve known-malicious domains and IP networks. Some IDSs update themselves several times daily, including adding to a list of known-malicious domains and IP addresses. Then, when any activities are associated with a known-malicious domain or IP address, the IDS can create an alert that lets personnel know about the activity.
  • Anomaly-based: An anomaly-based IDS monitors all the traffic over the network and builds traffic profiles. Over time, the IDS will report deviations from the profiles that it has built. The upside of anomaly-based IDSs is that there are no signature files to periodically update. The downside is that you may have a high volume of false-positives. Behavior-based and heuristics-based IDSs are similar to anomaly-based IDSs and share many of the same advantages. Rather than detecting anomalies to normal traffic patterns, behavior-based and heuristics-based systems attempt to recognize and learn potential attack patterns.

Intrusion detection doesn’t stop intruders, but intrusion prevention does … or, at least, it slows them down. Intrusion prevention systems (IPSs) are newer and more common systems than IDSs, and IPSs are designed to detect and block intrusions. An intrusion prevention system is simply an IDS that can take action, such as dropping a connection or blocking a port, when an intrusion is detected.

remember Intrusion detection looks for known attacks and/or anomalous behavior on a network or host.

See Chapter 6 for more on intrusion detection and intrusion prevention systems.

Security information and event management

Security information and event management (SIEM) solutions provide real-time collection, analysis, correlation, and presentation of security logs and alerts generated by various network sources (such as firewalls, IDS/IPS, routers, switches, servers, and workstations).

A SIEM solution can be software- or appliance-based, and may be hosted and managed either internally or by a managed security service provider.

A SIEM requires a lot of up-front configuration and tuning, so that only the most important, actionable events are brought to the attention of staff members in the organization. However, it’s worth the effort: a SIEM combs through millions, or billions, of events daily, and presents only the most important few, actionable events so that security teams can take appropriate action.

Many SIEM platforms also have the ability to accept threat intelligence feeds from various vendors including the SIEM manufacturers. This permits the SIEM to automatically adjust its detection and blocking capabilities for the most up-to-date threats.

Continuous monitoring

Continuous monitoring technology collects and reports security data in near real time. Continuous monitoring components may include

  • Discovery: Ongoing inventory of network and information assets, including hardware, software, and sensitive data.
  • Assessment: Automatic scanning and baselining of information assets to identify and prioritize vulnerabilities.
  • Threat intelligence: Feeds from one or more outside organizations that produce high-quality, actionable data.
  • Audit: Nearly real-time evaluation of device configurations and compliance with established policies and regulatory requirements.
  • Patching: Automatic security patch installation and software updating.
  • Reporting: Aggregating, analyzing and correlating log information and alerts.

Egress monitoring

Egress monitoring (or extrusion detection) is the process of monitoring outbound traffic to discover potential data leakage (or loss). Modern cyberattacks employ various stealth techniques to avoid detection as long as possible for the purpose of data theft. These techniques may include the use of encryption (such as SSL/TLS) and steganography (discussed in Chapter 4).

Data loss prevention (DLP) systems are often used to detect the exfiltration of sensitive data, such as personally identifiable information (PII) or protected health information (PHI), in e-mail messages, data uploads, PNG or JPEG images, and other forms of communication. These technologies often perform deep packet inspection (DPI) to decrypt and inspect outbound traffic that is TLS encrypted.

DLP systems can also be used to disable removable media drive interfaces on servers and workstations, and also to encrypt data written onto removable media.

Static DLP tools are used to discover sensitive and proprietary data in databases, file servers, and other data storage systems.

Securely Provisioning Resources

An organization’s information architecture is dynamic and constantly changing. As a result, its security posture is also dynamic and constantly changing. Provisioning (and decommissioning) of various information resources can have significant impacts (both direct and indirect) on the organization’s security posture. For example, an application may either directly introduce new vulnerabilities into an environment or integrate with a database in a way that compromises the integrity of the database. For these reasons, security planning and analysis must be an integral part of every organization’s resource provisioning processes, as well as throughout the lifecycle of all resources. Important security considerations include

Understand and Apply Foundational Security Operations Concepts

Fundamental security operations concepts that need to be well understood and managed include the principles of need-to-know and least privilege, separation of duties and responsibilities, monitoring of special privileges, job rotation, information lifecycle management and service-level agreements.

Need-to-know and least privilege

The concept of need-to-know states that only people with a valid business justification should have access to specific information or functions. In addition to having a need-to-know, an individual must have an appropriate security clearance level in order for access to be granted. Conversely, an individual with the appropriate security clearance level, but without a need-to-know, should not be granted access.

One of the most difficult challenges in managing need-to-know is the use of controls that enforce need-to-know. Also, information owners need to be able to distinguish I need-to-know from I want-to-know, I-want-to-feel-important, and I’m-just-curious.

Need-to-know is closely related to the concept of least privilege and can help organizations implement least privilege in a practical manner.

The principle of least privilege states that persons should have the capability to perform only the tasks (or have access to only the data) that are required to perform their primary jobs, and no more.

To give an individual more privileges and access than required invites trouble. Offering the capability to perform more than the job requires may become a temptation that results, sooner or later, in an abuse of privilege.

For example, giving a user full permissions on a network share, rather than just read and modify rights to a specific directory, opens the door not only for abuse of those privileges (for example, reading or copying other sensitive information on the network share) but also for costly mistakes (accidentally deleting a file — or the entire directory!). As a starting point, organizations should approach permissions with a “deny all” mentality, then add needed permissions as required.

tip Least privilege is also closely related to separation of duties and responsibilities, described in the following section. Distributing the duties and responsibilities for a given job function among several people means that those individuals require fewer privileges on a system or resource.

remember The principle of least privilege states that people should have the fewest privileges necessary to allow them to perform their tasks.

Several important concepts associated with need to know and least privilege include

  • Entitlement. When a new user account is provisioned in an organization, the permissions granted to that account must be appropriate for the level of access required by the user. In too many organizations, human resources simply instructs the IT department to give a new user “whatever so-and-so (another user in the same department) has access to”. Instead, entitlement needs to be based on the principle of least privilege.
  • Aggregation. When people transfer between jobs and/or departments within an organization (see the section on job rotations later in this chapter), they often need different access and privileges to do their new jobs. Far too often, organizational security processes do not adequately ensure that access rights which are no longer required by an individual are actually revoked. Instead, individuals accumulate privileges, and over a period of many years an employee can have far more access and privileges than they actually need. This is known as aggregation, and it’s the antithesis of least privilege!

    Privilege creep is another term commonly used here.

  • Transitive trust. Trust relationships (in the context of security domains) are often established within, and between, organizations to facilitate ease of access and collaboration. A trust relationship enables subjects (such as users or processes) in one security domain to access objects (such as servers or applications) in another security domain (see Chapter 5 and Chapter 7 to learn more about objects and subjects). A transitive trust extends access privileges to the subdomains of a security domain (analogous to inheriting permissions to subdirectories within a parent directory structure). Instead, a nontransitive trust should be implemented by requiring access to each subdomain to be explicitly granted based on the principle of least privilege, rather than inherited.

Separation of duties and responsibilities

The concept of separation (or segregation) of duties and responsibilities ensures that no single individual has complete authority and control of a critical system or process. This practice promotes security in the following ways:

  • Reduces opportunities for fraud or abuse: In order for fraud or abuse to occur, two or more individuals must collude or be complicit in the performance of their duties.
  • Reduces mistakes: Because two or more individuals perform the process, mistakes are less likely to occur or mistakes are more quickly detected and corrected.
  • Reduces dependence on individuals: Critical processes are accomplished by groups of individuals or teams. Multiple individuals should be trained on different parts of the process (for example, through job rotation, discussed in the following section) to help ensure that the absence of an individual doesn’t unnecessarily delay or impede successful completion of a step in the process.

Here are some common examples of separation of duties and responsibilities within organizations:

  • A bank assigns the first three numbers of a six-number safe combination to one employee and the second three numbers to another employee. A single employee isn’t permitted to have all six numbers, so a lone employee is unable to gain access to the safe and steal its contents.
  • An accounting department might separate record entry and internal auditing functions, or accounts payable and check disbursing functions.
  • A system administrator is responsible for setting up new accounts and assigning permissions, which a security administrator then verifies.
  • A programmer develops software code, but a separate individual is responsible for testing and validation, and yet another individual is responsible for loading the code on production systems.
  • Destruction of classified materials may require two individuals to complete or witness the destruction.
  • Disposal of assets may require an approval signature by the office manager and verification by building security.

In smaller organizations, separation of duties and responsibilities can be difficult to implement because of limited personnel and resources.

Privileged account management

Privileged entity controls are the mechanisms, generally built into computer operating systems and network devices, that give privileged access to hardware, software, and data. In UNIX and Windows, the controls that permit privileged functions reside in the operating system. Operating systems for servers, desktop computers, and many other devices use the concept of modes of execution to define privilege levels for various user accounts, applications, and processes that run on a system. For instance, the UNIX root account and Windows Server Enterprise, Domain, and Local Administrator account roles have elevated rights that allow those accounts to install software, view the entire file system and, in some cases, directly access the OS kernel and memory.

Specialized tools are used to monitor and record activities performed by privileged and administrative users. This helps to ensure accountability on the part of each administrator and aids in troubleshooting, through the ability to view actions performed by administrators.

System or network administrators typically use privileged accounts to perform operating system and utility management functions. Supervisor or Administrator mode should be used only for system administration purposes. Unfortunately, many organizations allow system and network administrators to use these privileged accounts or roles as their normal user accounts even when they aren't doing work which requires this level of access. Yet another horrible security practice is to allow administrators to share a single “administrator” or “root” account.

warning System or network administrators occasionally grant root or administrator privileges to normal applications as a matter of convenience, rather than spending the time to figure out exactly what privileges the application actually requires, and then creating an account role for the application with only those privileges. Allowing a normal application these privileges is a serious mistake because applications that run in privileged mode bypass some or all security controls, which could lead to unexpected application behavior. For instance, any user of a payroll application could view or change anyone's data because the application running in privileged mode was never told no by the operating system. Further, if an application running in privileged mode is compromised by an attacker, the attacker may then inherit privileged access for the entire system.

tip Hackers specifically target Supervisor and other privileged modes, because those modes have a great deal of power over systems. The use of Supervisor mode should be limited wherever possible, especially on end-user workstations.

Job rotation

Job rotation (or rotation of duties) is another effective security control that gives many benefits to an organization. Similar to the concept of separation of duties and responsibilities, job rotations involve regularly (or randomly) transferring key personnel into different positions or departments within an organization, with or without notice. Job rotations accomplish several important organizational objectives:

  • Reduce opportunities for fraud or abuse. Regular job rotations can accomplish this objective in the following two ways:
    • People hesitate to set up the means for periodically or routinely stealing corporate information because they know that they could be moved to another shift or task at almost any time.
    • People don’t work with each other long enough to form collusive relationships that could damage the company.
  • Eliminate single points of failure. By ensuring that numerous people within an organization or department know how to perform several different job functions, an organization can reduce dependence on individuals and thereby eliminate single points of failure when an individual is absent, incapacitated, no longer employed with the organization, or otherwise unavailable to perform a critical job function.
  • Promote professional growth. Through cross-training opportunities, job rotations can help an individual’s professional growth and career development, and reduce monotony and/or fatigue.

Job rotations can also include changing workers’ workstations and work locations, which can also keep would-be saboteurs off balance and less likely to commit.

As with the practice of separation of duties, small organizations can have difficulty implementing job rotations.

Information lifecycle

The information lifecycle refers to the activities related to the introduction, use, and disposal of information in an organization. The phases in the information lifecycle typically are

  • Plan. Development of formal plans on how to create and use information.
  • Creation. Information is created, collected, received, or captured in some way.
  • Store. Information is stored in an information system.
  • Use. Information is used, maintained, and perhaps disseminated.
  • Protection. Information is protected according to its criticality and sensitivity.
  • Disposal. Information at the end of its service life is discarded. Sensitive information will be erased using techniques to prevent its recovery.

tip The European Union’s General Data Protection Regulation (GDPR) and other privacy regulations bring to light the steps in the information lifecycle, by giving data subjects legal rights regarding use of information about them.

Service-level agreements

Users of business- or mission-critical information systems need to know whether their systems or services will function when they need them, and users need to know more than “Is it up?” or “Is it down again?” Their customers, and others, hold users accountable for getting their work done in a timely and accurate manner, so consequently, those users need to know whether they can depend on their systems and services to help them deliver as promised.

The service-level agreement (SLA) is a quasi-legal document (it’s a real legal document when it is included in or referenced by a contract) that pledges the system or service performs to a set of minimum standards, such as

  • Hours of availability: The wall-clock hours that the system or service will be available for users. This could be 24 x 7 (24 hours per day, 7 days per week) or something more limited, such as daily from 4:00 a.m. to 12:00 p.m. Availability specifications may also cite maintenance windows (for instance, Sundays from 2:00 a.m. to 4:00 a.m.) when users can expect the system or service to be down for testing, upgrades, and maintenance.
  • Average and peak number of concurrent users: The maximum number of users who can use the system or service at the same time.
  • Transaction throughput: The number of transactions that the system or service can perform or support in a given time period. Usually, throughput is expressed as transactions per second, per minute, or per hour.
  • Transaction accuracy: The accuracy of transactions that the system or service performs. Generally, this is related to complex calculations (such as calculating sales tax) and accuracy of location data.
  • Data storage capacity: The amount of data that the users can store in the system or service (such as cloud storage). Capacity may be expressed in raw terms (megabytes or gigabytes) or in numbers of transactions.
  • Response times: The maximum periods of time (in seconds) that key transactions take. Response times for long processes (such as nightly runs, batch jobs, and so on) also should be covered in the SLA.
  • Service desk response and resolution times: The amount of time (usually in hours) that a service desk (or help desk) will take to respond to requests for support and resolve any issues.
  • Mean Time Between Failures (MTBF): The amount of time, typically measured in (thousands of) hours, that a component (such as a server hard drive) or system is expected to continuously operate before experiencing a failure.
  • Mean Time to Restore Service (MTRS): The amount of time, typically measured in minutes or hours, that it is expected to take in order to restore a system or service to normal operation after a failure has occurred.
  • Security incident response times: The amount of time (usually in hours or days) between the realization of a security incident and any required notifications to data owners and other affected parties.
  • Escalation process during times of failure: When things go wrong, how quickly the service provider will contact the customer, as well as what steps the provider will take to restore service.

remember Availability is one of the three tenets of information security (Confidentiality, Integrity, and Availability, discussed in Chapter 3). Therefore, SLAs are an important security document.

Because the SLA is a quantified statement, the service provider and the user alike can take measurements to see how well the service provider is meeting the SLA’s standards. This measurement, which is sometimes accompanied by analysis, is frequently called a scorecard.

tip Operational-level agreements (OLAs) and underpinning contracts (UCs) are important SLA supporting documents. An OLA is essentially an SLA between the different interdependent groups that are responsible for the terms of the SLA, for example, a Service Desk and the Desktop Support team. UCs are used to manage third-party relationships with entities that help support the SLA, such as an external service provider or vendor.

Finally, for an SLA to be meaningful, it needs to have teeth! How will the SLA be enforced, and what will happen when violations occur? What are the escalation procedures? Will any penalties or service credits be paid in the event of a violation? If so, how will penalties or credits be calculated?

tip Internal SLAs (and OLAs), such as those between an IT department and their users, typically don’t provide penalties or service credits for service violations. Internal SLAs are structured more as a commitment between IT and the user community, and are useful for managing service expectations. Clearly defined escalation procedures (who gets notified of a problem; when, how, and when it goes up the chain of command) are critical in an internal SLA.

tip SLAs rarely, if ever, provide meaningful financial penalties for service violations. For example, an hour of Internet downtime might legitimately cost an e-commerce company $10,000 of business. But most service providers will typically only provide a credit equivalent to the amount paid for the lost hour of Internet service (a few hundred dollars). This may seem incredibly disproportionate, but consider it from the service provider’s perspective. That same credit has to be given to all of their customers that experienced the outage. Thus, an outage could potentially cost the service provider hundreds of thousands of dollars. If service providers were legally obligated to reimburse every customer for their actual losses, it’s fair to guess that no one would be in the business of providing Internet service (or it would cost a few thousand dollars a month for a T-l circuit). Instead, look for such penalties as an early termination clause that lets you get out of a long-term contract if your service provider repeatedly fails to meet its service level obligations.

Apply Resource Protection Techniques

Resource protection is the broad category of controls that protect information assets and information infrastructure. Resources that require protection include

Media management

Media management refers to a broad category of controls that are used to manage information classification and physical media. Data classification refers to the tasks of marking information according to its sensitivity, as well as the subsequent handling, storage, transmission, and disposal procedures that accompany each classification level. Physical media is similarly marked; likewise, controls specify handling, storage, and disposal procedures. See Chapter 4 to learn more about data classification.

Sensitive information such as financial records, employee data, and information about customers must be clearly marked, properly handled and stored, and appropriately destroyed in accordance with established organizational policies, standards, and procedures:

  • Marking: How an organization identifies sensitive information, whether electronic or hard copy. For example, a marking might read PRIVILEGED AND CONFIDENTIAL. See Chapter 4 for a more detailed discussion of data classification.
  • Handling: The organization should have established procedures for handling sensitive information. These procedures detail how employees can transport, transmit, and use such information, as well as any applicable restrictions.
  • Protection: This involves two components:
    • The physical protection of the actual media, such as locked cabinets and secured vehicles.
    • The logical protection of information on media, such as encryption.
  • Storage and Backup: Similar to handling, the organization must have procedures and requirements specifying how sensitive information must be stored and backed up.
  • Retention: Most organizations are bound by various laws and regulations to collect and store certain information, as well as to keep it for minimum and/or maximum specified periods of time. An organization must be aware of legal requirements and ensure that it’s in compliance with all applicable regulations. Records retention policies should cover any electronic records that may be located on file servers, document management systems, databases, e-mail systems, archives, and records management systems, as well as paper copies and backup media stored at off-site facilities. Organizations that want to retain information longer than required by law should firmly establish why such information should be kept longer. Nowadays, just having information can be a liability, so this should be the exception rather than the norm.
  • Destruction: Sooner or later, an organization must destroy sensitive information. The organization must have procedures detailing how to destroy sensitive information that has been previously retained, regardless of whether the data is in hard copy or saved as an electronic file.

warning At the opposite end of the records retention spectrum, many organizations now destroy records (including backup media) as soon as legally permissible in order to limit the scope (and cost) of any future discovery requests or litigation. Before implementing any such draconian retention policies that severely restrict your organization’s retention periods, you should fully understand the negative implications such a policy has for your disaster recovery capabilities. Also, consult with your organization’s legal counsel to ensure that you’re in full compliance with all applicable laws and regulations. Although extremely short retention policies and practices may be prudent for limiting future discovery requests or litigation, they’re illegal for limiting pending discovery requests or litigation (or even records that you have a reasonable expectation may become the subject of future litigation). In such cases, don’t destroy pertinent records — otherwise you go to jail. You go directly to jail! You don’t pass Go, you don’t collect $200, and (oh, yeah) you don’t pass the CISSP exam, either — or even remain eligible for CISSP certification!

Hardware and software asset management

Maintaining a complete and accurate inventory with configuration information about all of an organization’s hardware and software information assets is an important security operations function.

Without this information, managing vulnerabilities becomes a truly daunting challenge. With popular trends such as “bring your own device” becoming more commonplace in many organizations, it is critical that organizations work with their information security leaders and end users to ensure that all devices and applications that are used are known to — and appropriately managed by — the organization. This allows any inherent risks to be known — and addressed.

Conduct Incident Management

The formal process of detecting, responding to, and fixing a security problem is known as incident management (also known as security incident management).

warning Do not confuse the concept of incident management, described herein, with the more general concept of incident management as defined by the Information Technology Infrastructure Library’s (ITIL) Service Management best practices.

Incident management includes the following steps:

  1. Preparation. Incident management begins before an incident actually occurs. Preparation is the key to quick and successful incident management. A well-documented and regularly practiced incident management (or incident response) plan ensures effective preparation. The plan should include:
    • Response procedures: Include detailed procedures that address different contingencies and situations.
    • Response authority: Clearly define roles, responsibilities, and levels of authority for all members of the Computer Incident Response Team (CIRT).
    • Available resources: Identify people, tools, and external resources (consultants and law enforcement agents) that are available to the CIRT. Training should include use of these resources, when possible.
    • Legal review: The incident response plan should be evaluated by appropriate legal counsel to determine compliance with applicable laws and to determine whether they’re enforceable and defensible.
  2. Detection. Detecting that a security incident or event has occurred is the first and, often, most difficult step in incident management. Detection may occur through automated monitoring and alerting systems, or as the result of a reported security incident (such as a lost or stolen mobile device). Under the best of circumstances, detection may occur in real-time as soon as a security incident occurs, such as malware that is discovered by anti-malware software on a computer. More often, a security incident may not be detected for quite some time (months or years), such as in the case of a sophisticated “low and slow” cyberattack. Determining whether a security incident has occurred is similar to the detection and containment step in the investigative process (discussed earlier in this chapter) and includes defining what constitutes a security incident for your organization.
  3. Response. Upon determination that an incident has occurred, it’s important to immediately begin detailed documentation of every action taken throughout the incident management process. You should also identify the appropriate alert level. (Ask questions such as “Is this an isolated incident or a system-wide event?” and “Has personal or sensitive data been compromised?” and “What laws may have been violated?”) The answers will help you determine who to notify and whether or not to activate the entire incident response team or only certain members. Next, notify the appropriate people about the incident — both incident response team members and management. All contact information should be documented before an incident, and all notifications and contacts during an incident should be documented in the incident log.
  4. Mitigation. The purpose of this step is to contain the incident and minimize further loss or damage. For example, you may need to eradicate a virus, deny access, or disable services in order to halt the incident in progress.
  5. Reporting. This step requires assessing the incident and reporting the results to appropriate management personnel and authorities (if applicable). The assessment includes determining the scope and cause of damage, as well as the responsible (or liable) party.
  6. Recovery. Recovering normal operations involves eradicating any components of the incident (for example, removing malware from a system or disabling e-mail service on a stolen mobile device). Think of recovery as returning a system to its pre-incident state.
  7. Remediation. Remediation may include rebuilding systems, repairing vulnerabilities, improving safeguards, and restoring data and services. Do this step in accordance with a business continuity plan (BCP) that properly identifies recovery priorities.
  8. Lessons learned. The final phase of incident management requires evaluating the effectiveness of your incident management plan and identifying any lessons learned — which should include not only what went wrong, but also what went right.

remember Investigations and incident management follow similar steps but have different purposes: The distinguishing characteristic of an investigation is the gathering of evidence for possible prosecution, whereas incident management focuses on containing the damage and returning to normal operations.

Operate and Maintain Detective and Preventive Measures

Detective and preventive security measures include various security technologies and techniques, including:

Implement and Support Patch and Vulnerability Management

Software bugs and flaws inevitably exist in operating systems, database management systems, and various applications, and are continually discovered by researchers. Many of these bugs and flaws are security vulnerabilities that could permit an attacker to control a target system and subsequently access sensitive data or critical functions. Patch and vulnerability management is the process of regularly assessing, testing, installing and verifying fixes and patches for software bugs and flaws as they are discovered.

To perform patch and vulnerability management, follow these basic steps:

  1. Subscribe to security advisories from vendors and third-party organizations.
  2. Perform periodic security scans of internal and external infrastructure to identify systems and applications with unsecure configuration and missing patches.
  3. Perform risk analysis on each advisory and missing patch to determine its applicability and risk to your organization.
  4. Develop a plan to either install the security patch or to perform another workaround, if any is available.

    You should base your decision on which solution best eliminates the vulnerability or reduces risk to an acceptable level.

  5. Test the security patch or workaround in a test environment.

    This process involves making sure that stated functions still work properly and that no unexpected side-effects arise as a result of installing the patch or workaround.

  6. Install the security patch in the production environment.
  7. Verify that the patch is properly installed and that systems still perform properly.
  8. Update all relevant documentation to include any changes made or patches installed.

Understand and Participate in Change Management Processes

Change management is the business process used to control architectural and configuration changes in a production environment. Instead of just making changes to systems and the way that they relate to each other, change management is a formal process of request, design, review, approval, implementation, and recordkeeping.

Configuration Management is the closely related process of actively managing the configuration of every system, device, and application and then thoroughly documenting those configurations.

Implement Recovery Strategies

Developing and implementing effective backup and recovery strategies are critical for ensuring the availability of systems and data. Other techniques and strategies are commonly implemented to ensure the availability of critical systems, even in the event of an outage or disaster.

Backup storage strategies

Backups are performed for a variety of reasons that center around a basic principle: sometimes things go wrong and we need to get our data back. In order to cover all reasonable scenarios, backup storage strategies often involve the following:

  • Secure offsite storage. Store backup media at a remote location, far enough away so that the remote location is not directly affected by the same events (weather, natural disasters, man-made disasters), but close enough so that backup media can be retrieved in a reasonable period of time.
  • Transport via secure courier. This can discourage or prevent theft of backup media while it is in transit to a remote location.
  • Backup media encryption. This helps to prevent any unauthorized third party from being able to recover data from backup media.
  • Data replication. Sending data to an offsite or remote data center, or cloud-based storage provider, in near real-time.

Recovery site strategies

These include hot sites (a fully functional data center or other facility that is always up and ready with near real-time replication of production systems and data), cold sites (a data center or facility that may have some recovery equipment available but not configured, and no backup data onsite), and warm sites (some hardware and connectivity is prepositioned and configured, plus an offsite copy of backup data).

Selecting a recovery site strategy has everything to do with cost and service level. The faster you want to recover data processing operations in a remote location, the more you will have to spend in order to build a site that is “ready to go” at the speed you require.

In a nutshell: Speed costs.

Multiple processing sites

Many large organizations operate multiple data centers for critical systems with real-time replication and load balancing between the various sites. This is the ultimate solution for large commercial sites that have little or no tolerance for downtime. Indeed, a well-engineered multi-site application can suffer even significant whole-data-center outages without customers even knowing anything is wrong.

System resilience, high availability, quality of service, and fault tolerance

System resilience, high availability, quality of service (QoS), and fault tolerance are similar characteristics that are engineered into a system to make it as reliable as possible:

  • System resilience. This includes eliminating single points of failure in system designs and building fail-safes into critical systems.
  • High availability. This typically consists of clustered systems and databases configured in an active-active (both systems are running and immediately available) or active-passive (one system is active, while the other is in standby but can become active, usually within a matter of seconds). Clusters in active-passive mode have the failover mechanism used to automatically switch the “active” role from one server in the cluster to another.
  • Quality of service. Refers to a mechanism where systems that provide various services prioritize certain services to ensure they’re always available or perform at a certain level. For example, Voice over Internet Protocol (VoIP) systems typically are prioritized to ensure sufficient network bandwidth is always available to avoid any traffic delay or degradation of voice quality. Other services that are not as sensitive to delays (such as web browsing or file downloads) will be prioritized at a lower level in such cases.
  • Fault tolerance. This includes engineered redundancies in critical components, such as multiple power supplies, multiple network interfaces, and RAID (redundant array of independent disks) configured storage systems.

Implement Disaster Recovery (DR) Processes

A variety of disasters can beset an organization’s business operations. They fall into two main categories: natural and man-made.

In many cases, formal methodologies are used to predict the likelihood of a particular disaster. For example, 50-year flood plain is a term that you’ve probably heard to describe the maximum physical limits of a river flood that’s likely to occur once in a 50-year period. The likelihood of each of the following disasters depends greatly on local and regional geography:

Many of these occurrences may have secondary effects; often these secondary effects have a bigger impact on business operations, sometimes in a wider area than the initial disaster (for instance, a landslide in a rural area can topple power transmission lines, which results in a citywide blackout). Some of these effects are

As if natural disasters weren’t enough, man-made disasters can also disrupt business operations, all as a result of deliberate and accidental acts:

tip For a more complete reference on disaster recovery planning, we recommend IT Disaster Recovery Planning For Dummies.

Disasters can affect businesses in a lot of ways — some obvious, and others not so obvious.

The list above isn’t complete, but should help you think about all the ways a disaster can affect your organization.

Response

Emergency response teams must be prepared for every reasonably possible scenario. Members of these teams need a variety of specialized training to deal with such things as water and smoke damage, structural damage, flooding, and hazardous materials.

Organizations must document all the types of responses so that the response teams know what to do. The emergency response documentation consists of two major parts: how to respond to each type of incident, and the most up-to-date facts about the facilities and equipment that the organization uses.

In other words, you want your teams to know how to deal with water damage, smoke damage, structural damage, hazardous materials, and many other things. Your teams also need to know everything about every company facility: Where to find utility entrances, electrical equipment, HVAC equipment, fire control, elevators, communications, data closets, and so on; which vendors maintain and service them; and so on. And you need experts who know about the materials and construction of the buildings themselves. Those experts might be your own employees, outside consultants, or a little of both.

remember It is the DRP team’s responsibility to identify the experts needed for all phases of emergency response.

Responding to an emergency branches into two activities: salvage and recovery. Tangential to this is preparing financially for the costs associated with salvage and recovery.

Salvage

The salvage team is concerned with restoring full functionality to the damaged facility. This restoration includes several activities:

  • Damage assessment: Arrange a thorough examination of the facility to identify the full extent and nature of the damage. Frequently, outside experts, such as structural engineers, perform this inspection.
  • Salvage assets: Remove assets, such as computer equipment, records, furniture, inventory, and so on, from the facility.
  • Cleaning: Thoroughly clean the facility to eliminate smoke damage, water damage, debris, and more. Outside companies that specialize in these services frequently perform this job.
  • Restoring the facility to operational readiness: Complete repairs, and restock and reequip the facility to return it to pre-disaster readiness. At this point, the facility is ready for business functions to resume there.

remember The salvage team is primarily concerned with the restoration of a facility and its return to operational readiness.

Recovery

Recovery comprises equipping the BCP team (yes, the BCP team — recovery involves both BCP and DRP) with any logistics, supplies, or coordination in order to get alternate functional sites up and running. This activity should be heavily scripted, with lots of procedures and checklists in order to ensure that every detail is handled.

Financial readiness

The salvage and recovery operations can cost a lot of money. The organization must prepare for potentially large expenses (at least several times the normal monthly operating cost) to restore operations to the original facility.

Financial readiness can take several forms, including:

  • Insurance: An organization may purchase an insurance policy that pays for the replacement of damaged assets and perhaps even some of the other costs associated with conducting emergency operations.
  • Cash reserves: An organization may set aside cash to purchase assets for emergency use, as well as to use for emergency operations costs.
  • Line of credit: An organization may establish a line of credit, prior to a disaster, to be used to purchase assets or pay for emergency operations should a disaster occur.
  • Pre-purchased assets: An organization may choose to purchase assets to be used for disaster recovery purposes in advance, and store those assets at or near a location where they will be utilized in the event of a disaster.
  • Letters of agreement: An organization may wish to establish legal agreements that would be enacted in a disaster. These may range from use of emergency work locations (such as nearby hotels), use of fleet vehicles, appropriation of computers used by lower-priority systems, and so on.
  • Standby assets: An organization can use existing assets as items to be re-purposed in the event of a disaster. For example, a computer system that is used for software testing could be quickly re-used for production operations if a disaster strikes.

Personnel

People are the most important resource in any organization. As such, disaster response must place human life above all other considerations when developing disaster response plans and when emergency responders are taking action after a disaster strikes. In terms of life safety, organizations can do several things to ensure safety of personnel:

  • Evacuation plans. Personnel need to know how to safely evacuate a building or work center. Signs should be clearly posted, and drills routinely held, so that personnel can practice exiting the building or work center calmly and safely. For organizations with large numbers of customers or visitors, additional measures need to be taken so that persons unfamiliar with evacuation routes and procedures can safely exit the facilities.
  • First aid. Organizations need to have plenty of first aid supplies on hand, including longer-term supplies in the event a natural disaster prevents paramedics from being able to respond. Personnel need to be trained in first aid and CPR in the event of a disaster, especially when communications and/or transportation facilities are cut.
  • Emergency supplies. For disasters that require personnel to shelter in place, organizations need to stock emergency water, food, blankets and other necessities in the event that personnel are stranded at work locations for more than a few hours.

remember Personnel are the most important resource in any organization.

Communications

A critical component of the DRP is the communications plan. Employees need to be notified about closed facilities and any special work instructions (such as an alternate location to report for work). The planning team needs to realize that one or more of the usual means of communications may have also been adversely affected by the same event that damaged business facilities. For example, if a building has been damaged, the voice-mail system that people would try to call into so that they could check messages and get workplace status might not be working.

Organizations need to anticipate the effects of an event when considering emergency communications. For instance, you need to establish in advance two or more ways to locate each important staff member. These ways may include landlines, cell phones, spouses’ cell phones, and alternate contact numbers (such as neighbors or relatives).

tip Text messaging is often an effective means of communication, even when mobile communications systems are congested.

Many organizations’ emergency operations plans include the use of audio conference bridges so that personnel can discuss operational issues hour by hour throughout the event. Instead of relying on a single provider (which you might not be able to reach because of communications problems or because it’s affected by the same event), organizations should have a second (and maybe even a third) audio conference provider established. Emergency communications documentation needs to include dial-in information for both (or all three) conference systems.

In addition to internal communications, the DRP must address external communications to ensure that customers, investors, government, and media are provided with accurate and timely information.

Assessment

When a disaster strikes, an organization’s DRP needs to include procedures to assess damage to buildings and equipment.

First, the response team needs to examine buildings and equipment, to determine which assets are a total loss, which are repairable, and which are still usable (although not necessarily in their current location).

For such events as floods, fires and earthquakes, a professional building inspector usually will need to examine a building to see whether it is fit for occupation. If not, then the next step is determining whether a limited number of personnel will be permitted to enter the building to retrieve needed assets.

Once assessment has been completed, assets can be divided into three categories:

  • Salvage. These are assets that are a total loss and cannot be repaired. In some cases, components can be removed to repair other assets.
  • Repair. Some assets can be repaired and returned to service.
  • Reuse. Undamaged assets can be placed back into service, although this may require them to be moved to an alternate work location if the building cannot be occupied.

Restoration

The ultimate objective of the disaster recovery team is the restoration of work facilities with their required assets, so that business may return to normal. Depending on the nature of the event, restoration may take the form of building repair, building replacement, or permanent relocation to a different building.

Similarly, assets used in each building may need to undergo their own restoration, whether that takes the form of replacement, repair, or simply placing it back into service in whatever location is chosen.

Prior to full restoration, business operations may be conducted in temporary facilities, possibly by alternate personnel who may be other employees or contractors hired to fill in and help out. These temporary facilities may be located either near the original facilities or some distance away. The circumstances of the event will dictate some of these matters, as well as the organization’s plans for temporary business operations.

Training and awareness

An organization’s ability to effectively respond to a disaster is highly dependent on its advance preparations. In addition to the development of high quality, workable disaster recovery and business continuity plans that are kept up to date, the next most important part is making sure that employees and other needed personnel are periodically trained in the actual response and continuity procedures. Training and practice helps to reinforce understanding of proper response procedures, giving the organization the best chance at surviving the disaster.

An important part of training is the participation in various types of testing, which is discussed in the following section.

Test Disaster Recovery Plans

By the time that an organization has created a DRP, it’s probably spent hundreds of hours and possibly tens (or hundreds) of thousands of dollars on consulting fees. You’d think that after making such a big investment, they’d test the DRP to make sure that it really works when an actual disaster strikes!

The following sections outline DRP testing methods.

Read-through

A read-through (or checklist) test is a detailed review of DRP documents, performed by individuals on their own. The purpose of a read-through test is to identify inaccuracies, errors, and omissions in DRP documentation.

It’s easy to coordinate this type of test, because each person who performs the test does it when his or her schedule permits (provided they complete it before any deadlines).

By itself, a document review is an insufficient way to test a DRP; however, it’s a logical starting place. You should perform one or more of the other DR tests described in the following sections shortly after you do a read-through test.

Walkthrough or tabletop

A walkthrough (or tabletop or structured walkthrough) test is a team approach to the read-through test. Here, several business and technology experts in the organization gather to “walk” through the DRP. A moderator or facilitator leads participants to discuss each step in the DRP so that they can identify issues and opportunities for making the DRP more accurate and complete. Group discussions usually help to identify issues that people will not find when working on their own. Often the participants want to perform the review in a fancy mountain or oceanside retreat, where they can think much more clearly! (Yeah, right.)

During a walkthrough test, the facilitator writes down “parking lot” issues (items to be considered at a later time, written down now so they will not be forgotten) on a whiteboard or flipchart while the group identifies those issues. These are action items that will serve to make improvements to the DRP. Each action item needs to have an accountable person assigned, as well as a completion date, so that the action items will be completed in a reasonable time. Depending upon the extent of the changes, a follow-up walkthrough may need to be conducted at a later time.

tip A walkthrough test usually requires two or more hours to complete.

Simulation

In a simulation test, all the designated disaster recovery personnel practice going through the motions associated with a real recovery. In a simulation, the team doesn’t actually perform any recovery or alternate processing.

An organization that plans to perform a simulation test appoints a facilitator who develops a disaster scenario, using a type of disaster that’s likely to occur in the region. For instance, an organization in San Francisco might choose an earthquake scenario, and an organization in Miami could choose a hurricane.

In a simple simulation, the facilitator reads out announcements as if they’re news briefs. Such announcements describe an unfolding scenario and can also include information about the organization’s status at the time. An example announcement might read like this:

It is 8:15 a.m. local time, and a magnitude 7.1 earthquake has just occurred, fifteen miles from company headquarters. Building One is heavily damaged and some people are seriously injured. Building Two (the one containing the organization’s computer systems) is damaged and personnel are unable to enter the building. Electric power is out, and the generator has not started because of an unknown problem that may be earthquake related. Executives Jeff Johnson and Sarah Smith (CIO and CFO) are backpacking on the Appalachian Trail and cannot be reached.

The disaster-simulation team, meeting in a conference room, discusses emergency response procedures and how the response might unfold. They consider the conditions described to them and identify any issues that could impact an actual disaster response.

The simulation facilitator makes additional announcements throughout the simulation. Just like in a real disaster, the team doesn’t know everything right away — instead, news trickles in. In the simulation, the facilitator reads scripted statements that, um, simulate the way that information flows in a real disaster.

A more realistic simulation can be held at the organization’s emergency response center, where some resources that support emergency response may be available. Another idea is to hold the simulation on a day that is not announced ahead of time, so that responders will be genuinely surprised and possibly be less prepared to respond.

tip Remember to test your backup media to make sure that you can actually restore data from backups!

Parallel

A parallel test involves performing all the steps of a real recovery, except that you keep the real, live production systems running. The actual production systems run in parallel with the disaster recovery systems. The parallel test is very time-consuming, but it does test the accuracy of the applications because analysts compare data on the test recovery systems with production data.

The technical architecture of the target application determines how a parallel test needs to be conducted. The general principle of a parallel test is that the disaster recovery system (meaning the system that remains on standby until a real disaster occurs, at which time, the organization presses it into production service) runs process work at the same time that the primary system continues its normal work. Precisely how this is accomplished depends on technical details. For a system that operates on batches of data, those batches can be copied to the DR system for processing there, and results can be compared for accuracy and timeliness.

Highly interactive applications are more difficult to test in a strictly parallel test. Instead, it might be necessary to record user interactions on the live system and then “play back” those interactions using an application testing tool. Then responses, accuracy, and timing can be verified after the test to verify whether the DR system worked properly.

While a parallel test may be difficult to set up, its results can provide a good indication of whether disaster recovery systems will perform during a disaster. Also, the risks associated with a parallel test are low, since a failure of the DR system will not impact real business transactions.

remember The parallel test includes loading data onto recovery systems without taking production systems down.

Full interruption (or cutover)

A full interruption (or cutover) test is similar to a parallel test except that in a full interruption test, a function’s primary systems are actually shut off or disconnected. A full interruption test is the ultimate test of a disaster recovery plan because one or more of the business’s critical functions actually depends upon the availability, integrity, and accuracy of the recovery systems.

A full interruption test should be performed only after successful walkthroughs and at least one parallel test. In a full interruption test, backup systems are processing the full production workload and all primary and ancillary functions including:

  • User access
  • Administrative access
  • Integrations to other applications
  • Support
  • Reporting
  • … And whatever else the main production environment needs to support

remember A full interruption test is the ultimate test of the ability for a disaster recovery system to perform properly in a real disaster, but it’s also the test with the highest risk and cost.

Participate in Business Continuity (BC) Planning and Exercises

Business continuity and disaster recovery planning are closely related but distinctly different activities. As described in Chapter 3, business continuity focuses on keeping a business running after a disaster or other event has occurred, while disaster recovery deals with restoring the organization and its affected processes and capabilities back to normal operations.

tip If you don’t recall the similarities and differences between business continuity and disaster recovery planning, we strongly recommend that you refer back to Chapter 3!

Security professionals need to take an active role in their organization’s business continuity planning activities and related exercises. As a CISSP, you’ll be a recognized expert in the area of business continuity and disaster recovery, and you will need to contribute your specialized knowledge and experience to help your organization develop and implement effective and comprehensive business continuity and disaster recovery plans.

Implement and Manage Physical Security

Physical security is yet another important aspect of the security professional’s responsibilities. Important physical security concepts and technologies are covered extensively in Chapter 5 and Chapter 7.

As with other information security concepts, ensuring physical security requires appropriate controls at the physical perimeter (this includes the building exterior, parking areas, and common grounds) and internal security controls to (most importantly) protect personnel, as well as to protect other physical and information assets from various threats, such as fire, flooding, severe weather, civil disturbances, terrorism, criminal activity, and workplace violence.

Address Personnel Safety and Security Concerns

Security professionals contribute to the safety and security of personnel by helping their organizations develop and implement effective personnel security policies (discussed in Chapter 3), and through physical security measures (discussed in the preceding section, as well as Chapter 5 and Chapter 7).

remember Saving human lives is the first priority in any life-threatening situation.