Chapter 9
IN THIS CHAPTER
Understanding investigations
Applying security operations concepts and controls
Responding to incidents
Preparing for disasters
Keeping facilities and personnel safe
The Security Operations domain covers lots of essential security concepts and builds on many of the other security domains, including Security and Risk Management (Chapter 3), Asset Security (Chapter 4), Security Architecture and Engineering (Chapter 5), and Communication and Network Security (Chapter 6). Security operations represents routine operations that occur across many of the CISSP domains. This domain represents 13 percent of the CISSP certification exam.
Conducting investigations for various purposes is an important function for security professionals. You must understand evidence collection and handling procedures, reporting and documentation requirements, various investigative processes, and digital forensics tools and techniques. Successful conclusions in investigations depend heavily on proficiency in these skills.
Evidence is information presented in a court of law to confirm or dispel a fact that’s under contention, such as the commission of a crime, the violation of policy, or an ethics matter. A case can’t be brought to trial or other legal proceeding without sufficient evidence to support the case. Thus, properly gathering and protecting evidence is one of the most important and most difficult tasks that an investigator must master.
Important evidence collection and handling topics covered on the CISSP exam include the types of evidence, rules of evidence, admissibility of evidence, chain of custody, and the evidence lifecycle.
Sources of legal evidence that you can present in a court of law generally fall into one of four major categories:
Other types of evidence that may fall into one or more of the above major categories include
Important rules of evidence for computer crime cases include the best evidence rule and the hearsay evidence rule. The CISSP candidate must understand both of these rules and their applicability to evidence in computer crime cases.
The best evidence rule, defined in the Federal Rules of Evidence, states that “to prove the content of a writing, recording, or photograph, the original writing, recording, or photograph is [ordinarily] required.”
However, the Federal Rules of Evidence define an exception to this rule as “[i]f data are stored in a computer or similar device, any printout or other output readable by sight, shown to reflect the data accurately, is an ‘original’.”
Thus, data extracted from a computer — if that data is a fair and accurate representation of the original data — satisfies the best evidence rule and may normally be introduced into court proceedings as such.
Hearsay evidence is evidence that’s not based on personal, first-hand knowledge of a witness, but rather comes from other sources. Under the Federal Rules of Evidence, hearsay evidence is normally not admissible in court. This rule exists to prevent unreliable testimony from improperly influencing the outcome of a trial.
Business records, including computer records, have traditionally, and perhaps mistakenly, been considered hearsay evidence by most courts because these records cannot be proven accurate and reliable. One of the most significant obstacles for a prosecutor to overcome in a computer crime case is seeking the admission of computer records as evidence.
Several courts have acknowledged that the hearsay rules are applicable to computer-stored records containing human statements but are not applicable to computer-generated records untouched by human hands.
Perhaps the most successful and commonly applied test of admissibility for computer records, in general, has been the business records exception, established in the U.S. Federal Rules of Evidence, for records of regularly conducted activity, meeting the following criteria:
Because computer-generated evidence can sometimes be easily manipulated, altered, or tampered with, and because it’s not easily and commonly understood, this type of evidence is usually considered suspect in a court of law. In order to be admissible, evidence must be
The chain of custody (or chain of evidence) provides accountability and protection for evidence throughout its entire lifecycle and includes the following information, which is normally kept in an evidence log:
Any time that evidence changes possession or is transferred to a different media type, it must be properly recorded in the evidence log to maintain the chain of custody.
Law enforcement officials must strictly adhere to chain of custody requirements, and this adherence is highly recommended for anyone else involved in collecting or seizing evidence. Security professionals and incident response teams must fully understand and follow chain of custody principles and procedures, no matter how minor or insignificant a security incident may initially appear. In both cases, chain of custody serves to prove that digital evidence has not been modified at any point in the forensic examination and analysis.
Even properly trained law enforcement officials sometimes make crucial mistakes in evidence handling and safekeeping. Most attorneys won’t understand the technical aspects of the evidence that you may present in a case, but they will definitely know evidence-handling rules and will most certainly scrutinize your actions in this area. Improperly handled evidence, no matter how conclusive or damaging, will likely be inadmissible in a court of law.
The evidence lifecycle describes the various phases of evidence, from its initial discovery to its final disposition.
The evidence lifecycle has the following five stages:
The following sections explain more about each stage.
Collecting evidence involves taking that evidence into custody. Unfortunately, evidence can’t always be collected and must instead be seized. Many legal issues are involved in seizing computers and other electronic evidence. The publication Searching and Seizing Computers and Obtaining Evidence in Criminal Investigations (3rd edition, 2009), published by the U.S. Department of Justice (DOJ) Computer Crime and Intellectual Property Section (CCIPS), provides comprehensive guidance on this subject. Find this publication available for download at www.justice.gov/sites/default/files/criminal-ccips/legacy/2015/01/14/ssmanual2009.pdf
.
In general, law enforcement officials can search and/or seize computers and other electronic evidence under any of four circumstances:
When evidence is collected, it must be properly marked and identified. This ensures that it can later be properly presented in court as actual evidence gathered from the scene or incident. The collected evidence must be recorded in an evidence log with the following information:
Additionally, the evidence must be marked, using the following guidelines:
Always collect and mark evidence in a consistent manner so that you can easily identify evidence and describe your collection and identification techniques to an opposing attorney in court, if necessary.
Analysis involves examining the evidence for information pertinent to the case. Analysis should be conducted with extreme caution, by properly trained and experienced personnel only, to ensure the evidence is not altered, damaged, or destroyed.
All evidence must be properly stored in a secure facility and preserved to prevent damage or contamination from various hazards, including intense heat or cold, extreme humidity, water, magnetic fields, and vibration. Evidence that’s not properly protected may be inadmissible in court, and the party responsible for collection and storage may be liable. Care must also be exercised during transportation to ensure that evidence is not lost, temporarily misplaced, damaged, or destroyed.
Evidence to be presented in court must continue to follow the chain of custody and be handled with the same care as at all other times in the evidence lifecycle. This process continues throughout the trial until all testimony related to the evidence is completed and the trial has concluded or the case is settled or dismissed.
After the conclusion of the trial or other disposition, evidence is normally returned to its proper owner. However, under some circumstances, certain evidence may be ordered destroyed, such as contraband, drugs, or drug paraphernalia. Any evidence obtained through a search warrant is legally under the control of the court, possibly requiring the original owner to petition the court for its return.
As described in the preceding section, complete and accurate recordkeeping is critical to each investigation. An investigation’s report is intended to be a complete record of an investigation, and usually includes the following:
An investigation should begin immediately upon report of an alleged computer crime, policy violation, or incident. Any incident should be handled, at least initially, as a computer crime investigation or policy violation until a preliminary investigation determines otherwise. Different investigative techniques may be required, depending upon the goal of the investigation or applicable laws and regulations. For example, incident handling requires expediency to contain any potential damage as quickly as possible. A root cause analysis requires in-depth examination to determine what happened, how it happened, and how to prevent the same thing from happening again. However, in all cases, proper evidence collection and handling is essential. Even if a preliminary investigation determines that a security incident was not the result of criminal activity, you should always handle any potential evidence properly, in case either further legal proceedings are anticipated or a crime is later uncovered during the course of a full investigation. The CISSP candidate should be familiar with the general steps of the investigative process:
Detect and contain an incident.
Early detection is critical to a successful investigation. Unfortunately, computer-related incidents usually involve passive or reactive detection techniques (such as the review of audit trails and accidental discovery), which often leave a cold evidence trail. Containment minimizes further loss or damage. The computer incident response team (CIRT), which we discuss later in this chapter, is the team that is normally responsible for conducting an investigation. The CIRT should be notified (or activated) as quickly as possible after a computer crime is detected or suspected.
Notify management.
Management must be notified of any investigations as soon as possible. Knowledge of the investigations should be limited to as few people as possible, on a need-to-know basis. Out-of-band communication methods (reporting in person) should be used to ensure that an intruder does not intercept sensitive communications about the investigation.
Conduct a preliminary investigation.
This preliminary investigation determines whether an incident or crime actually occurred. Most incidents turn out to be honest mistakes rather than malicious conduct. This step includes reviewing the complaint or report, inspecting damage, interviewing witnesses, examining logs, and identifying further investigation requirements.
Determine whether the organization should disclose that the crime occurred.
First, and most importantly, determine whether laws or regulations require the organization to disclose a crime or incident. Next, by coordinating with a public relations or public affairs official of the organization, determine whether the organization wants to disclose this information.
Conduct the investigation.
Conducting the investigation involves three activities:
Identify potential suspects.
Potential suspects include insiders and outsiders to the organization. One standard discriminator to help determine or eliminate potential suspects is the MOM test: Did the suspect have the Motive, Opportunity, and Means? The Motive might relate to financial gain, revenge, or notoriety. A suspect had Opportunity if he or she had access, whether as an authorized user for an unauthorized purpose or as an unauthorized user — due to the existence of a security weakness or vulnerability — for an unauthorized purpose. And Means relates to whether the suspect had the necessary tools and skills to commit the crime.
Identify potential witnesses.
Determine whom you want interviewed and who conducts the interviews. Be careful not to alert any potential suspects to the investigation; focus on obtaining facts, not opinions, in witness statements.
Prepare for search and seizure.
Identify the types of systems and evidence that you plan to search or seize, designate and train the search and seizure team members (normally members of the Computer Incident Response Team, or CIRT), obtain and serve proper search warrants (if required), and determine potential risk to the system during a search and seizure effort.
Report your findings.
The results of the investigation, including evidence, should be reported to management and turned over to proper law enforcement officials or prosecutors, as appropriate.
Digital forensics is the science of conducting a computer incident investigation to determine what has happened and who is responsible, and to collect legally admissible evidence for use in subsequent legal proceedings, such as a criminal investigation, internal investigation, or lawsuit.
Proper forensic analysis and investigation requires in-depth knowledge of hardware (such as endpoint devices and networking equipment), operating systems (including desktop, server, mobile device, and other device operating systems, like routers, switches, and load balancers), applications, databases, and software programming languages, as well as knowledge and experience using sophisticated forensics tools and toolkits.
The types of forensic data-gathering techniques include
Hard drive forensics. Here, specialized tools are used to create one or more forensically identical copies of a computer’s hard drive. A device called a write blocker is typically used to prevent any possible alterations to the original drive. Cryptographic checksums can be used to verify that a forensic copy is an exact duplicate of the original.
Tools are then used to examine the contents of the hard drive in order to determine
Live forensics. Here, specialized tools are used to examine a running system, including:
Live forensics are difficult to perform, because the tools used to collect information can also affect the system being examined.
The purpose of an investigation is to determine what happened and who is responsible, and to collect evidence that supports this hypothesis. Closely related to, but distinctly different from, investigations is incident management (discussed in detail later in this chapter). Incident management determines what happened, contains and assesses damage, and restores normal operations.
Investigations and incident management must often be conducted simultaneously in a well-coordinated and controlled manner to ensure that the initial actions of either activity don’t destroy evidence or cause further damage to the organization’s assets. For this reason, it’s important that Computer Incident (or Emergency) Response Teams (CIRT or CERT, or Computer Security Incident Response Teams - CSIRT, respectively) be properly trained and qualified to secure a computer-related crime scene or incident while preserving evidence. Ideally, the CIRT includes individuals who will actually be conducting the investigation.
An analogy to this would be an example of a police patrolman who discovers a murder victim. It’s important that the patrolman quickly assesses the safety of the situation and secures the crime scene, but at the same time, he must be careful not to disturb or destroy any evidence. The homicide detective’s job is to gather and analyze the evidence. Ideally, but rarely, the homicide detective would be the individual who discovers the murder victim, allowing her to assess the safety of the situation, secure the crime scene, and begin collecting evidence. Think of yourself as a CSI-SSP!
Different requirements for various investigation types include
Various industry standards and guidelines provide guidance for conducting investigations. These include the American Bar Association’s (ABA) Best Practices in Internal Investigations, various best practice guidelines and toolkits published by the U.S. Department of Justice (DOJ), and ASTM International’s Standard Practice for Computer Forensics (ASTM E2763).
Event logging is an essential part of an organization’s IT operations. Increasingly, organizations are implementing centralized log collection systems that often serve as security information and event management (SIEM) platforms.
Intrusion detection is a passive technique used to detect unauthorized activity on a network. An intrusion detection system is frequently called an IDS. Three types of IDSs used today are
Both network- and host-based IDSs use a couple of methods:
Intrusion detection doesn’t stop intruders, but intrusion prevention does … or, at least, it slows them down. Intrusion prevention systems (IPSs) are newer and more common systems than IDSs, and IPSs are designed to detect and block intrusions. An intrusion prevention system is simply an IDS that can take action, such as dropping a connection or blocking a port, when an intrusion is detected.
See Chapter 6 for more on intrusion detection and intrusion prevention systems.
Security information and event management (SIEM) solutions provide real-time collection, analysis, correlation, and presentation of security logs and alerts generated by various network sources (such as firewalls, IDS/IPS, routers, switches, servers, and workstations).
A SIEM solution can be software- or appliance-based, and may be hosted and managed either internally or by a managed security service provider.
A SIEM requires a lot of up-front configuration and tuning, so that only the most important, actionable events are brought to the attention of staff members in the organization. However, it’s worth the effort: a SIEM combs through millions, or billions, of events daily, and presents only the most important few, actionable events so that security teams can take appropriate action.
Many SIEM platforms also have the ability to accept threat intelligence feeds from various vendors including the SIEM manufacturers. This permits the SIEM to automatically adjust its detection and blocking capabilities for the most up-to-date threats.
Continuous monitoring technology collects and reports security data in near real time. Continuous monitoring components may include
Egress monitoring (or extrusion detection) is the process of monitoring outbound traffic to discover potential data leakage (or loss). Modern cyberattacks employ various stealth techniques to avoid detection as long as possible for the purpose of data theft. These techniques may include the use of encryption (such as SSL/TLS) and steganography (discussed in Chapter 4).
Data loss prevention (DLP) systems are often used to detect the exfiltration of sensitive data, such as personally identifiable information (PII) or protected health information (PHI), in e-mail messages, data uploads, PNG or JPEG images, and other forms of communication. These technologies often perform deep packet inspection (DPI) to decrypt and inspect outbound traffic that is TLS encrypted.
DLP systems can also be used to disable removable media drive interfaces on servers and workstations, and also to encrypt data written onto removable media.
Static DLP tools are used to discover sensitive and proprietary data in databases, file servers, and other data storage systems.
An organization’s information architecture is dynamic and constantly changing. As a result, its security posture is also dynamic and constantly changing. Provisioning (and decommissioning) of various information resources can have significant impacts (both direct and indirect) on the organization’s security posture. For example, an application may either directly introduce new vulnerabilities into an environment or integrate with a database in a way that compromises the integrity of the database. For these reasons, security planning and analysis must be an integral part of every organization’s resource provisioning processes, as well as throughout the lifecycle of all resources. Important security considerations include
Asset inventory. Maintaining a complete and accurate inventory is critical to ensure that all potential vulnerabilities and risks in an environment can be identified, assessed, and addressed. Indeed, so many other critical security processes are dependent upon sound asset inventory that asset inventory is one of the most important (but mundane) activities in IT organizations.
Asset inventory is the first control found in the well-known CIS 20 Controls. Asset inventory appears first because it is fundamental to most other security controls.
Virtual assets. Virtual machine sprawl has increasingly become an issue for organizations with the popularity of virtualization technology and software defined networks (SDN). Virtual machines (VMs) can be (and often are) provisioned in a matter of minutes, but aren’t always properly decommissioned when they are no longer needed. Dormant VMs often aren’t backed up, and can go unpatched for many months This exposes the organization to increased risk from unpatched security vulnerabilities.
Of particular concern to security professionals is the implementation of VMs without proper review and approvals. This was not a problem before virtualization, as organizations had other checks and balances in place to prevent the implementation of unauthorized systems (namely, the purchasing process). But VM’s can be implemented unilaterally, often without the knowledge or involvement of other personnel within the organization.
Cloud assets. As more organizations adopt cloud strategies that include software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS) solutions, it’s important to keep track of these assets. Ultimately, an organization is responsible for the security and privacy of its applications and data — not the cloud service provider. Issues of data residency and trans-border data flow need to be considered.
A new class of security tools known as cloud access security brokers (CASB) can detect access to, and usage of, cloud-based services. These tools give the organization more visibility into its sanctioned and unsanctioned use of cloud services. Many CASB systems, in cooperation with cloud services, can be used to control the use of cloud services.
Fundamental security operations concepts that need to be well understood and managed include the principles of need-to-know and least privilege, separation of duties and responsibilities, monitoring of special privileges, job rotation, information lifecycle management and service-level agreements.
The concept of need-to-know states that only people with a valid business justification should have access to specific information or functions. In addition to having a need-to-know, an individual must have an appropriate security clearance level in order for access to be granted. Conversely, an individual with the appropriate security clearance level, but without a need-to-know, should not be granted access.
One of the most difficult challenges in managing need-to-know is the use of controls that enforce need-to-know. Also, information owners need to be able to distinguish I need-to-know from I want-to-know, I-want-to-feel-important, and I’m-just-curious.
Need-to-know is closely related to the concept of least privilege and can help organizations implement least privilege in a practical manner.
The principle of least privilege states that persons should have the capability to perform only the tasks (or have access to only the data) that are required to perform their primary jobs, and no more.
To give an individual more privileges and access than required invites trouble. Offering the capability to perform more than the job requires may become a temptation that results, sooner or later, in an abuse of privilege.
For example, giving a user full permissions on a network share, rather than just read and modify rights to a specific directory, opens the door not only for abuse of those privileges (for example, reading or copying other sensitive information on the network share) but also for costly mistakes (accidentally deleting a file — or the entire directory!). As a starting point, organizations should approach permissions with a “deny all” mentality, then add needed permissions as required.
Several important concepts associated with need to know and least privilege include
Aggregation. When people transfer between jobs and/or departments within an organization (see the section on job rotations later in this chapter), they often need different access and privileges to do their new jobs. Far too often, organizational security processes do not adequately ensure that access rights which are no longer required by an individual are actually revoked. Instead, individuals accumulate privileges, and over a period of many years an employee can have far more access and privileges than they actually need. This is known as aggregation, and it’s the antithesis of least privilege!
Privilege creep is another term commonly used here.
The concept of separation (or segregation) of duties and responsibilities ensures that no single individual has complete authority and control of a critical system or process. This practice promotes security in the following ways:
Here are some common examples of separation of duties and responsibilities within organizations:
In smaller organizations, separation of duties and responsibilities can be difficult to implement because of limited personnel and resources.
Privileged entity controls are the mechanisms, generally built into computer operating systems and network devices, that give privileged access to hardware, software, and data. In UNIX and Windows, the controls that permit privileged functions reside in the operating system. Operating systems for servers, desktop computers, and many other devices use the concept of modes of execution to define privilege levels for various user accounts, applications, and processes that run on a system. For instance, the UNIX root
account and Windows Server Enterprise, Domain, and Local Administrator account roles have elevated rights that allow those accounts to install software, view the entire file system and, in some cases, directly access the OS kernel and memory.
Specialized tools are used to monitor and record activities performed by privileged and administrative users. This helps to ensure accountability on the part of each administrator and aids in troubleshooting, through the ability to view actions performed by administrators.
System or network administrators typically use privileged accounts to perform operating system and utility management functions. Supervisor or Administrator mode should be used only for system administration purposes. Unfortunately, many organizations allow system and network administrators to use these privileged accounts or roles as their normal user accounts even when they aren't doing work which requires this level of access. Yet another horrible security practice is to allow administrators to share a single “administrator” or “root” account.
Job rotation (or rotation of duties) is another effective security control that gives many benefits to an organization. Similar to the concept of separation of duties and responsibilities, job rotations involve regularly (or randomly) transferring key personnel into different positions or departments within an organization, with or without notice. Job rotations accomplish several important organizational objectives:
Job rotations can also include changing workers’ workstations and work locations, which can also keep would-be saboteurs off balance and less likely to commit.
As with the practice of separation of duties, small organizations can have difficulty implementing job rotations.
The information lifecycle refers to the activities related to the introduction, use, and disposal of information in an organization. The phases in the information lifecycle typically are
Users of business- or mission-critical information systems need to know whether their systems or services will function when they need them, and users need to know more than “Is it up?” or “Is it down again?” Their customers, and others, hold users accountable for getting their work done in a timely and accurate manner, so consequently, those users need to know whether they can depend on their systems and services to help them deliver as promised.
The service-level agreement (SLA) is a quasi-legal document (it’s a real legal document when it is included in or referenced by a contract) that pledges the system or service performs to a set of minimum standards, such as
Because the SLA is a quantified statement, the service provider and the user alike can take measurements to see how well the service provider is meeting the SLA’s standards. This measurement, which is sometimes accompanied by analysis, is frequently called a scorecard.
Finally, for an SLA to be meaningful, it needs to have teeth! How will the SLA be enforced, and what will happen when violations occur? What are the escalation procedures? Will any penalties or service credits be paid in the event of a violation? If so, how will penalties or credits be calculated?
Resource protection is the broad category of controls that protect information assets and information infrastructure. Resources that require protection include
Media management refers to a broad category of controls that are used to manage information classification and physical media. Data classification refers to the tasks of marking information according to its sensitivity, as well as the subsequent handling, storage, transmission, and disposal procedures that accompany each classification level. Physical media is similarly marked; likewise, controls specify handling, storage, and disposal procedures. See Chapter 4 to learn more about data classification.
Sensitive information such as financial records, employee data, and information about customers must be clearly marked, properly handled and stored, and appropriately destroyed in accordance with established organizational policies, standards, and procedures:
PRIVILEGED AND CONFIDENTIAL
. See Chapter 4 for a more detailed discussion of data classification.Maintaining a complete and accurate inventory with configuration information about all of an organization’s hardware and software information assets is an important security operations function.
Without this information, managing vulnerabilities becomes a truly daunting challenge. With popular trends such as “bring your own device” becoming more commonplace in many organizations, it is critical that organizations work with their information security leaders and end users to ensure that all devices and applications that are used are known to — and appropriately managed by — the organization. This allows any inherent risks to be known — and addressed.
The formal process of detecting, responding to, and fixing a security problem is known as incident management (also known as security incident management).
Incident management includes the following steps:
Detective and preventive security measures include various security technologies and techniques, including:
Software bugs and flaws inevitably exist in operating systems, database management systems, and various applications, and are continually discovered by researchers. Many of these bugs and flaws are security vulnerabilities that could permit an attacker to control a target system and subsequently access sensitive data or critical functions. Patch and vulnerability management is the process of regularly assessing, testing, installing and verifying fixes and patches for software bugs and flaws as they are discovered.
To perform patch and vulnerability management, follow these basic steps:
Develop a plan to either install the security patch or to perform another workaround, if any is available.
You should base your decision on which solution best eliminates the vulnerability or reduces risk to an acceptable level.
Test the security patch or workaround in a test environment.
This process involves making sure that stated functions still work properly and that no unexpected side-effects arise as a result of installing the patch or workaround.
Change management is the business process used to control architectural and configuration changes in a production environment. Instead of just making changes to systems and the way that they relate to each other, change management is a formal process of request, design, review, approval, implementation, and recordkeeping.
Configuration Management is the closely related process of actively managing the configuration of every system, device, and application and then thoroughly documenting those configurations.
Change Management is the approval-based process that ensures that only approved changes are implemented.
Developing and implementing effective backup and recovery strategies are critical for ensuring the availability of systems and data. Other techniques and strategies are commonly implemented to ensure the availability of critical systems, even in the event of an outage or disaster.
Backups are performed for a variety of reasons that center around a basic principle: sometimes things go wrong and we need to get our data back. In order to cover all reasonable scenarios, backup storage strategies often involve the following:
These include hot sites (a fully functional data center or other facility that is always up and ready with near real-time replication of production systems and data), cold sites (a data center or facility that may have some recovery equipment available but not configured, and no backup data onsite), and warm sites (some hardware and connectivity is prepositioned and configured, plus an offsite copy of backup data).
Selecting a recovery site strategy has everything to do with cost and service level. The faster you want to recover data processing operations in a remote location, the more you will have to spend in order to build a site that is “ready to go” at the speed you require.
In a nutshell: Speed costs.
Many large organizations operate multiple data centers for critical systems with real-time replication and load balancing between the various sites. This is the ultimate solution for large commercial sites that have little or no tolerance for downtime. Indeed, a well-engineered multi-site application can suffer even significant whole-data-center outages without customers even knowing anything is wrong.
System resilience, high availability, quality of service (QoS), and fault tolerance are similar characteristics that are engineered into a system to make it as reliable as possible:
A variety of disasters can beset an organization’s business operations. They fall into two main categories: natural and man-made.
In many cases, formal methodologies are used to predict the likelihood of a particular disaster. For example, 50-year flood plain is a term that you’ve probably heard to describe the maximum physical limits of a river flood that’s likely to occur once in a 50-year period. The likelihood of each of the following disasters depends greatly on local and regional geography:
Many of these occurrences may have secondary effects; often these secondary effects have a bigger impact on business operations, sometimes in a wider area than the initial disaster (for instance, a landslide in a rural area can topple power transmission lines, which results in a citywide blackout). Some of these effects are
As if natural disasters weren’t enough, man-made disasters can also disrupt business operations, all as a result of deliberate and accidental acts:
Disasters can affect businesses in a lot of ways — some obvious, and others not so obvious.
The list above isn’t complete, but should help you think about all the ways a disaster can affect your organization.
Emergency response teams must be prepared for every reasonably possible scenario. Members of these teams need a variety of specialized training to deal with such things as water and smoke damage, structural damage, flooding, and hazardous materials.
Organizations must document all the types of responses so that the response teams know what to do. The emergency response documentation consists of two major parts: how to respond to each type of incident, and the most up-to-date facts about the facilities and equipment that the organization uses.
In other words, you want your teams to know how to deal with water damage, smoke damage, structural damage, hazardous materials, and many other things. Your teams also need to know everything about every company facility: Where to find utility entrances, electrical equipment, HVAC equipment, fire control, elevators, communications, data closets, and so on; which vendors maintain and service them; and so on. And you need experts who know about the materials and construction of the buildings themselves. Those experts might be your own employees, outside consultants, or a little of both.
Responding to an emergency branches into two activities: salvage and recovery. Tangential to this is preparing financially for the costs associated with salvage and recovery.
The salvage team is concerned with restoring full functionality to the damaged facility. This restoration includes several activities:
Recovery comprises equipping the BCP team (yes, the BCP team — recovery involves both BCP and DRP) with any logistics, supplies, or coordination in order to get alternate functional sites up and running. This activity should be heavily scripted, with lots of procedures and checklists in order to ensure that every detail is handled.
The salvage and recovery operations can cost a lot of money. The organization must prepare for potentially large expenses (at least several times the normal monthly operating cost) to restore operations to the original facility.
Financial readiness can take several forms, including:
People are the most important resource in any organization. As such, disaster response must place human life above all other considerations when developing disaster response plans and when emergency responders are taking action after a disaster strikes. In terms of life safety, organizations can do several things to ensure safety of personnel:
A critical component of the DRP is the communications plan. Employees need to be notified about closed facilities and any special work instructions (such as an alternate location to report for work). The planning team needs to realize that one or more of the usual means of communications may have also been adversely affected by the same event that damaged business facilities. For example, if a building has been damaged, the voice-mail system that people would try to call into so that they could check messages and get workplace status might not be working.
Organizations need to anticipate the effects of an event when considering emergency communications. For instance, you need to establish in advance two or more ways to locate each important staff member. These ways may include landlines, cell phones, spouses’ cell phones, and alternate contact numbers (such as neighbors or relatives).
Many organizations’ emergency operations plans include the use of audio conference bridges so that personnel can discuss operational issues hour by hour throughout the event. Instead of relying on a single provider (which you might not be able to reach because of communications problems or because it’s affected by the same event), organizations should have a second (and maybe even a third) audio conference provider established. Emergency communications documentation needs to include dial-in information for both (or all three) conference systems.
In addition to internal communications, the DRP must address external communications to ensure that customers, investors, government, and media are provided with accurate and timely information.
When a disaster strikes, an organization’s DRP needs to include procedures to assess damage to buildings and equipment.
First, the response team needs to examine buildings and equipment, to determine which assets are a total loss, which are repairable, and which are still usable (although not necessarily in their current location).
For such events as floods, fires and earthquakes, a professional building inspector usually will need to examine a building to see whether it is fit for occupation. If not, then the next step is determining whether a limited number of personnel will be permitted to enter the building to retrieve needed assets.
Once assessment has been completed, assets can be divided into three categories:
The ultimate objective of the disaster recovery team is the restoration of work facilities with their required assets, so that business may return to normal. Depending on the nature of the event, restoration may take the form of building repair, building replacement, or permanent relocation to a different building.
Similarly, assets used in each building may need to undergo their own restoration, whether that takes the form of replacement, repair, or simply placing it back into service in whatever location is chosen.
Prior to full restoration, business operations may be conducted in temporary facilities, possibly by alternate personnel who may be other employees or contractors hired to fill in and help out. These temporary facilities may be located either near the original facilities or some distance away. The circumstances of the event will dictate some of these matters, as well as the organization’s plans for temporary business operations.
An organization’s ability to effectively respond to a disaster is highly dependent on its advance preparations. In addition to the development of high quality, workable disaster recovery and business continuity plans that are kept up to date, the next most important part is making sure that employees and other needed personnel are periodically trained in the actual response and continuity procedures. Training and practice helps to reinforce understanding of proper response procedures, giving the organization the best chance at surviving the disaster.
An important part of training is the participation in various types of testing, which is discussed in the following section.
By the time that an organization has created a DRP, it’s probably spent hundreds of hours and possibly tens (or hundreds) of thousands of dollars on consulting fees. You’d think that after making such a big investment, they’d test the DRP to make sure that it really works when an actual disaster strikes!
The following sections outline DRP testing methods.
A read-through (or checklist) test is a detailed review of DRP documents, performed by individuals on their own. The purpose of a read-through test is to identify inaccuracies, errors, and omissions in DRP documentation.
It’s easy to coordinate this type of test, because each person who performs the test does it when his or her schedule permits (provided they complete it before any deadlines).
By itself, a document review is an insufficient way to test a DRP; however, it’s a logical starting place. You should perform one or more of the other DR tests described in the following sections shortly after you do a read-through test.
A walkthrough (or tabletop or structured walkthrough) test is a team approach to the read-through test. Here, several business and technology experts in the organization gather to “walk” through the DRP. A moderator or facilitator leads participants to discuss each step in the DRP so that they can identify issues and opportunities for making the DRP more accurate and complete. Group discussions usually help to identify issues that people will not find when working on their own. Often the participants want to perform the review in a fancy mountain or oceanside retreat, where they can think much more clearly! (Yeah, right.)
During a walkthrough test, the facilitator writes down “parking lot” issues (items to be considered at a later time, written down now so they will not be forgotten) on a whiteboard or flipchart while the group identifies those issues. These are action items that will serve to make improvements to the DRP. Each action item needs to have an accountable person assigned, as well as a completion date, so that the action items will be completed in a reasonable time. Depending upon the extent of the changes, a follow-up walkthrough may need to be conducted at a later time.
In a simulation test, all the designated disaster recovery personnel practice going through the motions associated with a real recovery. In a simulation, the team doesn’t actually perform any recovery or alternate processing.
An organization that plans to perform a simulation test appoints a facilitator who develops a disaster scenario, using a type of disaster that’s likely to occur in the region. For instance, an organization in San Francisco might choose an earthquake scenario, and an organization in Miami could choose a hurricane.
In a simple simulation, the facilitator reads out announcements as if they’re news briefs. Such announcements describe an unfolding scenario and can also include information about the organization’s status at the time. An example announcement might read like this:
It is 8:15 a.m. local time, and a magnitude 7.1 earthquake has just occurred, fifteen miles from company headquarters. Building One is heavily damaged and some people are seriously injured. Building Two (the one containing the organization’s computer systems) is damaged and personnel are unable to enter the building. Electric power is out, and the generator has not started because of an unknown problem that may be earthquake related. Executives Jeff Johnson and Sarah Smith (CIO and CFO) are backpacking on the Appalachian Trail and cannot be reached.
The disaster-simulation team, meeting in a conference room, discusses emergency response procedures and how the response might unfold. They consider the conditions described to them and identify any issues that could impact an actual disaster response.
The simulation facilitator makes additional announcements throughout the simulation. Just like in a real disaster, the team doesn’t know everything right away — instead, news trickles in. In the simulation, the facilitator reads scripted statements that, um, simulate the way that information flows in a real disaster.
A more realistic simulation can be held at the organization’s emergency response center, where some resources that support emergency response may be available. Another idea is to hold the simulation on a day that is not announced ahead of time, so that responders will be genuinely surprised and possibly be less prepared to respond.
A parallel test involves performing all the steps of a real recovery, except that you keep the real, live production systems running. The actual production systems run in parallel with the disaster recovery systems. The parallel test is very time-consuming, but it does test the accuracy of the applications because analysts compare data on the test recovery systems with production data.
The technical architecture of the target application determines how a parallel test needs to be conducted. The general principle of a parallel test is that the disaster recovery system (meaning the system that remains on standby until a real disaster occurs, at which time, the organization presses it into production service) runs process work at the same time that the primary system continues its normal work. Precisely how this is accomplished depends on technical details. For a system that operates on batches of data, those batches can be copied to the DR system for processing there, and results can be compared for accuracy and timeliness.
Highly interactive applications are more difficult to test in a strictly parallel test. Instead, it might be necessary to record user interactions on the live system and then “play back” those interactions using an application testing tool. Then responses, accuracy, and timing can be verified after the test to verify whether the DR system worked properly.
While a parallel test may be difficult to set up, its results can provide a good indication of whether disaster recovery systems will perform during a disaster. Also, the risks associated with a parallel test are low, since a failure of the DR system will not impact real business transactions.
A full interruption (or cutover) test is similar to a parallel test except that in a full interruption test, a function’s primary systems are actually shut off or disconnected. A full interruption test is the ultimate test of a disaster recovery plan because one or more of the business’s critical functions actually depends upon the availability, integrity, and accuracy of the recovery systems.
A full interruption test should be performed only after successful walkthroughs and at least one parallel test. In a full interruption test, backup systems are processing the full production workload and all primary and ancillary functions including:
Business continuity and disaster recovery planning are closely related but distinctly different activities. As described in Chapter 3, business continuity focuses on keeping a business running after a disaster or other event has occurred, while disaster recovery deals with restoring the organization and its affected processes and capabilities back to normal operations.
Security professionals need to take an active role in their organization’s business continuity planning activities and related exercises. As a CISSP, you’ll be a recognized expert in the area of business continuity and disaster recovery, and you will need to contribute your specialized knowledge and experience to help your organization develop and implement effective and comprehensive business continuity and disaster recovery plans.
Physical security is yet another important aspect of the security professional’s responsibilities. Important physical security concepts and technologies are covered extensively in Chapter 5 and Chapter 7.
As with other information security concepts, ensuring physical security requires appropriate controls at the physical perimeter (this includes the building exterior, parking areas, and common grounds) and internal security controls to (most importantly) protect personnel, as well as to protect other physical and information assets from various threats, such as fire, flooding, severe weather, civil disturbances, terrorism, criminal activity, and workplace violence.
Security professionals contribute to the safety and security of personnel by helping their organizations develop and implement effective personnel security policies (discussed in Chapter 3), and through physical security measures (discussed in the preceding section, as well as Chapter 5 and Chapter 7).