Chapter 11
Business Continuity via Information Security and People Power

Disasters can happen. Natural disasters might strike in mere seconds or minutes to disrupt your information systems and your ability to conduct business so totally that you go out of business. An advanced persistent threat's attacks on your systems might take months to achieve the same cumulative effect.

Without a well-considered business continuity plan and without management and leadership committing the resources that the plan needs to make it real, chances are that major disruptions will put your organization out of business. It takes a solid, well-prepared team of people to take a badly disrupted organization, one that's been hit by a disaster-sized major information systems security attack, and get it back on its feet again.

In doing so, we'll have to go beyond Layer 7 of the OSI model and get into the people-centric functions, features, and protocols that are where the business operates. It's at these rarified, nontechnical layers that SSCPs may face their organization's weakest links in their overall information security chain of defenses. It's also above Layer 7 that the SSCP has the greatest opportunity to help the organization pivot away from seeing its people as the “weakest link” in its security processes and instead see them as perhaps the organization's greatest and most reliable strength after a major disruption has occurred.

Perhaps more than other information security activities, business continuity and disaster recovery must place a greater emphasis and a sharper focus on safety, security, and privacy. Complex attacks that drastically disrupt the organization take months of concentrated effort to recover from; and from the first moment of that recovery, somebody has to work hard to ensure that fragile infrastructures, systems, assets, people, and resources are protected from further attack, accidental damage, or other harm.

Let's take what we've explored thus far and put all of that administrative, logical, and physical CIANA+PS thinking into the context of managing how people and systems get ready to survive to operate. As we do that, we'll also see how security assessments, both continuous and as deliberate, scheduled activities, can contribute to planning for and maintaining business continuity.

What Is a Disaster?

The leaders of communities or even nations know that disasters involve large-scale disruption to economies, infrastructures, and lives; they often include injuries and deaths. They see disaster recovery as putting communities, markets, economies, and lives back together again. Chief executive officers and the directors of major companies see their workforce idled, possibly injured, and traumatized, and their capital equipment and facilities damaged and unusable; at its worst, they see the end of the organization as a viable business concern. But the information systems and information technology risk managers see a disaster as merely the loss of IT, communications, and connectivity services, in a way that disrupts the plans and activities of the organization. IT managers think just in terms of getting the Internet back up and running, the factory automation working again, and the phones back on.

Those are not incompatible viewpoints. But the time for keeping them separate, as if in worlds apart, is past.

As a cybersecurity professional, you'll need to be able to support both of these views and reconcile them to some extent in your day-to-day practice. The larger-scale viewpoint can be seen in action at the Disaster Recovery Institute International's website, https://drii.org, among many others. It provides purpose and can drive awareness and education efforts as you work with planners and managers to build concrete cyber disaster recovery plans and the larger, surrounding set of business continuity plans.

You'll also need to further expand on this IT-centric view of disaster recovery planning and business continuity by including the many different operational technology aspects of your organization, its supply chains, and its workforce. COVID-19 has shown that organizations can and will rapidly adapt their architectures by embracing greater degrees of remote access, as well as expanding their use of different IoT and OT systems, in order to survive and continue to operate. The traditional IT-focused view, which is formally part of the SSCP's domain of practice, will therefore be used as the starting point for this chapter's exploration of BC/DR planning and execution; but as we do, we'll necessarily expand that somewhat to embrace the world of operational technologies and other information processing by referring to it as information systems (IS) disaster recovery planning and then place that planning in the larger context of disruptions, disasters, and recovering from them.

Surviving to Operate: Plan for It!

Stop and think about the conditions and environment that you are working in or living in, right now, this very moment. The chances are very good that right at this moment, none of the organizations of your life—your place of work, the schools you and others in your family are attending, the markets you shop in, and other community activities—is in the middle of a major disruption; just as a statement of probabilities, none of them is going through a disaster right now or trying to recover from it. This presents you with a choice: seize this golden opportunity to make sure that your personal plans for disaster recovery and business continuity, and those for each of the organizations that are essential to your life, are in the best possible state of readiness they can be in; or, let complacency take hold, enjoy the moment, and spend your energy doing something else.

Smaller organizations—the SOHO and SMB operations and many entrepreneurial ones—postpone continuity planning, often indefinitely. Larger ones may postpone such planning as well. And many will not start planning for continuity and recovery until after they have somehow survived their first major disruption.

As the on-scene SSCP, you've got the professional responsibility to understand how well (or how poorly) the organizations you work with or serve have this aspect of their information security posture put together. Let's start on that path by defining what to look for.

Business Continuity

Business continuity is the set of core or essential activities that an organization needs to rapidly restore to operational status after a major disruption of any kind. Core activities are both inward-facing (they deliver services to employees or members of the organization) or outward-facing (that deliver services to customers, external stakeholders, or the community or marketplace at large). Many larger-scale organizational planning and governance processes normally identify what these core, essential, strategic, or vital activities are; typically, those decisions and supporting rationale are captured in the business impact analysis (BIA) or in a similar knowledge base and planning product.

Business continuity includes every aspect of the organization's purpose, goals, or mission, and the people, systems, technologies, and resources that are required to achieve the required degree of continuity of operations. Making payroll commitments to employees, for example, requires the ability to actually get the money needed to pay those employees; this is often far, far beyond the scope of what the payroll department on its own, with its payroll system as a set of software and data, can achieve by itself, in the midst of recovering from a major disruption.

Business continuity plans, therefore, should lay out the steps or processes necessary to get these core activities back into useful operational service. These plans also identify the resources assigned to the plan, such as people, funding, third-party services, and work locations, and they identify the responsible managers or organizational units charged with carrying out these plans. Some of these activities are binary: either they are fully working or they are not. Others may have a range of services or outcomes they deliver, and it's up to the business continuity planners to set the aim points for these services. Automated logistics operations, for example, might first need to operate at reduced throughput to support a limited number of manufacturing operations (for a few products, rather than for the company's entire product line).

Ideally, these plans have a smooth demarcation or hand-over point where incident response activities logically end and where business continuity restoration begins. Quite often this hand-over also involves a different set of people and organizational functions taking up the responsibilities and authorities for task-level direction and decision making.

IS Disaster Recovery Plans

Information systems disaster recovery includes those activities and functions necessary to bring IT, OT, communications, and other information processing capabilities necessary for the designated core services back into safe, secure, and reliable operating condition. The degree of capability or capacity of each information service needed can also be either binary (all or nothing) or incremental, based on the needs of the core activities and services being brought back into operational use.

IS disaster recovery plans (IS DRPs) provide the plans, processes, and task flows necessary to bring the required information services and systems back into operational use. As with the business continuity plans, the IS DRPs should identify the resources assigned to the plan, such as people, funding, third-party services, and work locations, and identify the responsible managers or organizational units charged with carrying out these plans.

Large, complex architectures often require that sets of systems and capabilities are brought back up as groups, and then within each group, there may also be a natural order for bringing services back to operation, one or a few at a time. An information-driven manufacturing operation, for example, might require a recovery sequence that starts with essential physical support services (power, HVAC, and security), before then operation of the OT systems' supervisory control and management systems. These in turn can then control and monitor bringing up interfaces that control inward logistics to the manufacturing area, the control of selected robotics, computer aided design and manufacturing (CADAM), or other systems, and then the outward logistics.

Recovery planning must also consider and reflect the need for continued containment, isolation, or other enhanced cybersecurity operations. Depending upon the nature of the incidents that led to the disruption, the bare bones of the underlying IT systems—the CPUs, memory, disks, storage area networks—may need to be replaced with brand new equipment, then reloaded from “golden images” of their OS, applications, and initial data, before those systems can then be updated to reflect current operational data.

Plans, More Plans, and Triage

Figure 11.1 puts many different planning processes in a loosely arranged hierarchy; while many formal frameworks, such as those from NIST, ISO, and ITIL, offer sage advice, no one specific set of plans in any particular relationship is the most correct or most compliant with law and regulation. As a result, this figure shows no direct connecting lines or arrows from one plan to another; they all mesh together in the context of business planning and risk management planning. What this figure does illustrate, however, is that there are many interrelated and mutually supportive planning tasks or processes that organizations can and should use to be better prepared to adapt, survive, and overcome the anomalies. As an SSCP, you won't need to have deep knowledge of each of these plans or the planning processes that produce them. You will, however, serve your employers or clients best as you can to offer advice and assistance in helping them achieve their CIANA+PS needs by protecting their information, information systems, IT infrastructures, and people from harm.

Schematic illustration of Continuity of operations planning and supporting planning processes

FIGURE 11.1 Continuity of operations planning and supporting planning processes

Each of these layers of planning is (or should be) driven by the business impact analysis (BIA), which took the results of the risk assessment process to produce a prioritized approach to which risks, leading to which impacts to the organization, were the most important, urgent, or compelling to protect against. We've already looked at the BCP and DRP in brief; let's now take a brief look in more detail at some of the other planning processes that SSCPs will typically participate in, the plans those processes produce:

  • Contingency operations planning takes business continuity considerations a few steps further by examining and selecting how to provide alternate means of getting business operations up and running again. This can embrace a variety of approaches, depending on the nature of the business logic in question:
    • Alternate work locations for employees to use
    • Alternate communications systems, internal and external, to keep employees, stakeholders, customers, or partners in touch, informed, and engaged
    • Information backup, archive, and restore capabilities, whether for physical backup of information and key documents or digital backups
    • Alternate processing capabilities
    • Alternate storage, support, and logistics processes
    • Temporary staffing, financial, and other key considerations
  • Critical asset protection planning looks at the protection required for strategic, high-value or high-risk assets in order to prevent significant loss of value, utility, or availability of these assets to serve the organization's needs. As you saw in Chapter 3, “Integrated Information Risk Management,” these can be people, intellectual property, databases, assembly lines, or almost anything that is hard to replace and almost impossible to carry on business without.
  • Physical security and safety planning focuses on preventing unauthorized physical access to the organization's premises, property, systems, and people; it focuses on fire, environmental, or other hazards that might cause human injury or death, cause property damage, or otherwise reduce the value of the organization and its ability to function. It works to identify safety hazards and reduce accidents. (Chapter 4, “Operationalizing Risk Mitigation,” identified key approaches to physical risk mitigation controls with which SSCPs should be familiar.)

Finally, we as SSCPs come back to the information security incident response planning processes, as shown in Chapter 10, “Incident Response and Recovery.” That planning process rightly focuses our attention on detecting IT and information systems events (or anomalies) that might be security incidents in the making, characterizing them, notifying appropriate organizational managers and leaders, and working through containment, eradication, and recovery tasks as we respond to such incidents.

The conclusion is inescapable: planning is what keeps us prepared so that we can respond, but our planning has to be multifaceted and allow us to look at our organization, our operations, our information architectures, and our risks across the whole spectrum of business strategic, tactical, and operational concerns and details.

It's important to make a distinction here between plans and planning. Plans are sets of tasks, objectives, resources, constraints, schedules, and success criteria, brought together in a coherent way to show us what we need to do and how we do it to achieve a set of goals. Planning is a process—an activity that people do to gather all of that information, understand it, and put it to use. Planning is iterative; you do it over and over again, and each time through, you learn more about the objectives, the tasks, the constraints; you learn more about what “success” (or “failure”) really means in the context of the planning you're doing. In the worst of all worlds, plans become documents that sit on shelves; they are taken down every year, dusted off, thumbed through, and put back on the shelf with minor updates perhaps. These plans are not living documents; they are useless. Plans that people use every day become living documents through use; they stay alive, current, and real, because the people served by those plans take each step of those plans and develop detailed procedures that they then use on the job to accomplish the intent of the plan.

In a very real sense, the planning you'll do to meet the CIANA+PS needs of your organization or business does not and should not end until that organization or business does. Ongoing, continuous planning is in touch with what the knowledge workers and knowledge-seeking workers on your team are doing, every day, in every aspect of their jobs.

Timelines for BC/DR Planning and Action

Chapter 3 introduced the concepts of qualitative and quantitative risk assessment. Three of those “magic numbers” fundamentally shape the timelines for planning and executing continuity and recovery tasks by how they translate senior management's risk tolerance (or appetite) into a sense of urgency, when it comes to continuity and recovery planning:

  • Maximum allowable outage (MAO)—The greatest time period that business operations can be allowed to be disrupted by this risk event.
  • Recovery time objective (RTO)—The time by which the systems must be restored to normal operational function after the occurrence of this risk event.
  • Recovery point objective (RPO)—The maximum allowable latency or lag between having all data current versus the state of the data as a result of the risk event. The shorter the RPO, the more frequently data needs to be backed up. Longer RPOs reflect a willingness to operate on restored systems, handling new data (new business transactions) while still working to restore ones lost by the event.

Figure 11.2 puts these into perspective, and in a very oversimplified way; it may be helpful to think of this as the timeline for one major core activity, or a set of very closely related core activities, rather than the company or organization as a whole. Navigating through this figure we see that the “boom”—the moment when the organization painfully detects the start of the incident—is in the middle of the timeline. Let's divide this chart into five time periods, running from past into the future:

  • Historical activity is to the left of everything on this chart. IS disaster recovery cannot see that far back in time, so to speak. Data retention policies have probably deleted that data; software images and operational procedural knowledge would need too many updates applied to bring it current.
  • Archived, recoverable operations represent the most valuable assets that the BC/DR plans need to produce, manage, and then use when and if boom happens. These include database backups (full or partial), transaction logs, systems images, installation kits for new software or updates, operational procedures, and other know-how about business operational activities. This may cover days, weeks, or months; recovering older data may be necessary for reloading or rebuilding a data warehouse, but is not generally needed for day-to-day operations.
  • Unnoticed but compromised operations begin the moment an intruder is first able to gain access to your information systems (both the digital and nondigital, human-based ones). Routinely, it seems, attackers are overlapping activities in their cyber kill chain, finding and exfiltrating copies of data while they continue to locate further assets of opportunity or interest within your infrastructure and systems. Given that the industry average time to detect an intrusion is still well in excess of six months, an alarming amount of valuable information can be surreptitiously copied by attackers. For all you know, during this time the attackers are financing their ongoing operations by selling your data (perhaps thinly disguised as legitimate-seeming market research reports or supporting data).
  • Core activity recovery activities start happening during incident detection and response; they continue on until the elements needed for core activities are back in working order. For planning purposes, the maximum allowable outage period shows when these recovery operations must be completed so that operational users (internal, external, or both) can start using them to get useful work done again. The recovery time objective, which is generally shorter than the MAO, does not include getting end users back into the workplace, or other user-level tasks necessary to get back to work.
  • Full recovery includes the continued restoration and recovery of those less-urgent business activities and the systems and assets that support them.
Schematic illustration of Timelines for incident response, recovery, and continuity

FIGURE 11.2 Timelines for incident response, recovery, and continuity

You'll note one of those planning numbers, the recovery point objective, seems to be missing from this figure. It is represented by the relative spacing of the various systems and database backups, full or partial. Ideally, these happen frequently enough that the amount of time needed to reload from the last known good backup, and then use transaction logs or other records to re-accomplish lost work, is no longer than the RPO expressed as a duration.

Looking more closely at MAO, it's important to remember that this will probably vary process by process, activity by activity. A hospital, for example, might be willing to tolerate days or weeks for the MAO for its patient billing systems, but require MAOs of just a few minutes for emergency room, surgical suites, and intensive-care patient monitoring and support systems.

It's important to recognize that in all but the most trivial of situations, organizations are complex webs of business logic, core activities, supporting and ancillary processes, assets, and relationships. The COVID-19 pandemic experience taught many organizations that their previous assumptions and reasoning about risk estimation, maximum allowable recovery targets, and the effectiveness of backup, restoration, and recovery processes need to be significantly refreshed, reconsidered, and replanned. This is normal with plans and assumptions; it is why it has long been taught that no plan survives contact with reality.

Simple concepts are useful for initial planning; scaling them to large, complex situations often becomes impractical. Attempts to use SLE, for example, often fail to cope with the sheer numbers of assets and critical services involved in most information systems environments. This is especially true when considering the nature of complex cyberattacks:

  • Data exfiltration often takes place during the several months of attack activity prior to the disruption event itself. SLE and many risk assessment techniques have great difficulty dealing with the wide range of reputational impacts, business losses, and indirect, consequential damages that result when stolen data is replicated and sold on the Dark Web.
  • Ransom attacks have not demonstrated any real correlation of the ransom demanded to the asset values being attacked; rather, they seem to reflect a perception of the victims' willingness to pay and their ability to raise the cash.
  • Ransomware and related attacks tend to target the vast majority of IT-hosted assets within the attackers' reach, while most risk assessment methods find it difficult to deal with an “all or nothing” loss on a single event.

Options for Recovery

A wide range of choices enable organizations to employ as rich and robust a continuity plan as they may need. Some may be firmly in place, well supported with resources, procedures, and designated team members to put them into action when necessary. Others may be concepts under study. This range of options may include:

  • Archive and backup functions for operational databases, data warehouses, software development environments, production software systems, and knowledge management systems.
  • Nondigital (print or other) archives or backups of process knowledge, licenses, contracts, and other critical items.
  • Alternate processing sites that provide workspaces, connectivity, and other services to support team members as they restore continuity capabilities, and as they apply them to core business activities. This can involve a mix of off-premise, on-premise, cloud-hosted, or work from home capabilities.
  • Alternate business plans and processes that reflect (and maybe take advantage of) the changed circumstances of the disruption and its aftermath.
  • Alternate organizational relationships can be both internal (such as realignment of people and reporting relationships) and external (bringing in new partners or strategic provider relationships).
  • Changes to third-party services may be needed (such as specialists to restore critical capabilities), or modifications may be needed with existing service providers.
  • Shared (recovery) responsibility arrangements can bring in continuity and recovery specialists, take on some (or all) of the immediate and ongoing business activities (acting as a surrogate provider, so to speak), or meet other needs.

In larger enterprises, the entire BC/DR planning team should take on the responsibility of considering the right mix of capabilities needed for different types of disruptions, and then ensure that detailed planning and implementation go as far as they need to go. The SSCP's role in these will be to assist in understanding the security needs of any and all of these options.

Those security needs span the full CIANA+PS spectrum; they apply to ensuring the physical security of the organization's workforce, and its use of information assets, IT, OT, and communications capabilities, during incident response, recovery, and throughout all business continuity activities. Recovery of capabilities and bringing the organization back from disaster cannot be done if its people cannot trust the data they're using or the communications and collaboration arrangements they rely on to get work started up again.

Let's take a closer look at some of the security-specific implementation considerations that choosing these options require. Then, in the next section, we'll look at how these may be put to use. Along the way, we'll contrast some aspects of their implementation in SOHO and smaller SMB architectures versus the larger enterprise systems.

Backups, Archives, and Image Copies

The typical SOHO or smaller SMB environments demonstrate how the dramatic changes in storage technologies, costs, and capabilities have changed the way we think (or should think) about providing a safety net for our systems. This is shown in the different perceptions of these three different terms and the many different ideas for their use. Loosely, the term backup tends to encompass:

  • Any copy, in whole or in part, of a system (which is hosted on an IT, OT, or other infrastructure of some kind),
  • which can then be loaded onto a similar host infrastructure at some future time,
  • so as to restore the condition of that system to what it was at the time the backup was made.

Backups can be used for troubleshooting, forensic analysis and investigation, removing the unwanted effects of faulty updates (to data, procedures, software, hardware, or any combination thereof), and of course as part of recovering from information security incidents and IS disasters.

Archives or archival copies are backups made for longer-term recordkeeping and are often stored in different locations than used for operational data and backups. This represents the traditional intent that archival copies are generally kept to satisfy legal and other risk management concerns regarding records retention and audit, but (because of their age) tend not to be useful for restoring the organization's information systems after an incident. They are often called historical archives to reflect this view. However, as cyber-attacks have become more sophisticated, more organizations are finding that their archive copies, sometimes from one or two years prior to an attack, are the only reliable basis from which to rebuild their operational activities.

Backups, archives, and images (specifically, image copies of systems or portions thereof) are made of data, code, and other information on a source system or storage volume by copying to a destination device, system, or storage volume. These copies capture the current state of data, control information, code, and other related details on the source, which pertain to the task, activity, process, subsystem, or system of interest.

Thus, a backup of a SOHO client's accounting and finance system would not usually include files or data that is not part of that system; a backup of the entire accounting and financial department's systems and records, however, probably would include email, other correspondence, reports, spreadsheets, and anything else the department uses in the course of its activities.

There's also an aspect of time or purpose that separates these three ideas, when put into practice:

  • Image copies tend to be generated at the device level: copying every bit on a storage device, or every bit in RAM on a particular processor or server, generates a copy that preserves the internal structure of that data. Image copies do not alter the data or interpret the data in any way. Disk image copies include all of the storage extents marked as deleted (and in some cases may even include extents marked as “bad”). This allows the entire image to be reloaded onto another device or storage volume, either for use or for forensic analysis and investigation.
  • Backup copies of a disk or storage volume will read the directory structure on that device and then copy the selected files, file by file, to the destination storage. Backups tend to not copy files marked as deleted as a result.
  • Database backups or backups of elements within a database or data warehouse are generally made by the database application itself, under direction of the user or database analyst. These may be of selected records, views, or the entire database.

For any backup of any kind—a backup dataset, an image copy, or data in the archive—the full spectrum of CIANA+PS security needs must be met. Backups that won't load, or that cannot be accessed at the moment they are needed, aren't useful. The trustworthiness of any backup is directly tied to its integrity. Backups and archives must be managed if they are to be useful, prudent, and kept safe; the typical “casual SOHO” user's collection of removable storage, cloud storage, and papers usually cannot deliver a reliable, rapid recovery capability when it is needed most. For any and every size organization, backup, archive, and recovery planning needs to reach down to the individual user level, even to include any of the shadow IT information in backup, restore, and archive planning.

Cryptographic Assets and Recovery

Being able to keep backup and archival copies and images safe and secure almost always means they are protected in encrypted form. Legal and contractual compliance requirements often dictate that data at rest be encrypted, and this would of course include any and all backup or archival copies of that data, along with any image copies made of systems or storage devices that host or store that data.

There may be many layers of such encryption involved: individual records or files may be encrypted, the whole of an archive or backup set of data may be encrypted, and then the storage system itself may use a combination of dispersal, redundancy, and encryption to protect against inadvertent disclosure. None of this works correctly—no data can be retrieved if you've lost the keys to unlock it with. As a special subset of vital assets worthy of thoughtful backup, archiving, and protection planning, your cryptologic assets—the symmetric keys used to encrypt backups and archives, the digital certificates and private keys used by the organization and its people, and related information—must be kept and managed. Unlocking of encrypted archival or backup data will probably require that the original certificates, identities, and keys that were in use when it was encrypted be restored (or initialized) on the target system before the attempt is made to recover data from the backup. This may seem obvious; it is often overlooked, or handled incorrectly. For organizations that have chosen to establish their own private certificate authority infrastructures, it's even more important. Malware or other sabotage recoveries that require new endpoints or servers also mean a loss of the certificate stores and keychains that resided on the now-unusable systems, and these will have to be re-created or reloaded from trustworthy backups.

Using a hardware security module (HSM) for secure cryptographic asset management is often required for financial, medical, and some critical infrastructure information. That's a great start; it's important to ensure that the data on the HSM, and the keys, tokens, and so forth necessary to access and use that HSM, are also backed up securely. And it should go without saying that your HSM strategy should include appropriate backup and recovery options in case it, too, is damaged or unavailable.

“Golden Images” and Validation

Software development and IT organizations often identify particular system images as golden images to indicate their utility, purity, and value.

Golden images are used for the complete initialization of servers, endpoints, and other systems (be these virtual or real) by a bitwise copy onto their boot device or storage subsystem. This loads a known, tested, and configuration controlled instance of the complete operating systems, utilities, and applications software needed for that class of device, or for the type(s) of users who will be its primary users. Depending upon software (and data) licensing being used, this initialization may then require additional steps to register this new instance of that licensed content with the license management system. Nonetheless, images used in this fashion save time, expense, and reduce the exposure to the risks of poorly controlled or managed systems and their use.

Ideally, golden images are created on mastering systems that have been thoroughly inspected by anti-malware and other threat detection services or tools. The software and data to be built into the image must also pass these inspections and filters, and be from trustworthy, controlled software and data supply chains. In this way, the builders can attest to the purity or cleanliness (in anti-malware and threat surface terms) of that image, at the date and time at which that image is created.

Both of these aspects of golden images more than demonstrate their worth when endpoints, servers, or whole infrastructures are being restored after a security compromise, cyberattack, or major disruption. They provide a reliable instant in time to reset the bare metal of the infrastructure to. They do so in far less time than would be required, and at far less effort, than would be needed to reinstall hypervisors, OSs, applications, identities, system data, and then application data to these devices, one by one.

Golden images are only as trustworthy as the processes, supply chains, systems, and people who create, store, and use them are. Compromises (or exploited vulnerabilities) anywhere along that chain can weaken the high reliability and integrity that the golden image process claims to deliver to users.

Scan Before Loading: Blocking Historical Zero-Day Attacks

SOHO and smaller SMB users have no doubt had the experience of attempting to access older data from a removable drive or a backup dataset from their cloud storage provider, only to later on encounter a variety of malware-related behaviors on their previously malware-free system. This can of course happen when the original files had not been subject to a thorough anti-malware inspection; more often, however, it happens when the malware that infects that backup set was there all along, but went undetected by the anti-malware or other intrusion detection systems in use at the time the file was written. The malware may have been a true zero-day attack, involving vulnerabilities not known at that time to the security community; or it may have been known and reported, but not included in the threat detection models, patterns, and signatures in use at that time on the user's systems.

This is why it is vitally important to thoroughly scan any and all backups before they are used to reinitialize or reset any systems. You must use the latest, up-to-date versions of all threat detection, intrusion detection, anti-malware, and blocked/allowed systems and tools, along with up-to-date and complete copies of their signatures, models, patterns, definitions, parameters, and other control information, as you do this. Ideally, your security team or the recovery specialists use a physically isolated, clean workstation or system to perform these inspections with, adding a further air gap of protection between that system (as it scans the backup dataset or image) and the rest of your infrastructure.

Restart from a Clean Baseline

When all else fails, the organization may have to restart its core business activities completely from scratch. A devastating fire or natural disaster, or an end-to-end encryption of all data and systems, may mean that the only business know-how left to the organization is in the minds of its people and any paper copies of procedures, invoices, or business records it may be able to find and use. This might even mean that people who've long since retired, or otherwise left the organization (under good circumstances, one hopes), are asked to come back to work to help re-create the business logic and re-engineer the business processes that put it into action.

This is a painful process, often seen when ransomware attack targets refuse to pay off their attackers and are left literally with nothing digital that works. This can become far more expensive in the long run than paying the ransom might have been, but at least, the victim knows that that particular attacker is not going to be still with them.

Data on how often this is happening is hard to come by; private businesses and even public organizations or government services are very hesitant to admit to the marketplace, their regulators, and the general public just how badly they've been held for ransom. Some may even argue that since they don't know with certainty that any privacy-related protected data has been exfiltrated, that they have no legal requirement for such disclosure. This is a difficult ethical and legal situation to find oneself in, either as an officer of the organization or as a member of its security team. It can be avoided with having appropriate backup and restore capabilities in place, tested, and in use. That's not cheap nor easy, but it's often far, far more affordable than the alternative of starting over completely from scratch. Or not starting over at all.

Cloud-Based “Do-Over” Buttons for Continuity, Security, and Resilience

From individual users to the largest enterprises, the reliance on cloud-hosted systems and services has become a de facto part of nearly every business and personal activity. And with an extra bit of planning, forethought, and follow-through, cloud services can of course greatly enhance the survivability of an organization and its business processes. Let's take a 50,000-foot view of this (as our aviator friends call it) and see just how these three attributes of secure, safe, and survivable computing show up in a typical organization by means of a common feature in almost every video game: the do-over button. This can show up in a number of everyday activities seen in virtually every organization large and small:

  • Transaction do-over: Almost everything businesses and organizations need to achieve can be modeled as a series of transactions. Transactions are atomic by definition; that is, either you complete a transaction successfully, in its entirety, or you don't. (You don't partially make a deposit into your bank account, do you?) Undeleting a file is probably the most common IT example of this; this is straightforward when done on your local storage devices but requires multiple versions of files (or other approaches) to deal with shared, cloud-hosted, and synchronized storage supporting multiple users on multiple devices.
  • Session do-over: As a writer, I might spend an hour or more editing a document, only to realize that I've made some horrible mistakes; I really want to just throw away this hour's work and start over from where I was first thing this morning. Document file versioning (or even frequent “save as” with a new name) is the simplest approach to this, since auto-save and cloud-synchronized backups often capture each change as an update to the file being edited as they occur. In fact, document versioning—saving a completely new instance of the file under a new name—is just about the only way to provide this kind of fallback at the user's work task level.
  • Complex service or activity do-over: Installing a new version of an OS or a major applications suite is complex, can take a lot of time, and may require a number of system reboots. Most systems and applications installation kits provide some kind of fallback capability, allowing the user to retry by reconfiguring the system to the way it was before the (aborted) installation or update was started.

Achieving these (and other) levels of do-over, or the undoing, unwinding, or backing out a series of changes or the outcomes produced by performing a series of business operations, requires support from systems, applications platforms, and workflow or user procedures defined by the organization at the task or job level. Canceling a whole travel itinerary may seem simple for the buyer who has changed their mind, but for the business, it may also mean opportunities to re-sell those seats to waitlisted customers, possibly involving special pricing or terms. Similar situations are found in other industries, such as with on-demand manufacturing. build-to-order systems, and even medical care in some cases.

Situations requiring a redo can arise from errors in the design or use of complex systems, as well as from hazard events or cyberattacks. As the systems' users, managers, and owners detect such events, they're confronted with a three-way Hob's choice of the computer era: abort what's already been done, retry the tasks by falling back to a known good point in the cumulative transaction history and redoing everything since, or ignore the compounding of errors and somehow move on. None of these options is free from consequences; all entail the potential loss of ongoing activity as well as the sunk costs of labor and effort expended but now made worthless. In the larger case of a total systems outage lasting days or weeks, the activation of disaster recovery and business continuity plans determines what to do about three categories of work (as reflected in the physical nature of the business and in its information systems).

What all of these do-overs have in common is that first, users and systems planners need to identify some baseline configurations of systems, software, and data that are worthy of extra efforts to keep available. These baselines need to be time and date stamped to be effective, of course! Whether we need to fall back to this morning's version of a file or completely re-create the system image onto new hardware after a disaster, users will need to know what backup set from what moment in time to reload. Once it's loaded, users can step through offline records of work steps taken and either redo them or deliberately choose to ignore them. Business logic should dictate this choice in advance or at least define the criteria to use in making this choice.

Restoring a Virtual Organization

A growing number of organizations of all types and sizes have taken the plunge, so to speak, and gone completely virtual: they have little or no actual physical computing infrastructure beyond the endpoints their people use and the Internet access capabilities needed to get to their cloud-hosted business systems. Employee-owned endpoints, whether company provisioned and managed or not, seem ideally suited to isolate the information systems of the organization from disruption to a physical business or organizational presence.

This requires that the organization ensure that the virtual definitions of its business logic, its service architectures, and the networks and systems infrastructures those require are fully supported by secure, high-integrity backup and restore processes. Using software to define a network or services architecture can be incredibly powerful, more responsive, and much easier than doing it as an on-premises system would be; that set of definitions, the scripts that create the virtual machine templates or clone copies of them to meet processing demand, must all be securely stored, backups made of them, and those backups retained in secure storage.

Portability of software-defined or virtual systems templates should also be considered. Larger enterprises (and smaller ones with low risk appetites) often use two or more cloud hosting services to reduce the potential that vendor-specific issues could lead to little or no notice of contract cancellation or other service disruptions.

Different cloud systems providers, their deployment models, and their services models provide different selections of features and capabilities that support a cloud-based do-over capability. Critical to making the right choice is to know your organization's real needs in each of these three areas, and the BIA should give you a solid foundation from which to start. This is especially important when you consider the sheer number and diverse types of cloud services the organization may be using. A typical organizational website alone might use several hundred such services to gather analytics data, another large number as part of interfaces and data sharing relationships with external partners or providers. Major cloud-hosted applications platforms tie the organization to hundreds more. (It's a small wonder, then, that the average enterprise is using 1,427 such cloud services.) Some key questions to consider include these:

  • How many connections to the cloud-hosted business platforms, from how many of our end users, must meet what degree of reliability?
  • Does our current physical communications architecture, including connections via our ISPs, provide us with that degree of reliability?
  • If our fallback options include greater reliance on end-user mobile devices, what happens to our connectivity and business continuity when the local area mobile networks are overloaded or crash (as often happens during severe storms, earthquakes, and accidental or deliberate large-scale disruptions)?

We also must consider where our cloud-hosted data and applications platform systems actually physically reside. Can the same natural or man-made hazard that disrupts our on-site business operations and people also disrupt our cloud host? If there's a possibility of this, we need to explore how the cloud host itself can provide backup, distributed, or alternate site storage, as well as processing and access control support. Again, our business needs for this should drive the CIANA+PS components of our discussions and negotiations with alternative cloud services providers. (One comforting thought: the major cloud host providers, such as Amazon, Google, and Microsoft, have already worked out solutions to these problems for very large, multinational customer organizations using their clouds; this drove them to build in capabilities that smaller organizations, be they local, regional, or international, can benefit from.)

People Power for BC/DR

It's time to shift our mental gears from the technical and physical systems and controls needed for business continuity and disaster recovery and focus on the administrative, people-facing and people-powered elements that actually carry out those plans.

These next three sections operationalize these aspects of getting ready to do BC/DR planning and implementation, put those plans into action when disruptions occur, and learn from what those disruptions can teach you and your organization as a result. More important, these next three sections focus on how the security professional needs to protect and support the people that carry those activities from ideas to post-event analysis and feedback. In doing so, they give you the arms and ammunition you need to help your organization cast off the tired old assumption that people are the weakest link in the security chain; instead, it's time to recognize that no amount of technical and physical security controls will work if the organization is not first and foremost reliant on its people.

In other words, it's time to look at implementing and using effective CIANA+PS at Layer 8 and above. It is predominantly through exploitable vulnerabilities in the Layer 8s of your organization—its business processes, people elements, cultural aspects, and the applications software, communications, and collaborations systems that they all use—that advanced persistent threat actors conduct their low and slow, persistent, and stealthy intrusions and attacks on your systems and your organization. Every major cyberattack of note shows the use of Layer 8–related tactics and techniques, often spread in time over many months. Other chapters provide you with ideas and techniques for hardening your systems, including at Layer 8, to prevent and detect such attacks; the attacks probably won't stop just because a major disruption has occurred. The original attacker—or others—may think you are at your weakest and your systems and organization are most vulnerable. Carrying out the work of disaster recovery and business continuity operations is stressful enough, without leaving the door open to additional incidents and disruptions.

What? Layer 8? I thought you said that the OSI reference model was a 7-layer model!” Well, that's true, but almost since it was first being drafted, there were any number of authors (including Michael Gregg, in his 2006 classic book Hack the Stack) who referred to the people-facing administrative, policy, training, and procedural stuff as Layer 8. (Pundits have also pointed out additional layers, such as Money, Political, and Dogmatic, but for simplicity of analysis and as a nod to the IETF's treatment of “anything above the transport layer is an application,” SSCPs can lump those all into the “people layer.”) Layer 8 by that name probably won't appear on the SSCP certification exam, but vulnerabilities, exploitations, and countermeasures involving how people configure, control, manage, use, misuse, mismanage, and misconfigure their IT systems no doubt will. Figure 11.3 illustrates this concept, and much like Figure 11.2, it too shows many process-focused aspects of running a business or organization that intermesh with each other. Note that just as every layer within the OSI protocol stack defines and enables interactions with the outside world, so too does every protocol or business architectural element in Figure 11.3. The surrounding layers might be immediate customers, suppliers, and clients; next, the overall marketplace, maybe the society or dominant culture in the nation or region in which the organization does business. This layer-by-layer view of interaction can be a powerful way to look at both the power and value of information at, within, or across a layer, as well as a tool that SSCPs can use to think about threats and vulnerabilities within those higher layers of the uber-stack.

Collaborative workspaces are an excellent case in point of this. The design and manufacturing of the Boeing 767 aircraft family involves hundreds of design, manufacturing, and supply businesses, all working together with a dozen or more major airlines and air cargo operators, collaborating digitally to bring the ideas through design to reality and then into day-to-day sustained air transport operations. At Layer 7 of the OSI stack, there were multiple applications programs and platforms used to provide the IT infrastructure needed for this project. The information security rules that all players in the B767 design space had to abide by might see implementation using many physical and logical control technologies across Layers 1 through 7, yet with all of the administrative controls being implemented out in “people space” and the interorganizational contractual, business process, and cultural spaces.

Schematic illustration of Beyond the seventh layer

FIGURE 11.3 Beyond the seventh layer

Sadly, a number of IT security professionals continue to constrain their gaze to Layer 7 and below. The results? Missed opportunities to better serve the information security needs of their organizations. One irony of this is that almost by definition, all administrative controls are instantiated, implemented, and used (and abused or ignored) beyond Layer 7.

Let's take a closer look at those next layers.

Threat Vectors: It Is a Dangerous World Out There

If we were to redraw Figure 11.3, we might be able to see that the people element of an organization makes up a great deal of its outermost threat surface. Even the digital or physical connections our businesses make with others are, in one sense, surrounded by a layer of people-facing, people-powered processes that create them, operate them, maintain them, and sometimes abuse or misuse them. You can see this reflected in the dark humor of the security services of many nations: before the computer age, they'd joke that if your guards, secretaries, or janitors owned better cars, houses, or boats than you did, you might want to look into who else is paying them and why. By the 1980s, we'd added communications and cryptologic technical and administrative people to that list of “the usual suspects.” A decade later, our sysadmins and database administrators joined this pool of people to really, really watch more closely. And like all apocryphal stories, these still missed seeing the real evolution of the threat actor's approaches to social engineering.

The goal of any social engineering process is to gain access to insider information—information that is normally not made public or disclosed to outsiders, for whatever reason. With such insider information, an outsider can potentially take actions that help them gain their own objectives at greatly reduced costs to themselves, while quite likely damaging the organization, its employees, its customers, its stakeholders, or its community. In Chapter 3 we looked at how one classical and useful approach to keeping insider information inside involves creating an information security classification process; the more damaging that disclosure, corruption, or loss of this information can cause to the company, the greater the need to protect it. This is a good start, but it's only a start. Social engineering attacks have proven quite effective at sweeping up many different pieces of unclassified information, even that which is publicly available, and analyzing it to deduce the possible existence of an exploitable vulnerability.

Social engineering works because people in general want to be well regarded by other people; we want to be helpful, courteous, and friendly, because we want other people to behave in those ways toward us. (We're wired that way inside.) But we also are wired to protect our group, be that our home and family, our clan, or any other social grouping we belong to. So, at the same time we're open and trusting, we are hesitant, wary, and maybe a bit untrusting or skeptical. Social engineering attacks try to establish one bit of common ground with a target, one element on which further conversation and engagement can take place; over time, the target begins to trust the attacker. The honest sales professional, the doctor, and the government inspector use such techniques to get the people they're working with to let down their guard and be more open and more sharing with information. Parents do this with their children and teachers with their students. It's human to do so. So, naturally, we as humans are very susceptible to being manipulated by the smooth-talking stranger with hostile intent.

Consider how phishing attacks have evolved in just the last 10 or 15 years:

  • Phishing attacks tend to use email spam to “shotgun blast” attractive lures into the inboxes of perhaps thousands of email users at a time; the emails either would carry malware payloads themselves or would tempt recipients to follow a URL, which would then expose their systems to hostile reconnaissance, malware, or other attacks. The other major use of phishing attacks is identity theft or compromise; by offering to transfer an inheritance or a bank's excess profit to the addressee, the attacker tempts the target to reveal personally identifying information (PII), which the attacker can then sell or use as another step in an advanced persistent attack's kill chain. The attacker can also use this PII to defraud the addressee, banks, merchants, or others by masquerading as the addressee to access bank accounts and credit information, for example.
  • Spear phishing attacks focus on individual email recipients, or very select, targeted groups of individuals, and in true social engineering style, they'd try to suggest that some degree of affinity, identification, or relationship already existed in order to wear down the target's natural hesitation to trusting an otherwise unknown person or organization. Spear phishing attacks are often aimed at lower-level personnel in large organizations—people who by themselves can't or don't do great things or wield great authority and power inside the company but who may know or have access to some little bit of information or power the attacker can make use of. The most typical spear phishing attack would be an email sent to a worker in the finance department, claiming to be from the company's chief executive. “I'm traveling in (someplace far away), and to make this deal happen, I need you to wire some large amount of money to this name, address, bank name, account, etc.,” such phishing attacks would say. Amazingly, an embarrassingly large number of small, medium, and large companies have fallen (and continue to fall) for these attacks and lost their money in the process.
  • Whaling attacks, by contrast, aim at key individuals in an organization. The chief financial officer (CFO) of a company might get an email claiming to be from their chief executive officer (CEO), which says much the same thing: “If we're going to make this special deal happen, I need you to send this payment now!” CFOs rarely write checks or make payments themselves—so they'd forward these whaling attack emails on to their financial payments clerks, who'd just do what the CFO told them to do. (One of the author's friends is the CEO of a small technology company, and he related the story of how such an attack was attempted against his company recently, and the low-level payments clerk in that kill chain was the only one who said, “Wait a minute, this email doesn't look right . . . ,” which got the CEO involved in the nick of time.)
  • Catphishing involves the creation of an entirely fictitious persona; this “person” then strikes up what seems to be a legitimate personal or professional relationship with people within its operator's target set. Catphishing originated within the online dating communities, but they have quickly become one of the favorite operational intelligence—gathering tactics used by APTs during the early stages of their attack strategies.

This list could go on and on; we've already had more than enough examples of advanced persistent threat operations that create phony companies or organizations, staffed with nonexistent people, as part of their reconnaissance and attack strategies.

Notice that by shifting from phishing to more sophisticated spear phishing, whaling, or even catphishing attacks, attackers have to do far more social engineering, in more subtle ways, to gather the intelligence data about their prospective target, its people, and its internal processes. Of course, the potential payoff to the attacker often justifies the greater up-front reconnaissance efforts.

This all should suggest that if we can provide for more trustworthy interpersonal interaction and communication, we could go a long way toward establishing and maintaining a greater security posture at these additional layers of our organization's information architecture. Much of this will depend on your organization, its decision-making culture and managerial style, its risk tolerance, and its mission or strategic sense of purpose. Going back to Chapter 10's ideas, we're looking for ways to find precursors and indicators that some kind of reconnaissance probe or attack is in the works. For example, separation of duties can be used to identify “need to know” boundaries; queries by people not directly involved in those duties, whether insiders or not, should be considered as possible precursor signals. This can aid in protecting key assets, securing critical business processes, and even protecting information about the movement or availability of key personnel. Penetration testing or exercises that focus on social engineering attack vectors might also help discover previously unknown vulnerabilities or identify high-payoff ways in which improved (or different) staff education and training can help “phishing-proof” your organization.

“Blue Team's” C3I

As with other security controls and processes, operational business continuity and disaster recovery start with administrative policies, procedures, and controls. These command, control, communications, and intelligence (C3I) systems establish the lines of communication, decision making, responsibility, and authority, which empower individuals (by name or by job function) to direct the activation of different BC/DR activities as required to sustain critical, core activities while bringing other business logic back into operational condition.

Many organizations use a collaborative, iterative process to transform initial policy guidance into procedures and then use tabletop read-throughs and exercises to identify important details and discover shortcomings. Larger-scale exercises are also used, which may even involve deploying BC/DR teams to alternate processing sites or work locations as part of the operational test and evaluation of the plans and procedures. Periodic test and validation that archival or backup systems and datasets can still be used, and that they are free of any latent malware or other hostile agents, can be combined with these exercises of operational recovery and continuity processes.

Within the bounds of need-to-know constraints, use the organization's security awareness, education, and training processes to help build confidence while providing an informed basis for that confidence. Offer opportunities for members of the organization to participate in exercises, or receive additional training on BC/DR process steps, perhaps as part of career broadening or other professional development opportunities. As the SSCP on the BC/DR team, you may be admirably positioned to identify essential opportunities to help spread the good word, as well as help identify the need for additional business-savvy operational insight when needed to work on difficult recovery or continuity problems.

The technical verification that BC/DR procedures are reasonably sound, complete, and workable is important; perhaps more important is the confidence instilled in the workforce that in the event of catastrophe, the organization knows what to do and how to do it.

Learning from Experience

There are two contrasting models for how organizations learn from experience. One is continuous, a part of ongoing, day-to-day business operations; the other is a formalized process that is invoked after a major event or activity has concluded. Many organizations rightfully use both, or a blend of approaches that treats these not as an either-or but as endpoints on a spectrum of choices.

The first is seen in many high-reliability organizations. They empower their individual work units or teams to conduct a critical debrief and discovery as their work teams go through their operational day. This may be as simple as a quick “pulse check” at the end of a short-duration, routine task, or a more detailed reflection on what worked well and what could be improved after a lengthy and complex process has been completed. To make these review and reflection sessions work, the team members need to believe they “own,” to a reasonable degree, both the outcomes of their work and the processes they use to get it accomplished. They are granted a degree of freedom to make changes to what they do and how they do it, and they are encouraged to come forward to management with issues and opportunities that may exceed that scope. These work teams are viewed by management, and by themselves, as the people who know how to get the job done.

The second approach is well suited when the event or activity being reviewed is of a serious or critical nature. Hospitals, for example, have formal processes for morbidity and mortality (M&M) reviews, held when patients die. Accidents to rail, road, or aviation operations are investigated in formal, structured ways, often to meet requirements in law and regulation, and these will dictate processes to formally identify the lessons learned from a specific accident, assign them to one or more groups to implement, and require follow-through reporting regarding that implementation.

These two strategies can be used to review and learn from every step of the BC/DR process. Initial planning and concept development, and its tabletop exercises, provide the first opportunity to reflect upon the products and the processes, and then note and keep track of issues, questions, and ideas. Implementation testing and operational training for BC/DR processes further provide learning opportunities.

Since many real-world BC/DR experiences need months, if not years, of sustained activity to conclude, each day provides a learning opportunity. Don't let it go to waste. You may not be the one in charge of the overall recovery effort; you may not even be directly involved, but instead are providing security support to business units that are beneficiaries of the restored capabilities as they become available. Regardless, you have your own chance to take notes, keep a log, and learn from the experience. Reflect on those notes, and share your findings with your managers, leaders, and co-workers.

Other organizations—whether in the same industry, markets, or even countries as the one(s) you're working with—are also going through experiences that you and your organization can learn from. Learning from cyberattacks on other organizations can be done analytically, by reflecting upon reports of their experience; or it can be done experientially by using a real-world disaster as the scenario for an operational training, readiness, evaluation, or assessment of your own organization's BC/DR processes.

Security Assessment: For BC/DR and Compliance

As the number and severity of cyberattacks on corporate and government-owned and operated infrastructures and systems continues to grow, so too has the pressure by regulators and legislatures for stronger, faster, and more effective disclosure and reporting in response. Organizations in the financial, education, healthcare, and transportation industries already must provide auditable proof of their compliance with a wide variety of regulations about data protection, product and service safety, and financial accountability. As of summer 2021, efforts are underway to require private and public operators of critical infrastructures to report any cybersecurity incident—not just one involving disclosure of protected or private data—to government officials and in some cases to the public at large.

These compliance requirements not only influence (if not dictate) how your organization develops and conducts BC/DR activities, but also how it demonstrates its readiness to achieve the required levels of business continuity, systems integrity, and data protection, all while maintaining and assuring the safety of workers, consumers, and the general public.

Most enterprises and some SMBs already have in place a significant effort for periodic or ongoing systems assessment for security compliance. Expanding these efforts to include the specifics of BC/DR programs and procedures should be relatively straightforward. Smaller SMBs and many SOHO environments may have so far been off the horizon of most audit requirements; even so, those with an awareness of risk management and mitigation need to be moving toward using security assessments as a way to strengthen their security posture.

No matter what the purpose, security assessments use a common set of processes, which you as the on-scene SSCP may need to support, participate in, or perhaps even lead.

  • Planning for assessment, including determining the purpose, setting the scope, identifying the assessment activities, and setting the timeline or schedule for performing them. Planning also identifies the resources required for an assessment.
  • Conducting the assessment activities, which can include a mix of:
    • Audit: A structured review of the existing systems, processes, and their use against a specified standard
    • Inspection: A review and analysis of existing systems, processes, and their operational use
    • Testing: The use of test scenarios and data to verify that the system behaves as required
    • Ethical penetration testing: Using adversarial (attacker) techniques, under control of the organization (and with its permission) in attempts to find and exploit vulnerabilities in the systems, processes, and procedures in use by the organization
    • Interview: Conversations (not interrogations) with members of staff and the operational users of the system, to gather further insights regarding systems behavior and use
  • Analysis and review, which produces a set of findings and recommendations
  • Reporting the findings to the requesting authorities in the organization, or externally if required

The most successful security assessment programs are viewed by the organization's workforce and other stakeholders as helpful, welcome additions to the overall business model and processes. Assessments are seen as adding value to the reputation and the outcomes of the organization; they empower the organization to continually mature and strengthen its asset ownership and protection, as they identify opportunities to improve security procedures. Assessment processes that involve external auditors or third-party assessment experts (such as ethical penetration test teams) can also inject a healthy dose of outside knowledge and insight, drawn from the painful but perhaps not well-publicized experiences of other organizations.

As with other security processes, assessments can be scheduled, planned, and deliberate activities, or they can be part of an ongoing and continuous assessment process. Some organizations find it quite straightforward to integrate continuous monitoring for incident detection and response with continuous security assessment for security and compliance. Bringing their BC/DR plans, processes, readiness activities, and lessons learned processes into the overall security assessment effort is a logical and effective step to take.

Converged Communications: Keeping Them Secure During BC/DR Actions

Two further aspects of the Blue Team's C3I need further consideration by the on-scene SSCP. From a systems and technologies perspective, we cannot afford to overlook the plain old telephone systems and services, nor the reliance on converged communications, computing, and internetworking that the modern digitized workplace demands and relies upon. At the same time, it's important to realize that all of this focus on post-incident (or post-attack) efforts to get systems, assets, and processes back to work is also about getting people back to work; and people work best when in an environment and context of mutually supportive trust. It's hard to strike the balance between being threat-focused, ready to deal with the intruders within and without, and demonstrating that confidence and trust in the organization, its tools, and its processes is deserved. Hard, but necessary, and part of the security professional's job. Let's take a last look at these two aspects of continuity of operations and disaster recovery.

POTS and VoIP Security

One last frontier we need to look beyond is the use of other communications technologies and systems, both for communications inside our organization and with the myriad of outside organizations and people we deal with. Plain old telephone service (POTS) has traditionally been provided to businesses and organizations by using on-premises switchboard systems to connect to a phone company's central office systems (at a point of presence, of course, within the organization's physical space). Endpoint devices such as desktop or wall-mounted phones, intercoms, or other devices provided both the voice connectivity and the routing and control of individual calls. But with the rare exception of encrypted telephone systems, the vast majority of these systems used unsecured analog encoding to transmit and receive digitized speech over the public switched telephone network (PSTN). In recent years, Voice over Internet Protocol (VoIP) has become a major communications alternative for many organizations and individuals. VoIP platforms, such as Skype, Viper, FaceTime, and Google Voice, have revolutionized the way we think about talking with others. We want to hear and see them; we want to be able to instantly add others to the call. We want to call them, not an endpoint device that happens to be where they were yesterday or might return to next week. Business and individual VoIP users have many legitimate reasons (and some illegitimate ones) to record such calls or incorporate other multimedia information into them.

Each of these major, and very appealing, features of VoIP systems brings with it the increased risks that well-intentioned users may disclose sensitive, proprietary, or other information to other parties on the VoIP session who may not have a valid need to know such information. Your organization must ensure that users understand your information security classification guidelines and that you have procedures in place that users can use to check what information can be shared in a VoIP session (especially with outsiders) and what information must not be shared. Without such a classification guide, and the awareness, education, and training to make use of it, your organization is putting a great deal at risk.

POTS systems tend to use a separate physical plant than the network systems in most locations. The same modem (at the same point of presence) may deliver POTS and Internet connectivity to your location if provided over the same “last mile” wiring or fiber distribution system used by the same communications company. Once we look past that modem, to date, the technologies involved in telephone call routing, control, and support of advanced calling and billing features have tended to be separate systems. POTS is call based, while the Internet is packet switching based.

We haven't looked at POTS technologies and security issues very much, since they tend to be beyond what SSCPs deal with. Nevertheless, the same need for awareness, education, and training of your staff in how to handle sensitive, proprietary, or other restricted information is paramount to any attempt to keep your information safe, secure, and reliable.

Clearly, VoIP touches the users at Layer 7 of the OSI model; all of the technical risk mitigation controls, such as encryption, access control, identity management, and authentication, can be applied to meet the organization's needs for information security. We've addressed those for VoIP already to the degree that it's just another app that runs on our networks. However, the trivial ease with which a trusted team member of one organization can be VoIP calling from less-than-trustworthy surroundings does suggest that keeping VoIP safe and secure requires other specialized end-user knowledge, skills, and attitudes. This is another opportunity for you as an SSCP to identify ways to help your organization's VoIP users communicate in more information security–conscious ways.

People Power for Secure Communications

There's a lot of great advice out there in the marketplace and on the Internet as to why organizations need to teach their people how to help protect their own jobs by protecting critical information about the company. As an SSCP, you can help the organization select or create the right education, training, and evaluation processes and tools for this. A survival tip: use the separation of duties principle to identify groups or teams of people whose job responsibilities suggest the need for specific, focused information protection skills at the people-to-people level.

That's an important thought; this is not about multifactor identification or physical control of the movement of people throughout the business's office spaces or work areas. This is also not trying to convert your open, honest, trusting, and helpful team members into suspicious, surly, standoffish “moat dragons” either! All you need to do is get each of them to add one key concept to their mental map of the workplace: trust, but verify. Our network engineers need to build our systems in as much of a zero-trust architectural way as the business needs and can afford, but the most flexible, responsive, surprise-tolerant, and abnormality-detecting link in our security chain needs to stay trusting if it's going to deliver the agility that resilient organizations require. They just need to have routine, simple, safe, reliable, and efficient ways to verify that what somebody seemingly is asking them to do, share, or divulge is a legitimate request from a trustworthy person or organization.

Without needing to dive too deeply into organizational psychology and culture, as SSCPs we ought to be able to help our organizations set such processes in place and keep them simple, current, and useful. This won't stop every social engineering attack—but then again, no risk control will stop every threat that's targeted against it either. And as organizations find greater value and power in actually sharing more information about themselves with far larger sets of outsiders—even publishing it—the collection of information “crown jewels” that need to be protected may, over time, get smaller. That smaller set of valuable nuggets of information may be easier to protect from inadvertent disclosure but may also become much more of an attractive target.

Summary

Business continuity is about staying in business, despite what risks may materialize. It's about achieving the organization's strategic, tactical, and operational goals and objectives despite the occurrence of accidents, deliberate attacks, or even natural disasters. To survive to operate, organizations must plan to make such survival possible, as well as plan for how to bring disrupted business processes back to something close to pre-incident normal. As we've seen in this chapter, SSCPs bring many different sets of knowledge, skills, and abilities to the table as they help their organizations prepare to survive, prepare to recover, and then carry out those plans successfully when incidents happen.

Much of this business continuity planning, including incident response activities, happens at the administrative level—in other words, it happens in nontechnical ways, as if at Layer 8 or beyond in our 7-layer OSI reference model. The organization's people are (once again) seen to be critically important to making these various layers of planning become reality under the stress of an anomaly becoming an incident and then an incident becoming a disaster. Awareness education and procedural training of our organization's workforce, focused on tasks, work units, or processes, and the critical assets or systems they need to perform their roles, can play a vital part not only in emergency response preparedness but also in day-to-day activities that enhance information security.

This is all part of how organizations become resilient, able to bend in the face of major disruptions without breaking under the strain. (That same resilience may also herald an unlooked-for opportunity for innovation and positive change, as you may recall from Chapter 3.) The ubiquitous nature of cloud-based or cloud-hosted systems, platforms, and services make many options available to support contingency operations plans, including backup and restore of systems and data. When combined with the people power the organization already depends on for success, the chances of surviving to operate can be better than ever. And that's what continuity of business operations planning is all about.

Exam Essentials

  • Understand the relationship between incident response, business continuity, and disaster recovery planning. A disaster is an incident that causes major damage to property, disrupts business activities, and quite possibly injures or even kills people. A disaster can also cause information critical to a business to be lost, corrupted, or exposed to the wrong people. A disaster may be one very extensive incident or a whole series of smaller events that, taken together, constitute an existence-threatening stress to the organization. The extensiveness of this damage can be such that the organization cannot recover quickly, if at all, or that such recovery will take significant reinvestment into systems, facilities, relationships with other organizations, and people. Disaster recovery plans are ways of preparing to cope with such significant levels of disruption. Business continuity, by contrast, is the general term for plans that address how to continue to operate in as normal a fashion as possible despite the occurrence of one or more disruptions. Such plans can address alternative processing capabilities and locations, partnering arrangements, and financial arrangements necessary to keep the payroll flowing while operational income is disrupted. Business continuity can be interrupted by one incident or a series of them. Incident response narrows the focus down to a single incident, and provides detailed and systematic instruction as to how to detect, characterize, and respond to an incident to contain or minimize damage; such response plans then outline how to restore systems and processes to let organizations resume normal operations.

    Describe how business continuity and disaster recovery planning differ from incident response planning. These three sets of planning activities share a common core of detecting events that could disrupt critical business processes, inflict damage to vital business assets (including information systems), or lead to people being injured or killed. As risk management plans, these all look to identify appropriate responses, identify required resources and preparation tasks, and lay out manageable strategies to attain acceptable levels of preparedness. They differ on the scale of disruption considered and the scope of activities. Disaster recovery plans (DRPs) look at significant events that could potentially put the business out of business; as such, they focus on workforce health and safety, morale, and continuing key financial functions such as payroll and alternate and contingency operations at reduced levels or capacities. Business continuity plans (BCPs) look more at business processes, by criticality, and determine what the details of those alternate operations need to be. BCPs address more of the details of backup and restore capabilities for systems, information, and business processes, which can include alternate processing arrangements, cloud solutions, or hot, warm, and cold backup operating locations. Incident response plans focus on getting ready to continually detect a potentially disruptive incident, such as an attack by an advanced persistent threat, and how to characterize it, contain it, respond to it, and recover from it. Part of that process includes decision points (by senior leadership and management) as to whether to activate larger BCP recovery options or to declare a disaster is in progress and to activate the DRP.

    Describe the legal and ethical obligations organizations must address when developing disaster response, business continuity, and incident response plans. The first set of such obligations come under due diligence and due care responsibilities to shareholders, stakeholders, employees, and the larger society. The organization must protect assets placed in its care for its business use. It must also take reasonable and prudent steps to prevent damage to its own assets or systems from spreading to other systems and causing damages to them in the process. Legally and ethically, organizations must keep stakeholders, investors, employees, and society informed when such information security incidents occur; failure to meet such notification burdens can result in fines, criminal prosecution, loss of contracts, or damage to the organization's reputation for reliability and trustworthiness. Such incidents may also raise questions of guilt, culpability, responsibility, and liability, and they may lead to digital forensic investigations. Such investigations usually need information that meets stringent rules of evidence, including a chain of custody that precludes someone from tampering with the evidence.

    Describe the possible role of cloud technologies in business continuity planning and disaster recovery. Using cloud-based systems to host data storage, business application platforms, or even complete systems can provide a number of valuable business continuity capabilities. First, it diversifies location by allowing data, apps, and systems to be physically residing on hardware systems not located directly in the business' premises. This reduces the potential that the same incident (such as a storm or even a terrorist attack) can disrupt, disable, or destroy both the business and its cloud services provider. Second, it provides for layers of secure, off-site data, apps and systems backup, and archive and restore capabilities, which can range from restoring a single transaction up to restoring entire sets of business logic, processes, capabilities, and the data they depend on. Third, hosting such systems in a third-party cloud services provider may make it much easier to transition to alternate or contingency business operations plans, especially if knowledge workers have to work from home, from temporary quarters, or even from another city or state.

    Explain the role of awareness, education, and training for employees and associates in achieving business operations continuity. All employees of an organization, or people associated with an organization, should have a basic awareness of its business continuity plans and strategy; this gives them confidence that this important aspect of their own personal security and continued employment has not been forgotten. Separation of duties, as a design-for-security concept, can play a role in developing focused, timely education and training based on teams or groups involved with specific, related subsets of the business logic. Education can build on that awareness to help selected teams of employees know more about how they are a valuable part of ensuring or achieving continuity of business operations for their specific duties and responsibilities; it gets employees engaged in making continuity planning and readiness more achievable. Training focuses on skills development and practice, which builds confidence for dealing with any emergency.

    Describe the different types of phishing attacks. Phishing attacks, like all social engineering attacks, attempt to gain the trust and confidence of the targeted person or group of people so that they will divulge information that provides something of value to the attacker. This can be information that makes it easier for the attacker to gain access to IT systems, money, or property. Phishing attacks originated as broadcast-style emails, sent to thousands of email addresses, and either carried a malware payload to the reader's system or offered links to tempt the user to browse to sites from which the attack could continue. Spear phishing attacks focused on selected individuals within organizations, often by claiming to be email from a senior company official, and would attempt to lure the recipient into taking action to initiate a transfer of funds to the attacker's account. These tended to be aimed at (addressed to) clerical and administrative personnel. Whaling attacks target high-worth or highly placed individuals, such as a chief financial officer (CFO), and use much the same story line to attempt to get the CFO to task a clerk to initiate a funds transfer. Cat phishing attacks involve the creation of a fictitious persona, who attempts to establish a personal, professional relationship with a targeted individual or small group of individuals. The attacker may be posing as a consultant, possible client, journalist, investor—in short, anyone business managers or leaders might reasonably be willing to take at face value. Once that trust and rapport is established, the manipulation begins.

    Explain how to defend against phishing attacks. Some automated tools can screen email from external addresses for potentially fraudulent senders and scan for other possible indications that they might be a phishing attack rather than a legitimate email. The most powerful defense is achieved by increasing every employee's awareness of the threat and providing focused education and training to improve skills in spotting possibly suspicious emails that might be phishing attacks. It's also advisable to apply separation of duties processes that establish multiple, alternative ways to validate the legitimacy of any such request to expose critical and valuable assets to risk.

    Explain the apparent conflict between designing zero-trust networks but encouraging employees to “trust, but verify.” Zero-trust network design is sometimes described as “never trust, always verify.” For example, it asks us to segment networks and systems into smaller and smaller zones of trust and enforce verification of every access attempt and every attempt to cross from one zone or segment to another, because this seems to be required to deal with advanced persistent threats using low-and-slow attack methodologies. On the other hand, people are not terribly programmable, and this is both a weakness and a strength. Our businesses and organizations need our people to be helpful, engaging, and trusting—this is how we break down the internal barriers to communication and teamwork while strengthening our company's relationship with customers, suppliers, or others. We must educate and train our employees to first verify that the person asking for the conversation, help, or information is a trustworthy person with a legitimate business reason for their request, and then engage, be helpful, and establish rapport and trust. This way, we maintain the strength and flexibility of the human component of our organizations, while supporting them with processes, procedures, and training to keep the organization safe, secure, and resilient.

Review Questions

  1. Which of the following types of actions or responses would you not expect to see in an information security incident response plan? (Choose all that apply.)
    1. Relocation of business operations to alternate sites
    2. Temporary staffing
    3. Using off-site systems and data archives
    4. Engaging with senior organizational leadership
  2. Your boss believes that your company must follow NIST guidelines for disaster recovery planning and wants you to develop the company's plans based on those guidelines. Which statement might you use to respond to your boss?
    1. As a government contractor, we actually have to follow ISO and ITIL, not NIST.
    2. Although we are not a government contractor, NIST frameworks and guidelines are mandatory for all U.S. businesses, and so this is correct.
    3. NIST publications are mandatory only for government agencies or companies on government contracts, and since we are neither of those, we don't have to follow them. But they have some great ideas we should see about putting to use, tailored to our risk management plans.
    4. NIST publications are specifically for government agencies and their contractors, and most of what they say is just not applicable to the private sector.
  3. Your boss has asked you to start planning for disaster recovery. Where would you start to understand what your organization needs to do to be prepared? (Choose all that apply.)
    1. Business impact analysis
    2. Business continuity plan
    3. Critical asset protection plan
    4. Physical security and safety plan
  4. Which plan would you expect to be driven by assessments such as SLE, ARO, or ALE?
    1. Business continuity plan
    2. Contingency operations plan
    3. Information security incident response plan
    4. Risk management plan
  5. Which statement best explains the relationship between incident response or disaster recovery, and configuration management of your IT architecture baseline? (Choose all that apply.)
    1. There is no relationship; managing the IT baseline is useful during normal operations, but it has no role during incident response or disaster recovery.
    2. As you're restoring operations, you may need to redo changes or updates done since the time the backup copies were made; your configuration management system should tell you this.
    3. Without a documented and managed baseline, you may not know sufficient detail to build, buy, or lease replacement systems, software, and platforms needed for the business.
    4. There is no relationship, because the contingency operations procedures should provide for this.
  6. Which statement about recovery times and outages is most correct?
    1. MAO should exceed RTO.
    2. RTO should exceed MAO.
    3. RTO should be less than or equal to MAO.
    4. RTO and MAO are always equal.
  7. Which value reflects a quantitative assessment of the maximum allowable loss of data due to a risk event?
    1. RTO
    2. RPO
    3. MAO
    4. ARO
  8. Which of the following statements about information security risks is most correct regarding the use of collaborative workspace tools and platforms?
    1. Because these tools encourage open, trusting sharing of information and collaboration on ideas, they cannot be used to securely work with proprietary or sensitive data.
    2. These tools require strong identity management and access control, as part of the infrastructure beneath them, to protect sensitive or proprietary information.
    3. Granting access to such collaboration environments should first be determined by legitimate business need to know and be based upon trustworthiness.
    4. First, the organizations collaborating with them should agree on how sensitive data used by or created by the team members must be restricted, protected, or kept safe and secure. Then, the people using the tool need to be fully aware of those restrictions. Without this, the technical risk controls, such as access control systems, can do very little to keep information safe and secure.
  9. Which statement about phishing attacks is most correct?
    1. Phishing attacks are rarely successful, and so they pose very low risk to organizations.
    2. Spear phishing attacks are easy to detect with scanners or filters.
    3. Attackers learn nothing of value from you if you simply reply to an email you suspect is part of a spear phishing or whaling attack and say “Please remove me from your list.”
    4. Phishing attacks of all kinds are still in use, because they can be effective social engineering tools when trying to do reconnaissance or gain illicit entry into an organization or its systems.
  10. In general, what differentiates phishing from whaling attacks?
    1. Phishing attacks tend to be used to gain access to systems via malware payloads or by getting recipients to disclose information, whereas whaling attacks try to get responsible managers to authorize payments to the attacker's accounts.
    2. Phishing attacks are focused on businesses; whaling attacks are focused on high-worth individuals.
    3. Whaling attacks tend to offer something that ought to sound “too good to be true,” whereas phishing attacks masquerade as routine business activities such as package delivery confirmations.
    4. There's really no difference.
  11. Which statement best describes how the separation of duties relates to education and training of end users, managers, and leaders in an organization?
    1. Separation of duties would dictate that general education and awareness training be done by different people than those who provide detailed skills-based training for the proper handling of sensitive information.
    2. Separation of duties should identify groups or teams that have little need for information security awareness, training, and education so that effort can be better focused on ones with greater needs.
    3. Separation of duties should segment the organization into teams focused on their job responsibilities, with clear interfaces to other teams. Effective awareness training and education can help each team, and each team member, see how successfully fulfilling their duties depends on keeping information safe, secure, and reliable.
    4. Separation of duties would dictate that workers outside of one team's span of control or duties have no business need to know what that team works with; education and training would reinforce this.
  12. What should be your highest priority as you consider improving the information security of your organization's telephone and voice communications systems?
    1. Having in-depth, current technical knowledge on the systems and technologies being used
    2. Understanding the contractual or terms of service conditions, with each provider, as they pertain to information security
    3. Ensuring that users, managers, and leaders understand the risks of sharing sensitive information with the wrong parties and that effective administrative controls support everyone in protecting information accordingly
    4. Ensuring that all sensitive information, of any kind, is covered by nondisclosure agreements (NDAs)
  13. Social engineering attacks still present a threat to organizations and individuals for all of the following reasons except:
    1. Most targeted individuals don't see the harm in responding to or in answering simple questions posed by the attacker.
    2. Most people believe they are too smart to fall for such obvious ploys, but they do anyway.
    3. Most targeted individuals and organizations have effective tools and procedures to filter out phishing and related scams, so they are now better protected from such attacks.
    4. Most people want to be trusting and helpful.
  14. You're the lone SSCP in the IT group of a small startup business, which has perhaps 25 or so full-time employees performing various duties. Much of the work the company does depends on dynamic collaboration with many outside agencies, companies, and academic organizations, as well as with potential customers. The managing director wants to talk with you about ways to help protect the rapidly evolving intellectual property, market development ideas, and other information that she believes give the company its competitive advantage. She's especially worried that with the high rate of open conversation in the collaborations, this advantage is at risk. Which of the following would you recommend be the first that the company invest in and make use of?
    1. More rigorous access control systems, using multifactor authentication
    2. More secure, compartmented collaboration software suites, tools, and procedures
    3. Better, more focused education and open dialogue with company staff about the risks of too much open collaboration
    4. Better work on our information risk management efforts, to include an information security classification process that our teammates can use effectively
  15. You've recently determined that some recent systems glitches might be being caused by the software or hardware that a few employees have installed and are using with their company-provided endpoints; in some cases, employee-owned devices are being used instead of company-provided ones. What are some of the steps you should take right away to address this? (Choose all that apply.)
    1. Check to see if your company's acceptable use policy addresses this.
    2. Review your IT team's approach to configuration management and control.
    3. Conduct remote configuration inspections and audits on the devices in question.
    4. Get your manager to escalate this issue before things get worse.
  16. You've just started a new job as an information security analyst at a medium-sized company, one with about 500 employees across its seven locations. In a conversation with your team chief, you learn that the company's approach to risk management and information security includes an annual review and update of its risk register. Which of the following might be worth asking your team chief about? (Choose all that apply.)
    1. What do we do when an incident response makes us aware of previously unknown vulnerabilities?
    2. How does that relate to our ongoing monitoring of our IT infrastructure and key applications platforms and systems?
    3. Does that result in any tangible cost savings for us?
    4. Why do that every year, as opposed to doing it on some other periodic basis?
  17. The company you work for does medical insurance billing, payments processing, and reconciliation, using both Web-based transaction systems as well as batch file processing of hundreds of transactions in one file. As the SSCP on the IT team, you've been asked to consider changes to their backup and restore strategies to help reduce costs. Which quantitative risk assessment parameter might this affect most?
    1. Recovery time objective.
    2. Recovery point objective.
    3. None; this operational change would not impact information security risks.
    4. Maximum allowable outage.
  18. Which statement about business continuity planning and information security is most correct?
    1. Plans are more important than the planning process itself.
    2. Planning is more important than the plans it produces.
    3. Plans represent significant investments and decisions and thus should be updated only when significant changes to objectives or circumstances dictate.
    4. Planning should continuously bring plans and procedures in tune with ongoing operational reality.
  19. One of your co-workers stated that he thought business continuity planning was a heartless, bottom line–driven exercise that cared only about the money and not about anything else. You disagree. Which of the following points would you not raise in discussing this with your colleague? (Choose all that apply.)
    1. Insurance coverage should provide for meeting the needs of workers or others who are disrupted by the incident and our responses to it.
    2. Due care places on all of us the burden to protect the safety and security of the business, its people, and stakeholders in it, as well as the society we're part of.
    3. As a professional, we're expected to take steps to ensure that the systems we're responsible for do not harm people or the property of others.
    4. The workers and managers are part of what makes the company productive and profitable in normal times, and even more so during the recovery from a significant disruption.
  20. How can ideas from the identity management lifecycle be applied to helping an organization's workforce, at all levels, defend against sophisticated social engineering attack attempts? (Choose all that apply.)
    1. Most end users may have significant experience with the routine operation of the business systems and applications that they use; this can be applied, much like identity proofing, to determine whether a suspected social engineering attempt is taking place.
    2. Most end users and their first-level supervisors have the best, most current insight as to the normal business rhythm, flow, inputs, and outcomes. This experience should be part of authenticating an unusual access request (via email, phone, in person, or by any means).
    3. Users think that they know a lot about “business normal,” but they tend to know only the narrow scope of their jobs and responsibilities; this does not equip them to contribute to detecting social engineering attacks.
    4. Contact requests by email, by phone, in person, or by other means are akin to access attempts, and they can and should be accounted for.