Anomalies happen. Tasks stop working right. Users can't connect reliably, or their connections don't stay up as they should. Servers get sluggish, as if they are handling an abnormally high demand for services. Hardware or software systems just stop working, either with “blue screens of death” or by using normal restart procedures but at unexpected times. On the factory floor, safety systems start malfunctioning; equipment and materials get damaged, people get hurt. Autonomous warehouse bots start dropping what they're carrying, or run into the shelf racks (or each other). Vehicles won't start—or stop. Networks go down, and take the VoIP systems with them. Now, your organization's computer emergency response team springs into action to characterize the incident, contain it, and get your systems back to operating normally.
What? Your organization doesn't have such a team? Let's jump right in, do some focused preparation, and improve your operational information security posture so that you can detect, identify, contain, eradicate, and restore after the next anomaly.
It's often been said that the attackers have to get lucky only once, whereas the defenders have to be lucky every moment of every day. When it comes to advanced persistent threats (APTs), which pose potentially the most damaging attacks to our information systems, another, more operationally useful rule applies. APTs must of necessity use a robust kill chain to discover, reconnoiter, characterize, infiltrate, gain control, and further identify resources to attack within the system; make their “target kill”; and copy, exfiltrate, or destroy the data and systems of their choice, cover their tracks, and then leave. Things get worse: for most businesses, nongovernmental organizations (NGOs), and government departments and agencies, they are probably the object of interest of dozens of different, unrelated attackers, each following its own kill chain logic to achieve its own set of goals (which may or may not overlap with those of other attackers). Taken together, there may be thousands if not hundreds of thousands of APTs out there in the wild, each seeking its own dominance, power, and gain. The millions of information systems owned and operated by businesses and organizations worldwide are their hunting grounds. As a senior official in the British Intelligence services remarked in 2021, cybercrime attacks have become big business because they work; they're part of a highly successful business model that embraces the entire Dark Web ecosystem and everyone that's a part of it. (Some forecasters predict that by 2030, if cybercrime were a nation it would have the fifth largest economy on the planet.)
The good news, however, is that as you've seen in previous chapters, SSCPs have some field-proven information risk management and mitigation strategies that they can help their companies or organizations adopt. These frameworks, and the specific risk mitigation controls, are tailored to the information security needs of your specific organization. With them, you can first deter, prevent, and avoid attacks. Then you can detect the ones that get past that first set of barriers, and characterize them in terms of real-time risks to your systems. You then take steps to contain the damage they're capable of causing, and help the organization recover from the attack and get back up on its feet.
You probably will not do battle with an APT directly; you and your team won't have the luxury (if we can call it that!) of trying to design to defeat a particular APT and thwart its attempts to seek its objectives at your expense. Instead, you'll wage your defensive campaign one skirmish at a time. You'll deflect or defeat one scouting party as you strengthen one perimeter; you'll detect and block a probe from gaining entry into your systems. You'll find where an illicit user ID has made itself part of your system, and you'll contain it, quarantine it, and ultimately block its attempts to expand its presence inside your operations. As you continually work with your systems' designers and maintainers, you'll help them find ways to tighten down a barrier here or mitigate a vulnerability there. Step by step, you strengthen your information security posture.
By now, you and your organization should be prepared to respond when those alarms start ringing. Right?
Why should SSCPs put so much emphasis on APTs and their use of the kill chain? In virtually every major data breach in the past decade, the attack pattern was low and slow: sequences of small-scale efforts designed to not cause alarm, each of which gathered information or enabled the attacker to take control of a target system. More low and slow attacks launched from that first target against other target systems. More reconnaissance. Finally, with all command, control, and hacking capabilities in place, the attack began in earnest to exfiltrate sensitive, private, or otherwise valuable data out of the target's systems.
Note that if any of those low and slow attack steps had been thwarted, or if any of those early reconnaissance efforts, or attempts to install command and control tools, had been detected and stopped, then the attacker might have given up and moved on to another lucrative target.
Preparation and planning are the keys to survival. In previous chapters, you've learned how to translate risk mitigation into specific physical, technical, and administrative controls that you'd recommend to management to implement as part of the organization's information systems security posture. You've also learned how to build in the detection capabilities that should raise the alarms when things aren't looking right. More importantly, you've grasped the need to aggregate alarm data with systems status, state, and health information to generate indications and warnings of a possible information security incident in the making and the urgent and compelling need to promptly escalate such potential bad news to senior management and leadership.
In Chapter 1, “The Business Case for Decision Assurance and Information Security,” we looked briefly at the value chain, which models how organizations create value in the products or services they provide to their customers. The value chain brings together the sequence of major activities, the infrastructures that support them, and the key resources that they need to transform each input into an output. The value chain focuses our attention on both the outputs and the outcomes that result from each activity. Critical to thinking about the value chain is that each major step provides the organization a chance to improve the end-to-end experience by reducing costs (by reducing waste, scrap, and rework) and improving the quality of each output and outcome along the way. We also saw that every step along the value chain is an opportunity for something to go wrong. A key input could be delayed or fail to meet the required specifications for quality or quantity. Skilled labor might not be available when we need it; critical information might be missing, incomplete, or inaccurate.
The name kill chain comes from military operational planning (which, after all, is the business of killing the opponent's forces and breaking their systems). Kill chains are outcomes-based planning concepts and are geared to achieving national strategic, operational, or tactical outcomes as part of larger battle plans. These kill chains tend to be planned from the desired outcome back toward the starting set of inputs: if you want to destroy the other side's naval fleet while at anchor at its home port, you have to figure out what kind of weapons you have or can get that can destroy such ships. Then you work out how to get those weapons to where they can damage the ships (by air drop, surface naval weapons fire, submarine, small boats, cargo trucks, or other stealthy means). And so on. You then look at each way the other side can deter, defeat, or prevent you from attacking. By this point, you probably realize that you need to know more about their naval base, its defenses, its normal patterns of activity, its supply chains, and its communications systems. With all of that information, you start to winnow down the pile of options into a few reasonably sensible ways to defeat their navy while it's at home port, or you realize that's beyond your capabilities and you look for some other target that might be easier to attack that can help achieve the same outcome you want to achieve by defeating their navy.
With that as a starting point, we can see that an information systems kill chain is the total set of actions, plans, tasks, and resources used by an advanced persistent threat to
How do APTs apply this kill chain in practice? In broad general terms, APT actors do the following:
The more complex, pernicious APTs will use multiple target systems as proxies in their kill chains, using one target's systems to become a platform from which they can run reconnaissance and exploitation against other targets.
Let's suppose for a moment that your company and its information systems have caught the attention of an APT actor. How might their attentions show up as observable activities from your side of the interface? Most probably, your systems will experience a variety of anomalies, of many different types, which may seem completely unrelated. At some point, one of those anomalies catches your interest, or you think you see a pattern beginning to emerge from a sequence of events.
Back in Chapter 2, “Information Security Fundamentals,” we defined an event of interest as something that happens that might be an indicator of something that might impact your information's systems security. We looked at how an event of interest may or may not be a warning of a computer security incident in the making, or even the first stages of such an incident.
But what is a computer security incident? Several definitions by NIST, ITIL, and the IEFT* suggest that computer security incidents are events involving a target information system in ways that
Consider the unplanned shutdown of an email server within your systems. You'd need to do a quick investigation to rule out natural causes (such as a thunderstorm-induced power surge) and accidental causes (the maintenance technician who stumbled and pulled the power cord loose on his way to the floor). Yes, your vulnerability assessment might have discovered these and made recommendations as to how to reduce their potential for disruption. But if neither weather nor a hardware-level accident caused the shutdown, you still have a dilemma: was it a software design problem that caused the crash, or a vulnerability that was exploited by a person or persons unknown?
Or consider the challenges of differentiating phishing attacks from innocent requests for information. An individual caller to your main business phone number, seeking contact information in your IT team, might be an honest and innocent inquiry (perhaps from an SSCP looking for a job!). However, if a number of such innocent inquiries across many days have attempted to map out your entire organization's structure, complete with individual names, phone numbers, and email addresses, you're being scouted against!
What this leads to is that your organization needs to clearly spell out a triage process by which the IT and information security teams can recognize an event, quickly characterize it, and decide the right process to apply to it. Figure 10.1 illustrates such a process.
Note that our role as SSCPs requires us to view these incidents from the overall information risk management and mitigation perspective as well as from the information systems security perspective. It's quite likely that the computer security perspective is the more challenging one, demanding a greater degree of rapid-fire analysis and decision making, so we'll focus on it from here on out.
Before we take a deep dive into incident detection and response, it's probably useful to take a look at a number of real attacks that have taken place and see them broken down across time. You've heard that, on average, an attacker is inside their target's system for nearly six months before their presence is detected. It's natural to ask, “just what are those attackers actually doing during all of that time,” as a way of guiding your own thinking about planning and preparing to detect and respond to attackers.
Frameworks and theoretical models like the cyber kill chain are useful for part of this; they help organize our meta-thinking (our thinking about how we think) as we use them to build our own conceptual models of events such as complex cyberattacks. This prepares us to get down into the details; and as we saw in our previous investigations of vulnerability assessment and management, some of the best tools to help in that process are the national databases and query systems that gather, collate, and present the details of vulnerabilities that have been discovered and the exploits that have been used against them. MITRE's CVE data, for example, is a fundamental tool for every information security specialist. As of July 30, 2021, this database had 157,777 records of publicly disclosed vulnerability data regarding almost every form of IT and OT hardware, systems software, and applications in the world. Correlating your IT and OT architecture's system baseline with the applicable CVE items informs your vulnerability assessment process.
Organized another way, that same vulnerability and exploit reporting data can inform your incident response preparedness efforts. Let's see how.
More than a decade of research efforts across the information security community formed the foundation of this framework, which was intended to produce a “living lab” capability that supports and encourages security practitioners and researchers to find new ways to derive meaning from the experiences of real-world attacks. In the words of its creators, it “is just as much about the mindset and process of using it as much as it is about the knowledge base itself.” Two views of this knowledge base illustrate this:
ATT&CK provides three defined views of this knowledge base, where the exploits in question (and thus their tactical use and objective or purpose) are organized in ways that best reflect attacks on three classes of targets: enterprise systems, mobile systems, and industrial control systems (ICSs). Figure 10.2 shows a small portion of the enterprise framework, which illustrates how individual exploits are bottom-up grouped by purpose and then by stage in an overall attack plan. It shows how botnets, for example, can serve multiple tactical needs during the resource development phase of an attack.
ATT&CK is live data; as each new CVE report comes in, it is characterized (either by those who report it or by ATT&CK-affiliated researchers) and tagged so that it can be placed into its rightful places (plural) in these attack frameworks. And when attackers demonstrate new tactics and techniques and organize the ebb and flow of their tactical steps in new ways, ATT&CK is adjusted by its researcher-operators to reflect those new, painfully learned lessons.
MITRE's ATT&CK team has rounded up a number of resources to help you become familiar with the tool, its data, and its use. Whether you want to do this with videos, blogs, ebooks, or other materials, your journey can begin at https://attack.mitre.org/resources/getting-started/
.
Set this book down, take a break, and go spend some time digging into some of what's out there on ATT&CK; then, armed with more of an attacker's-eye view, come back, and let's start organizing, preparing, training, and equipping your own incident responders.
All organizations, regardless of size or mission, should have a framework or process they use to manage their information security incident response efforts with. It is a vital part of your organization's business logic. Due care and due diligence both require it. The sad truth is, however, many organizations don't get around to thinking through their incident response process needs until after the first really scary information security incident has taken place. As they sweep up the digital broken glass from the break-in, assess their losses due to stolen or compromised data, and start figuring out how to get back into operation, they say “Never again!” They promise themselves that they'll write down the lessons they've just painfully learned and be better prepared.
(ISC)2 and others define the incident response framework as a formal plan or process for managing the organization's response to a suspected information security incident. It consists of a series of steps that start with detection and run through response, mitigation, reporting, recovery, and remediation, ending with a lessons learned and onward preparation phase. Figure 10.3 illustrates this process. Please note that this is a conceptual flow of the steps involved; reality tells us that incidents unfold in strange and complex ways, and your incident response team needs to be prepared to cycle around these steps in different ways based on what they learn and what results they get from the actions they take.
NIST, in its special publication 800-61r2, adds an initial preparation phase to this flow and further focuses attention on the detection process by emphasizing the role of prompt analysis to support incident identification and characterization. NIST also refines the mitigation efforts by breaking them down into containment and eradication steps and the lessons learned phase into information sharing and coordination activities. These are shown alongside the simplified response flow in Figure 10.4.
Other publications and authorities such as ITIL publish their own incident response frameworks, each slightly different in specifics. ISO/IEC 27035:2016 is another good source of information technology security techniques and approaches to incident management. As we saw with risk management frameworks in Chapters 3, “Integrated Information Risk Management,” and 4, “Operationalizing Risk Mitigation,” the key is to find what works for your organization. These same major tasks ought to show up in your company's incident response management processes, policies, and procedures. They may be called by different names, but the same set of functions should be readily apparent as you read through these documents. If you're missing a step—if a critical task in either of these flows seems to be overlooked—then it's time to investigate.
Unless you're in a very small organization, and as the SSCP you wear all of the hats of network and systems administration, security, and incident response, your organization will need to formally designate a team of people who have the “watch-standing” duty of a real-time incident response team. This team might be called a computer emergency response team (CERT). CERTs can also be known as computer incident response teams, as a cyber incident response team (both using the CIRT acronym), or as computer security incident response teams (CSIRTs). For ease of reference, let's call ours a CSIRT for the remainder of this chapter. (Note that CERTs tend to have a broader charter, responding whether systems are put out of action by acts of nature, accidents, or hostile attackers. CERTs, too, tend to be more involved with broader disaster recovery efforts than a team focused primarily on security-related incidents.)
Your organization's risk appetite and its specific CIANA needs should determine whether this CSIRT provides around-the-clock, on-site support, or supports on a rapid-response, on-call basis after business hours. These needs will also help determine whether the incident response team is a separate and distinct group of people or is a part of preexisting groups in your IT, systems, or networks departments. In Chapter 5, “Communications and Network Security,” for example, we looked at segregating the day-to-day network operations jobs of the network operations center (NOC) from the time-critical security and incident response tasks of a security operations center (SOC).
Whether your organization calls them a CSIRT or an SOC, or they're just a subset of the IT department's staff, there are a number of key functions that this incident response team should perform. We'll look at them in more detail in subsequent sections, but by way of introduction, they are as follows:
Serve as a single point of contact for incident response. Having a single point of contact between the incident and the organization makes incident command, control, and communication much more effective. This should include the following:
Take control of the incident and the scene. Taking control of the incident, as an event that's taking place in real time, is vital. Without somebody taking immediate control of the incident, and where it's taking place, you risk bad decisions placing people, property, information, or the business at greater risk of harm or loss than they already are. Taking control of the incident scene protects information about the incident, where it happened, and how it happened. This preserves physical and digital evidence that may be critical to determining how the incident began, how it progressed, and what happened as it spread. This information is vital to both problem analysis and recovery efforts and legal investigations of fault, liability, or unlawful activity.
Investigate, analyze, and assess the incident. This is where all of your skills as a troubleshooter, an investigator, or just being good at making informed guesses start to pay off. Gather data; ask questions; dig for information.
Escalate, report, and engage with leadership. Once they've determined that a security-related incident might in fact be happening, the team needs to promptly escalate this to senior leadership and management. This may involve a judgment call on the response team chief's part, as preplanned incident checklists and procedures cannot anticipate everything that might go wrong. Experience dictates that it's best to err on the side of caution, and report or escalate to higher management and leadership.
Keep a running incident response log. The incident response team should keep accurate logs of what happened, what decisions got made (and by whom), and what actions were taken. Logging should also build a time-ordered catalog of event artifacts—files, other outputs, or physical changes to systems, for example. This time history of the event, as it unfolds, is also vital to understanding the event, and mitigating or taking remedial action to prevent its reoccurrence. Logs and the catalogs of artifacts that go with them are an important part of establishing the chain of custody of evidence (digital or other) in support of any subsequent forensics investigation.
Coordinate with external parties. External parties can include systems vendors and maintainers, service bureaus or cloud-hosting service providers, outside organizations that have shared access to information systems (such as extranets or federated access privileges), and others whose own information and information systems may be put at risk by this incident as it unfolds. By acting as the organization's focal point for coordination with external parties, the team can keep those partners properly informed, reduce risk to their systems and information, and make better use of technical, security, and other support those parties may be able to provide.
Contain the incident. Prevent it from infecting, disrupting, or gaining access to any other elements of your systems or networks, as well as preventing it from using your systems as launchpads to attack other external systems.
Eradicate the incident. Remove, quarantine, or otherwise eliminate all elements of the attack from your systems.
Recover from the incident. Restore systems to their pre-attack state by resetting and reloading network systems, routers, servers, and so forth as required. Finally, inform management that the systems should be back up and ready for operational use by end users.
Document what you've learned. Capture everything possible regarding systems deficiencies, vulnerabilities, or procedural errors that contributed to the incident taking place for subsequent mitigation or remediation. Review your incident response procedures for what worked and what didn't, and update accordingly.
No matter how your organization breaks up the incident response management process into a series of steps, or how they are assigned to different individuals or teams within the organization, the incident response team must keep three basic priorities firmly in mind.
The first one is easy: the safety of people comes first. Nothing you are going to try to accomplish is more important than protecting people from injury or death. It does not matter whether those people are your coworkers on the incident response team, or other staff members at the site of the incident, or even people who might have been responsible for causing the incident, your first priority is preventing harm from coming to any of them—yourself included! Your organization should have standing policies and procedures that dictate how calls for assistance to local fire, police, or emergency medical services should be made; these should be part of your incident response procedures.
The next two priority choices, when taken together, are actually one of the most difficult decisions facing an organization, especially when it's in the midst of a computer security incident: should it prioritize getting back into normal business operations or supporting a digital forensics investigation that may establish responsibility, guilt, or liability for the incident and resultant loss and damages. This is not a decision that the on-scene response team leader makes! Simply put, the longer it takes to secure the scene, and gather and protect evidence (such as memory dumps, systems images, disk images, log files, etc.), the longer it takes to restore systems to their normal business configurations and get users back to doing productive work. This is not a binary, either-or decision—it is something that the incident response team and senior leaders need to keep a constant watch over throughout all phases of incident response.
Increasingly, we see that government regulators, civic watchdog groups, shareholders, and the courts are becoming impatient with senior management teams that fail in their due diligence. This impatience is translating into legal and market action that can and will bring self-inflicted damage—negligence, in other words—home to roost where it belongs, and the reasonable fear of that should lead to tasking all members of the IT organization, including their information security specialists, with developing greater proficiency at being able to protect and preserve the digital evidence related to an incident, while getting the systems and business processes promptly restored to normal operations.
The details of how to preserve an incident scene for a possible digital forensics investigation, and how such investigations are conducted, is beyond the scope of the SSCP exam and this book. They are, however, great avenues for you to journey along as you continue to grow in your chosen profession as a white hat!
You may have noticed that this step isn't shown in either of the flows in Figures 10.3 or 10.4. That's not an oversight—this should have been done as soon as you started your information risk management planning process. There is nothing to gain by waiting—and potentially everything to lose. NIST SP800-61 Rev. 2 provides an excellent “shopping list” of key preparation and planning tasks to start with and the information they should make readily available to your response team. But where do you start?
Let's break this preparation task down into more manageable steps, using the Plan-Do-Check-Act (PDCA) model we used in earlier chapters, as part of risk management and mitigation. It may seem redundant to plan for a plan, but it's not—you have to start somewhere, after all. Note that the boundaries between planning, doing, checking, and acting are not hard and fast; you'll no doubt find that some steps can and should be taken almost immediately, while others need a more deliberative approach. Every step of the way, keep senior management and leadership engaged and involved. This is their emergency response capability you're planning and building, after all.
This first set of tasks focuses on gathering what the organization already knows about its information systems and IT infrastructures, its business processes and its people, which become the foundation on which you can build the procedures, resources, and training that your incident responders will need. As you build those procedures and training plans, you'll also need to build out the support relationships you'll need when that first incident (or the next incident) happens.
Build, maintain, and use a knowledge base of critical systems support information. You'll need this information to identify and properly scope the CSIRT's monitoring and detection job, as well as identify the internal systems support teams, critical users, and recovery and restoration processes that already exist. As a living library, the CSIRT should have these information products available to them as reference and guidance materials. These include but are not limited to
Whether you put this information into a separate knowledge base for your incident responders, or it is part of your overall software, systems, and IT knowledge base, is perhaps a question of scale and of survivability. During an incident itself, you need this knowledge base reliably available to your responders, without having to worry if it's been tainted by this incident or a prior but undetected one.
Use that list to identify the set of business process, systems architecture, and technology-focused critical knowledge that each CSIRT team member must be proficient in, and add this to your team training and requalification planning set.
Assemble critical data collection, collation, and analysis tools. Characterizing an event in real time, and quickly determining its nature and the urgency of the response it demands, requires that your incident response team be able to analyze and assess what all of the information from your systems is trying to tell them. You do not help the team get this done by letting the team find the tools they need right when they're trying to deal with an ongoing incident. Instead, identify a broad set of systems and event information analysis tools, and bring them together in what we might call a responder's workbench. This workbench can provide your response team with a set of known, clean systems to use as they capture data, analyze it, and draw conclusions about the event in question. Some of the current generation of security information and event management systems may provide good starting points for growing your own workbench. Other tools may need to be developed in house, tailored to the nature of critical business processes or information flows, for example.
Establish minimum standards for event logging. Virtually all of your devices, be they servers, endpoints, or connectivity systems, have the capability to capture event information at the hardware, systems software, and applications levels. These logs can quickly narrow down your hunt for the broken or infected system, or the unauthorized subject(s) and the objects they've accessed. You'll also need to establish a comprehensive and uniform policy about log file retention if you're hoping to correlate logs from different systems and devices with each other in any meaningful way. Higher-priority, mission-critical systems should have higher levels of logging, capturing more events and at greater time granularity, to better empower your response capability regarding these systems.
Identify forensics requirements, capabilities, and relationships. Although many information security incidents may come and go without generating legal repercussions, you need to take steps now to prepare for those incidents that will. You'll need to put in place the minimum required capabilities to establish and maintain a chain of custody for evidence. This may surface the need for additional training for CSIRT team members and managers. Use this as the opportunity to understand the support relationships your team will need when (not if) such an incident occurs, and start thinking through how you'd select the certified forensics examiners you'd need when it does.
By the end of this preparation planning phase, you should have some concrete ideas about what you'll need for the CSIRT:
This is where the doing of our PDCA gets going in earnest. Some of the actions you'll take are strictly internal and technical; some relate to improvements in administrative controls:
Synchronize all system clocks. Many service handshakes can allow up to 5 minutes or more misalignment of clocks across all elements participating in the service, but this can play havoc with attempts to correlate event logs.
Frequently profile your systems. System profiles help you understand the “normal” types, patterns, and amounts of traffic and load on the systems, as well as capturing key security and performance settings. Whether you use automated change-detection tools or manual inspection, comparing a current profile to a previous one may surface an indicator of an event of interest in progress or shed light on your search to find it, fix it, and remove it.
Establish channels for outside parties to report information security incidents to you. Whether these are other organizations you do routine business with or complete strangers, you make it much easier on your shared community of information security professionals when you set up an email form or phone number for anyone to report such problems to you. And it should go without saying that somebody in your response team needs to be paying attention to that email inbox, or the phone messages, or the forms-generated trouble tickets that flow from such a “contact us” page!
Establish external incident response support relationships. Many of the organizations you work with routinely—your cloud-hosting providers, other third-party services, your systems and software vendors and maintainers, even and especially your ISP—can be valuable teammates when you're in the midst of an incident response. Gather them up into a community of practice before the lightning strikes. Get to know each other, and understand the normal limits of what you can call upon each other for in the way of support. Clearly identify what you have to warn them about as you're working through a real-time incident response yourself.
Develop and document CSIRT response procedures. These will, of course, be living documents; as your team learns with each incident they respond to, they'll need to update these procedures as they discover what they were well-prepared and equipped to deal with effectively, and what caught them by surprise. Checklist-oriented procedures can be very powerful, especially if they're suitable for deployment to CSIRT team members' smartphones or phablets. Don't forget the value of a paper backup copy, along with emergency lighting and flashlights with fresh batteries, for when the lights go out!
Initiate CSIRT personnel training and certification as required. Take the minimum proficiency sets of knowledge, skills, and abilities (often called KSAs in human resources management terms), review the personnel assigned to the CSIRT and your recall rosters, and identify the gaps. Focus training, whether informal on-the-job or formal coursework, that each person needs, and get that training organized, planned, scheduled, and accomplished. Keep CSIRT proficiency qualification files for each team member, note the completion of training activities, and be able to inform management regarding this aspect of your readiness for incident response. (Your organization's HR team may be able to help you with these tasks, and with organizing the training recordkeeping.)
Maybe your preparation achieves a “ready to respond” state incrementally; maybe you're just not ready for an incident at all, until you've achieved a certain minimum set of verified, in-place knowledge, tools, people, and procedures. Your organization's mission, goals, objectives, and risk posture will shape whether you can get incrementally ready or have to achieve an identifiable readiness posture. Regardless, there are several things you and the CSIRT should do to determine whether they are ready or not:
Understand your “business normal” as seen by your IT systems. Establish a routine pattern or rhythm for your incident response team members to steep themselves in the day-to-day normal of the business and how people in the business use the IT infrastructure to create value in that normal way. Stay current with internal and external events that you'd reasonably expect would change that normal—the weather-related shutdown of a branch office, or a temporary addition of new federation partners into your extranets. The more each team member knows about how “normal” is reflected in fine-grained system activity, the greater the chance that those team members will sniff out trouble before it starts to cause problems. They'll also be better informed and thus more capable of restoring systems to a useful normal state as a result.
While you're at it, don't forget to translate that business normal into fine-tuning of your automated and semiautomated security tools, such as your security incident event management systems (SIEMs), intrusion detection systems (IDS), intrusion prevention systems (IPS), or other tools that drive your alerting and monitoring channels. Business normal may also be reflected in the control and filter settings for access control and identity management systems, as well as for firewall settings and their access control lists. This is especially important if your organization's business activities have seasonal variations.
Routinely demonstrate and test backup and restore capabilities. You do not want to be in the middle of an incident response only to find out that you've been taking backup images or files all wrong and that none of them can be reloaded or work right when they are loaded.
Exercise your alert/recall, notification, escalation, and reporting processes. At the cost of a few extra phone calls and a bit of time from key leaders and managers, you gain confidence in two critical aspects of your incident response management process. For starters, you demonstrate that the phone tree or the recall and alert processes work; this builds confidence that they'll work when you really need them to. A second, add-on bonus is that you get to “table-top” or exercise the protocols you'd want to use had this been an actual information systems security incident.
Document your incident response procedures, and use these documents as part of training and readiness. Do not trust human memory or the memory of a well-intended and otherwise effective committee or team! Take the time to write up each major procedure in your incident response management process. Make it an active, living part of the knowledge base your responders will need. Exercise these procedures. Train with them, both as initial training for IT and incident response team members, line, and senior managers, and your general user base as applicable.
Taken all at once, that looks like a lot of preparation! Yet much of what's needed by your incident response team, if they're going to be well prepared, comes right from the architectural assessments, your vulnerability assessments, and your risk mitigation implementation activities. Other key information comes from your overall approach to managing and maintaining configuration control over your information systems and your IT infrastructure. And you should already be carrying out good “IT hygiene” and safety and security measures, such as clock synchronization, event logging, testing, and so forth. The new effort is in creating the team, defining its tasks, writing them up in procedural form, and then using those procedures as an active part of your ongoing training, readiness, and operational evaluation of your overall information security posture.
On a typical day, a typical medium-sized organization might see millions of IP packets knocking on its point of presence, most of them in response to legitimate traffic generated inside the organization, solicited by its Web presence, or generated by its external partners, customers, prospective customers, and vendors. Internally, the traffic volume on the company's internetworks and the event loads on servers that support end users at their endpoints could be of comparable volume. Detecting that something is not quite right, and that that something might be part of an attack, is as much art as it is science. Three different factors combine to make this art-and-science difficult and challenging:
So how does our response team sort through all of that noise and find the few important, urgent, and compelling signals to pay attention to?
First, let's define some important terms related to incident detection. Earlier we talked about events of interest—that is, some kind of occurrence or activity that takes place that just might be worth paying closer attention to. Without getting too philosophical about it, events make something in our systems change state. The user, with hand on mouse, does not cause an event to take place until they do something with the mouse, and it signals the system it's attached to. That movement, click, or thumbwheel roll causes a series of changes in the system. Those changes are events. Whether they are interesting ones, or not, from a security perspective, is the question!
A precursor is a sign, signal, or observable characteristic of the occurrence of an event that in and of itself is not an attack but that might indicate that an attack could happen in the future. Let's look at a few common examples to illustrate this concept:
Genuine precursors—ones that give you actionable intelligence—are quite rare. They are often akin to the “travel security advisory codes” used by many national governments. They rarely provide enough insight that something specific is about to take place. The best you can do when you see such potential precursors is to pay closer attention to your indicators and warnings systems, perhaps by opening up the filters a bit more. You might also consider altering your security posture in ways that might increase protection for critical systems, perhaps at the cost of reduced throughput due to additional access control processing.
An indicator is a sign, signal, or observable characteristic of the occurrence of an event indicating that an information security incident may have occurred or may be occurring right now. Again, a few very common examples will illustrate:
One type of indicator worth special attention is called an indicator of compromise (IOC), which is an observable artifact that with high confidence signals that an information system has been compromised or is in the process of being compromised. Such artifacts might include recognizable malware signatures, attempts to access IP addresses or URLs known or suspected to be of hostile or compromising intent, or domain names associated with known or suspected botnet control servers. The information security community is working to standardize the format and structure of IOC information to aid in rapid dissemination and automated use by security systems.
In one respect, the fact that detection is a war of numbers is both a blessing and a curse; in many cases, even the first few low and slow steps in an attack may create dozens or hundreds of indicators, each of which may, if you're lucky, contain information that correlates them all into a suspicious pattern. Of course, you're probably dealing with millions of events to correlate, assess, screen, filter, and dig through to find those few needles in that field of haystacks.
Initial incident detection is the iterative process by which human members of the incident response team assemble, collate, and analyze any number of indicators (and precursors, if available and applicable), usually with a SIEM tool or data aggregator of some sort, and then come to the conclusion that there is most likely an information security event in progress or one that has recently occurred. This is a human-centric, analytical, thoughtful process; it requires team members to make educated guesses (that is, generate hypotheses), test those hypotheses against the indicators and other systems event information, and then reasonably conclude that the alarm ought to be sounded.
That alarm might be best phrased to say that a “probable information security incident” has been detected, along with reporting when it is believed to have first started to occur and whether it is still ongoing.
Ongoing analysis will gather more data, from more systems; run tests, possibly including internal profiling of systems suspected to have been affected or accessed by the attack (if attack it was); and continue to refine its characterization or classification of the incident. At some point, the response team should consult predefined priority lists that help them allocate people and systems resources to continuing this analysis.
Note the dilemma here: paying too much attention, too soon, to too many alarms may distract attention, divert resources, and even build in a “Chicken Little” kind of reaction within management and leadership circles. When a security incident actually does occur, everyone may be just too desensitized to care about it. And of course, if you've got your thresholds set too high, you ignore the alarms that your investments in intrusion detection and security systems are trying to bring to your attention. Many of the headline-grabbing data breach incidents in the past 10 years, such as the attack that struck Target stores in 2013, suffered from having this balance between the costs of dealing with too many false rejections (or Type 1 errors) and the risk of missing a few more dangerous false acceptances (or Type 2 errors) set wrong.
This may seem obvious, but one of the most powerful analytical tools is often overlooked. Timeline analysis reconstructs the sequence of events in order to focus analysis, raise questions, generate insight, and aid in organizing information discovered during the response to the incident. Responders should start building their own reconstructed event timeline or sequence of events, starting from well before the last known good system state, through any precursor or indicator events, and up to and including each new event that occurs. The timeline is different than the response team's log—the log chronicles actions and decisions taken by the response team, directions they've received from management, and key coordination the team has had with external parties.
Timeline correlation and analysis depends upon having a common time reference for all of the data sources being used. Most architectures and infrastructures can do this by making use of a stable, reliable network time service provider, and ensuring that all systems, servers, endpoints, and communications equipment synchronizes with the same time service. Note that in October 2020, the IETF published RFC 8915, Network Time Security for the Network Time Protocol, which significantly improves the security of time services to all devices on the Internet (and on isolated LAN segments with their own network time servers).
Some IDS, IPS, or SIEM product systems may contain timeline analysis tools that your teams can use. Digital forensic workbenches usually have excellent timeline analysis capabilities. Even a simple spreadsheet file can be used to record the sequence of events as it reveals itself to the responders, and as they deduce or infer other events that might have happened.
This last is a powerful component of timeline analysis. Timeline analysis should focus you on asking, “How did event A cause event B?” Just asking the question may lead you to infer some other event that event A actually caused, with this heretofore undiscovered event being the actual or proximate cause of event B. Making these educated guesses, and making note of them in your timeline analysis, is a critical part of trying to figure out what happened.
And without figuring out what happened, your search for all of the elements that might have caused the incident to occur in the first place will be limited to lucky guesswork.
Now that the incident response team has determined that an incident probably already occurred or is ongoing, the team must notify managers and leaders in the organization. Each organization should specify how this notification is to be done and who the team contacts to deliver the bad news. In some organizations, this may direct that some types of incidents need immediate notification to all users on the affected systems; other circumstances may dictate that only key departmental or functional managers be advised. In any event, these notification procedures should specify how and when to inform senior leadership and management. (It's a sign of inadequate planning and preparation if the incident responders have to ask, “Who should we call?” in the heat of battle.)
Notification also includes getting local authorities, such as fire or rescue services, or law enforcement agencies, involved in the real-time response to the incident. This should always be coordinated with senior leadership and management, even if the team phones them immediately after following the company's process for calling the fire department.
Senior leadership and management may also have notification and reporting responsibilities of their own, which may include very short time frames in which notification must be given to regulatory authorities, or even the public. The incident response team should not have to do this kind of reporting, but it does owe its own leadership and management the information they will need to meet these obligations.
As incident containment, eradication, and recovery continue, the CSIRT will have continuing notification responsibilities. Management may ask for their assistance or direct them to reach out directly via webpage updates, updated voice prompt menus on the IT Help Desk contact line, emails, or phone calls to various internal and external stakeholders. Separate voice contact lines may also need to be used to help coordinate activities and keep everyone informed.
There are several ways to prioritize the team's efforts in responding to an incident. These consider the potential for impact to the organization and its business objectives; whether confidentiality, integrity, or availability of information resources will be impacted; and just how possible it will be to recover from the incident should it continue. Let's take a closer look at these:
Taken together, these factors help the incident response team advise senior leadership and management on how to deal with the incident. It's worth stressing, again, that senior leadership and management need to make this prioritization decision; the SSCPs on the incident response team must advise their leaders by means of the best, most complete, and most current assessment of the incident and its impacts that they can develop. That advice also should address options for containment and eradication of the incident and its effects on the organization.
These two goals are the next major task areas that the CSIRT needs to take on and accomplish. As you can imagine, the nature of the specific incident or attack in question all but defines the containment and eradication tactics, techniques, and procedures you'll need to bring to bear to keep the mess from spreading and to clean up the mess itself.
More formally, containment is the process of identifying the affected or infected systems elements, whether hardware, software, communications systems, or data, and isolating them from the rest of your systems to prevent the disruption-causing agent and the disruption it is causing from affecting the rest of your systems or other systems external to your own. Pay careful attention to the need to not only isolate the causal agent, be that malware or an unauthorized user ID with superuser privileges, but also keep the damage from spreading to other systems. As an example, consider a denial of service (DoS) attack that's started on your systems at one local branch office and its subnets and is using malware payloads to spread itself throughout your systems. You may be able to filter any outbound traffic from that system to keep the malware itself from spreading, but until you've thoroughly cleansed all hosts within that local set of subnets, each of them could be suborned into launching DoS attacks on other hosts inside your system or out on the Internet.
Some typical containment tactics might include:
A familiar term should come to mind as you read this list: quarantine. In general, that's what containment is all about. Suspect elements of your system are quarantined off from the rest of the system, which certainly can prevent damage from spreading. It also can isolate a suspected causal agent, allowing you a somewhat safer environment in which to examine it, perhaps even identify it, and track down all of its pieces and parts. As a result, containment and eradication often blur into each other as interrelated tasks rather than remain as distinctly different phases of activity.
This gives us another term worthy of a definition: a causal agent is a software process, data object, hardware element, human-performed procedure, or any combination of those that perform the actions on the targeted systems that constitute the incident, attack, or disruption. Malware payloads, their control and parameter files, and their carriers are examples of causal agents. Bogus user IDs, hardware sniffer devices, or systems on your network that have already been suborned by an attacker are examples of causal agents. As you might suspect, the more sophisticated APT kill chains may use multiple methods to get into your systems and in doing so leave multiple bits of stuff behind to help them achieve their objectives each time they come on in.
Eradication is the process of identifying every instance of the causal agent and its associated files, executables, etc. from all elements of your system. For example, a malware infection would require you to thoroughly scrub every CPU's memory, as well as all file storage systems (local and in the clouds), to ensure you'd found and removed all copies of the malware and any associated files, data, or code fragments. You'd also have to do this for all backup media for all of those systems in order to ensure you'd looked everywhere, removed the malware and its components, and clobbered or zeroized the space they were occupying in whatever storage media you found them on. Depending on the nature of the causal agent, the incident, and the storage technologies involved, you may need to do a full low-level reformat of the media and completely initialize its directory structures to ensure that eradication has been successfully completed.
Eradication should result in a formal declaration that the system, a segment or subsystem, or a particular host, server, or communications device has been inspected and verified to be free from any remnants of the causal agent. This declaration is the signal that recovery of that element or subsystem can begin.
It's beyond the scope of the SSCP exam to get into the many different techniques your incident response team may need to use as part of containment and eradication—quite frankly, there are just far too many potential causal agents out there in the wild, and more are being created daily. It's important to have a working sense of how detection and identification provided you the starting point for your containment, and then your eradication, of the threat.
During all stages of an incident, responders need to be gathering information about the status, state, and health of all systems, particularly those affected by the attack. They need to be correlating event log files from many different elements of their IT infrastructure, while at the same time constructing their own timeline of the event. Incident response teams are expected to figure out what happened, take steps to keep the damage from spreading, remove the cause(s) of the incident, and restore systems to normal use as quickly as they can.
There's a real danger that the incident response team can spread itself too thin if the same group of people are containing and eradicating the threat, while at the same time trying to gather evidence, preserve it, and examine it for possible clues. Management and leadership need to be aware of this conflict. They are the ones who can allocate more resources, either during preparation and planning, incident response, or both, to provide a digital forensics capability.
As in all things, a balance needs to be struck, and response team leaders need to be sensitive to these different needs as they develop and maintain their team's battle rhythm in working through the incident.
From the first moment that the responders believe that an incident has occurred or is ongoing, the team needs to sharpen their gaze at the various monitoring tools that are already in place, watching over the organization's IT infrastructure. The incident itself may be starting to cause disruptions to the normal state of the infrastructure and systems; containment and eradication responses will no doubt further disrupt operations. All of that aside, a new monitoring priority and question now needs to occupy center stage for the response team's attention: are their chosen containment, eradication, and (later on) restoration efforts working properly?
On the one hand, the team should be actively predicting the most likely outcomes of each step they are about to take before they take it. This look-ahead should also be suggesting additional alarm conditions or signs of trouble that might indicate that the chosen step is not working correctly or in fact is adding to the impact the incident is causing. Training and experience with each tool and tactic is vital, as this gives the team the depth of specialist knowledge to draw on as they assess the situation, choose among possible actions to take, and then perform that action as part of their overall response.
The incident response team is, first and foremost, supposed to be managing their responses to the incident. Without well-informed predictions of the results of a selected action, the team is not managing the incident; they're not even experimenting, which is how we test such predictions as part of confirming our logic and reasoning. Without informed guesswork and thoughtful consideration of alternatives, the team is being out-thought by its adversaries; the attackers are still managing and directing the incident, and defense is trapped into reacting as they call the shots.
Recovery is the process by which the organization's IT infrastructure, applications, data, and workflows are reestablished and declared operational. In an ideal world, recovery starts when the eradication phase is complete, and the hardware, networks, and other systems elements are declared safe to restore to their required normal state. The ideal recovery process brings all elements of the system back to the moment in time just before the incident started to inflict damage or disruption to your systems. When recovery is complete, end users should be able to log back in and start working again, just as if they'd last logged off at the end of a normal set of work-related tasks.
It's important to stress that every step of a recovery process must be validated as correctly performed and complete. This may need nothing more than using some simple tools to check status, state, and health information, or using preselected test suites of software and procedures to determine whether the system or element in question is behaving as it should be. It's also worth noting that the more complex a system is, the more it may need to have a specific order in which subsystems, elements, and servers are reinitialized as part of an overall recovery and restart process.
With that in mind, let's look at this step by step, in general terms:
Eradication complete. Ideally, this is a formal declaration by the CSIRT that the systems elements in question have been verified to be free of any instances of the causal agent (malware, illicit user IDs, corrupted or falsified data, etc.).
Restore from bare metal to working OS. Servers, hosts, endpoints, and many network devices should be reset to a known good set of initial software, firmware, and control parameters. In many cases, the IT department has made standard image sets that they use to do a full initial load of new hardware of the same type. This should include setting up systems or device administrator identities, passwords, or other access control parameters. At the end of this task, the device meets your organization's security and operational policy requirements and can now have applications, data, and end users restored to it.
Ensure all OS updates and patches are installed correctly… …if any have been released for the versions of software installed by your distribution kits or pristine system image copies.
Restore applications as well as links to applications platforms and servers on your network. Many endpoint devices in your systems will need locally installed applications, such as email clients, productivity tools, or even multifactor access control tools, as part of normal operations. These will need to be reinstalled from pristine distribution kits if they were not in the standard image used to reload the OS. This set of steps also includes reloading the connections to servers, services, and applications platforms on your organization's networks (including extranets). This step should also verify that all updates and patches to applications have been installed correctly.
Restore access to resources via federated access controls and resources beyond your security perimeter out on the Internet. This step may require coordination with these external resource operators, particularly if your containment activities had to temporarily disable such access.
At this point, the systems and infrastructure are ready for normal operations. Aren't they?
Remember that the IT systems and the information architecture exist because the organization's business logic needs to gather, create, make use of, and produce information to support decisions and action. Restoring the data plane of the total IT architecture is the next step that must be taken before declaring the system ready for business again.
In most cases, incident recovery will include restoring databases and storage systems content to the last known good configuration. This requires, of course, that the organization has a routine process in place for making backups of all of its operational data. Those backups might be
Restoring all databases and file systems to their “ready for business as usual” state may take the combined efforts of the incident response team, database administrators, application support programmers, and others in the IT department. Key end users may also need to be part of this process, particularly as they are probably best suited to verifying that the systems and the data are all back to normal.
For example, a small wholesale distributor might use a backup strategy that makes a full copy of its databases once per week, and then a differential backup at the end of every business day. Individual transactions (reflecting customer orders, payments to vendors, inventory changes, etc.) would be reflected in the transaction logs kept for specific applications or by end users. In the event that the firm's database has been corrupted by an attacker (or a serious systems malfunction), it would need to restore the last complete backup copy, then apply the daily differential backups for each day since that backup copy had been made. Finally, the firm would have to step through each transaction again, either using built-in applications functions that recover transactions from saved log files or by hand.
Now, that distributor is ready to start working on new transactions, reflecting new business. Its CSIRT's response to the incident is over, and it moves on to the post-incident activities we'll look at in just a moment.
One of the last tasks that the incident response team has is to ensure that end users, functional managers, and senior leaders and managers in the organization know that the recovery operations are now complete. This notice serves several important purposes:
At this point, the incident response team's real-time sense of urgency can relax; they've met the challenges of this latest information security incident to confront their organization. Now it's time to take a deep breath, relax, and capture their lessons learned.
Before you as team chief send your responder crews home for some rest, you need to get them to look at their notes and the team log, and make some quick memory-jogging notes about anything that happened that's not immediately obvious in those logs. Then (perhaps the next morning), the team should walk through a formal debrief process, using their logs and their event timeline as a framework. This debrief needs to capture, as completely as possible, the immediate memory of the experiences the team has just shared.
The process of appreciative inquiry can be a great help in such a team debrief. Appreciative inquiry starts from the assumption that what happened was good and useful, even if it didn't quite fit what was needed; this can lead the team to a blame-free examination of why or how the chosen procedures didn't suit the situation as best as they could have. Appreciative inquiry sets the stage for learning from experience by valuing that experience and, in doing so, reassuring those on the team that they played valued roles in the incident recovery process.
Good questions can and should be used to drive this debriefing process:
Notice those first three questions: they are looking for the good news, the insights regarding where your preparation, training, analysis, and other efforts paid off. Celebrate these. The incident responders as a team, and the organization as a whole, need the positive emotional reinforcement that they got this one right, even if they then left many other painful problems unresolved. The organization as a whole also needs to hear that the investment in resources, management attention, and everybody's time and effort has paid off, even if it didn't keep everything from being impacted by the attack.
This debriefing process may take several iterations as the team discovers that they need to learn more from the data collected from the systems during the incident and their response actions. They may also need to consult with others, such as system developers, key end users, or other partners, to more fully appreciate just what did happen and how well the team and the organization responded to it.
The debriefing process will no doubt surface a number of actions, suggestions, and areas for further exploration and analysis. All of these need to be captured in a manageable form, which the team leader, IT director, chief information security officer, or others in leadership and management can use to manage and direct the learning process that's been started by the debrief. In general, you'll see several broad types or categories of action items flowing out from the start of this “lessons learned” process:
The question is often asked: did we really learn lessons from such an experience, or did we just write them down and put them in the files for later? That set of action item categories bears a striking resemblance to how software, systems, or product developers manage successive builds or versions of their own products. They plan what should be in each of the next several releases or versions; they task members of their teams to develop those incremental changes, write them, test, and validate them, and then the team integrates them together into the next release.
Make those observations you and your team wrote down be more than just observations—prioritize them, plan and schedule their resolution, and assign resources and people to update systems, controls, procedures, and training as required to get the learning from those lessons reflected in your new and improved ways of doing incident response.
Almost every task that security professionals perform for their organizations benefits from being captured, codified, and stabilized with a workflow. Workflows can be as simple as a checklist or as complex and powerful as a set of scripts and tools that collaborate with the organization's own in-house security and operations knowledge management systems. Security systems and services vendors can provide security orchestration and automation for response (SOAR) platforms. These platforms or services do the following:
SOAR represents an evolutionary step in the growth and maturation of security capabilities. As a member of a security team in an organization, SOAR capabilities provide you a potentially valuable time and task management system, with which you can plan, schedule, and track routine tasks while not losing sight of them when urgent and compelling demands for your time and talents arise. They can give you the opportunity to reflect upon how you work and how you learn to work more effectively. As a team member, they can greatly improve collaboration and communication within the team and with others in the organization. From the organization's perspective, SOAR systems can help understand whether the organization is getting better with managing and mitigating information risks or are actually getting worse at it.
The incident responders may be done at this point, but other investigations may still be ongoing. Criminal or civil proceedings may mean that digital discovery motions have been served on the organization, or it's anticipated that they'll be served very soon. Ongoing internal investigations may be examining suspicious or careless behavior on the part of one or more employees, which could lead to disciplinary actions or even dismissal for cause. Most employers will not take such actions unless they are reasonably certain that they've got the evidence to back up such accusations, should the employee seek redress via a labor relations tribunal or the courts. In addition, the nature of the incident may bring with it still more regulatory or legal burdens that require the organization to thoroughly document exactly what happened; what information was compromised, disclosed, or corrupted; and whether any business decisions and actions were taken unadvisedly based on such loss or impact to decision support data.
In almost any jurisdiction, there are many different and sometimes conflicting rules, regulations, laws, and expectations regarding how long information pertaining to such an incident must be retained. There are even laws and regulations that set maximum retention periods, and companies and individuals can cause themselves more legal troubles if they don't dispose of information when required to do so. When any aspect of an incident becomes a matter for the courts to consider, these retention timelines can change yet again.
As an SSCP, your role in the midst of all of this may be as simple as ensuring that somebody in the organization produces a records and information retention schedule and that this schedule states how long data collected during an information security incident and response activity must be retained.
You'll also need to be aware that storage and retention of evidence requires more stringent controls than the storage and retention of other forms of business records, including data gathered or produced during an incident response. Any of that information that has been deemed evidence to a legal proceeding of any kind will probably require a separate storage and accountability process. Most digital evidence is a copy of the original—the contents of a system's RAM when it was executing malware has to be read out and written onto some kind of systems image media, and that disk image is what must be kept free from harm and under positive accountability. The chain of custody is the sequence of each step taken to originally gather the evidence, record or copy it, put it into storage, and then control and keep account of persons or processes who accessed that evidence; it further has to account for anything that was done to the evidence. Gaps in this chain of custody suggest that someone had the opportunity to tamper with the evidence, at which point the evidence is worthless.
You probably won't encounter questions on the SSCP exam as to the details of records retention, evidence protection and its chain of custody, and the many different laws, regulations, and standards that apply to all of this. You may very well encounter these topics on the job, and the more you know about the nature of these requirements, the better you'll be able to serve your organization's overall information security needs.
No matter where in the world you practice your security profession, you know how much you depend upon the experiences of those who've gone through these types of incidents before. Pay off that debt, organizationally and individually. Make sure your organization feeds information and incident reports back into your national or regional vulnerability database and clearing house organizations, as well as into your systems vendors and service providers. Individually, too, reach out to the greater information security community, and share lessons you and your organization are learning as you work to improve and support its security posture. (Do be sure to get your organization's permission to speak, write, or share in public before you start doing so!)
It's good practice to be an established, respected, and trusted member of your local area information security communities of practice, as well as of larger communities. Once you're into the post-event phase, it's a good time to share information about the incident, your responses to it, and the residual damage or actions, if any, that you're facing. (Such sharing must of course be tempered by your organization's information security classification guidelines!) Those communities—much like your fellow (ISC)2 members—are there to help each other learn from experiences such as you and your team have just been through. Share the wealth, as well as the pain, of that learning with them.
From preparation through response and to post-response wrap-up, organizations need to invest in, create, and maintain their capabilities to respond to information systems security incidents. It's a vital part of getting back into business and may be the difference between being in business after the incident or allowing the incident to put you out of business completely. Prompt detection, identification, and characterization of an incident are the first major steps; these inform the incident response team, who (after notifying their senior leadership and management) begin the tasks of containment and eradication of the damage-causing agent, malware, or illicit identities. Once those are thoroughly eradicated, the team begins the process of restoring systems and data, finally notifying managers and users that all's well, and the systems are back online and ready for business as usual.
But it's not just our systems and our end users and business needs anymore, is it? Increasingly, our businesses and organizations become parts of larger digital communities via extranets, federated access, and other collaboration arrangements. Thus, our response to information security incidents takes on both greater urgency and a greater burden of coordination and cooperation. We may have done everything right in our own systems, and yet still, our systems were struck down, perhaps by a zero day exploit, and corrupted; that's our loss. If we then fail to promptly notify our federated partners, or organizations who share their information resources with us via their own extranets, we can be liable for damages they suffer as well.
Being part of an incident response team is perhaps the closest we in the IT world can come to being part of a hospital emergency room's urgent care team. The alarms start going off; systems start behaving abnormally or crash completely. Normal work starts to slow down or halt completely, either directly because of the incident or because of the containment, eradication, and recovery efforts your team is taking. Senior leaders and managers need to know, now, what's going on, and what your best prognosis is as to the possible damage, the extent of the downtime, and what else it might take to get things back to normal. It's demanding and challenging, and it can be quite stressful; it also demands broad and deep specialist knowledge and experience from the SSCPs and others who work on that team.
Explain how the incident response team and process support digital forensics investigations. Digital forensics investigations are conducted to gather and assess digital evidence in support of answering legal, regulatory, or contractual questions of guilt, fault, liability, or innocence. Most such questions require that evidence gathered and used to answer such questions be subject to chain of custody standards, which dictate how access to the evidence is controlled and accounted for. In an information security incident, much of the same digital information that the incident response team needs to analyze and understand so that they can appropriately identify, contain, and eradicate whatever caused the incident may also end up being needed as evidence by forensics examiners. The procedures used by the incident responders should try to respect the needs of potential follow-on forensics investigations, wherever possible, so that problem-solving information still meets chain of custody and other evidentiary standards for use in courts of law. This balance can be difficult to maintain in the immediacy of responding to an incident. Proper preparation can reduce the chance that key information will become unusable as evidence.
Understand the relationship between incident response, business continuity, and disaster recovery planning. A disaster is an incident that causes major damage to property, business information, and quite possibly injures or kills people. A disaster may be one very extensive incident or a whole series of smaller events, which, taken together, constitute an existence-threatening stress to the organization. The extensiveness of this damage can be such that the organization cannot recover quickly, if at all, or that such recovery will take significant reinvestment into systems, facilities, relationships with other organizations, and people. Disaster recovery plans are ways of preparing to cope with such significant levels of disruption. Business continuity, by contrast, is the general term for plans that address how to continue to operate in as normal a fashion as possible despite the occurrence of one or more disruptions. Such plans can address alternative processing capabilities and locations, partnering arrangements, and financial arrangements necessary to keep the payroll flowing while operational income is disrupted. Business continuity can be interrupted by one incident or a series of them. Incident response narrows the focus down to a single incident and provides detailed and systematic instruction as to how to detect, characterize, and respond to an incident to contain or minimize damage; such response plans then outline how to restore systems and processes to let business operations operate again as normal.
Describe some of the key elements of incident response preparation. Preparation usually starts with those possible incidents identified by the risk management process, and documented in the business impact analysis (BIA) as being of highest priority or concern to senior leadership and management. These are used to identify a key set of information resources, tools, systems, skills, and talent needed to respond effectively. The incident response team's structure, roles, and responsibilities should be defined, and the team established, whether as an on-call resource, an ongoing security operations watch team, or some other structure best suited to the organization's business logic and security needs. The team should then ensure that system profiles and other information be routinely gathered and updated so that the team understands the normal behavior of the IT systems and infrastructure when servicing routine business loads and demands. Testing and validation of backup and restore capabilities, and team exercises, should also be part of becoming and staying well prepared for information security incidents when (not if) they occur.
Explain the challenges of precursors and indicators in incident detection. An incident is a series of one or more events, the cumulative effect of which is a potential or real violation of the information security needs of the organization. As an event occurs, it makes something change—it changes the contents of a storage system or location, triggers another event or blocks a preplanned trigger, etc. These outcomes of an event may be either precursors or indicators. Precursors are signals that a security event may happen some indeterminate time later but that such an event is not happening right now. Indicators signal that a security event is taking place now. The problem is one of sheer volume; even a small SOHO system might see hundreds of thousands of events each working day, some of which might be legitimate precursors or indicators. Intrusion detection systems, firewalls, and access control systems generate many more signals, but by themselves, these systems cannot usually determine whether the event in question was legitimate and authorized or might be part of a security incident. Filters and logical controls can limit these false positive alarms, but if set too high, alarms that should demand additional investigation are never reported to security analysts. This sense of false negative (the absence of alarms) may not reflect reality. Conversely, set the filters too low, and your analysts can spend far too much time on fruitless investigation of the false positives.
Explain why containment and eradication often overlap as activities. As part of incident response, containment needs to keep the damage-causing agent, activity, process, or data from spreading to other elements of the system and causing further damage. Containment should also prevent this agent (malware, for example) from leaving your systems and getting back out onto the Internet where it could be part of an attack on another organization's systems. Many containment techniques, such as antimalware quarantine operations, logically or physically move the suspected malware to separate storage areas that are not accessible by normal user processes. This simultaneously removes them from the infected system and prevents their spread to other systems.
Describe the legal and ethical obligations organizations must address when responding to information security incidents. The first set of such obligations come under due diligence and due care responsibilities to shareholders, stakeholders, employees, and the larger society. The organization must protect assets placed in its care for its business use. It must also take reasonable and prudent steps to prevent damage to its own assets or systems from spreading to other systems and causing damages to them in the process. Legally and ethically, organizations must keep stakeholders, investors, employees, and society informed when such information security incidents occur; failure to meet such notification burdens can result in fines, criminal prosecution, loss of contracts, or damage to the organization's reputation for reliability and trustworthiness. Such incidents may also raise questions of guilt, culpability, responsibility, and liability, and these may lead to digital forensic investigations. Such investigations usually need information that meets stringent rules of evidence, including a chain of custody that precludes someone from tampering with the evidence.
Describe the key steps in the recovery phase of responding to an information security incident. Once the incident response team is confident that the damage-causing agents have been eradicated from the systems, servers, hosts, and communications and network elements, those systems need to be restored to their normal hardware, software, data, and connectivity states needed for routine business operations. This can involve complete reloads or rebuilds of their operating systems, reinstallation of applications, and restoring of access control and identity management information so that each device's normal subjects (users or processes) can function. The team then can ensure that databases, file systems, and other storage elements have their content fully restored. Data recovery may also need to include re-execution of transactions lost between the time of the last data system backup (complete, incremental, differential, or special) and the impact of the incident itself. At that point, end users can be notified that the system is back up and available for normal use.
Describe the key steps in the post-incident phase of incident response. After the systems have been restored to normal operations, the incident response team in effect stands down from “emergency response” mode, but it's not through with this incident yet. As soon as possible after the incident is over, the team should debrief thoroughly to capture observations and insights that team members made during the incident or as a result of their response to it. An appreciative inquiry process is recommended, as this will encourage more open dialogue and avoid perceptions of finger-pointing. This should generate a list of actions to take that update procedures and risk mitigation controls, and may lead to additional or changed training and education for the team, users, or managers and leaders. Other actions may take considerable investment in resources or time in order to realize improvements in the incident prevention, detection, response, and recovery processes.
Explain the benefits of doing exercises, drills, and testing of incident response plans and procedures. Exercises, drills, and testing of incident response plans and procedures can help the organization in several ways. First, they can verify the technical completeness and correctness of the plans and procedure before attempting to use them in response to an actual incident. Second, they give all those involved in incident response the chance to strengthen their skills and knowledge via practice and evaluation; this supports in-classroom or self-paced training. Third, it can enhance team morale as it focuses on creating unity of effort. By instilling a sense of confident competence, the practice effect of such exercises, drills, and testing can prepare the team and the organization to better cope with the stress of real incidents.
Describe the role of monitoring systems during incident response. Monitoring of IT infrastructures is performed by a combination of automated data-generating tools (such as event loggers), data gathering and correlation systems (such as security information and event monitoring systems, or dashboards of any kind), and the attentive engagement of IT operations and incident response team members to what these systems are attempting to alert them to. Each step of the incident response cycle depends heavily on monitoring, by systems and by people, to notice out-of-tolerance conditions, abnormalities, or anomalies; to understand what more detailed data about such events is suggesting; and to validate that their efforts at containment, eradication, and recovery have been successfully completed. Continued monitoring well after the incident response is over will contribute to the assurance that the incident is safely in the past.
Explain the use of the kill chain concept in information security incident response. Attacks on information systems by advanced persistent threat (APT) actors almost invariably involve sequences of steps to support the many phases of such attacks, such as reconnaissance, entry, establishing command and control, and achieving the outcomes desired by the threat actor. This chain of events, called a kill chain, can be quite complex and take months, or even a year or more, to run through to completion. Many of its steps are low and slow, small-scale intrusions or attacks that are designed to not attract too much attention or set off too many alarms. Systems defenders who can detect and defend against any step in the kill chain may deter or delay other steps in the chain to the point where the attacker gives up and chooses a less well-defended target instead. Thus, the white hat defenders don't need to be successful against major attacks every day but against the low and slow small steps that may be part of such attacks.
Describe the use of logs in responding to information security incidents. Almost every element of modern IT infrastructures, systems, and applications can generate event log files that can record the time-tagged occurrence of events that incident responders may need to learn about. Changes in access control settings, changes in the status or content of an information resource, the loading and execution of tasks or process threads, or the creation of a user ID and elevation of its privileges are but a few of thousands of such log file events responders need to know about. Correlating log files from different systems elements can help produce or enrich an incident timeline, which is built by the incident response team as an analysis tool and as a description of what happened step by step as the incident unfolded. To correlate log files, they must use a common time of reference; it's therefore important to synchronize all system clocks, preferably to a network time standard. Different logs quite frequently record different kinds of events, to different levels of granularity and accuracy; thus, the team can find it challenging to find all of the observable events, across multiple logs, which are actually signaling a specific event in the incident itself. This is an important way to identify cause and effect relationships between events that take place during the incident.
Explain why and how the incident response team communicates and engages with organizational management and leadership. The incident response team acts as a single point of contact or focus regarding the response to an ongoing information security incident. It is important that the team not be overwhelmed by calls from every end user, or need to communicate with each of them individually. The team may also need senior organizational leadership and management's authority to call in additional personnel or emergency responders, or to activate other contingency plans. Management and leadership also have the burden to notify regulators, partners, legal authorities, customers, and the public. During the preparation phase, decisions should be made and procedures developed that dictate how the team reaches out to which specific leaders and managers in the organization, to share what kind of information. These procedures should also provide ways for the team to ask leadership for key decisions, as well as seek guidance from them in dealing with the incident if they need to prioritize some efforts over others. This communication can be face to face or by phone, email, or any means available, as specified in procedures.