Remember, when disaster strikes, the time to prepare has passed.
—Steven Cyros
The lights began to flicker in the Landmark Office Building in downtown Lansing, Michigan, about 4:15 p.m. on Thursday, August 14, 2003. The leadership team was wrapping up the biweekly business meeting. When the lights went out, the 15 men and women in the room sat in stunned silence.
Downtown streets quickly filled with people scurrying around, wondering what was happening and how to get home. Getting out of the parking garage became a 30-minute challenge in accident avoidance. Several commuters volunteered to direct traffic on busy street corners.
Cell phones either didn't work or were constantly busy. The real shock came with a radio announcement that the entire northeastern United States, including New York City, was experiencing a blackout. No one knew the cause. Was this another 9/11? Could this be a terrorist attack? What was going on?
As Emergency Management Coordinator for the Michigan Department of Information Technology (DIT), I (Dan Lohrmann) reported to the State Emergency Operations Center (SEOC), which was on generator power. As I walked into the underground facility, I thanked God we had run three emergency exercises in the past five months to prepare for moments like these. On arrival, I learned the full scope of the outage in Michigan and surrounding states. My job was to coordinate actions with other departments and ensure that the DIT provided computer and communication assistance needed during the emergency.
I immediately contacted our DIT emergency coordination center, which was activated at our backup data center location – also running on generator power. Many of our technical staff and emergency contacts were on vacation, but after working through wrong phone numbers and unanswered calls, representatives from every section of the DIT were connected into our phone bridge. This line was buzzing with activity for the next five days.
The SEOC quickly filled with emergency management representatives from all parts of state and local government, the Red Cross, and the energy companies. Governor Jennifer Granholm and her executive staff also were there. Slowly the activity in the room started to build as phones rang, and meetings and informal discussions formed.
An executive update was given by each organization every few hours. The governor walked around the room to hear each report and ask follow-up questions. I was impressed by her focus and hands-on approach to the crisis. During one of the briefings, President Bush called the governor to promise federal support.
Over the next several days, our Public Service Commission (PSC) representative gave regular reports about the power outage's expected length in different areas. Maps on the walls showed which areas were still without power and which were still in a state of emergency. A pattern developed in which power was restored quicker than estimated, but in some cases the power was unstable and failed shortly after it was restored, hampering our computer restoration efforts.
The biggest issue was water. Many organizations, including the National Guard and the Red Cross, helped get water to southwest Michigan. Private companies donated water and others volunteered to truck it in from one part of the state to another. Reports were given by the Department of Community Health on hospital coverage and other health-related issues. The Department of Agriculture was active in resolving food spoilage issues and restaurant food safety.
On Thursday night, the DIT team faced numerous challenges and questions. Reports came from all over the state about whether services were up and running. Some computer servers went down when their uninterruptible power supply (UPS) failed. The Executive Office wanted to update the state web portal with regular messages from the governor and the Public Service Commission, but connectivity was down.
We worked much of Thursday night to get things working again.
Over the weekend, the DIT was involved with workarounds to get unemployment extension letters out, update benefit card credits (formerly food stamp allocations), and assist in coordination for many other business processes. At one point, the Department of Community Health couldn't get an urgent email to the Centers for Disease Control in Colorado. They thought their emergency center's generator would enable them to keep all computer services going in emergencies like these, only to find that their email server was in a different building without power. Situations like these continued to arise through the following Monday.
Power was expected back in Lansing around 4 a.m. Friday. Should state employees report? Since cooling for state buildings in Lansing was provided by the utilities via chilled water, not air conditioning, would the computer rooms have enough cooling to bring up servers in time? The decision was made to have Lansing employees report, even if computer networks were unavailable. Through the dedicated efforts of employees, most computer services were available by 9 a.m. Friday morning in Lansing.
In most of Detroit, power was unavailable until Saturday morning. Work continued through Monday morning as the DIT went through the same processes in Detroit that were followed in Lansing the previous Friday.
Intelligence reports surfaced after the incident claiming that a computer virus or foreign nation-state hack may have triggered the Northeast blackout of 2003. It was (much later) determined that “overgrown trees” that came into contact with strained high-voltage lines near facilities in Ohio owned by FirstEnergy Corp. were the real cause.1
However, it is true that there was a cyber component to the outage: “a bug in a GE energy management system that resulted in an alarm system failure at FirstEnergy's control room, which kept the company from responding to the outage before it could spread to other utilities.”
Nevertheless, regardless of the cause, the steps to respond when emergencies occur remain the same. An “All Hazards” approach means that whether natural causes (like tornadoes, hurricanes, and floods) or manmade causes (like a cyberattack or arson) lead to a crisis, emergency responders need to work as one team to coordinate actions and recovery.
No doubt, forensic teams within incident response teams in a cyber emergency will be busy identifying and (when ready) remediating cyberattacks, but when the lights and power go out, most of the rescue responses remain the same – no matter the initial cause.
So what lessons did we learn? On the positive side, our previous exercises helped us. Having a common incident tracking system at the SEOC and DIT emergency coordination center was priceless. It allowed everyone at the department command center to see all the actions, alerts, logs, and issues available at the SEOC. We could share common event logs and track actions taken across the enterprise.
On the negative side, we learned that behaviors change when real emergencies occur. Most staff went home to check on their families before coming back to work. What if power had been off in all of Michigan? Would everyone have reported as quickly? We were amazed at how an extended loss of power affected so many other areas. What if the outage had been a week or longer? As a result, we updated several parts of our emergency plan activation procedure.
Looking back at the blackout of 2003, I realize again how vulnerable we are to emergency situations. From hurricanes to power outages to terrorist acts, we can prepare for emergency situations, but we can't control events.
For Michigan's DIT, the power outage enabled us to gain a more positive “can-do” reputation with our customers. Our relatively new department had now lived through an emergency with the agencies we serve, which helped build trust. Not only did the blackout strengthen the new relationship with our client agencies, it showed the tremendous accomplishments we can achieve through teamwork in our own agency.
Despite a hectic workload and an intense atmosphere, I was able to step back a few times to watch how things were running in the SEOC. I was amazed by the calm dedication and lack of panic. I was proud of the response everyone provided, and especially that our prior planning paid off. I'm proud of the excellence and teamwork we showed as a department.
The importance of people, processes, and technology cannot be overemphasized in cyberdefense efforts.
How do these elements come together in daily practice? What does it look like when the phone rings and the person on the end says, “We have a significant cyber incident that needs to be escalated”? The following is a helpful story from an outstanding cyber leader: Texas CISO Nancy Rainosek.
Early in the morning on August 16, 2019, a ransomware attack on Texas entities began that spread across the entire state and impacted 23 local governments. These were smaller entities with little in-house IT knowledge or support. The Texas Department of Information Resources (DIR), the state agency charged with leading statewide cybersecurity response, was notified at 8:36 a.m. that eight local government entities had suffered a ransomware event. Over the next two hours, 11 more reports came in, and at 10:30 a.m., one of the municipalities reported that its Supervisory Control and Data Acquisition (SCADA) system had been impacted. This system controlled the monitoring and distribution of the entire local community's water supply. Given the number of entities impacted and the very real public health and safety threat, DIR notified the governor's office to discuss issuing a disaster declaration.
Shortly after 11:00 a.m., Governor Greg Abbott issued Texas's first statewide disaster declaration for a cyber event.2 With the declaration, the Cybersecurity Annex to the Texas Emergency Management Plan was put into action. The declaration also activated the Texas Division of Emergency Management's (TDEM) State Operations Center (SOC). By noon that same day, the SOC was fully active on a 24/7 operation with state and federal incident responders. Leveraging the logistical expertise of TDEM, Texas held the first coordination call with all potentially impacted entities at 2:30 p.m.
Over the next two days, Texas incident responders identified, prioritized, and visited all impacted entities across Texas. And by the end of Friday, August 23 – one week after the incident began – all impacted entities had been remediated to the point that state support was no longer required.
Nancy Rainosek emphasizes that Texas's successful response to this unprecedented cyber event resulted in impacted entities being restored quickly with no ransom paid. The state's response cost one-tenth of the ransom demanded by the criminals responsible for this attack. The extensive preparation and cooperation between the responders led to the entities being back online and in the rebuilding phase within one week of the attack.
The following preventive measures also contributed to Texas' preparation for such an event:
The other crucial key to Texas's success in this event was the collaboration and cooperation of state and federal partners. Per the State of Texas Cybersecurity Annex, DIR led the incident response effort. State responders included:
Federal responders included:
The FBI teams worked well with the Texas responders and were quickly integrated with the other responders on this joint effort. They provided clear and timely information and were excellent partners on the forensic side of this mission.
Many private companies offered assistance during the event. DIR did not have the resources to adequately vet and train these resources, given the urgency of this matter. The Texas Legislature proposed a bill to strengthen the state's cybersecurity response through a series of measures, including requiring DIR to establish a volunteer incident response team, with appropriate background checks and incident procedures. This would help ensure that Texas is prepared for the next major incident affecting multiple entities in the state.
“I can tell a lot about a person's cybersecurity background through a simple discussion,” says North Dakota CISO Kevin Ford. In his experience, many cybersecurity conversations involve high-minded ethical considerations around the importance of data. Most people fall into one of three groups.
In the first (largest) group, conversations go something like, “Who are these hackers, and why would they target us? Who do they think they are? They should be arrested!”
According to Kevin, these are generally conversations with new cybersecurity staff, noncybersecurity IT staff, or perhaps laypeople like reporters and legislators. The presupposition of this conversation is that any cybersecurity incident is intolerable.
Kevin continues, “The truth of the matter is, that in any sufficiently large period, or in any sufficiently large network, cybersecurity events will happen. The quality and effectiveness of your cybersecurity program only extends the time between events.”
The second group's conversation is more nuanced around the concept of cyber risk. The cyber risk concept, broadly construed, treats cybersecurity as a series of decisions and actions around the evaluation of both the likelihood and impacts of cybersecurity events. The cyber risk concept is powerful because it accepts the inevitability of cybersecurity events and prioritizes the prevention and responses to those events. Cyber risk conversation generally focuses on the fact that only a finite number or controls and activities are possible, given a limited number of resources, and thus certain risk treatments should be prioritized over others.
Kevin says he frequently has cyber risk conversations with executives, finance, and other business people who have spending authority or business opportunities tied to information systems in this group.
The final group contains what Kevin defines as “grizzled veterans.” This conversation is generally with experienced cybersecurity pros, and typically happens behind closed doors in CISO support groups. This group agrees that protecting data is important, and they assume that the correct risk-based decisions are made. “With these ground rules out of the way, we move on to discuss security operations,” he said.
Kevin Ford's cybersecurity mindset and approach to cybersecurity were shaped by his impressive background. He helped develop the NIST Cybersecurity Framework. He served as the CISO and Director of Assessments for the private sector firm CyberGRX. Earlier in his career as a Deloitte employee, he worked on cyber policy development for the U.S. House of Representative and for the Department of Health and Human Services – Indian Health Service. Kevin shared his perspectives on security operations, the North Dakota Cybersecurity Operations Center (CYOC), information sharing, playbooks, and multi-state operations.
In the business world, the operations team designs and manages methods of production of a product. When operating as designed, the work associated with producing an organization's product runs smoothly and efficiently. However, due to entropy, there are evolving and recurring factors that sneak into workflows that disrupt production, causing organizations to redirect work to fix the issues. This unplanned work is deadly to organizations and inhibits their ability to deliver their product. Cyber events are a manifestation of entropy, the chaos that causes organizations to be less efficient and engage in unplanned work.
Kevin's goal as a CISO, from an organizational perspective, is to maximize the value of his team by reducing unplanned work for an organization. In this way security operations build value for the organization and are seen as a force for enablement. A team built on business enablement is an asset to the organization rather than just a necessary expense.
The goal of reducing unplanned work may sound crass in comparison to more high-minded ethical ideals, but the upshot is a more comprehensive and effective cyber risk posture for the organization. A public servant must be a good steward, not just of the people's data but of their resources. As such, the integrity and availability of state services for which citizens pay is every bit as much an ethical consideration as is the confidentiality of the state's data. This applies doubly for critical infrastructure like utilities and healthcare.
North Dakota operates one of the largest public networks in the world. Even on the best days there are cybersecurity events, the nature of asymmetric warfare against large attack surfaces. As such, it operates at a macro basis, and resembles a public health organization more than law enforcement. Success for the team looks more like a dip in the frequency and impact of cyber events than it does hackers in handcuffs. To accomplish this, North Dakota's governance risk and compliance team and secure infrastructure team make risk-based decisions to protect the network from cybersecurity events using the resources available. It is then the responsibility of the Cybersecurity Operations Center to reduce the impact of unplanned work on the cybersecurity events that aren't prevented.
The North Dakota Cybersecurity Operations Center (CYOC) leverages specifically designed capabilities, skills, and tools to reduce the impact of unplanned work to state operations.4 When a cyber incident does occur, the CYOC's aim is to mitigate the severity of the incident in an attempt to insulate the state from as much unplanned work as possible. Time is the key factor in reducing the severity of incidents. Key performance indicators the CYOC measures are the time it takes to respond to an incident and the time it takes to recover from an incident. The team considers an incident that is immediately responded to and recovered from to be a close second place to prevention of the incident. Team members identify that as response and recovery time is reduced to near zero, incident prevention and incident response and recovery are often indistinguishable.
The CYOC relies on Security Orchestration and Automation Response (SOAR) technologies to handle incident response at massive scales as well as to significantly drive down time to respond and recover from security incidents. The use of SOAR tools shifts focus away from human-based incident response and instead focuses on integrations to automatically respond to security anomalies.
As adoption of SOAR technologies expands across the North Dakota network, staff are increasingly shifting away from incident response to proactive content creation and development of security tools. Security analysts now focus on creating playbooks for SOAR tools that synthesize understanding of threat kill chains for identified risks, threat intelligence from shared feeds, and lessons learned from previous incidents.
As the CYOC increasingly integrates SOAR within its environment, human constraints to security operations are quickly dissolving and being replaced by constraints around information gathering and playbook development.
The use of automation fundamentally shifted the operating model of the North Dakota CYOC. Before SOAR tools, even when the team could see a cybersecurity issue coming, staff were generally powerless to stop it due to the expansiveness of the attack surface.
As SOAR tools were rolled out across the network, staff developed greater control of the environment, and the reward for correctly identifying risk is greater than before. As such, cyber risk information such as threat intelligence, kill chain analysis, and environmental and vulnerability information are key to successful incident response and recovery. If an analyst can identify a potential risk, the lifecycle of the risk, and the areas of the network that may be impacted by the risk, the analyst can develop playbooks to automatically detect and disrupt the risk if it is actualized within the environment.
Within the United States, there are multiple resources that provide valuable cybersecurity information, including Information Sharing and Analysis Centers (ISACs), the Department of Homeland Security's Cyber and Infrastructure Security Agency (CISA), which includes the US-CERT, as well as other formalized information-sharing and response organizations such as state-owned Fusion Centers and Security Operations Centers (SOCs). Internationally, there are also many similar setups (see Free Cyber Incident Resources beginning on Page 193). While these entities are valuable for providing context around events occurring in the larger national ecosystem, much of the information they share is restricted, highly edited, and qualitative in nature. Because the threat briefs are tailored to the widest possible audience, they often require significant human interaction to make them actionable in automated systems.
Fortunately, there are systems that may be used for more automated threat information sharing. Such systems use standardized formats to share identified threats across a user base. Standardizations around communication improves the capability to ingest records into SOAR tools with a low degree of human interaction. Because the human cost of running automated systems is far lower than traditional methods, the capacity for threat information ingestion, event correlation, and incident disruption and response is far greater. With the human constraints around threat intelligence significantly reduced, the ability to protect the environment becomes a function of the amount of structured threat data to which analysts have access.
In addition to automated data sharing, SOAR technology fundamentally shifted the way the North Dakota CYOC operates. Incident response is among the most intensive activities within cybersecurity operations and carries with it a tremendous human cost. SOAR tools reduced this cost by executing predefined actions when certain criteria are met. The predetermined actions and the criteria for action are defined in a security playbook that functions similarly to application code. Because the playbook, like code, can be distilled into an artifact, it can be shared and improved on a collaborative basis much like code within a repository.
As the CYOC increasingly relies on SOAR tools for incident detection and response, the capability of its security efforts rely greatly on the throughput of its playbook development. As such, more so than in many other operations centers observed, the CYOC relies on agile methodology and DevSecOps concepts of continuous integration and continuous development (CI/CD) to outpace and outfocus the adversary. In many regards, the CYOC resembles a development team.
The CYOC's reliance on large amounts of cybersecurity threat intelligence and modular playbook development causes it to function in many regards more like a big data or development shop than a traditional cybersecurity operations center. To increase both access to threat intelligence and exposure to playbook development and sharing, the North Dakota CYOC partners with other states across the United States and hosts collaborative threat information sharing systems as well as playbook repositories in which playbooks for incidents are shared and refined.
The Multi-State SOC operates on multiple levels. At the most basic level, partners have agreed to share threat intelligence data in a specified format using the same system. This ensures that information about incidents that occur in one state are immediately made available to all partners. Playbooks that align to multiple incidents are also shared at this level on the CYOC's GIT repositories.
At higher levels, the Multi-State SOC incorporates greater operational partnerships than are available in other similar efforts. For instance, when experiencing a major incident, it is common within the state government to receive help from other state cyber programs through the use of preestablished agreements. The most prevalent of these agreements is the Emergency Management Assistance Compact (EMAC). EMAC is a compact that provides, under certain circumstances, a vehicle for multi-state Cyber Emergency Response. However, to activate an EMAC the governor of the state has to declare a state of emergency. The state may then receive offers of assistance from other states, and would have to reimburse them for their effort. This EMAC process can be extremely expensive. In addition, because there is no guarantee that skill sets are aligned or technologies are complementary, it can also be unwieldy and inefficient.
The Multi-State SOC operates under the principle that it is far less expensive to address acute cybersecurity issues prior to becoming emergencies. The MS SOC's operating assumptions are that more frequent multi-partner engagements with preestablished operational capabilities, lines of communication, and a firm understanding of every participant's competencies and capabilities prior to engaging in large collaborative efforts are far superior to infrequent engagements with nonfrequent contacts. As such, the Multi-State SOC enlists partners in every step of the program from development to incident preparedness. When partners have shown that they are receiving the full benefit of the threat information and playbook sharing, they are then welcome to engage in the higher tiers of the operation. In this manner, the automated threat information sharing and playbook distribution functions as a filter to reduce the noise within local environments so that human security analysts can focus on more acute security incidents.