CHAPTER 27
Disaster Recovery and Business Continuity

by Bernard Chapple

Disaster recovery and business continuity may initially appear to be the same thing, but they can be different in the world of security. Disaster recovery concerns the recovery of the technical components of your business, such as computers, software, the network, data, and so on. Business continuity includes disaster recovery, business resumption, and functionality, along with the recovery of the people in your workplace. Business continuity is vital to keeping your business running and to providing some semblance of “business as usual.” Disaster-recovery or business-continuity professionals must ensure the recovery and continuity of all that is affected by an outage or security event. In this chapter, we will analyze the best practices and methodologies for disaster recovery and business continuity.

Disaster Recovery

When you put together a disaster-recovery plan, you have to make sure that you know everything about your company’s information technology infrastructure, applications, and network—you must know the enterprise you are recovering. To become intimately familiar with the enterprise, you must know what business function you are recovering. The general consensus in “old school” disaster recovery is that if you recover the systems, everything else will fall into place, including the people. I submit that you have to be conscious of all aspects of the company, including its business and its technologies.

For example, a particular business unit may claim not to need a certain application or function until day three, but the technology process may dictate that the application should be available on day one, due to technological interdependencies. What is a disaster-recovery professional to do? In this case, it would be incumbent upon the professional to help the business unit understand why it needs to pay for a day-one recovery as opposed to a day-three recovery. The business unit’s budget will typically include a sizeable expense for the information technology (IT) department, and this may cause the business unit to think that any disaster-recovery or business-continuity efforts will be cost prohibitive. In working with the IT gurus, you can sometimes figure out a way to bypass a particular electronic feed or file dependency that may be needed to continue the recovery of your system.

All of this will work well if you know who and what you are recovering. The responsible business-continuity or disaster-recovery professional should work with the IT group and the business unit to achieve one purpose—to operate a fine, productive, and lucrative organization. You can come to know who and what you are recovering by gathering experts together, such as the programmer, business analyst, system architect, or any other subject matter expert that is necessary. These experts will prove to be invaluable when it comes to creating your disaster-recovery plan. They are the people who know what it takes to technically run the application in question and can explain why a certain disaster-recovery process will cost a certain amount. This information is important for the manager of the business unit, so that she or he can make informed decisions.

Business Continuity

The business-continuity professional is concerned with the company’s most important asset. We might like to think that asset is the employees who make the company really work, but in reality it is the business functions that the employees perform. The business-continuity professional needs to work with each business unit as closely as possible. This means they need to meet with the people who make the decisions, the people who carry out the decisions in the management team, and finally the “worker bees” who actually do the work.

I like to call the “worker bees” the power users. These are the users who know an application intimately. They know the nuances and idiosyncrasies of the business function—they are looking at the trees as opposed to the forest. This is important when it comes to preparing the business unit’s business-continuity plan. The power users should participate in your disaster-recovery rehearsals and business-continuity tabletop exercises.

The business unit management team is vital because its members see the business unit from 20,000 feet—they will help in determining the importance of the application, as they are acquainted with the mission of the business unit. The business unit also needs to keep in mind the need for a disaster-recovery plan as it introduces new or upgraded program applications. The disaster-recovery and/or business-continuity professional should be kept informed about such changes.

For example, a member of management in a business unit might talk to a vendor about a product that could make a current business function quicker, smarter, and better. Being the diligent manager, he or she would bring the vendor in to meet with upper management, and the decision would be made to buy the product, all without informing the IT department or the disaster-recovery or business-continuity professional.

Suppose the product has a Java base, and your corporation’s application infrastructure utilizes native COBOL. The vendor promises that the product will work with COBOL, after it performs a translation algorithm. Of course, some performance issues may surface around this product that is supposedly going to work quicker, smarter, and better. The IT department will have to retrofit this third-party application into the company’s application and hardware infrastructure. If nobody brings the business-continuity professional into the discussion, the application won’t be recovered should you have an outage or other event you must recover from.

As you can see, the business-continuity professional needs to have a relationship with every principle within the business unit so that should a new product be brought into the organization, the knowledge and ability to recover the product will be taken into consideration.

The Four Components of Business Continuity

There are four main components of business continuity. Each is the sum of the whole business continuity initiative. They are plan initiation, the business impact analysis or assessment, development of the recovery strategies and finally the rehearsal or exercise of the disaster recovery and business continuity plans. Each business unit should have its own plan. The company as a whole needs to have a global plan, encompassing all the business units. There should be two plans that work in tandem. A plan should be developed for business continuity (recovery of the people and business function), and a disaster recovery plan (technological and application recovery).

Initiating a Plan

Plan initiation puts everyone on the same page at the beginning of the creation of the plan. A disaster or event is defined from the perspective of the specific business unit or company. One disaster to one business unit or company may not be the same for another.

A disaster is defined generally by the Disaster Recovery Institute International (www.drii.org) as a “sudden, unplanned calamitous event causing great damage or loss” or “any event that creates an inability on an organization’s part to provide critical business functions for some predetermined period of time.” With this general definition in mind, the disaster-recovery planner or business-continuity professional would sit down with all the principals in the organization and map out what a disaster would be for that business unit. This is the initial stage of creating a business impact analysis (BIA).

A BIA is important for several reasons. It provides a company or business unit with a dollar value impact for an unexpected event. This indicates how long a company can have its business interrupted before it will go out of business completely.

Here are three examples of possible events that could impact your business and compel you to implement your disaster-recovery or business-continuity plan, along with some possible responses:

Hurricane Since a hurricane can be predicted a reasonable amount of time before it strikes, you have time to inform employees to prepare their homes and other personal effects. You also have the time to alert your technology group so that they can initiate their preparation strategy procedures.

Blackout You can ensure that your enterprise is attached to a backup generator or an uninterruptible power supply (UPS). You can conduct awareness programs, and perhaps give away small flashlights that employees can keep in their desks.

Tuberculosis outbreak You could provide an offsite facility where your employees can relocate to during the outbreak and investigation.

Analyzing the Business Impact

With a BIA, you must first establish what the critical business function is. This can only be determined by the critical members of the business unit. You might want to outline it in this fashion as shown in Figure 27-1.

image

FIGURE 27-1 Sample of a business-impact analysis

The preceding information needs to be populated in a spreadsheet with different columns for Day 1, Day 3, Day 5, and so on.

The BIA should be completed and reviewed by the business unit, including upper management, since the financing of the business continuity and disaster recovery project will ultimately come from the business unit’s coffers.

Developing Recovery Strategies

The next step is to develop your recovery strategy. The business unit will be paying for the recovery, so they need to know what their options are for different types of recoveries. You can provide anything from a no-frills recovery to an instantaneous recovery. It all depends on the business functions that have to be recovered and on how long the business unit can go without the function. The question is essentially how much insurance the business unit wants to buy. If it is your business, you are the only one who can make that decision. Someone who does not have as large a stake in the growth of the business cannot look at the business from the same perspective.

For example, you could have your IT group take a regular media backup—that would be the least expensive option. You could also have them connect your local computer to another computer in another location, and as a transaction is made on the local computer, it could also be made on the remote computer. This option, called a hotsite, is obviously more expensive, but it means that if the local computer fails for some reason, the hotsite can immediately take over with no loss of service.

Earlier I mentioned the experts you will need in developing a disaster-recovery plan, and this is where you need to utilize their expertise. Your software architect will be able to tell you the workflow of the applications and should also be aware of any ancillary or legacy systems that are necessary in the workflow process. Your network person will be able to advise you of any network implications outside of your network or even within your network, such as interactions with the DMZ. The network recovery will assist in providing you with redundancy, and immediate recovery if the mirroring paradigm is used in a hotsite scenario, but it will have a significant cost. However, the majority of businesses that experience a catastrophic event and do not mitigate, prepare, and rehearse will not survive.

In most cases, a company will not experience the “hurricane” or the “earthquake” type of disasters. However, there are more subtle and insidious events. For instance, has your company experienced a system failure that has caused you to be down for hours, or even days. For example, several years ago we were installing a UPS over the weekend at a company I worked at. Although the system was quiesced (taken down gracefully), the electricians and facility personnel failed to unplug the system from the power source, and two wires from the UPS inadvertently touched and singed the motherboard of the system. The system was down for four days. We did not have a backup system, parts had to be ordered, and we did the installation in the middle of the week. There are other seemingly innocuous events that can cause major problems:

• Viruses, such as SARS

• Bio-terrorism threats, such as anthrax

• Employee threats relating to poor security, poor passwords, inadequate training, or workplace violence

In a business-recovery situation, there must be written procedures that any of your employees in your business unit can have access to and can follow. Information needs to be readily available about the business function that has to be performed. There also needs to be a list of people to contact. This list should be of the current employees, and they should include members of the Human Resources, Facilities, Risk Management, and Legal departments. You should develop a relationship with the fire and rescue department, police department, the local emergency operations center, and your industry peers.

Rehearsing Disaster Recovery and Business Continuity

The fourth component, and the most crucial, is the rehearsals, exercising, or testing of the plan. This is “where the rubber meets the road.” It is good to have the other three components, but the plan is no good if you’re not sure if it will work.

It is vital to test your plan. If the plan has not been tested and it fails during a disaster, all the work you put into developing it is for naught. If the plan fails during a test, though, you can improve on it and test again. Let’s look at a sample mainframe recovery.

Table 17-2 shows a sample checklist of what has to be done to recover a specific mainframe platform for a specific environment. Note that this is just a sample and not what has to be done for every system. Real-life examples can be useful when creating similar lists for other specific environments.

image

TABLE 27-1 Sample Steps for Recovering a Mainframe at a Hotsite

image

TABLE 27-2 Sample Steps and Commands to Recover a Typical Network

Figure 27-2 gives an example of commands that need to be launched to bring up a mainframe at a particular hotsite. For this recovery scenario, the mainframe is attached to EMC’s Symmetrics Remote Data Facility (SRDF). The SRDF DASD is located at the hotsite vendor’s location. The steps in Figure 27-2 will bring the system up as far as putting the DASD online.

image

image

image

image

FIGURE 27-2 Sample steps and commands to bring a mainframe online at a hotsite

Figure 27-3 is another example of commands that need to be launched to bring up an i-Series AS/400 at a predetermined hotsite. To set this recovery scenario up, the i-Series is attached to EMC’s Symmetrics Remote Data Facility (SRDF) and an IBM mainframe. The SRDF DASD is located at the hotsite vendor’s location. This example will show how to bring the system up.

image

FIGURE 27-3 Sample steps and commands to bring an i-Series AS/400 online

Table 27-2 lists commands that can be launched to bring up a network with SNA components to recover an i-Series, mainframe, server farm, and remote connectivity to a vendor.

These examples may appear to be useless trivia to some, but their purpose is to give you an idea of what it takes to restore a typical system or network. Of course, there is a whole lot more to the restoration, but this is just an example.

Third-Party Vendor Issues

Most organizations make use of various third-party vendors (ERP—Enterprise Resource Planning, ASP—Application Service Provider, et al) in their recovery efforts. In such cases, the information about the third-party vendor is just as critical in your business or technology recovery. When you need to make use of such resources, it is beneficial, if not crucial, to make inquiries into the third-party’s operations prior to the implementation of its product or services.

In the real world, the disaster-recovery and/or business-continuity professional has to retrofit the vendor’s information into the business unit’s continuity plan. It is good to get your operation up and running, but what if a critical path includes one of your third-party vendors? For example, your company may rely on credit bureau reports—if processing loans is the bread and butter of your business, you need to know that if your company experiences an outage, you will still receive these reports in order to conduct business.

The vendor’s ability to recover from a failure will also affect how robust your recovery is. Although your recovery may be technically sound, you have to make sure you can conduct business. The same standards you apply to your own company should be set for the third-party vendors you do business with. They should be available to you to conduct business.

It is incumbent upon the disaster-recovery or business-continuity coordinator to make the appropriate inquiries. I’ve developed a questionnaire for this purpose that can be used as a guide, shown in Figure 27-4. Receiving satisfactory answers to questions like these will provide you with some confidence that the vendor will be there when you need its services.

image

image

image

image

image

image

image

FIGURE 27-4 Third-party vendor questionnaire

Awareness and Training Programs

Another important element of disaster-recovery and business-continuity planning is implementing an awareness program. The business-continuity or disaster-recovery professional can meet with each business unit to hold what are known as tabletop exercises. These exercises are important, because they actually get the members of the business unit to sit down and think about a particular event and how to first prevent or mitigate it, and then how to recover from it. The event can be anything from a category 3 hurricane to workplace violence. Any work stoppage can potentially impede the progress of a company’s recovery or resumption of services, and it is up to the management team to design or develop a plan of action or a business-continuity plan. The business-continuity or disaster-recovery professional must facilitate this process and make the business unit aware that there are events (such as an anthrax scare) that can bring the business to a grinding halt.

Holding a Hazard Fair

One of the programs I implement is what I call a Hazard Fair. While it’s important for disaster-recovery and business-continuity practitioners to prevent disruptions to your business functions, it’s also important to inform your company’s second most important asset, its people. We would be remiss if we built mitigation programs for business functions and technologies but offered no information for the employees, and the Hazard Fair serves that purpose.

If you work for a firm that supports effective disaster-recovery program activities, you should already have a budget for your event. I usually budget $700 for each site. If you do not have a budget, you should cajole and effectively convince upper management of the importance of this activity for the employees. The employees will benefit by learning who they can contact in the event of a disaster or outage in their local community. They can learn about such agencies as the Fire and Rescue department, or the FBI in the event of a homeland security incident. They can find out what stores are in the neighborhood that would supply disaster recovery materials. It is all about awareness.

Next, schedule a meeting with the management team, help them understand and appreciate the win–win situation created once employees know that management is putting on the Hazard Fair for them. Assuming you used your negotiating skills to secure a budget, the next step is to set a date for the fair. You do not want to interfere with daily activities. And, supposing that the location is susceptible to hurricanes, you wouldn’t want to have an event right in the middle of hurricane season. Once you have the date selected, you’ll need to reserve an area for the Fair, typically a cafeteria or large break room.

Next you need to determine an overall theme for the event. Something like “How to Be a Survivor” or “Surviving the Worst Case Scenario.” That will help your staff understand what’s happening, and they can relate to those particular themes.

Develop a logo and advertise the event. Prepare and send e-mails, pass out flyers, and display posters in the halls, the cafeteria, and washrooms. If you have a company intranet, be sure to post a notice of the event on the home page. In your messages, include the date and time of the fair, selected activities, vendors who will be exhibiting, prizes to be given out, and any other relevant facts. Make sure you describe how people can benefit from attending the fair, such as by learning how they and their families can be prepared for a disaster.

Vendors are very important to the fair, as they offer great ideas and information. Among the businesses and organizations I have invited are the FBI, local police, fire and rescue departments, the American Red Cross, representatives of the city’s Emergency Operations Center, the Humane Society, the NOAA (National Oceanic and Atmospheric Administration), home improvement stores, supermarkets, shutter companies, and local weather forecasters and television stations. Given the nature of your event, which you should make known to the media, you can probably have these people attend for the cost of a meal. Remember that $700 I mentioned earlier? Have the celebrities eat in your cafeteria so they can mingle with your employees and their families. Your vendors will also appreciate the opportunity to get involved with the community. This way, everyone completes his or her community service for the month.

To get vendors to attend the fair you can call them first, describing the event and the opportunity and its benefit to them. Next, you can follow up with a request on company letterhead, with an invitation stating how your company is committed to assisting employees during a disaster and how the vendors can help in this effort. You can also mention that you’ll be feeding them!

For prizes you can take a portion of the budgeted funds to purchase various “disaster items,” such as flashlights, bottled water, weather radios, matches, and even toilet paper. The idea is to stimulate thought about what is needed during a disaster. Obtain these and other items from local department stores. Create a game that encourages employees to visit each of the vendors, such as a special card that has to be stamped by each vendor. Each completed card is then entered in drawings for the prizes. Of course, serving free food will also cheer up the proceedings.

Schedule the fair to last a few hours; a good time to hold your event is during lunchtime. With good planning and the support of company management and local vendors, you should be able to conduct your own successful Hazard Fair. It will help your employees appreciate the value of being prepared for disasters.

Summary

Here in summary are the principal points, roles, and responsibilities of a good disaster-recovery plan.

• Develop and maintain disaster-recovery plans for all your company’s enterprise technologies.

• Assist IT departments with assessments of their disaster-recovery plans and their ability to mitigate a business disruption.

• Maintain IT departmental plans, and update them with the addition of new technologies.

• Recommend technical recovery strategies and options, and assist with the implementation of recovery solutions.

• Assess the business-continuity implications of proposed technology and organizational changes, and coordinate the implementation of any required changes.

• Schedule and oversee disaster-recovery rehearsals for all enterprise systems.

• Schedule annual disaster-recovery rehearsals.

• Document the results of the rehearsals and identify any recommended enhancements to the plans and procedures.

• Work with critical third-party vendors to ensure their inclusion and adherence to your company’s disaster-recovery strategy and policies.

• Ensure disaster awareness.

• Plan and conduct awareness programs for company associates in the area of business and personal disaster preparedness.

• Plan and conduct Hazard Fairs for business sites.

• Develop and conduct Lunch-and-Learn sessions for company employees. Either your team or local agencies and vendors can be invited to give talks about how they can be of assistance to the employees in the event of a disaster.

• Activate the plan.

• Provide expertise and 24/7 on-call support to management and business functional areas, as requested, when a business disruption occurs.

• During plan activation, act as liaison between IT away teams, both local and remote, and the vital-operations team.

• Use hurricane software to monitor hurricanes and other weather-related anomalies.

• Ensure community involvement.

• Participate in local community disaster mitigation and planning initiatives.

• Participate as a member in the Association of Contingency Planners (www.acp-international.com) and other community groups.

• Perform disaster-recovery speaking engagements as requested by municipalities or focus groups.

To the disaster-recovery and business-continuity purists, this business can have the appearance of being esoteric. The disaster-recovery and business-continuity process is cyclical and must be maintained. Your plans must be updated and rehearsed regularly. Disaster recovery is vital to everyone—you, your family, and the workplace. Although it may not seem important on a daily basis, being properly prepared can mean the difference between having a place to work or going out of business. What is your choice?