Protecting and recovering applications and application data
Deciding how and where to store vital data for recovery purposes
It’s all about the data.
That pretty much sums it up. But don’t close the book yet. I have a lot of details to cover about how to protect that data.
I don’t blame you if you’re confused because I say that it’s all about the data. Security products have, in the past, been largely network-centric: Firewalls and intrusion detection systems are network-based tools, leading many IT professionals (myself included) to believe that their networks needed the protection. But your network is only your private highway that leads to what really matters — your data.
This chapter focuses on protecting your organization’s data so you don’t find yourself in a jam when a disaster strikes — instead, you’re prepared to resume processing at a later time in your same location or soon in a different location.
If you know that your most valuable IT assets are your databases, protecting them is just a matter of incorporating some backup or replication scheme to make them more readily available if a disaster strikes. Right?
Not so fast.
In most organizations, although they store much of the data centrally in databases, they also store some of the data elsewhere. And probably only a few people are familiar with the details of the data that exists in the backwater places in your network.
Locating all your data requires some sleuthing. You need a process-centric view that uncovers all of the ins and outs of your data (where the data comes from and where it goes) so you don’t miss any details.
Before we explore the various options for protecting data against loss (and making it available soon after a disaster), you need to understand some basic principles about DR-style data protection:
Speed increases costs. In other words, the faster you want your data available after a disaster, the more it’ll cost you.
Distance increases costs. The further away you store the data that you’ll need to recover quickly, the more it’ll cost you. Primarily, this cost relates to private, high-speed WAN connections between two or more of your facilities.
Size increases costs. The more data that you want to make available soon after a disaster, the more it’ll cost you. This cost relates to the amount of storage you must purchase to achieve the storage redundancy you require.
Complexity increases costs. A data protection and recovery plan that contains several different solutions for parts of the organization costs more than a simple plan. For instance, a plan that has mirroring for part of the data, replication for another part of the data, electronic vaulting for yet another part of the data, and tape backup recovery (all discussed later in this chapter) for the rest of the data costs quite a lot in terms of all of these platforms that you must build and operate over time. A simpler one- or two-tier plan is more cost effective than a complex plan with many more components.
Here’s an example: During its Business Impact Analysis (BIA), an organization identifies several critical databases that it must protect and recover in the event of a disaster. While developing the Recovery Time Objectives (RTOs) for various business applications, the DR project team identifies two classes of recovery needs: a more rapid need to recover a subset of data and a less rapid need to recover the remaining data.
The project team debates these possible strategies:
Two-tier recovery: They can protect and recover the most time-critical data by using a more expensive data replication mechanism and recover the rest of the data, which is less time-sensitive, from backup tapes. Members of the project team who are in favor of this strategy argue that this approach is less expensive than using costly data replication for all the organization’s data.
One-tier recovery: They can use the pricey data replication to protect and recover all the organization’s data. Although this approach has more in upfront costs to protect all the data, the strategy is simpler and easier to implement than two separate schemes. The team members in favor of this approach argue that its simplicity outweighs the additional costs. They’ll probably find this approach easier and more reliable during a recovery.
The project team needs to do more analysis to determine whether the two-tier or one-tier recovery strategy is appropriate for the organization. The team needs to weigh not only the capital costs, but also the costs of operating the environment day to day and during disaster and recovery scenarios.
Keep the data protection principles in mind when you’re developing your data protection and recovery plans. You need to weigh the costs and benefits of developing either a single data recovery solution for your entire enterprise or a multi-tier data recovery solution.
The process of DR planning begins with the Business Impact Analysis (BIA), which includes a risk assessment and the determination of Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs). These steps help you identify which business processes are the most important in an organization and how quickly you need to recover them. Chapter 3 discusses BIA development in detail.
Before you can develop a viable data recovery plan, you need to identify all the data associated with each critical business process listed in the BIA. Analysis often shows that you find some of the data where you expect to find it — in application databases and other centrally managed locations. However, you probably also find much data in other places, including end-user laptops, workstations, and inside the heads of several individuals. This data-location effort is discussed in Chapter 4.
The Business Impact Analysis and data sleuthing efforts, when completed, provide two important facts:
Which data is important
Where the important data is located now
Knowing these facts, you can now construct a strategy for protecting the data and making the data recoverable in the event of a disaster.
Because, in a disaster, the facility or the equipment that houses the data is often destroyed, the business needs to recover the data (either in the same or a different location) on alternative equipment so business operations can resume.
The next important step in developing your DR plan (after completing the BIA) is to determine where and how to store the data so that you can recover it in the time required to support your RTOs and RPOs.
This chapter deals with the where and how of this data storage. The following sections discuss several solutions for protecting data:
Backups
Resilient storage
Replication and mirroring
Electronic vaulting
Tape backup has, for over five decades, been a runaway favorite for protecting data against loss in the event of a disaster. People who need to back up data have favored tape for its long-term storage stability and low unit cost. The storage capacity of magnetic tape, or magtape, has steadily increased over time, enabling tape backup to keep up with similar advances in hard-disk capacity. Figure 8-1 shows the flow of data to and from backup media when you back up and later restore that data.
Backup fulfills several purposes:
Long-term archiving: Regulations often require the long-term storage of certain business records. Organizations back up such data to tape and store those tapes for long periods of time so they can meet those retention needs.
Data recovery: Human error often occurs — a program bug can inadvertently corrupt data or someone can accidentally deletes files. Equipment malfunctions, such as hard drive failures, are also common. In these and similar scenarios, you can recover data from recently-created backup tapes.
Disaster recovery: If a disastrous event strikes a data center in which your organization stores important business information, you can easily recover that data from backup tapes onto replacement or alternate systems.
Organizations that have no disaster recovery plan still often use tape backup for data recovery purposes, either because they know that it makes good business sense or they want to avoid repeating an incident in which an equipment failure or human error resulted in a costly data loss. With some changes in the way that the organization performs these backups, it can adapt its tape backup operation to support disaster recovery needs.
You need to consider several important issues if you want to make tape backup a part of your disaster recovery plan:
Data retention policy: Before you can put a tape backup plan into place, the organization must first decide how long it wants to retain electronic data, including data in databases and in other locations. Factors that drive a retention policy include statutory requirements, contractual requirements, service level agreements, and existing operations procedures.
Backup media retention: The organization needs to decide how long it plans to retain backup tapes. This timeframe should fulfill data RPOs in both directions, meaning
• Most recent data recovery: This requirement supports the RPO — recovering most recent data in order to avoid re-keying or re- processing newest data.
• Least recent data recovery: This requirement supports a different objective — recovering data from a particular point in time for whatever business purpose requires that data. For example, to see what a customer record looked like exactly six weeks and three days ago, you might need to retain daily backups for seven weeks or more.
Developing a tape backup media retention and rotation plan can be complex. Such a plan needs to accommodate both archival and disaster recovery functions, and it must anticipate data growth, longevity of backup media, and other business requirements.
Backup media recordkeeping: Because you use backup media for both archival and recovery purposes, you need to establish a robust and reliable recordkeeping system for backup media. The recordkeeping system should support several functions:
• Quickly determine which backup tape contains any given data file.
• Quickly determine which backup tapes you need to completely recover a given server.
• Quickly determine the physical location of any backup tape.
• Easily determine how many times any given backup tape has been used.
Backup media protection: Because they contain valuable business information, you need to protect backup tapes from unauthorized access and use.
Backup media location: To effectively guard against any disaster scenario, you must locate backup media away from the data center in case that data center is damaged or destroyed. Backup tapes wouldn’t be much good if they were damaged or destroyed alongside the equipment that stored the original data. I discuss where to store your backup media in more detail in the section “Deciding where to keep your recovery data,” later in this chapter.
Backup media privacy: Government regulations, as well as requirements through pervasive standards, such as PCI (the Payment Card Industry data security standard intended to protect credit card data), often require that data in backup media be encrypted, thus preventing exposure of backup data to any unauthorized parties. You can implement the encryption of backup media in a variety of ways:
• Main storage encryption: If data that you want to back up is already encrypted in its native location, the data may be automatically encrypted when you copy it to backup media.
• Backup program encryption: The software that performs the backup operation may be able to encrypt the data as it passes from the source system to the tape backup equipment.
• Backup equipment encryption: The equipment that you use to write data onto backup tapes may have the ability to encrypt the data before it writes that data to tape.
Table 8-1 shows several pros and cons for tape and disk backup.
Pros | Cons |
---|---|
Inexpensive media | Slow/sequential access |
Well established | Media is somewhat fragile (applies |
mostly to tape) | |
Media has a long shelf life | |
Media is easily transportable, | |
lends itself easily to off-site storage |
Hardware failures in storage systems account for enough unscheduled downtime that these failures warrant attention and DR planning consideration. If you make the storage systems themselves resilient, you reduce the likelihood that a storage system failure will cripple a critical business function.
Here are some of the ways in which you can make a storage system more resilient:
RAID: Redundant Array of Independent Drives (or Disks), also known as Redundant Array of Inexpensive Drives (or Disks). You can choose from many RAID features and configurations. But all RAID-based storage systems have the ability for the storage system to continue functioning, even if one of the hard drives fails. RAID systems also permit you to replace a failed drive without needing to power down the RAID system (called hot swapping).
Redundant power supplies: Many storage systems have multiple power supplies, which permit the continuous operation of the storage system, even if one of the power supplies fails. These storage systems also permit the hot replacement (or hot swap) of power supplies, assuring continuous availability.
Redundant server connections: Many storage systems have multiple controllers and physical connections to servers, assuring continuous operation, even if one of the connections fails.
Virtualization: You can use large central storage systems to create virtual disk volumes that meet whatever storage needs application servers have. Storage virtualization lets you more efficiently use storage resources by allowing the organization to carve up the storage system in whatever ways meet its business needs.
Network Attached Storage (NAS) and Storage Area Network (SAN) technologies often include all of the features in the preceding list. I discuss NAS and SAN in Chapter 7.
The terms replication and mirroring generally refer to the ability to write newly introduced data to more than one storage system at the same time. Replication and mirroring differ somewhat in the details.
The two chief advantages of replication and mirroring over tape (or disk) backup are
Data is copied to another storage system that a server can immediately use, in near-real time.
You can access data without the need for a magtape-style restore.
You can develop a variety of architectures to support replication and mirroring, depending on the nature of the business objectives. The two primary architectures involve replication or mirroring to
A secondary storage system that a server accesses (as shown in Figure 8-2)
A storage system at a remote location (refer to Figure 8-3)
The difference between replication and mirroring is as follows:
Mirroring: A real-time or near real-time copying process, disk block by disk block. Changes take place on a primary storage system and onto one or more secondary storage systems. Here are some additional facts about mirroring:
• Mirroring is usually handled directly by the storage subsystem.
• Servers often don’t know that any mirroring is taking place (at least, when the storage hardware controls mirroring).
Replication: A transaction-level process in which changes to databases that occur on primary storage systems are carried out on secondary storage systems. Additional facts about replication include
• Replication can occur in near-real time, but it can also be delayed by several minutes or hours, depending on the configuration of the replication software, the speed of the communications link between storage systems, and the RTO and RPO figures.
• Database management systems or storage systems can manage replication; hence, servers may or may not be aware of the details of replication.
You need to know these other salient facts about replication and mirroring so you can make sure your clustered servers operate properly:
Replication and mirroring are a part of a larger application resiliency architecture that also may include a clustering or failover capability.
You can’t do replication and mirroring without deep consideration for other layers in the stack, including application, server, database, and network. You must closely coordinate replication and mirroring with server cluster configuration, which in turn requires you to carefully configure the application and network devices.
A popular option available for backing up data is known as electronic vaulting, sometimes known as remote backup or e-vaulting. Electronic vaulting is the process of sending data electronically to an off-site location through a network connection.
In the strict sense of the word, electronic vaulting can also mean sending data to another one of your business locations. I prefer to use the term replication to mean sending your own data to yourself at a different location.
You can use electronic vaulting for a lot of things, depending on whom you ask. Some of the possibilities include
Using backup software on your systems to send backup data to computers run by a backup service provider.
Replicating database transactions that you send off-site.
Copying data to a remotely located standby system that can assume primary processing duties during a disaster.
Most people probably consider electronic vaulting a faster version of third-party off-site media storage. Off-site backup media storage provides one or more copies of data that you can use to recover systems in the event of a disaster. Electronic vaulting achieves this same objective, and it’s generally much faster because you can have recovery data electronically transmitted to a recovery facility.
Whether your DR plan calls for backup tapes, replication, electronic vaulting, or remote mirroring, you need to think about where your data is going and how you’ll get it back when you need it. In most cases, you need to decide where to store your recovery data.
Case in point: If you decide to recover your application data by using backup tapes stored off-site, at what location do you want to store those backup tapes?
Some of the factors that you need to consider when evaluating off-site locations include
Proximity to business locations: Keep your recovery data close to your primary (or alternate) processing site for rapid recovery times. However, if the site is too close to your primary site, it could also be involved in a regional disaster, such as an earthquake or hurricane.
Physical security: Make the measures that protect media from harm and unauthorized access at least as good as the physical security in your main processing site — perhaps even far better than your own security.
Proximity to transportation: If your DR plan calls for backup media to be flown to another city, be sure that your media storage location is close enough to major transportation centers (airports, freeways, and so on) so your media can be quickly sent to a recovery center.
Security while in transit: You take unnecessary risks if you have your backup media transported on a scooter or in some hitchhiker’s backpack. Seriously, though, transport backup media in vehicles with security measures that ensure your media stays secure while en route to or from the storage facility. A third-party service may include secure media transit, or you may need to find a separate secure media transportation service provider.
You may have additional criteria concerning off-site storage. After you know some of the characteristics you want for the location storing your backup media, you can begin evaluating the characteristics of possible locations.
You can find a few types of data storage facilities available:
Commercial media storage centers: Organizations that store backup media in highly secured facilities designed for that purpose.
Alternate business locations: These locations could (and should) coincide with your alternate processing sites so those processing sites can more quickly resume support of critical business functions.
Third-party service providers: Electronic vaulting and managed backup services have their own facilities for storing customer data. Due diligence is especially important — check them out for yourself.
Reciprocal processing site: If your organization has entered into a reciprocal processing site usage agreement with another organization, you may want to have your recovery data sent to that location. Just make sure you can get to your data when you need it.
Although the business information that you store on your systems is important, don’t lose sight of the data that you transmit over network connections to or from other organizations. Not only does that data itself require protection, but you probably also need to include the means for transporting it in your DR plan, especially when transporting data into and out of your organization is a part of critical business functions.
Your hardware and software inventories, as well as the development of your infrastructure and data flow diagrams, have probably captured the facts about how you transmit data into and out of your organization. You need to include the transmission facilities that are essential to critical business processes in your DR plan. If you don’t include those facilities, one or more critical business functions may be unable to operate in a disaster.
Here are some tips about how to identify and protect data transmissioncapabilities:
Inventory all data transfer points. Identify all the external systems that transfer data to or from your organization. You can look in two places:
• At any scheduling tool used to initiate data transfer connections with other organizations
• At user IDs and passwords that other organizations use to initiate connections with your system
This task is especially tedious if your organization doesn’t have an existing inventory of all of the data transfer partners.
Inventory encryption keys and means. If your data transfer system(s) uses encryption, you need to capture all the details about encryption with every external entity. Decide whether to use the same encryption keys at your alternate processing center or develop alternative encryption keys for your alternate site.
Identify associated firewall rules. If your firewall has rules about permitting inbound or outbound data transfer sessions, you need to translate these rules for use at your alternate site. Translating, in this context, means mapping IP addresses between the main and recovery sites (although it may not be this simple if the DR site’s architecture is different from the main site’s architecture). Similarly, the entities with which you exchange data need to make adjustments in their firewall rule sets so their firewalls permit your file transfer sessions to and from your alternate sites.
In the preceding sections, I discuss various ways to protect data during normal operations — backups, replication, off-site storage, data transmission, and so on.
While in disaster recovery mode — during a disaster when critical business applications are operating in alternate locations — you also need all of these protections:
Back up DR servers.
Protect backup media, usually through off-site storage.
Protect transmitted data.
Store critical data on resilient storage systems.
Applications enable employees, customers, and suppliers to manage the production and delivery of whatever goods or services your organization provides. Without applications, your business data just sits there, unable to move or be seen. Applications bring data to life and give it meaning.
From a purist’s point of view, applications (meaning the software itself) are just another form of business data. Although this assertion is mostly true, you need to consider many issues when recovering applications that you don’t have to worry about for data. You must consider these application- specific issues and factors when building a DR plan:
Version
Patches and fixes
Configuration
Users and roles
Interfaces
Customizations
Pairing with OS and database versions
Client systems
Network considerations
Change management
Configuration management
In terms of disaster recovery, applications are a good deal more complex than data. If you think that recovering applications is a simple restore-and-go process, I hate to be the bearer of bad news. You have to know and address many details — otherwise, restoring an application becomes a gamble.
Each application has a notation of its version. Many organizations lag behind the most recent version of an application, particularly in larger and more complex environments, such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), MRP II (Manufacturing Resource Planning), and financials.
Determine the steps you need to take to recover the current version of each critical application.
Organizations that run in-house developed applications need to keep a highly detailed history of problems, solutions, changes, patches, and fixes made to each critical application. Your organization may make these changes for a wide variety of reasons, from feature changes to stability to cosmetic adjustments.
You need to make the level of recordkeeping for application changes so good that a recovery team can use that recordkeeping to properly recover the application in a disaster scenario, rolling forward all the changes so the application can operate in the same way in the recovered environment.
For more about patches and fixes, see the section “Applications and change management,” later in this chapter.
Applications are becoming more configuration-centric, thereby increasing flexibility and reducing the need for customizations. The overburdened IT industry has seen this development as a boon because IT departments don’t have the in-house resources to maintain every application at the application code layer. Some applications can have so many configuration settings that just keeping track of all those settings can become someone’s full-time job.
And many configuration changes have an effect on data and external systems. Configuration changes can alter the layout of data feeds and even database schema. They can change the way that users log in and how the application uses resources on servers and other equipment.
You need to do more than just blindly clone all configuration changes from a production system onto a recovery system. That might be a good start, but the recovery system may require different configuration settings so the application can operate as intended in recovery mode. You need to put a lot of thought into the meaning of some configuration settings to ensure that you can recover and operate the application in a recovery setting, in which business conditions and the technical environment are probably different than in day-to-day business.
I discuss application configuration more in the section “Applications and change management,” later in this chapter.
Users log in to applications in order to perform their specific duties, according to their roles in the organization. But logging in to an application isn’t as easy as it looks. Consider some of the ways in which authentication can take place:
Local password database: The application itself authenticates the user. One or more tables contain user IDs, and encrypted or hashed passwords. The application has its own login functions that it uses to identify and authenticate the user.
External directory service: The application performs its own authentication, but it doesn’t store user ID information. Instead, the application makes a call to an external directory service, such as Microsoft Active Directory, LDAP (Lightweight Directory Access Protocol), or Liberty. An application that relies on an external service needs to know how to find the external service and communicate with it in real time to authenticate users.
Single Sign On (SSO): The application participates in an enterprise-wide service whereby users authenticate once, and their credentials are passed around between applications and a central session management service. SSO can use any of dozens of services, including SAML, Kerberos, HMAC, or OpenID.
Users may also use two-factor authentication in conjunction with any of the authentication methods in the preceding list. Two-factor authentication can add unnecessary (or necessary) complication to a recovery environment because of the extra hardware and software that you need.
The DR planning team needs to make some strategic decisions on how users will authenticate to applications in a recovery environment. You may use the same method currently in the normal business environment, or you may go with a different approach. If you use a different method, all users who plan to use the application in a recovery situation need to know how they can log in to that application. You need to understand security requirements and possibly contractual obligations to understand how you should implement authentication in a DR system.
In all but the most simple and mundane applications, users are assigned functions and/or permitted to perform certain functions in an application. In today’s vernacular, you say that a user is assigned to one or more roles. A role is a name (such as Revenue Clerk 1) that has specific application functions associated with it. The application also uses access control to determine what database information a given role is permitted to view, modify, or remove.
Some application environments have highly granular roles — these roles require a lot more attention and foresight on the part of DR planners if you want the organization to continue operating its applications in a recovery situation. In some environments, such as financial applications, each user’s role has a paper trail for requests and approvals. You must maintain even this kind of functionality in a disaster scenario to maintain compliance with laws, regulations, and standards.
The assignment of roles may be closely tied to the way in which users are authenticated. The DR planning team needs to carefully sift through the details in order to determine how you can recover and operate such an application in a recovery situation, realizing that in many disaster situations, different people may need to use and operate the application than those who do in daily business when the sun is shining.
External users can include suppliers, partners, or members of the general public.
Users of any given online application are largely unaware of the actual location of the systems they’re using. In today’s online world, a natural or man-made disaster can occur in some region of the world, and users don’t necessarily connect such a disaster to the availability of the online application they use. They just want to start their application client program or Web browser and access the application as always, and they usually do so without a thought about the application infrastructure that’s located somewhere, doing their bidding.
If external users are your bread and butter, you need to make the manner in which they authenticate to your applications consistent — so much so that they have little or no idea whether they’re logging in to and using a recovery system or the original.
No application is an island. Applications often rely on back-end data transfers that occur in real time or in batches. These data transfers may take place between applications in the business, and they often involve exchanging information with external entities.
I discuss this topic in Chapter 4 as it relates to identifying all data inputs and outputs for systems and the organization, as a whole. This section talks about these interfaces from the application’s point of view.
The application itself may have configurations, scheduled jobs, staging tables, or other moving parts associated with data that’s sent to or from the application. For each type of interface (and perhaps for each external entity, as well), you need to document and understand all the details so these interfaces (at least, the critical ones) can function in a recovery environment.
Some of the interfaces’ configuration or operations details may contain specific information, such as IP addresses, user IDs, or passwords. You probably need to change some of these settings now so these interfaces can operate in a recovery environment. You must find and document these settings now if you want the recovery team to successfully get these interfaces running in a disaster.
Many organizations aren’t quite satisfied with the off-the-shelf behaviors and functionality of applications. Often, those organizations write custom code and wedge it into the application to make that application work the way they want it to.
Customizations may have specific rewards, but they also have certain risks. These risks include, but aren’t limited to, disaster recovery planning. Creating customizations and integrating them into an application require effort, and you usually need to re-implement the customizations when you upgrade the application in order to keep those customizations working, even in a disaster scenario.
Generally speaking, you should assume that an application won’t work properly unless you have all of its customizations in place. So, you must apply all customizations to the recovery environment. Depending on how system developers develop and implement customizations for a given application, you may need to recover both the application itself and the development environment so that you can recompile or reinstall customizations into production systems.
Managing customizations can add considerable complexity to a recovery plan.
On an ongoing basis, an organization may be able to make some key changes so it can more easily determine whether specific customizations are critical in a recovery situation, as well as more simply install customizations.
Applications are at the top of the stack, but they don’t exist in a vacuum. Instead, applications depend on a great many details further down in the stack, almost to the bottom. Figure 8-4 contains a depiction of the application stack.
A properly operating application depends on many details in the other layers in the stack:
Operating system version, patches, and configuration: The version of the application that you want to build in the recovery environment may require a specific version of the operating system (OS). Also, you may need (or need to avoid) certain OS patches. And the application may depend on several OS configuration settings in order to work properly.
Database management system version, patches, and configuration: An application in a recovery environment may require specific versions of the database management system (DBMS), including specific patches and configurations.
Database names and locations: The application may look in specific locations for data or connect with specifically named databases.
Web server version, patches, and configuration: Many Web-based applications use a separate Web server that’s configured to run one or more specific applications. You often need the version, patch levels, and configuration of such a Web server for the application to work properly.
Network: An application may have one or more external systems specifically named, either by DNS host name or by IP address. Other possible network-based dependencies include configurations of Web servers, authentication servers, and sources for external feeds.
A careful analysis of dependencies in the stack may uncover additional dependencies. Be sure to capture these details so you can make your recovery plans accurate and successful.
Applications require client systems in order to work properly. Clients allow users to communicate with the application so the users can do whatever it is they do with the application.
These clients may be
Client software: Many applications use client software that’s installed on end-user workstations. In order for this software to work properly, you may need specific versions of these clients that have particular configuration settings. In turn, the client software may impose specific dependencies on the workstation OS, patch level, and configuration settings. The topics in the preceding section may apply to client systems, as well as many aspects of server architecture and configuration.
Web browsers: Web-based applications require only Web browsers — or do they? Often, these applications also require a Java run-time program, ActiveX controls, browser helper objects (BHOs), and other components. And Web browsers have versions, patches, and configurations of their own that often must be just right.
Terminal emulators: PCs that run terminal emulation software, which is just another application client, often access many mainframe applications. Like with client software of other types, you have to get versioning, patches, configuration, and perhaps other factors right so that the software works properly.
Applications communicate. They communicate over networks to other applications, client systems, and network-based services for a variety of reasons, including authentication. Applications depend on networks just to function, in most cases.
Some of the ways in which applications depend on networks include
Domain name service (DNS): When applications need to establish communications with other systems (or when they want to know the name of the host that wants to communicate with them), they need to make queries to DNS servers. Most of the time, the underlying OS (on both servers and clients) contains information about server IP addresses and domain names. A common problem occurs when an application wants to communicate with a given hostname, but the application doesn’t specify the fully qualified host name (FQDN), such as fileserver.acct.company.com, but instead just specifies fileserver. In a recovery environment, the network’s domain name may be different (for example, recovery.company.com), and the desired host name may not exist in the recovery domain. You need to use a consistent approach to server naming so that applications can function correctly in a recovery environment and you can more easily maintain them in the primary environment.
Hard-coded IP addresses: Some applications have hard-coded IP addresses for network resources. When an application needs to communicate with another system, the application usually references the system by name, but developers sometimes use IP addresses. You need to identify, document, and remedy these hard-coded IP addresses if you want the application to work properly in a recovery environment.
Hard-coded network resource names: Instead of using a configuration file to store resource names, applications might have hard-coded network resources that include file servers, database servers, printers, and other network based resources.
Authentication service: Applications often don’t authenticate users directly, but instead rely on an authentication service, such as LDAP, Kerberos, or Active Directory.
Network configurations: The network layers may contain configurations that facilitate proper operation of (or access to) an application. Examples include VPN connections to external entities, NAT settings, custom routes, router ACLs, and holes in firewalls.
Most organizations use a change management process to control the changes that they make on production systems. The main benefit to a DR program that includes a formal change management process is that the organization has a formal record of all changes it makes to the production environment. This record can help the recovery team with the task of building the recovery environment. The team can more conveniently see the recent changes that were made, as well as why they were made.
Here are some additional considerations about change management and DR planning:
Change management doesn’t always record all the changes made to a production system. Routine, low-risk, and low-impact changes may be exempt from the change management process.
Systems or infrastructure that support specific applications may lie outside the scope of change management.
Secret changes that circumvent the change management process may not be recorded. The team that builds the recovery system has to work without knowledge of those changes, which could lead to unexpected results.
Many organizations use configuration management to track all changes they make to their environments. You can use configuration management systems to track all changes in all layers of the stack (possibly including hardware changes), enabling an organization to know precisely what’s happening on the systems that a configuration management system manages.
You can use some configuration management systems to rebuild application servers in a bare-metal-restore recovery (in which you recover a server in a single step — operating system, database, application, and data). Having such a capability would be potentially powerful for building recovery servers in an alternate processing center because it could simplify the server recovery procedure to just a few steps. In fact, with advance knowledge about a recovery processing center, you may be able to configure the configuration management system so that it creates application servers that would properly function at the alternate processing site. Your mileage may vary.
In the section “Deciding where to keep your recovery data,” earlier in this chapter, I discuss off-site media storage in the context of backups. Backups are only a portion of the entire set of data that you may want to protect from loss by storing off-site. An organization should include all of the following types of data in its off-site storage strategy:
Backup media: Protects the organization against loss of vital online information if a catastrophic event occurs in the primary processing center.
Backup records: Information about which tapes (or disks) you used to back up which databases on which days. Vital for recovering applications from backup media.
Release media: The CDs and tapes that contain the software you purchase to run on your servers. If you purchase the software by downloading it, copy that software to CD or other suitable media, and then store a copy off-site.
Infrastructure diagrams and schematics: All the drawings and records that show how the current environment is assembled, including data flow, network addresses, and so on.
Software release and operations documentation: All documentation that comes with software, including installation, operations, programming, release notes, and so on. If you have soft copies only (maybe you ordered the software online and downloaded it), get copies stored off-site. If you have hardcopies only, either scan or photocopy them and store those copies off-site.
Software licenses and activation codes: Information that you need to activate software — for example, license codes.
Encryption keys: Keys you use to encrypt and decrypt files, stored data, and communications sessions.
Passwords: User IDs and passwords that gain access to key accounts at every layer in the stack so administrators can make an application fully functional.
Change management and configuration management records: The information that contains changes made to applications and supporting infrastructure. Vital when a recovery team attempts to recover and restart a critical application.
Inventory information: Lists of hardware, software, licenses, and whatever you use to track all the components in all the layers that support vital applications.
Disaster recovery plans: All the procedures, emergency contact lists, and other information you need to recover vital applications during a disaster.
Catalog of information stored off-site: A master list of everything that you store off-site: names, descriptions, creation dates, and so forth.
The preceding list gives you a lot of information to store at an off-site facility. Remember, store all the information you need to recover vital applications in a real disaster at this off-site facility. Assume the disaster will completely destroy the current information processing facilities and you’ll need to rebuild everything from only the knowledge and information stored at the off-site data center.
By reading the preceding list of data that you should store off-site, you may realize that you need to make the security of such a facility of paramount importance. The last thing you want to have happen is a breach of security at the off-site storage facility that could result in a disclosure of a great deal of vital information about your organization!
Consider only an off-site storage provider whose primary line of business includes this activity. An employee’s home or a bank safe deposit box isn’t an acceptable solution for off-site storage! Here are some requirements to include in your shopping list for off-site storage providers:
Secure siting: An unmarked building in a low-traffic area away from natural and man-made threats.
Multiple layers of physical security: At least two or three layers of access control (main entrance, inner storage area, clients’ individual storage areas) with badges or biometric authentication, video surveillance, man traps, and guards. Least-privilege access to all locations. Dual-custody access to client assets. Audit trails on all accesses. Employees with clean background checks.
Secure delivery vehicles: Double-locked to protect client assets. Drivers who have clean background checks. Effective and secure pickup and delivery procedures.
Secure procedures: Multiple approvals needed to retrieve assets from the secure facility. Thorough recordkeeping for transfer and storage of all assets.
24/7/365 availability: Ability to retrieve assets any hour of the day, any day of the year.
Location, location, location! Close enough to the primary business location so you can return assets quickly, but not so close that the off-site storage facility is involved in the same regional disaster as your primary processing site. A lot of good all your off-site storage efforts would be if an earthquake or hurricane damaged both facilities.
If your organization has several business locations — possibly including an alternate processing site — you can consider one of your own business locations as your off-site storage facility. Your alternate facility may not have all the features in the preceding list; you need to identify which features are available and perform a risk analysis to determine whether you need any additional security features.