IT Disaster Recovery Planning For Dummies

Chapter 7

Planning System and Network Recovery

In This Chapter

Preparing servers for a disaster

Keeping the lines open in your network infrastructure

Upgrading to standard application interfaces

Understanding server-clustering architecture

Although disaster recovery planning is all about recovering critical business functions, data, and the associated applications, application recovery can’t exist or operate without support from the systems they reside on and the networks that enable communication to everything else in your application ecosystem.

DR planning, in large part, is about prevention — not preventing disastrous events, but preventing the crippling aftermath of a disaster. In the context of systems and networks, this prevention involves building consistent, resilient servers and networks that are flexible and can accommodate the processing needs for the applications they support.

In this chapter, you can figure out what you need to build and maintain the server and network infrastructure that your DR needs require. Every business is different in so many ways; so, rather than just give you the answers (which would make this book far too long), I give you everything you need to think about so that you can establish the best possible recovery plan for your critical applications and processes.

Managing and Recovering Server Computing

If data and applications are the soul of an IT-supported business process, systems are the body in which the soul resides. The system, like the body, needs to be a suitable vessel that allows the application to run correctly and provides access to that application’s data.

Systems’ resilience is the key to recoverability in the face of disaster. When I say systems’ resilience, I indeed mean to use the plural because the entire community of systems that support an application before, during, and after a disaster need this resilience. You may need to make sure the organization has a collection of systems, in different locations, that are ready to assume operational duties when a disaster occurs.

I’m not saying that you must have a second set of servers ready to go in another location. Many organizations can’t afford that kind of redundancy. Rather, your organization should have systems ready only when you need them, in the event that a disaster damages your primary servers or makes them unavailable for use. This need-specific setup could mean

Hot servers already sharing the current workload

Hot standby servers ready to take over at short notice

Warm standby servers ready to take over with some preparation

Cold standby servers ready for installation of applications and data

Order servers when the disaster strikes, and then install your application and data on them when they arrive

Which option an organization chooses depends on the time sensitivity and business value of the applications that servers support, as well as whether the organization can invest in a given recovery capability. Whichever option you choose, you have to identify and manage numerous technical issues. And, generally speaking, the faster you want to recover your processing ability, the more complicated and costly your solution needs to be. I discuss these issues in the following sections.

Determining system readiness

A top-down DR plan defines the most critical business processes, and it therefore identifies the applications and databases associated with those processes as critical. These critical applications and databases, in turn, pinpoint critical servers and supporting infrastructure. When you determine Recovery Time Objectives (RTOs — that is, how quickly replacement servers must be up and running), you can figure out how much time you have to get new servers ready to run those critical applications. The RTOs drive the arguments for hot sites versus warm or cold sites. (I discuss determining RTO and other key values fully in Chapter 5, and I talk about the distinction between and issues about hot, cold, and warm sites in Chapter 6.)

RTOs and hot/warm/cold sites are all about timing: An RTO determines how quickly you need to recover your systems in a disaster. If you measure your RTOs in minutes, you have hot servers in a remote location that are probably already doing production work in the form of load balancing. If you measure your RTOs in hours or days, you still have a hot site, but one with a failover capability, instead. If you measure your RTOs in days or weeks, your plan probably calls for a warm or cold site. It’s just about the speed of recovery.

Speed doesn’t drive the argument, but consider speed to readiness as I dive into the issues that really matter for system recovery and readiness.

Server architecture and configuration

In Chapter 4, I discuss the need for inventory information at every level. When you inventory your systems, software, network devices, and other supporting assets, you can begin to identify the components that support critical business functions. After you identify these assets, you need to get the magnifying glass out and look more closely at critical application servers. At this level of detail, you have to identify every tiny detail and determine the following information for every server:

Hardware configuration: Find out everything about the hardware on a server, including

• Make, model, serial number

• Firmware (BIOS/CMOS) versions

• Number and type of CPUs

• Amount and type of memory

• Number, type, and hardware configuration of network adaptors

• Number, type, and hardware configuration of storage interfaces (for example, SCSI adaptors)

• Exactly how the hardware is assembled (order of adaptor cards, memory sticks, and so on)

• Attached peripheral devices (type, model, version, and so on)

Operating system: Figure out everything about the operating system (OS) that’s running on the server, including but not limited to

• Version, release date, and patch level for the OS you’re using

• Patches installed (and the versions of those patches and even the order of installation, if you can find out)

• Components installed and their versions

• Boot configuration

• Recovery settings

You’re probably thinking, “This is a lot of detail!” But you need to know these details to ensure that application software functions properly and predictably.

Resource configuration: Virtual memory, disk utilization settings, memory utilization, and how the OS makes resources available to applications and system processes. In the UNIX world, these are kernel parameters; in Windows, these are mostly configured in the Registry and in some administrative user interface functions. Regardless of the OS, system administrators usually manage these settings.

Network and network services configuration: All the usual settings, including subnet mask, gateway, DNS server, directory server, and time server, as well as tuning settings, such as number of open connections and buffer allocation.

Security configuration: A lot of these settings deal with event logging, system auditing, system-level access control configuration, patch download and installation, and user account settings.

System-level components: Additional components installed at the system level, including

• Firewall

• Intrusion detection and prevention

• Anti-virus and other anti-malware

• System management agents

Be sure to get the versions and configurations for all of these components!

Access management: The whole gamut of system-level access that includes

• User IDs, user ID and password configuration

• Configurations related to any centralized user management resources, such as LDAP or Active Directory

• Shared resources, meaning directories and other resources that users can access via the network

Inventorying all the information in the preceding list for a server could take you a good long time. Multiply the effort by the number of servers you have. I envision a very wide and deep spreadsheet in your immediate future.

Those centralized system configuration management platforms are highly valued, and also very expensive because of their complexity and the high value they provide. The top tools in the configuration management space make identifying and managing the entire configuration for each server a far easier task. If you don’t have any such tools, you have to manage systems the hard way (in other words, manually).

Why this level of detail is important

Critical applications and data reside and run on your servers. Your servers’ intricate configuration permits your applications to run in the more-or-less predictable state that administrators and users are familiar with.

To ensure that a recovery server can support the correct and proper functioning of a critical application, configure the recovery server to match the original server as closely as possible. If you do otherwise, you introduce potential instabilities or changes in functionality that you don’t want to introduce ever, especially during a real-life disaster recovery operation. In such a situation, you already have enough chaos and disruption to deal with; adding application instability because of differences in server configuration could be the difference between your business’s survival and demise.

Understanding the case for consistency

If you haven’t delved into the depths of server configuration detail before, you’re probably gaining an appreciation for the complexity of modern operating systems and the people who manage them. If you want consistency across multiple servers in an environment, I applaud you for your wisdom. Simplicity and consistency are far easier to manage than complexity and inconsistency. Well, you can’t get rid of complexity in operating systems (because of the vast number of configuration settings), but you can make the configuration of systems more consistent, server to server. Configuration management tools can help you achieve this consistency.

Configuration management can also give you other significant advantages:

Reduction in administrative errors: System administrators can more easily manage identically configured systems, so they make fewer mistakes.

Reduction in unscheduled downtime: When system administrators are making fewer mistakes on servers, those servers run more reliably.

Developing the ability to build new servers

You have to be able to make identical copies (well, as near to identical as is practical) of critical application servers and use those copies to run your applications when disaster strikes.

So you need to figure out how to build new servers that are like existing ones. This book isn’t about Windows (or UNIX, or whatever) system administration, so I won’t go too deep into that subject. It should suffice to say that you need to make new servers as similar as possible to existing ones so that applications can run on those new servers with as little fuss as possible. And this smooth transition makes building recovery systems after a disaster easier.

Building nearly identical servers for recovery purposes is one thing, but keeping those servers consistent is quite another. Server consistency requires two separate but related disciplines:

Change management: The business process concerned with the proper development, analysis, and approval of changes made in a production environment, at all layers. The goal of change management is to expose potential risks and other issues that could jeopardize proposed changes before they occur. Proper change management gives you higher system availability and fewer unscheduled outages.

Configuration management: The process of recording all changes made to all components (at all layers) in an environment. The central repository is known as the configuration management database (CMDB), which stores every detail about the systems under its management.

Typically, change management and configuration management relate to each other in this way: You use configuration management to document the changes that the change management process has analyzed and approved.

How you keep your recovery server configurations consistent depends a lot on your speed to recovery: If your DR plan calls for rapid failover to hot servers, you need to make the means for updating those hot standby servers as close to automated as possible. If you install patches or make other changes to the primary server, make those same changes to the recovery servers as soon as possible. Letting servers get too far out of sync could invite trouble during a recovery operation.

If, on the other hand, your recovery servers are stored in a closet someplace, you need to somehow queue up all of your changes and apply them only now and then so those servers aren’t too far out of sync in case a disaster strikes.

Your DR plan may specify that you purchase servers when a disaster strikes and then build them after they arrive. In a situation such as this, you might invest in a set of tools that you can use to duplicate a server configuration onto another server.

The bottom line is that you must keep your recovery servers’ configuration consistent with your primary servers. How you maintain this consistency depends on how quickly you need to begin using your recovery servers in a disaster.

Distributed server computing considerations

Many environments use a complex application architecture that includes components that reside on many servers, and not all of those servers are necessarily located in the same location. Distributed architectures (as these types of architectures are called) increase the complexity of an environment in steady-state and introduce additional issues that you must address in your DR planning.

This complexity is further exacerbated in cases in which other organizations own or operate one or more of the components in the application environment. With Internet-connected enterprises and application integration that’s fueled by business interoperability and made possible by newer technologies, such as Service-Oriented Architecture (SOA), disaster recovery planning assumes a much higher level of complexity. Organizations need to make additional plans for recovering these increasingly complex environments if one of those organizations is hit with a disastrous event.

Architecture issues

You may encounter these issues related to application architecture during your DR analysis and planning effort:

Interfaces: If you have custom interfaces between the components of your distributed environment, it’ll take more effort (on the part of system developers or integrators) to improve resilience in the overall environment. Engage your application architecture personnel and urge them to develop strategic plans that include moving to standard interfaces, such as Service-Oriented Architecture (SOA).

Latency: In a highly distributed environment, systems that communicate over great distances and/or over slow wide-area network (WAN) connections can experience latency (delays in the transmission of data from system to system). The behavior of the application may change in unexpected ways in a disaster scenario if the latency between components increases (or decreases) by a significant amount. Parts of a distributed environment may not be able to tolerate latency, and other parts that include significant latency may behave differently if the latency decreases.

Network considerations: A distributed application environment that encompasses WAN connectivity needs to take network design into account. Distributed applications that were designed for, and implemented in, fast local-area networks may suffer performance degradation in a wide-area network. The cumulative effect of several lengthy hops across a WAN can slow response time and even cause network timeouts.

Operational issues

Keep these common operational issues for distributed applications in mind when doing your system-level DR planning:

Failure points: Distributed systems have more failure points (literally, the number of hardware and software components required to support the environment) than centralized systems, and a disaster is more likely to occur when those failure points are geographically diverse, rather than in a single/central location.

Distributed recovery: A disaster that occurs in one location of a distributed environment may prompt a recovery operation that involves personnel in many locations.

Third-party components: If a third-party service provider that hosts a vital element to your application experiences a disaster, that service provider’s priority list may differ from yours.

Priority: In a distributed environment, a disaster that disables one component may have additional recovery complications. The failure of a component at a remote location may be your high priority, but a lower priority than other components for the remote location’s organization. For instance, a database server in a remote location is only a medium priority for the organization in that location, but it’s critical to your organization. If that database server fails, the organization that runs it may not be in a particular hurry to fix it, which can lead to longer downtime or a more difficult or time-consuming recovery for you.

Application architecture considerations

Because the bulk of DR planning is in preparation, an organization should consider a number of application architecture issues while developing its DR plan. You can more easily recover a resilient application that relies on standard components than you can an environment that relies more heavily on custom components and non-standard features. Figure 7-1 shows a typical application architecture.

Figure 7-1: A typical application architecture.

Introduce these issues into the long-term application architecture effort:

Centralized authentication: Application architects and integrators should seriously consider a shift towards network-based authentication services instead of relying on authentication within the application. Sure, an application may still need to perform an authentication, but authentication data should be centralized, which improves the integrity of access controls across the organization in several ways:

• Standard logon credentials: Users have fewer (as few as one) user IDs and passwords that they need to remember.

• Streamlined authentication management: You need fewer personnel to manage the issuance and termination of access rights.

• Fewer forgotten passwords: Because workers have fewer passwords that they need to remember (as few as one), they won’t forget their passwords as often, so you can spend less time performing password resets.

Examples of centralized authentication and identity services include LDAP, Microsoft Active Directory, Oblix, and IBM Tivoli Identity Manager.

Standardized interfaces: Application architects should put SOA (Service-Oriented Architecture), Web Services, ETL (Extract, Transform, Load), and XML (Extensible Markup Language) on their roadmaps as the means for highly agile integration between applications, both within the enterprise and with external applications.

The approaches in the preceding list permit application designers and DR planners to think about the service components that support applications, separate from the applications themselves. Developing standard and centralized services permits an organization to streamline its application environment by plugging application components into a service-oriented framework that already provides basic services, such as authentication and data management.

Server consolidation: The double-edged sword

Server consolidation has been the talk of IT departments for several years and represents a still-popular cost-cutting move. The concept is simple: Instead of dedicating applications to individual servers, which can result in underutilized servers, you install multiple applications onto servers to more efficiently utilize server hardware, thereby reducing costs.

I’m all for saving money, electricity, natural resources, and so on. Consolidating servers is a smart move to undertake, as long as you remember that server consolidation is something to undertake during peacetime (normal operations), not in a disaster scenario.

Consider an environment that’s made up of dozens of underutilized servers dedicated to applications. The DR planning team wants to consider a DR strategy that consolidates these applications onto fewer servers to provide a lower-cost recovery capability.

Well, this consolidation might work, but the DR planning team should test it very thoroughly and carefully. Combining applications that previously had servers all to themselves may lead to unexpected interactions that could be difficult to troubleshoot and untangle.

If you want to undertake server consolidation, do it first in your production environment, and then take that consolidated architecture and apply it to a DR architecture. To make server consolidation successful in your DR plan, first implement server consolidation in your production environment, where your architects, designers, developers, and operations staff can become familiar with it. You can far more easily implement something in a DR environment if you’re already familiar with it.

Security laws and regulations apply during disasters

Organizations need to protect every type of sensitive data, including

bullet Financial records and transactions

bullet Medical records

bullet Identity information, such as names, dates of birth, social insurance numbers, and passport numbers

bullet Intellectual property, including an organization’s secrets

In the onset or aftermath of a disaster, you can’t suspend security laws and regulations. On the contrary, disaster recovery plans must take into account all the security requirements that apply during peacetime.

Many regulations exist to protect information about citizens and corporate entities from compromise and theft. Organizations that are in the midst of a disaster recovery operation aren’t exempt from these regulations, so you need to consider all applicable security regulations and requirements when developing an organization’s disaster recovery plan. Also consider that a disaster declaration may result from a malicious act, possibly a deliberate attempt to get an organization to let its guard down and make information easier to steal or sabotage.

Managing and Recovering Network Infrastructure

Networks and network services are the plumbing that permits applications to communicate with each other and with the people who use them. Although networks are generally a lot less complicated than the applications they support, they’re a lot more complicated than the router-firewall-hub architectures that businesses used in times past. Figure 7-2 shows a typical enterprise network architecture.

Networks are a lot more than just devices that move network traffic about. Networks perform often-invisible functions that enable communications within and among enterprises.

Ninety percent of good DR planning is knowing what makes your environment run today, especially when you’re dealing with networks and those invisible services.

To a great extent, the features in a network and the issues with network recovery are one in the same. After you identify your assets and features, you can incorporate that information into your DR planning effort.

Figure 7-2: A typical enterprise network architecture.

Consider your voice network capabilities (meaning your office telephones) as part of your network, whether you use analog, digital, or IP-based phones in your environment. Voice communications are as vital as data communications in most organizations, and perhaps even more so.

Consider these voice network features and issues:

External network dependencies: An organization’s connection to the Internet depends on some configuration settings that external service providers maintain, including

• Data circuits and trunks: Getting any sort of an Internet data circuit (a network circuit that connects your internal data network to the Internet) or PBX trunk (a network connection between your internal phone system and the public telephone network) installed takes several weeks, even with expedited orders from local telephone companies. Data circuits are also expensive, so organizations need to use some creative thinking to figure out how to set up data circuits for DR purposes. Not only are circuits and trunks expensive, but they take several weeks to install and set up.

• Domain name service (DNS): DNS is the glue that associates domain names (such as www.company.com or http://mailserver.company.com) with the IP addresses that systems use. Changing an IP address for a well-known service, such as a Web site, can take hours or days before Internet users can visit the site on the new IP address.

• Publicly routable network numbers: The network connection established between an ISP (Internet Service Provider) and a business includes some fixed (non-changeable) IP addresses that are associated with that particular network connection. You generally can’t associate those IP addresses with another physical location.

• Office telephone service: Aside from trunks, you have a lot of other considerations if you want to build or recover your voice network in a DR setting. Partner with your voice service provider to get a better idea of what issues you need to consider in order to recover your voice network.

DNS, network addresses, and especially data and voice circuits are long lead time items — functions that you must build well in advance of a disaster, especially if an alternate processing site depends on them. Even for a cold site, you need to pre-order these items and put them in place, unless you don’t mind waiting six weeks for connectivity at your DR site.

• Network Time Protocol (NTP): You use NTP to provide accurate clock synchronization for servers and workstations. Usually, business applications just work when starting up new servers in a new environment. Still, include NTP on your system build checklist so you can make sure that this important function continues working.

Firewalls: They stop the bad stuff, such as worms and scanning attacks, from getting in — that’s the easy part. Firewalls also contain a list of rules, which permit specific communications between servers inside the enterprise with servers or networks outside the environment. Often, network administrators introduce these rules to permit specific applications to communicate with systems in other organizations. If you need to move an application to an alternate processing site, you need to know and correctly apply all the original firewalls’ rules to the firewall(s) at the alternate site.

Network security devices: In addition to firewalls, organizations frequently use a variety of other means for protecting systems and networks:

• Intrusion detection and prevention

• Spam filters

• Web proxies and filters

• Load balancers

• Hardware encryption

• VLANs (virtual networks)

• DMZ (demilitarized zone) network segments

Network equipment: Like servers, some network equipment — especially bigger routers and other non-commodity items — can take a while to obtain. You either need to have this equipment on hand in your alternate processing facility or have “first off the line” privileges with your suppliers.

Management processes: Specifically, change management and configuration management. I describe these two disciplines in the section “Developing the ability to build new servers,” earlier in this chapter. Change and configuration management are vital for network equipment and configuration, not just for servers.

Network architecture, routing, and addressing: The internal details of a network facilitate communication within the network and with external networks. Setting up network addressing (the IP addresses that are assigned to systems and devices) and routing (the means through which systems on different networks can communicate with each other) involves a lot of detail that applications depend on.

Network management: Medium-sized and larger organizations often use a network management application to monitor and manage all the network devices (and, often, servers) in the environment. Some organizations’ networks are so tightly coupled to their network management platforms that the networks can’t run without the management systems.

The completion of the Business Impact Analysis (discussed in Chapter 3), together with the establishment of Recovery Time Objectives (RTOs), largely determines the speed at which you need to recover critical business applications. A rapid speed to recovery can cost quite a lot. The process of establishing RTOs and then determining their costs can get repetitive: A first pass could well determine that the cost of recovering an application within an arbitrary period of time would cost more than the business derives from the business process that the application supports. DR planning is, for sure, a time of soul searching and difficult decisions.

Implementing Standard Interfaces

You can often more easily extend and change an applications architecture that’s based on open standards than one that’s built with custom interfaces. Open standards are the programming and communications standards that applications and systems are built on. Here are some examples of open standards:

TCP/IP: The network protocol of the Internet. Dozens of open standards fall within TCP/IP, including SMTP (Simple Mail Transfer Protocol) that makes e-mail work, DNS (domain name service) that’s used to translate names into IP addresses, NTP (Network Time Protocol) that synchronizes system clocks, HTTP (HyperText Transfer Protocol) that Web browsers use to request data from Web servers, and SNMP (Simple Network Management Protocol) that’s used to manage network devices and systems.

World Wide Web: Encompasses protocols and standards that support the Web.

GSM: The cellular telephone communication standard used in most of the world.

To make applications and systems more resilient and recoverable, organizations may have to add or change components that can facilitate recovery operations. Here are some examples:

Upgrading interfaces to SOA: Service-Oriented Architecture (SOA), sometimes known as Web Services, is a newer method that you use to integrate applications without having to build custom interfaces over low-level protocols, such as CORBA (Common Object Request Broker Architecture), DCOM (Distributed Component Object Model), or .Net.

Implementing centralized identity management: Centralizing authentication within the environment permits applications to do what they do best — provide access to business information. Technologies and products that provide centralized identity management include LDAP (the open standard Lightweight Directory Access Protocol), IBM Tivoli Identity Manager, Microsoft Identity Integration Server, Novell Identity Manager, Oracle Identity Management, and Sun Java System Identity Manager.

Implementing Server Clustering

Applications that are critical to important business processes often require higher availability, and you need to be able to quickly recover them in another location. Organizations that need the ability to recover these applications within minutes of a failure frequently use server clustering. Clustering is an expensive proposition, but it’s the method of choice for applications that require rapid recovery.

A server cluster is a tightly-coupled collection of two or more servers that are configured to host one or more applications. In his book In Search of Clusters (Prentice Hall), Gregory Pfister describes a cluster as “a parallel or distributed system that consists of a collection of interconnected whole computers that are utilized as a single, unified computing resource.” In other words, a cluster is a collection of computers that appear as a single computer to end users.

The servers in a cluster coordinate with each other to ensure that at least one of the servers is running the applications. They coordinate by communicating with each other through a fast network, using clustering software that manages a complex set of tasks. Figure 7-3 shows a server cluster architecture.

Figure 7-3: A typical server cluster.

Server clusters improve the availability of the applications they serve by reducing the effects of

Hardware failure: Failures in components such as CPU, RAM, bus, or boot drive result in an immediate, unplanned outage for critical applications running on the system.

Software failure: A malfunction in software can result in a server lockup or crash. You may be thinking, “If a software failure occurs on one server in the cluster, wouldn’t it also occur in other servers in the cluster?” Yes, possibly. This scenario is certainly worth thinking about while application and system engineers try to figure out an ideal architecture for a clustered application.

Network failure: You couple a properly designed system cluster with a network architecture that has similar redundancies. If you experience a failure in the network, other network components and paths should still be available, making at least some of the servers in the cluster still reachable from client systems.

Maintenance: With two or more systems in a cluster, you can much more easily perform regular maintenance activities on individual servers. You can take individual servers in a cluster offline, shut them down, and perform maintenance on them — all without affecting application availability.

Performance issues: You can configure a cluster to permit more than one server in the cluster to serve applications. I discuss cluster configuration more in the following section, which talks about cluster modes.

Understanding cluster modes

Server clusters generally operate in one of two basic modes, which are based on how the servers in the cluster are configured to operate day by day:

Active/active: In this configuration, all the servers in the cluster run applications. You use this mode for a load-sharing scenario in which an application runs on several servers at the same time.

Active/passive: This configuration consists of servers that are hosting applications (the active ones) and other servers that are in a standby mode.

Whether an organization chooses to run its clusters in active/active or active/passive mode depends on performance, availability, and disaster recovery needs. You need specially designed applications if you want several servers hosting those applications at the same time.

Geographically distributed clusters

In the preceding sections of this chapter, I discuss server clusters that you set up in a data center, in which all of the servers in the cluster are located within a few feet of each other. When put into operation, the cluster provides not only much higher availability for the application, but also greater capacity if you configure the cluster as active/active.

This type of clustering does little for disaster recovery purposes, however. The entire cluster can be incapacitated if any of several events occur:

Fire, flood, earthquake, tornado, or hurricane: Any of these events can directly destroy servers or render them incapable of operation.

Widespread communications or other utility failure: A prolonged or severe event can result in communications or power utilities being unavailable for so long that short-term contingencies can’t compensate.

Sabotage, vandalism, or terrorism: Somebody with determination can render all of the systems in a processing center inoperative for hours, days, or longer. Wire cutters, a sledge hammer, a bomb, or a degausser can quickly destroy several servers’ ability to host critical applications.

Any of these events can easily render all of the servers in a cluster inoperative or unreachable. Right? Not necessarily. You can make your server clusters geographically diverse. In other words, instead of placing servers side by side in a processing center, you can place them hundreds or thousands of miles apart in separate processing centers. These clusters are called geo- clusters or GD (geographically diverse) clusters.

The chief advantage of a geo-cluster is that you always have an application available, even when a widespread regional disaster completely destroys a processing center. Servers in the cluster that are located in surviving processing centers keep running, with little or no interruption. Figure 7-4 shows a typical geo-cluster configuration.

Figure 7-4: A typical geographi-cally diverse server cluster.

Cluster and storage architecture

The architecture of clustered servers, storage systems, and the applications themselves are tightly coupled. Applications hosted on a server cluster must be designed to properly interface with the physical architecture of storage systems and how cluster failovers work. The integrity of applications is vital to the health of the application, especially when the servers and cluster architecture can be changed in real time.

You can use several technologies to protect application data, including resilient storage, mirroring, and replication. A server cluster needs one or more of these technologies. In fact, you can develop many such possible cluster-storage architectures. The following list breaks down cluster and storage architecture into the simplest terms. Choose a technology from each of the following groups to put together a complete clustered architecture:

Cluster architecture: Pick one:

• Active/active

• Active/passive

You also need to figure out how many servers you want in your cluster and whether you want them located in the same processing center or hundreds of miles apart.

Storage architecture: Pick one or more:

• SAN (Storage Area Network): Select whether you want storage attached via SCSI, Fibre Arbitrated Loop, or a Fibre network.

• NAS (Network Attached Storage): You connect storage systems to the network and access them with NFS (Network File System) or SMB (Server Message Block) network protocols.

Data replication: Pick one or more:

• Mirroring: The database, server OS, or storage system layers can perform the mirroring, in which changes to data on a storage system are copied to a remote storage system in real time.

• Replication: The copying of transactions from one system to another.

Network architecture: You have many choices available, including

• Load balancers: For arbitrating access among several active servers

• Round robin DNS: For load balancing across an active/active geo-cluster

• Dynamic routing changes: For failover at the network level so that application clients are directed to active servers, wherever those servers are

The technologies in the preceding list are just the high-level details. Within every option, you have to make many more choices. And many intimate details of the application, in terms of how transactions work, matter when making choices, large and small. You can find entire books filled with information on application, storage, and cluster architecture. I have only enough room in this book to take you on a quick tour.