Chapter 4: Defining a Highly Available Messaging Solution

As you saw in Chapter 1, “Business, Functional, and Technical Requirements,” when eliciting requirements, desired availability is often one of the first topics to be raised. It is important to note that one of your functions as a consultant or implementer is to make the businesses aware that raising the availability of any system has a direct cost implication.

With quality cloud-based solutions, messaging systems are no longer shackled to the limits of on-premises capabilities. Keep cloud-based solutions in mind as a possible fit for some of your highly available messaging solution requirements.

If you have deployed Exchange Online in conjunction with on-premise solutions, and you are running in Hybrid mode, then you need to consider how the Exchange Online service-level agreement dovetails with your on-premise service-level agreement. In traditional on-premise systems, the more available a system becomes, the more expensive it is to operate. The price of increasing availability, however, is often offset by the cost of an outage. For a large company or well-known brand, an email outage may become a visible public failure with an associated loss of confidence.

Defining Availability

Defining Availability Components

One of the many things you'll do during requirement elicitation is to help the business define and understand the services that Exchange delivers in terms of Exchange availability. At minimum, Exchange availability is a superset of dependency services. These services may be classified as follows:

For the sake of clarity, we will not define auxiliary services such as message hygiene third-party application integration, or the many other examples that come to mind. Assuming you are satisfied to continue with the three basic services listed, your criteria may also include that availability could be measured per service. The availability of a system consisting of several independent critical components is a product of the availabilities of each individual component. The product is calculated by multiplying together the availabilities of each individual component in the following manner:

Let us examine a hypothetical example of three nines availability across three hypothetical components, as shown in Table 4.2. Using Excel, you would list the three availability components and use the PRODUCT function to multiply the three numbers together. Or, you could simply multiply the first number by the second number and then by the third number to obtain the fourth number, or result. Since we have multiplied three numbers together to obtain a fourth number, we need to move our decimal to the left by four places, or divide by 10,000 (four zeroes), to obtain a number that may be expressed as a percentage availability. The result will look similar to Table 4.2.

Notice that the total availability is lower than the availability of any of the component pieces. In order to build a system with a particular stated resilience, the components' availability requires examination. This may include network, power, chassis, and storage availability to name just a few, because these are all dependent features in larger systems.

Figure 4.1 illustrates the interdependency of component systems. Note that each one of the boxes of components can carry its own availability measurement when calculating total system availability.

Figure 4.1 clearly illustrates that it is extremely difficult to establish even theoretical total system availability when so many systems are interrelated. When agreeing on the resulting availability of the desired system, clarify the definition of availability, downtime, and scheduled downtime as it pertains to the business.

Scheduled system downtime classically is not included in how availability is measured—in our three nines example, 1.44 minutes per day does not allow for much action to be taken. However, if the measure is adjusted to per year, then 8.76 hours becomes much more plausible for system maintenance.

Defining the Cost of Downtime

Planning for Failure

Defining Terms for Availability

Recovery Point Objective The recovery point objective (RPO) is the allowable period for which data can be lost due to an incident. For example, a backup is normally completed at 4 a.m. and the next backup is scheduled for 24 hours later. The RPO is considered to be 24 hours. If a 24-hour RPO is unacceptable, then another backup method is required that is able to deliver a lower RPO.

Recovery Time Objective The recovery time objective (RTO) is the allowable period in which the service is restored without data (T₂ - T₁) or the service is restored with data (T₃ - T₁).

Defining High Availability and Disaster Recovery

High availability (HA) and disaster recovery (DR) are sometimes incorrectly used interchangeably when, in fact, they are quite different from each other. The one thing that they do have in common is that because of some level of duplication of hardware, software, networks, storage, or other components, the overall cost of the system goes up, commensurate with the level of redundancy required.

High Availability

A high availability (HA) system is defined as a system that includes redundant components that increase the availability or fault tolerance of the overall system in a near-transparent manner and confined within a defined geography. It is also important to note that HA is a technology-driven function; that is, the technology used in making the system highly available is often the same one that initiates a failover to another available system or component of that system. In other words, IT is responsible as the initiator of the failover as well as being the decision maker to fail a system over.

Disaster Recovery

Disaster recovery (DR) is defined as the restoration of an IT-based service. It includes the use of a separate site or geography, and it addresses the failure of an entire system or datacenter containing that system. It includes the use of people and processes to make DR possible. Lastly, DR is hardly ever seamless or swift.

Consider this example: Datacenter A houses multiple copies of data within a highly resilient cluster representing the implementation of an email service. Datacenter B has a single server with a near-identical specification as one of the servers in Datacenter A, except that it has a tape drive attached. If Datacenter A is lost, a restore of the last-known good backup will occur in Datacenter B—however long it takes. From that point forward, the single server in Datacenter B represents the restoration of the service that used to live in Datacenter A. There may be a significant gap in the data restored in Datacenter B, depending on when the outage occurred, as well as the point in time of the last known good backup. If no good backup can be found, then Datacenter B will offer a “dial tone” service for email, which means that customers may send and receive email. However, no historical mail, contacts, or calendar information will be present in their mailboxes. There is a sharp contrast between what was implemented in Datacenter A versus what was implemented in Datacenter B.

This kind of dramatic contrast between locations can be expected as companies figure out how to balance the cost of a DR facility with the stated RPO and RTO. The lower the RPO and RTO, the higher the cost of the overall solution.

Achieving High Availability

Process Existing processes for a decentralized, non-highly available system will not suffice when considering a move to a highly centralized or a higher level of availability. Increased hardware cost is only one of the factors to consider with high availability. Another is the cost of a larger number of processes, often interdependent with each other.

User Locations Users may be centralized in one well-connected campus or located in many different geographies across the globe, all with varying connectivity and power, which may or may not be guaranteed.

Servers Server construction may or may not have its own availability factors to consider. Memory bank redundancy, power supply redundancy, processor redundancy, and backplane failure in the case of blade servers are some of the factors that influence total server availability.

Network End users may have varying amounts of bandwidth available to access email as well as different network topologies, which themselves may present points of failure. Do the datacenters hosting the Exchange servers have multiple connection points to the Internet as well as redundant routes to the rest of the network? Are the routing and the switch fabric redundant? Are firewalls and reverse proxy/hygiene solutions redundant?

Power to the Datacenter Power availability is often taken for granted. Events have shown, however, that power may be compromised during extreme weather or may not be guaranteed in some parts of the world at all.

Power to the Racks Is the power to the racks themselves wired so that loss of any one power source or power distribution point within a rack does not affect the entire contents of the rack?

Cooling Cooling is another critical measure of datacenter availability. Do the datacenters housing your servers have redundant cooling available in the event of an outage?

Cloud Solutions You may rely on an external cloud solution for some of the availability of your infrastructure. Does your vendor have a published availability strategy, and does that strategy map to your desired availability goals in a compatible manner?

Virtualization While virtualization carries with it the promise of on-demand capacity and higher levels of availability, Exchange may not fit into your current virtualization strategy. You may be increasing risk and lowering availability by virtualizing Exchange, as opposed to deploying on physical hardware.

Capacity All of the points mentioned thus far have a measure of available capacity that may be overwhelmed or compromised during an outage or a denial-of-service attack.

Single Points of Failure When increasing availability, redundancy of components is a given. However, one not so obvious factor is a single point of failure or, as you learned earlier, a failure domain. A failure domain could include an individual server, power supply, the network, the rack itself, or any datacenter component, including the datacenter itself.

This is not meant as an exhaustive list of all possible factors. Your own analysis of your environment may yield a number of other factors that may be relevant.

Once you identify the factors influencing availability, you can evaluate each in an attempt to mitigate them. For example, let's say your analysis has shown that cooling and the centralization of all IT into a single datacenter represent a single point of failure. The business will remedy the lack of cooling redundancy, but it will not build or rent another datacenter. You may want to capture this and other potential factors as demonstrated in Table 4.3.

When calculating availability (remember that total availability is a product of all the availability factors), the biggest factor influencing total availability is the component(s) in the entire chain that is most likely to fail. As you saw in Table 4.2, when calculating total availability, a single machine is not very redundant, and it is able to drop the total availability of a single factor, such as networking or mail flow, quite significantly.

Building an Available Messaging System

Similarly, if we use a global load balancer or even Round Robin DNS, we are able to utilize a single name space. Table 4.4 is still valid in this scenario, because two virtual IPs representing each datacenter are stored in the global load balancer and presented to the external client simultaneously. The client will then attempt to access each datacenter's IP address as shown in Figure 4.4.

We could quite easily extrapolate this example out to three, four, or more datacenters within a single name space. However, this assumes that connectivity from any point around the globe is roughly equal and that connectivity between datacenters is high speed in order to guarantee a good user experience.

Assuming, however, that you would like to present a different name space for Exchange housed in different datacenters, as shown Figure 4.5, the entries on your SAN certificate would read as follows:

Remember from Chapter 3 that endpoint URL names are not important, since the autodiscover service is responsible for servicing endpoints to any requesting client. This implies that the URL endpoints may be named anything at all, as long as there is a valid path for the client to bind to the autodiscover endpoint.

You may choose to forego SAN certificates altogether and choose to implement a wildcard certificate; however, you are still required to plan your name spaces and define these in DNS.

Exchange Hybrid Deployment

Exchange Online presents its own set of SLAs, and it is of interest to us in terms of its interactions with on-premises Exchange. Assuming that your organization is running in Hybrid mode, there will be three on-premises points of interaction with Exchange Online. Specifically, these interaction points are as follows:

Each of these is not highly available by default because each is deployed on a single server. The only possible exception for keeping a single server may be directory synchronization, since it is built as a no-touch software appliance by default, unless deployed using the full-featured Forefront Identity Manager using a highly available SQL instance.

Exchange Client Access Servers providing Exchange Hybrid mode integration may be a subset of the total number of Client Access Servers contained in your organization. If more than one of these exists, they will be load-balanced via some sort of load-balancing mechanism. Client Access Servers facilitating Exchange Hybrid mode are responsible for the interaction between Exchange Online and on-premises Exchange and directly facilitate the features required, which makes the on-premises system and Office 365 appear as a single organization. With this in mind, you will do well to ensure that sufficient redundancy exists to guarantee availability during a server outage, as well as during periods of high server load.

Active Directory Federation Services (AFDS) enable external authentication to an on-premises Active Directory by validating credentials against Active Directory and returning a token that is consumed by Office 365, thereby facilitating one set of Active Directory credentials to be used against both on-premises services as well as Office 365. ADFS servers may have a DMZ-based component (ADFS Proxy servers) alongside the LAN-based ADFS server. ADFS Proxy servers are a version of ADFS that is specifically designed to be deployed in the DMZ, a secure network location, disconnected from a production network via additional layers of firewalls. Since all that these Proxy servers do is to intercept credentials securely and pass them onto LAN-based ADFS instances, they may not be required if an equivalent service is available via Microsoft TMG/UAG or similar.

These types of servers are great virtualization targets because of their light load. Depending on load, you will require a minimum of two ADFS servers and two ADFS Proxy servers.

Your availability concerns for Exchange Online/Hybrid mode include the following:

Database Availability Group Planning

Database availability group (DAG) planning requires you to balance a number of factors. Most of these are interdependent and require significant thought and planning.

Database Sizing

The theoretical maximum database size should not be based purely on the maximum database size supported by Exchange 2013. Large databases require longer backup/restore and reseed times, especially when over the 1 TB mark. Databases size of 1 TB and upward are impractical to back up, and they should only be considered if enough database copies exist in order not to require a traditional backup, specifically three or more copies. You need to strike a balance between fewer nodes and larger databases versus more nodes and smaller databases.

Database Copies

The number of database copies required in order to meet availability targets is a relatively simple determination. Early on, we discussed the number of disks or databases required in order to calculate a specific availability. If we have been given a stated availability target of 99.99 percent, then we will not be able to achieve such a target with a single database copy. Four copies within a datacenter is the minimum number required for a 99.99 percent availability target. Taking into account the number of databases is just one of the factors in our availability calculation.

In multi-datacenter scenarios, datacenter activation is a manual step, as opposed to the automatic failover provided by high availability. Therefore, switchover requires more time and incurs more downtime that an automatic failover. While Exchange 2013 is able to automate a switchover event, we would argue that the business via the administrator initiating the event should wield that level of control, so that the state of Exchange is always known and understood.

When the second datacenter uses RAID to protect volumes on a single server, as opposed to individual servers with isolated storage, this slightly increases the availability of each individual volume and therefore slightly increases overall availability. In the case of three or more database copies, however, the additional gain will hardly justify the additional costs of doubling the disk spindles (depending on the RAID model) and the additional RAID controllers. Applying the principle of failure domains, it may be cheaper to deploy extra servers with isolated storage, as opposed to deploying the extra disks and RAID controller per RAID volume required to achieve higher availability.

Database Availability Group Nodes

The number of DAG nodes is driven not only by the number of copies required but also by how many nodes are required in a database availability group in order to maintain quorum. Quorum is the number of votes required to establish if the cluster has enough votes to stay up or to make a voting decision, such as mounting databases. Quorum is calculated as the number of nodes/2 + 1. A three-node cluster can therefore suffer a single failure and still maintain quorum. Odd-numbered node sets easily maintain this mathematical relationship; however, even-numbered node sets require the addition of a file share witness.

File Share Witness Location

The file share witness is an empty file share on a nominated server that acts as an extra vote to establish cluster quorum. Whichever datacenter in which the file share witness is located may be considered the primary datacenter. In Exchange 2013, the file share witness may be located in a third datacenter from the primary and secondary location, thereby eliminating the risk of split brain, which is a condition that occurs when two datacenters become active for the same database copy. Changes are now written to different instances of the same database, which requires considerable effort to undo should the WAN link between the primary and secondary datacentre break. This change, while not recommended, is now supported in Exchange 2013, and it is the first version of Exchange to support the separation of the file share witness into a third datacenter.

Database Distribution

The distribution of databases on database availability group nodes has a direct impact on performance and availability. In order to demonstrate this concept, consider a four-node DAG with four database copies and with all databases active on Server 1, as shown in Figure 4.6.

Server 1 will serve all of the required client interactions, while Servers 2, 3, and 4 remain idle, with the exception of logging replay activity. Assuming Server 1 fails, all active copies fail with the server and, depending on the health of the remaining copies, may all activate on Server 2. This is a highly inefficient distribution structure.

Figure 4.7 shows how databases are distributed in a manner such that client and server load is balanced and failure domains are minimized (assuming the storage is not shared). Note that this symmetry is precalculated on a current version of the Exchange calculator.

Determining Quorum and DAC

In Chapter 3, we discussed how quorum is established and how Database Activation Coordination (DAC) mode affects DAG uptime. Remember that if you have DAC mode enabled on your DAG and a WAN failure occurs, both datacenters will dismount databases in order to prevent split brain. By design, DAC mode may be the cause of an outage if it is not implemented correctly. If properly implemented, however, it will act as an extra layer of quorum against split brain.

If WAN links are unreliable, and your DAG appears similar to Figure 4.8, consider planning your DAGs without DAC mode, as per Figure 4.9.

DAGs may be split into two or more DAGs with either datacenter maintaining quorum if a WAN failure occurs, similar to what appears in Figure 4.9.

Intersite Replication Bandwidth

The bandwidth required for intersite replication may be considerable. Let's consider an example of a four-node DAG, with the first node containing all database copies, as shown in Figure 4.10.

As the first database copy is added, we have a replication unit of traffic, as shown in Figure 4.11.

As copies 3 and 4 are added, we have a multiple of this traffic, that is, replication traffic × 3, as shown in Figure 4.12.

Exchange 2013 database replication is quite similar to that of Exchange 2010 in that it uses a single source as its replication master. Note that we are able to seed from any active or passive node in the DAG. This means that Server 3 can seed from Server 2. Likewise, Server 4 can seed from Server 3, all while Server 1 contains the active copy. If Server 3 and Server 4 are in different datacenters, then the replication traffic for all databases may outweigh the cost of placing two additional servers in the DR site.

Reseed Planning

Database sizing is critical, because reseed times are directly related to database sizes, especially if reseeds occur via WAN connections.

Assuming that you have established the AFR for your disks at 5 percent, you now have a potential number of reseed events that require planning. In Chapter 6, we will also discuss the automatic reseed capability of Exchange 2013. This capability is based on the forethought and planning required to ensure that the additional disks or LUNs have been allocated to each server so that Exchange may execute the automatic reseed.

Reseed times vary greatly across different networks. Therefore, it is vital to benchmark the time required to reseed a database under different conditions, because the reseed time factors directly into RPO and RTO times. Note that disks that are allocated as spares for automatic reseed targets suffer the same rate of failure as live production disks.

Your availability concerns for database availability group planning include the following:

Component	Percent Availability
Client access	99.9
Email transport	99.9
Email storage	99.9
Total availability	99.7

Factor	Detail	Mitigation
Datacenter	Only one datacenter exists currently, which makes it impossible to achieve the stated disaster recovery goals.	The customer has been advised of the risk and has chosen to not invest in another datacenter. The stated disaster recovery goals have been adjusted accordingly to reflect a longer time to recover.
Cooling	Two cooling units service the datacenter with insufficient individual capacity to assume the full cooling requirement in the event of failure.	The customer has elected to upgrade the cooling units.
Virtualization	The customer has deployed a virtualization solution on shared storage SAN, which automatically moves guests to the least-busy node. Customer wishes to deploy a DAG.	After reviewing the Exchange support statement and possible storage models, the customer has decided to deploy on physical hardware.
Network/mail flow	The customer has deployed a point solution for branding and hygiene in the DMZ. No redundancy exists.	The customer has elected to replace a point solution with a cloud-based equivalent. This equivalent presents an SLA with 99.99 percent availability.

Name	Protocol
Autodisover.Exchange-D3.com	Autodiscover
Mail.Exchange-D3.com	SMTP
Mail.Exchange-D3.com	Outlook Anywhere
Mail.Exchange-D3.com	EWS
Mail.Exchange-D3.com	EAS
Mail.Exchange-D3.com	OWA
Mail.Exchange-D3.com	ECP
Mail.Exchange-D3.com	POP/IMAP