Designing High Availability
This chapter covers the following topics:
Making Controller Connectivity More Resilient: This section describes a method of bundling multiple physical ports into a single logical link so that the wireless network can survive controller port failures.
Designing High Availability for APs: This section explains how APs can detect a controller failure, how they can join other controllers in an orderly fashion, and how they can automatically fall back to their original controllers after service has been restored.
Designing High Availability for Controllers: This section covers various approaches that can be used to offer redundant controllers and minimize the impact of a controller failure.
This chapter covers the following ENWLSD exam topics:
4.0 WLAN High Availability
4.1 Design high availability for controllers
4.1.a Network availability through LAG
4.1.b Stateful Switchover (SSO)
4.2 Design high availability for APs
4.2.a AP prioritization
4.2.b Fallback (assigning primary, secondary, and tertiary)
Cisco lightweight wireless access points normally need to be paired with a wireless LAN controller to provide a functional wireless network. If the controller fails for some reason, wireless service could be interrupted. This chapter discusses several features and mechanisms you can leverage to make wireless controllers more resilient and redundant, thus improving network availability for the end user.
The “Do I Know This Already?” quiz allows you to assess whether you should read this entire chapter thoroughly or jump to the “Exam Preparation Tasks” section. If you are in doubt about your answers to these questions or your own assessment of your knowledge of the topics, read the entire chapter. Table 9-1 lists the major headings in this chapter and their corresponding “Do I Know This Already?” quiz questions. You can find the answers in Appendix D, “Answers to the ‘Do I Know This Already?’ Quizzes and Review Questions.”
Table 9-1 “Do I Know This Already?” Section-to-Question Mapping
Foundation Topics Section |
Questions |
---|---|
Making Controller Connectivity More Resilient |
1–3 |
Designing High Availability for APs |
4–7 |
Designing High Availability for Controllers |
8–10 |
1. Which of the following are likely reasons you would configure a LAG on a WLC?
To bind two controllers together into a logical HA pair
To make client roaming more efficient
To load-balance traffic across multiple links
To add redundancy to the WLC’s distribution ports
2. When you configure multiple distribution ports on a WLC to form a single logical link, you are forming which one of the following?
A MAP
A LAG
An SSO
A CAPWAP
3. Suppose you have configured a LAG on a controller. Which one of the following lists the negotiation method that must be used between the switch and the controller to successfully bring up the LAG?
PAgP
LACP
GLBP
None; the LAG cannot be negotiated.
4. Which one of the following makes a controller failure most disruptive to connected clients?
The controller must take time to find a replacement for itself.
The clients must take time to find a new controller to join.
The APs must take time to find a new controller to join.
The clients must wait for the Spanning Tree Protocol to unblock the links from the APs to the new controller.
5. You can configure the priority value on an AP to accomplish which one of the following?
To set the controller it will try to join first
To define which APs will be preferred when joining a controller
To set the SSID that will be advertised first
To identify the least loaded controller to join
6. Which one of the following is the default AP priority value?
Low
Medium
High
Critical
7. By default, which one of the following methods and intervals does an AP use to detect a failed controller?
ICMP, 60 seconds
ICMP, 30 seconds
CAPWAP keepalive, 60 seconds
CAPWAP keepalive, 30 seconds
CAPWAP discovery, 30 seconds
8. Suppose that an AP is joined to the WLC that is configured as the primary controller. At a later time, that controller fails and the AP joins its secondary controller. Once the primary controller is restored to service, which feature would allow the AP to rejoin it again?
CAPWAP Rejoin
AP Failover
AP Priority
9. Suppose a wireless design consists of two controllers and a number of APs. The APs are distributed equally across the two controllers. Each AP is configured with one controller as primary and the other controller as secondary. Based on this information, which one of the following redundancy models is being used?
No redundancy
SSO redundancy
10. Which one of the following controller redundancy designs is the least disruptive to APs and wireless clients when a controller fails?
N+1 redundancy
N+N redundancy
N+N+1 redundancy
SSO redundancy
Foundation Topics
A wireless network design is generally successful if the network is accessible in the places where users are located and the performance is satisfactory for the number of users gathered there. In other words, the network should be available, convenient, and efficient. At first glance, network availability might mean that a user can detect and join a live network. Keeping a wireless network alive involves much more than providing RF coverage. Figure 9-1 illustrates the basic building blocks of a wireless network (labeled A through F), along with icons that denote possible failure points. Notice that each building block, including links between them, can potentially fail. To make the network highly available, you should consider ways to improve each component’s resiliency.
Figure 9-1 Basic Wireless Network Building Blocks and Potential Failures
Table 9-2 lists the potential failure points illustrated in Figure 9-1. Some components, such as switch redundancy (D), can be provided by following LAN switching network design best practices. Failures of the RF signal (A) and an AP (B) can be addressed manually, with a wireless design that places APs such that they completely overlap each other’s cell coverage, or automatically, through the use of Radio Resource Management (RRM), which is covered in more detail in Chapter 6, “Designing Radio Management.” Failures at the WLC level (E and F) can be mitigated by using design strategies presented in the sections that follow.
Table 9-2 Wireless Network Failure Points
|
Component |
Failure Mitigation |
---|---|---|
A |
RF signal |
Augment the coverage hole of missing RF with signals from neighboring APs. |
B |
AP |
Augment RF coverage hole with signals from neighboring APs; could co-locate APs on different channels for full fault tolerance. |
C |
AP uplink |
None; APs usually support only one wired Ethernet connection. |
D |
Switch |
Leverage switch stacking or pairing for redundancy, multiple links between switch layers. |
E |
Controller uplink |
Leverage multiple links between WLC and switches. |
F |
Wireless LAN controller |
Design and configure WLC high-availability features, AP Fallback, and anchor controller redundancy. |
Tip
Because the exam objectives covered in this chapter all begin with “design,” you should not expect to know how to configure the high-availability features on a WLC. Even so, each section ends with a tip that explains where you can find the relevant settings in a controller’s GUI.
Wireless LAN controllers have several distribution system ports that make physical connections to an external wired or switched network. These ports carry most of the data coming to and going from the controller. For example, the CAPWAP tunnels (control and data) that extend to each of a controller’s APs pass across the distribution system ports. Client data also passes from wireless LANs to wired VLANs over the ports. In addition, any management traffic using a web browser, Secure Shell (SSH), Simple Network Management Protocol (SNMP), Trivial File Transfer Protocol (TFTP), and so on, normally reaches the controller in-band through the ports.
Tip
You might be thinking that “distribution system ports” is an odd name for what appear to be regular data ports. Recall that the wired network that connects APs together is called the distribution system (DS). With the split MAC architecture, the point where APs touch the DS is moved upstream to the WLC instead, through the distribution system ports.
Because the distribution system ports must carry data that is associated with many different VLANs, VLAN tags and numbers become very important. For that reason, the distribution system ports always operate in 802.1Q trunking mode. When you connect the ports to a switch, you should also configure the switch ports for unconditional 802.1Q trunk mode.
The distribution system ports can operate independently, each one transporting multiple VLANs to a unique group of internal controller interfaces. However, if the link to one port fails for some reason, the controller would lose connectivity for the VLANs being carried over the port. For resiliency, you can configure distribution system ports in redundant pairs. One port is primarily used; if it fails, a backup port is used instead.
To get the most use out of each distribution system port, you can configure all of them to operate as a single logical group, much like an EtherChannel or port-channel on a switch. Controller distribution system ports can be configured as a link aggregation group (LAG) such that they are bundled together to act as one larger link. In Figure 9-2, the four distribution system ports are configured as a single logical LAG.
Figure 9-2 Cisco WLC Distribution System Ports Configured as a Single LAG
With a LAG configuration, traffic can be load-balanced across the individual ports that make up the LAG. The switch will compute a hash based on parameters in a packet’s IP header to decide which port to use to reach the WLC. For example, suppose an AP sends a CAPWAP packet to the WLC. The switch can use the source and destination IP addresses from the packet, as well as other methods, to select an egress port. As long as the switch is configured to use IP addresses as a load-balancing method, and as long as the IP addresses vary, the switch will be able to distribute traffic across the links in the LAG. The WLC uses a different method for its outbound traffic across the LAG—packets are sent over the same port they arrived on. When the WLC receives a CAPWAP packet from an AP, it un-encapsulates the contents and forwards the packet onto the corresponding VLAN. That VLAN is reached through the switch, so the controller sends the packet out the same port where the incoming CAPWAP packet was received. As long as the switch is evenly distributing the packets it sends to the controller, the controller will follow suit with traffic it sends back across the LAG links.
The LAG also offers resiliency; if one individual link fails for some reason, traffic will be automatically redirected to the remaining working links instead. Even if multiple links fail, traffic will continue to be forwarded in and out of the WLC as long as at least one working link remains.
The LAG depicted in Figure 9-2 does increase the controller’s availability by keeping it connected to the switch. However, the switch can become a single point of failure if it goes offline or has a faulty line card. A better design distributes the individual links of the LAG across multiple line cards in a single physical switch, across a stack of switches, or across a pair of switches configured as a single logical switch. The design shown in Figure 9-3 can survive a switch failure and individual link failures.
Figure 9-3 Improving WLC Availability by Distributing Links of the LAG
Tip
Be aware that even though the LAG acts as a traditional EtherChannel, Cisco WLCs do not support any link aggregation negotiation protocol, like LACP or PAgP, at all. Therefore, you must configure the switch ports as an unconditional or always-on EtherChannel. You can configure and verify the LAG mode by going to Controller > General > LAG Mode on next reboot.
Cisco lightweight wireless access points need to be paired with a wireless LAN controller to function. Each AP must discover and bind itself with a controller before wireless clients can be supported. An AP can discover and build a list of live candidate controllers through prior knowledge of WLCs, DHCP and DNS information, or by broadcasting on the local subnet to solicit controllers. Once an AP has discovered, selected, and joined a controller, it must stay joined to that controller to remain functional.
Now consider that a single controller might support as many as 1,000 or even 6,000 APs—enough to cover a very large building or an entire enterprise. If something ever causes the controller to fail, a large number of APs would fail along with it. In the worst case, where a single controller carries the enterprise, the entire wireless network would become unavailable. That might be catastrophic.
Fortunately, Cisco APs can discover multiple controllers—not just the one that it chooses to join. Figure 9-4 shows this scenario, where the AP has joined WLC-A. If the joined controller becomes unavailable, the AP can simply select the next least-loaded controller and request to join it, as Figure 9-5 depicts. That sounds simple, but it is not very deterministic.
Figure 9-4 An AP Joins One of Several Discovered Controllers
Figure 9-5 An AP Joins a Different Controller After WLC-A Fails
For example, if a controller full of 1,000 APs fails, all 1,000 APs must detect the failure, discover other candidate controllers, and then select the least loaded one to join. During that time, wireless clients can be left stranded with no connectivity. You might envision the controller failure as a commercial airline flight that has just been canceled; everyone who purchased a ticket suddenly joins a mad rush to find another flight out.
The most deterministic approach is to leverage the primary, secondary, and tertiary controller fields that every AP stores in nonvolatile memory. Even after a reboot or power failure, the AP will remember the controllers it has “primed” in its configuration. If any of these fields are configured with a controller name or address, the AP knows which three controllers to try in sequence before resorting to a more generic search. Be aware that a controller name is not the same as its DNS entry; rather, it is the name string configured on the individual controller.
Tip
When an AP boots and builds a list of potential controllers, it can use CAPWAP to build a tunnel to more than one controller. The AP will join only one controller, which it uses as the primary unit. By building a tunnel with a second controller ahead of time, before the primary controller fails, the AP will not have to spend time building a tunnel to the backup controller before joining it.
Tip
You can find the primary, secondary, and tertiary controller fields on an AireOS WLC by going to Wireless > All APs, selecting an AP’s name, and then selecting the High Availability tab.
As a wireless network grows, you might have several controllers implemented just to support the number of APs that are required. Each WLC platform is rated to support a maximum number of APs and must be licensed for some number of concurrently joined APs. It is not enough just to have multiple controllers in a network, even if they can all handle the total number of APs in use. A good network design should also take failures and high availability (HA) into consideration. What if the controllers are all in use and full of APs? If one of the controllers fails, there would not be enough room to spare for a large group of additional, displaced APs to join in their time of need. In the commercial flight analogy, there might be other flights departing the airport soon after the cancellation. If those flights are already mostly full of passengers, many people will be left waiting at the gate.
Figure 9-6 illustrates an example network that does not offer enough capacity to fully survive a controller failure. In the “before” diagram, a group of 400 APs has joined controller WLC-A, and a group of 300 APs has joined WLC-B. Suppose each controller has a maximum capacity of 500 APs. As long as both controllers stay up and functional, the wireless network should work fine. In the “after” diagram, WLC-A has failed. All 400 APs that were previously joined to WLC-A will discover that WLC-B is alive, so they will all try to join it. WLC-B already has 300 APs, so it has room for only 200 more. That means the first 200 APs to request to join WLC-B will be able to, but 200 more will be left out in the cold with no controller to join at all. Once controller WLC-B has the maximum number of APs joined to it, it will reject any additional APs.
To provide some flexibility in supporting APs on an oversubscribed controller, where more APs are trying to join than a license allows, you can configure the APs with a priority value. All APs begin with a default priority of low. You can change the value to low, medium, high, or critical. A controller will try to accommodate as many higher-priority APs as possible. Once a controller is full of APs, it will reject an AP with the lowest priority to make room for a new one that has a higher priority.
Figure 9-6 The Result of Undersized Controllers During a Failure
Tip
You can find the AP priority setting by going to Wireless > All APs and selecting an AP’s name. Select the High Availability tab and look for the AP Failover Priority drop-down menu.
When HA is required, make sure you design your wireless network to support it properly. Fortunately, Cisco APs and controllers are built with HA in mind, so you have several strategies at your disposal. First, it is important to understand how APs detect a controller failure and what action they take to recover from it.
Once an AP joins a controller, it sends keepalive (also called heartbeat) messages to the controller over the wired network at regular intervals. By default, keepalives are sent every 30 seconds. The controller is expected to answer each keepalive as evidence that it is still alive and working. If a keepalive is not answered, an AP will escalate the test by sending four more keepalives at 3-second intervals. If the controller answers, all is well; if it does not answer, the AP presumes that the controller has failed. The AP then moves quickly to find a successor to join.
Using the default values, an AP can detect a controller failure in as little as 35 seconds. You can adjust the regular keepalive timer between 1 and 30 seconds and the escalated or “fast” heartbeat timer between 1 and 10 seconds. By using the minimum values, a failure can be detected after only 6 seconds.
Tip
You can find the keepalive and fast heartbeat timer settings by going to Wireless > Access Points > Global Configuration and looking under the High Availability section of parameters.
Normally, an AP will stay joined to a controller until it fails. If the AP has been configured with primary and secondary controller information, it will join the primary controller first. If the primary fails, the AP will try to join the secondary until it fails. Even if the primary controller is put back into service, the AP will stay with the secondary. You can change that behavior by enabling the AP Fallback feature—a global controller configuration parameter. If AP Fallback is enabled (the default), an AP can try to rejoin its primary controller at any time, whether its current controller has failed or not.
Tip
You can find this feature setting under Controller > General > AP Fallback.
Building a wireless network with one controller and some APs is straightforward, but it does not address what would happen if the controller fails for some reason. Adding another controller or two could provide some redundancy, as long as the APs know how to move from one controller to another when the time comes.
Redundancy is best configured in the most deterministic way possible, such that APs know exactly what action to take if a controller fails, in the most efficient way possible. In other words, the APs should be able to recover from a failure event with a minimal disruption to the wireless users. The following sections explain how you can configure APs with primary, secondary, and tertiary controller fields to implement various forms of redundancy. The sections present a progression from basic to robust. As you read through the sections, keep in mind that redundant controllers should be configured consistently so that APs can move from one controller to another and operate exactly as before.
Tip
The following sections discuss WLC high availability for APs operating within an enterprise. For APs operating in FlexConnect mode, refer to Chapter 10, “Implementing FlexConnect,” to learn more about how high availability works with FlexConnect.
The simplest way to introduce HA into a Cisco wireless network is to provide an extra backup controller. This is commonly called N+1 or N:1 redundancy, where N represents some number of active controllers and 1 denotes the one backup controller.
By having one backup controller, N+1 redundancy can withstand a failure of only one active controller. As long as the backup controller is sized appropriately, it can accept all of a failed controller’s APs. However, once an active controller fails and all its APs rehome to the backup controller, there will be no space to accept any other APs if a second controller fails.
Figure 9-7 illustrates N+1 redundancy with a two-controller network for simplicity. The network could have any number of active controllers but only one backup controller. WLC-A is the active controller and normally carries 100 percent of the network’s APs. WLC-Z is the backup controller, which normally carries no APs at all. The backup controller sits idle until an active controller fails.
To configure N+1 redundancy, you configure the primary controller field on all APs with the name or IP address of an active controller (WLC-A, for example). The secondary controller field is set to the name or address of the backup controller (WLC-Z).
Figure 9-7 A Design Using N+1 Controller Redundancy
N+1 design is simple, but it has a couple of shortcomings. First, the backup controller must sit idle and empty of APs until another controller fails. That might not sound like a problem, except that the backup unit must be purchased with the same AP license capacity as the active controller it supports. That means the active and backup controllers must be purchased at the same price. Having a full-price device sit empty and idle might seem like a poor use of funds.
Second, the backup controller must be configured identically to every other active controller it has to support. The idea is to make a controller failure as seamless as possible so the APs should not have any noticeable configuration differences when they move from one controller to another.
The N+N redundancy strategy tries to make better use of the available controllers. N+N gets its name from grouping controllers in pairs. If you have one active controller, you would pair it with one other controller; two controllers would be paired with two others, and so on. You might also see the same strategy called N:N or 1+1.
By grouping controllers in pairs, you can divide the active role across two separate devices. This makes better use of the AP capacity on each controller. Also, the APs’ and clients’ loads will be distributed across separate hardware while still supporting redundancy during a failure. N+N redundancy can support failures of more than one controller, but only if the active controllers are configured in pairs.
Figure 9-8 illustrates the N+N scenario consisting of two controllers: WLC-A and WLC-B. The APs are divided into two groups—one that joins WLC-A as primary controller and another that joins WLC-B as primary. Notice that the primary and secondary controllers are reversed between the two groups of APs. To support the full set of APs during a failure, each controller must not be loaded with more than 50 percent of its AP capacity.
Figure 9-8 A Design Using N+N Controller Redundancy
Rather than having an extra controller sitting idle waiting for another controller to fail, N+N puts all of the controllers to use. However, it also requires more controllers and licenses than you actually need. N+N is an extremely reliable but extremely expensive solution.
What if a scenario calls for more resiliency than the N+N plan can provide? You can simply add one more controller to the mix, as a backup unit. As you might expect, this is commonly called N+N+1 redundancy and combines the advantages of the N+N and N+1 strategies.
Two or more active controllers are configured to share the AP and client load, while reserving some AP capacity for use during a failure. One additional backup controller is set aside as an additional safety net. Figure 9-9 shows a simple example using three controllers—two active (WLC-A and WLC-B) and one backup (WLC-Z). Like N+N redundancy, the two groups of APs are configured with primary and secondary controllers that are the reverse of each other. Each group of APs is also configured with a tertiary controller that points to the backup unit.
Figure 9-9 A Design Using N+N+1 Controller Redundancy
If one active controller fails, APs that were joined to it will move to the secondary controller. As long as the two active controllers are not loaded with over 50 percent of their AP capacity, either one may accept the full number of APs. N+N+1 goes one step further; if the other active controller happens to fail, the backup controller is available to carry the load. This means that the active controllers can be loaded to more than 50 percent each because the backup controller will be available to share the load when an active controller fails.
The N+1, N+N, and N+N+1 strategies all address redundancy and fault tolerance, but each still relies on the basic controller discovery and join processes. In other words, APs require a certain amount of time to detect a failed controller and to seek out a new one to join. That process can be disruptive to wireless client devices that are using the network at the time.
You can use stateful switchover (SSO) redundancy on WLCs to maximize redundancy and minimize disruption. SSO groups controllers into HA pairs, where one controller takes on the active role while the other is in a hot standby mode. Only the active unit must be purchased with the appropriate license to support the AP count. The standby unit is purchased with an HA license, allowing it to be paired with an active unit of any license size. The standby unit never needs an actual license count because it inherits the licenses when it takes on the active role from another licensed controller.
Figure 9-10 depicts SSO redundancy. The APs can be configured with only a primary controller name that references the name or IP address of the active unit. The active and standby units keep their configurations synchronized so that the standby unit is always ready to take over if needed. In case the active controller fails, the standby controller becomes active and assumes the previous active unit’s IP address. Therefore, the APs need to know about only the active unit. Because each active controller has its own hot standby controller, there really is no need to configure a secondary or tertiary controller on the APs unless you need an additional layer of redundancy. To achieve that extra redundancy, you could set up controllers in HA pairs and then configure the active controller of each pair as the primary, secondary, and tertiary controllers.
Figure 9-10 A Design Using SSO Redundancy
Each AP learns of the active unit in the HA pair during a CAPWAP discovery phase and then builds a CAPWAP tunnel to the active controller. The active unit keeps CAPWAP tunnels, AP states, client states, configurations, and image files all in sync with the hot standby unit. If the active unit fails, the hot standby unit quickly takes over the active role. The APs do not have to discover another controller to join; the controllers simply swap roles so the APs can stay joined to the active controller in the HA pair.
The APs do not even have to rebuild their CAPWAP tunnels after a failure. The tunnels are synchronized between active and standby, so they are always maintained. The SSO switch-over occurs at the controllers—not at the APs.
The active controller also synchronizes the state of each associated client that is in the RUN state with the hot standby controller. If the active fails, the standby will already have the current state information for each client, making the failover process transparent to the end users. HA synchronization takes place over a special redundancy port that connects the active and hot standby units in an HA pair.
The hot standby controller monitors the active unit through keepalives that are sent every 100 ms. If a keepalive is not answered, the standby unit begins to send ICMP echo requests to the active unit to determine what sort of failure has occurred. For example, the active unit could have crashed, lost power, or had its network connectivity severed.
Once the hot standby unit has declared the active unit as failed, it assumes the active role. The failover may take up to 500 ms, in the case of a crash or power failure, or up to 4 seconds if a network failure has occurred.
SSO is designed to keep the failover process transparent from the AP’s perspective, as well as the client’s. In fact, the APs know only of the active unit; they are not even aware that the hot standby unit exists. The two controllers share a “mobility” MAC address that initially comes from the first active unit’s MAC address. From then on, that address is maintained by whichever unit has the active role at any given time. The controllers also share a common virtual management IP address. Keeping both MAC and IP addresses virtual and consistent allows the APs to stay in contact with the active controller—regardless of which controller currently has that role.
This chapter described the main considerations needed to design a wireless network that maximizes availability for the end users. More precisely, you have learned the following:
How wireless controllers can support multiple links to scale performance and tolerate faults
How multiple wireless controllers can handle the AP load and offer greater availability if one fails
How controller redundancy is classified according to the number of active and standby controllers
How stateful switchover (SSO) can be leveraged to maximize controller redundancy and minimize disruption
Exam Preparation Tasks
As mentioned in the section “How to Use This Book” in the Introduction, you have a few choices for exam preparation: the exercises here, Chapter 18, “Final Preparation,” and the exam simulation questions in the Pearson Test Prep Software Online.
Review the most important topics in this chapter, noted with the Key Topic icon in the outer margin of the page. Table 9-3 lists these key topics and the page numbers on which each is found.
Table 9-3 Key Topics for Chapter 9
Key Topic Element |
Description |
Page Number |
---|---|---|
Multiple WLC ports configured as a single logical LAG |
193 |
|
Paragraph |
Assigning AP priorities |
195 |
Paragraph |
Using AP Fallback |
197 |
SSO redundancy |
200 |
Define the following key terms from this chapter and check your answers in the glossary: