The goal of lean cloud capacity management – to sustainably achieve the shortest lead time, best quality and value, and highest customer delight at the lowest cost – requires that some reserve capacity be held to mitigate the service quality impact of failures and other situations. At the highest level, reserve capacity is used to mitigate the risk of a service being driven into a capacity emergency which causes parties in the service delivery chain to accrue waste from inadequate capacity (Section 3.3.5). Severe capacity emergencies can have dire business consequences including customer churn, liquidated damages liabilities, loss of market share, and loss of market value. Thus, one must deploy sufficient spare or reserve capacity to mitigate the risk of inadequate capacity without squandering resources on excessive capacity.
This chapter considers lean reserves via the following sections:
Figure 8.1 (a copy of Figure 3.2) illustrates reserve capacity alongside working capacity and excess capacity. Working capacity is mean (average) demand so consider it the average level of demand across any particular capacity decision and planning window. For example, if application capacity decision and planning is evaluated every 5 minutes, then working capacity should be considered for each of those 5-minute intervals. Random variance covers the maximum and minimum levels of demand in the particular capacity decision and planning window to reflect the most extreme moments of demand in the window, such as demand intensity in the busiest seconds of the window. Reserve capacity is an increment of capacity above the forecast working capacity that is held online to mitigate risks to assure that user demand can be served with acceptable latency and overall quality. Reserve capacity should be significantly greater than peak demand (i.e., peak random variance plus mean demand in the time window) to mitigate the risk of failures, extreme surges in demand, and so on. Reserve capacity is typically expressed as a percentage of capacity above forecast mean demand.
Figure 8.2 visualizes capacity for a hypothetical application across several capacity decision and planning cycles. The points on the lower dotted line illustrate the forecast working capacity and the points on the upper dotted line show the total forecast demand plus reserve target capacity. The policy for this hypothetical example is to maintain reserve capacity of nominally 50% above forecast mean demand. The solid lines represent the actual mean and peak demand in each window. Note that while the actual mean demand is close to the forecast working capacity, the working plus reserve capacity significantly exceeds the peak actual demand because sufficient reserve capacity is held to mitigate the risk of events like unforecast surges in demand, failures, and so on; since no such event occurred in that window, that reserve was not fully consumed. Reserve capacity is somewhat like life insurance: you hold it to mitigate the financial consequences of a death. Just because the insured ultimately did not die during the term of insurance does not reduce the prudence of hedging risk during the term of insurance.
Reserve capacity is used to mitigate an ensemble of unforeseen circumstances including:
As explained in Section 1.5: Demand Variability, random variations in service demand across the shortest time scales overlay onto cyclical patterns of demand. As explained in Chapter 7: Lean Demand Management, techniques like buffers, queues, and resource scheduling enable modest and momentary random bursts of demand to be served with acceptable service quality, albeit perhaps with slightly higher service latency. Reserve capacity enables random peaks to be promptly served rather than creating a persistent backlog of work that increases service latency and diminishes user quality of experience for some or all users.
Highly available applications maintain redundant online capacity sufficient to recover service impacted following a failure event with minimal user impact. No single point of failure means that an application has been appropriately designed and sufficient redundant capacity is held online that any single failure event can be mitigated with minimal user service impact. Practically, this means that sufficient spare application component capacity is held online that the entire offered load can be served with acceptable service quality immediately following the largest single failure event that can impact the target application.
Application capacity failure group size is driven by two footprints:
Note that recovering from a component or resource failure can produce a larger transient capacity impact, in that more capacity than merely replacing the failed component may be required to acceptably recover service. For example, if a component directly serving X users fails, then recovering service for those X users may require not only sufficient service capacity to replace the failed component, but also sufficient spare capacity from supporting elements such as user identification, authentication and authorization components, data stores of user information and so on, to fully recover all impacted users within the maximum acceptable time. While the failure recovery workload surge for ordinary failures may be sufficiently small that it can be served via normal spare capacity, catastrophic failure or disaster recovery scenarios often put such a large correlated recovery-related workload on systems that sufficient spare capacity must be carefully engineered to assure that recovery time objectives (RTO) can be met.
Occasional infrastructure element failures are inevitable and those failed elements may be out of service awaiting repair for hours or longer. Infrastructure service providers hold some reserve capacity so that application service provider requests for virtual resources – including requests to restore (a.k.a., repair) application component capacity lost due to infrastructure element failures – can be rapidly and reliably fulfilled.
As discussed in Chapter 7: Lean Demand Management, infrastructure service providers may occasionally curtail resource throughput or activate voluntary or mandatory demand management actions. Application service providers can use reserve capacity to mitigate service impact of these infrastructure demand management actions.
Forecasting future demand is inherently difficult and error prone. Beyond predicting whether demand will broadly increase, remain constant, or decline, unforeseen – and hence hard to predict – events can occur which impact demand. Natural disasters like earthquakes, events of regional significance like terrorist attacks, commercial, entertainment, or other events can lead to unforecast surges in demand. Reserve capacity can minimize the user service impact of unforeseen surges in demand. Both application service providers and infrastructure service providers must account for the risk that their forecasts of demand are occasionally outstripped by actual demand.
Reserve capacity is held to cover increases in demand that occur before additional capacity can be brought into service. Figure 8.3 illustrates the timeline of capacity decision, planning, and fulfillment. As discussed in Section 3.8: Cadence, capacity decision and planning cycles repeat on a regular cadence; let us assume a 5-minute cadence for a hypothetical application. At the start of the decision and planning cycle, some system retrieves current usage, performance, alarm, demand forecast and other information, and applies business logic to that information to decide if a capacity change is necessary. If yes, then the capacity planning function determines exactly what capacity change action to order (e.g., which specific application instance needs to execute exactly what capacity change action) and dispatches that order – or perhaps multiple orders for capacity changes for complex applications or solutions – to the appropriate system for fulfillment. Appropriate fulfillment systems then execute the required capacity configuration change action.
Capacity lead time is thus the sum of:
Fulfillment of application capacity degrowth typically involves:
Capacity fulfillment actions take time to complete, and some variability in completion time is likely. For instance, variability in the time it takes the infrastructure service provider to allocate and deliver requested virtual resources cascades as variability in capacity fulfillment time.
Occasionally a failure will occur in allocation, configuration, startup, or some other aspect of capacity fulfillment. Detecting and backing-out the failure will take time, yet not fulfill the requested capacity change action. Thus, another fulfillment action must be attempted which inevitably delays the time until the new capacity arrangement is online and available to serve user demand. As a rough estimate, one can assume that detecting and mitigating a failed capacity fulfillment action will take one lead time period, and a second lead time period will be consumed for decision, planning, and fulfillment to mitigate failure of the first fulfillment action. Thus, capacity decision and planning should manage capacity growth actions at least two normal lead time intervals into the future, so sufficient time is available to detect, mitigate, and resolve occasional failures in the fulfillment process.
Force majeure or other extreme events can render some or all of the applications in a data center unavailable or unusable. Strikes, riots, and other events of regional significance can also render a physical data center inaccessible or unreachable for a time. Business continuity concerns drive organizations to carefully plan and prepare to survive catastrophic (e.g., force majeure) failure events. Note that the impact of force majeure and disaster events will likely render normal reserve capacity inoperable by simultaneously impacting all collocated reserve capacity. Thus, somewhat different emergency reserve capacity arrangements are typically used to mitigate catastrophic events; Section 8.4.8: Emergency Reserves considers this topic.
Reserve capacity is often explicitly monetized via two qualities of service tiers:
While a best effort service may make little or no provision to mitigate any of the risks enumerated in Section 8.2: Uses of Reserve Capacity, guaranteed quality of service offerings would be engineered to mitigate the risks covered by the service quality guarantee.
Even without an explicit quality of service guarantee, service providers should set a service level objective to engineer their service for. Reserve capacity beyond the needs of the service provider's service level objectives is excess application capacity (Section 3.3.2), excess online infrastructure capacity (Section 3.3.3), and/or excess physical infrastructure capacity (Section 3.3.4).
Section 5.10: Demand and Reserves explained that the power industry considers operating reserves in two orthogonal dimensions:
Emergency reserves (Section 8.4.8)
Demand oriented reserves include:
Just as power generating equipment has mechanisms that automatically control power output of generation equipment over some range of normal operational variations, modern electronic components and ICT equipment implement advanced power management mechanisms that implement automatic control of throughput over a small operational range via techniques like adjusting clock frequencies and operating voltages. While these automatic mechanisms are not typically considered capacity reserve techniques, deactivating advanced power management mechanisms that were engaged may make somewhat more infrastructure capacity available to serve demand.
When load shared reserve capacity mechanisms are used, surges in demand naturally utilize an application's reserve capacity.
Uneven distribution of workload across pools of fungible application components or application instances makes redistributing new, or perhaps even existing, demand from more utilized components or instances to less heavily utilized components or instances useful. Potentially workload can also be shifted away from a stressed or poorly performing component to another application instance that has spare capacity.
Growing online capacity of an existing application instance is the normal way to increase application capacity.
When service demand outstrips online capacity, one can limit or throttle the service throughput delivered to some or all consumers. Readers will be familiar with service curtailment in the contest of broadband internet service: download speed (i.e., service throughput) slows due to congestion during heavy usage periods.
In capacity emergencies, application or infrastructure service providers can unilaterally pause, suspend, cancel, or terminate active workloads. Activating mandatory demand shaping actions often impacts the service user and forces them to take the trouble of executing mitigating actions which is generally counter to the lean principle of respect.
Voluntary demand shaping measures give the service consumer the opportunity to gracefully shift their pattern of demand at the earliest convenient moment, thereby sparing them the trouble of having to mitigate service impact of an ill-timed or ill-planned mandatory demand management action.
A single force majeure or disaster event can render an entire data center, including all virtual resources and applications hosted on physical equipment in the impacted data center, unreachable or otherwise unavailable for service. Natural disasters like earthquakes can simultaneously impact some or all data centers in the vicinity of the event. The fundamental business problem is recovering impacted user service and application data to one or more alternate application instances running in data centers that were not directly impacted by the disaster event.
Critical and important applications are explicitly engineered for disaster recovery so that after a disaster or force majeure event renders a cloud data center unavailable or otherwise inaccessible for service, then user service can be restored in a geographically distant data center which would not have been affected by the same force majeure event. Recovery performance is quantitatively characterized by RTO (RTO) and recovery point objective (RPO) (RPO), which are visualized in Figure 8.5. Note that disaster RTOs are generally far more generous than the maximum acceptable service impact duration for non-catastrophic incidents.
Very short RTO and RPO values often require disaster recovery capacity to be online and synchronized with current data prior to the disaster event, which means that sufficient disaster recovery capacity must be online 24/7 to recover application capacity for the largest expected disaster event because the timing of disaster events is generally unpredictable. Generous RTO values may enable some, most, or all disaster recovery capacity to be allocated on the fly following a disaster event.
There are a range of application recovery strategies into geographically distant cloud data centers; for simplicity, let us consider two basic disaster recovery architectures:
Roles and responsibilities alignment will dictate what role the infrastructure service provider of the impacted data center, the recovery data center, and/or a disaster recovery as-a-service provider play to support the application service provider.
A unique challenge around disaster recovery is that a data center represents a large footprint of failure that simultaneously impacts all applications hosted in that data center. Thus, application service providers for all impacted applications will likely activate disaster recovery plans at the same time and may end up competing for virtual resource allocations and throughput to serve the surging workload associated with disaster recovery. In aggregate, this likely surge in resource allocation requests may be far beyond what the infrastructure service provider's management and orchestration systems normally process. A surge in network traffic to retrieve recovery data and prepare recovery capacity may cause congestion of networking infrastructure and facilities which can slow disaster recovery time. Thus, resource allocation and provisioning may take significantly longer and be somewhat less reliable in disaster emergencies, so RTO planning should allow extra time for congestion-related affects.
Disaster recovery as a service and other cloud offerings may enable application service providers to configure application capacity somewhat leaner than they might have for traditional disaster recovery deployment models. In addition, automated service, application, and resource lifecycle management mechanisms can shorten execution time and improve execution quality for disaster recovery actions.
Both application service provider and infrastructure service provider organizations must make sensible financial decisions to remain sustainable business. One financial decision considers the risk tolerance for having insufficient capacity instantaneously available to serve customer demand compared to the cost of maintaining actual or optioned capacity to serve increasingly unlikely extreme demand surges. For example, at some point it may be cheaper for an application or infrastructure service provider to pay penalties for rare capacity emergency incidents rather than to carry sufficient capacity to fully mitigate service impact of all possible capacity emergency incidents. Accepting greater risk of tardy or limited service availability during capacity emergencies will lower reserve capacity requirements (and hence costs) but customer satisfaction and good will may be impacted when a capacity emergency incident occurs.
The ideal level of reserve capacity is considered separately as:
Affinity rules typically request that virtual resources hosting an application instance's components are co-located, ideally in the same rack, chassis, or even server blade, to maximize performance. Reserve capacity is normally collocated with the working capacity so that users served by that reserve capacity are likely to experience the same quality of service as users served by working capacity.
Just as safety stock inventory levels are set based on probabilistic service level (i.e., probability that sufficient inventory will be on hand to serve demand) rather than engineering for absolute certainty, ideal reserve is also determined based on a probabilistic service level target that sufficient capacity will be available to serve offered workload with acceptable quality of service. The higher the service level objective the greater the quantity of reserve capacity that must be held online to mitigate the remote probability of extreme events that require even more capacity to mitigate.
Normal, co-located reserve capacity optimally mitigates:
Ideal single-event reserve capacity is the maximum of the capacities necessary to mitigate each of the individual risks above. Longer fulfillment times, less frequent capacity decision and planning cycles, and less reliable capacity fulfillment actions bias one to engineer sufficient reserve capacity to mitigate two failure events occurring before additional capacity can be brought online. As a starting point, one might consider an ideal reserve of perhaps twice the maximum capacity required to mitigate any one of the single risks above. Continuous improvement activities driven by lean cloud computing principles can drive ideal lean reserve targets down over time.
Catastrophic failures and force majeure events (Section 8.2.7) simultaneously impact most or all application components hosted in a physical data center, thereby impacting most or all working and normal (co-located) reserve capacity at the impacted site inoperable. Thus, these events must be mitigated via reserves located in a data center not impacted by the force majeure or catastrophic event. Emergency reserve capacity is typically engineered to be sufficiently distant from the original site so that no single force majeure or catastrophic event would impact both sites; ideally emergency reserve capacity is hundreds of miles away from the original site. The emergency reserve capacity might be owned and operated by the same infrastructure service provider or by a different infrastructure service provider. The application service provider typically arranges for emergency reserve capacity in advance to shorten the service recovery time following a disaster event. Some application service providers will opt for disaster-recovery-as-a-service offerings rather than managing and maintaining emergency reserve capacity themselves. Mutual aid arrangements in which other organization serve the impacted organization's user traffic might also be considered. Perfect emergency (geographically distributed) reserve is the minimum capacity necessary to recover user service within the RTO of a disaster or force majeure event
Emergency reserves are characterized by two key parameters:
Emergency reserves (Section 8.4.8)