Failure in distributed systems is inevitable. Instead of trying to prevent failure entirely, we want to design systems that are capable of self-repair. To accomplish this, it is essential to have a good strategy for clients to follow when initiating retries. A service may become temporarily unavailable or experience a problem that requires manual response from an on-call engineer. In either scenario, clients should be able to queue and then retry requests to be given the best chance of success.
Retrying endlessly in the event of an error is not an effective tactic. Imagine a service starts to experience a higher-than-normal failure rate, perhaps even failing 100% of requests. If clients all continuously enqueue retries without ever giving up, you'll end up with a thundering-herd problem—clients continuously retrying requests without limit. As the timeline of the failure progresses, more clients will experience failures, resulting in more retries. You'll end up with a traffic pattern, illustrated by the following diagram, which is a similar graph to the one you'll see during a denial-of-service attack. The end result will be the same—cascading failures due to overwhelmed services and a shedding of legitimate traffic. Your application will become unusable and the failing service will be harder to isolate and repair:
The solution to prevent thundering herds is to add a backoff algorithm that exponentially increases the wait period between retries and gives up after a certain number of failures. This approach is referred to as capped exponential backoff. Adding an exponentially-increasing sleep function between retries accomplishes half of what we're after—clients will slow down their retry attempts, distributing load over time. Unfortunately, client retries will still be clustered, resulting in periods of time where your service is being hammered by many concurrent requests. The second half of our strategy addresses this problem by adding a randomized value or jitter to our sleep function to distribute the retries over time. To summarize, our retry strategy has the following three requirements:
- Retries must be spaced out using an exponential backoff
- Retries must be randomized by adding jitter
- Retries must terminate after a specific amount of time
Most HTTP libraries will have support for a retry strategy that meets these requirements. In this recipe, we'll look at the HTTP client library for Java written by Google.