Downstream Resiliency: The Timeout, Retry, and Circuit-Breaker Patterns
As systems become more connected and dependent, downstream resiliency has become a key consideration in service architecture, guiding how an application responds in case dependent services fail.
This article will dive into downstream resiliency and how timeouts, retries, and circuit breakers work together.
What is Downstream Resiliency?
Downstream Resiliency refers to the capability of a service, say Service A, to function properly when there are failures contributed by its downstream dependencies, for instance, Service B, which itself may further rely on another service (Service C). Most of the time, especially in distributed architectures like microservices, services do not operate in isolation. For example, Service A might call Service B, and that in turn calls Service C to complete its requests. When these services experience either latency or outage, it is important for the originating service, Service A in this case, to implement various measures to handle such situations gracefully.
Timeouts
The first line of defense for ensuring downstream resiliency is the implementation of timeouts. A timeout is a period after which a service will stop waiting for a response from a downstream call. Imagine Service A calling Service B and waiting indefinitely for a response. This can lead to resource exhaustion as Service A holds onto resources while it waits for a response. With a set timeout, Service A can free up resources and resume its functioning even if Service B takes longer to respond.
However, determining the optimal timeout duration is critical. Too short of a timeout could lead to unnecessary failures, and respective potential retries, while too long could waste valuable resources. Utilizing observability tools can help in understanding the typical response times and setting an appropriate timeout threshold, usually slightly higher than the average response time observed.
Retries
Retries occur when an initial request to a downstream service fails or times out. The assumption in such scenarios is that the failure might be transient, meaning that the service is very much capable of handling requests but is just temporarily overwhelmed. In such a case, by retrying the request, Service A can probably receive a successful response without overwhelming Service B.
When it comes to retries, one of the most important characteristics is to understand which backoff strategy we should adopt. A backoff strategy just refers to how long we should wait between retries. There are a couple of strategies we can go with:
- Constant Interval: Retrying after a fixed duration. For example retry every second.
- Linear Interval: In this, the wait time is increased linearly after each retry. As an example, the first retry happens after 1 second, the second retry after 2 seconds, and so on.
- Exponential Interval: The wait time increases exponentially, thereby helping to reduce the load on Service B. Normally the base of the exponent is 2. For example, the first retry happens at 2 seconds (2 to the power of 1 [retry attempt]), the second at 4 seconds (2 to the power of 2 [retry attempt]), the third at 8 seconds (2 to the power of 3 [retry attempt]), and so on.
- Exponential Backoff with Jitter: Introduce randomness in the retry intervals to avoid a "thundering herd" problem where all services retry at the same time and overwhelm Service B. It's the same as the exponential interval but with an added random jitter. For example, the first retry happens at 2.2 seconds (2 to the power of 1 [retry attempt] plus random 0.2 seconds of jitter), the second at 4.7 seconds (2 to the power of 2 [retry attempt] plus random 0.7 seconds of jitter), and so on.
Circuit-Breaker
While retries are suitable for transient issues between services, circuit breakers handle scenarios where a service is likely down, i.e. non-transient failures. The circuit breaker pattern changes the state of the service based on the success or failure of a number of requests. It has three states: Closed, Open, and Half-Open.
Here's a high-level overview of the states and their respective transitions:
In the Closed state, requests flow normally.
If a threshold of failures is reached, the circuit breaker transitions into the Open state, blocking all requests to that service for a defined timeout period.
It transitions after this time into the Half-Open state, when a request is made, to test whether the downstream service has recovered. If the request is successful, it transitions back into the Closed state; if not, into the Open state.
Here's a more visual explanation of these state changes.
(1) Closed circuit. Everything is flowing normally and ok between services:
(2) Open circuit. After a specific number of requests fail (you configure this number/threshold), the downstream dependency is considered as faulty, therefore the circuit is opened.
(3) After a period of the time of open circuit (which is also configured by you), we allow the next request to flow through, to test the waters. By doing so, the first step is to move the state of the circuit to half-opened.
The system won't stay here. The next state will depend on the success of the response to this request. If positive the system will go back to the closed state.
Otherwise, it goes to the open state.
These state changes will continue for as long as the system is alive.
This process prevents Service A from constantly trying to reach a non-responsive Service B, thus saving resources and generally contributing to system stability.
Conclusion
In summary, downstream resilience is a must-have to ensure the stability of modern applications dependent on several services. Adding timeouts, using retry strategies, and applying circuit-breakers build a resilient system that can handle failures in downstream dependencies gracefully. Each of these has its own subtlety and requires careful consideration and monitoring to ensure they work as expected.
Resources
If you are more of a visual learner, I’ve created a video to showcase the concepts explained here.