๐๐ผ๐๐ป๐๐๐ฟ๐ฒ๐ฎ๐บ ๐ฅ๐ฒ๐๐ถ๐น๐ถ๐ฒ๐ป๐ฐ๐
Ensuring stability in distributed networks
Downstream resiliency ensures that a component can continue to function correctly even if the components it relies on experience issues.
๐ง๐ถ๐บ๐ฒ๐ผ๐๐
Before we start, letโs answer the simple question: "Why timeout?".
A successful response, even if it takes time, is better than a timeout error. Hmmโฆ not always, it depends.
When a network call is made, itโs best practice to configure a timeout. If the call is made without a timeout, there is a chance it will never return. Network calls that donโt return lead to resource leaks.
Modern HTTP clients such as Java, .NET etc do a better job and usually, come with default timeouts. For example, .NET Core HttpClient has a default timeout of 100 seconds. However, some HTTP clients, like Go, do not have a default timeout for network requests. In such cases, it is a best practice to explicitly configure a timeout.
How to configure timeout and not breach the SLA?
Option 1: Share Your Time Budget
Divide your SLA between services, e.g., 500ms for Order Service and 500ms for Payment Service. This prevents SLA breaches but may cause false positive timeouts.
Option 2: Use a TimeLimiter
Wrap calls in a time limiter, setting a shared max timeout (e.g., 1s) while allowing flexibility (e.g., 700ms per service) to handle varying response times efficiently.
How do we determine a good timeout duration?
One way is to base it on the desired false timeout rate. For example, if 0.1% of downstream requests can timeout, configure the timeout based on the 99.9th percentile of response time.
Good monitoring tracks the entire lifecycle of a network call. Measure integration points carefully. This helps with debugging production issues.
๐ฅ๐ฒ๐๐ฟ๐ ๐ฆ๐๐ฟ๐ฎ๐๐ฒ๐ด๐ถ๐ฒ๐
When a network request fails or times out, the client has two options: fail fast or retry the request. If the failure is temporary, retrying with backoff can resolve the issue. However, if the downstream service is overwhelmed, immediate retries can worsen the problem. To prevent this, retries should be delayed with progressively increasing intervals until either a maximum retry limit is reached or sufficient time has passed.
This approach incorporates techniques such as Exponential Backoff, Cap, Random Jitter, and Retry Queue, ensuring the system remains resilient while avoiding additional strain on the downstream service.
๐๐
๐ฝ๐ผ๐ป๐ฒ๐ป๐๐ถ๐ฎ๐น ๐๐ฎ๐ฐ๐ธ๐ผ๐ณ๐ณ
Exponential backoff is a technique where the retry delay increases exponentially after each failure.
backoff = backOffMin * (backOffFactor ^ attempt)
For an initial backoff of 2 seconds and a backoff factor of 2:
1st retry: 2ร2^1=2 seconds
2nd retry: 2ร2^2=4 seconds
3rd retry: 2ร2^3=8 seconds
This means that after each failed attempt, the time to wait before retrying increases exponentially. Exponential backoff can cause multiple clients to retry simultaneously, leading to load spikes on the downstream service. To solve this, we can limits the maximum retry delay to prevent excessive waiting times.
๐๐ฎ๐ฝ๐ฝ๐ฒ๐ฑ ๐๐
๐ฝ๐ผ๐ป๐ฒ๐ป๐๐ถ๐ฎ๐น ๐๐ฎ๐ฐ๐ธ๐ผ๐ณ๐ณ
Capped exponential backoff builds upon exponential backoff by introducing a maximum limit (cap) for the retry delay. This prevents the delay from growing indefinitely while ensuring retries happen within a reasonable timeframe.
backoff = backOffMin * (backOffFactor ^ attempt)
However, the cap limits the maximum delay. For an initial backoff of 2 seconds, a backoff factor of 2, and a cap of 8 seconds:
1st retry: 2ร2^1=2 seconds
2nd retry: 2ร2^2=4 seconds
3rd retry: minโก(2ร2^3, 8)=8 seconds (capped)
Capping the delay ensures retries don't extend indefinitely, striking a balance between efficiency and resilience.
๐ฅ๐ฎ๐ป๐ฑ๐ผ๐บ ๐๐ถ๐๐๐ฒ๐ฟ ๐๐ถ๐๐ต ๐๐ฎ๐ฝ๐ฝ๐ฒ๐ฑ ๐๐
๐ฝ๐ผ๐ป๐ฒ๐ป๐๐ถ๐ฎ๐น ๐๐ฎ๐ฐ๐ธ๐ผ๐ณ๐ณ
This method enhances capped exponential backoff by adding randomness to the delay, preventing synchronized retries and reducing the risk of traffic spikes. Random jitter spreads out retry attempts over time, improving system stability.
delay = random(0, min(cap, backOffMin * (backOffFactor ^ attempt)))
For an initial backoff of 2 seconds, a backoff factor of 2, and a cap of 8 seconds:
1st retry: Random value between 0 and 2ร2^1=4 seconds
2nd retry: Random value between 0 and 2ร2^2=8 seconds
3rd retry: Random value between 0 and 2ร2^3=8 seconds (capped)
The addition of randomness avoids "retry storms," where multiple clients retry at the same time, and spreads out load more evenly to protect the downstream service.
๐ฅ๐ฒ๐๐ฟ๐ ๐๐บ๐ฝ๐น๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป
Suppose a user request goes through a chain: the client calls Your Awesome Service, which calls Order Service, which then calls Payment Service. If the request from Order Service to Payment Service fails, should Order Service retry? Retrying could delay Your Awesome Serviceโs response, risking its timeout. If Your Awesome Service retries, the client might timeout too, amplifying retries across the chain. This can overload deeper services like Payment Service. For long chains, retrying at one level and failing fast elsewhere is often better.
๐๐ฎ๐น๐น๐ฏ๐ฎ๐ฐ๐ธ ๐ฃ๐น๐ฎ๐ป
Fallback plans act as a backup when retries fail. Imagine a courier who canโt deliver your package after trying once. Instead of repeatedly attempting the same thing, they switch to a "Plan B"โlike leaving the package in front of door, or at a nearby kiosk or post office. Similarly, in systems, this means using an alternative option, such as cached data or another provider, when the primary service isnโt working. The system then notifies users or logs the change, just like the courier leaving you a note or sending a text. This way, resources aren't wasted on endless retries, and the system remains resilient by relying on a practical backup solution.
๐๐ถ๐ฟ๐ฐ๐๐ถ๐ ๐๐ฟ๐ฒ๐ฎ๐ธ๐ฒ๐ฟ๐
When a downstream service fails persistently, retries slow down the caller and can spread slowness system-wide. A circuit breaker detects such failures, blocks requests to avoid slowdowns, and fails fast instead. It has three states: closed (passes calls, tracks failures), open (blocks calls), and half-open (tests recovery).
If failures exceed a threshold, it opens; after a delay, it tests in half-open mode. Success closes it; failure reopens it. This protects the system, enabling graceful degradation for non-critical dependencies. Timing and thresholds depend on context and past data.
๐๐ผ๐ป๐ฐ๐น๐๐๐ถ๐ผ๐ป
Downstream resiliency is a critical aspect of Resiliency Engineering, ensuring components can adapt and recover gracefully from failures in dependent systems. By implementing effective strategies, systems can remain robust and reliable, even in the face of unforeseen disruptions.
๐๐ป๐๐ฝ๐ถ๐ฟ๐ฎ๐๐ถ๐ผ๐ป๐ ๐ฎ๐ป๐ฑ ๐ฅ๐ฒ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ๐
All you need to know about timeouts, Zalando Engineering Blog
Understanding Distributed Systems by Roberto Vitillo, as presented in Gergely Oroszโs Understanding the Ins and Outs of Distributed Systems