Resilience Patterns: Circuit Breakers, Bulkheads, Retries
You deploy a new version of your payment service. It passes all tests. Traffic looks normal. Then a downstream fraud-detection API starts responding in 8 seconds instead of 200 milliseconds. Your payment service’s thread pool fills up waiting. The checkout service, which calls payments, starts queuing. The cart service, which calls checkout, starts timing out. Within 90 seconds your entire order pipeline is down. The root cause? A single slow dependency that never actually returned an error.
That is how distributed systems fail. Not with a clean crash and a stack trace. With slow, invisible degradation that cascades across service boundaries until something completely unrelated breaks loudly enough for someone to notice. Traditional try-catch error handling is useless here because nothing threw an exception. Every service was “working.” They were just working slowly enough to poison everything upstream. This is the failure mode that catches every team eventually.
The patterns that prevent this (circuit breakers, bulkheads, retry budgets, timeout hierarchies) have been understood for over a decade. Hystrix popularized them. But Hystrix is dead (Netflix moved it to maintenance in 2018), and many teams are still writing raw HTTP calls with a 30-second default timeout and hoping for the best. Hope is not a resilience strategy.
Why Distributed Failure Is Different
A monolith fails atomically. The process crashes, the health check fails, the load balancer stops sending traffic. Recovery is restart. The blast radius is one process. Simple.
A microservice architecture fails partially. And partial failure is far harder to handle. Service A is healthy. Service B is healthy. But Service A calling Service B through Service C, which calls Service D, which is slow. That chain degrades in ways none of the individual health checks catch. The failure propagates backward through the call chain, consuming resources at each hop, until services that have no direct relationship to the root cause start failing.
The insidious part: every service in that chain was technically running. Health checks returned 200. Logs showed no errors. Just slow responses. Alerting that only watches error rates missed it entirely. The dashboard showed green until requests started timing out at the edge load balancer, at which point the cascade was already 90 seconds deep and the damage was done.
The circuit breaker is the first pattern you need to understand.
Circuit Breaker Pattern
The circuit breaker is the single most important resilience pattern in distributed systems. It does for service calls what a physical circuit breaker does for electrical circuits: detects a fault condition and stops traffic before the fault causes further damage.
Three States
A circuit breaker has three states. Closed is normal operation - requests flow through and the breaker monitors failure rate. Open means the failure threshold was crossed and all requests are immediately rejected without calling the downstream service. Half-open is the recovery probe state - a single request is allowed through to test if the dependency has recovered.
Configuration That Works
Resilience4j (the production successor to Hystrix) uses a sliding window for failure rate calculation. Here are the parameters that actually work in production:
- slidingWindowSize: 10 (number of calls in the window). Too small and transient errors trip the breaker unnecessarily. Too large and the breaker is slow to react.
- failureRateThreshold: 50% (percentage). Below this, errors are normal variance. Above it, the dependency is genuinely unhealthy.
- waitDurationInOpenState: 30 seconds. Long enough for most transient issues to resolve. Short enough to recover promptly.
- permittedNumberOfCallsInHalfOpenState: 3. One probe can be a fluke. Three probes give reasonable confidence.
- slowCallDurationThreshold: 2 seconds. Any call exceeding this is counted as a failure for breaker calculations, catching the “not-erroring-but-too-slow” scenario.
That last parameter is what most teams miss entirely. A service responding in 5 seconds is just as dangerous to your thread pool as one returning a 500 error. More dangerous, actually, because it consumes the thread for 5 seconds instead of failing immediately. Resilience4j lets you treat slow calls as failures, which is critical for preventing the latency-cascade scenario from the opening of this article.
Bulkhead Isolation
Circuit breakers protect against a single failing dependency. Bulkheads protect against resource exhaustion across dependencies. The name comes from ship construction: watertight compartments that prevent a hull breach from flooding the entire vessel. The analogy is precise.
Without bulkheads, all outbound calls share a common thread pool. A slow database query consumes the same threads as a fast cache lookup. When the database gets slow, it consumes all available threads, and now even your cache calls (which would complete in 2ms) cannot execute because there are no threads left. Everything stops because one dependency got slow. This is the most common form of cascade failure we see in production.
Thread Pool vs Semaphore
Thread pool bulkheads give each dependency its own isolated pool. The payment gateway gets 50 threads. The user service gets 30. The cache gets 20. If the payment gateway exhausts its 50 threads, the user service and cache are completely unaffected.
Semaphore bulkheads limit concurrency without dedicated threads. Instead of allocating a thread pool, you cap the number of concurrent calls to a dependency. Less memory (no 1MB-per-thread overhead) but less isolation - a slow call still occupies a thread from the shared pool while the semaphore slot is held.
The decision heuristic: use thread pool bulkheads for any dependency with unpredictable latency (databases, third-party APIs, legacy services). Use semaphore bulkheads for dependencies with consistently fast responses (caches, in-memory lookups, DNS).
Retry Strategies That Don’t Cause Thundering Herds
Retries are the most dangerous resilience pattern. Read that again. The pattern meant to help recovery is the one most likely to make things worse. Three retries means a failing service receives 4x its normal traffic at the exact moment it is least able to handle it. Multiply that across 50 calling services and you understand why naive retry policies turn partial failures into total outages.
Exponential Backoff with Jitter
The base formula is straightforward: wait min(cap, base * 2^attempt) milliseconds between retries. A common configuration is base=100ms, cap=10s. First retry at 100ms, second at 200ms, third at 400ms.
But exponential backoff alone is insufficient. If 200 clients all start retrying after the same failure event, they all compute the same backoff intervals and retry simultaneously. Adding jitter randomizes each client’s backoff: wait = random(0, min(cap, base * 2^attempt)). AWS calls this “full jitter” and their benchmarking shows it produces 3x fewer total calls than equal jitter during recovery.
Retry Budgets
The fundamental fix is a retry budget: cap total retries at a percentage of normal traffic. If your service normally handles 1000 requests per second, a 10% retry budget means the system allows at most 100 retries per second across all clients, regardless of how many individual requests fail. This is not optional. Google SRE’s published guidance recommends a 10-20% budget. Envoy proxy implements this natively with retry_budget.budget_percent.
Timeout Hierarchies
A service call chain of A -> B -> C -> D needs coordinated timeouts. Get this wrong and you waste resources at every level. If A’s timeout is 10 seconds, B’s should be less than 10 seconds, C’s less than B’s, and D’s less than C’s. Otherwise, A gives up before the downstream call completes, but B, C, and D keep processing the abandoned request, wasting resources on work nobody is waiting for.
The practical rule: each service’s timeout should be 80% of its caller’s timeout, minus the service’s own processing time. If A sets a 5-second timeout to B, and B needs 200ms for its own logic, B should set a 3.6-second timeout to C (5s * 0.8 - 0.2s). This creates headroom for each service to process the response and propagate errors cleanly.
Deadline propagation is the mature version of this pattern. Instead of each service computing its own timeout, the originating request carries a deadline timestamp. Every service in the chain checks the remaining time before calling downstream. If only 500ms remain and the downstream p99 is 800ms, skip the call and return a fallback. Do not waste time on a call that will almost certainly exceed the deadline. gRPC implements this natively via grpc-timeout headers. For HTTP services, propagate a custom X-Request-Deadline header.
Fallback Patterns
When everything fails (the circuit breaker is open, retries exhausted, timeout elapsed) what does the service return? This is the question most teams answer poorly or not at all. A well-designed fallback is the difference between a degraded experience and a broken one.
The fallback hierarchy, from best to worst:
- Cached response: return the last known good response. Works for read-heavy endpoints like product catalogs, user profiles, configuration. A stale product price from 5 minutes ago is better than an error page.
- Degraded response: return partial data. If the recommendation engine is down, show the product page without recommendations. If the personalization service is unavailable, show the default experience.
- Static default: return a hardcoded safe response. If the feature flag service is unreachable, default to the conservative flag values.
- Graceful error: return a meaningful error that the UI can handle. Not a 500 with a stack trace. A structured response that says “this feature is temporarily unavailable” with a Retry-After header.
Load Shedding at the Application Layer
Circuit breakers and bulkheads protect against downstream failures. Load shedding protects the service itself from being overwhelmed by upstream traffic. When inbound request volume exceeds what the service can handle at acceptable latency, shed the excess load deliberately rather than degrading quality for everyone. A 503 for 10% of requests is better than 5-second latency for 100% of requests.
The simplest implementation is concurrency-based: track the number of in-flight requests and reject new ones with 503 when the count exceeds a threshold. A more sophisticated approach uses latency-based shedding. Monitor p99 latency, and when it exceeds your SLO target, start rejecting a percentage of low-priority requests. Google’s CoDel (Controlled Delay) algorithm does this adaptively.
Effective infrastructure architecture includes load shedding at multiple layers: the reverse proxy, the application, and the database connection pool. Each layer protects the next.
Health Check Design
Kubernetes gives you three probe types, and conflating them is a common source of cascading restarts. Get these wrong and Kubernetes will amplify your outage instead of containing it.
Liveness probes answer one question: is this process fundamentally broken? Return unhealthy only when the process is deadlocked, stuck in an infinite loop, or otherwise unrecoverable without a restart. A liveness probe that checks downstream dependencies will cause restart cascades when a shared dependency goes down. Ten services restart simultaneously, overloading the scheduler and amplifying the outage. This pattern takes down entire clusters.
Readiness probes answer a different question: can this instance handle traffic right now? This is where dependency checks belong. If the database connection pool is exhausted, the service is alive but not ready. Kubernetes removes it from the Service endpoint list, traffic shifts to healthy instances, and the instance recovers without restarting.
Startup probes answer: has this process finished initializing? Critical for services with slow startup (JVM warmup, cache loading, ML model loading). Without a startup probe, the liveness probe kills the container before it finishes booting.
Chaos Engineering as Validation
Every pattern in this article is a hypothesis until tested under real failure conditions. A circuit breaker configured to trip at 50% failure rate over 10 seconds needs to be validated with actual injected failures, not unit tests against mock responses. Mocks do not catch configuration drift. Mocks do not catch the timeout that someone changed from 5s to 500ms last Tuesday. Teams practicing chaos engineering inject real faults (pod kills, network latency, dependency failures) and verify that circuit breakers actually trip, bulkheads actually isolate, and fallbacks actually return useful responses.
The pattern composition (circuit breaker wrapping retries wrapping bulkhead wrapping fallback) is only as good as its most recent validation. Configuration drift, dependency changes, and traffic pattern shifts all invalidate previous test results. Continuous chaos, not annual gamedays, is what keeps resilience patterns honest. Building distributed systems that genuinely tolerate partial failure requires treating these patterns as living infrastructure. Set-and-forget resilience is not resilience at all. It is an untested assumption waiting for the next production incident to prove it wrong.