Resilience Patterns for Distributed Failures
You deploy a new version of your payment service. It passes all tests. Traffic looks normal. Then a downstream fraud-detection API starts responding in 8 seconds instead of 200 milliseconds. Your payment service’s thread pool fills up waiting. The checkout service starts queuing. The cart service starts timing out. Within 90 seconds your entire order pipeline is down. Not because something crashed. Because something got slow.
Nothing threw an error. Every service was technically “working.” Just slowly enough to poison everything upstream. The distributed systems equivalent of “this is fine” while the room fills with smoke.
- A slow dependency is worse than a dead one. Dead services fail fast. Slow ones hold connections, fill worker pools, and cascade upstream until the whole pipeline collapses.
- Circuit breakers without timeout tuning are ornamental. A 30-second default timeout means 30 seconds of requests piling up before the breaker even thinks about tripping.
- Bulkheads isolate the blast radius. Your payment service and search service should never share a worker pool. One failing shouldn’t drag the other down.
- Retry budgets prevent retry storms. A 4-service chain where each retries 3x turns one failure into 81 downstream requests. Budget retries across the chain, not per service.
- Timeouts must form a hierarchy. Downstream timeout must always be shorter than upstream timeout. Break this rule and you create ghost requests burning resources after the caller has already given up.
Why Distributed Failure Is Different
A monolith crashes or it runs. A microservice architecture fails partially, in ways that traditional error handling can’t catch. Service A calls B, which calls C, which calls D. D gets slow. The latency spreads back through the chain, eating up resources at every step, until completely unrelated services start failing. No health check catches it because every service is technically healthy.
Your building’s electrical system works the same way. One bad appliance drawing too much current doesn’t just trip its own outlet. Without proper circuit breakers, it overloads the wiring, heats up the panel, and eventually kills power to the whole building. The appliance didn’t “break.” It just drew more than the system could handle.
One slow dependency, no isolation, every service goes down.
Circuit Breakers: The First Line of Defense
Your house has a fuse box. Too much current on a circuit and the fuse blows, cutting power before the wiring melts. A circuit breaker does the same thing for service calls. Three states. Closed is normal: current flows, requests go through, and the breaker quietly watches for trouble. Open means something tripped it. Too many failures, too fast. All requests get rejected right away with a fallback response. Power cut. Wiring protected. Half-open is the breaker letting a single test request through to check if the problem is fixed. If it works, the breaker closes again. If not, it stays open and resets the timer.
Resilience4j production defaults that actually work: slidingWindowSize: 10 calls. failureRateThreshold: 50%. waitDurationInOpenState: 30 seconds. permittedNumberOfCallsInHalfOpenState: 3. slowCallDurationThreshold: 2 seconds.
That last parameter is the one nobody sets until they get burned. A service responding in 5 seconds is more dangerous than a flat-out error because it holds the worker hostage while looking “healthy.” A zombie employee. Technically at their desk. Producing nothing. Treat slow calls as failures. Set the threshold based on the dependency’s P99 latency, not the library’s default.
Don’t: Use a circuit breaker with the default 30-second timeout. By the time the breaker trips, 30 seconds of requests have piled up and the cascade is already happening.
Do: Set slowCallDurationThreshold to 2x the dependency’s P99 latency. A service with a 200ms P99 should trigger slow-call detection at 400ms, not at the library default of 60 seconds.
Bulkhead Isolation
A well-wired building doesn’t put the kitchen, bathroom, and garage on the same circuit. The microwave trips a breaker, the bathroom lights stay on. Bulkhead isolation is that same principle for your services.
Without bulkheads, all outbound calls share a single worker pool. One slow database query eats every available worker, and even 2ms cache lookups are stuck in line behind it. Every dependency sharing one pool.
Thread pool bulkheads give each dependency its own separate circuit. If the payment dependency burns through its 50 workers, the user service and cache are completely unaffected. Different circuit, different breaker, different pool. Semaphore bulkheads are cheaper. They limit how many requests can run at once for each dependency but still share the underlying wiring. Less memory overhead (no dedicated workers to maintain), but a slow call still ties up a shared worker. Weaker isolation.
| Dimension | Thread Pool Bulkhead | Semaphore Bulkhead |
|---|---|---|
| Isolation mechanism | Dedicated thread pool per dependency. Each dependency gets its own threads | Shared thread pool. Concurrency limit per dependency via semaphore count |
| Resource exhaustion | Contained. Slow dependency exhausts only its own pool. Other dependencies unaffected | Partial. Semaphore limits concurrency but threads are shared. Slow dependency still holds threads |
| Overhead | Higher. Each pool has its own threads, context switches, memory | Lower. Single thread pool, just a counter per dependency |
| Timeout behavior | Thread pool rejection when full. Fast failure, no queuing | Semaphore blocks until permit available (or timeout). Can queue |
| Best for | Critical dependencies where full isolation justifies the overhead | Most dependencies. Lower overhead, adequate isolation |
| Example | Payment gateway gets its own 20-thread pool. Database gets its own 50-thread pool | Payment gateway: max 20 concurrent. Database: max 50 concurrent. All share one pool |
| When to use thread pool bulkheads | When to use semaphore bulkheads |
|---|---|
| Dependencies with unpredictable latency (databases, third-party APIs) | Dependencies that respond predictably fast (caches, in-memory lookups) |
| Critical paths where isolation is non-negotiable | Non-critical paths where memory matters more than isolation |
| Legacy services with no SLA guarantees | Internal services with well-known, predictable response times |
| Any dependency that has caused a cascading outage before | DNS lookups, config fetches, health checks |
If you’re unsure which one a dependency needs, default to thread pool. The memory cost is worth the isolation guarantee. You wouldn’t wire your server room on the same circuit as the break room microwave.
Retry Strategies and Thundering Herds
Three retries per client sounds harmless until you do the math. Helping a friend move, except 200 friends show up at the same time. Three retries means 4x the traffic hitting a service when it’s least able to handle it. Fifty callers, each retrying 3x: 200 requests become 800.
Power outage in your neighborhood. The grid comes back on. Every house fires up the air conditioning, the fridge, and the TV at the same time. The grid, which just barely recovered, buckles under the spike and the power goes right back out. A thundering herd.
Exponential backoff with full jitter is the fix: wait = random(0, min(cap, base * 2^attempt)). Without jitter, 200 clients retry at the same intervals after the same failure. Perfectly synchronized load spikes. Full jitter randomizes the retry window and spreads the load across time. Staggering when each house turns the AC back on. Google’s SRE book
recommends budgeting retries at 10-20% of normal traffic.
Without a retry budget, your own retry logic becomes the outage. Congratulations, you DDoS’d yourself. The recovering dependency gets hammered by synchronized retries, slows down further, triggers more retries, and the cycle feeds itself.
Timeout Hierarchies and Deadline Propagation
- P99 latency measured and documented for every downstream dependency
- Timeout values set to 2-3x the dependency’s P99 (not library defaults)
- Each service’s timeout is shorter than its caller’s timeout minus its own processing time
- Deadline propagation headers (gRPC deadlines or
X-Request-Deadline) in use - Monitoring alerts on timeout rate increase per dependency
Your main breaker is rated for 200 amps. The sub-panel for the kitchen is 50 amps. Individual outlets are 15-20 amps. Each level is smaller than the one above it. An outlet trips, the kitchen sub-panel stays on. The kitchen sub-panel trips, the main breaker stays on. Timeouts in a service chain work the same way.
Each service’s timeout should be roughly 80% of its caller’s timeout, minus the service’s own processing time. If Service A sets a 5-second timeout calling B, and B needs 200ms for its own work, then B should set a 3.6-second timeout calling C. Without this hierarchy, A gives up but B, C, and D keep working on a request nobody is waiting for. Ghost requests. Burning resources to compose a reply to an email that was already deleted.
Deadline propagation is the grown-up version of this pattern. The original request carries a deadline timestamp, and every service checks how much time is left before calling the next service downstream. If 500ms remain and the downstream P99 is 800ms, skip the call and return a fallback right away. gRPC does this natively. For HTTP services, pass X-Request-Deadline as a custom header.
Fallback Patterns and Load Shedding
When the breaker trips, you need flashlights. Fallback hierarchy, from best experience to worst:
| Priority | Strategy | When to use | Example |
|---|---|---|---|
| 1 | Cached response | Read-heavy endpoints where slightly stale data is fine | Last-known product price from 5 min ago beats an error page |
| 2 | Degraded response | Partial data is better than no data | Product page without recommendations |
| 3 | Static default | Safe defaults exist for the feature | Conservative feature flag values when the flag service is unreachable |
| 4 | Graceful error | No fallback data available | Clear “temporarily unavailable” message with Retry-After header |
Load shedding works alongside fallbacks at the application layer. A 503 for 10% of requests is vastly better than 5-second latency for all of them. During a brownout, you turn off non-essential circuits so the critical ones stay powered. Concurrency-based shedding rejects new requests when the number of in-flight requests passes a threshold. Latency-based shedding starts rejecting low-priority requests when P99 exceeds the SLO target. Good infrastructure design puts shedding at multiple layers: reverse proxy, application middleware, and database connection pool.
Health Checks That Don’t Cause Cascading Restarts
Liveness probes answer one question: is this process seriously broken? Return unhealthy only for deadlocks, corrupted state, or conditions the process can’t recover from on its own. A liveness probe that checks downstream dependencies will cause restart cascades when those dependencies get slow. This single mistake has taken down entire clusters. The database is slow, so the liveness probe fails, so Kubernetes restarts the pod, which puts more load on the already-slow database, which makes more liveness probes fail. The fire alarm triggers the sprinklers, which short out the electrical panel, which triggers more fire alarms.
Readiness probes answer a different question: can this instance handle traffic right now? Dependency checks belong here. If the database connection pool is full, pull the pod out of the service endpoints. Don’t restart it.
Startup probes protect slow-starting containers. Without one, the liveness probe kills the container before it finishes starting up. Java applications with heavy dependency injection are especially vulnerable.
Validating Patterns with Chaos Engineering
Every pattern in this article is untested until proven under real failure conditions. Mocks don’t catch configuration drift. They don’t catch the timeout someone quietly changed from 5 seconds to 500 milliseconds, or the circuit breaker that was turned off “temporarily” three months ago. (Nothing is more permanent than a temporary fix.) Chaos engineering injects real faults and checks whether your resilience patterns actually kick in when they need to.
You don’t test a fire alarm by looking at the wiring diagram. You pull it.
| Pattern | Protects against | Key config | Common mistake |
|---|---|---|---|
| Circuit breaker | Cascade from slow/dead dependency | Trip at 50% failures over 10s, half-open after 30s | Timeout too long (30s default = 30s of queued requests) |
| Bulkhead | One dependency eating all resources | Separate worker/connection pools per dependency | Sharing pools across dependencies (defeats the purpose) |
| Retry with backoff | Brief failures (network blips, 503s) | Exponential + jitter, max 3 retries | No jitter = thundering herd after partial outage |
| Retry budget | Retry amplification across service chain | Max 20% additional load from retries | No budget = 4-service chain amplifies to 81x requests |
| Timeout hierarchy | Ghost requests burning resources | Downstream timeout always shorter than upstream | Inner timeout longer than outer (creates orphans) |
| Fallback | Total dependency failure | Cached data, degraded mode, default response | Fallback that calls another failing dependency |
| Load shedding | Self-protection under extreme load | Reject low-priority when at capacity | Shedding without priority classification (drops important traffic) |
What the Industry Gets Wrong About Resilience Patterns
“Add circuit breakers and you’re resilient.” A circuit breaker with a 30-second default timeout means requests pile up for 30 seconds before the breaker trips. The cascade is already in full swing. Circuit breakers without timeout tuning are ornamental. A breaker set to trip on the library defaults is like a fuse rated for 10,000 amps. Technically present. Functionally useless. Set slowCallDurationThreshold based on the dependency’s P99 latency.
“Retry on failure is always safe.” Retry without a budget amplifies failures. A 4-service chain where each service retries 3x turns one failed request into 81 downstream requests. The retry storm hits the struggling dependency harder, making it slower, triggering more retries. Every house flipping the AC on at the same moment after the outage. Retry budgets cap the total additional load across the whole chain.
Building distributed systems that handle partial failure well takes regular practice with these patterns, not a one-time setup. Set-and-forget resilience is a smoke detector with dead batteries. Comforting until the smoke.
Same deploy. Same fraud API at 8 seconds. But this time the timeout fires at 600ms, the breaker trips, the bulkhead keeps payment’s worker pool from starving the rest of the system, and checkout falls back to cached risk scores. The order pipeline stays up. Cart still works. Users still check out. The fraud API team gets a Slack alert instead of a page about total system failure. Same building, same overloaded appliance, but this time the circuit breaker trips and only one room goes dark.