Microservice Communication Patterns: REST, gRPC, Events
It starts with 3 services talking over REST. Everyone agrees it’s the fastest way to get moving. Fast forward to service 15, and somebody’s inventory lookup is taking 800ms under load. Because the order service calls inventory synchronously, and the checkout page calls order synchronously, your users are staring at a 2.4-second spinner for what used to be a 200ms page load. The SRE team pulls a late night tracing a cascade failure that started with a 97th-percentile database query in a service three hops away from the one that’s actually timing out. If you’ve been in this room at 11 PM, you already know where this article is going.
By the time you’ve got 20 services wired together with synchronous HTTP, changing even one service to async means renegotiating the contract with every caller. The interface shape is baked into deployment pipelines, retry logic, and error handling across a dozen codebases. Congratulations. You’ve built a distributed monolith with network hops instead of function calls.
Communication pattern is not a detail to sort out after the services are running. It determines your failure blast radius, your consistency model, and the coupling between teams who should not need to coordinate on every release. Getting the defaults right early is what separates architectures that scale from ones that calcify. Making these microservice architecture decisions before they’re expensive to reverse is the whole game.
The Synchronous Coupling Problem
REST between services means the calling service blocks until it gets a response. In a chain, latency adds and availability multiplies. This is just math. Five services each at 99.9% availability produce a chain with 99.5% end-to-end availability. If the slowest service in that chain has P99 latency of 200ms, the chain’s P99 is at least 1,000ms before your own code even runs. These are not implementation bugs. They are the arithmetic consequences of synchronous coupling.
The more dangerous property is failure coupling. When the inventory service is overloaded and responding in 10 seconds, the order service waits 10 seconds. The API gateway waits 10 seconds. Users see 10-second page loads. Without circuit breakers, one slow dependency grinds every transaction path that touches it to a halt.
Here’s the part most teams learn too late: the failure mode is not one slow service. It’s one slow service that causes thread pool exhaustion in its callers, which causes those callers to become slow, which exhausts thread pools in their callers. A cascade. By the time your monitoring alerts fire, three services are down and the root cause is a database index that dropped on a service nobody was watching. Site reliability engineering practices formalize the circuit breaker and retry budget parameters that prevent this cascade from propagating across service boundaries.
When to Choose Async
Asynchronous messaging through Kafka, RabbitMQ, or SQS decouples services temporally. The producer publishes and moves on. If the notification service is down when an order is placed, the order still completes. The message sits in the queue. When the notification service recovers, it processes the backlog. The order service has no awareness that the notification service even exists. That’s the power of temporal decoupling.
The trade-off is eventual consistency. An order placed at 10:00:00 may not show in the analytics dashboard until 10:00:05. For most cross-domain events, that’s perfectly fine. For use cases where the caller needs to know the outcome before proceeding (confirming inventory availability before accepting payment, for instance) synchronous communication is the right call.
Here is the heuristic that holds up after dozens of these decisions: use async for things other domains should react to but don’t need to confirm. Use sync for queries and commands where the caller needs a definitive answer before proceeding. Make this choice explicitly per use case. Do not default to one pattern for everything. Cloud-native platform engineering practices codify these defaults so individual teams aren’t reinventing the decision on every new service.
One thing that catches teams off-guard with async, and it will catch you if you’re not deliberate: you need to design for message ordering and idempotency from day one. Kafka guarantees ordering within a partition, but not across partitions. If your order-created and order-cancelled events land in different partitions, a consumer processes them out of order. Partition keys solve this for entity-scoped events, but you need to think about it upfront. Bolting it on after you’ve already got 50 event types in production is a multi-sprint effort that nobody wants to fund.
gRPC for Internal High-Frequency Calls
For internal service-to-service calls where latency and throughput compound, gRPC is worth the tooling investment. Protocol Buffers produce payloads 30-60% smaller than equivalent JSON. HTTP/2 multiplexing enables multiple concurrent RPCs over a single TCP connection. Strongly typed proto contracts generate client and server stubs in Go, Java, Python, TypeScript, or whatever your teams run. No more runtime type mismatches that JSON-over-REST silently allows.
The setup cost is real, though. Don’t pretend otherwise. Proto files need to be compiled and distributed. Generated code needs to be versioned alongside the proto definitions. Service teams need to understand proto schema evolution rules: field numbering, required vs. optional semantics, and the discipline of never reusing field numbers after deprecation. A team that renames field 3 instead of deprecating it and adding field 8 will produce a wire-compatible but semantically broken contract that passes all tests. This exact mistake happens more than once.
For external APIs where developer ergonomics matter, browser compatibility is needed, or you want engineers to be able to curl your endpoints, REST is still the right default. For internal calls above roughly 1,000 RPS, the gRPC investment pays back in weeks. Solid distributed systems engineering covers the proto management and API evolution patterns that keep gRPC sustainable at scale.
Circuit Breakers and Retry Budgets
Microservice architectures without circuit breakers are not resilient architectures. They’re architectures that haven’t failed badly enough yet. Wire them in before the first production traffic, not after the first outage.
The pattern itself is straightforward. Track the error rate of calls to each downstream dependency over a rolling window. When the error rate exceeds your threshold (50% over a 10-second window is a reasonable starting point), open the circuit. Subsequent calls fail immediately with a local error rather than making network calls to the failing dependency. The caller returns a degraded response: cached data, a graceful fallback, or an honest error. After a configured cooldown (30 seconds is typical), allow a small number of probe requests through. If they succeed, close the circuit. If they fail, extend the open state.
The subtlety is tuning, and this is where teams spend real time. Set the threshold too sensitive and circuits open during normal traffic spikes. Circuits open every morning when traffic ramps up after overnight lulls. Set it too loose and the circuit opens only after hundreds of requests have already timed out, which means hundreds of users already had a bad experience.
Retry budgets are the companion control, and they matter just as much. In a 4-service chain where each layer retries 3 times, a single failing leaf service receives 3^4 = 81 requests from one originating request. That amplification turns a struggling service into a dead service. The standard defense: cap retries at 3 attempts per layer, use exponential backoff with jitter starting at 100ms (the jitter prevents synchronized retry storms from all callers hitting the failing service simultaneously), and let the circuit breaker handle sustained failures rather than relying on retries to eventually succeed.
Sagas for Distributed Transactions
The moment you split a monolith into services, you lose database transactions that span multiple entities. An order that debits a wallet, reserves inventory, and creates a shipment used to be one transaction with ACID guarantees. Now it’s three service calls, each with its own database, and “rollback” does not mean what it used to.
The saga pattern is the standard answer. Each step in a multi-service operation has a corresponding compensating action. If step 3 fails, steps 2 and 1 execute their compensating actions in reverse. Choreography-based sagas use events: the inventory service publishes “inventory.reserved” and the payment service reacts. Orchestration-based sagas use a coordinator service that directs each step explicitly.
In practice, orchestration wins for anything beyond 3-4 steps. Don’t fight this. Choreographed sagas across 6 services become impossible to reason about when you need to answer “what happens if step 4 fails after step 3 succeeded?” The event chain is distributed across 6 codebases and you need to read all of them to understand the rollback sequence. Nobody wants to do that at 3 AM.
The two genuinely hard failure modes are the ones nobody thinks about until they’re in production: the compensating action itself fails (the payment refund API is down when you need to compensate), and partial success where compensation is impossible (you shipped the package before the payment bounced). Both require runbooks, not just code. Teams that don’t design compensating actions before implementing the forward path always discover this gap in production. Always.
Communication pattern choices made in the first few months of a microservice architecture become load-bearing walls that are expensive to tear out once multiple teams are building against them. Default to async for cross-domain events. Use sync only when the caller genuinely needs a response to proceed. Wire in circuit breakers before the first production traffic. Design compensating actions before implementing forward paths. Get these four things right early, and the architecture scales. Get them wrong, and you’ll spend a year paying down the debt.