API Gateway Patterns: BFF, Rate Limiting, and Routing

Apr 5, 2025 Metasphere Engineering 12 min read

API gateways accumulate responsibilities the way inboxes accumulate unread messages. A small transformation added “just this once” for a specific client integration. Business logic for a special case that would be “easier to handle at the edge.” Data aggregation because the mobile client needed it and a proper BFF was not ready yet. You know exactly how this story ends.

Six months later, the gateway contains business rules that only two engineers understand (and one of them just left), performance-critical aggregation logic that cannot be independently scaled, and deployment coupling where a gateway change is required to update any API response shape. What started as a routing layer has become a bottleneck. You have rebuilt the monolith at the edge.

The inverse failure is equally common: a gateway configured as a thin proxy that forwards all traffic to a single backend, providing none of the cross-cutting value a proper gateway delivers, while adding a network hop and a deployment dependency. Both extremes are wrong.

The Gateway’s Actual Job

A well-designed API gateway has a narrow, well-defined responsibility set. Here is what actually belongs there, and nothing else. For a deeper look at how security fits into this, see the application security controls guide.

TLS termination at the gateway decouples certificate management from individual services. Services receive plain HTTP or mTLS internally; the gateway handles HTTPS externally.

Authentication validates that a request carries a valid identity token before it reaches any backend. Rejecting unauthenticated requests at the gateway means no backend service ever handles unauthenticated traffic. They receive pre-validated requests only.

Rate limiting protects backends from traffic spikes and API abuse. A gateway processing 50,000 requests per second can enforce global limits that individual services cannot. Rate limiting at the gateway is the only effective approach because the gateway sees all traffic. Individual services see only their slice and cannot enforce cross-service limits.

Request routing directs requests to appropriate backends based on path, method, and header matching. Backends do not need to know about each other.

Request ID injection assigns a unique trace ID to every request at the edge, enabling end-to-end distributed tracing across all downstream service calls.

That is the complete list. If you are tempted to add more, resist.

The BFF Pattern

So the gateway handles cross-cutting concerns. But where does aggregation live? Not in the gateway. That is the mistake that creates edge monoliths.

When different client types need significantly different data from the same backend services, a shared gateway layer becomes an awkward compromise. The mobile app needs compact responses to minimize data transfer and battery. The web app needs richer data for complex UI rendering. The partner API needs a stable, versioned interface that does not change when the mobile app design changes.

Routing all three through the same gateway-level transformation logic means every client gets a response shaped for none of them. The backend-for-frontend pattern creates a dedicated aggregation layer for each client type. Each BFF calls the microservices it needs, aggregates and transforms data for its client’s requirements, and presents a client-optimized API. The BFF is code the team owns and deploys independently.

This is the correct layer for aggregation because the BFF understands the client’s actual requirements and can evolve with them. A microservice architecture where the mobile team updates the mobile BFF without coordinating with the web team or the gateway team is organizationally sustainable. A shared gateway layer where every client-specific change requires gateway deployment is not. Full stop.

Rate Limiting Design

Rate limiting at the API gateway protects both security and reliability. The design choices matter more than most teams realize:

Rate limit by IP address for unauthenticated endpoints (typically 100 requests per minute) to prevent enumeration and brute force. Rate limit by API key or user ID for authenticated endpoints (1,000-10,000 requests per minute depending on tier) to enforce fair use and prevent runaway clients from impacting others. Return 429 with a Retry-After header on violations, not a generic 503.

The operational metric that matters: what percentage of API traffic is being rate limited? Below 1% means limits are calibrated correctly. Above 5% means limits are too aggressive or legitimate traffic patterns have changed. Alert on sudden increases in rate limit hit rate. They almost always indicate a client bug or an attempted abuse pattern. Connecting gateway rate limit data to your broader observability stack lets you correlate traffic anomalies with downstream service health in a single investigation flow rather than context-switching between tools.

Gateway Observability

The API gateway is the single best triage point in a distributed system because it sees 100% of inbound traffic. Nothing else does. No backend service has this visibility. A service mesh sees inter-service communication but misses the client-facing edge. Load balancer metrics show connection-level data but lack application context. The gateway sits at the intersection of client intent and system behavior, making it the fastest path from “something is wrong” to “here is what is wrong.”

Five metrics form the foundation of gateway observability. Request rate by endpoint establishes the traffic baseline. Alert when any endpoint deviates more than 2x from its normal volume, either up or down. A sudden drop in traffic to a critical endpoint often indicates a client-side failure or DNS issue that backend monitoring would never catch. Error rate by HTTP status code separates client errors (4xx) from server errors (5xx). A spike in 401 responses points to an authentication system issue. A spike in 502 responses points to backend unavailability. Aggregating all errors into a single metric destroys this diagnostic precision. Do not do it. Latency percentiles at P50, P95, and P99 reveal the shape of performance degradation. A P50 that stays flat while P99 doubles indicates a long-tail problem affecting a subset of requests, almost always one slow upstream dependency. Upstream health check results track which backend services are available from the gateway’s perspective. Rate limit hit rate measures what percentage of traffic is being throttled. A sudden increase from 1% to 8% without a corresponding traffic increase means a client bug is sending duplicate requests.

Distributed tracing starts at the gateway. The gateway injects a trace ID into every request, and that ID follows the request through every downstream service call, message queue, and database query. Without gateway-injected trace IDs, correlating a client-reported error with the specific backend failure that caused it means manual timestamp matching across multiple logging systems. That is hours of work per incident. With trace IDs, a single identifier connects the client request to the exact service, function, and database query that failed. Teams adopting observability monitoring practices should treat the gateway as the origin point for all distributed traces.

Access logging at the gateway deserves its own discipline. Every request should produce a structured JSON log entry containing the request path, HTTP method, response status code, response latency in milliseconds, client IP, authenticated client ID, and the trace ID. Structured logs enable programmatic analysis. When an API consumer reports intermittent failures, filter gateway logs by their client ID and see every request they made, the response codes they received, and the latency they experienced. This resolves most “your API is broken” support tickets in minutes rather than hours. Teams that maintain comprehensive gateway access logs alongside their API integration engineering practices consistently report faster issue resolution and more productive conversations with API consumers.

The operational payoff of gateway observability compounds over time. Teams with well-instrumented gateway dashboards reduce mean time to diagnosis by 60% compared to teams relying on backend service logs alone. The gateway dashboard becomes the first screen engineers open during an incident, providing immediate answers to the three critical triage questions: is traffic arriving, are backends responding, and where is the bottleneck?

Version Routing and API Sunset

Running multiple API versions simultaneously is a gateway routing concern, not a service concern. Configure the gateway to route /v1/users to the v1 service deployment and /v2/users to v2. This lets you retire old versions on a deliberate schedule while maintaining backward compatibility during the transition.

The gateway should actively communicate deprecation to clients: inject Sunset and Deprecation response headers on v1 endpoints once v2 is live. Clients that monitor their response headers know the timeline. Clients that do not will still be on v1 when you force the cutover. That is useful signal about who needs a direct nudge before the old endpoint shuts down.

Old API versions are a security liability. Every active version is attack surface you must monitor, patch, and maintain. Building explicit sunset enforcement into gateway routing creates the operational pressure that actually gets old versions decommissioned rather than running indefinitely because removing them would break an integration nobody documented. It is common to find v1 endpoints running for years after they were “deprecated.” Do not let this happen to you.

The gateway is not the place for creativity. It is the place for discipline. Keep its responsibilities narrow, push aggregation into BFFs, enforce rate limits and version lifecycle at the edge, and resist every request to add “just one more transformation.” The teams that maintain this boundary build API platforms that scale. The teams that do not end up with gateway deployments that take longer than service deployments and wonder how they got there. The answer is always the same: they said yes to “just one more.”

Security is the one area where the gateway should be comprehensive, not minimal.

Security at the Edge

The API gateway is the natural enforcement point for security controls that apply to all inbound traffic. Implement these controls once at the gateway and you eliminate the risk of inconsistent enforcement across dozens of backend services. A single misconfigured service behind a properly secured gateway is still protected. A single misconfigured service with no gateway-level security is directly exposed. The math is simple.

Web Application Firewall integration at the gateway layer provides protection against the OWASP top 10 vulnerability categories before malicious requests ever reach application code. SQL injection attempts, cross-site scripting payloads, and path traversal attacks are blocked at the edge. A WAF processing rules at the gateway adds 1-3ms of latency per request. That cost prevents attack traffic from consuming backend compute resources, which matters significantly during volumetric attacks where thousands of malicious requests per second would otherwise saturate service capacity. For teams designing defense-in-depth strategies, the application security controls guide covers how gateway-level WAF fits alongside service-level and infrastructure-level protections.

Mutual TLS between the gateway and backend services establishes zero-trust internal communication. External clients authenticate to the gateway over standard TLS. The gateway then establishes mTLS connections to each backend, where both sides present and verify certificates. This means a compromised internal service cannot impersonate another service, and network-level attackers who breach the perimeter cannot intercept or modify inter-service traffic. Certificate rotation is the operational challenge. Automate it from day one using short-lived certificates (24-72 hour expiry) issued by an internal certificate authority. Do not tell yourself you will automate it later. Manual certificate management at scale is a guaranteed outage waiting to happen.

JWT validation at the gateway follows a specific pattern that balances security with performance. The gateway validates the token signature against the identity provider’s public key, checks the expiry timestamp, and verifies the audience claim matches the expected API. These checks reject expired, tampered, or misdirected tokens before any backend processing occurs. The gateway then extracts the validated claims (user ID, roles, permissions) and forwards them to backend services as trusted headers. Backend services use these headers for authorization decisions without re-validating the token. This separation is critical: the gateway handles authentication (is this token valid?), and the backend service handles authorization (can this user perform this action on this resource?). Do not push fine-grained authorization into the gateway. That pulls domain logic to the edge, which is exactly the anti-pattern this entire architecture exists to prevent.

IP allowlisting and geo-blocking address compliance requirements that many regulated industries mandate. Financial services APIs may be required to reject traffic from specific jurisdictions. Healthcare platforms may need to restrict API access to known partner IP ranges. The gateway enforces these rules using IP reputation databases and GeoIP lookup, adding under 1ms of latency per request. Maintain allowlists and blocklists as configuration that deploys through the same CI/CD pipeline as other gateway rules. Manual firewall changes made through a console are undocumented, unreviewable, and unreproducible.

Bot detection and API fingerprinting protect against automated abuse that rate limiting alone cannot stop. Sophisticated bots rotate IP addresses and API keys to stay under per-client rate limits while still overwhelming the system in aggregate. Gateway-level fingerprinting analyzes request patterns: header ordering, TLS fingerprints, request timing distributions, and behavioral sequences that distinguish human-driven API clients from automated scrapers. A cloud-native architecture that processes 10 million API requests per day will see 15-30% of that traffic from bots unless active detection is in place. That is not a typo. Up to a third of your traffic is probably not human. Blocking identified bot traffic at the gateway reclaims backend capacity and protects data assets from unauthorized harvesting.

Frequently Asked Questions

What should every API gateway be responsible for?

TLS termination, authentication token validation, rate limiting, request routing, request ID injection, and observability (latency metrics, request logging). These cross-cutting concerns add 2-5ms of gateway overhead but eliminate duplicated implementation across every backend service. A gateway handling 10,000 requests per second validates tokens once at the edge instead of requiring each of 20+ backend services to implement their own validation.

What should never go inside an API gateway?

Business logic, data aggregation, and complex transformation should never live in the gateway. Authorization decisions (whether user X can access resource Y) require domain context only the service has. Aggregating data from 3-5 services couples the gateway to service internals and creates a deployment bottleneck. Teams that put aggregation in gateways report 40% longer deployment cycles because every API shape change requires a gateway release.

What is the backend-for-frontend pattern and when should you use it?

BFF creates a dedicated aggregation layer per client type: mobile BFF, web BFF, partner API BFF. Each aggregates 3-8 backend service calls and transforms data for its client’s needs. BFF is justified when mobile responses need to be 60-80% smaller than web responses from the same data, or when client teams deploy on different cadences. A well-designed BFF adds 10-30ms latency but reduces client-side processing and over-fetching significantly.

How should API versioning be handled at the gateway?

The gateway routes /v1/users to the v1 service instance and /v2/users to v2, decoupling public API versions from internal service versions. Best practice is a 90-day deprecation window with Sunset and Deprecation headers injected automatically. During the dual-version period, monitor v1 traffic weekly. When v1 drops below 5% of total traffic, begin active migration outreach to remaining consumers before the cutoff date.

What gateway observability metrics matter most?

The five critical metrics are: request rate by endpoint (alert on 2x normal baseline), error rate by status code (4xx vs 5xx, alert above 1%), latency P50/P95/P99 (alert when P99 exceeds 500ms), upstream service availability (health check failures), and rate limit hit rate (alert above 5% of total traffic). The gateway sees 100% of traffic, making it the fastest incident triage point. Teams with gateway dashboards reduce mean time to diagnosis by 60%.