API Gateway Architecture Done Right

Apr 5, 2025 Metasphere Engineering 14 min read

API gateways collect responsibilities the way junk drawers collect batteries. A small transformation added “just this once” for a specific client integration. Business logic for a special case that would be “easier to handle at the edge.” Data aggregation because the mobile client needed it and a proper BFF wasn’t ready yet. You know exactly how this story ends.

Six months later, the gateway has business rules only two engineers understand. One of them just left. There’s aggregation logic that can’t scale on its own. Deployment coupling so tight that any API response change needs a gateway release. The front desk receptionist who started sorting mail, fixing the printer, doing everyone’s accounting, and is now the bottleneck for the entire building.

The opposite failure is just as common. Gateway configured as a thin proxy that forwards everything to a single backend. No cross-cutting value. Just a network hop and a deployment dependency for nothing. A front desk that waves everyone through without checking badges.

Key takeaways

Five responsibilities belong in the gateway. TLS termination, auth validation, rate limiting, request routing, request ID injection. Everything else is a liability.
BFF (backend-for-frontend) is the correct aggregation layer. Putting aggregation in the gateway is the #1 gateway anti-pattern, and the one with the longest recovery time.
Rate limit hit rate above 5% means limits are miscalibrated. Below 1% means they’re working. Alert on sudden jumps, not the absolute number.
Gateway observability slashes mean-time-to-diagnosis. The gateway sees 100% of inbound traffic. No other component has that view.
Old API versions are attack surface. Build sunset enforcement into routing with 90-day deprecation windows and automatic Sunset headers.

The Gateway’s Actual Job

A narrow responsibility set, strictly kept. The front desk checks badges, directs visitors, and logs who enters. That’s it. Application security at the edge depends on this boundary staying clean.

Belongs in the gateway	Does NOT belong in the gateway
TLS termination	Business logic
Token validation (authentication)	Authorization (needs domain context)
Rate limiting (global + per-client)	Data aggregation from multiple services
Request routing by path/header	Response transformation for specific clients
Request ID / trace ID injection	Database queries or caching logic
Access logging and metrics	Retry logic for specific service failures

Complete list. If someone on your team is tempted to add more, the answer is no. Every extra responsibility is a step toward the edge monolith. The moment the receptionist starts doing accounting, the lobby is unattended.

The distinction between authentication (does this token represent a valid user?) and authorization (can this user access this resource?) is where most teams blur the line first. Authentication belongs at the gateway. Authorization needs domain context that only the backend service has. The front desk checks your badge is real. Only the department head decides whether you’re allowed in the meeting.

The BFF Pattern: Aggregation Done Right

Different clients need different data shapes. Mobile needs compact responses with minimal payloads. Web needs rich data for complex UI rendering. Partners need stable contracts that don’t break when internal services change. One front desk, three different tours.

A dedicated BFF per client type aggregates and transforms independently. The mobile team updates their BFF without coordinating with web or gateway teams. If gateway changes need cross-team coordination, the gateway has swallowed too much. Microservice architectures were supposed to fix that.

When BFF makes sense	When it’s overkill
Mobile and web need very different response shapes	All clients consume the same API shape
Client teams deploy on independent schedules	One team owns all clients and the backend
Partner integrations need stable, versioned contracts	Internal traffic only, no external consumers
Aggregation spans 3+ backend services per request	Each client calls one backend service directly

The BFF adds 10-30ms of latency. Worth it when it kills over-fetching and cuts client-side processing. Not worth it when a single backend already returns exactly what the client needs. Don’t build a personal assistant for someone who only asks for directions.

Rate Limiting Design

Rate limit by IP for unauthenticated traffic (100 req/min), by API key for authenticated traffic (1,000-10,000 depending on tier). Return 429 with Retry-After, not 503. The status code difference matters because well-behaved clients retry on 429 with backoff but treat 503 as a service failure. The difference between “please wait” and “something’s broken.”

# Gateway rate limiting config
rate_limiting:
  unauthenticated:
    limit: 100
    window: 60s
    key: client_ip
    response: 429  # with Retry-After header
  authenticated:
    tiers:
      free:       { limit: 1000,  window: 60s, key: api_key }
      pro:        { limit: 5000,  window: 60s, key: api_key }
      unlimited:  { limit: 10000, window: 60s, key: api_key }

Below 1% rate limited: set correctly. Above 5%: limits are too tight or traffic patterns shifted. Alert on sudden jumps. Connect to your observability stack so rate limit events line up with latency spikes. A surge in 429s that doesn’t match a traffic spike means your limits are wrong, not your users.

Request Type	Rate Limit Tier	Limit	What Happens on Breach
Unauthenticated	IP-based	60 requests/minute per IP	429 Too Many Requests. Retry-After header with backoff
Free tier authenticated	API key or token-based	1,000 requests/hour	429 with usage dashboard link. Upgrade CTA
Paid tier authenticated	Token-based with plan lookup	10,000-100,000 requests/hour (plan-dependent)	429 with current usage. Soft notification at 80% threshold
Internal service	Service identity	No hard limit. Circuit breaker at anomaly detection	Alert on-call. No 429 (internal services shouldn’t be rate-limited, they should be debugged)

Anti-pattern

Don’t: Apply the same rate limits to internal service-to-service traffic and external client traffic. Internal traffic has completely different patterns, security posture, and latency needs. Running both through the same gateway config means either throttling internal calls for no reason or under-protecting external ones. Putting the delivery entrance and the customer entrance through the same revolving door.

Do: Separate gateway configs for internal and external traffic. Internal gateways handle service mesh routing and mTLS validation. External gateways handle rate limiting, WAF, and client authentication.

The Edge Monolith What starts as “just one aggregation” for a mobile client becomes ten, then thirty, and the gateway deployment blocks every team in the org. Each aggregation couples the gateway to the guts of multiple backend services. Changing a field in one service needs a gateway release. Scaling the aggregation means scaling the entire gateway. The receptionist is now doing surgery. Every problem microservices were supposed to solve, recreated at a different layer.

Gateway Observability

The gateway sees 100% of inbound traffic. The security camera at the front door. Five metrics matter: request rate by endpoint (alert on 2x jump from baseline), error rate by status code (separate 4xx client errors from 5xx server errors), latency at P50/P95/P99, upstream health from backend health check failures, and rate limit hit rate as a percentage of total traffic.

Inject trace IDs at the gateway. One identifier connects the client request through every backend hop to the exact failure point. Structured access logs (path, method, status, latency, client ID, trace ID) resolve most “your API is broken” tickets in minutes instead of hours. Teams with gateway dashboards resolve incidents faster because the gateway is the one point with complete traffic visibility. API integration engineering starts with this visibility layer.

Building a gateway observability dashboard

The minimum viable dashboard has four panels: request rate by top-10 endpoints (line chart, 5-minute windows), error rate by status code family (stacked area, 4xx and 5xx separated), P50/P95/P99 latency (line chart, alert when P99 crosses 500ms), and rate limit events by tier. Add a fifth panel showing upstream service response times so you can tell gateway latency from backend latency during incidents. Wire alerts to the on-call rotation for 5xx rate above 1% and P99 above 500ms. Most incident triage starts and ends at this dashboard.

Version Routing and API Sunset

Route /v1/users to the v1 service instance, /v2/users to v2. The gateway separates public API versions from internal service versions, which means v1 and v2 can be completely different implementations behind the same hostname. Inject Sunset and Deprecation headers automatically on deprecated versions so clients get advance notice in every response.

Old versions are attack surface. That 2019 endpoint “nobody uses” still gets 47 requests per day from an undocumented integration. Ghosts in the building. Leaving it running means maintaining security patches for code that should have been torn down years ago.

Security at the Edge

Four controls, layered. WAF blocks OWASP top 10 attacks at the edge with 1-3ms per request overhead. The bouncer who knows what trouble looks like. mTLS between gateway and backends gives you zero-trust internal communication. Automate certificate rotation from day one with short-lived certificates, because manual rotation is the rotation that doesn’t happen. (It never happens.)

JWT validation at the gateway handles authentication. Authorization stays at the backend where domain logic lives. Don’t push authorization into the gateway. It needs context (ownership, roles, resource state) that only the service has. The gateway confirms you work here. Only the service decides whether you belong in the room.

Bot detection analyzes header ordering, TLS fingerprints, and timing patterns. A cloud-native architecture processing heavy request volumes will find a surprising share of traffic is bots, and that share grows with API popularity. Your most loyal “users” might not be human.

What the Industry Gets Wrong About API Gateways

“The gateway should handle aggregation.” The single most common gateway anti-pattern, and the one with the longest recovery time. Once aggregation logic lives in the gateway, extracting it into BFFs means rewriting every affected endpoint while keeping backward compatibility with existing clients. Removing load-bearing walls from a building that’s already occupied. Start with BFFs. There is no shortcut that doesn’t become technical debt.

“One gateway config fits all traffic.” Internal service-to-service traffic and external client traffic have completely different security, rate limiting, and versioning needs. Separate configs for internal and external traffic is the correct architecture. The employee entrance and the customer entrance exist for a reason.

"GraphQL federation replaces the gateway." Federation handles query composition across subgraphs. It does not handle rate limiting, authentication, request tracing, or version lifecycle. The federation layer sits behind the gateway, not instead of it. The conference room doesn’t replace the front desk.

Our take The best gateway is the most boring gateway. If your gateway config is interesting, you’ve put too much in it. Gateway deployments should be rare, changes should be small, and the team managing it should be a tiny fraction of your engineering org. The moment gateway changes start showing up on sprint boards regularly, something has gone wrong architecturally. Boring infrastructure is stable infrastructure. The best receptionist is the one you forget is there.

That “just this once” transformation from six months ago? Rip it out. Clear contract: route, authenticate, rate-limit. Everything else belongs somewhere else. The front desk checks badges and points visitors to the right floor. Nothing more.

Frequently Asked Questions

What should every API gateway be responsible for?

TLS termination, authentication token validation, rate limiting, request routing, request ID injection, and observability (latency metrics, request logging). These shared concerns add 2-5ms of gateway overhead but remove duplicated work across every backend service. A gateway handling 10,000 requests per second validates tokens once at the edge instead of making each of 20+ backend services do their own validation.

What should never go inside an API gateway?

Business logic, data aggregation, and complex transformation should never live in the gateway. Authorization decisions (whether user X can access resource Y) need domain context only the service has. Aggregating data from 3-5 services couples the gateway to service internals and creates a deployment bottleneck. Teams that put aggregation in gateways report much longer deployment cycles because every API shape change needs a gateway release.

What is the backend-for-frontend pattern and when should you use it?

BFF creates a dedicated layer per client type: mobile BFF, web BFF, partner API BFF. Each one pulls together 3-8 backend service calls and shapes data for its client’s needs. BFF is justified when mobile responses need to be much smaller than web responses from the same data, or when client teams deploy on different schedules. A well-designed BFF adds 10-30ms latency but kills over-fetching and cuts client-side processing.

How should API versioning be handled at the gateway?

The gateway routes /v1/users to the v1 service instance and /v2/users to v2, separating public API versions from internal service versions. Best practice is a 90-day deprecation window with Sunset and Deprecation headers injected automatically. During the dual-version period, monitor v1 traffic weekly. When v1 drops below 5% of total traffic, begin active migration outreach to remaining consumers before the cutoff date.

What gateway observability metrics matter most?

Five critical metrics: request rate by endpoint (alert on 2x normal baseline), error rate by status code (4xx vs 5xx, alert above 1%), latency P50/P95/P99 (alert when P99 exceeds 500ms), upstream service availability (health check failures), and rate limit hit rate (alert above 5% of total traffic). The gateway sees 100% of traffic, making it the fastest place to start incident triage.