Observability: From Dashboard Green to Actually Working
A quiet evening. Checkout starts failing for 3% of requests. Your on-call engineer opens Grafana. Error rate graph: red. She scans fourteen dashboards. None answer the three questions that actually matter: which users are affected, which request path is failing, and whether the root cause lives in checkout, the payment gateway, or the inventory service downstream.
Temperature is elevated. Blood pressure is normal. The vital signs monitor says something is wrong. It doesn’t say what.
She SSHes into a production instance. Starts tailing logs. Free-text. No structure. grep "error" /var/log/app.log returns 4,000 lines. She tries grep "payment" and gets 12,000. Half are success messages. She starts reading line by line, copying trace-looking strings between browser tabs, manually playing human database while the PagerDuty timer ticks.
Diagnosing a patient by reading their entire medical history out loud, page by page, looking for the relevant sentence. You’ve been here. Everyone has been here. It’s miserable every single time.
Two hours and three engineers later, the root cause is a timeout on one payment gateway endpoint. But only for requests with a specific feature flag variant. The fix? Four minutes. The investigation? Two hours. The test that would have shown the problem instantly existed. Nobody connected the blood test to the X-ray to the patient history. Something is deeply wrong with that ratio.
- The fix is always faster than the investigation. When finding the problem takes 10x longer than fixing it, you have a visibility problem, not a reliability problem.
- Three pillars present, none connected = three tools, three tabs, zero correlation. Alert to trace to log in seconds, not hours of tab-switching.
- High-cardinality tags on traces are the debug dimension. Customer ID, feature flag variant, deployment version, database shard. “Checkout is slow” becomes “checkout is slow for new-checkout-v2 on shard 3 in APAC.”
- SLO-based alerting cuts alert fatigue and pages on user impact, not infrastructure metrics. CPU at 80% goes on a dashboard. Checkout failing for 3% of users goes to PagerDuty. Page on what users feel, not what machines report.
- Service templates with observability baked in make correct instrumentation the default, not the disciplined exception.
More dashboards won’t fix this. More monitoring tools won’t fix this. The gap between collecting telemetry and having observability is architectural.
Monitoring Tells You What. Observability Tells You Why.
Monitoring watches for failures you imagined in advance. You write a threshold, the system checks the threshold, and an alert fires when reality matches the scenario you predicted. Observability answers the question you couldn’t have predicted: “Show me all checkout requests over 2 seconds, grouped by feature flag and gateway endpoint.” Ten-second answer from a query interface? Observability. Requires SSH and grep? Monitoring with a nicer UI.
The distinction matters because novel incidents (the ones that actually hurt) are by definition the ones you didn’t predict. A dashboard built for yesterday’s outage doesn’t help with tomorrow’s. Observability gives engineers the ability to ask arbitrary questions of their telemetry during an incident, without needing a pre-built dashboard for each scenario.
| Dimension | Monitoring | Observability |
|---|---|---|
| Question it answers | Known questions: “Is CPU above 80%?” “Is the service up?” | Unknown questions: “Why is this request slow for users in EU?” |
| Data model | Pre-defined metrics and thresholds on dashboards | High-cardinality telemetry: traces, structured logs, metrics correlated |
| Alert style | Threshold breach: CPU > 80% for 5 minutes | Anomaly detection + SLO burn rate |
| Debug workflow | Check the dashboard someone built for this scenario | Slice and dice by any dimension. No dashboard needed in advance |
| Fails when | Novel failure mode not covered by existing dashboards | Telemetry pipeline can’t handle the cardinality (cost explosion) |
| Investment | Lower. Prometheus + Grafana, predefined dashboards | Higher. Distributed tracing, structured logging, correlation IDs |
| You need both | Monitoring for known failure modes. Observability for the ones you haven’t imagined yet |
The Three Pillars Are Not the Point
Every conference talk mentions the three pillars: metrics, logs, traces. Most organizations dutifully deploy all three. And most still suffer long investigation times because the pillars are disconnected. Three tools, three browser tabs, zero correlation. Clicking from alert to relevant trace to causal log entry in seconds requires one hard architectural commitment: consistent context propagation across every service boundary.
Every log line carries trace_id. Every span carries tenant_id, feature_flag, and deployment_version. Every alert links to the traces that triggered it. Site reliability practice
treats context propagation as first-class infrastructure because it never gets retrofitted successfully. Bolting trace IDs onto an existing logging system after the fact means rewriting log statements across dozens of services. Building it into the service template from day one means it just works.
What Production Instrumentation Looks Like
Structured Logging
At 50,000 events per second, regex against free-text is both slow and unreliable. Structured events turn every log line into a queryable record with explicit fields. Every log line becomes a queryable record with explicit fields.
{
"user_id": "john@example.com",
"order_id": "ord-4521",
"service": "checkout",
"trace_id": "abc-123",
"span_id": "span-789",
"duration_ms": 342,
"result": "success",
"feature_flag": "new-checkout-v2",
"region": "ap-southeast-1"
}
“Show me all failed checkout requests for this user” becomes a 2-second query instead of an hour of grep archaeology. Two-second query instead of an hour of grep. Start with the five highest-traffic services and expand from there. Performance and capacity engineering becomes far easier when event shapes are predictable and queryable.
High-Cardinality Tags on Traces
Most teams tag spans with HTTP method, status code, and endpoint. Fine for totals. Nearly useless for debugging the specific failure that just woke someone up. Fine for totals. Nearly useless for debugging the specific failure that just woke someone up.
The tags that actually help mid-incident: customer ID, feature flag variant, deployment version, database shard, geographic region. These are high-cardinality dimensions - attributes with thousands or millions of unique values. Which patient, which ward, which medication, which dose. They can’t go on Prometheus metrics (millions of time series will crush storage and query performance). But they belong on every trace span, where the storage model handles cardinality gracefully.
“Checkout is slow” is a symptom. “Checkout is slow for new-checkout-v2 users hitting shard 3 in APAC” is actionable. “The patient has a fever” vs. “the patient in ward 3 who started the new medication yesterday has a fever.” Same engineer, same tools, same incident. The second statement comes from tags the first engineer never added.
SLO-Based Alerting
Alert on what users experience, not on what servers report. CPU at 80% might mean nothing. Checkout failing for 3% of users means revenue is disappearing. Page on what users feel, not what machines report. Error budget burn rate translates reliability problems into urgency: 14x burn means the monthly error budget is gone in two days (page now), 1.2x burn means it tracks slowly toward exhaustion (file a ticket). Core to mature observability and monitoring practice .
| Invest in full observability | Monitoring alone is enough |
|---|---|
| 10+ microservices with cross-service dependencies | Single monolith with well-understood failure modes |
| Novel incidents with long investigation times | Predictable failures with known runbook responses |
| Multiple teams shipping to shared infrastructure | One team, one service, one deployment pipeline |
| SLO-driven reliability commitments | Best-effort availability without formal targets |
| Production debugging requires tracing across services | Errors are self-contained within one process |
Making Good Instrumentation the Default
The hardest part of observability is not choosing the right backend. It’s making correct instrumentation the path of least resistance for every engineer shipping code.
Service templates solve this by baking structured logging, OpenTelemetry propagation, and pre-built dashboards into the starter kit for every new service. A service created from the template is observable on day one, before its authors write a single business-logic line. The hospital admission form that creates the patient file, assigns the room, and sets up the monitoring automatically. Middleware propagates trace IDs, request IDs, and tenant IDs automatically so individual developers never need to think about context propagation. CI linters verify that new endpoints create spans, preventing gaps in trace coverage from slipping through code review.
DevOps practice embeds these standards into the platform itself so observability is not a checklist item engineers remember to add but a default they’d have to deliberately opt out of.
What the Industry Gets Wrong About Observability
“More dashboards improve observability.” Fourteen dashboards showing CPU, memory, and request count across different services is monitoring, not observability. Fourteen vital signs monitors. None of them tell you why the patient is sick. If answering “why is checkout failing for mobile users in APAC?” requires SSHing into a server and grepping logs, the dashboards are decoration.
“Alert on everything, filter later.” Alerting on CPU, memory, disk, and network for every service generates alert fatigue within weeks. The hospital alarm that goes off every time a machine reading changes. On-call engineers mute channels. Real incidents get lost in the noise. SLO-based alerting (page on user impact, not infrastructure metrics) cuts alert volume sharply while raising signal quality. Alert when the patient’s condition changes. Not when the heart rate monitor fluctuates by one beat.
That checkout failure from the opening? Trace ID links the alert to the payment span. The blood test connected to the X-ray connected to the patient history. The structured log shows the flag variant and the upstream timeout. The engineer never opens a terminal. Four-minute fix. Zero-hour investigation. Same hospital. Same patient. One connected chart instead of three separate filing cabinets.