Observability: From Dashboard Green to Actually Working

Oct 21, 2025 Metasphere Engineering 12 min read

A quiet evening. Checkout starts failing for 3% of requests. Your on-call engineer opens Grafana. Error rate graph: red. She scans fourteen dashboards. None answer the three questions that actually matter: which users are affected, which request path is failing, and whether the root cause lives in checkout, the payment gateway, or the inventory service downstream.

Temperature is elevated. Blood pressure is normal. The vital signs monitor says something is wrong. It doesn’t say what.

She SSHes into a production instance. Starts tailing logs. Free-text. No structure. grep "error" /var/log/app.log returns 4,000 lines. She tries grep "payment" and gets 12,000. Half are success messages. She starts reading line by line, copying trace-looking strings between browser tabs, manually playing human database while the PagerDuty timer ticks.

Diagnosing a patient by reading their entire medical history out loud, page by page, looking for the relevant sentence. You’ve been here. Everyone has been here. It’s miserable every single time.

Two hours and three engineers later, the root cause is a timeout on one payment gateway endpoint. But only for requests with a specific feature flag variant. The fix? Four minutes. The investigation? Two hours. The test that would have shown the problem instantly existed. Nobody connected the blood test to the X-ray to the patient history. Something is deeply wrong with that ratio.

Key takeaways

The fix is always faster than the investigation. When finding the problem takes 10x longer than fixing it, you have a visibility problem, not a reliability problem.
Three pillars present, none connected = three tools, three tabs, zero correlation. Alert to trace to log in seconds, not hours of tab-switching.
High-cardinality tags on traces are the debug dimension. Customer ID, feature flag variant, deployment version, database shard. “Checkout is slow” becomes “checkout is slow for new-checkout-v2 on shard 3 in APAC.”
SLO-based alerting cuts alert fatigue and pages on user impact, not infrastructure metrics. CPU at 80% goes on a dashboard. Checkout failing for 3% of users goes to PagerDuty. Page on what users feel, not what machines report.
Service templates with observability baked in make correct instrumentation the default, not the disciplined exception.

More dashboards won’t fix this. More monitoring tools won’t fix this. The gap between collecting telemetry and having observability is architectural.

Monitoring Tells You What. Observability Tells You Why.

Monitoring watches for failures you imagined in advance. You write a threshold, the system checks the threshold, and an alert fires when reality matches the scenario you predicted. Observability answers the question you couldn’t have predicted: “Show me all checkout requests over 2 seconds, grouped by feature flag and gateway endpoint.” Ten-second answer from a query interface? Observability. Requires SSH and grep? Monitoring with a nicer UI.

The distinction matters because novel incidents (the ones that actually hurt) are by definition the ones you didn’t predict. A dashboard built for yesterday’s outage doesn’t help with tomorrow’s. Observability gives engineers the ability to ask arbitrary questions of their telemetry during an incident, without needing a pre-built dashboard for each scenario.

Dimension	Monitoring	Observability
Question it answers	Known questions: “Is CPU above 80%?” “Is the service up?”	Unknown questions: “Why is this request slow for users in EU?”
Data model	Pre-defined metrics and thresholds on dashboards	High-cardinality telemetry: traces, structured logs, metrics correlated
Alert style	Threshold breach: CPU > 80% for 5 minutes	Anomaly detection + SLO burn rate
Debug workflow	Check the dashboard someone built for this scenario	Slice and dice by any dimension. No dashboard needed in advance
Fails when	Novel failure mode not covered by existing dashboards	Telemetry pipeline can’t handle the cardinality (cost explosion)
Investment	Lower. Prometheus + Grafana, predefined dashboards	Higher. Distributed tracing, structured logging, correlation IDs
You need both	Monitoring for known failure modes. Observability for the ones you haven’t imagined yet

The Three Pillars Are Not the Point

Every conference talk mentions the three pillars: metrics, logs, traces. Most organizations dutifully deploy all three. And most still suffer long investigation times because the pillars are disconnected. Three tools, three browser tabs, zero correlation. Clicking from alert to relevant trace to causal log entry in seconds requires one hard architectural commitment: consistent context propagation across every service boundary.

Every log line carries trace_id. Every span carries tenant_id, feature_flag, and deployment_version. Every alert links to the traces that triggered it. Site reliability practice treats context propagation as first-class infrastructure because it never gets retrofitted successfully. Bolting trace IDs onto an existing logging system after the fact means rewriting log statements across dozens of services. Building it into the service template from day one means it just works.

The Correlation Gap The time it takes to manually connect a metric alert to the relevant trace to the causal log entry. With uncorrelated telemetry (separate tools, no shared trace IDs), this gap stretches into the tens of minutes per incident. With correlated telemetry (click from alert to trace to log), it collapses to almost nothing. The gap is the difference between a quick resolution and a multi-hour investigation.

What Production Instrumentation Looks Like

Structured Logging

At 50,000 events per second, regex against free-text is both slow and unreliable. Structured events turn every log line into a queryable record with explicit fields. Every log line becomes a queryable record with explicit fields.

{
  "user_id": "john@example.com",
  "order_id": "ord-4521",
  "service": "checkout",
  "trace_id": "abc-123",
  "span_id": "span-789",
  "duration_ms": 342,
  "result": "success",
  "feature_flag": "new-checkout-v2",
  "region": "ap-southeast-1"
}

“Show me all failed checkout requests for this user” becomes a 2-second query instead of an hour of grep archaeology. Two-second query instead of an hour of grep. Start with the five highest-traffic services and expand from there. Performance and capacity engineering becomes far easier when event shapes are predictable and queryable.

High-Cardinality Tags on Traces

Most teams tag spans with HTTP method, status code, and endpoint. Fine for totals. Nearly useless for debugging the specific failure that just woke someone up. Fine for totals. Nearly useless for debugging the specific failure that just woke someone up.

The tags that actually help mid-incident: customer ID, feature flag variant, deployment version, database shard, geographic region. These are high-cardinality dimensions - attributes with thousands or millions of unique values. Which patient, which ward, which medication, which dose. They can’t go on Prometheus metrics (millions of time series will crush storage and query performance). But they belong on every trace span, where the storage model handles cardinality gracefully.

“Checkout is slow” is a symptom. “Checkout is slow for new-checkout-v2 users hitting shard 3 in APAC” is actionable. “The patient has a fever” vs. “the patient in ward 3 who started the new medication yesterday has a fever.” Same engineer, same tools, same incident. The second statement comes from tags the first engineer never added.

SLO-Based Alerting

Alert on what users experience, not on what servers report. CPU at 80% might mean nothing. Checkout failing for 3% of users means revenue is disappearing. Page on what users feel, not what machines report. Error budget burn rate translates reliability problems into urgency: 14x burn means the monthly error budget is gone in two days (page now), 1.2x burn means it tracks slowly toward exhaustion (file a ticket). Core to mature observability and monitoring practice .

Invest in full observability	Monitoring alone is enough
10+ microservices with cross-service dependencies	Single monolith with well-understood failure modes
Novel incidents with long investigation times	Predictable failures with known runbook responses
Multiple teams shipping to shared infrastructure	One team, one service, one deployment pipeline
SLO-driven reliability commitments	Best-effort availability without formal targets
Production debugging requires tracing across services	Errors are self-contained within one process

Making Good Instrumentation the Default

The hardest part of observability is not choosing the right backend. It’s making correct instrumentation the path of least resistance for every engineer shipping code.

Service templates solve this by baking structured logging, OpenTelemetry propagation, and pre-built dashboards into the starter kit for every new service. A service created from the template is observable on day one, before its authors write a single business-logic line. The hospital admission form that creates the patient file, assigns the room, and sets up the monitoring automatically. Middleware propagates trace IDs, request IDs, and tenant IDs automatically so individual developers never need to think about context propagation. CI linters verify that new endpoints create spans, preventing gaps in trace coverage from slipping through code review.

DevOps practice embeds these standards into the platform itself so observability is not a checklist item engineers remember to add but a default they’d have to deliberately opt out of.

What the Industry Gets Wrong About Observability

“More dashboards improve observability.” Fourteen dashboards showing CPU, memory, and request count across different services is monitoring, not observability. Fourteen vital signs monitors. None of them tell you why the patient is sick. If answering “why is checkout failing for mobile users in APAC?” requires SSHing into a server and grepping logs, the dashboards are decoration.

“Alert on everything, filter later.” Alerting on CPU, memory, disk, and network for every service generates alert fatigue within weeks. The hospital alarm that goes off every time a machine reading changes. On-call engineers mute channels. Real incidents get lost in the noise. SLO-based alerting (page on user impact, not infrastructure metrics) cuts alert volume sharply while raising signal quality. Alert when the patient’s condition changes. Not when the heart rate monitor fluctuates by one beat.

Our take Instrument context propagation before choosing your observability backend. A trace ID that flows through every service, every log line, and every metric label is the single most valuable observability investment. The patient ID on every test result, every scan, every chart entry. Without it, every tool is an island. With it, any tool becomes useful. The backend matters far less than the context flowing through it.

That checkout failure from the opening? Trace ID links the alert to the payment span. The blood test connected to the X-ray connected to the patient history. The structured log shows the flag variant and the upstream timeout. The engineer never opens a terminal. Four-minute fix. Zero-hour investigation. Same hospital. Same patient. One connected chart instead of three separate filing cabinets.

Frequently Asked Questions

What is the difference between monitoring and observability?

Monitoring asks whether the system is working using predefined dashboards and static thresholds. Observability lets you ask why a specific request is failing using open-ended queries against correlated telemetry. Teams with mature observability consistently resolve novel incidents in a fraction of the time it takes teams relying only on dashboards for the same class of incident.

Why do structured logs matter more than free-text logs?

Structured logs assign values to explicit keys, turning every event into a queryable dataset. Querying all errors where customer_id equals a specific value and feature_flag equals new-checkout-v2 takes under 2 seconds against structured logs. Against free-text logs at volume, the same query takes hours of regex parsing and still misses variant formats.

What is a high-cardinality metric and why do traditional systems struggle?

Cardinality is the number of unique label values a metric can take. User ID has millions of values. Prometheus stores one time series per unique label combination, so tagging metrics with user_id creates millions of series, crushing storage and query performance. High-cardinality attributes belong on traces and structured logs, not on Prometheus metrics.

Is distributed tracing valuable in a monolithic architecture?

Yes. In a monolith, tracing reveals which internal operation eats request time: the database query taking 80% of duration, the external API adding 200ms of latency, or the code path that only runs on slow requests. Without tracing, you know a request was slow. With tracing, you know exactly which span consumed the time and can optimize.

What should trigger an alert versus being a dashboard metric?

Alert on user-facing behavior: error rate above SLO threshold, P99 latency exceeding response time SLO, availability below target. Don’t alert on infrastructure like CPU above 80% unless it connects to user impact. Infrastructure alerting produces far more alerts with the same incident rate, destroying on-call effectiveness through fatigue.