← Back to Insights

Observability: From Dashboard Green to Actually Working

Metasphere Engineering 12 min read

A quiet evening. Checkout starts failing for 3% of requests. Your on-call engineer opens Grafana. Error rate graph: red. She scans fourteen dashboards. None answer the three questions that actually matter: which users are affected, which request path is failing, and whether the root cause lives in checkout, the payment gateway, or the inventory service downstream.

Temperature is elevated. Blood pressure is normal. The vital signs monitor says something is wrong. It doesn’t say what.

She SSHes into a production instance. Starts tailing logs. Free-text. No structure. grep "error" /var/log/app.log returns 4,000 lines. She tries grep "payment" and gets 12,000. Half are success messages. She starts reading line by line, copying trace-looking strings between browser tabs, manually playing human database while the PagerDuty timer ticks.

Diagnosing a patient by reading their entire medical history out loud, page by page, looking for the relevant sentence. You’ve been here. Everyone has been here. It’s miserable every single time.

Two hours and three engineers later, the root cause is a timeout on one payment gateway endpoint. But only for requests with a specific feature flag variant. The fix? Four minutes. The investigation? Two hours. The test that would have shown the problem instantly existed. Nobody connected the blood test to the X-ray to the patient history. Something is deeply wrong with that ratio.

Key takeaways
  • The fix is always faster than the investigation. When finding the problem takes 10x longer than fixing it, you have a visibility problem, not a reliability problem.
  • Three pillars present, none connected = three tools, three tabs, zero correlation. Alert to trace to log in seconds, not hours of tab-switching.
  • High-cardinality tags on traces are the debug dimension. Customer ID, feature flag variant, deployment version, database shard. “Checkout is slow” becomes “checkout is slow for new-checkout-v2 on shard 3 in APAC.”
  • SLO-based alerting cuts alert fatigue and pages on user impact, not infrastructure metrics. CPU at 80% goes on a dashboard. Checkout failing for 3% of users goes to PagerDuty. Page on what users feel, not what machines report.
  • Service templates with observability baked in make correct instrumentation the default, not the disciplined exception.

More dashboards won’t fix this. More monitoring tools won’t fix this. The gap between collecting telemetry and having observability is architectural.

Distributed trace propagation across API gateway, auth service, and data layer showing span timingA trace request travels across three microservices with span bars showing relative duration and parent-child timing, similar to a Jaeger trace view.API GatewayAuth ServiceData Layer0ms80ms160ms247msapi-gateway /checkout247msauth-service /validate112msdata-layer SELECT orders168mstrace-id: abc123247msGatewayAuthData Layer

Monitoring Tells You What. Observability Tells You Why.

Monitoring watches for failures you imagined in advance. You write a threshold, the system checks the threshold, and an alert fires when reality matches the scenario you predicted. Observability answers the question you couldn’t have predicted: “Show me all checkout requests over 2 seconds, grouped by feature flag and gateway endpoint.” Ten-second answer from a query interface? Observability. Requires SSH and grep? Monitoring with a nicer UI.

The distinction matters because novel incidents (the ones that actually hurt) are by definition the ones you didn’t predict. A dashboard built for yesterday’s outage doesn’t help with tomorrow’s. Observability gives engineers the ability to ask arbitrary questions of their telemetry during an incident, without needing a pre-built dashboard for each scenario.

DimensionMonitoringObservability
Question it answersKnown questions: “Is CPU above 80%?” “Is the service up?”Unknown questions: “Why is this request slow for users in EU?”
Data modelPre-defined metrics and thresholds on dashboardsHigh-cardinality telemetry: traces, structured logs, metrics correlated
Alert styleThreshold breach: CPU > 80% for 5 minutesAnomaly detection + SLO burn rate
Debug workflowCheck the dashboard someone built for this scenarioSlice and dice by any dimension. No dashboard needed in advance
Fails whenNovel failure mode not covered by existing dashboardsTelemetry pipeline can’t handle the cardinality (cost explosion)
InvestmentLower. Prometheus + Grafana, predefined dashboardsHigher. Distributed tracing, structured logging, correlation IDs
You need bothMonitoring for known failure modes. Observability for the ones you haven’t imagined yet

The Three Pillars Are Not the Point

Every conference talk mentions the three pillars: metrics, logs, traces. Most organizations dutifully deploy all three. And most still suffer long investigation times because the pillars are disconnected. Three tools, three browser tabs, zero correlation. Clicking from alert to relevant trace to causal log entry in seconds requires one hard architectural commitment: consistent context propagation across every service boundary.

Every log line carries trace_id. Every span carries tenant_id, feature_flag, and deployment_version. Every alert links to the traces that triggered it. Site reliability practice treats context propagation as first-class infrastructure because it never gets retrofitted successfully. Bolting trace IDs onto an existing logging system after the fact means rewriting log statements across dozens of services. Building it into the service template from day one means it just works.

Correlated Observability: Alert to Root CauseCorrelated Observability: Alert to Root CauseSLO AlertP99 > 500msBudget burningDistributed Tracetrace_id: abc-123Slow span: DB query 450msStructured LogsSame trace_idFull query + planRoot CauseMissing index onuser_id columnThe trace ID connects alert, trace, and log into one story.
The Correlation Gap The time it takes to manually connect a metric alert to the relevant trace to the causal log entry. With uncorrelated telemetry (separate tools, no shared trace IDs), this gap stretches into the tens of minutes per incident. With correlated telemetry (click from alert to trace to log), it collapses to almost nothing. The gap is the difference between a quick resolution and a multi-hour investigation.

What Production Instrumentation Looks Like

Structured Logging

At 50,000 events per second, regex against free-text is both slow and unreliable. Structured events turn every log line into a queryable record with explicit fields. Every log line becomes a queryable record with explicit fields.

{
  "user_id": "john@example.com",
  "order_id": "ord-4521",
  "service": "checkout",
  "trace_id": "abc-123",
  "span_id": "span-789",
  "duration_ms": 342,
  "result": "success",
  "feature_flag": "new-checkout-v2",
  "region": "ap-southeast-1"
}

“Show me all failed checkout requests for this user” becomes a 2-second query instead of an hour of grep archaeology. Two-second query instead of an hour of grep. Start with the five highest-traffic services and expand from there. Performance and capacity engineering becomes far easier when event shapes are predictable and queryable.

High-Cardinality Tags on Traces

Most teams tag spans with HTTP method, status code, and endpoint. Fine for totals. Nearly useless for debugging the specific failure that just woke someone up. Fine for totals. Nearly useless for debugging the specific failure that just woke someone up.

The tags that actually help mid-incident: customer ID, feature flag variant, deployment version, database shard, geographic region. These are high-cardinality dimensions - attributes with thousands or millions of unique values. Which patient, which ward, which medication, which dose. They can’t go on Prometheus metrics (millions of time series will crush storage and query performance). But they belong on every trace span, where the storage model handles cardinality gracefully.

“Checkout is slow” is a symptom. “Checkout is slow for new-checkout-v2 users hitting shard 3 in APAC” is actionable. “The patient has a fever” vs. “the patient in ward 3 who started the new medication yesterday has a fever.” Same engineer, same tools, same incident. The second statement comes from tags the first engineer never added.

SLO-Based Alerting

Alert on what users experience, not on what servers report. CPU at 80% might mean nothing. Checkout failing for 3% of users means revenue is disappearing. Page on what users feel, not what machines report. Error budget burn rate translates reliability problems into urgency: 14x burn means the monthly error budget is gone in two days (page now), 1.2x burn means it tracks slowly toward exhaustion (file a ticket). Core to mature observability and monitoring practice .

Invest in full observabilityMonitoring alone is enough
10+ microservices with cross-service dependenciesSingle monolith with well-understood failure modes
Novel incidents with long investigation timesPredictable failures with known runbook responses
Multiple teams shipping to shared infrastructureOne team, one service, one deployment pipeline
SLO-driven reliability commitmentsBest-effort availability without formal targets
Production debugging requires tracing across servicesErrors are self-contained within one process

Making Good Instrumentation the Default

The hardest part of observability is not choosing the right backend. It’s making correct instrumentation the path of least resistance for every engineer shipping code.

Service templates solve this by baking structured logging, OpenTelemetry propagation, and pre-built dashboards into the starter kit for every new service. A service created from the template is observable on day one, before its authors write a single business-logic line. The hospital admission form that creates the patient file, assigns the room, and sets up the monitoring automatically. Middleware propagates trace IDs, request IDs, and tenant IDs automatically so individual developers never need to think about context propagation. CI linters verify that new endpoints create spans, preventing gaps in trace coverage from slipping through code review.

DevOps practice embeds these standards into the platform itself so observability is not a checklist item engineers remember to add but a default they’d have to deliberately opt out of.

What the Industry Gets Wrong About Observability

“More dashboards improve observability.” Fourteen dashboards showing CPU, memory, and request count across different services is monitoring, not observability. Fourteen vital signs monitors. None of them tell you why the patient is sick. If answering “why is checkout failing for mobile users in APAC?” requires SSHing into a server and grepping logs, the dashboards are decoration.

“Alert on everything, filter later.” Alerting on CPU, memory, disk, and network for every service generates alert fatigue within weeks. The hospital alarm that goes off every time a machine reading changes. On-call engineers mute channels. Real incidents get lost in the noise. SLO-based alerting (page on user impact, not infrastructure metrics) cuts alert volume sharply while raising signal quality. Alert when the patient’s condition changes. Not when the heart rate monitor fluctuates by one beat.

Our take Instrument context propagation before choosing your observability backend. A trace ID that flows through every service, every log line, and every metric label is the single most valuable observability investment. The patient ID on every test result, every scan, every chart entry. Without it, every tool is an island. With it, any tool becomes useful. The backend matters far less than the context flowing through it.

That checkout failure from the opening? Trace ID links the alert to the payment span. The blood test connected to the X-ray connected to the patient history. The structured log shows the flag variant and the upstream timeout. The engineer never opens a terminal. Four-minute fix. Zero-hour investigation. Same hospital. Same patient. One connected chart instead of three separate filing cabinets.

Stop Guessing Why Production Is Breaking

CPU graphs don’t tell you why checkout is failing for mobile users in Southeast Asia. Real observability connects metrics, traces, and structured logs into one root-cause investigation flow. Click from alert to trace to log in seconds, not endless minutes of tab-switching.

Fix Your Observability Stack

Frequently Asked Questions

What is the difference between monitoring and observability?

+

Monitoring asks whether the system is working using predefined dashboards and static thresholds. Observability lets you ask why a specific request is failing using open-ended queries against correlated telemetry. Teams with mature observability consistently resolve novel incidents in a fraction of the time it takes teams relying only on dashboards for the same class of incident.

Why do structured logs matter more than free-text logs?

+

Structured logs assign values to explicit keys, turning every event into a queryable dataset. Querying all errors where customer_id equals a specific value and feature_flag equals new-checkout-v2 takes under 2 seconds against structured logs. Against free-text logs at volume, the same query takes hours of regex parsing and still misses variant formats.

What is a high-cardinality metric and why do traditional systems struggle?

+

Cardinality is the number of unique label values a metric can take. User ID has millions of values. Prometheus stores one time series per unique label combination, so tagging metrics with user_id creates millions of series, crushing storage and query performance. High-cardinality attributes belong on traces and structured logs, not on Prometheus metrics.

Is distributed tracing valuable in a monolithic architecture?

+

Yes. In a monolith, tracing reveals which internal operation eats request time: the database query taking 80% of duration, the external API adding 200ms of latency, or the code path that only runs on slow requests. Without tracing, you know a request was slow. With tracing, you know exactly which span consumed the time and can optimize.

What should trigger an alert versus being a dashboard metric?

+

Alert on user-facing behavior: error rate above SLO threshold, P99 latency exceeding response time SLO, availability below target. Don’t alert on infrastructure like CPU above 80% unless it connects to user impact. Infrastructure alerting produces far more alerts with the same incident rate, destroying on-call effectiveness through fatigue.