← Back to Insights

SLO Engineering: Error Budgets That Drive Decisions

Metasphere Engineering 17 min read

Your service is “99.9% available.” It says so on the status page. It also says so in the SLA your sales team signed. The problem is that nobody can tell you what that number actually measures, what happens when it drops below the target, or whether the alerts that woke your on-call engineer at an unreasonable hour had anything to do with it.

The number exists. It lives in a dashboard. It occasionally surfaces in an executive slide. And it changes absolutely nothing about how anyone makes decisions, because it connects to no mechanism that forces decisions to be made.

Key takeaways
  • SLOs are engineering tools, not contractual obligations. An SLO without an error budget policy is just a number on a dashboard. The policy is what changes behavior.
  • Averages lie about user experience. A service with 200ms average latency and 4-second p99 has two completely different reliability stories depending on which metric you measure.
  • Error budgets make the velocity-versus-reliability tradeoff explicit. When the budget is healthy, ship fast. When it burns, fix reliability. No debate. No politics.
  • Burn-rate alerting replaces threshold alerting. Instead of paging when error rate crosses 1%, page when the error budget is being consumed fast enough to exhaust within hours.
  • Product must co-own the target, or the target is fiction. A reliability number set by engineers alone is a wish. One negotiated with product is a commitment.

SLAs, SLOs, SLIs: Three Letters, Three Different Things

The confusion starts with the terminology. Teams use SLA, SLO, and SLI interchangeably, and the resulting conversations go in circles because everyone is talking about a different concept using the same words.

An SLI (Service Level Indicator) is a measurement. It answers the question: “How is the user experiencing this service right now?” The ratio of successful HTTP responses to total responses. The proportion of requests served under 300ms. The fraction of search queries returning results within a freshness window. SLIs are always expressed as a ratio between 0% and 100%.

An SLO (Service Level Objective) is a target set against an SLI. “99.9% of requests will succeed over a rolling 30-day window.” The SLO defines what “good enough” looks like. Not perfect. Good enough.

An SLA (Service Level Agreement) is a contract. It carries financial consequences. If the vendor breaches the SLA, the customer gets credits, refunds, or the right to terminate. SLAs are negotiated by legal and sales, not engineering.

The critical relationship: SLOs must always be stricter than SLAs. If your SLA promises 99.5% and your SLO targets 99.9%, you have a buffer. The internal alarm trips long before the contractual breach. Set them equal and you lose that buffer entirely. By the time anyone notices a problem, the SLA violation has already happened and the credits are owed.

What the Industry Gets Wrong About SLOs

“Our SLA is 99.9%, so our SLO is 99.9%.” An SLO equal to the SLA means there is zero buffer. Every dip in reliability immediately risks contractual penalties. SLOs should be meaningfully stricter than SLAs so that internal mechanisms kick in before external consequences do. If your SLA is 99.9%, your SLO should be at least 99.95%.

“We need five nines.” 99.999% availability means 26 seconds of total downtime per month. A single deployment that takes 30 seconds to roll back has already blown the budget. Most services do not need this, and the engineering cost of achieving it is exponential. Every additional nine roughly doubles the infrastructure and operational complexity. The right target for most user-facing services is 99.9% to 99.95%.

“We measure availability by checking if the server is up.” A synthetic health check that returns 200 OK tells you the process is running. It tells you nothing about whether users can actually complete the actions they came to perform. SLIs must measure user-facing behavior, not infrastructure liveness.

Choosing the Right SLIs

The SLI is the foundation. Get it wrong and everything built on top of it - the SLO, the error budget, the alerting - reflects a reality that does not match what users experience.

Most teams reach for infrastructure metrics: CPU utilization, memory pressure, disk I/O. These are useful for capacity planning. They are terrible SLIs because they do not correlate reliably with user experience. A service can run at 90% CPU and serve every request happily. It can also run at 20% CPU while returning errors because a downstream dependency is down.

Good SLIs measure the boundary between your system and the user.

SLI TypeWhat It MeasuresGood ForExample
AvailabilityProportion of successful responsesRequest-driven services99.9% of HTTP requests return non-5xx
LatencyResponse time at a percentileUser-facing APIs, search99% of requests under 300ms
CorrectnessProportion of correct resultsData pipelines, ML inference99.99% of pricing calculations match source of truth
FreshnessData age relative to sourceDashboards, search indexes99.5% of queries see data less than 5 minutes old
The SLI Proxy Trap Many teams measure SLIs at the wrong layer. Measuring latency at the load balancer captures network and routing time but misses slow database queries, serialization overhead, and rendering delays that the user actually experiences. The closer the SLI measurement point is to the user, the more accurately it reflects their experience. Instrument at the edge, not the origin.
Anti-pattern

Don’t: Use average latency as an SLI. A service with 50ms average and 4-second p99 looks healthy on the average but is miserable for the unluckiest 1% of users.

Do: Use percentile-based latency SLIs. p99 latency under 300ms captures the experience of all but the most extreme outliers. Track p50 for typical experience and p99 for worst-case.

Error Budgets: The Decision Framework That Changes Behavior

An SLO without a policy is a number on a dashboard. The error budget is what turns it into a decision-making tool.

The math is simple. A 99.9% SLO over a 30-day rolling window means 0.1% allowed failure. In a system serving one million requests per day, that budget is 1,000 failed requests per day, or roughly 30,000 over the window. In terms of downtime, 0.1% of 30 days is 43.2 minutes.

That 43.2 minutes (or 30,000 failed requests) is the error budget. It belongs jointly to product and engineering. Product wants to spend it on velocity - ship features, run experiments, accept some risk. Engineering wants to conserve it for operational safety. The budget forces the conversation out of the abstract and into the concrete: “We have 38 minutes of budget remaining this month. Do we deploy this risky migration, or wait until the window resets?”

Our take Error budgets only work when they have teeth. A policy that says “when the budget runs out, we should probably focus on reliability” changes nothing. A policy that says “when the budget drops below 25%, all feature work stops and the team focuses exclusively on reliability until the budget recovers to 50%” changes everything. The difference is enforcement, and enforcement requires product leadership to co-sign the policy before the first incident, not during one.
Budget HealthEngineering BehaviorProduct Behavior
Healthy (> 50% remaining)Ship freely, experiment with new architecturesPush features, run A/B tests, tolerate some churn
Caution (25-50%)Require rollback plans, avoid high-risk changesPrioritize lower-risk features, defer large migrations
Critical (< 25%)Reliability-only sprint, no feature deploymentsDefer all feature requests, support reliability work
Exhausted (0%)Full freeze until budget recovers past 50% thresholdAccept timeline delay, communicate externally if needed

Burn-Rate Alerting: Stop Paging on the Wrong Things

Traditional threshold alerting pages your on-call when error rate crosses a fixed boundary. Error rate above 1%? Page. Latency above 500ms? Page. The problem is that a 1.5% error rate for three seconds is not the same problem as a 1.5% error rate for three hours. The threshold alert fires identically for both.

Burn-rate alerting asks a different question: “At the current rate of errors, how fast is the error budget being consumed?” A burn rate of 1x means you will exactly exhaust the budget at the end of the 30-day window. Nothing to worry about. A burn rate of 14.4x means the budget will be gone in roughly 24 hours. That warrants attention. A burn rate of 720x means the budget burns out in one hour. Page immediately.

The Google SRE Workbook formalizes this into multi-window burn-rate alerts. A fast window (short lookback, say 5 minutes) catches rapid budget consumption. A slow window (longer lookback, say 60 minutes) filters out transient spikes that resolve on their own. Both conditions must be true to fire.

# Multi-window burn-rate alert configuration
# Pages only when BOTH windows confirm the burn
alerts:
  - name: "high-burn-rate-page"
    # 14.4x burn rate = budget exhausted in ~24 hours
    condition:
      fast_window: 5m    # Recent error rate over 5 minutes
      fast_threshold: 14.4
      slow_window: 60m   # Sustained error rate over 1 hour
      slow_threshold: 14.4
    severity: page        # Wake someone up

  - name: "medium-burn-rate-ticket"
    # 3x burn rate = budget exhausted in ~10 days
    condition:
      fast_window: 30m
      fast_threshold: 3
      slow_window: 6h
      slow_threshold: 3
    severity: ticket       # File a ticket, don't page

The result: fewer false-positive pages, faster detection of genuine incidents, and on-call rotations that don’t burn people out chasing phantom alerts. Teams with effective observability and monitoring practices find that burn-rate alerting typically eliminates the majority of their noisy threshold alerts while catching real problems earlier.

How to calculate burn rate from SLO parameters

The burn rate formula is: burn_rate = (error_rate_observed / error_rate_allowed).

For a 99.9% SLO, the allowed error rate is 0.1% (0.001). If the observed error rate over the fast window is 1.44% (0.0144), the burn rate is 0.0144 / 0.001 = 14.4x.

To determine the time until budget exhaustion: time_remaining = slo_window / burn_rate. At 14.4x burn rate with a 30-day window: 30 / 14.4 = ~2.08 days. This gives the on-call engineer a concrete timeline for how urgent the response needs to be.

Common burn-rate thresholds:

  • 14.4x (budget gone in ~2 days): page immediately
  • 6x (budget gone in ~5 days): page during business hours
  • 3x (budget gone in ~10 days): ticket for next sprint
  • 1x (budget tracks exactly to exhaustion): informational, no action

Multi-Signal SLOs

A single SLI rarely captures the full picture. A service can return 100% successful responses, all of them taking 10 seconds. Availability SLO: met. User experience: terrible.

Meaningful reliability requires combining multiple SLIs into a composite view. Not a single number (weighted averages hide problems the same way averaging hides latency outliers) but a set of SLOs where all must be met simultaneously.

A request that succeeds (availability: good) but takes 4 seconds (latency: bad) counts against the latency error budget. A request that is fast and successful but returns stale data (correctness: bad) counts against the correctness budget. Each SLI has its own budget, and the most constrained budget drives the team’s priorities.

Most services need two to four SLOs. More than that and the signal becomes noise. The typical starting set:

  1. Availability: proportion of requests that don’t return server errors
  2. Latency: response time at p99 (or p95 for less latency-sensitive services)
  3. Correctness (if applicable): data accuracy, computation correctness, consistency with source of truth

Freshness is the fourth dimension, relevant for search indexes, dashboards, and data pipelines where staleness is a distinct failure mode from incorrectness.

The Organizational Half

SLOs are as much an organizational practice as a technical one. The most precisely instrumented SLI with the most elegant burn-rate alert is worthless if nobody changes their behavior when the error budget runs low.

Prerequisites
  1. Product and engineering leadership have jointly agreed on SLO targets for the top three user-facing services
  2. Error budget policy document exists with explicit actions for each budget threshold
  3. SLO dashboards are visible to both product and engineering teams
  4. On-call rotation has access to error budget status in their incident response tooling
  5. At least one retrospective has been conducted after an error budget breach

The hardest organizational shift: product managers must accept that error budgets are not engineering’s problem. When the budget is healthy, engineering ships fast and product benefits. When the budget burns, product pays by deferring features. This is the deal. It only works if product co-owns the target from the beginning.

Organizations with mature SRE practices embed SLO reviews into sprint planning. Error budget status goes on the agenda before the feature backlog. Budget below threshold? Reliability sprint. No negotiation.

When SLOs Don’t Apply

Not every system needs a formal SLO. The overhead of defining SLIs, setting targets, building dashboards, configuring burn-rate alerts, and writing error budget policies is real. For systems where the investment does not pay back, simpler approaches work fine.

SLOs add valueSimpler monitoring is fine
User-facing services with external customersInternal batch jobs with retry logic
Services in the critical path of revenue transactionsDev/staging environments
Platform services consumed by many internal teamsOne-off migration scripts
Services with contractual SLA obligationsServices with a single team as the only consumer

The test: if a reliability problem in this service would change someone’s behavior (halt a deployment, trigger an incident, escalate to leadership), an SLO formalizes that decision trigger. If a failure just means “wait and retry,” a basic health check suffices.

Our take Start with SLOs on exactly three services. The login flow, the primary API, and the payment path (or whatever your revenue-critical equivalent is). Get the full stack working: SLIs instrumented, SLOs set, error budgets calculated, burn-rate alerts firing, and a policy document that product has signed. Then expand. Trying to SLO everything at once produces dozens of meaningless targets that nobody trusts and everybody ignores.

Getting Started: The 30-Day Playbook

SLO adoption does not require new tooling. Most observability stacks already collect the data needed for SLIs. The gap is usually in the framing, not the instrumentation.

Week 1: Pick three services and define SLIs. Instrument availability and latency at the edge. For each SLI, record a baseline over seven days. Do not set targets yet. Measure first.

Week 2: Set SLO targets based on baseline data. If your availability SLI shows 99.95% over the baseline week, a 99.9% SLO gives you breathing room. If your p99 latency is 280ms, a 300ms target is too tight. Set it at 400ms and tighten later as you improve the system.

Week 3: Build error budget dashboards and configure burn-rate alerts. Show remaining budget as a percentage and as absolute time/requests. Configure a 14.4x burn-rate page alert and a 3x ticket alert. Disable or mute the threshold alerts they replace.

Week 4: Write the error budget policy and get product sign-off. Define what happens at each budget threshold. Present it as a joint commitment, not an engineering diktat. Run a tabletop exercise: “The budget just hit 20%. What happens next?”

The SLO will be wrong on the first attempt. The target will be too tight or too loose. The SLI will miss an important failure mode. The policy thresholds will trigger too early or too late. This is expected. SLOs are calibrated through experience, not designed in a vacuum.

Same status page. Same “99.9% available” claim. But now it measures what users actually experience, feeds an error budget the team tracks weekly, and drives alerts that fire on genuine problems while staying quiet during transient blips. When the budget runs low, the conversation shifts from “should we prioritize reliability?” to “the policy says we do.” That number on the dashboard stopped being decorative. It started making decisions for you.

Your Alerts Are Lying to You

Threshold-based alerts fire on infrastructure symptoms, not user pain. SLO-based alerting with burn-rate windows catches real incidents faster, with fewer false positives, and gives teams the error budget framework to balance velocity against reliability.

Engineer Your SLOs

Frequently Asked Questions

What is the difference between an SLA SLO and SLI?

+

An SLI (Service Level Indicator) is a measurement of user-facing behavior, like the proportion of requests faster than 300ms. An SLO (Service Level Objective) is an internal reliability target set against that SLI, such as 99.9% of requests below 300ms over 30 days. An SLA (Service Level Agreement) is a contractual promise with financial consequences if broken. SLOs should always be stricter than SLAs to provide a buffer before contractual penalties trigger.

How do error budgets work in SRE?

+

An error budget is the inverse of your SLO target expressed as allowed failure. A 99.9% SLO over 30 days gives you a budget of 43.2 minutes of total downtime or equivalent error volume. When the budget is healthy, teams ship features freely. When it burns past a threshold, reliability work takes priority. The budget creates a shared, measurable framework for the velocity-versus-reliability tradeoff.

Why are averages bad for SLI measurement?

+

Averages hide the worst user experiences behind the majority of good ones. A service with 50ms average latency might have a p99 of 2 seconds, meaning one in every hundred users waits 40x longer than the average suggests. SLIs should use percentile-based measurements, typically p50, p95, and p99, to capture the distribution of user experience rather than a single misleading number.

What is burn rate alerting for SLOs?

+

Burn rate alerting measures how fast your error budget is being consumed relative to the SLO window. A burn rate of 1x means you will exactly exhaust the budget by the end of the window. A burn rate of 14.4x means the budget will be gone in roughly 24 hours. Multi-window burn rate alerts combine a fast window for detection speed with a slow window to filter transient spikes, dramatically reducing false positive pages.

How many SLOs should a service have?

+

Most services need two to four SLOs covering distinct dimensions of user experience: availability (successful responses), latency (response time at a meaningful percentile), and optionally correctness or freshness. More than five SLOs per service creates confusion about which matters most. Fewer than two usually means a critical dimension of user experience is unmeasured and unprotected.