SLO Engineering: Error Budgets That Drive Decisions
Your service is “99.9% available.” It says so on the status page. It also says so in the SLA your sales team signed. The problem is that nobody can tell you what that number actually measures, what happens when it drops below the target, or whether the alerts that woke your on-call engineer at an unreasonable hour had anything to do with it.
The number exists. It lives in a dashboard. It occasionally surfaces in an executive slide. And it changes absolutely nothing about how anyone makes decisions, because it connects to no mechanism that forces decisions to be made.
- SLOs are engineering tools, not contractual obligations. An SLO without an error budget policy is just a number on a dashboard. The policy is what changes behavior.
- Averages lie about user experience. A service with 200ms average latency and 4-second p99 has two completely different reliability stories depending on which metric you measure.
- Error budgets make the velocity-versus-reliability tradeoff explicit. When the budget is healthy, ship fast. When it burns, fix reliability. No debate. No politics.
- Burn-rate alerting replaces threshold alerting. Instead of paging when error rate crosses 1%, page when the error budget is being consumed fast enough to exhaust within hours.
- Product must co-own the target, or the target is fiction. A reliability number set by engineers alone is a wish. One negotiated with product is a commitment.
SLAs, SLOs, SLIs: Three Letters, Three Different Things
The confusion starts with the terminology. Teams use SLA, SLO, and SLI interchangeably, and the resulting conversations go in circles because everyone is talking about a different concept using the same words.
An SLI (Service Level Indicator) is a measurement. It answers the question: “How is the user experiencing this service right now?” The ratio of successful HTTP responses to total responses. The proportion of requests served under 300ms. The fraction of search queries returning results within a freshness window. SLIs are always expressed as a ratio between 0% and 100%.
An SLO (Service Level Objective) is a target set against an SLI. “99.9% of requests will succeed over a rolling 30-day window.” The SLO defines what “good enough” looks like. Not perfect. Good enough.
An SLA (Service Level Agreement) is a contract. It carries financial consequences. If the vendor breaches the SLA, the customer gets credits, refunds, or the right to terminate. SLAs are negotiated by legal and sales, not engineering.
The critical relationship: SLOs must always be stricter than SLAs. If your SLA promises 99.5% and your SLO targets 99.9%, you have a buffer. The internal alarm trips long before the contractual breach. Set them equal and you lose that buffer entirely. By the time anyone notices a problem, the SLA violation has already happened and the credits are owed.
What the Industry Gets Wrong About SLOs
“Our SLA is 99.9%, so our SLO is 99.9%.” An SLO equal to the SLA means there is zero buffer. Every dip in reliability immediately risks contractual penalties. SLOs should be meaningfully stricter than SLAs so that internal mechanisms kick in before external consequences do. If your SLA is 99.9%, your SLO should be at least 99.95%.
“We need five nines.” 99.999% availability means 26 seconds of total downtime per month. A single deployment that takes 30 seconds to roll back has already blown the budget. Most services do not need this, and the engineering cost of achieving it is exponential. Every additional nine roughly doubles the infrastructure and operational complexity. The right target for most user-facing services is 99.9% to 99.95%.
“We measure availability by checking if the server is up.” A synthetic health check that returns 200 OK tells you the process is running. It tells you nothing about whether users can actually complete the actions they came to perform. SLIs must measure user-facing behavior, not infrastructure liveness.
Choosing the Right SLIs
The SLI is the foundation. Get it wrong and everything built on top of it - the SLO, the error budget, the alerting - reflects a reality that does not match what users experience.
Most teams reach for infrastructure metrics: CPU utilization, memory pressure, disk I/O. These are useful for capacity planning. They are terrible SLIs because they do not correlate reliably with user experience. A service can run at 90% CPU and serve every request happily. It can also run at 20% CPU while returning errors because a downstream dependency is down.
Good SLIs measure the boundary between your system and the user.
| SLI Type | What It Measures | Good For | Example |
|---|---|---|---|
| Availability | Proportion of successful responses | Request-driven services | 99.9% of HTTP requests return non-5xx |
| Latency | Response time at a percentile | User-facing APIs, search | 99% of requests under 300ms |
| Correctness | Proportion of correct results | Data pipelines, ML inference | 99.99% of pricing calculations match source of truth |
| Freshness | Data age relative to source | Dashboards, search indexes | 99.5% of queries see data less than 5 minutes old |
Don’t: Use average latency as an SLI. A service with 50ms average and 4-second p99 looks healthy on the average but is miserable for the unluckiest 1% of users.
Do: Use percentile-based latency SLIs. p99 latency under 300ms captures the experience of all but the most extreme outliers. Track p50 for typical experience and p99 for worst-case.
Error Budgets: The Decision Framework That Changes Behavior
An SLO without a policy is a number on a dashboard. The error budget is what turns it into a decision-making tool.
The math is simple. A 99.9% SLO over a 30-day rolling window means 0.1% allowed failure. In a system serving one million requests per day, that budget is 1,000 failed requests per day, or roughly 30,000 over the window. In terms of downtime, 0.1% of 30 days is 43.2 minutes.
That 43.2 minutes (or 30,000 failed requests) is the error budget. It belongs jointly to product and engineering. Product wants to spend it on velocity - ship features, run experiments, accept some risk. Engineering wants to conserve it for operational safety. The budget forces the conversation out of the abstract and into the concrete: “We have 38 minutes of budget remaining this month. Do we deploy this risky migration, or wait until the window resets?”
| Budget Health | Engineering Behavior | Product Behavior |
|---|---|---|
| Healthy (> 50% remaining) | Ship freely, experiment with new architectures | Push features, run A/B tests, tolerate some churn |
| Caution (25-50%) | Require rollback plans, avoid high-risk changes | Prioritize lower-risk features, defer large migrations |
| Critical (< 25%) | Reliability-only sprint, no feature deployments | Defer all feature requests, support reliability work |
| Exhausted (0%) | Full freeze until budget recovers past 50% threshold | Accept timeline delay, communicate externally if needed |
Burn-Rate Alerting: Stop Paging on the Wrong Things
Traditional threshold alerting pages your on-call when error rate crosses a fixed boundary. Error rate above 1%? Page. Latency above 500ms? Page. The problem is that a 1.5% error rate for three seconds is not the same problem as a 1.5% error rate for three hours. The threshold alert fires identically for both.
Burn-rate alerting asks a different question: “At the current rate of errors, how fast is the error budget being consumed?” A burn rate of 1x means you will exactly exhaust the budget at the end of the 30-day window. Nothing to worry about. A burn rate of 14.4x means the budget will be gone in roughly 24 hours. That warrants attention. A burn rate of 720x means the budget burns out in one hour. Page immediately.
The Google SRE Workbook formalizes this into multi-window burn-rate alerts. A fast window (short lookback, say 5 minutes) catches rapid budget consumption. A slow window (longer lookback, say 60 minutes) filters out transient spikes that resolve on their own. Both conditions must be true to fire.
# Multi-window burn-rate alert configuration
# Pages only when BOTH windows confirm the burn
alerts:
- name: "high-burn-rate-page"
# 14.4x burn rate = budget exhausted in ~24 hours
condition:
fast_window: 5m # Recent error rate over 5 minutes
fast_threshold: 14.4
slow_window: 60m # Sustained error rate over 1 hour
slow_threshold: 14.4
severity: page # Wake someone up
- name: "medium-burn-rate-ticket"
# 3x burn rate = budget exhausted in ~10 days
condition:
fast_window: 30m
fast_threshold: 3
slow_window: 6h
slow_threshold: 3
severity: ticket # File a ticket, don't page
The result: fewer false-positive pages, faster detection of genuine incidents, and on-call rotations that don’t burn people out chasing phantom alerts. Teams with effective observability and monitoring practices find that burn-rate alerting typically eliminates the majority of their noisy threshold alerts while catching real problems earlier.
How to calculate burn rate from SLO parameters
The burn rate formula is: burn_rate = (error_rate_observed / error_rate_allowed).
For a 99.9% SLO, the allowed error rate is 0.1% (0.001). If the observed error rate over the fast window is 1.44% (0.0144), the burn rate is 0.0144 / 0.001 = 14.4x.
To determine the time until budget exhaustion: time_remaining = slo_window / burn_rate. At 14.4x burn rate with a 30-day window: 30 / 14.4 = ~2.08 days. This gives the on-call engineer a concrete timeline for how urgent the response needs to be.
Common burn-rate thresholds:
- 14.4x (budget gone in ~2 days): page immediately
- 6x (budget gone in ~5 days): page during business hours
- 3x (budget gone in ~10 days): ticket for next sprint
- 1x (budget tracks exactly to exhaustion): informational, no action
Multi-Signal SLOs
A single SLI rarely captures the full picture. A service can return 100% successful responses, all of them taking 10 seconds. Availability SLO: met. User experience: terrible.
Meaningful reliability requires combining multiple SLIs into a composite view. Not a single number (weighted averages hide problems the same way averaging hides latency outliers) but a set of SLOs where all must be met simultaneously.
A request that succeeds (availability: good) but takes 4 seconds (latency: bad) counts against the latency error budget. A request that is fast and successful but returns stale data (correctness: bad) counts against the correctness budget. Each SLI has its own budget, and the most constrained budget drives the team’s priorities.
Most services need two to four SLOs. More than that and the signal becomes noise. The typical starting set:
- Availability: proportion of requests that don’t return server errors
- Latency: response time at p99 (or p95 for less latency-sensitive services)
- Correctness (if applicable): data accuracy, computation correctness, consistency with source of truth
Freshness is the fourth dimension, relevant for search indexes, dashboards, and data pipelines where staleness is a distinct failure mode from incorrectness.
The Organizational Half
SLOs are as much an organizational practice as a technical one. The most precisely instrumented SLI with the most elegant burn-rate alert is worthless if nobody changes their behavior when the error budget runs low.
- Product and engineering leadership have jointly agreed on SLO targets for the top three user-facing services
- Error budget policy document exists with explicit actions for each budget threshold
- SLO dashboards are visible to both product and engineering teams
- On-call rotation has access to error budget status in their incident response tooling
- At least one retrospective has been conducted after an error budget breach
The hardest organizational shift: product managers must accept that error budgets are not engineering’s problem. When the budget is healthy, engineering ships fast and product benefits. When the budget burns, product pays by deferring features. This is the deal. It only works if product co-owns the target from the beginning.
Organizations with mature SRE practices embed SLO reviews into sprint planning. Error budget status goes on the agenda before the feature backlog. Budget below threshold? Reliability sprint. No negotiation.
When SLOs Don’t Apply
Not every system needs a formal SLO. The overhead of defining SLIs, setting targets, building dashboards, configuring burn-rate alerts, and writing error budget policies is real. For systems where the investment does not pay back, simpler approaches work fine.
| SLOs add value | Simpler monitoring is fine |
|---|---|
| User-facing services with external customers | Internal batch jobs with retry logic |
| Services in the critical path of revenue transactions | Dev/staging environments |
| Platform services consumed by many internal teams | One-off migration scripts |
| Services with contractual SLA obligations | Services with a single team as the only consumer |
The test: if a reliability problem in this service would change someone’s behavior (halt a deployment, trigger an incident, escalate to leadership), an SLO formalizes that decision trigger. If a failure just means “wait and retry,” a basic health check suffices.
Getting Started: The 30-Day Playbook
SLO adoption does not require new tooling. Most observability stacks already collect the data needed for SLIs. The gap is usually in the framing, not the instrumentation.
Week 1: Pick three services and define SLIs. Instrument availability and latency at the edge. For each SLI, record a baseline over seven days. Do not set targets yet. Measure first.
Week 2: Set SLO targets based on baseline data. If your availability SLI shows 99.95% over the baseline week, a 99.9% SLO gives you breathing room. If your p99 latency is 280ms, a 300ms target is too tight. Set it at 400ms and tighten later as you improve the system.
Week 3: Build error budget dashboards and configure burn-rate alerts. Show remaining budget as a percentage and as absolute time/requests. Configure a 14.4x burn-rate page alert and a 3x ticket alert. Disable or mute the threshold alerts they replace.
Week 4: Write the error budget policy and get product sign-off. Define what happens at each budget threshold. Present it as a joint commitment, not an engineering diktat. Run a tabletop exercise: “The budget just hit 20%. What happens next?”
The SLO will be wrong on the first attempt. The target will be too tight or too loose. The SLI will miss an important failure mode. The policy thresholds will trigger too early or too late. This is expected. SLOs are calibrated through experience, not designed in a vacuum.
Same status page. Same “99.9% available” claim. But now it measures what users actually experience, feeds an error budget the team tracks weekly, and drives alerts that fire on genuine problems while staying quiet during transient blips. When the budget runs low, the conversation shifts from “should we prioritize reliability?” to “the policy says we do.” That number on the dashboard stopped being decorative. It started making decisions for you.