Developer Experience Metrics: DORA, Toil, and Pipeline Friction

Mar 9, 2025 Metasphere Engineering 8 min read

Your engineering org decides to improve its DORA metrics. Deployment frequency goes from twice a week to 15 per day after hooking up automatic deployment from main. Lead time for changes drops from 5 days to 6 hours after removing the manual QA gate. The metrics look excellent in the quarterly board deck. The engineers are more frustrated than ever.

They’re now dealing with 15 production deployments per day when they used to review each one. The removed QA gate had been catching real regressions that now reach users. The on-call team is triaging 3x more incidents per week. You optimized the dashboard. You made the job worse.

DORA metrics were a meaningful contribution to engineering measurement because they moved the conversation from gut feel to numbers. But DORA measures outputs of the development process, not the experience of the people doing the developing. Optimize for DORA without measuring what engineers actually feel and you get exactly this: impressive dashboards, miserable teams. You can have elite DORA numbers and a 40% attrition rate simultaneously.

Good DevEx measurement captures the friction engineers actually encounter, not just the outputs their process produces.

The DORA Foundation

DORA metrics are a starting point, not a finish line. Get the baseline right before you try to improve anything. The 2023 State of DevOps report defined four performance tiers: elite, high, medium, and low. Elite performers deploy on demand with lead times under one hour, change failure rates below 5%, and restoration times under one hour. Figure out where your team actually sits in these tiers before you attempt any improvement. Integrating these measurements into your engineering operations is a core DevOps practice.

The measurement precision matters more than most teams realize. Deployment frequency measured by CI/CD system events is reliable. Self-reported deployment frequency is not. Lead time measured from first commit on a PR branch to production deployment captures the full cycle including code review wait time and merge queue delays. Many teams measure lead time starting from merge, which conveniently hides the 2-3 days their code sits in review. That’s not measurement. That’s flattery. Change failure rate requires a consistent definition of “failure,” typically meaning incidents requiring rollback or a hotfix deployment within 24 hours. Without a consistent definition, the metric drifts based on whoever’s filling out the post-incident form.

Here’s the interpretation trap: comparing DORA metrics across teams without context. A platform team deploying Terraform configuration changes daily has fundamentally different meaningful deployment frequency than an application team with complex integration requirements and regulated change windows. Ranking teams by DORA metrics without this context creates perverse incentives. The platform team “outperforms” the application team on a metric that means completely different things for each. Stop doing this.

DORA tells you how fast your process moves. It says nothing about the friction your engineers carry every day. That requires a different lens entirely.

Cognitive Load and Toil

Cognitive load is the mental overhead engineers carry that has nothing to do with building the product. Learning a different deployment workflow for each service. Remembering which of 12 runbooks applies to which type of alert. Knowing that the staging environment needs a manual VPN connection but production doesn’t, and that the dev database requires a different credentials rotation than staging. Each one is small. Together they eat mental bandwidth that would otherwise go to the actual work. And they’re completely invisible to DORA metrics. This is the friction that makes engineers quit.

Measuring cognitive load requires asking engineers directly. The most practical approach: survey engineers with specific workflow scenarios and have them rate difficulty on a 1-5 scale. “How difficult is it to deploy a new version of your service?” “How difficult is it to set up a local development environment?” Better yet, shadow a new hire through their first deployment. The friction they encounter is the friction everyone encounters. Experienced engineers just stopped noticing it.

Toil measurement requires engineers to categorize their actual time for a few weeks. Track the interrupts: how many times did you stop coding to respond to a routine alert, provision an environment, rotate credentials, or restart a flaky service? The aggregate gives you a toil ratio and identifies the highest-volume categories for automation. Teams consistently find that 20% of toil categories account for 80% of total toil hours. Automate those first. You’ll eliminate most of the pain with a fraction of the effort.

Google’s SRE guidance targets keeping toil below 50% of engineering time, with a goal of trending it under 30% for healthy teams. There are teams where toil hits 65% and the engineers themselves didn’t realize it because the work felt “normal.” That number is worth repeating: 65% of their time spent on work a script could do, and they’d stopped questioning it. The measurement exercise alone is eye-opening. Developer productivity platforms reduce the toil surface area by centralizing tooling and automating the highest-frequency manual tasks first.

Pipeline Metrics That Surface Real Friction

Cognitive load and toil tell you where the hidden friction lives. Pipeline metrics tell you where the visible friction is, and P95 CI time is the metric that actually reflects developer experience. Not averages. Averages lie.

Engineers don’t make workflow decisions based on the average build. They make decisions based on the slow builds. A pipeline with P50 of 8 minutes and P95 of 45 minutes produces completely different engineering behavior than one with P50 of 10 minutes and P95 of 12 minutes. When the slow case means 45 minutes of waiting, engineers context-switch to another task, lose focus, batch commits to avoid running CI twice, or just skip the local CI step and push directly. Every one of those behaviors introduces quality risk.

Finding the root cause of high P95 times requires looking at the tail cases specifically. Pull the full CI logs for your slowest 5% of builds and look for patterns. The cause is almost never “the whole pipeline is slow.” It’s usually one specific thing that makes 5% of builds take 4x as long: a flaky test suite that triggers a retry, a cold dependency cache that forces a full download, or a large test suite running serially when it could be parallelized. Fix that one thing and P95 drops dramatically while P50 barely changes. This is one of the highest-leverage improvements you can make.

Environment provisioning time directly measures platform engineering effectiveness. When provisioning takes hours, engineers work around the system. They keep environments running indefinitely, share environments between colleagues, or skip proper environment testing entirely. You’ll never catch them doing it on a dashboard. Measure provisioning time and set quarterly reduction targets. That creates accountability and visible improvement.

Onboarding time to first meaningful PR is the metric that catches everything else. It’s a composite of documentation quality, tooling accessibility, and environment setup friction. DevOps teams that track this consistently discover something important: improvements to onboarding benefit everyone, not just new hires. The same friction that slows a new hire on day one slows a returning engineer after leave, a contractor joining a project, and anyone recovering from a laptop replacement. If the path from “cloned the repo” to “first green build” takes 20 minutes rather than 2 days, every engineer benefits every time they touch a fresh environment. Strong CI/CD practices are often where the largest onboarding improvements come from.

Surveys: The Missing Signal

All the metrics above capture process outcomes. None of them capture whether your engineers feel productive, frustrated, or actively looking for other jobs. For that, you need to ask them.

A 5-question monthly pulse survey fills this gap more effectively than a 50-question annual survey that nobody remembers by the time results are analyzed.

Keep surveys focused. One theme per month: tooling satisfaction, deployment confidence, documentation quality, on-call burden. Use consistent 1-5 scoring scales so you can track trends. The target response rate is above 60%, which is achievable when surveys take under 2 minutes and engineers see concrete action taken on results. Surveys without visible follow-through drop below 30% response within two cycles. At that point, running the survey is worse than not running it. The low response rate tells engineers that leadership asks for feedback but doesn’t act on it. That’s a trust problem that’s hard to undo.

The most actionable survey question we’ve encountered: “What is the single most frustrating thing about your development workflow right now?” Open-ended, specific, and it generates a rank-ordered list of improvements that engineers will actually notice when fixed. Measure the outputs with DORA. Measure the friction with toil ratios and P95. Measure the sentiment with surveys. That’s the full picture. Anything less and you’re flying blind on the dimension that matters most: whether the people building your product want to keep building it.

Frequently Asked Questions

What are the four DORA metrics and what do they actually measure?

The four DORA metrics are Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service. Elite performers deploy on demand with lead times under one hour, change failure rates below 5%, and restoration under one hour. They measure process outputs, not the experience of the people producing them. Supplementing DORA with friction-focused metrics like toil ratio and P95 CI time captures what engineers actually feel.

What is engineering toil and how do you measure it?

Toil is manual, repetitive, automatable work that produces no lasting value: deploying by hand, provisioning environments, responding to routine alerts. Google’s SRE guidance targets keeping toil below 50% of engineering time, trending toward 30% for healthy teams. Measure toil by having engineers categorize their actual time for 2-3 weeks using interrupt tracking. Teams consistently find 20% of toil categories account for 80% of total toil hours.

Why is P95 CI time more informative than average CI time?

Average CI time is dominated by fast, successful runs. P95 captures what slow builds actually feel like. A pipeline with P50 of 8 minutes and P95 of 45 minutes is experienced very differently from one with P50 of 10 minutes and P95 of 12 minutes, even if averages are similar. Engineers make workflow decisions based on the slow cases: whether to context switch, whether to batch commits. P95 drives those decisions.

What is the right way to run developer satisfaction surveys?

A 5-question monthly pulse survey produces higher response rates and more timely signal than a 50-question annual survey. Keep response rates above 60% by keeping surveys short and showing engineers concrete action taken on results. Focus each survey on a specific theme rather than covering everything. Surveys without visible outcomes drop below 30% response within two cycles.

What is onboarding time to first meaningful PR and why does it matter?

Time to first meaningful PR measures how long it takes a new team member to go from ‘cloned the repo’ to ‘merged a real change.’ High-performing platform teams target 2 days or fewer. Teams without structured onboarding average 2-3 weeks. This metric reflects friction everyone experiences when setting up an environment, not just new hires. It is one of the most direct measures of platform engineering investment ROI.