Developer Experience Metrics: Beyond DORA Numbers

Q: What are the four DORA metrics and what do they actually measure?

The four DORA metrics are Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service. Elite performers deploy on demand with lead times under one hour, change failure rates below 5%, and restoration under one hour. They measure process outputs, not the experience of the people producing them. Supplementing DORA with friction-focused metrics like toil ratio and P95 CI time captures what engineers actually feel.

Q: What is engineering toil and how do you measure it?

Toil is manual, repetitive, automatable work that produces no lasting value: deploying by hand, provisioning environments, responding to routine alerts. Google's SRE guidance targets keeping toil below 50% of engineering time, trending toward 30% for healthy teams. Measure toil by having engineers categorize their actual time for 2-3 weeks using interrupt tracking. Teams consistently find 20% of toil categories account for 80% of total toil hours.

Q: Why is P95 CI time more informative than average CI time?

Average CI time is dominated by fast, successful runs. P95 captures what slow builds actually feel like. A pipeline with P50 of 8 minutes and P95 of 45 minutes is experienced very differently from one with P50 of 10 minutes and P95 of 12 minutes, even if averages are similar. Engineers make workflow decisions based on the slow cases: whether to context switch, whether to batch commits. P95 drives those decisions.

Q: What is the right way to run developer satisfaction surveys?

A 5-question monthly pulse survey produces higher response rates and more timely signal than a 50-question annual survey. Keep surveys short and show engineers concrete action taken on results. Focus each survey on a specific theme rather than covering everything. Surveys without visible follow-through see response rates collapse within two cycles, and recovering that trust is harder than building it.

Q: What is onboarding time to first meaningful PR and why does it matter?

Time to first meaningful PR measures how long it takes a new team member to go from 'cloned the repo' to 'merged a real change.' Strong platform teams get new hires merging within days. Teams without structured onboarding lose weeks to environment setup alone. This metric reflects friction everyone experiences, not just new hires. It is one of the most direct measures of platform engineering investment ROI.

Mar 9, 2025 Metasphere Engineering 12 min read

Your engineering org decides to improve its DORA metrics. Deployment frequency goes from twice a week to 15 per day after hooking up automatic deployment from main. Lead time for changes drops from 5 days to 6 hours after removing the manual QA gate. The metrics look excellent in the quarterly board deck. The engineers are more frustrated than ever.

They’re now dealing with 15 production deployments per day when they used to review each one. The removed QA gate had been catching real regressions that now reach users. The on-call team is drowning in incidents. You optimized the dashboard. You made the job worse.

DORA’s own research acknowledges this gap. Their metrics were a genuine contribution because they moved the conversation from gut feel to numbers. But DORA measures process outputs, not the experience of the people producing them. Optimize for DORA without measuring what engineers actually feel and you get exactly this: impressive dashboards, miserable teams. You can absolutely have elite DORA numbers and painful attrition at the same time. Teams do. More often than anyone admits.

Key takeaways

Elite DORA numbers and painful attrition can coexist. Fast process, miserable engineers. Both at the same time.
Toil ratio (manual/repetitive work as % of total) is the missing metric. Teams above 30% toil ratio are burning out regardless of what DORA says.
Developer surveys with quarterly cadence catch sentiment shifts before they become attrition. Ask about friction, not satisfaction.
CI wait time drives task-switching. Slow builds force context switches, and every context switch costs far more focus time than the build itself. Measure P95, not averages.
Environment provisioning time is the #1 complaint in most engineering satisfaction surveys. Self-serve, under 15 minutes, or engineers route around it.

The DORA Foundation

DORA Metric	Elite	High	Medium	Low
Deploy frequency	On demand	Daily to weekly	Weekly to monthly	Monthly+
Lead time for changes	<1 hour	1 day to 1 week	1 week to 1 month	1-6 months
Change failure rate	<5%	5-10%	10-15%	15-45%
Time to restore	<1 hour	<1 day	1 day to 1 week	1 week+

DORA metrics are a starting point, not a finish line. The 2023 State of DevOps report defined the tiers above. Figure out where your team actually sits before you attempt any improvement. Not where you think you sit. Where you actually sit. Integrating these measurements into DevOps operations is where the surprises live.

Measurement precision matters more than teams realize. Deployment frequency from CI/CD events is reliable. Self-reported frequency is fiction. Lead time from first commit to production captures the full cycle - code review wait, merge queue delays included. Many teams measure from merge, conveniently hiding the 2-3 days code sits in review. Measurement or flattery? Change failure rate needs a consistent definition of “failure.” Typically: incidents requiring rollback or hotfix within 24 hours. Without consistency, the metric drifts based on whoever fills out the post-incident form and how generous they’re feeling.

The interpretation trap: comparing DORA across teams without context. A platform team deploying Terraform daily has completely different deployment frequency semantics than an app team with regulated change windows. Ranking teams by raw DORA creates perverse incentives.

Anti-pattern

Don’t: Remove QA gates, skip code review, or batch micro-commits to improve DORA numbers. Optimizing the metric while degrading the experience is Goodhart’s Law in action.

Do: Track DORA alongside friction metrics (toil ratio, P95 CI time, environment provisioning). Improvement that shows up in DORA but not in engineer surveys is not improvement.

DORA tells you how fast your process moves. It says nothing about the weight your engineers carry while moving it.

Cognitive Load and Toil

Cognitive load is the mental overhead engineers carry that has nothing to do with building the product. A different deployment workflow for each service. Twelve runbooks and nobody remembers which one applies to which alert type. The staging environment needs a manual VPN connection but production does not. The dev database requires a different credentials rotation than staging. Each one is small. Together they devour mental bandwidth that should go to the actual work. And they are completely invisible to DORA metrics. This kind of friction makes engineers quit. Not the hard problems. The stupid ones.

Measuring cognitive load requires asking engineers directly. The most practical approach: survey engineers with specific workflow scenarios and have them rate difficulty on a 1-5 scale. “How difficult is it to deploy a new version of your service?” “How difficult is it to set up a local development environment?” Better yet, shadow a new hire through their first deployment. Watch them struggle. The friction they hit is the friction everyone has. Experienced engineers just stopped noticing it years ago. They have workarounds for the workarounds.

Toil measurement requires engineers to categorize their actual time for a few weeks. Track the interrupts: how many times did you stop coding to respond to a routine alert, provision an environment, rotate credentials, restart a flaky service? The aggregate gives you a toil ratio and exposes the highest-volume categories ripe for automation. Teams consistently find 20% of toil categories account for 80% of total hours. Automate those first. Most of the pain, a fraction of the effort.

Google’s SRE guidance targets keeping toil below 50% of engineering time, with a goal of trending it under 30% for healthy teams. Teams where toil hits 65% often don’t realize it because the work felt “normal.” Two-thirds of their time on work a script could do, and they had stopped questioning it. The measurement exercise alone is eye-opening. Developer productivity platforms reduce the toil surface area by centralizing tooling and automating the highest-frequency manual tasks first.

Metric	Effort to Instrument	Signal Quality	What It Catches
DORA (4 metrics)	Low (CI/CD events)	High for process flow	Deployment bottlenecks, change failure patterns
Toil ratio	Medium (2-3 week tracking)	Very high for burnout	Invisible manual work eating engineering capacity
P95 CI time	Low (pipeline logs)	High for tooling friction	The slow builds that drive context switching
Env provisioning time	Low (timestamp diff)	High for platform quality	The #1 complaint in most engineering satisfaction surveys
Monthly pulse survey	Low (5 questions)	High for sentiment	Frustration shifts before they become attrition
Time to first PR	Medium (onboarding tracking)	High for friction	Documentation, tooling, and environment gaps

Pipeline Metrics That Surface Real Friction

Cognitive load and toil tell you where the hidden friction lives. Pipeline metrics tell you where the visible friction is. And P95 CI time is the metric that actually reflects developer experience. Not averages. Averages are liars.

Engineers don’t make workflow decisions based on the average build. They make decisions based on the slow builds. A pipeline with P50 of 8 minutes and P95 of 45 minutes produces completely different engineering behavior than one with P50 of 10 minutes and P95 of 12 minutes. When the slow case means 45 minutes of waiting, engineers context-switch to another task, lose focus, batch commits to avoid running CI twice, or just skip the local CI step and push directly. Every one of those behaviors introduces quality risk.

Finding the root cause of high P95 times requires looking at the tail cases specifically. Pull the full CI logs for your slowest 5% of builds and look for patterns. The cause is almost never “the whole pipeline is slow.” It’s usually one specific thing that makes 5% of builds take 4x as long: a flaky test suite that triggers a retry, a cold dependency cache that forces a full download, or a large test suite running serially when it could be parallelized. Fix that one thing and P95 drops hard while P50 barely changes. One of the highest-impact improvements available.

Environment provisioning time directly measures platform engineering effectiveness. When provisioning takes hours, engineers route around it: environments running indefinitely, shared between colleagues, proper testing skipped. You’ll never catch them doing it on a dashboard. Measure provisioning time and set quarterly reduction targets.

Onboarding time to first meaningful PR catches everything else. Documentation quality, tooling accessibility, environment friction - all compressed into one number. Here’s what people miss: improvements to onboarding benefit everyone, not just new hires. The same friction that slows a new hire slows a returning engineer after leave, a contractor, anyone with a fresh laptop. If “cloned the repo” to “first green build” takes 20 minutes instead of 2 days, every engineer benefits. Strong CI/CD practices are often where the largest gains come from.

Surveys: The Missing Signal

Pipeline metrics and toil ratios capture what’s measurable. Whether your engineers feel productive, frustrated, or actively updating their LinkedIn requires a different instrument entirely. You have to ask.

Prerequisites

Survey takes under 2 minutes to complete (5 questions maximum)
Results are shared with the team within 1 week of closing
At least one concrete action is taken on results before the next survey
Survey cadence is monthly, not quarterly or annual
Each survey focuses on a single theme (tooling, deployment, on-call, docs)

A 5-question monthly pulse beats a 50-question annual survey that nobody remembers by the time someone analyzes the results three months later. Keep surveys focused. One theme per month: tooling satisfaction, deployment confidence, documentation quality, on-call burden. Use consistent 1-5 scoring scales so you can track trends.

Response rates are the canary. When they drop, it means engineers have concluded that feedback leads nowhere. Surveys without visible follow-through see participation collapse within two cycles. Recovering that trust is harder than building it. At that point, running the survey is worse than not running it. The low response rate tells engineers that leadership asks for feedback and ignores it.

The most actionable survey question in practice: “What is the single most frustrating thing about your development workflow right now?” Open-ended, specific, and it generates a rank-ordered list of improvements that engineers will actually notice when fixed.

The DORA-Experience Gap The disconnect between process metrics that look excellent on a dashboard and the daily friction engineers actually experience. Closing this gap requires measuring what engineers feel (surveys, toil ratios, environment wait times) alongside what the process produces (deploy frequency, ).

What the Industry Gets Wrong About Developer Experience

“DORA metrics measure developer experience.” DORA measures process throughput and stability. A team deploying 15 times daily with a 2% failure rate has elite DORA numbers and potentially terrible developer experience if each deploy requires manual babysitting, the CI pipeline takes 45 minutes, and the on-call rotation is brutal.

“Improve metrics, improve experience.” Goodhart’s Law hits DevEx hard. Optimize deploy frequency by removing QA gates and you get more deploys and more production incidents. Optimize lead time by skipping code review and you get faster merges and worse code quality. Every metric can be gamed in ways that make the experience worse.

Our take Measure toil ratio quarterly. The percentage of engineering time spent on manual, repetitive, automatable work that produces no lasting value. Above 30%, engineers are burning out regardless of what DORA says. Below 15%, the team has enough capacity for innovation. This single metric explains more about team health than all four DORA metrics combined.

That engineering org from the opening, the one with 15 deploys per day and elite DORA numbers? Toil ratios exposed the missing QA gate. Surveys surfaced the on-call burden nobody wanted to quantify. Incident volume dropped because the team finally saw what was causing them. Same org. Same DORA numbers. Completely different understanding of what those numbers were hiding.

Frequently Asked Questions

What are the four DORA metrics and what do they actually measure?

The four DORA metrics are Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Time to Restore Service. Elite performers deploy on demand with lead times under one hour, change failure rates below 5%, and restoration under one hour. They measure process outputs, not the experience of the people producing them. Supplementing DORA with friction-focused metrics like toil ratio and P95 CI time captures what engineers actually feel.

What is engineering toil and how do you measure it?

Toil is manual, repetitive, automatable work that produces no lasting value: deploying by hand, provisioning environments, responding to routine alerts. Google’s SRE guidance targets keeping toil below 50% of engineering time, trending toward 30% for healthy teams. Measure toil by having engineers categorize their actual time for 2-3 weeks using interrupt tracking. Teams consistently find 20% of toil categories account for 80% of total toil hours.

Why is P95 CI time more informative than average CI time?