User Research for Product Engineering Teams
You check your product analytics dashboard and everything looks healthy. DAU is stable. Page views are up. The feature your team shipped last sprint has “adoption” because people are visiting the page. You report the numbers in the sprint retro. Everyone feels good.
Then you watch a session replay. Users are visiting the page, clicking the same button three times, staring at the screen, and leaving to accomplish the task in a spreadsheet instead. Your “adoption” metric was measuring frustration, not success.
This is the gap that catches most product engineering teams. The quantitative data says one thing. The qualitative reality is completely different. And closing that gap is not a design team responsibility. It is an engineering infrastructure problem.
Beyond Page Views: Metrics That Actually Signal User Intent
Standard web analytics track what happened (page loaded, link clicked, form submitted) without capturing whether the user accomplished their goal. The metrics that actually reveal intent live one layer deeper.
Rage clicks are 3+ clicks on the same element within 1-2 seconds. The user clicked something that looked interactive but was not, or clicked a button that gave no feedback. PostHog and FullStory detect these natively. A typical product surfaces 15-25 rage click hotspots per quarter that standard click tracking completely misses. These are not edge cases. These are your users screaming at the screen.
Dead clicks are clicks on non-interactive elements. Users clicking on static text, images, or whitespace are trying to navigate or interact with something that does not respond. High dead click rates on a specific element mean one of two things: it looks clickable when it is not, or there is a feature gap where users expect functionality that does not exist.
Scroll depth answers whether anyone reads past the fold. If 70% of users never scroll past 25% of your configuration page, the settings below that point might as well not exist. Combine scroll depth with task completion to distinguish between pages that are too long and pages where users find what they need quickly.
Task completion rate is the metric that ties everything together. Define the specific task a page or flow is designed to support (e.g., “create a new project,” “configure an alert rule,” “export a report”). Measure the percentage of users who start the task and complete it. This single metric is more revealing than any combination of page views, session duration, and bounce rate.
Building A/B Testing Infrastructure That Doesn’t Lie
Most A/B testing setups produce unreliable results because they violate statistical assumptions the team does not realize they are making. Three problems recur constantly, and most teams are making at least one of these mistakes right now.
Underpowered tests. For a 5% minimum detectable effect (MDE) with standard 80% power and 95% confidence, you need approximately 3,200 users per variant. For a 2% MDE, that jumps to roughly 20,000 per variant. Most teams run tests with a few hundred users and declare results after three days. That is not testing. That is astrology with better tooling.
Peeking. Checking results daily and stopping the test when the p-value dips below 0.05 inflates false positive rates from the intended 5% to 20-30%. Sequential testing frameworks (like Optimizely’s Stats Engine or custom implementations using always-valid confidence intervals) account for continuous monitoring. If your testing framework does not support sequential analysis, commit to the pre-calculated sample size and do not peek. Seriously. Do not peek.
Novelty effects. Any UI change produces an initial lift from curiosity. Users explore the new treatment because it is different, not because it is better. This effect typically inflates engagement metrics by 10-30% in the first week and decays over 2-3 weeks. Run tests for minimum two full business cycles (typically 2-4 weeks) to let the novelty wash out.
Guardrail metrics prevent you from optimizing one metric at the expense of everything else. If your test improves click-through rate but degrades load time by 200ms, you need to catch that before rollout. Define 3-5 guardrail metrics (error rate, latency P95, revenue per session, support ticket rate) and automatically flag any test where a guardrail degrades beyond a threshold. Amplitude Experiment and Eppo both support guardrail configuration natively.
Session Replay Architecture at Scale
Session replay records DOM mutations and user interactions, then reconstructs the session as a video-like playback in the browser. It is the closest thing to watching over a user’s shoulder without actually standing behind them. The architectural challenge is volume.
A medium-traffic application generating 100,000 sessions per day at 5-10 minutes average session length produces 50-100 GB of raw replay data daily. FullStory, LogRocket, and PostHog each handle this differently, but the core architecture is the same: a lightweight client-side recorder captures DOM snapshots and incremental mutations, compresses them, and ships them to a backend that indexes the data for search and playback.
The key engineering decisions are sampling rate and privacy masking. Recording 100% of sessions is expensive and unnecessary. A 10-20% sample rate covers most debugging and research needs. For specific user segments (enterprise accounts, users on critical flows, users who triggered errors), crank the sample to 100%.
Privacy masking must happen client-side, before data leaves the browser. Mask all form inputs by default. Explicitly allowlist fields that are safe to record (search queries, dropdown selections). Never record password fields, payment information, or health data. GDPR and CCPA compliance depends on this being correct in the recording layer, not retroactively scrubbed server-side. Get this wrong and you have a compliance incident, not a product analytics bug.
For research workflows, integrate session replay with your analytics pipeline. When a user hits a rage click or abandons a critical flow, automatically bookmark that session with metadata (page, element, timestamp). Researchers and engineers can then filter replays by behavior pattern rather than watching sessions randomly.
The Quantitative + Qualitative Loop
Quantitative data tells you what is happening. Qualitative research tells you why. Neither alone is sufficient, and most engineering teams only do the quantitative half. That is like diagnosing a production outage by staring at metrics without ever reading the logs.
The loop works like this: analytics surfaces a pattern (e.g., 40% drop-off on step 3 of onboarding). Session replays show what users are doing at that step (e.g., scrolling past the CTA, trying to skip the step). Usability interviews reveal why (e.g., the step asks for information users do not have yet, and there is no way to save and return). The fix addresses the root cause, and analytics confirms the improvement.
Without the qualitative leg, teams build fixes based on guesses. “Users are dropping off at step 3, so let’s make the button bigger.” That is a solution to a problem nobody verified. The actual issue (users not having the required information at that moment) requires a completely different solution: allowing partial completion and returning later. Making the button bigger would have wasted a sprint and changed nothing.
Usability Testing Without a Dedicated Researcher
Jakob Nielsen’s finding still holds: five users testing a specific task flow uncover roughly 80% of usability issues. You do not need a UX research lab. You do not need a full-time researcher. You need a structured protocol and the discipline to observe without defending your design choices.
A usability test that engineering teams can run:
- Define 3-5 task scenarios that map to real user goals (not “click the settings button” but “change your notification preferences so you only get emails about critical alerts”).
- Recruit 5 participants from your target user base. Existing users who signed up in the last 30 days work well because they’re past initial confusion but haven’t developed workarounds.
- One moderator, one observer. The moderator reads the task and asks the participant to think aloud. The observer takes notes. The person who built the feature should observe, not moderate. Moderating your own feature creates unconscious steering.
- Record the session with screen + audio (Loom, Lookback, or Zoom screen share). Review later for patterns across participants.
- Debrief within 24 hours while observations are fresh. List the top 5 issues by severity (task failure vs. confusion vs. cosmetic).
The critical rule: do not help. When a user struggles, every fiber of the moderator’s being wants to point them in the right direction. Resist. The struggle is the data. If three out of five users cannot find the export button, that is the finding. Your discomfort watching them struggle is the price of learning the truth about your product.
Jobs-to-be-Done for Technical Products
Feature requests from technical users are notoriously specific and solution-oriented. “Add a CSV export for the audit log.” “Support regex in the search bar.” “Let me pin dashboards.” These are solutions, not problems. Building them without understanding the underlying job leads to features that technically satisfy the request but miss the actual need entirely.
The JTBD interview framework peels back the request to the real job. The engineer requesting CSV export has a job: “Prepare compliance evidence for the quarterly audit.” That job might be better served by a scheduled automated report, a direct integration with the compliance tool, or a pre-built audit evidence package. The CSV export is the lowest-leverage solution to the highest-value problem. But if you never ask why, you will build the CSV export and call it done.
JTBD interviews follow a specific structure. Start with the last time the user performed the task. Walk through the timeline chronologically. Ask what they did, what tools they used, what was frustrating, and what workarounds they created. The workarounds are gold. Every workaround is a feature request your users never filed, expressed as behavior instead of words.
Product Analytics Stack
The tooling landscape for product analytics has consolidated around a few architectural patterns. The right stack depends on your data maturity and team size, but the wrong stack creates data silos that take months to untangle.
Event collection: Segment (or its open-source equivalent, RudderStack) acts as the event router. Instrument once, send to multiple destinations. This decouples your instrumentation code from your analytics vendor. When you switch from Mixpanel to Amplitude, you change a destination config, not 200 event calls across your codebase.
Product analytics: Amplitude and PostHog are the two strong options for most teams. Amplitude has deeper behavioral analysis features (funnel analysis, cohort comparison, predictive analytics). PostHog is open-source, self-hostable, and bundles session replay, feature flags, and A/B testing in one platform. For teams under 50 engineers, PostHog’s all-in-one approach reduces integration overhead significantly.
Custom events beyond the defaults: Standard analytics tracks page views and clicks automatically. The events that matter for product decisions are custom: project_created, alert_configured, export_completed, onboarding_step_3_abandoned. Define a tracking plan document that lists every custom event, its properties, and its trigger condition. Treat the tracking plan like a schema. Review changes in PRs. Breaking changes to event names or properties silently break dashboards and reports downstream, and nobody notices until someone asks why the conversion funnel chart is empty.
Connecting this analytics infrastructure to user experience design practices closes the loop between measurement and iteration. Insight without action is just reporting. The analytics stack should feed directly into the prioritization process, providing evidence for what to build next.
Continuous Discovery for Engineering-Led Teams
Teresa Torres’ continuous discovery framework adapts well to engineering teams that do not have a product manager embedded full-time. The core habit is deceptively simple: talk to at least one user every week. Not a formal research study. Not a survey. A 20-minute conversation with someone who used the product recently.
Structure the conversation around three questions: What were you trying to accomplish? What did you try? What happened? These map directly to JTBD and surface the friction points that quantitative data alone can’t explain.
The output feeds an opportunity solution tree. Opportunities are user problems or unmet needs. Solutions are potential features or changes. Map solutions to opportunities, and prioritize opportunities by reach (how many users have this problem) and severity (how much it blocks their goal). This prevents the common failure mode of building the loudest request instead of the most impactful fix.
For advanced analytics approaches, combining behavioral data from your product analytics stack with qualitative patterns from discovery interviews creates a feedback loop that compounds. Each week’s conversations are informed by the previous week’s data. Each week’s data is interpreted through the lens of user conversations. Teams running this loop consistently report that they ship fewer features but those features see 2-3x higher adoption rates than features shipped from roadmap commitments alone.
The insight from accessibility-focused UX engineering is especially relevant here. Users with accessibility needs are often the most articulate about friction because every unnecessary step or unclear interaction is amplified. Including these users in discovery conversations surfaces issues that improve the experience for everyone, not just the accessibility-specific audience.
The pattern across all of these practices is the same: replace assumptions with evidence, and build the infrastructure to make evidence collection continuous rather than episodic. Product teams that instrument properly, test rigorously, and talk to users regularly do not ship fewer features. They ship fewer features that nobody uses. That is the difference between a team that builds and a team that builds the right things. And it makes the design system components and the engineering time behind them dramatically more valuable.