Frontend Error Tracking: Session Replay and RUM

Nov 21, 2025 Metasphere Engineering 14 min read

You ship a new release and check your backend dashboards. Error rates are flat and latency percentiles look completely normal, so you close your laptop assuming the deployment was a success.

An hour later, customer support lights up. Users complain the page won’t load, buttons don’t work, and everything turns blank after logging in. You pull up the server logs, but every single request returns a 200 OK. The backend served the HTML, the JavaScript, and the API responses correctly. The failure actually happened after the response left your infrastructure and entered the user’s browser, leaving your monitoring stack completely blind.

Server-side telemetry tells you everything is fine while the client-side experience silently fails.

Key takeaways

Backend returned 200 OK. The bug lives in the browser. Server-side observability has a visibility ceiling that strictly stops at the HTTP response boundary.
Source maps in production are non-negotiable. Minified stack traces (a.js:1:4523) are practically useless without a way to map them back to the original code. Upload maps to your error tracking service and keep them out of public access.
Sample errors intelligently, not exhaustively. High-traffic apps generate millions of error events. Record 100% of error sessions, but routinely sample roughly 5-10% of everything else.
Session replay turns vague frontend complaints into a reproducible bug. Record DOM mutations, not video. Redact PII automatically at capture time.
Connect frontend errors to backend traces. One trace ID directly links a user’s blank page to the exact backend API call that failed.

Your Prometheus metrics, Grafana dashboards, and distributed traces saw nothing unusual. The W3C Performance Timeline defines the specific browser APIs that make frontend instrumentation possible. Without them, you’re trying to debug an incident without any client-side context.

The Visibility Ceiling

The 200 OK Blind Spot The visibility gap between a successful server response and the actual user experience in the browser. The server returns the HTML, but if the JavaScript crashes during hydration, backend dashboards still show green while the user sees a blank screen. The backend has no visibility into the actual frontend state.

A 200 response tells you nothing about React hydration crashes, third-party scripts blocking the main thread for 800ms, or ad blockers aggressively removing DOM elements your click handlers depend upon. The browser is a hostile environment filled with extensions you didn’t install, hardware you didn’t pick, and networks you don’t control. You deploy to a predictable server, but your code runs in thousands of completely unpredictable client environments.

RUM vs Synthetic: You Need Both

Real User Monitoring collects performance data from real user sessions across every device, network, and location. RUM might show that your P75 LCP sits at 2.1s while the P95 reaches 6.4s, heavily clustered in Southeast Asia on Android Chrome. This gives you exact data from the field. Synthetic monitoring running from a static remote datacenter would never surface that geographical pattern.

Synthetic monitoring runs scripted browser tests from controlled infrastructure to selectively catch regressions between releases. Lighthouse CI in your pipeline flags a 500ms LCP increase before it reaches active users. It acts as an automated safety check against rapid performance degradation.

	Synthetic Monitoring	Real User Monitoring (RUM)
Data source	Scripted browser tests from controlled infra	Actual user sessions, every page load
When it runs	CI/CD pipeline, scheduled cron	Continuous in production
Catches	Regressions between releases	Real-world performance issues
Misses	Device/network/geography variation	Pre-deploy regressions
Best for	Regression detection, baseline comparison	Understanding actual user experience
Use alone?	Optimizes for a lab that doesn’t match reality	Catches problems only after users suffer

Session Replay Architecture

The standard approach uses mutation observers (rrweb ). Instead of dragging in heavy video files, you take an initial DOM snapshot and strictly record every subsequent mutation, scroll, and click as incremental events. This takes roughly 50-200KB compressed per minute versus 500KB-2MB for actual video. Because it is natively searchable by DOM state, when a user reports a broken UI element, you can restore the exact DOM state from the moment they clicked. There is no guesswork, and no debating whether the feature works perfectly on a developer’s machine.

Sampling strategy matters. It makes sense to record 100% of sessions where an error occurs and 100% of sessions where a user contacts support. Sample roughly 5-10% of everything else. Storage scales linearly with your session count. Get your sampling right and session replay becomes a high-value debugging tool. Get it wrong and you end up accumulating massive data storage bills for collecting redundant traces.

Privacy is not optional. You logically need to mask all <input> elements by default. For GDPR compliance, masking has to happen securely on the client at recording time, not during later playback. If unmasked personal data hits your ingestion servers, the strict processing already happened under GDPR rules. Text masking keeps the layout fully intact but cleanly replaces sensitive content with safe placeholders. For standard debugging workflows, the exact sequence of clicks matters far more than the specific text the user typed out.

Source Map Management

Prerequisites

Build pipeline generates source maps alongside production bundles
Source maps upload to your error tracking service as a required CI step
Source maps are cleanly stripped from the deployment artifact before reaching the CDN
Each upload is tagged with the git commit SHA for exact code version correlation
Source map retention policy covers at least 90 days

Generate source maps during your build phase, upload them directly to your error tracking service, and carefully strip them from the deployment artifact. Tagging the exact upload with a git commit SHA ensures that every error links to the specific source code version that generated it. Never serve source maps publicly, since they act as easily accessible copies of your raw application source files.

Teams typically break this setup in the same three ways. Maps randomly fall out of sync with deployed code when an urgent hotfix accidentally skips the upload step. Retention policies eventually expire before anyone actively investigates an issue. Tags mismatch because the string format for release identifiers shifted between deploys. To counter this predictably, make the source map upload a required CI check rather than an optional post-deploy step. If the upload process fails, the deploy should gracefully fail with it.

Error Grouping and Noise Filtering

Default grouping logic buckets errors strictly by exception type and the top stack frame. This breaks completely when the same underlying root cause generates vastly different stack traces in the wild. A null reference trace spanning three separate React components, all caused by one missing API field, shows up as three separate incidents. What looks like three separate incidents is actually one underlying bug.

Custom fingerprinting directly resolves this by grouping payloads according to message patterns, failing network endpoints, or custom error tags. This approach collapses 300 noisy events into a single actionable issue with an accurate occurrence count.

Anti-pattern

Don’t: Alert heavily on every unhandled JavaScript exception. Browser extensions regularly inject external scripts that throw errors your codebase didn’t cause, ad blockers aggressively remove DOM elements your code expects to reference, and bots execute JavaScript completely out of order. If you configure hard alerts for all of this noise, your on-call engineer will mute the channel within a week.

Do: Safely filter out extension errors by directly checking stack traces for extension:// URLs, identify and drop bot traffic using User-Agent analysis, and strictly maintain a known-noise fingerprint list. Once routinely filtered, you can confidently set up operational alerts on rate-based thresholds. An alert pointing out that error rates increased 5x above baseline on the checkout page is actionable, whereas a raw TypeError is just baseline noise you should quickly drop.

// Global error handler with noise filtering
window.addEventListener('error', (event) => {
  const stack = event.error?.stack || '';

  // Filter browser extension noise
  if (/chrome-extension:|moz-extension:|safari-extension:/.test(stack)) return;

  // Filter known third-party script errors
  if (KNOWN_NOISE_PATTERNS.some(p => p.test(event.message))) return;

  // Filter bot traffic
  if (/bot|crawl|spider/i.test(navigator.userAgent)) return;

  // Send meaningful errors with context
  errorTracker.capture({
    error: event.error,
    url: window.location.href,
    sessionId: getSessionId(),
    traceId: getActiveTraceId(),  // Links to backend distributed trace
  });
});

Core Web Vitals via Field Data

Core Web Vitals measured actively through RUM provide field data that Lighthouse can only roughly approximate in a lab environment. Broad, sweeping totals are generally useless for these metrics. You get real analytical value by breaking the datasets down by specific page path, device hardware class, and geographic network region.

LCP (threshold: 2.5s): This reflects normal server response time added to the load time of your largest visible screen element. Serving primary LCP images with fetchpriority="high" helps immensely. This single HTML attribute change often moves the needle more than most comprehensive, week-long frontend refactoring efforts.

INP (threshold: 200ms): This uniquely captures the absolute worst interaction latency measured during an active session period. Heavy rendering tasks blocking the single main thread are the usual culprit here. Break them up seamlessly using scheduler.yield(). A mature frontend UX engineering team fundamentally treats INP as a primary metric right alongside LCP constraints.

CLS (threshold: 0.1): Layout shifts usually trigger when asynchronously loaded images arrive without physical dimensions, dynamically injected content pushes existing text around, or custom web fonts swap unexpectedly. To reliably prevent this, actively use aspect-ratio or declare explicit width and height properties on every media element. Reserving spatial block space for dynamic content before it fully resolves keeps the page from rearranging itself, ensuring users click exactly what they intended to click.

Metric	Threshold	Primary Cause	Quick Fix
LCP	2.5s	Large unoptimized images, render-blocking resources	`fetchpriority="high"` on LCP element
INP	200ms	Long tasks blocking main thread	`scheduler.yield()`, break work into chunks
CLS	0.1	Images without dimensions, font swap	`aspect-ratio` on media, `font-display: swap`

Correlating Frontend and Backend

Without proper connection constraints, the frontend team says users are reporting errors, while the backend team points out that all their service metrics look explicitly healthy. Both groups are technically correct based on their own dashboards, but neither can find the root cause acting entirely alone.

The standard solution is to reliably inject a trace ID property into every outgoing fetch request payload. Observability tools like Sentry and Datadog RUM perform this automatically when their client SDKs detect a matching backend APM configuration. Moving from the frontend error directly to the backend trace shows exactly which query timed out, which service returned unexpected data, or which middleware rejected the request. It gives you the full picture without forcing you to switch contexts.

Implementing trace ID propagation manually

If you’re not using an SDK that handles network correlation automatically, simply inject the trace ID directly via a standard fetch interceptor tool:

Generate a unique trace ID per active page load (UUID v4 works perfectly)
Add it directly as a custom header property (x-trace-id) to every outgoing API request
Always include the trace ID value in every error event sent upstream to your tracking service
Downstream on the backend, extract the header and systematically propagate it through your span context
Log both the frontend errors and the backend trace spans carefully indexed by the identical trace ID
Build a dashboard lookup view that connects the frontend error cleanly to the backend trace with a single query execution

The strict technical implementation mechanics are fundamentally quite straightforward. The real challenge comes from the discipline required to keep it totally consistent across every API boundary and backend microservice.

What the Industry Gets Wrong About Frontend Observability

“APM adequately covers the frontend.” Application Performance Monitoring focuses tightly on server-side code execution. It has practically zero visibility into local JavaScript execution, localized DOM layout rendering, browser extension interference, or the delay gap spanning between when the server originally transmitted a payload and when the user actually observed the final result. Dedicated frontend observability routinely requires completely different logging tools, different sampling mechanics, and completely disparate metric alert thresholds.

“Alert broadly on every JavaScript error.” Unfiltered internal JavaScript exceptions primarily surface from disruptive browser extensions, active ad blockers, and automated bot scraping traffic. Setting up hard pagers for all of this external noise produces immense alert fatigue, and on-call engineers predictably tune the channel out within days. Filter the baseline noise first, and then properly configure alerts exclusively on strict rate-based thresholds applied against your own codebase.

Our take Source maps and session replay are easily the two highest-ROI frontend observability investments your teams can quickly make. Reliable source maps efficiently turn an opaque string like a.js:1:4523 into actionable logic like CheckoutForm.tsx:47: handleSubmit. Capable session replay utilities provide a strict visual reproduction of precisely what happened right before the interface crashed. Everything else you assemble (like deep RUM dashboards, precise Core Web Vitals tracking, and strict noise filters) builds incrementally on those exact two foundations. A comprehensive DevOps practice shouldn’t stop arbitrarily at the server’s HTTP communication boundary. Extending strong observability directly into the rendering browser itself safely closes the loop.

If you configure properly source-mapped traces, session replay, and active frontend-backend correlation, a silent hydration crash will correctly surface as a high-priority alert long before the very first customer support ticket even arrives. The frontend serves as the final edge of your distributed application system. Actively bringing it into your observability stack prevents client-side rendering failures from remaining completely invisible.

Frequently Asked Questions

What is the difference between Real User Monitoring and synthetic monitoring?

RUM collects performance data from real user sessions across actual devices, networks, and locations. Synthetic monitoring runs scripted tests from controlled infrastructure. RUM shows the full range of user experience, including the slow devices and bad connections that make up the long tail. Synthetic catches regressions before users do. You need both: synthetic in CI to prevent regressions, RUM in production to see what users actually experience.

How much performance overhead does session replay add to a page?

Modern replay tools using mutation observer recording add 1-3% CPU overhead and 50-200KB of compressed data per minute. The first DOM snapshot is usually 30-100KB compressed. Overhead goes up with DOM complexity. Pages with over 5,000 DOM nodes or lots of DOM changes see more overhead. Sample replay at roughly 5-10% of sessions to keep bandwidth costs down while still having enough data for debugging.

How do you correlate a frontend error with the backend request that caused it?

Inject a trace ID into every outgoing fetch or XMLHttpRequest using a request interceptor. Sentry and Datadog RUM do this automatically when their SDKs find the matching backend APM agent. The frontend error carries the trace ID, linking straight to the backend distributed trace. Without this link, frontend and backend teams investigate the exact same incident separately without knowing it.

What percentage of frontend errors are caused by browser extensions and bot traffic?

In production, most unfiltered frontend JavaScript errors come from browser extensions, ad blockers, or bot traffic rather than your actual application code. In many codebases, this noise completely outnumbers real errors. Filter the noise by checking stack traces for extension:// URLs, keeping a known-noise fingerprint list, and separating bot traffic using User-Agent analysis.

What source map configuration is needed for production error deobfuscation?

Upload source maps to your error tracking service during the build step, and then strictly strip them from the production deployment. Never serve source maps publicly in production because they expose your original source code. Include the build commit SHA as the release identifier so error events cleanly link to the exact code version that generated them. Keep 90 days of source maps for auditing and debugging.