← Back to Insights

Frontend Error Tracking: Session Replay and RUM

Metasphere Engineering 14 min read

You ship a new release and check your backend dashboards. Error rates are flat and latency percentiles look completely normal, so you close your laptop assuming the deployment was a success.

An hour later, customer support lights up. Users complain the page won’t load, buttons don’t work, and everything turns blank after logging in. You pull up the server logs, but every single request returns a 200 OK. The backend served the HTML, the JavaScript, and the API responses correctly. The failure actually happened after the response left your infrastructure and entered the user’s browser, leaving your monitoring stack completely blind.

Server-side telemetry tells you everything is fine while the client-side experience silently fails.

Key takeaways
  • Backend returned 200 OK. The bug lives in the browser. Server-side observability has a visibility ceiling that strictly stops at the HTTP response boundary.
  • Source maps in production are non-negotiable. Minified stack traces (a.js:1:4523) are practically useless without a way to map them back to the original code. Upload maps to your error tracking service and keep them out of public access.
  • Sample errors intelligently, not exhaustively. High-traffic apps generate millions of error events. Record 100% of error sessions, but routinely sample roughly 5-10% of everything else.
  • Session replay turns vague frontend complaints into a reproducible bug. Record DOM mutations, not video. Redact PII automatically at capture time.
  • Connect frontend errors to backend traces. One trace ID directly links a user’s blank page to the exact backend API call that failed.

Your Prometheus metrics, Grafana dashboards, and distributed traces saw nothing unusual. The W3C Performance Timeline defines the specific browser APIs that make frontend instrumentation possible. Without them, you’re trying to debug an incident without any client-side context.

The Visibility Ceiling

The 200 OK Blind Spot The visibility gap between a successful server response and the actual user experience in the browser. The server returns the HTML, but if the JavaScript crashes during hydration, backend dashboards still show green while the user sees a blank screen. The backend has no visibility into the actual frontend state.

A 200 response tells you nothing about React hydration crashes, third-party scripts blocking the main thread for 800ms, or ad blockers aggressively removing DOM elements your click handlers depend upon. The browser is a hostile environment filled with extensions you didn’t install, hardware you didn’t pick, and networks you don’t control. You deploy to a predictable server, but your code runs in thousands of completely unpredictable client environments.

RUM vs Synthetic: You Need Both

Real User Monitoring collects performance data from real user sessions across every device, network, and location. RUM might show that your P75 LCP sits at 2.1s while the P95 reaches 6.4s, heavily clustered in Southeast Asia on Android Chrome. This gives you exact data from the field. Synthetic monitoring running from a static remote datacenter would never surface that geographical pattern.

Synthetic monitoring runs scripted browser tests from controlled infrastructure to selectively catch regressions between releases. Lighthouse CI in your pipeline flags a 500ms LCP increase before it reaches active users. It acts as an automated safety check against rapid performance degradation.

Synthetic MonitoringReal User Monitoring (RUM)
Data sourceScripted browser tests from controlled infraActual user sessions, every page load
When it runsCI/CD pipeline, scheduled cronContinuous in production
CatchesRegressions between releasesReal-world performance issues
MissesDevice/network/geography variationPre-deploy regressions
Best forRegression detection, baseline comparisonUnderstanding actual user experience
Use alone?Optimizes for a lab that doesn’t match realityCatches problems only after users suffer
RUM + Synthetic: Detect Regressions and Real User PainRUM + Synthetic: Two Lenses, Complete PictureSynthetic MonitoringRuns in CI/CD before deployConsistent environment, no user varianceCatches regressions before users see themPre-deploy: did we break anything?Real User Monitoring (RUM)Runs on real user devices in productionCaptures device, network, geography varianceShows actual user experience (P75 Core Web Vitals)Post-deploy: how do real users experience it?Together: synthetic prevents regressions, RUM surfaces real-world issuesSynthetic without RUM is a lab. RUM without synthetic is reactive. You need both.

Session Replay Architecture

The standard approach uses mutation observers (rrweb ). Instead of dragging in heavy video files, you take an initial DOM snapshot and strictly record every subsequent mutation, scroll, and click as incremental events. This takes roughly 50-200KB compressed per minute versus 500KB-2MB for actual video. Because it is natively searchable by DOM state, when a user reports a broken UI element, you can restore the exact DOM state from the moment they clicked. There is no guesswork, and no debating whether the feature works perfectly on a developer’s machine.

Session replay recording flow from browser DOM capture through mutation observer to replay service with PII maskingA user session in the browser triggers an initial DOM snapshot, followed by incremental mutation observer events that flow through a PII masking layer before reaching the replay service for storage, error correlation, and playbackUSER SESSIONapp.example.com/checkoutSubmitTypeErrorDOM Snapshot30-100KB compressedMutation ObserverClicks, DOM changes, scrollsPRIVACY LAYERInput MaskingAll fields maskedText ReplacementPII to placeholdersNetwork ScrubHeaders + bodiesGDPR ModeMask at record timeSession Replay ServiceSentry Replay / LogRocket / FullStory / DatadogSession PlaybackWatch user's exact experienceError Correlationtrace-id links to backend spansRUM MetricsLCP, INP, CLS per cohort50-200KB per min recorded5-10% session sample rate

Sampling strategy matters. It makes sense to record 100% of sessions where an error occurs and 100% of sessions where a user contacts support. Sample roughly 5-10% of everything else. Storage scales linearly with your session count. Get your sampling right and session replay becomes a high-value debugging tool. Get it wrong and you end up accumulating massive data storage bills for collecting redundant traces.

Privacy is not optional. You logically need to mask all <input> elements by default. For GDPR compliance, masking has to happen securely on the client at recording time, not during later playback. If unmasked personal data hits your ingestion servers, the strict processing already happened under GDPR rules. Text masking keeps the layout fully intact but cleanly replaces sensitive content with safe placeholders. For standard debugging workflows, the exact sequence of clicks matters far more than the specific text the user typed out.

Source Map Management

Prerequisites
  1. Build pipeline generates source maps alongside production bundles
  2. Source maps upload to your error tracking service as a required CI step
  3. Source maps are cleanly stripped from the deployment artifact before reaching the CDN
  4. Each upload is tagged with the git commit SHA for exact code version correlation
  5. Source map retention policy covers at least 90 days

Generate source maps during your build phase, upload them directly to your error tracking service, and carefully strip them from the deployment artifact. Tagging the exact upload with a git commit SHA ensures that every error links to the specific source code version that generated it. Never serve source maps publicly, since they act as easily accessible copies of your raw application source files.

Teams typically break this setup in the same three ways. Maps randomly fall out of sync with deployed code when an urgent hotfix accidentally skips the upload step. Retention policies eventually expire before anyone actively investigates an issue. Tags mismatch because the string format for release identifiers shifted between deploys. To counter this predictably, make the source map upload a required CI check rather than an optional post-deploy step. If the upload process fails, the deploy should gracefully fail with it.

Error Grouping and Noise Filtering

Default grouping logic buckets errors strictly by exception type and the top stack frame. This breaks completely when the same underlying root cause generates vastly different stack traces in the wild. A null reference trace spanning three separate React components, all caused by one missing API field, shows up as three separate incidents. What looks like three separate incidents is actually one underlying bug.

Custom fingerprinting directly resolves this by grouping payloads according to message patterns, failing network endpoints, or custom error tags. This approach collapses 300 noisy events into a single actionable issue with an accurate occurrence count.

Error grouping: raw noise to actionable signal1000 raw errors per hour. Stack trace fingerprinting groups them into 12 unique issues. Priority scoring surfaces the 3 that affect users. The rest is noise you can safely ignore.Error Grouping: From Noise to Signal1,000raw errors/hourUnfiltered noisefingerprintStack Trace GroupingNormalize framesDeduplicate by fingerprintCount occurrences12unique issuesRanked by frequencyprioritize3user-impacting bugsFix these. Ignore the rest.1000 errors is noise. 3 prioritized issues is a sprint backlog.
Anti-pattern

Don’t: Alert heavily on every unhandled JavaScript exception. Browser extensions regularly inject external scripts that throw errors your codebase didn’t cause, ad blockers aggressively remove DOM elements your code expects to reference, and bots execute JavaScript completely out of order. If you configure hard alerts for all of this noise, your on-call engineer will mute the channel within a week.

Do: Safely filter out extension errors by directly checking stack traces for extension:// URLs, identify and drop bot traffic using User-Agent analysis, and strictly maintain a known-noise fingerprint list. Once routinely filtered, you can confidently set up operational alerts on rate-based thresholds. An alert pointing out that error rates increased 5x above baseline on the checkout page is actionable, whereas a raw TypeError is just baseline noise you should quickly drop.

// Global error handler with noise filtering
window.addEventListener('error', (event) => {
  const stack = event.error?.stack || '';

  // Filter browser extension noise
  if (/chrome-extension:|moz-extension:|safari-extension:/.test(stack)) return;

  // Filter known third-party script errors
  if (KNOWN_NOISE_PATTERNS.some(p => p.test(event.message))) return;

  // Filter bot traffic
  if (/bot|crawl|spider/i.test(navigator.userAgent)) return;

  // Send meaningful errors with context
  errorTracker.capture({
    error: event.error,
    url: window.location.href,
    sessionId: getSessionId(),
    traceId: getActiveTraceId(),  // Links to backend distributed trace
  });
});

Core Web Vitals via Field Data

Core Web Vitals measured actively through RUM provide field data that Lighthouse can only roughly approximate in a lab environment. Broad, sweeping totals are generally useless for these metrics. You get real analytical value by breaking the datasets down by specific page path, device hardware class, and geographic network region.

LCP (threshold: 2.5s): This reflects normal server response time added to the load time of your largest visible screen element. Serving primary LCP images with fetchpriority="high" helps immensely. This single HTML attribute change often moves the needle more than most comprehensive, week-long frontend refactoring efforts.

INP (threshold: 200ms): This uniquely captures the absolute worst interaction latency measured during an active session period. Heavy rendering tasks blocking the single main thread are the usual culprit here. Break them up seamlessly using scheduler.yield(). A mature frontend UX engineering team fundamentally treats INP as a primary metric right alongside LCP constraints.

CLS (threshold: 0.1): Layout shifts usually trigger when asynchronously loaded images arrive without physical dimensions, dynamically injected content pushes existing text around, or custom web fonts swap unexpectedly. To reliably prevent this, actively use aspect-ratio or declare explicit width and height properties on every media element. Reserving spatial block space for dynamic content before it fully resolves keeps the page from rearranging itself, ensuring users click exactly what they intended to click.

MetricThresholdPrimary CauseQuick Fix
LCP2.5sLarge unoptimized images, render-blocking resourcesfetchpriority="high" on LCP element
INP200msLong tasks blocking main threadscheduler.yield(), break work into chunks
CLS0.1Images without dimensions, font swapaspect-ratio on media, font-display: swap

Correlating Frontend and Backend

Without proper connection constraints, the frontend team says users are reporting errors, while the backend team points out that all their service metrics look explicitly healthy. Both groups are technically correct based on their own dashboards, but neither can find the root cause acting entirely alone.

Frontend-to-Backend Trace CorrelationFrontend-to-Backend: One Trace ID Connects EverythingBrowser ErrorUser clicks buttonJS error thrownTrace ID Injectedtraceparent headeron every fetch/XHRBackend TraceSame trace_id propagatedthrough all servicesFull PictureClick in browser linked toAPI call, DB query, errorOne search finds allWithout trace correlation, frontend and backend errors are two separate mysteries.

The standard solution is to reliably inject a trace ID property into every outgoing fetch request payload. Observability tools like Sentry and Datadog RUM perform this automatically when their client SDKs detect a matching backend APM configuration. Moving from the frontend error directly to the backend trace shows exactly which query timed out, which service returned unexpected data, or which middleware rejected the request. It gives you the full picture without forcing you to switch contexts.

Implementing trace ID propagation manually

If you’re not using an SDK that handles network correlation automatically, simply inject the trace ID directly via a standard fetch interceptor tool:

  1. Generate a unique trace ID per active page load (UUID v4 works perfectly)
  2. Add it directly as a custom header property (x-trace-id) to every outgoing API request
  3. Always include the trace ID value in every error event sent upstream to your tracking service
  4. Downstream on the backend, extract the header and systematically propagate it through your span context
  5. Log both the frontend errors and the backend trace spans carefully indexed by the identical trace ID
  6. Build a dashboard lookup view that connects the frontend error cleanly to the backend trace with a single query execution

The strict technical implementation mechanics are fundamentally quite straightforward. The real challenge comes from the discipline required to keep it totally consistent across every API boundary and backend microservice.

What the Industry Gets Wrong About Frontend Observability

“APM adequately covers the frontend.” Application Performance Monitoring focuses tightly on server-side code execution. It has practically zero visibility into local JavaScript execution, localized DOM layout rendering, browser extension interference, or the delay gap spanning between when the server originally transmitted a payload and when the user actually observed the final result. Dedicated frontend observability routinely requires completely different logging tools, different sampling mechanics, and completely disparate metric alert thresholds.

“Alert broadly on every JavaScript error.” Unfiltered internal JavaScript exceptions primarily surface from disruptive browser extensions, active ad blockers, and automated bot scraping traffic. Setting up hard pagers for all of this external noise produces immense alert fatigue, and on-call engineers predictably tune the channel out within days. Filter the baseline noise first, and then properly configure alerts exclusively on strict rate-based thresholds applied against your own codebase.

Our take Source maps and session replay are easily the two highest-ROI frontend observability investments your teams can quickly make. Reliable source maps efficiently turn an opaque string like a.js:1:4523 into actionable logic like CheckoutForm.tsx:47: handleSubmit. Capable session replay utilities provide a strict visual reproduction of precisely what happened right before the interface crashed. Everything else you assemble (like deep RUM dashboards, precise Core Web Vitals tracking, and strict noise filters) builds incrementally on those exact two foundations. A comprehensive DevOps practice shouldn’t stop arbitrarily at the server’s HTTP communication boundary. Extending strong observability directly into the rendering browser itself safely closes the loop.

If you configure properly source-mapped traces, session replay, and active frontend-backend correlation, a silent hydration crash will correctly surface as a high-priority alert long before the very first customer support ticket even arrives. The frontend serves as the final edge of your distributed application system. Actively bringing it into your observability stack prevents client-side rendering failures from remaining completely invisible.

Stop Losing Users to Errors You Cannot See

Backend logs show a 200 OK, but your user sees a blank screen. The gap between server-side observability and actual browser experience is where conversion dies. Frontend observability pipelines connect session replay, RUM metrics, and backend traces into a single debugging workflow.

Fix Your Frontend Observability

Frequently Asked Questions

What is the difference between Real User Monitoring and synthetic monitoring?

+

RUM collects performance data from real user sessions across actual devices, networks, and locations. Synthetic monitoring runs scripted tests from controlled infrastructure. RUM shows the full range of user experience, including the slow devices and bad connections that make up the long tail. Synthetic catches regressions before users do. You need both: synthetic in CI to prevent regressions, RUM in production to see what users actually experience.

How much performance overhead does session replay add to a page?

+

Modern replay tools using mutation observer recording add 1-3% CPU overhead and 50-200KB of compressed data per minute. The first DOM snapshot is usually 30-100KB compressed. Overhead goes up with DOM complexity. Pages with over 5,000 DOM nodes or lots of DOM changes see more overhead. Sample replay at roughly 5-10% of sessions to keep bandwidth costs down while still having enough data for debugging.

How do you correlate a frontend error with the backend request that caused it?

+

Inject a trace ID into every outgoing fetch or XMLHttpRequest using a request interceptor. Sentry and Datadog RUM do this automatically when their SDKs find the matching backend APM agent. The frontend error carries the trace ID, linking straight to the backend distributed trace. Without this link, frontend and backend teams investigate the exact same incident separately without knowing it.

What percentage of frontend errors are caused by browser extensions and bot traffic?

+

In production, most unfiltered frontend JavaScript errors come from browser extensions, ad blockers, or bot traffic rather than your actual application code. In many codebases, this noise completely outnumbers real errors. Filter the noise by checking stack traces for extension:// URLs, keeping a known-noise fingerprint list, and separating bot traffic using User-Agent analysis.

What source map configuration is needed for production error deobfuscation?

+

Upload source maps to your error tracking service during the build step, and then strictly strip them from the production deployment. Never serve source maps publicly in production because they expose your original source code. Include the build commit SHA as the release identifier so error events cleanly link to the exact code version that generated them. Keep 90 days of source maps for auditing and debugging.