Frontend Error Tracking: Session Replay and RUM
You ship a new release and check your backend dashboards. Error rates are flat and latency percentiles look completely normal, so you close your laptop assuming the deployment was a success.
An hour later, customer support lights up. Users complain the page won’t load, buttons don’t work, and everything turns blank after logging in. You pull up the server logs, but every single request returns a 200 OK. The backend served the HTML, the JavaScript, and the API responses correctly. The failure actually happened after the response left your infrastructure and entered the user’s browser, leaving your monitoring stack completely blind.
Server-side telemetry tells you everything is fine while the client-side experience silently fails.
- Backend returned 200 OK. The bug lives in the browser. Server-side observability has a visibility ceiling that strictly stops at the HTTP response boundary.
- Source maps in production are non-negotiable. Minified stack traces (
a.js:1:4523) are practically useless without a way to map them back to the original code. Upload maps to your error tracking service and keep them out of public access. - Sample errors intelligently, not exhaustively. High-traffic apps generate millions of error events. Record 100% of error sessions, but routinely sample roughly 5-10% of everything else.
- Session replay turns vague frontend complaints into a reproducible bug. Record DOM mutations, not video. Redact PII automatically at capture time.
- Connect frontend errors to backend traces. One trace ID directly links a user’s blank page to the exact backend API call that failed.
Your Prometheus metrics, Grafana dashboards, and distributed traces saw nothing unusual. The W3C Performance Timeline defines the specific browser APIs that make frontend instrumentation possible. Without them, you’re trying to debug an incident without any client-side context.
The Visibility Ceiling
A 200 response tells you nothing about React hydration crashes, third-party scripts blocking the main thread for 800ms, or ad blockers aggressively removing DOM elements your click handlers depend upon. The browser is a hostile environment filled with extensions you didn’t install, hardware you didn’t pick, and networks you don’t control. You deploy to a predictable server, but your code runs in thousands of completely unpredictable client environments.
RUM vs Synthetic: You Need Both
Real User Monitoring collects performance data from real user sessions across every device, network, and location. RUM might show that your P75 LCP sits at 2.1s while the P95 reaches 6.4s, heavily clustered in Southeast Asia on Android Chrome. This gives you exact data from the field. Synthetic monitoring running from a static remote datacenter would never surface that geographical pattern.
Synthetic monitoring runs scripted browser tests from controlled infrastructure to selectively catch regressions between releases. Lighthouse CI in your pipeline flags a 500ms LCP increase before it reaches active users. It acts as an automated safety check against rapid performance degradation.
| Synthetic Monitoring | Real User Monitoring (RUM) | |
|---|---|---|
| Data source | Scripted browser tests from controlled infra | Actual user sessions, every page load |
| When it runs | CI/CD pipeline, scheduled cron | Continuous in production |
| Catches | Regressions between releases | Real-world performance issues |
| Misses | Device/network/geography variation | Pre-deploy regressions |
| Best for | Regression detection, baseline comparison | Understanding actual user experience |
| Use alone? | Optimizes for a lab that doesn’t match reality | Catches problems only after users suffer |
Session Replay Architecture
The standard approach uses mutation observers (rrweb ). Instead of dragging in heavy video files, you take an initial DOM snapshot and strictly record every subsequent mutation, scroll, and click as incremental events. This takes roughly 50-200KB compressed per minute versus 500KB-2MB for actual video. Because it is natively searchable by DOM state, when a user reports a broken UI element, you can restore the exact DOM state from the moment they clicked. There is no guesswork, and no debating whether the feature works perfectly on a developer’s machine.
Sampling strategy matters. It makes sense to record 100% of sessions where an error occurs and 100% of sessions where a user contacts support. Sample roughly 5-10% of everything else. Storage scales linearly with your session count. Get your sampling right and session replay becomes a high-value debugging tool. Get it wrong and you end up accumulating massive data storage bills for collecting redundant traces.
Privacy is not optional. You logically need to mask all <input> elements by default. For GDPR compliance, masking has to happen securely on the client at recording time, not during later playback. If unmasked personal data hits your ingestion servers, the strict processing already happened under GDPR rules. Text masking keeps the layout fully intact but cleanly replaces sensitive content with safe placeholders. For standard debugging workflows, the exact sequence of clicks matters far more than the specific text the user typed out.
Source Map Management
- Build pipeline generates source maps alongside production bundles
- Source maps upload to your error tracking service as a required CI step
- Source maps are cleanly stripped from the deployment artifact before reaching the CDN
- Each upload is tagged with the git commit SHA for exact code version correlation
- Source map retention policy covers at least 90 days
Generate source maps during your build phase, upload them directly to your error tracking service, and carefully strip them from the deployment artifact. Tagging the exact upload with a git commit SHA ensures that every error links to the specific source code version that generated it. Never serve source maps publicly, since they act as easily accessible copies of your raw application source files.
Teams typically break this setup in the same three ways. Maps randomly fall out of sync with deployed code when an urgent hotfix accidentally skips the upload step. Retention policies eventually expire before anyone actively investigates an issue. Tags mismatch because the string format for release identifiers shifted between deploys. To counter this predictably, make the source map upload a required CI check rather than an optional post-deploy step. If the upload process fails, the deploy should gracefully fail with it.
Error Grouping and Noise Filtering
Default grouping logic buckets errors strictly by exception type and the top stack frame. This breaks completely when the same underlying root cause generates vastly different stack traces in the wild. A null reference trace spanning three separate React components, all caused by one missing API field, shows up as three separate incidents. What looks like three separate incidents is actually one underlying bug.
Custom fingerprinting directly resolves this by grouping payloads according to message patterns, failing network endpoints, or custom error tags. This approach collapses 300 noisy events into a single actionable issue with an accurate occurrence count.
Don’t: Alert heavily on every unhandled JavaScript exception. Browser extensions regularly inject external scripts that throw errors your codebase didn’t cause, ad blockers aggressively remove DOM elements your code expects to reference, and bots execute JavaScript completely out of order. If you configure hard alerts for all of this noise, your on-call engineer will mute the channel within a week.
Do: Safely filter out extension errors by directly checking stack traces for extension:// URLs, identify and drop bot traffic using User-Agent analysis, and strictly maintain a known-noise fingerprint list. Once routinely filtered, you can confidently set up operational alerts on rate-based thresholds. An alert pointing out that error rates increased 5x above baseline on the checkout page is actionable, whereas a raw TypeError is just baseline noise you should quickly drop.
// Global error handler with noise filtering
window.addEventListener('error', (event) => {
const stack = event.error?.stack || '';
// Filter browser extension noise
if (/chrome-extension:|moz-extension:|safari-extension:/.test(stack)) return;
// Filter known third-party script errors
if (KNOWN_NOISE_PATTERNS.some(p => p.test(event.message))) return;
// Filter bot traffic
if (/bot|crawl|spider/i.test(navigator.userAgent)) return;
// Send meaningful errors with context
errorTracker.capture({
error: event.error,
url: window.location.href,
sessionId: getSessionId(),
traceId: getActiveTraceId(), // Links to backend distributed trace
});
});
Core Web Vitals via Field Data
Core Web Vitals measured actively through RUM provide field data that Lighthouse can only roughly approximate in a lab environment. Broad, sweeping totals are generally useless for these metrics. You get real analytical value by breaking the datasets down by specific page path, device hardware class, and geographic network region.
LCP (threshold: 2.5s): This reflects normal server response time added to the load time of your largest visible screen element. Serving primary LCP images with fetchpriority="high" helps immensely. This single HTML attribute change often moves the needle more than most comprehensive, week-long frontend refactoring efforts.
INP (threshold: 200ms): This uniquely captures the absolute worst interaction latency measured during an active session period. Heavy rendering tasks blocking the single main thread are the usual culprit here. Break them up seamlessly using scheduler.yield(). A mature frontend UX engineering
team fundamentally treats INP as a primary metric right alongside LCP constraints.
CLS (threshold: 0.1): Layout shifts usually trigger when asynchronously loaded images arrive without physical dimensions, dynamically injected content pushes existing text around, or custom web fonts swap unexpectedly. To reliably prevent this, actively use aspect-ratio or declare explicit width and height properties on every media element. Reserving spatial block space for dynamic content before it fully resolves keeps the page from rearranging itself, ensuring users click exactly what they intended to click.
| Metric | Threshold | Primary Cause | Quick Fix |
|---|---|---|---|
| LCP | 2.5s | Large unoptimized images, render-blocking resources | fetchpriority="high" on LCP element |
| INP | 200ms | Long tasks blocking main thread | scheduler.yield(), break work into chunks |
| CLS | 0.1 | Images without dimensions, font swap | aspect-ratio on media, font-display: swap |
Correlating Frontend and Backend
Without proper connection constraints, the frontend team says users are reporting errors, while the backend team points out that all their service metrics look explicitly healthy. Both groups are technically correct based on their own dashboards, but neither can find the root cause acting entirely alone.
The standard solution is to reliably inject a trace ID property into every outgoing fetch request payload. Observability tools like Sentry and Datadog RUM perform this automatically when their client SDKs detect a matching backend APM configuration. Moving from the frontend error directly to the backend trace shows exactly which query timed out, which service returned unexpected data, or which middleware rejected the request. It gives you the full picture without forcing you to switch contexts.
Implementing trace ID propagation manually
If you’re not using an SDK that handles network correlation automatically, simply inject the trace ID directly via a standard fetch interceptor tool:
- Generate a unique trace ID per active page load (UUID v4 works perfectly)
- Add it directly as a custom header property (
x-trace-id) to every outgoing API request - Always include the trace ID value in every error event sent upstream to your tracking service
- Downstream on the backend, extract the header and systematically propagate it through your span context
- Log both the frontend errors and the backend trace spans carefully indexed by the identical trace ID
- Build a dashboard lookup view that connects the frontend error cleanly to the backend trace with a single query execution
The strict technical implementation mechanics are fundamentally quite straightforward. The real challenge comes from the discipline required to keep it totally consistent across every API boundary and backend microservice.
What the Industry Gets Wrong About Frontend Observability
“APM adequately covers the frontend.” Application Performance Monitoring focuses tightly on server-side code execution. It has practically zero visibility into local JavaScript execution, localized DOM layout rendering, browser extension interference, or the delay gap spanning between when the server originally transmitted a payload and when the user actually observed the final result. Dedicated frontend observability routinely requires completely different logging tools, different sampling mechanics, and completely disparate metric alert thresholds.
“Alert broadly on every JavaScript error.” Unfiltered internal JavaScript exceptions primarily surface from disruptive browser extensions, active ad blockers, and automated bot scraping traffic. Setting up hard pagers for all of this external noise produces immense alert fatigue, and on-call engineers predictably tune the channel out within days. Filter the baseline noise first, and then properly configure alerts exclusively on strict rate-based thresholds applied against your own codebase.
a.js:1:4523 into actionable logic like CheckoutForm.tsx:47: handleSubmit. Capable session replay utilities provide a strict visual reproduction of precisely what happened right before the interface crashed. Everything else you assemble (like deep RUM dashboards, precise Core Web Vitals tracking, and strict noise filters) builds incrementally on those exact two foundations. A comprehensive DevOps practice
shouldn’t stop arbitrarily at the server’s HTTP communication boundary. Extending strong observability directly into the rendering browser itself safely closes the loop.If you configure properly source-mapped traces, session replay, and active frontend-backend correlation, a silent hydration crash will correctly surface as a high-priority alert long before the very first customer support ticket even arrives. The frontend serves as the final edge of your distributed application system. Actively bringing it into your observability stack prevents client-side rendering failures from remaining completely invisible.