Progressive Web Apps: Offline-First That Works

Jan 14, 2025 Metasphere Engineering 16 min read

You ship a progressive web app. The demo is flawless. Airplane mode, kill the connection, toggle it back. The app loads instantly, the data persists, the sync resolves. Your team celebrates.

Then it hits production. A user on a spotty train connection submits a form three times because the service worker retried silently. The cache served a stale login page after you deployed an auth change. IndexedDB quietly hit its quota limit on a 32GB iPhone crammed with photos and podcast apps. Three different bugs, three different root causes, all invisible during the demo.

The gap between a PWA demo and a production PWA is not the service worker registration. It is everything that happens after. Real networks, real devices, and real user behavior will find every crack in your caching assumptions. And they will find them fast.

The Service Worker Lifecycle Nobody Explains Well

A service worker has three distinct phases: installation, activation, and fetch interception. The lifecycle is event-driven, not request-driven. The transitions between phases are where most bugs live.

When a user first visits your site, the browser downloads and installs the service worker in the background. The install event fires, and this is where you precache your critical assets. But here is the part that trips everyone up: the new service worker does not activate immediately. If there is an existing service worker controlling the page, the new one enters a waiting state. It will not activate until every tab controlled by the old service worker is closed. This is by design. It prevents half your tabs running against one cache version and half against another.

The problem? Most users never close their tabs. They leave your app open in a background tab for days. Your new deployment sits in the waiting state while the old cache keeps serving stale assets. Calling skipWaiting() in the install handler forces immediate activation, but now you have a different problem. The page was loaded with old assets, and the new service worker is intercepting fetches with new routing logic. If your HTML references a CSS file that the new service worker’s precache manifest renamed, the page breaks mid-session.

The safe pattern: skipWaiting() combined with clients.claim() in the activate handler, plus a version check in your app shell that detects when the controlling service worker changed and prompts the user to refresh. Workbox’s workbox-window module provides this exact mechanism via the controlling event. Never silently swap cache versions under active sessions. Your users will not forgive a mid-session page break.

Caching Strategies and When Each One Breaks

The five caching strategies are not interchangeable. Each one has a specific failure mode that will bite you in production. Applying the wrong strategy to the wrong resource type is the root cause of roughly 60-70% of production PWA bugs. This is the mistake that catches every team eventually.

Cache-first serves from cache and only hits the network on a cache miss. Use it for versioned static assets: CSS, JavaScript, and images with content hashes in their filenames. The filename changes when the content changes, so cached versions are always correct. Apply cache-first to API responses and you will serve stale data indefinitely. Apply it to HTML documents and you will serve yesterday’s page layout after a deployment. Both are bad. Both happen constantly.

Network-first hits the network and falls back to cache on failure. Use it for HTML documents and API responses where freshness matters more than speed. The failure mode is timeout behavior. Without an explicit timeout, a slow but not dead connection hangs the request for 30+ seconds while the user stares at a blank screen. Meanwhile, the cache has a perfectly good response sitting right there. Set network timeouts to 3-5 seconds for document requests. Always.

Stale-while-revalidate serves from cache immediately and updates the cache in the background. This is the strategy that sounds perfect and causes the most subtle bugs. The user sees instant content, but it is the previous version. If you deployed a breaking API change, the user’s UI is rendering against stale data until the background revalidation completes and they navigate again. For non-critical resources like avatars and metadata, this works well. For anything where staleness causes functional breakage, it is a trap.

Network-only skips the cache entirely. Use it for authentication endpoints, payment processing, and anything where serving a cached response would be functionally wrong or a security risk. Never cache OAuth token exchanges, session validation responses, or mutation acknowledgments. This is non-negotiable.

Cache-only serves exclusively from the precache. Use it for the app shell HTML when you want guaranteed instant loading and handle content separately via runtime caching. The failure mode is obvious: if the asset is not in the precache, the request fails with no fallback.

Getting the strategy right for each resource type is the difference between a PWA that feels magical and one that serves corrupted state to users at 2 AM.

Workbox Configuration That Actually Ships

Workbox is the de facto standard for service worker caching, and for good reason. Hand-rolling cache management leads to the same bugs Workbox already solved. But here is the thing most teams miss: Workbox’s defaults are not production defaults. They are demo defaults.

The precache manifest is where most teams over-provision. Workbox’s build tool scans your output directory and precaches everything by default. For a typical single-page application, that means every route’s JavaScript chunk, every image, every font file. The initial service worker install downloads all of it, even routes the user will never visit. On a mid-range phone over 3G, that precache takes 30+ seconds and burns through the user’s data plan.

Scope the precache to genuinely critical assets: the app shell HTML, the main JavaScript bundle, core CSS, and your primary font. Everything else goes into runtime caching with appropriate strategies. This is the difference between a 200KB precache that installs in under a second and a 4MB precache that times out on slow connections. Teams routinely ship 8MB precaches without realizing it.

Cache versioning is the piece most tutorials skip entirely. Workbox handles precache versioning automatically through content hashes in the manifest. But runtime caches need explicit expiration. Without ExpirationPlugin, your runtime cache grows indefinitely. A user who visits your app daily for six months accumulates thousands of cached API responses and images. Set maxEntries and maxAgeSeconds on every runtime cache route. No exceptions.

Here is the Workbox configuration that actually ships reliably: precache the shell, runtime-cache API responses with NetworkFirst and a 4-second timeout capped at 50 entries with 24-hour expiry, runtime-cache images with CacheFirst capped at 100 entries with 30-day expiry, and handle navigation requests with NetworkFirst falling back to the cached app shell. That covers 90% of production PWA needs. Start here, then customize.

IndexedDB Patterns for Offline Data

Now for the data layer. LocalStorage is synchronous, blocks the main thread, and caps at 5-10MB. Do not use it for anything beyond simple key-value preferences. IndexedDB is the only browser storage API suitable for structured offline data at scale.

The API is notoriously hostile. Raw IndexedDB code involves opening database connections, creating object stores in upgrade handlers, managing transactions with success and error callbacks, and dealing with cursor iteration for queries. Do not write this by hand. Libraries like Dexie.js and idb wrap it in a Promise-based interface that feels like a normal database client. Use one. Life is too short for the raw IndexedDB API.

The architectural pattern that actually works for offline-first applications is a local-first data layer that treats IndexedDB as the primary data store and syncs to the server as a background operation. Reads always hit IndexedDB first. Writes always go to IndexedDB first. The sync engine pushes changes to the server when connectivity is available and pulls remote changes on a schedule or via push notification.

This pattern decouples the UI from network state entirely. The application never shows a loading spinner waiting for a network response. It shows what is in IndexedDB, and the data gets fresher in the background. Users do not care about your network state. They care about seeing their data instantly.

The critical design decision is the sync strategy. And this is where things get hard.

The Sync Problem Nobody Warns You About

Background Sync API handles the “retry when online” part elegantly. You register a sync event, and the browser fires it when connectivity returns. That part is easy. The hard problem is what happens when two devices, or two tabs, or a device and a direct API call, modify the same record while offline.

Last-write-wins is the default strategy and the source of most data loss bugs in offline-first apps. User A edits a record offline on their phone. User B edits the same record on their laptop. Both come online. Whichever sync completes last silently overwrites the other’s changes. For low-contention data like user preferences, this is acceptable. For collaborative editing, document authoring, or any multi-user workflow, it destroys trust in the application. Teams lose users over this within the first month.

The alternatives are field-level merging, operational transforms, and CRDTs. Field-level merging tracks which fields changed in each mutation and merges non-conflicting field changes automatically, flagging conflicts only when both sides modified the same field. Operational transforms maintain an ordered log of operations and transform concurrent operations against each other. CRDTs (Conflict-free Replicated Data Types) use mathematical properties to guarantee convergence without coordination.

For most production applications, field-level merging with manual conflict resolution for true conflicts is the right choice. CRDTs are powerful but add significant complexity. They are overkill unless you are building a collaborative editor. Track modification timestamps per field, auto-merge when fields do not overlap, and surface a conflict resolution UI when they do.

Effective serverless architecture simplifies the sync backend by handling the queue processing, retry logic, and webhook delivery without managing server infrastructure for what is fundamentally an event-driven workload.

Cache Versioning Across Deployments

This is where “works in development” dies in production. You deploy a new version. The service worker’s precache manifest has new content hashes. The new service worker installs, the old caches get cleaned up in the activate handler. But the user’s current page was loaded from the old cache. The HTML references old asset filenames. The new service worker’s fetch handler does not have those old filenames in its precache. The request falls through to the network, which returns a 404 because your build pipeline already cleaned up the old files. Congratulations, you just broke every active session.

The fix is a cache-busting strategy that maintains old assets for at least one deployment cycle. Keep the previous version’s assets on your CDN or in your runtime cache for 24-48 hours after deployment. Workbox’s precache cleanup only removes entries from the precache manifest, not from runtime caches, so runtime-cached assets naturally age out via expiration plugins.

For the HTML document itself, always use NetworkFirst. The HTML is the entry point that references all other assets. If the HTML is stale, every asset reference in it is potentially wrong. NetworkFirst ensures the user gets the latest HTML when online and falls back to the cached version only when the network is unavailable.

The deployment sequence matters and the order is non-negotiable: push new assets to CDN first, then deploy the new service worker, then clean up old assets after the TTL expires. Reverse this order and you create a window where the service worker references assets that do not exist yet. This is the kind of bug that only shows up when real users are mid-session during a deploy.

App Shell vs. Streaming SSR

Two fundamentally different approaches here, and picking the wrong one costs you months.

The app shell model caches a minimal HTML skeleton with a loading state and fills it with content via JavaScript after the shell renders. The advantage: consistent, near-instant initial paint on repeat visits. The shell is in the precache, so it loads from disk in milliseconds. The disadvantage: content requires JavaScript execution. That means a blank content area until the JavaScript bundle downloads, parses, and fetches data. On a mid-range phone, that gap is 1-3 seconds even with the shell cached. Users notice.

Streaming SSR with navigation preload takes a different approach. When the service worker intercepts a navigation request, it simultaneously serves the cached header and fires a network request for the page content via navigationPreload. The browser starts rendering the cached header while the server streams the page body. Content appears as it arrives from the server, with no JavaScript required for initial content display.

The trade-off is cache storage. App shell caches one HTML file for all routes. Streaming SSR caches individual pages, which means significantly more storage per route but also means each route loads with its actual content from cache when offline. Not a loading skeleton. Real content.

For content-heavy sites, publishing platforms, and documentation sites, streaming SSR with page-level caching is the better architecture. Do not fight this. For highly interactive applications where the UI is JavaScript-driven regardless, app shell makes more sense because the JavaScript is required for functionality, not just rendering. The web performance engineering trade-offs between rendering strategies apply directly here. An SSR-cached PWA regularly achieves sub-second LCP on repeat visits, which is competitive with native app performance.

A solid UI/UX engineering practice bridges the gap between these architectural decisions and the actual user experience. Offline states, loading transitions, and sync indicators need to feel intentional. If they feel broken, users will assume the app is broken.

Testing Offline Behavior in CI

Testing PWA behavior manually means toggling airplane mode on your development machine. That catches the happy path. It catches nothing else. Not the service worker lifecycle bugs, not the cache version conflicts, not the IndexedDB quota errors that happen on real devices over real networks.

Playwright supports service worker interception and network condition emulation natively. Write tests that install the service worker, verify precache population, simulate offline navigation, confirm cached responses are served, then simulate a deployment with a new service worker version and verify the update flow. This catches the cache versioning bugs before they reach production. Do this in CI, not manually.

For IndexedDB testing, seed the database with realistic data volumes. A test that writes 10 records will never hit quota limits. A test that writes 50,000 records with realistic payload sizes will reveal the performance cliffs and quota behavior that production users experience. And test on WebKit explicitly. Safari’s IndexedDB implementation has different transaction isolation behavior and more aggressive storage eviction than Chromium. If you only test on Chrome, you are testing the easy browser.

Network condition testing goes beyond binary online/offline. Use Playwright’s route.abort() to simulate partial failures where some requests succeed and others fail. This is the real-world scenario. Your API server is reachable but returning 502 errors on one endpoint. The CDN is serving stale assets. The WebSocket connection dropped but HTTP requests still work. These partial failure states expose the gaps in error handling that binary offline testing will never find.

For teams running progressive web app technology in production, automated offline testing in CI is not optional. It is the only way to catch service worker regressions before they reach users who depend on offline functionality. Skip it and you will learn about your bugs from user complaints.

The Gap Between Demo and Production

The PWA demo works because it runs on a fast device, a stable connection, a single tab, with no prior cache state, against a server that never deploys mid-session. Production is none of those things. Not even close.

Production PWAs need cache versioning that survives deployments without breaking active sessions. They need sync conflict resolution that does not silently overwrite data. They need IndexedDB schemas that migrate gracefully when the data model changes. They need service worker update flows that prompt users instead of breaking their current page.

The technology is mature. Service workers, Cache API, IndexedDB, and Background Sync have broad browser support and well-documented APIs. The gap is not in the platform capabilities. It is in the engineering discipline to handle every state transition, every failure mode, and every edge case that the conference demo conveniently skips. The platform gives you the tools. Whether your PWA survives contact with real users depends entirely on whether you respected the complexity those tools demand. Building on a cloud-native architecture provides the backend resilience and edge distribution that the offline-first frontend depends on for reliable sync and asset delivery.

Frequently Asked Questions

What percentage of PWA issues come from caching strategy mistakes?

Approximately 60-70% of production PWA bugs trace back to incorrect caching decisions. Cache-first applied to API responses serves stale data. Network-first without timeouts creates perceived hangs on slow connections. Stale-while-revalidate without cache versioning serves broken assets after deployments. The caching strategy must match the volatility of the resource, and most teams apply a single strategy to everything.

How much does a service worker add to initial page load time?

A well-structured service worker adds 10-30ms to initial registration on modern devices. The performance concern is not registration overhead but what the service worker does on subsequent navigations. A cache-first strategy for static assets can reduce repeat-visit LCP by 40-60% since assets serve from disk cache without network round trips. The cost comes from poorly scoped precache manifests that download megabytes of assets the user may never need.

What is the maximum reliable size for IndexedDB storage in production?

Browsers allocate IndexedDB storage based on available disk space, typically up to 50% of free space per origin. However, reliable cross-browser storage caps at roughly 100-200MB before quota pressure triggers eviction in some contexts. Safari is the constraint. It caps storage at approximately 1GB on desktop but aggressively evicts after 7 days of inactivity on iOS. Design offline data strategies around 50MB of actively managed structured data and you will avoid quota issues on 95%+ of devices.

How long does it take to implement production-grade offline sync?

Background Sync API registration takes a few hours. Production-grade conflict resolution takes 4-8 weeks. The API is the simple part. The hard work is defining merge strategies for concurrent edits, building retry queues with exponential backoff, handling partial sync failures where some records succeed and others fail, and testing against every combination of connectivity state. Teams that skip conflict resolution design end up with data corruption within the first month of production use.

Does a PWA need an app shell architecture or can SSR pages work offline?

Both approaches work, but with different trade-offs. App shell caches a minimal HTML skeleton and renders content client-side, achieving sub-second repeat visits but requiring JavaScript for any content display. Navigation preload with streaming SSR caches full server-rendered pages, providing content without JavaScript execution but using more cache storage. App shell suits highly interactive applications. SSR caching suits content-heavy sites. Hybrid approaches that cache the shell and stream fresh content into it achieve the best balance for most production applications.