Progressive Web Apps: Offline-First That Works

Jan 14, 2025 Metasphere Engineering 14 min read

You ship a progressive web app. The demo is flawless. Airplane mode, kill the connection, toggle it back. The app loads instantly, data persists, sync resolves. Your team celebrates.

The camping trip where everything went perfectly. Clear skies. Road open. Supplies fresh.

Then production happens. A user on a spotty train connection submits a form three times because the service worker retried quietly and the UI showed nothing. The cache served a stale login page after your auth deploy, locking users out for an afternoon. IndexedDB hit its quota limit on a 32GB iPhone crammed with photos. The W3C Service Workers specification defines the caching primitives. The failure modes? Those are entirely yours. The web.dev PWA guides cover the happy path. Production teaches you the unhappy ones.

The road washed out. The pantry had expired food. The journal had two conflicting entries. Nobody planned for the storm.

Key takeaways

Service workers retry quietly by default. On a spotty connection, the user submits a form three times and the UI shows nothing. Idempotency keys and UI feedback are mandatory, not nice-to-haves.
Stale caches serve old auth pages after deploys, locking users out. Cache versioning with skipWaiting() and clients.claim() must be deliberate, not automatic.
IndexedDB quotas vary wildly by device. A 32GB iPhone with photos and podcasts may have less than 500MB available. Handle quota errors gracefully or lose offline data quietly.
Conflict resolution for offline writes is the hardest problem. Two devices edit the same record offline. Last-write-wins destroys data. Field-level merging or CRDTs are the real answer.
Background sync only works while the service worker is alive. If the browser kills it (common on mobile), queued writes vanish unless persisted to IndexedDB first.

Prerequisites

HTTPS configured for all origins (service workers require secure contexts)
Build pipeline generates content-hashed filenames for all static assets
At least one library wrapping IndexedDB (Dexie.js or idb) in the data layer
Playwright or equivalent set up for service worker interception in CI
Cache storage budget defined per origin (target under 50MB for reliable cross-browser support)

The Service Worker Lifecycle Nobody Explains Well

Most bugs live in the transitions between install, activate, and fetch. The supplies manager’s shift change. A new service worker doesn’t activate right away. If an old one still controls the page, the new one waits until every tab running the old worker closes. By design, this stops split cache versions from serving mixed assets to the same user.

Most users never close their tabs. Your deployment sits in the waiting state for days. The new supplies manager waiting in the lobby. The old one still running the pantry.

Calling skipWaiting() forces the new worker to take over right away. But the page loaded with old assets, and the new worker is fetching with new routing rules. If your HTML points to a CSS file that the new precache renamed, the page breaks mid-session. No error. No warning. Just a broken layout the user has to refresh away. The new manager reorganized the pantry while people were eating. Plates break.

Safe pattern: skipWaiting() + clients.claim() in the activate handler, plus a version check that asks users to refresh when the controlling worker changes. Workbox’s workbox-window provides this via the controlling event. Never quietly swap cache versions under active sessions.

The lifecycle determines when your worker activates. What it does with requests after activation is the next minefield entirely.

Caching Strategies and When Each One Breaks

The wrong strategy on the wrong resource type causes more production PWA bugs than anything else. Packing the wrong supplies for the wrong trip.

Strategy	Best For	Failure Mode	Timeout
Cache-first	Versioned static assets (hashed filenames)	Serves stale data forever on API responses	None needed
Network-first	HTML documents, fresh API data	Hangs 30+ seconds on slow connections without timeout	3-5s for documents
Stale-while-revalidate	Avatars, metadata, non-critical lists	UI renders against stale data after breaking API change	N/A
Network-only	Auth, payments, mutations	No offline fallback at all	Connection timeout
Cache-only	App shell (guaranteed instant load)	Missing asset = hard failure	N/A

Cache-first serves from cache, hits network only on miss. Perfect for versioned static assets with content hashes in filenames. Apply it to API responses and you serve stale data forever. Eating canned food that expired last year because nobody checked the label. Apply it to HTML and you serve yesterday’s layout after a deployment.

Network-first hits the network, falls back to cache on failure. The right choice for HTML and APIs where freshness matters. Go to town for fresh supplies. If the road is out, eat from the pantry. Without an explicit timeout, a slow connection hangs 30+ seconds while the cache has a perfectly good response sitting unused. Standing at the road waiting for the delivery truck in a snowstorm. Set network timeouts to 3-5 seconds for documents.

Stale-while-revalidate serves from cache right away and updates in the background. Sounds perfect. Causes the most subtle bugs. Deploy a breaking API change and the UI renders against stale data until the user navigates again. Serving yesterday’s menu while the kitchen preps today’s specials. Fine for avatars and metadata. A trap for anything where staleness causes functional breakage.

Anti-pattern

Don’t: Apply a single caching strategy globally across all routes. registerRoute(/.*/, new CacheFirst()) serves stale auth tokens and old HTML forever. Canning everything and calling it a pantry.

Do: Match strategy to resource volatility. Hashed assets get cache-first. HTML gets network-first with a 3-5 second timeout. Auth endpoints get network-only. Every route needs its own strategy. Every supply has its own shelf life.

Workbox Configuration That Ships

Workbox’s defaults are demo defaults. The build tool scans your output folder and precaches everything. Every JS chunk, image, font. On a mid-range phone over 3G, that download takes 30+ seconds before the worker even starts working. Packing the entire supermarket into the cabin. Limit precaching to the essentials: shell HTML, main JS bundle, core CSS, primary font. Everything else goes into runtime caching with explicit expiration. Pack essentials. Forage the rest as needed.

Runtime caches need expiration rules. Without ExpirationPlugin, your runtime cache grows forever. A pantry with no expiry dates. Set maxEntries and maxAgeSeconds on every runtime cache route. No exceptions. A cache that only grows is a quota error waiting to happen.

IndexedDB and the Sync Problem

IndexedDB is the only browser storage API that works for structured offline data at any real scale. The journal. The raw API is notoriously ugly. Libraries like Dexie.js and idb wrap it in a Promise-based interface that humans can actually use.

For offline-first apps, treat IndexedDB as the main data store. Reads and writes always hit local first. Write everything in the journal. A sync engine pushes changes when the connection comes back and pulls remote changes on a schedule. Send the journal entries to headquarters when the road reopens.

Storing data locally is the easy part. Packing the journal is easy. What happens when two copies diverge is where offline-first gets genuinely hard. Two people writing different entries on the same journal page while the phone lines are cut.

Last-write-wins is the default conflict strategy and the default source of data loss. User A edits offline on their phone. User B edits on their laptop. Whichever syncs last quietly overwrites the other. For user preferences, fine. For anything collaborative, it destroys trust. (The entry that vanished. Nobody knows why.)

Strategy	Complexity	Best For	Limitation
Last-write-wins	Low	Preferences, settings	Quietly drops concurrent edits
Field-level merge	Medium	Forms, profiles, records	True conflicts still need manual resolution
Operational transforms	High	Collaborative text editing	Requires ordered operation logs
CRDTs	High	Real-time collaboration at scale	Complex to build, limited data structure support

For most apps, field-level merging with manual conflict resolution is the right call. CRDTs are overkill unless you’re building a collaborative editor. A serverless architecture simplifies the sync backend since this is event-driven work at its core.

Cache Versioning Across Deployments

You deploy a new version. The new service worker installs, old caches get cleaned. But the user’s current page references old asset filenames that no longer exist in the precache or on the CDN. Every active session breaks quietly. New supplies arrive. Old labels on the shelves. Nothing matches.

The Stale Cache Lockout A deployed auth change that the cache doesn’t know about. Users load the cached login page, try to authenticate against an endpoint that no longer accepts the old token format, and get locked out. Eating from last month’s pantry. The food is fine. The expiry date on the packaging changed. The fix is cache versioning with clear refresh prompts. Most teams learn this the hard way from their first post-deploy support ticket flood.

Keep previous version assets available for 24-48 hours after deployment. Always use NetworkFirst for HTML documents. If the HTML is stale, every asset reference in it is potentially wrong.

Deploy order matters and isn’t optional: push new assets to CDN first, then deploy the new service worker, then clean up old assets after TTL expires. Stock the new supplies. Then rotate the pantry. Then throw out the old stock. Reverse this ordering and you create a window where the worker references assets that don’t exist yet.

App Shell vs. Streaming SSR

Criteria	App Shell	Streaming SSR
Repeat visit speed	Near-instant (shell cached)	Fast (full page cached per route)
JS dependency	Content requires JS execution	Content visible without JS
Cache storage	Low (one shell for all routes)	Higher (full page per route)
Offline content	Shell only, content needs sync	Full page content available offline
Best for	Interactive, JS-driven apps	Content-heavy, article-based sites

App shell caches a minimal HTML skeleton and fills content with JavaScript. Near-instant paint on repeat visits, but content needs JS to run first. The cabin with walls and a roof. Furniture arrives later. On a mid-range phone, the browser still has to parse, compile, and run JavaScript before any content shows up. The cached shell gives you a fast frame. The content inside still waits.

Streaming SSR with navigation preload serves the cached header while streaming fresh page content from the server. The cabin fully furnished on arrival. No JavaScript required for initial content display. More cache storage per route, but each route loads with real content offline.

Content-heavy sites: streaming SSR. Interactive JS-driven applications: app shell. Good UI/UX engineering makes offline states feel intentional, not broken.

Testing Offline Behavior in CI

Playwright supports service worker interception and network emulation out of the box. Write tests that install the worker, check precache, go offline and navigate, then simulate a deploy with a new version and verify the update flow. Run these in CI, not by hand.

For IndexedDB, seed with real-world volumes. A test that writes 10 records never hits quota limits. 50,000 records with realistic payloads show you the cliffs. Test on WebKit specifically. Safari’s IndexedDB handles transactions differently and clears data more aggressively than Chromium. The cabin that runs differently in winter.

Network testing goes beyond just online/offline. Use Playwright’s route.abort() to fake partial failures: API returning 502, CDN serving stale assets, WebSocket dropped but HTTP still working. These messy states expose gaps that clean on/off testing misses completely. The road that’s passable for cars but not trucks. Progressive web app offline testing in CI catches service worker regressions before they reach users.

What the Industry Gets Wrong About Progressive Web Apps

“Service workers make apps work offline automatically.” Service workers cache resources. They don’t handle conflict resolution, quota limits, or the UX of telling users their data hasn’t synced yet. Stocking the pantry is the easy part. Managing inventory, expiry dates, and two people reaching for the last can is the actual work.

“Background sync handles offline writes.” On mobile, browsers kill idle service workers aggressively. If the write was only in memory and not persisted to IndexedDB, it vanishes. Supplies that weren’t packed before the storm hit. Background sync is a delivery mechanism, not a persistence layer. Treat it as an unreliable transport and persist everything locally first.

Our take Chrome DevTools’ offline toggle simulates a clean disconnect. The road cleanly washed out. Real users deal with spotty connections flipping between online and offline several times a minute. The road that’s passable one minute and flooded the next. Cache strategies that pass DevTools testing break under this flicker because the cache and network race each other unpredictably. Test with real cellular throttling profiles, not the binary on/off switch.

That flawless demo? No cache versioning. No sync conflict resolution. No quota handling. No service worker update prompts. The camping trip with perfect weather. Every one of those is load-bearing in production. The demo worked because it ran once, on fast WiFi, with fresh caches. Your users won’t be that generous. A cloud-native architecture gives the backend resilience your offline-first frontend needs for reliable sync and asset delivery. The road back to town. Make sure it’s paved.

Frequently Asked Questions

What percentage of PWA issues come from caching strategy mistakes?

Most production PWA bugs come from wrong caching choices. Cache-first on API responses serves stale data. Network-first without timeouts makes slow connections feel frozen. Stale-while-revalidate without cache versioning serves broken assets after deploys. The caching strategy has to match how often the resource changes, and most teams slap one strategy on everything.

How much does a service worker add to initial page load time?

A well-built service worker adds barely any overhead to first registration on modern devices. The real concern isn’t registration but what it does on later visits. Cache-first for static assets can make repeat visits much faster since assets load from disk instead of the network. The cost comes from bloated precache lists that download megabytes of assets the user may never need.

What is the maximum reliable size for IndexedDB storage in production?

Browsers give IndexedDB storage based on available disk space, usually up to 50% of free space per origin. In practice, reliable cross-browser storage tops out at 100-200MB before browsers start clearing data. Safari is the tight spot: roughly 1GB on desktop but it wipes data after 7 days of inactivity on iOS. Design your offline data around 50MB of actively managed structured data and you’ll dodge quota issues on almost every device.

How long does it take to implement production-grade offline sync?

Background Sync API registration takes a few hours. Production-grade conflict resolution takes 4-8 weeks. The API is the easy part. The hard work is figuring out merge rules for concurrent edits, building retry queues with backoff, handling partial sync failures where some records land and others don’t, and testing against every combination of connection state. Teams that skip conflict resolution end up with corrupted data within the first month of production.

Does a PWA need an app shell architecture or can SSR pages work offline?

Both work, but with different trade-offs. App shell caches a minimal HTML skeleton and fills content with JavaScript. Repeat visits are nearly instant, but nothing shows without JS running. Streaming SSR with navigation preload caches full server-rendered pages. Content shows without JavaScript, but you use more cache storage per route. App shell fits interactive apps. SSR caching fits content-heavy sites.