← Back to Insights

Progressive Web Apps: Offline-First That Works

Metasphere Engineering 14 min read

You ship a progressive web app. The demo is flawless. Airplane mode, kill the connection, toggle it back. The app loads instantly, data persists, sync resolves. Your team celebrates.

The camping trip where everything went perfectly. Clear skies. Road open. Supplies fresh.

Then production happens. A user on a spotty train connection submits a form three times because the service worker retried quietly and the UI showed nothing. The cache served a stale login page after your auth deploy, locking users out for an afternoon. IndexedDB hit its quota limit on a 32GB iPhone crammed with photos. The W3C Service Workers specification defines the caching primitives. The failure modes? Those are entirely yours. The web.dev PWA guides cover the happy path. Production teaches you the unhappy ones.

The road washed out. The pantry had expired food. The journal had two conflicting entries. Nobody planned for the storm.

Key takeaways
  • Service workers retry quietly by default. On a spotty connection, the user submits a form three times and the UI shows nothing. Idempotency keys and UI feedback are mandatory, not nice-to-haves.
  • Stale caches serve old auth pages after deploys, locking users out. Cache versioning with skipWaiting() and clients.claim() must be deliberate, not automatic.
  • IndexedDB quotas vary wildly by device. A 32GB iPhone with photos and podcasts may have less than 500MB available. Handle quota errors gracefully or lose offline data quietly.
  • Conflict resolution for offline writes is the hardest problem. Two devices edit the same record offline. Last-write-wins destroys data. Field-level merging or CRDTs are the real answer.
  • Background sync only works while the service worker is alive. If the browser kills it (common on mobile), queued writes vanish unless persisted to IndexedDB first.
Prerequisites
  1. HTTPS configured for all origins (service workers require secure contexts)
  2. Build pipeline generates content-hashed filenames for all static assets
  3. At least one library wrapping IndexedDB (Dexie.js or idb) in the data layer
  4. Playwright or equivalent set up for service worker interception in CI
  5. Cache storage budget defined per origin (target under 50MB for reliable cross-browser support)
Service worker lifecycle phases from registration through cache-first fetch interceptionAnimated walkthrough of a service worker progressing through registration, install with precache, activate with old cache cleanup, and fetch interception using cache-first strategy with network fallback, showing how each phase gates the nextService Worker Lifecycle: Registration to Fetchhttps://app.example.comSecure1. Registernavigator.serviceWorker.register('/sw.js')2. Install EventPrecache critical assetsPrecache Manifest:app-shell.htmlmain.a3f2.jsstyle.b7c1.css3. Activate EventClean old caches + claim clientsOld caches:precache-v1-old4. Fetch Interception (Cache-First Strategy)Browserfetch eventService Workerintercepts requestCache APIcache.match()Networkfallback onlyCache Hit!skipCached response (instant)RegisterHTTPS onlyscope: originInstallPrecache assetswaitUntil()ActivateClean old cachesclients.claim()FetchIntercept requestsStrategy routing

The Service Worker Lifecycle Nobody Explains Well

Most bugs live in the transitions between install, activate, and fetch. The supplies manager’s shift change. A new service worker doesn’t activate right away. If an old one still controls the page, the new one waits until every tab running the old worker closes. By design, this stops split cache versions from serving mixed assets to the same user.

Most users never close their tabs. Your deployment sits in the waiting state for days. The new supplies manager waiting in the lobby. The old one still running the pantry.

Calling skipWaiting() forces the new worker to take over right away. But the page loaded with old assets, and the new worker is fetching with new routing rules. If your HTML points to a CSS file that the new precache renamed, the page breaks mid-session. No error. No warning. Just a broken layout the user has to refresh away. The new manager reorganized the pantry while people were eating. Plates break.

Safe pattern: skipWaiting() + clients.claim() in the activate handler, plus a version check that asks users to refresh when the controlling worker changes. Workbox’s workbox-window provides this via the controlling event. Never quietly swap cache versions under active sessions.

The lifecycle determines when your worker activates. What it does with requests after activation is the next minefield entirely.

Caching Strategies and When Each One Breaks

The wrong strategy on the wrong resource type causes more production PWA bugs than anything else. Packing the wrong supplies for the wrong trip.

StrategyBest ForFailure ModeTimeout
Cache-firstVersioned static assets (hashed filenames)Serves stale data forever on API responsesNone needed
Network-firstHTML documents, fresh API dataHangs 30+ seconds on slow connections without timeout3-5s for documents
Stale-while-revalidateAvatars, metadata, non-critical listsUI renders against stale data after breaking API changeN/A
Network-onlyAuth, payments, mutationsNo offline fallback at allConnection timeout
Cache-onlyApp shell (guaranteed instant load)Missing asset = hard failureN/A

Cache-first serves from cache, hits network only on miss. Perfect for versioned static assets with content hashes in filenames. Apply it to API responses and you serve stale data forever. Eating canned food that expired last year because nobody checked the label. Apply it to HTML and you serve yesterday’s layout after a deployment.

Network-first hits the network, falls back to cache on failure. The right choice for HTML and APIs where freshness matters. Go to town for fresh supplies. If the road is out, eat from the pantry. Without an explicit timeout, a slow connection hangs 30+ seconds while the cache has a perfectly good response sitting unused. Standing at the road waiting for the delivery truck in a snowstorm. Set network timeouts to 3-5 seconds for documents.

Stale-while-revalidate serves from cache right away and updates in the background. Sounds perfect. Causes the most subtle bugs. Deploy a breaking API change and the UI renders against stale data until the user navigates again. Serving yesterday’s menu while the kitchen preps today’s specials. Fine for avatars and metadata. A trap for anything where staleness causes functional breakage.

Cache strategy decision matrix mapping resource types to the correct caching strategyFive resource types on the left (versioned assets, HTML documents, fresh APIs, stale-tolerant APIs, auth endpoints) connect to four caching strategies on the right (cache-first, network-first, stale-while-revalidate, network-only). Each connection shows which strategy applies to which resource type.Cache Strategy Decision MatrixRESOURCE TYPEVersioned AssetsJS, CSS, images with hashHTML DocumentsPage shells, routesFresh APIsState, payments, mutationsStale-Tolerant APIsAvatars, metadata, listsAuth EndpointsLogin, token refreshSTRATEGYCache-FirstNetwork-First3-5s timeoutStale-While-RevalidateNetwork-OnlyHash invalidates automaticallyFreshness matters, fallback to shellStale data = broken stateFast render, background updateNever serve cached credentialsWrong strategy on the wrong resource: the #1 PWA production bug.
Anti-pattern

Don’t: Apply a single caching strategy globally across all routes. registerRoute(/.*/, new CacheFirst()) serves stale auth tokens and old HTML forever. Canning everything and calling it a pantry.

Do: Match strategy to resource volatility. Hashed assets get cache-first. HTML gets network-first with a 3-5 second timeout. Auth endpoints get network-only. Every route needs its own strategy. Every supply has its own shelf life.

Workbox Configuration That Ships

Workbox’s defaults are demo defaults. The build tool scans your output folder and precaches everything. Every JS chunk, image, font. On a mid-range phone over 3G, that download takes 30+ seconds before the worker even starts working. Packing the entire supermarket into the cabin. Limit precaching to the essentials: shell HTML, main JS bundle, core CSS, primary font. Everything else goes into runtime caching with explicit expiration. Pack essentials. Forage the rest as needed.

Runtime caches need expiration rules. Without ExpirationPlugin, your runtime cache grows forever. A pantry with no expiry dates. Set maxEntries and maxAgeSeconds on every runtime cache route. No exceptions. A cache that only grows is a quota error waiting to happen.

IndexedDB and the Sync Problem

IndexedDB is the only browser storage API that works for structured offline data at any real scale. The journal. The raw API is notoriously ugly. Libraries like Dexie.js and idb wrap it in a Promise-based interface that humans can actually use.

For offline-first apps, treat IndexedDB as the main data store. Reads and writes always hit local first. Write everything in the journal. A sync engine pushes changes when the connection comes back and pulls remote changes on a schedule. Send the journal entries to headquarters when the road reopens.

Local-First: Read/Write Locally, Sync When OnlineLocal-First: Read/Write Locally, Sync When OnlineUser ActionRead or write dataOffline or onlineIndexedDBRead: always instant (local)Write: stored locally firstQueued for syncBackground SyncWhen online: push queueConflict resolutionServer SyncedData consistent acrossdevices and serverReads are always fast. Writes never block. Sync happens when connectivity allows.

Storing data locally is the easy part. Packing the journal is easy. What happens when two copies diverge is where offline-first gets genuinely hard. Two people writing different entries on the same journal page while the phone lines are cut.

Last-write-wins is the default conflict strategy and the default source of data loss. User A edits offline on their phone. User B edits on their laptop. Whichever syncs last quietly overwrites the other. For user preferences, fine. For anything collaborative, it destroys trust. (The entry that vanished. Nobody knows why.)

StrategyComplexityBest ForLimitation
Last-write-winsLowPreferences, settingsQuietly drops concurrent edits
Field-level mergeMediumForms, profiles, recordsTrue conflicts still need manual resolution
Operational transformsHighCollaborative text editingRequires ordered operation logs
CRDTsHighReal-time collaboration at scaleComplex to build, limited data structure support

For most apps, field-level merging with manual conflict resolution is the right call. CRDTs are overkill unless you’re building a collaborative editor. A serverless architecture simplifies the sync backend since this is event-driven work at its core.

Cache Versioning Across Deployments

You deploy a new version. The new service worker installs, old caches get cleaned. But the user’s current page references old asset filenames that no longer exist in the precache or on the CDN. Every active session breaks quietly. New supplies arrive. Old labels on the shelves. Nothing matches.

The Stale Cache Lockout A deployed auth change that the cache doesn’t know about. Users load the cached login page, try to authenticate against an endpoint that no longer accepts the old token format, and get locked out. Eating from last month’s pantry. The food is fine. The expiry date on the packaging changed. The fix is cache versioning with clear refresh prompts. Most teams learn this the hard way from their first post-deploy support ticket flood.

Keep previous version assets available for 24-48 hours after deployment. Always use NetworkFirst for HTML documents. If the HTML is stale, every asset reference in it is potentially wrong.

Deploy order matters and isn’t optional: push new assets to CDN first, then deploy the new service worker, then clean up old assets after TTL expires. Stock the new supplies. Then rotate the pantry. Then throw out the old stock. Reverse this ordering and you create a window where the worker references assets that don’t exist yet.

App Shell vs. Streaming SSR

CriteriaApp ShellStreaming SSR
Repeat visit speedNear-instant (shell cached)Fast (full page cached per route)
JS dependencyContent requires JS executionContent visible without JS
Cache storageLow (one shell for all routes)Higher (full page per route)
Offline contentShell only, content needs syncFull page content available offline
Best forInteractive, JS-driven appsContent-heavy, article-based sites

App shell caches a minimal HTML skeleton and fills content with JavaScript. Near-instant paint on repeat visits, but content needs JS to run first. The cabin with walls and a roof. Furniture arrives later. On a mid-range phone, the browser still has to parse, compile, and run JavaScript before any content shows up. The cached shell gives you a fast frame. The content inside still waits.

Streaming SSR with navigation preload serves the cached header while streaming fresh page content from the server. The cabin fully furnished on arrival. No JavaScript required for initial content display. More cache storage per route, but each route loads with real content offline.

Content-heavy sites: streaming SSR. Interactive JS-driven applications: app shell. Good UI/UX engineering makes offline states feel intentional, not broken.

Testing Offline Behavior in CI

Playwright supports service worker interception and network emulation out of the box. Write tests that install the worker, check precache, go offline and navigate, then simulate a deploy with a new version and verify the update flow. Run these in CI, not by hand.

For IndexedDB, seed with real-world volumes. A test that writes 10 records never hits quota limits. 50,000 records with realistic payloads show you the cliffs. Test on WebKit specifically. Safari’s IndexedDB handles transactions differently and clears data more aggressively than Chromium. The cabin that runs differently in winter.

Network testing goes beyond just online/offline. Use Playwright’s route.abort() to fake partial failures: API returning 502, CDN serving stale assets, WebSocket dropped but HTTP still working. These messy states expose gaps that clean on/off testing misses completely. The road that’s passable for cars but not trucks. Progressive web app offline testing in CI catches service worker regressions before they reach users.

PWA Testing in CI: Verify Offline Behavior AutomaticallyPWA Testing in CI: Verify Offline Behavior AutomaticallySW Lifecycle TestInstall, activate, updateCache populated correctly?Offline SimulationPuppeteer: go offlineApp still works?Cache StrategyCorrect strategy perresource type?Lighthouse AuditPWA score, installabilityBlock deploy if below 90Test offline in CI or discover it is broken from a user complaint on a train.

What the Industry Gets Wrong About Progressive Web Apps

“Service workers make apps work offline automatically.” Service workers cache resources. They don’t handle conflict resolution, quota limits, or the UX of telling users their data hasn’t synced yet. Stocking the pantry is the easy part. Managing inventory, expiry dates, and two people reaching for the last can is the actual work.

“Background sync handles offline writes.” On mobile, browsers kill idle service workers aggressively. If the write was only in memory and not persisted to IndexedDB, it vanishes. Supplies that weren’t packed before the storm hit. Background sync is a delivery mechanism, not a persistence layer. Treat it as an unreliable transport and persist everything locally first.

Our take Chrome DevTools’ offline toggle simulates a clean disconnect. The road cleanly washed out. Real users deal with spotty connections flipping between online and offline several times a minute. The road that’s passable one minute and flooded the next. Cache strategies that pass DevTools testing break under this flicker because the cache and network race each other unpredictably. Test with real cellular throttling profiles, not the binary on/off switch.

That flawless demo? No cache versioning. No sync conflict resolution. No quota handling. No service worker update prompts. The camping trip with perfect weather. Every one of those is load-bearing in production. The demo worked because it ran once, on fast WiFi, with fresh caches. Your users won’t be that generous. A cloud-native architecture gives the backend resilience your offline-first frontend needs for reliable sync and asset delivery. The road back to town. Make sure it’s paved.

Ship Offline-First That Survives Real Networks

The demo always works on conference WiFi. Production means flaky connections, stale caches breaking login flows, and sync conflicts at scale. Cache versioning, background sync, and conflict resolution handle the edge cases demos never reveal.

Build Offline-First Right

Frequently Asked Questions

What percentage of PWA issues come from caching strategy mistakes?

+

Most production PWA bugs come from wrong caching choices. Cache-first on API responses serves stale data. Network-first without timeouts makes slow connections feel frozen. Stale-while-revalidate without cache versioning serves broken assets after deploys. The caching strategy has to match how often the resource changes, and most teams slap one strategy on everything.

How much does a service worker add to initial page load time?

+

A well-built service worker adds barely any overhead to first registration on modern devices. The real concern isn’t registration but what it does on later visits. Cache-first for static assets can make repeat visits much faster since assets load from disk instead of the network. The cost comes from bloated precache lists that download megabytes of assets the user may never need.

What is the maximum reliable size for IndexedDB storage in production?

+

Browsers give IndexedDB storage based on available disk space, usually up to 50% of free space per origin. In practice, reliable cross-browser storage tops out at 100-200MB before browsers start clearing data. Safari is the tight spot: roughly 1GB on desktop but it wipes data after 7 days of inactivity on iOS. Design your offline data around 50MB of actively managed structured data and you’ll dodge quota issues on almost every device.

How long does it take to implement production-grade offline sync?

+

Background Sync API registration takes a few hours. Production-grade conflict resolution takes 4-8 weeks. The API is the easy part. The hard work is figuring out merge rules for concurrent edits, building retry queues with backoff, handling partial sync failures where some records land and others don’t, and testing against every combination of connection state. Teams that skip conflict resolution end up with corrupted data within the first month of production.

Does a PWA need an app shell architecture or can SSR pages work offline?

+

Both work, but with different trade-offs. App shell caches a minimal HTML skeleton and fills content with JavaScript. Repeat visits are nearly instant, but nothing shows without JS running. Streaming SSR with navigation preload caches full server-rendered pages. Content shows without JavaScript, but you use more cache storage per route. App shell fits interactive apps. SSR caching fits content-heavy sites.