Design Systems: From Figma File to Production Infrastructure
You know this story because you’ve lived it. Your organization spent over a year building a design system. Designers poured months into the Figma library. The launch blog post was great. And within six months, only two product teams were actually using it. Everyone else forked their own button components. The official version didn’t support the variants they needed. A patch update broke their layouts with no changelog. The contribution process had a six-week review queue that nobody was assigned to run. Three strikes and the teams walked.
- Design systems fail because of engineering, not design. The Figma library is never the problem. Missing versioning, broken releases, and bottlenecked contribution processes are.
- Semantic versioning is non-negotiable. A patch update that breaks layouts without a changelog destroys adoption faster than bad design.
- Visual regression tests catch what unit tests miss. A 2px padding change in a shared component cascades across 40 consuming applications. Screenshot diffing catches it.
- Contribution processes must be fast or teams will fork. Six-week review queues guarantee that product teams build their own components. Target 5-day turnaround.
- Treat the component library as production infrastructure. CI pipelines, automated releases, deprecation timelines, breaking change policies. Same rigor as any shared service.
Brad Frost’s Atomic Design defines the composition model. But most design systems don’t die because the composition model was wrong. They die because nobody treated the component library as production infrastructure.
Design Tokens as the Foundation
If you take one thing from this article, make it this: design tokens must be the source of truth for all visual decisions. Not #1A2980 but --color-brand-primary. Not 16px but --spacing-md. Token values can change freely. Semantic names should not.
The distinction between semantic and literal token names is load-bearing. Semantic names (--color-interactive-primary) let you swap entire themes by remapping references. Literal names (--color-blue-500) bake a specific value into every consumer, forcing code changes when the brand evolves. One approach scales. The other creates cross-platform drift that compounds with every new surface.
| Token Tier | Examples | Purpose | Who Maintains |
|---|---|---|---|
| Primitive | blue-500: #1A2980, gray-900: #1A1A2E, white: #FFFFFF | Raw values. The palette. Never referenced directly in components | Design team defines the palette |
| Semantic | color-interactive-primary, color-surface-default, color-text-primary | Intent-based names. Components use these | Design + engineering agree on naming |
| Theme resolution | Light: interactive = blue-500, surface = white, text = gray-900 | Maps semantic tokens to primitives per theme | Automated via Style Dictionary |
| Theme resolution | Dark: interactive = blue-500, surface = gray-900, text = white | Same semantic names, different primitive values | Same automation, different input file |
Components reference semantic tokens only. Theme switches swap the resolution layer. No component code changes.
Style Dictionary generates CSS, Swift, and Android outputs from one source. The design tokens guide covers the full pipeline.
Component Library Versioning
An unversioned library where teams pull from main is a ticking time bomb. One engineer changes button padding. Forty teams inherit the change. Three of those teams have pixel-perfect layouts that break. None of them opted in.
| SemVer | What It Means for Components | Example | Consumer Impact |
|---|---|---|---|
| Patch (1.0.1) | Bug fix, no visual change | Fix button focus ring on Safari | Safe to auto-update |
| Minor (1.1.0) | New prop or variant, nothing removed | Add size="xl" to Button | No migration needed |
| Major (2.0.0) | Prop renamed, removed, or default changed | variant="primary" to intent="primary" | Migration guide required |
| Misclassified minor | Default spacing changed “because it still renders” | Padding change breaks 10 team layouts | Trust in the system destroyed |
“Non-breaking” is harder for components than APIs. A spacing change doesn’t break compilation but breaks layout. Changesets tooling forces explicit categorization per change, and that categorization is what keeps the contract honest. Without it, every minor feels like Russian roulette.
For major versions, run parallel (v3 + v4) during a 3-6 month migration window. Codemods automate the bulk of mechanical changes. The remainder, the handful of edge cases that defy mechanical transformation, is where real design conversations happen.
Visual Regression Testing
A spacing change in Button silently breaks 10 applications. Without visual regression, nobody hears about it for three weeks, when some team’s QA finds a layout shifted and opens a bug nobody connects to the upstream change.
// Playwright visual regression test for Button component
test('Button renders correctly across all variants', async ({ page }) => {
await page.goto('/storybook/iframe.html?id=button--primary');
await expect(page).toHaveScreenshot('button-primary.png', {
maxDiffPixelRatio: 0.01, // 1% pixel tolerance
});
await page.goto('/storybook/iframe.html?id=button--secondary');
await expect(page).toHaveScreenshot('button-secondary.png');
await page.goto('/storybook/iframe.html?id=button--disabled');
await expect(page).toHaveScreenshot('button-disabled.png');
});
Chromatic and Percy capture every Storybook state per PR. But coverage is only as good as your stories. No story for the error state means no visual baseline for the error state. Track story completeness alongside code coverage. Mature UI/UX engineering treats edge case stories (long text, empty content, RTL) as the surface where regressions actually hide. The normal states look fine. The weird states break silently.
Don’t: Treat visual regression tests as optional CI checks that authors can skip when they’re “confident” a change is safe. Confident authors produce the most dangerous regressions.
Do: Make visual regression a blocking gate on every PR that touches component code. Approved diffs update the baseline. Unapproved diffs block the merge. No exceptions, including for the design system team itself.
Accessibility as a CI Gate
Axe-core in Storybook catches 30-40% of WCAG issues automatically. Missing alt text, insufficient contrast, improper ARIA roles. The rest, keyboard navigation patterns, screen reader announcement quality, focus trap behavior, requires documented manual acceptance criteria per component type. For accessibility engineering , these criteria are a non-negotiable definition of done.
Automated checks create a floor. Manual review raises the ceiling. Neither works alone. A component that passes axe-core but traps keyboard focus in a modal with no escape route is technically “accessible” by the linter and completely unusable by a screen reader user.
| Gate | What It Catches | Coverage | When It Runs |
|---|---|---|---|
| Automated (axe-core in CI) | Missing alt text, insufficient contrast, missing form labels, ARIA violations | 30-40% of WCAG issues. All the machine-checkable ones | Every PR. Blocks merge on failure |
| Storybook a11y addon | Component-level violations in isolation. Missing roles, keyboard traps | Additional 10-15%. Catches component context that page-level scanning misses | During development. Visual indicator in Storybook |
| Manual review checklist | Keyboard navigation flow, screen reader announcement order, focus management, cognitive load | Remaining 45-60%. The hard stuff that machines can’t evaluate | Pre-release. Required for new components and major changes |
Governance Without Bottleneck
The core team owns the platform, the quality standards, and 30-50 foundational components. Product teams contribute domain components through CI gates. No ticket queue. No review committee. PR passes the gates, it merges.
This only works if the gates are comprehensive and fast. Accessibility, visual regression, documentation coverage, and unit test thresholds all enforced automatically. If the automation catches regressions reliably, the human bottleneck evaporates. If the automation is thin, you’re back to manual review queues that stretch to six weeks.
ADRs (Architecture Decision Records) for significant choices prevent quarterly relitigating by newcomers who weren’t in the room when the decision was made. CI/CD gates enforce standards that committees only debate.
| When to contribute to the system | When to keep it local |
|---|---|
| Used in 3+ products today | Used by one team only |
| Encodes a brand decision that must stay consistent | Experimental, likely to change significantly |
| Synchronization cost across codebases exceeds maintenance cost | Simple composition of existing components |
| Multiple teams are already building variants independently | Domain-specific with no cross-team relevance |
What the Industry Gets Wrong About Design Systems
“A Figma library is a design system.” A Figma library is a design artifact. A design system is production infrastructure. Versioned components. Automated visual regression tests. CI-gated releases. Contribution workflows. Teams that confuse the two launch a Figma library, declare victory, and wonder why adoption stalls at two teams.
“Enforce adoption top-down.” Mandating design system usage without making the system genuinely faster to use than building from scratch produces compliance without adoption. Teams import the component and override its styles. The design system “adoption” metric looks good. The actual consistency is worse than before because overrides are undocumented, untested, and invisible in any audit.
Two teams using the system six months after launch. The Figma was never the problem. No semver. No visual regression. No contribution path. Design systems that survive are production infrastructure with the engineering to match. Build that engineering or watch adoption flatline.