← Back to Insights

GitOps Beyond Kubernetes: Terraform, DBs, and Policy

Metasphere Engineering 10 min read

2 AM. Your on-call engineer gets paged for a connectivity issue between two services. She adds an inbound rule to a security group directly in the AWS console. The incident resolves. The Slack thread goes quiet. Everyone goes back to sleep. Nobody files a follow-up ticket.

Three weeks later, someone on the infrastructure team runs terraform apply as part of a routine change. Terraform doesn’t know about the console fix. It sees drift from the declared state and silently reverts the security group to its pre-incident configuration. The connectivity between those two services breaks again. Another page, another late-night scramble, and this time the on-call engineer has no idea what changed because the Terraform apply log shows a “normal” change. She’s debugging a ghost.

This exact sequence plays out at multiple companies every year. It is the defining failure mode of “GitOps for some things.” You get the worst of both worlds: the friction of declared state management and the drift of manual operations. The reconciler can’t reconcile what it doesn’t manage.

ArgoCD and Flux taught most of the industry what GitOps feels like operationally for Kubernetes. Git as the source of truth for desired state, automated deployment on commit, continuous reconciliation to detect and correct drift. The workflow is so natural for Kubernetes manifests that teams forget the underlying model isn’t Kubernetes-specific. It’s much bigger than that.

The reconciliation loop (compare desired state in Git to actual state in the live system, apply changes to close the gap) works for everything that has declarative state: Terraform infrastructure, database schemas, OPA security policies, network device configuration. The question isn’t whether GitOps applies beyond Kubernetes. It’s how quickly you extend it before the next late-night console change creates a drift bomb that detonates at the worst possible moment.

The drift time bomb: a manual console change silently reverted by terraform apply weeks laterAn animated sequence showing infrastructure in sync, a manual console change during an incident, time passing as drift accumulates, then terraform apply unknowingly reverting the change and breaking services again.Terraform state (code)Live infrastructure1In SyncCode = Live infra2Incident! Manual FixEngineer adds security grouprule via AWS console3Time PassesNobody updates the codeDriftLive stateCode state5terraform applyRoutine change. Sees drift.Reverts SG to code state.6Security rule removedThe manual fix is silently undone!Services break againAnother page. The on-call has no idea what changed.

Infrastructure GitOps with Terraform

Let’s be direct: the standard Terraform workflow, where a developer runs terraform plan locally, reviews the output, then runs terraform apply, is not GitOps. The desired state lives in Git, but there’s no automated reconciler ensuring live infrastructure matches it. It’s manual sync with good intentions. Good intentions don’t prevent drift.

Atlantis and Spacelift implement the real GitOps loop for Terraform. Pull requests trigger terraform plan automatically and post the plan output as a PR comment. Merge triggers terraform apply. Scheduled runs detect drift by comparing live infrastructure against Terraform state.

Drift Detection: The Part Everyone Skips

Drift detection is where teams under-invest and where the real value of infrastructure GitOps lives. This is the part everyone skips, and it’s the part that matters most.

In a typical 200-resource AWS environment, teams accumulate 5-15 manual console changes per month during incidents. That’s not negligence. That’s operational reality. When production is down, the right call is often to fix it now and formalize later. The problem is that “formalize later” has about a 30% follow-through rate. The other 70% silently accumulate as drift.

Without drift detection, this divergence builds until the next terraform apply either reverts a manual change at the worst possible time or hits a state conflict that blocks the entire pipeline. Teams go three months without running terraform apply because the state had drifted so far that nobody wanted to touch the conflict resolution. Their infrastructure-as-code was effectively abandoned while the infrastructure kept evolving through the console. That’s not IaC. That’s IaC theater.

Running drift detection on a 30-minute schedule catches changes within the same shift they were introduced. The practical implementation with Atlantis: run terraform plan in a read-only mode against each workspace on a cron schedule. If the plan is non-empty, open a PR with the drift details and alert the responsible team. They either formalize the console change by updating the Terraform code or revert the drift by applying the existing state.

The cultural impact is where it gets interesting. When drift gets caught within hours instead of weeks, the cost of formalization is low (update one resource block), and teams start committing console changes to Git proactively because it’s less work than dealing with the drift alert. Within about 6 weeks, most teams we’ve worked with see console-originated drift drop by 80%.

Database Schema as Code

Flyway and Liquibase have managed database schema migrations as versioned scripts for years. Bringing them into the GitOps model means treating the migration directory as another domain with Git as the source of truth.

The key shift: migrations run automatically in the deployment pipeline, not manually by DBAs. Every deployment that includes schema changes applies those changes as part of the process. No more coordinating “deploy the app” and “run the migration” as separate steps. In organizations with 10+ microservices, manual schema coordination accounts for roughly 25% of deployment-related incidents. Automating it through the pipeline kills that failure mode entirely.

But here’s the fundamental difference from Kubernetes and Terraform GitOps, and it catches every team that doesn’t think about it upfront: database migrations are forward-only. You can’t revert a migration by rolling back a Git commit. If you merged a migration that adds a column, reverting the commit does not drop the column from the live database. The reconciliation is strictly linear: apply migrations in order up to the current version. You never sync backward.

This means your safety net is not rollback. It is validation before apply. Here’s the approach that works:

  1. Every migration runs against a test database in CI before it can merge
  2. Validation takes under 30 seconds for schema changes, catching 95% of errors
  3. For data migrations affecting more than 10,000 rows, run against a production-size dataset in staging
  4. All schema changes use the expand/contract pattern (add first, backfill, drop later) so they’re backward-compatible with the previous application version
  5. The pipeline logs every migration execution with timing, so you can track when a migration starts taking longer than expected

Policy-as-Code: The Security Layer

OPA Gatekeeper and Kyverno bring GitOps to security policy. Instead of security rules living in someone’s head or a wiki page that nobody reads, they’re expressed as code, stored in Git, reviewed in PRs, and enforced automatically at admission time. This is where GitOps starts paying dividends for your security team, not just your platform team.

The practical impact: when a developer tries to deploy a container running as root, or a pod without resource limits, or a service without a network policy, the admission controller rejects it in under 5ms with a specific error message explaining why and how to fix it. No context-switching to ask a security team for review. No waiting for approval. The guardrails are built into the platform.

Organizations we’ve worked with typically start with 20-30 rules and grow to 200-500 as the policy library matures. The rules catching the most real issues: requiring resource limits on all pods (prevents noisy-neighbor problems), enforcing non-root containers (prevents privilege escalation), requiring labels for cost allocation (prevents untracked spend), and blocking images from untrusted registries (prevents supply chain attacks).

The GitOps model for policy is particularly powerful because policy changes go through the same PR review process as code changes. A change to a security policy is visible, reviewable, and auditable. When an auditor asks “when was this policy enacted and who approved it?” the answer is a Git commit with a timestamp, an author, and a reviewed PR. Try getting that from a wiki page.

The Incident Response Shift

GitOps changes how you respond to incidents, and that shift takes about 2-3 weeks to internalize. It feels wrong at first.

When production needs a quick firewall rule change, the instinct is to make the change in the console. In a GitOps model, the correct action is to commit the change to Git and let the pipeline apply it. This feels slower. It is about 3-5 minutes slower during the incident. But those 3-5 minutes buy you something invaluable.

The Git commit creates an audit trail. The pipeline validates the change. The reconciler won’t revert it on the next drift detection run. And every engineer on the team can see what changed by looking at the Git log. No more “someone changed something in the console but nobody documented it” conversations. That conversation has never ended well for anyone.

For DevOps teams with high deployment frequency (10+ deploys/day), committing first becomes muscle memory within 2-3 weeks. For teams accustomed to direct console access, it requires a deliberate cultural adjustment. The forcing function is simple: console changes made outside Git get detected and reverted by the reconciler within 15-30 minutes. After a few reverted console changes, committing-first starts feeling like the path of least resistance rather than an extra step.

The Adoption Playbook

The teams that get the most value from full-platform GitOps extended it gradually, not all at once. Kubernetes first, then Terraform with Atlantis, then database migrations, then policy-as-code. Each extension pays for itself in reduced investigation time and eliminated “what actually changed?” conversations.

Here’s the timeline that works consistently:

Months 1-2: Kubernetes GitOps. Install ArgoCD or Flux. Move application manifests to Git. Set a 3-minute sync interval. This is the easiest win because the tooling is mature and the workflow is well-documented. By week 4, your team should be deploying exclusively through Git commits.

Months 3-4: Infrastructure GitOps. Set up Atlantis or Spacelift for Terraform. Enable PR-triggered plans and merge-triggered applies. Add drift detection on a 30-minute schedule. This is where you start catching the console changes that create the security-group-revert scenario from the opening of this article.

Months 5-6: Database GitOps. Integrate Flyway or Liquibase into the deployment pipeline. Add CI validation against test databases. Adopt the expand/contract pattern for all schema changes. This kills the DBA-as-bottleneck antipattern and makes schema changes as reviewable as code changes.

Months 7-8: Policy GitOps. Deploy OPA Gatekeeper. Start with 20-30 rules covering resource limits, non-root containers, and image registry restrictions. Extend as the platform engineering practice matures.

By month 8, every layer of your platform has declarative state in Git, automated reconciliation, and drift detection. The audit trail is complete. The “what changed?” question always has an answer. And the late-night console change that silently drifts for three weeks until it detonates during a routine apply? That story is over.

Extend GitOps Across Your Entire Platform

GitOps applied only to Kubernetes is a partial solution. Metasphere extends GitOps principles across your full operational surface for complete auditability and drift prevention.

Expand Your GitOps Practice

Frequently Asked Questions

What is the core principle that makes GitOps valuable?

+

GitOps makes desired state declarative and version-controlled, with an automated reconciler comparing desired to actual state every 3-5 minutes. Every change is a Git commit with author, timestamp, and rationale. Teams adopting GitOps reduce configuration drift incidents by 60-80% and cut mean time to recovery to under 15 minutes via git revert.

How does GitOps apply to Terraform infrastructure?

+

Terraform code in Git with plan/apply automated by CI/CD is GitOps for infrastructure. The critical addition is drift detection: running terraform plan on a 15-30 minute schedule and alerting on non-empty plans. Atlantis and Spacelift implement the full loop. Teams using Atlantis catch 90% or more of unauthorized console changes within one detection cycle.

Can database schema changes be managed with GitOps?

+

Yes, with a key difference. Flyway and Liquibase store migrations in version control with CI validation against a test database before production apply. But database migrations are forward-only and often irreversible. You can’t revert to a previous commit and expect the schema to roll back. Teams run migration validation in under 30 seconds against test databases, catching 95% of errors before production.

What is policy-as-code in the GitOps model?

+

Policy-as-code means security and compliance rules expressed as OPA Rego or Kyverno YAML, stored in Git and deployed through PR review. OPA Gatekeeper evaluates admission requests in under 5ms, rejecting non-compliant resources in real time. Organizations typically enforce 200-500 rules, catching 30-40% of misconfigurations at deploy time that would otherwise reach production.

What is the difference between push and pull model GitOps?

+

Push model means CI/CD pushes changes on commit. Pull model means an agent like ArgoCD or Flux polls Git every 3 minutes and pulls changes. Pull models are preferred because the agent authenticates outbound, eliminating inbound CI credentials to production. This reduces credential exposure by roughly 50% and provides continuous reconciliation that handles network partitions gracefully.