← Back to Insights

Infrastructure as Code: Reproducible, Auditable, Recoverable

Metasphere Engineering 12 min read

Your production RDS instance is sitting on a publicly accessible subnet. No encryption at rest. The engineer who provisioned it through the AWS console left eight months ago. The security group allowing 0.0.0.0/0 on port 5432 has been open since day one. The only reason it hasn’t been exploited is luck. Luck is not a security strategy.

A building with no blueprints. The contractor who built it is gone. Nobody knows which walls are load-bearing. The CIS Benchmarks codify exactly these configuration standards. But without blueprints, nobody can check the building against the code.

Key takeaways
  • Console-provisioned infrastructure has no audit trail, no review process, and no reproducibility. The engineer who built it left. The documentation is wrong. The security group is wide open. A building with no blueprints and no inspector.
  • IaC catches misconfigurations in PR review, before resources exist. The RDS instance on a public subnet gets blocked by policy, not discovered by a pentester.
  • Drift detection must run continuously. terraform plan once a week catches some drift. Continuous reconciliation catches it all. The inspector who checks the building against the blueprints every 30 minutes.
  • State file management is the hardest operational problem in Terraform. Remote state with locking is baseline. State file corruption is a category of outage nobody plans for.
  • Modular architecture eliminates most of the duplication across environments. Dev, staging, and prod differ in variable values, not resource definitions. Same blueprints. Different paint colors.

Hundreds of manually provisioned resources, documentation nobody trusts. Infrastructure in code instead of in memory is the only path out.

Infrastructure Drift Divergence and RecoveryThree environments start identical, then drift apart over three months through manual changes. Running terraform plan detects 14 differences, and terraform apply snaps all three back into sync.Environment Drift Over TimeDay 0Month 1Month 2Month 3Devt3.largeSG: port 54323 dependenciesStagingt3.largeSG: port 54323 dependenciesProdt3.largeSG: port 54323 dependenciesIn syncStagingt3.largeSG: 0.0.0.0/0:54323 dependenciesSecurity group widened to debugProdr5.2xlargeSG: port 54323 dependenciesResized during incidentDevt3.largeSG: port 54325 dependencies+redis, +memcachedExtra deps added manuallyEnvironmentsdivergedAll three are different$ terraform planPlan: 3 to change, 2 to add, 9 to update. 14 differences detected.$ terraform apply -auto-approveDevt3.large | SG: 5432 | 3 depsStagingt3.large | SG: 5432 | 3 depsProdt3.large | SG: 5432 | 3 depsBack in sync

What Console Management Actually Costs

Clicking through the AWS console feels productive. That feeling is a trap. Building without blueprints feels fast. Until something breaks and nobody knows what’s behind the wall.

Console ManagementInfrastructure as Code
ReproducibilityRebuild from memory + screenshotsterraform apply from any commit
Audit trail“Who changed the security group?”Git blame shows who, when, why, and who approved
Drift detectionUnknown until something breaksterraform plan shows every deviation
DR recoveryDays of archaeologyHours (apply code to new region)
Compliance evidenceScreenshots collected before auditAutomated, continuous, in the pipeline
Knowledge sharingLives in someone’s headLives in version-controlled code

Three problems compound quietly over time.

Reproducibility is gone. If production went down right now, could you rebuild from scratch? Honest answer for most manually provisioned environments: not quickly, and not completely. DR exercises routinely take days when they should take hours. Engineers reconstructing from memory, scattered screenshots, and Slack DMs of someone who left months ago. Disaster recovery by archaeology. Rebuilding a building from photos.

Drift is invisible. Someone widens a security group to debug connectivity. Forgets to revert. Someone bumps an instance size during a spike. Never scales back. The contractor who moved a wall during the renovation and didn’t update the blueprints. Run terraform plan against a 400-resource account that’s been console-managed for two years. The plan will show dozens of drifted resources, many security-relevant. Nobody remembers making the changes.

Scale shatters consistency. One environment is manageable by hand. Four that should be identical (dev, staging, QA, production) never are. The “works in staging, breaks in production” mystery? It traces back to environment parity gaps that manual management guarantees. Four buildings from the same blueprints, except the blueprints were verbal instructions. Every building is slightly different. None of them match the specification nobody wrote down.

What Codified Infrastructure Delivers

Terraform and Pulumi treat infrastructure definitions as source code. Not as a metaphor. As an engineering capability. The blueprints are the building.

Auditable change history. Every change goes through version control. Production incident? git log --oneline --since="2 days ago" infra/ shows exactly what shifted. Investigation drops from hours of guesswork to minutes of reading diffs. But the bigger prize is git revert. Bad security group rule hits production? On-call reverts the commit, the pipeline runs terraform apply, and the environment rolls back to last known good. Tear down the wrong wall? The blueprints restore it. Try that with the AWS console.

Consistent environments. Define once, deploy everywhere. Staging matches production because both come from identical Terraform modules with different tfvars. The only differences are ones you put there on purpose: instance sizes, domain names, scaling thresholds.

Peer review for infrastructure. Changes go through pull requests. A second pair of eyes catches the overly permissive security group, the missing encryption, or the oversized instance. The building inspector reviewing the plans before construction starts. The cheapest infrastructure mistake is the one that never deploys.

Automated compliance at PR time. Policy engines like OPA check every proposed change against your rules before it touches any environment:

# OPA policy: block public S3 buckets and unencrypted RDS
deny[msg] {
  input.resource_type == "aws_s3_bucket"
  input.planned_values.acl == "public-read"
  msg := "Public S3 buckets are not allowed"
}

deny[msg] {
  input.resource_type == "aws_db_instance"
  not input.planned_values.storage_encrypted
  msg := "RDS instances must have encryption at rest"
}

Violations blocked in the PR, not discovered in a pen test six months later. The building code enforced at the blueprint stage, not discovered during the safety inspection after tenants move in. Cloud migration modernization built on engineering discipline, not hope.

IaC Workflow: Commit to Applied InfrastructureIaC Workflow: Code to InfrastructureDeveloperCommits .tf changePR ReviewDiff reviewed by teamterraform planShows what will changePosted as PR commentPolicy CheckOPA / Sentinel gateterraform apply (on merge)Infrastructure matches codeEvery infra change: reviewed, planned, policy-checked, applied. No more console cowboys.

Adopting IaC Without Stopping Feature Development

“We can’t pause feature delivery to codify everything.” Good news: you don’t have to. Wrong framing entirely. You don’t stop using the building to draw the blueprints.

Import, Don’t Rebuild

Terraform and Pulumi import existing live resources into state. You don’t tear down production. Run terraform import for each resource, generate the HCL, verify terraform plan shows zero pending changes. Codified baseline, running system untouched. Survey the existing building. Draw the blueprints. Don’t move anything.

Bulk-import tools handle scale. A 300+ resource account imports in under a week. Tedious, mechanical work. But release engineering discipline turns a terrifying migration into a boring one. Boring is exactly what you want.

Prerequisites
  1. Remote state backend configured with locking (S3 + DynamoDB or equivalent)
  2. CI pipeline able to run terraform plan on every PR
  3. At least one engineer trained on Terraform import workflows
  4. Policy engine (OPA, Checkov, or tfsec) in CI
  5. Alert channel configured for drift detection notifications
IaC adoption phases: import existing, new resources only, full coveragePhase 1: import existing infrastructure into Terraform state without changing anything. Phase 2: all new resources go through IaC. Phase 3: gradually migrate legacy resources. Never stop feature work to do this.IaC Adoption: Never Stop Shipping to Do ThisPhase 1: Importterraform import existingState file matches realityZero changes to infraWeek 1-2. Read-only.Phase 2: New OnlyAll new resources via IaCLegacy untouchedFeature work continuesOngoing. The new normal.Phase 3: Migrate LegacyBackfill old resourcesOne service at a timeDriven by change frequencyGradual. Months, not weeks.Import first, mandate for new, migrate on natural churn.

Enforce the Rule for New Changes

Day one rule: every new change goes through code. Every modification to an existing resource gets codified at the time of change. Coverage grows naturally. No dedicated migration project needed. Every new wall gets drawn on the blueprints. Existing walls get drawn when someone touches them.

The psychological shift matters more than the tooling. Once engineers see terraform plan showing exactly what will change before it changes, they stop wanting the console. The architect’s preview of the construction before a single brick is laid. Nobody forces them. They just stop going back.

Automate the Pipeline Early

Set up a continuous deployment pipeline from the start. Post terraform plan output as a PR comment so reviewers see exactly what will change. Run terraform apply on merge. If changes still require manual CLI commands, adoption will stall. Engineers revert to the console under pressure. Every time. Remove that escape hatch early.

Anti-pattern

Don’t: Run terraform apply from a developer laptop with local state and no locking. Two engineers applying at the same time corrupt the state file, and corrupted state is one of the hardest Terraform problems to recover from. Two contractors modifying the same blueprint at the same time. The blueprints tear.

Do: Store state remotely with locking. Apply only through CI on merge. No human should run apply against production. The pipeline is the only path.

Drift Detection: The Other Half of IaC

Most adoption guides stop at code coverage. Drift detection is the half they skip. Someone will use the console during an emergency. A support engineer will tweak something through the CLI. An automated process from another team will modify a resource you own. Not edge cases. Certainties. The contractor who moves a wall without updating the blueprints. It will happen.

Run terraform plan on a schedule (every 30-60 minutes), read-only, against live infrastructure. If the plan shows changes nobody committed, that’s drift. Alert right away. The inspector comparing the building to the blueprints every 30 minutes. Catching drift within an hour is a minor correction. Discovering it during an incident is a crisis.

IaC drift detection pipeline: scheduled plan, compare, classify, remediateTerraform plan runs on schedule comparing live infrastructure against IaC definition. Changes classified as intentional (update IaC) or unauthorized drift (revert infrastructure).IaC Drift Detection PipelineScheduled Planterraform planevery 4 hoursCompare StatesIaC definition vs live infraResource-by-resource diffClassifyIntentional orunauthorized?Update IaC to matchIntentional change acceptedRevert to IaC stateAuto-create revert PR + alertWithout drift detection, IaC becomes a lie that everyone trusts.
The Console Escape Hatch Teams use IaC for provisioning but revert to console clicks during incidents, urgency, or debugging. The contractor who bypasses the architect during the emergency. Each console change creates drift that the next terraform apply either reverts (causing a new incident) or ignores (if the state was manually updated). Both outcomes are worse than the console change that caused them. The escape hatch is the source of the very problems IaC was supposed to prevent.

When IaC Creates More Problems Than It Solves

ScenarioIaC Is RightIaC Is Overkill
Ephemeral dev environmentsYes, if >2 developers share themNo, if single-developer sandbox with <5 resources
Long-lived productionAlwaysNever
One-off investigationsNoYes, tear it down manually after
Shared stagingYesNever
Prototype/spikeNo, codify if it survivesYes, until it survives

Not every resource justifies codification. A temporary debugging instance that lives for two hours doesn’t need a Terraform module. You don’t draw blueprints for a tent. The discipline is knowing which resources earn the overhead and which don’t. Production, staging, shared infrastructure: always. Ephemeral spikes: codify if they survive the week.

What the Industry Gets Wrong About Infrastructure as Code

“Running Terraform means doing IaC.” Running terraform apply from a laptop with no state locking, no PR review, and no policy checks is not IaC. It’s console management with a different syntax. IaC means version-controlled definitions, peer-reviewed changes, automated policy enforcement, and continuous drift detection. Typing the blueprints instead of drawing them doesn’t make them blueprints. The tool is not the practice.

“Import existing resources and you’re caught up.” terraform import captures current state. It doesn’t capture intent. An imported security group has the rules it has today, including the one someone widened during a debug session and forgot to revert. Surveying the building captures what exists. It doesn’t tell you whether the existing building is safe. Import is a starting point for audit, not a declaration of correctness.

Our take Enforce IaC for all new resources right away. Don’t wait until existing infrastructure is imported. Every new resource goes through Terraform from day one. Import existing resources gradually, starting with the most security-sensitive. Waiting until “everything is imported” before enforcing IaC means IaC never gets enforced. Draw blueprints for every new wall. Survey existing walls as you touch them. Waiting for the complete survey before requiring blueprints means the survey never finishes and the blueprints never happen.

That RDS instance on the public subnet with 0.0.0.0/0 on port 5432? With IaC, it would have been blocked in PR review before it ever existed. The OPA policy catches it, the reviewer flags it, the CI pipeline rejects it. The building code inspector looks at the blueprints and says “that wall can’t go there.” Three layers of protection, all automated, all before a single resource gets provisioned. The setup cost pays for itself through prevented incidents alone. Delay doesn’t save time. It just moves the bill to the outage, when the price is highest and the options are fewest.

Your Infrastructure Exists Only in Someone's Head

Console-provisioned infrastructure has no audit trail, no review process, and no rollback path. Reproducible, peer-reviewed cloud environments where every change goes through a PR, every policy is enforced in CI, and drift gets caught in hours instead of discovered during an audit.

Codify Your Infrastructure

Frequently Asked Questions

What is Infrastructure as Code and why does it matter?

+

IaC manages cloud infrastructure through version-controlled definition files instead of web consoles. Terraform, Pulumi, and CloudFormation are the most common tools. Organizations using IaC investigate incidents faster because every change shows up in git log, and they see fewer configuration-related outages. It matters because it makes infrastructure reproducible and auditable rather than tribal knowledge locked in one engineer’s head.

Why is configuration drift dangerous in cloud infrastructure?

+

Configuration drift creates a gap between documented infrastructure and what actually runs in production. Within months, a striking share of manually managed resources will have drifted from their intended state. During incidents, engineers waste precious time debugging based on false assumptions about the environment. IaC with automated drift detection running every 15-60 minutes catches differences before they cause outages.

Can we adopt IaC with a large existing cloud footprint?

+

Yes. Terraform import and Pulumi import bring existing resources into state management without rebuilding anything. A dedicated engineer can import dozens of resources per day. Import your live state as the baseline, verify the plan shows zero pending changes, then require all future changes through code. Most teams reach high coverage within a few weeks without disrupting feature development.

Should infrastructure code live in the same repo as application code?

+

Generally, no. Separate repos for foundational infrastructure like networking and clusters versus application-specific infrastructure like queues and databases keeps blast radius contained and access controls clean. Application teams own their service-level infrastructure without touching shared platform resources. This also allows independent review cycles and deployment cadences.

How does Infrastructure as Code improve security posture?

+

IaC enables automated policy scanning on every proposed change before deployment. Policy scanners check for 1,000+ misconfiguration rules and OPA enforces custom policies. In practice, policy-as-code routinely catches security misconfigurations that would have reached production via console provisioning. Open database ports, unencrypted storage, and overly permissive IAM roles get blocked in the PR.