Cloud Security Posture: Closing the Remediation Gap

Nov 28, 2024 Metasphere Engineering 11 min read

Six months after deploying a CSPM tool, the security team’s dashboard shows 847 findings. Forty-two are Critical. The backlog grows faster than remediation can close it. Every standup starts with the same tired ritual: “We need to prioritize the findings.” By the end of the week, the Critical count is higher than it was at the start. You’ve seen this movie. You might be living in it right now.

A very expensive smoke detector that nobody wired to the sprinklers.

Key takeaways

CSPM tools find problems. They don’t fix them. The gap between finding and remediation is where security posture actually degrades.
Remediate in IaC, not in the console. Console fixes get overwritten on the next terraform apply. The finding reappears. The cycle repeats.
Ownership by resource tag is the only model that scales. Map findings to teams via team and service tags. Untagged resources go to a default owner with escalation.
Prevent-detect-respond is the priority order. Pre-commit IaC scanning catches the majority of misconfigurations before they exist. CSPM catches runtime drift.
847 findings and growing means the remediation pipeline is broken, not that you need a better scanner.

The CSPM tool does exactly what it promises: scan and report. The gap is everything after the finding appears. Who owns remediation? Where does the fix go? How do you prevent recurrence? Engineering problem, not procurement. Another tool just makes the dashboard busier.

Connecting Findings to Infrastructure as Code

Prerequisites

All production infrastructure managed by IaC (Terraform, CloudFormation, or Pulumi)
Resource tagging standard enforced with team and service tags on every resource
CI/CD pipeline exists for IaC changes with automated plan and apply
Security team has merge permissions on IaC repositories
Break-glass process documented for emergency console changes with 48-hour reconciliation SLA

The single most important design decision in a CSPM program is where remediation happens. If engineers fix findings by clicking through the AWS Console, you’ve bought yourself a loop. IaC and live infrastructure diverge. The next terraform apply reverts the fix. The finding comes back. The same security group misconfiguration gets “fixed” repeatedly over months because the Terraform module keeps regenerating it. Dozens of wasted engineering hours per recurring finding per year. Sisyphus with an AWS account.

The remediation target is always the infrastructure-as-code that created the resource. No exceptions. CSPM flags an overly permissive security group? Fix the Terraform module, run through CI/CD, deploy. You get a fixed misconfiguration and updated IaC that prevents recurrence on the next terraform apply.

For well-understood misconfigurations aligned with CIS Benchmarks (public S3 buckets, security groups with 0.0.0.0/0 on management ports, unencrypted EBS volumes), the CSPM finding can trigger an automated PR:

# Auto-generated PR: Fix CIS 2.1.1 - S3 bucket public access
# Finding: bucket "marketing-assets" has public ACL
# Severity: Critical | Resource: aws_s3_bucket.marketing

resource "aws_s3_bucket_public_access_block" "marketing" {
  bucket                  = aws_s3_bucket.marketing.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

The PR includes the specific line change, the finding context, and the compliance reference. An engineer reviews and merges instead of filing a ticket that enters a queue behind 40 other tickets that no one is looking at.

Not hypothetical. Teams running this pattern close Critical findings in under 24 hours because the remediation arrives as a ready-to-merge PR, not a description of a problem that someone needs to investigate, eventually, when they get to it. The difference between “here is a PR that fixes it” and “here is a ticket describing the problem” is the difference between a program that works and a backlog that grows until someone turns off the alerts.

Building a Prioritization Model That Reflects Actual Risk

Not all findings are equal. Treating them equally is the fastest way to produce a backlog where a public bucket serving marketing images competes for attention with a production database accepting connections from the entire internet. Not prioritization. Just a to-do list sorted by when the scanner happened to notice.

A useful prioritization model layers three factors. Where they intersect tells you how fast to move.

Inherent severity: Has this class been exploited in breaches? Public S3 buckets, open management ports (SSH/RDP to 0.0.0.0/0), and overly permissive IAM roles have extensive breach histories. Missing cost tags don’t. Stop treating them the same.

Resource exposure: Internet-facing, VPN-restricted, or private VPC? The same misconfiguration on a load balancer versus a private subnet database has completely different risk profiles. Context beats the scanner’s severity label.

Data sensitivity: Customer PII, payment cards, health records, or internal tooling? A public bucket serving a static website and one containing customer records are both “Critical: public S3” in your CSPM. One is embarrassing. The other ends careers. Your CSPM treats them the same. Your board won’t.

Severity + Exposure + Data	Priority	SLA	Action
All three high	P0	Fix in 4 hours	Drop everything. Automated PR if pattern is known.
Two high, one medium	P1	Fix in 48 hours	Next sprint, engineer assigned immediately
One high, rest medium	P2	Fix in 2 weeks	Backlog with scheduled remediation
All medium or lower	P3	Accept or backlog	Document exception with business justification

All three factors at “High”? Drop everything. Fix it now. All three at “Low”? Document it as an accepted exception and move on. The cloud security tooling that integrates with your resource tagging and data classification systems can automate much of this scoring. But consistent resource tagging is non-negotiable. And consistent tagging is its own engineering discipline that most teams badly underestimate.

Drift Detection as Enforcement

Fixing misconfigurations is only half the problem. Keeping them fixed is the other half, and most teams forget about it entirely. Without drift detection, a manual console change during an incident silently undoes a carefully engineered security control. It goes unnoticed until the next CSPM scan. Or worse, an auditor finds it six months later.

The question is not whether to detect drift. The question is what happens when you find it. Alerting on drift without enforcement means engineers see the alert, mean to fix it, and get busy with other things. The alert rots in a Slack channel. Blocking deployment pipelines until drift is resolved creates a forcing function that ensures fixes actually stick.

Blocking pipelines on security-critical drift feels aggressive. Good. It should feel aggressive. It is the right default for security groups, IAM policies, network ACLs, and encryption settings. If it causes daily friction, that friction is telling you something: your IaC workflow is not handling emergency changes properly. The fix is not loosening enforcement. Wrong instinct. The fix is a documented break-glass process that allows manual changes during incidents but creates a Terraform reconciliation ticket that must close within 48 hours. Pairing drift detection with a mature DevOps delivery pipeline ensures security fixes move through the same auditable process as every other infrastructure change.

CSPM Across Multi-Cloud Environments

Multi-cloud makes all of this harder. Every provider has its own security model, its own API surface, its own blind spots. AWS Security Hub, Azure Defender for Cloud, and GCP Security Command Center each provide deep coverage for their platform and zero visibility into the others. Multi-cloud CSPM platforms provide unified dashboards but typically have shallower coverage for provider-specific services. There’s no single tool that does everything well. Stop looking for one. The security tool market is happy to sell you the search.

The pragmatic approach: use a multi-cloud CSPM for unified risk scoring, cross-cloud visibility, and compliance reporting. Layer native security tools underneath for provider-specific deep coverage. The unified CSPM gives your security team one dashboard. The native tools give your cloud-native platform teams the depth they actually need to do their jobs.

Provider-specific tool layering details

AWS: Security Hub aggregates findings from GuardDuty, Inspector, and Config. AWS Config rules catch provider-specific misconfigurations (S3 Object Lock settings, KMS key rotation) that multi-cloud platforms miss entirely.

Azure: Defender for Cloud covers workload protection alongside posture. Azure Policy enforces guardrails at the subscription level. The integration with Azure DevOps makes remediation PRs native.

GCP: Security Command Center provides posture findings. Organization Policies enforce constraints at the org level. GCP’s asset inventory API gives the most complete resource-level visibility of any provider.

The common pattern: multi-cloud CSPM for unified scoring, native tools for depth, findings aggregated into a centralized store that feeds the automated remediation pipeline.

The Compliance Score Trap

The compliance score trap catches every team eventually, exactly because it feels like progress.

Compliance scores against CIS Benchmarks and NIST CSF are dangerous as operational metrics. Teams game them with exception marks. Nearly fully compliant on paper, attack surface unchanged.

Mean time to remediation for critical findings: under 48 hours. Recurrence rate: under 5%. Exception rate: under 20% or your exception process is a close button.

What the Industry Gets Wrong About CSPM

“A higher compliance score means better security.” Compliance scores measure whether findings are acknowledged or suppressed, not whether risk is eliminated. Teams learn to mark findings as accepted exceptions to close them. In many organizations, a third or more of “resolved” findings are exceptions rather than actual fixes. Score goes up. Attack surface stays the same. A green dashboard and an open door. The score is the diet plan. The weight hasn’t changed.

“More findings mean you need a better scanner.” A growing backlog is a remediation problem, not a detection problem. Your scanner works fine. Your pipeline from finding to fix doesn’t. Adding another scanner is buying a second smoke detector when the first one is already screaming and nobody is grabbing the extinguisher.

“Console fixes are legitimate remediation.” Every console click that doesn’t update the underlying Terraform is a fix with an expiration date. The next terraform apply reverts it. Teams burn months “fixing” the same security group because nobody touches the module that keeps recreating the problem. Remediation that doesn’t change the source code isn’t remediation. It’s busywork with a security veneer.

Our take The metric that matters is mean time to remediation for critical findings, not compliance score. Target under 48 hours. The fastest path to that target is routing CSPM findings directly to IaC pull requests, not ticket queues. PRs get reviewed. Tickets get buried. Route findings to the workflow engineers already live in, and the backlog starts shrinking on its own.

Those 847 findings from the opening? The dashboard shrinks when findings route to IaC PRs, drift detection blocks regressions, and MTTR replaces compliance scores as the operating metric. The architecture improves. The dashboard follows. Never the other way around.

Frequently Asked Questions

What is the difference between CSPM and CWPP?

CSPM focuses on infrastructure configuration: are S3 buckets public, are security groups permissive, is MFA enforced? CWPP focuses on runtime workload security: what is running inside containers, is it exhibiting malicious behavior, does it have known vulnerabilities? Cloud misconfigurations are the leading cause of cloud breaches, which is squarely CSPM territory. Most mature programs need both: CSPM to prevent misconfiguration, CWPP to detect runtime compromise.

Why do CSPM compliance scores improve without actual security improvement?

Scores measure whether findings are acknowledged or suppressed, not whether risk is eliminated. Teams learn to mark findings as accepted exceptions to close them. In many organizations, a large share of ‘resolved’ findings are exceptions rather than actual fixes. Track mean time to remediation for critical findings (target under 48 hours) and exception rate (target under 20%) instead of compliance scores.

How should we prioritize CSPM findings when there are hundreds?

Layer three factors: inherent severity (has this been exploited in real breaches?), resource exposure (is it internet-reachable?), and data sensitivity (does it handle PII or financial data?). A public S3 bucket with customer PII scores critical on all three and justifies stopping everything. A missing cost tag in a private VPC scores low on all three and can be an accepted exception.

What is configuration drift and how do you detect it?

Drift is divergence between your IaC-defined state and actual deployed infrastructure. It happens when engineers make console changes during incidents or when cloud providers update defaults. Most cloud environments have some drift at any given time. Terraform plan in detect mode, AWS Config rules, and specialized drift detection tools catch divergence. But detection without enforcement is just a noise source. Block deployments until drift is resolved, or the alerts rot in Slack.

Is a single CSPM tool sufficient for multi-cloud?

Usually not at full depth. Multi-cloud CSPM platforms provide unified visibility but often have shallower coverage for provider-specific services than native tools like AWS Security Hub or GCP Security Command Center. A common approach is a multi-cloud CSPM for unified scoring plus native tools for deep provider-specific coverage.