← Back to Insights

Security Incident Response: Automate the First 15 Minutes

Metasphere Engineering 13 min read

A high-confidence alert fires: an IAM access key associated with a production service just made sts:AssumeRole calls against 14 different roles in rapid succession, from an IP address in a country where you have no employees. Your SIEM scored it at 92/100 confidence. The clock starts.

The patient just arrived at the ER. Vitals are alarming. The NIST SP 800-61 Incident Handling Guide defines the framework. What happens in the next five minutes determines whether this is a contained incident or a breach notification.

One team’s on-call analyst gets paged, opens the SOAR platform, and the automated playbook has already done the heavy lifting. Revoked the compromised key. Quarantined the originating instance by swapping its security group. Snapshotted the instance volume for forensics. Exported the last 72 hours of CloudTrail logs to immutable storage. The ER that starts treatment on arrival. The analyst reviews the automated actions, confirms the incident, and starts root cause investigation. Total time from alert to containment: 4 minutes.

Another team’s analyst gets paged, opens Slack, asks “does anyone have the runbook for credential compromise?”, waits for a response, finds a Google Doc from 2023 that references an account structure that no longer exists, manually logs into the AWS console, searches for the right IAM user, and starts figuring out the revocation steps. The ER where the doctor has to find the chart, call for tests, wait for results. Total time from alert to containment: 47 minutes. In those 43 extra minutes, the attacker has already attempted lateral movement to three additional accounts. The patient deteriorated while the team searched for the treatment protocol.

Key takeaways
  • Automated playbooks cut containment from tens of minutes to single digits. Same alert. Same tools. The difference is executable automation versus a Google Doc someone has to find first.
  • Incident response must be software, not documentation. A PDF on SharePoint has never stopped a breach. Executable playbooks that revoke, quarantine, and snapshot automatically have.
  • Forensic evidence preservation starts before containment. CloudTrail, VPC flow logs, and DNS logs exported to write-once storage within minutes of alert. Evidence that can be tampered with is not evidence.
  • Post-incident reviews must produce code, not action items. “Improve our response to credential compromise” goes on a backlog. A new automated playbook ships immediately.
  • Tabletop exercises quarterly, live-fire drills semi-annually. The training drill. Response muscles atrophy fast. Practice the playbooks before the breach, not during it.

Architecture separated those outcomes. Not budget. Not headcount. The MITRE ATT&CK framework maps the adversary techniques that playbooks need to counter.

Automated vs Manual Incident ResponseSide-by-side comparison of automated SOAR-based incident response completing containment in 4 minutes versus manual response taking 47 minutes. The automated side finishes while the manual side is still on step 3.Incident Response: Automated vs ManualHIGH ALERT: Credential CompromiseUnknown IP. Confidence: 92/100AUTOMATED (SOAR)MANUALRevoke compromised IAM key30sQuarantine instance (swap SG)45sSnapshot volume for forensics1mExport CloudTrail to immutable S31.5mCONTAINEDTotal: 4 minutesEngineer paged, wakes up5 minOpens Slack, asks for runbook8 minFinds outdated Google Doc (2023)12 minAutomated side already containedLogs into AWS, finds IAM user10 minFigures out revocation steps...12 min...still working (47 min total)Automated4 minManual47 min43 extra minutes = lateral movement to 3 additional accounts

Detection: Where Most Programs Fail First

Dwell time is how long an attacker operates undetected inside your environment. With a dedicated SOC and mature detection, dwell time drops to days. Without mature detection, organizations often learn about breaches months later, sometimes from external parties like law enforcement or a journalist.

Out-of-box SIEM rules flood teams with hundreds or thousands of alerts per day. The overwhelming majority are false positives. Alert fatigue sets in within weeks. Analysts start ignoring alerts, or worse, disable rules that generate too much noise. More rules means more noise. Smarter rules move the needle.

Security detection enrichment pipeline from raw event to actionable alertRaw security event gets enriched with identity context, asset classification, and behavioral baseline. The enriched alert includes who, what they accessed, whether it's abnormal, and recommended response actions.Detection Enrichment: Raw Event to Actionable AlertRaw EventIP: 203.0.113.5Action: API callNo context+ IdentityUser: jane.doeRole: engineerTeam: platform+ AssetResource: prod-dbClassification: criticalContains: PII+ BaselineNormal hours: 9-5Current: 2:30 AMABNORMALActionable AlertWho: jane.doeWhat: prod-db at 2AMRisk: HIGH (PII + off-hours)Action: page + containRaw events are noise. Enriched events are intelligence.

Building the Context Enrichment Layer

The raw event stream from your cloud security tools provides telemetry. Turning that telemetry into actionable alerts requires enrichment your vendor can’t do for you, because it depends on your specific environment.

Three enrichment dimensions cut false positive rates drastically:

Identity context. Is this a service account or a human? What permissions does the role grant? A service account making AssumeRole calls at predictable intervals is normal. Baseline behavior. The same calls from a human account at an unusual hour, against roles it’s never assumed before, is a high-confidence signal.

Asset context. Production or sandbox? What data classification tier does this system hold? A suspicious outbound transfer from a development sandbox is worth investigating. The same transfer from a system classified as holding customer PII is an immediate escalation. Context determines whether the same event is noise or signal.

Behavioral baseline. What does 30 days of normal activity look like for this identity, this asset, this network segment? Anomalies measured against a real baseline are far more reliable than static thresholds.

Maturity LevelRule SourceFalse Positive RateDetection Quality
Level 1: Out-of-boxVendor default rules, SIEM templatesHigh (70-90% noise)Catches obvious attacks. Misses targeted ones. Alert fatigue guaranteed
Level 2: TunedVendor rules + environment-specific thresholdsMedium (30-50% noise)Reduced noise. Still misses behavioral anomalies
Level 3: Context-enrichedCustom rules with identity, asset, and behavioral contextLow (10-20% noise)Alerts include who, what, why. Analyst can act without investigation
Level 4: ML-augmentedBehavioral baselines + anomaly detection + contextLowest (<10% noise)Catches novel attack patterns. Requires clean data from Level 3
Prerequisites
  1. CloudTrail enabled for all regions with centralized log delivery
  2. VPC flow logs active on production subnets with 90+ day retention
  3. DNS query logging enabled for all VPCs
  4. Asset inventory with data classification tiers maintained and current
  5. Identity provider audit logs streaming to SIEM
  6. Baseline behavioral data from at least 30 days of normal operations

Automated Containment Playbooks

SOAR platforms run playbooks on alert trigger. Seconds versus the better part of an hour manually. The standard credential compromise playbook runs six steps before a human ever opens a laptop:

  1. Revoke the suspected credential (disable IAM access key or delete temporary session)
  2. Quarantine the affected instance (swap its security group to an isolation group with no egress)
  3. Snapshot the instance volume (preserve forensic evidence before any state changes)
  4. Export CloudTrail logs for the past 72 hours to immutable S3 with Object Lock
  5. Open an incident ticket with pre-populated context (affected resource, alert details, automated actions taken)
  6. Page the on-call security engineer with the ticket link

By the time the analyst opens the laptop, containment is already done.

Security Incident Response: Detection to RecoverySecurity Incident: Five PhasesDetectSIEM alert firesContext enrichedTriageSeverity classificationScope assessmentContainIsolate affected systemsPreserve evidence FIRSTThen containEradicateRemove root causePatch vulnerabilityRecover + ReviewRestore servicesPost-incident reviewPreserve evidence before containing. Containment destroys forensic artifacts.

Treat playbooks like production software: version control, staging tests, game days. Automated remediation patterns apply directly. A playbook that’s never been tested in a realistic scenario is a playbook that will fail when the real incident hits. (A runbook nobody ran before the real incident.)

Incident TypeManual ContainmentAutomated ContainmentKey Automation Steps
Credential compromise30-60 minutesUnder 5 minutesKey revocation, session termination, instance quarantine
Malware detection45-90 minutesUnder 10 minutesNetwork isolation, memory capture, endpoint scan trigger
Data exfiltration60-120 minutesUnder 15 minutesEgress block, source isolation, transfer log preservation
Unauthorized access20-45 minutesUnder 5 minutesSession kill, MFA enforcement, access log export

Forensic Evidence Preservation

Cloud infrastructure actively destroys forensic evidence. Auto-scaling terminates instances. Logs rotate. Ephemeral containers disappear. The evidence you need to understand what happened is being deleted by the infrastructure doing its normal job.

Set these up before the incident: CloudTrail with Object Lock, VPC flow logs with 90+ day retention, GuardDuty findings, DNS query logging to a centralized store. When an incident is confirmed, the evidence preservation sequence must happen before any containment action that alters state. Preserve first. Contain second. Always.

ActionWhenWhyHow
Snapshot affected volumesBefore containment. First actionVolume state changes during containment. Pre-containment snapshot preserves attacker artifactsAWS: create EBS snapshot. Tag with incident ID and timestamp
Export logs to immutable storageBefore containmentAttacker may have access to delete CloudTrail, VPC flow logs, or application logsCopy to S3 with Object Lock (WORM). Cross-account for extra protection
Capture running process stateBefore terminating instancesMemory-resident malware disappears on shutdown. Process list, network connections, open filesSSM RunCommand to capture /proc state, netstat, lsof output
Preserve network flow dataBefore changing security groupsNetwork connections reveal lateral movement paths and data exfiltration endpointsVPC Flow Logs + DNS query logs. Ensure retention covers the investigation window
Document the timelineThroughoutChain of custody. Who did what, when, and why. Required for legal proceedingsIncident channel with timestamped entries. No retroactive edits

Contain AFTER preserving. Evidence destroyed during containment is evidence lost forever.

Anti-pattern

Don’t: Immediately terminate compromised instances to “stop the bleeding.” Termination destroys memory state, running processes, and ephemeral storage that forensic investigation needs. You kill the evidence along with the threat.

Do: Quarantine first by swapping the security group to a no-egress isolation group. The attacker loses network access, but the instance stays running and its state is preserved for investigation. Snapshot first. Quarantine second.

Automate this entire preservation sequence in SOAR. Humans skip steps under stress, especially step 4 (memory capture), which needs tooling most teams haven’t pre-installed.

Timeline Reconstruction and Root Cause

Pull CloudTrail, VPC flow logs, application logs, DNS queries, and endpoint detection into one timeline. Now you can see the full story. How they got in. How long they were there. What they touched. What they installed. What they may have taken. When you finally noticed. The full story assembled from every source. Incident management tools that support multi-source timelines make this much faster.

NTP drift across log sources makes correlation unreliable unless all sources feed into a centralized store with normalized timestamps. All clocks synchronized. Have this infrastructure in place before the incident. Building it during an active breach is a recipe for missed evidence and incorrect timelines.

Timeline reconstruction source priority
  1. CloudTrail - API-level activity for all AWS services. The authoritative record of who did what.
  2. VPC Flow Logs - Network-level evidence of connections, volumes, and destinations.
  3. DNS Query Logs - Reveals command-and-control domains and data exfiltration channels.
  4. Application Logs - Shows what the application saw and did. Often reveals the initial exploitation vector.
  5. Endpoint Detection (EDR) - Process-level activity on affected hosts. Captures persistence mechanisms.
  6. GuardDuty Findings - Consolidated threat intelligence with severity scoring.

Cross-correlate by timestamp. Build the timeline in reverse from detection to initial compromise. The dwell time between initial compromise and detection is the most important number in any post-incident report because it determines the scope of what the attacker had access to. This number determines the scope of what the attacker had access to.

Post-Incident Work That Prevents Recurrence

“Improve security awareness training” as a post-mortem finding means the team doesn’t know what to fix. Useful findings are specific and produce artifacts: a detection rule that catches this specific compromise pattern earlier, a playbook step that timed out and needs a faster implementation, a credential that should have been federated instead of stored as a long-lived key.

Post-incident improvement cycle: timeline, findings, actions, verificationAfter resolution: build blameless timeline within 24h, identify systemic findings not just proximate cause, assign improvement actions with owners and deadlines, verify actions prevent recurrence.Post-Incident: The Work That Prevents RecurrenceBuild TimelineBlameless chronologyWithin 24 hoursWho knew what, whenMemory fades fastSystemic FindingsNot just proximate causeWhat allowed it to happen?Missing alerts? Gaps?Root cause is a systemAssign ActionsOwner + deadline per itemTracked in issue trackerNot wiki action itemsActions without owners dieVerify PreventionGame day the fixReproduce originalscenario, verify blockedProof, not hopeA post-mortem without tracked actions is a group therapy session.

Findings feed into DevOps backlog as engineering work, not documentation updates. The test for whether a post-mortem produced value: did it change what happens automatically next time, or did it add a task to a backlog nobody checks? Did the team ship a new detection rule, or did someone add “improve detection” to a backlog?

The 43-Minute Gap The time difference between automated and manual incident containment for a standard credential compromise scenario. Automated playbook: 4 minutes (revoke key, quarantine instance, snapshot volume, export logs). Manual process: 47 minutes (Slack thread, stale runbook, console login, manual revocation). 43 minutes of additional attacker dwell time. Every minute in that gap is an opportunity for lateral movement, data access, and persistence installation. The gap is architecture, not headcount or budget.

What the Industry Gets Wrong About Incident Response

“A response plan in a shared drive is preparation.” A PDF nobody has read since it was written, referencing an account structure that changed twice, is the illusion of preparation. Response plans must be executable code, tested quarterly, and updated after every architecture change. If the plan references an IAM structure from 2023 and you restructured accounts in 2024, the plan is worse than nothing because it creates false confidence.

“Post-mortems produce improvement.” Post-mortems that produce action items produce documentation. Post-mortems that produce code (new playbooks, updated detection rules, automated containment steps) produce improvement. Action items decay on backlogs. Code ships and runs.

“You need a 24/7 SOC to have incident response.” Automated playbooks handle the first five minutes of containment without a human. The human reviews and confirms what automation already contained. For most organizations, the combination of automated containment plus on-call rotation provides faster response than a staffed SOC with manual runbooks.

Our take Automate containment for the top 3 incident types your organization handles. Credential compromise, malware detection, and data exfiltration cover the majority of security incidents in most environments. Automated playbooks for these three scenarios deliver more improvement than comprehensive planning for every conceivable incident type. Get the common cases to under 5 minutes, then expand coverage.

That IAM key making AssumeRole calls across 14 roles? Contained in 4 minutes. Key revoked, instance quarantined, logs preserved in immutable storage, analyst paged with full context. Each incident makes the next faster to catch and harder for the attacker to exploit. The loop compounds. Teams treating post-mortems as engineering inputs, not documentation exercises, never get breached the same way twice.

43 Minutes of Attacker Dwell Time You Could Have Prevented

A response plan in a PDF does not stop a breach. Detection pipelines, automated containment playbooks, and forensic preservation infrastructure are what make incident response work when the clock is ticking and every minute matters.

Automate Your Containment

Frequently Asked Questions

What is the difference between SIEM and SOAR?

+

A SIEM collects, normalizes, and correlates security events to generate alerts. A SOAR platform automates the response actions those alerts trigger: isolating hosts, revoking credentials, capturing forensic evidence, and paging the on-call team. Organizations running both typically compress Mean Time to Contain sharply, cutting containment from hours to minutes for common incident types.

How do you reduce SIEM false positives without losing detection coverage?

+

Build context into detection rules rather than lowering sensitivity. A single failed login is noise. 200 failed logins from one IP against multiple accounts in 5 minutes is a signal. Context enrichment with asset classification, identity type, and behavioral baselines transforms raw events into actionable alerts. Teams that build enrichment routinely cut false positives by 70-80% while keeping or improving detection coverage.

How do you preserve forensic evidence in cloud before containment destroys it?

+

Act before containment. Immediately snapshot affected EBS volumes, export CloudTrail logs for the past 72 hours to S3 with Object Lock, capture VPC flow logs for affected subnets, and take memory dumps from running instances if tooling supports it. All of this must happen before security group changes or instance termination that would destroy volatile evidence.

What should a post-incident timeline reconstruction include?

+

A complete timeline establishes: initial compromise method and timestamp, dwell time (undetected access duration), systems and data accessed, persistence mechanisms installed, data exfiltrated, and detection and containment timestamps. Sources include CloudTrail, VPC flow logs, application logs, endpoint detection, and DNS query logs.

What metrics should we track for incident response capability?

+

Mean Time to Detect is the most critical metric. Set aggressive targets by severity tier: Severity 1 incidents should be detected within hours, not days, and contained rapidly. Mean Time to Contain should be measured against the same severity tiers. Track both and review quarterly. Teams that measure and review these metrics consistently compress both each quarter.