Compliance

Security Incident Response Plan: Building Your IR Playbook

Build a security incident response plan using the NIST framework. Covers team roles, detection, containment, recovery, and tabletop exercise design.

March 9, 20268 min readShipSafer Team

Security Incident Response Plan: Building Your IR Playbook

A security incident at 2am is the worst time to figure out who is on call, who has authority to take systems offline, and what you are legally required to disclose. Organizations that handle incidents well do so because they prepared before the incident, not during it. An incident response plan is not a document you file and forget — it is an operational playbook that your team drills regularly.

This guide walks through building a plan based on the NIST SP 800-61 framework, which defines four phases: Preparation, Detection and Analysis, Containment/Eradication/Recovery, and Post-Incident Activity.

Phase 1: Preparation

Preparation is the most important phase — and the one most organizations underfund until after their first serious incident.

Define What Counts as an Incident

Not every security event is an incident requiring formal response. Define your classification scheme:

  • Event: Any observable occurrence in a system or network (a failed login attempt, a port scan)
  • Alert: An event flagged by automated systems as potentially security-relevant
  • Incident: A confirmed security event that adversely affects the confidentiality, integrity, or availability of information or systems

Incidents should be categorized by severity:

SeverityDefinitionResponse Time
Critical (P1)Active breach, ransomware, confirmed data exfiltrationImmediate, 24/7
High (P2)Suspected breach, confirmed compromise of a single systemWithin 2 hours
Medium (P3)Malware detected and contained, failed targeted attackWithin 8 hours
Low (P4)Policy violation, minor security event with no data at riskNext business day

Assemble the Incident Response Team

Assign explicit roles before incidents happen. Do not assume people know what to do during a crisis.

Incident Commander (IC): Coordinates the overall response, makes decisions about escalation, communicates with leadership. During a major incident, this person's only job is coordination — they do not perform technical analysis.

Technical Lead: Owns the forensic analysis, identifies attack vectors, and implements containment measures. Typically a senior engineer or security engineer.

Communications Lead: Manages internal communications (to leadership, all-staff), external communications (to customers, regulators, press), and legal coordination. This should be someone with authority to speak on behalf of the organization.

Scribe: Documents everything during the incident — timeline of events, decisions made, evidence collected, communications sent. This record is critical for post-incident review and, if necessary, regulatory reporting.

Legal Counsel: On call for incidents involving data breaches, as notification obligations have hard deadlines and legal complexity.

Pre-Establish Resources

Before incidents happen, establish:

  • An out-of-band communication channel (incident bridge number or Slack workspace separate from your main tools, in case your primary tools are compromised)
  • Contact list for all team members, their backups, and outside resources (forensic investigators, legal counsel)
  • Authority matrix specifying who can authorize taking production systems offline
  • Retainer with a digital forensics and incident response (DFIR) firm for major incidents
  • Pre-drafted notification templates for customers and regulators

Build Your Detection Infrastructure

You cannot respond to incidents you do not detect. At minimum:

  • Centralized logging aggregating authentication events, administrative actions, and application errors
  • Alerting for high-priority events: multiple failed authentication attempts, impossible travel, new administrative account creation, large data exports
  • Endpoint detection and response (EDR) on company-managed devices
  • Cloud security posture monitoring for unauthorized configuration changes

Phase 2: Detection and Analysis

When an alert fires or someone reports a potential incident, the response starts with validation and scoping.

Initial Triage Checklist

Within the first 30 minutes of a potential incident:

  1. Determine if the event is real or a false positive: Gather initial facts. What system? What user? What time? Cross-reference with other data sources.
  2. Assign a severity level: Based on initial information, classify the incident. You can reclassify as you learn more.
  3. Activate the IR team: Based on severity, wake up the appropriate people.
  4. Declare the incident formally: Open an incident record with a unique identifier. Start the documentation clock.
  5. Preserve evidence: Do not take remediation actions that destroy evidence before you have captured what you need. Take memory snapshots before rebooting compromised systems. Export logs before they roll over.

Scoping the Incident

Understanding the blast radius is the most critical early analysis task. Ask:

  • What systems are confirmed compromised?
  • What data was potentially accessed or exfiltrated?
  • How long has the attacker had access? (Determine the initial access timestamp)
  • Are there indicators of lateral movement to other systems?
  • What is the business impact of the affected systems?

Build a timeline from your logs. The attacker's first action, the initial foothold, lateral movement, and the triggering event that caused detection — lay these out chronologically. Gaps in the timeline indicate logging coverage gaps you need to address after the incident.

Phase 3: Containment, Eradication, and Recovery

Containment

Containment stops the bleeding without necessarily fixing the root cause. Containment actions depend on the incident type:

Compromised user account: Disable the account, revoke active sessions, revoke API tokens and OAuth grants. Force re-authentication for any accounts that may have been accessed from the same session.

Compromised server or container: Isolate the system from the network (revoke security group rules, move to an isolated VPC, or take it offline). Do not reboot — preserve volatile memory if forensics are needed.

Active ransomware: Isolate affected systems immediately to prevent lateral spread. Disconnect from network shares and backup systems (ransomware often targets backups).

Data exfiltration in progress: Terminate the exfiltration channel (revoke API key, block IP, revoke session), but capture network logs first to understand what was taken.

Document every containment action with timestamp and person responsible.

Eradication

After containment, identify and remove the root cause:

  • Remove malware and attacker tools
  • Close the initial access vector (patch the vulnerability, revoke the compromised credential, remove the malicious OAuth application)
  • Identify and close any backdoors or persistence mechanisms the attacker installed
  • Review for additional compromised accounts using the same attack vector

Do not proceed to recovery until you are confident the attacker no longer has access.

Recovery

Recovery returns systems to normal operation:

  • Rebuild affected systems from known-good images or infrastructure-as-code definitions rather than cleaning infected systems in place
  • Restore data from backups taken before the compromise (verify backup integrity first)
  • Deploy additional monitoring to detect any recurrence
  • Gradually restore access with enhanced monitoring

For major incidents, consider a phased recovery rather than restoring everything at once. Monitor closely in the 24–48 hours following restoration.

Phase 4: Post-Incident Activity

Post-Incident Review

Within 5–7 business days of resolving the incident, conduct a formal post-incident review. The review should produce:

  • A complete incident timeline from initial access to resolution
  • Root cause analysis (the technical root cause and the process/organizational factors that enabled it)
  • Impact assessment (systems affected, data at risk, customer impact, regulatory implications)
  • Lessons learned
  • Action items with owners and due dates

Structure the review as a blameless post-mortem. The goal is systemic improvement, not assigning fault to individuals. Ask "what allowed this to happen" rather than "who caused this."

Updating the Plan

An incident response plan that is not updated after real incidents degrades quickly. After every significant incident or tabletop exercise, update:

  • Contact lists and on-call rotations
  • Playbooks that were found to be inaccurate or incomplete
  • Severity classification if the incident revealed gaps
  • Detection capabilities if the attack vector was not caught by existing monitoring

Tabletop Exercises

A tabletop exercise is a facilitated discussion where the IR team walks through a simulated incident scenario without performing live technical actions. It is the most cost-effective way to test your plan.

Running a Tabletop

  1. Select a scenario: Choose a realistic incident type relevant to your threat model. Common scenarios: ransomware from a phishing email, stolen developer credentials used to access production, a vendor breach exposing your data.

  2. Set the scene: The facilitator presents the initial indicator. "At 3am on a Saturday, your monitoring system alerts that a large number of records are being exported from your customer database using an API key associated with your analytics integration."

  3. Walk through each phase: As a team, discuss what actions you would take. Who do you call? What do you preserve? What questions do you ask? When do you notify customers? When do you notify regulators?

  4. Identify gaps: Document every instance where the team was unsure what to do, the plan did not cover the scenario, or the right contact information was unavailable.

  5. Produce action items: Assign specific improvements with owners and deadlines.

Run tabletops quarterly for your core IR team and annually with extended stakeholders including legal, communications, and executive leadership. The first tabletop is almost always uncomfortable — that discomfort is exactly the point.

Check Your Security Score — Free

See exactly how your domain scores on DMARC, TLS, HTTP headers, and 25+ other automated security checks in under 60 seconds.