operationsIncident response

Incident response

Sev levels

SevDefinitionResponse
Sev1Production down or PHI breachPage primary immediately; war room within 15 min
Sev2Significant degradation or known data corruptionPage primary; investigate within 30 min
Sev3Limited user impact or non-critical service degradedSlack alert; investigate within 4 hours
Sev4Cosmetic / non-urgentTicket; next business day

First 15 minutes (Sev1)

  1. Acknowledge the page in PagerDuty.
  2. Open the war room Slack channel: #incident-<ticket-id>.
  3. Update status page — set to “investigating.”
  4. Identify scope: which service(s), which brand(s), how many users.
  5. Decide: rollback or roll forward?
    • Recent deploy + obvious correlation → rollback (faster, lower risk).
    • Old code + new pattern of failure → roll forward with a fix.
  6. Communicate every 15 min in the war room even if there’s no progress.

Investigation playbook

Start with:

  • Dashboards — error rate, latency, DLQ depth, vendor circuit breaker state.
  • Logs — search by request_id, user_id, or audit row.
  • Recent deploysgh run list --workflow deploy-prod.yml --limit 5.
  • Vendor status — Stripe, AWS, BigCommerce, etc.
  • Databasepg_stat_activity, recent migrations.

Common patterns:

SymptomLikely cause
Error rate spike right after deployBad release — rollback
Slow queries appearing in a chunkLock contention or missing index
5xx on one specific pathVendor circuit open or DB connection exhausted
All services 500ingIdentity / token introspection / Redis down
Spike in 401sIdentity outage or expired secrets

Escalation

If primary on-call can’t resolve in 30 min: escalate to secondary. If both stuck after 60 min: page CTO + relevant service owner.

PagerDuty schedule: https://theloopway.pagerduty.com/schedules/<platform-primary>.

Communications template

When user-impact is real:

**Incident summary**
What: <one sentence on user impact>
When: <start time, UTC>
Status: investigating | identified | mitigating | resolved
Service(s): <names>
Brand(s): <names or "all">

**Latest update** (timestamp)
<what we know, what we're doing>

**Next update by** <timestamp>

Post this to: #incidents Slack, status page, war room channel.

Post-mortem

Within 5 business days of resolution. Required for Sev1 and Sev2.

Use docs/incidents/<YYYY-MM-DD>-<short-name>.md template:

  • Timeline (UTC, minute-by-minute)
  • Root cause
  • What worked / what didn’t (process, not people)
  • Action items with owners + due dates
  • Customer-facing communication summary
  • Blameless framing

PM-level incidents are reviewed in the next ops sync.