Incident response

Sev levels

Sev	Definition	Response
Sev1	Production down or PHI breach	Page primary immediately; war room within 15 min
Sev2	Significant degradation or known data corruption	Page primary; investigate within 30 min
Sev3	Limited user impact or non-critical service degraded	Slack alert; investigate within 4 hours
Sev4	Cosmetic / non-urgent	Ticket; next business day

First 15 minutes (Sev1)

Acknowledge the page in PagerDuty.
Open the war room Slack channel: #incident-<ticket-id>.
Update status page — set to “investigating.”
Identify scope: which service(s), which brand(s), how many users.
Decide: rollback or roll forward?
- Recent deploy + obvious correlation → rollback (faster, lower risk).
- Old code + new pattern of failure → roll forward with a fix.
Communicate every 15 min in the war room even if there’s no progress.

Investigation playbook

Start with:

Dashboards — error rate, latency, DLQ depth, vendor circuit breaker state.
Logs — search by request_id, user_id, or audit row.
Recent deploys — gh run list --workflow deploy-prod.yml --limit 5.
Vendor status — Stripe, AWS, BigCommerce, etc.
Database — pg_stat_activity, recent migrations.

Common patterns:

Symptom	Likely cause
Error rate spike right after deploy	Bad release — rollback
Slow queries appearing in a chunk	Lock contention or missing index
5xx on one specific path	Vendor circuit open or DB connection exhausted
All services 500ing	Identity / token introspection / Redis down
Spike in 401s	Identity outage or expired secrets

Escalation

If primary on-call can’t resolve in 30 min: escalate to secondary. If both stuck after 60 min: page CTO + relevant service owner.

PagerDuty schedule: https://theloopway.pagerduty.com/schedules/<platform-primary>.

Communications template

When user-impact is real:

**Incident summary**
What: <one sentence on user impact>
When: <start time, UTC>
Status: investigating | identified | mitigating | resolved
Service(s): <names>
Brand(s): <names or "all">

**Latest update** (timestamp)
<what we know, what we're doing>

**Next update by** <timestamp>

Post this to: #incidents Slack, status page, war room channel.

Post-mortem

Within 5 business days of resolution. Required for Sev1 and Sev2.

Use docs/incidents/<YYYY-MM-DD>-<short-name>.md template:

Timeline (UTC, minute-by-minute)
Root cause
What worked / what didn’t (process, not people)
Action items with owners + due dates
Customer-facing communication summary
Blameless framing

PM-level incidents are reviewed in the next ops sync.

Release process Intelligence loop