Incident response
Sev levels
| Sev | Definition | Response |
|---|---|---|
| Sev1 | Production down or PHI breach | Page primary immediately; war room within 15 min |
| Sev2 | Significant degradation or known data corruption | Page primary; investigate within 30 min |
| Sev3 | Limited user impact or non-critical service degraded | Slack alert; investigate within 4 hours |
| Sev4 | Cosmetic / non-urgent | Ticket; next business day |
First 15 minutes (Sev1)
- Acknowledge the page in PagerDuty.
- Open the war room Slack channel:
#incident-<ticket-id>. - Update status page — set to “investigating.”
- Identify scope: which service(s), which brand(s), how many users.
- Decide: rollback or roll forward?
- Recent deploy + obvious correlation → rollback (faster, lower risk).
- Old code + new pattern of failure → roll forward with a fix.
- Communicate every 15 min in the war room even if there’s no progress.
Investigation playbook
Start with:
- Dashboards — error rate, latency, DLQ depth, vendor circuit breaker state.
- Logs — search by request_id, user_id, or audit row.
- Recent deploys —
gh run list --workflow deploy-prod.yml --limit 5. - Vendor status — Stripe, AWS, BigCommerce, etc.
- Database —
pg_stat_activity, recent migrations.
Common patterns:
| Symptom | Likely cause |
|---|---|
| Error rate spike right after deploy | Bad release — rollback |
| Slow queries appearing in a chunk | Lock contention or missing index |
| 5xx on one specific path | Vendor circuit open or DB connection exhausted |
| All services 500ing | Identity / token introspection / Redis down |
| Spike in 401s | Identity outage or expired secrets |
Escalation
If primary on-call can’t resolve in 30 min: escalate to secondary. If both stuck after 60 min: page CTO + relevant service owner.
PagerDuty schedule: https://theloopway.pagerduty.com/schedules/<platform-primary>.
Communications template
When user-impact is real:
**Incident summary**
What: <one sentence on user impact>
When: <start time, UTC>
Status: investigating | identified | mitigating | resolved
Service(s): <names>
Brand(s): <names or "all">
**Latest update** (timestamp)
<what we know, what we're doing>
**Next update by** <timestamp>Post this to: #incidents Slack, status page, war room channel.
Post-mortem
Within 5 business days of resolution. Required for Sev1 and Sev2.
Use docs/incidents/<YYYY-MM-DD>-<short-name>.md template:
- Timeline (UTC, minute-by-minute)
- Root cause
- What worked / what didn’t (process, not people)
- Action items with owners + due dates
- Customer-facing communication summary
- Blameless framing
PM-level incidents are reviewed in the next ops sync.
Related
- Rollback
- Production readiness
- Service catalog — per-service runbooks
- Reliability and deployment