Rollback
The fast path when a deploy goes bad.
Decision tree
Service rollback (no schema)
If the deploy didn’t include a migration, OR the migration is backwards-compatible:
# Find the previous successful deploy SHA
gh run list --workflow deploy-prod.yml --status success --limit 3
# Trigger a deploy of that SHA
gh workflow run deploy-prod.yml --ref <previous-sha>
# Watch the deploy
gh run watchVerify after rollback:
- Healthz returns 200 on all tasks
- Smoke test passes
- Error rate returns to baseline within 5 minutes
Migration rollback (the dangerous case)
Reverse migrations are LAST RESORT. Drizzle migrations are forward-only. If you absolutely must reverse:
- Stop the bleeding first — disable the affected route, page-gate via a feature flag, or rollback service code if the new code path is at fault.
- Write a corrective migration (forward-only) rather than running
migrate:down. Reverse migrations have caused more incidents than they’ve fixed. - If you MUST reverse the schema:
Get explicit approval from the service owner first.
pnpm --filter @services/<svc> exec drizzle-kit drop # then manually verify the schema state
Secret rotation
If a secret was leaked or compromised:
- Generate new secret value
- Update SST Secret:
pnpm sst secret set <NAME> <value> --stage prod - Redeploy the affected services
- Verify the old secret is rejected
For OAuth client secrets specifically, use the rotation endpoint:
POST /v1/admin/oauth/clients/:id/rotate-secretThe old secret remains valid for 24 hours so existing tokens keep working.
Database rollback
Aurora point-in-time recovery exists but is catastrophic in scope — it rolls back the entire DB to a snapshot. Use only for:
- Verified data corruption from a bad migration
- Active malicious actor mid-attack
Never use for “I committed bad code.” Roll forward with a corrective fix.
To restore:
aws rds restore-db-cluster-to-point-in-time \
--source-db-cluster-identifier loop-platform-prod \
--db-cluster-identifier loop-platform-prod-restore \
--restore-to-time <ISO-8601>The restore takes ~30 minutes. Then promote the restored cluster to primary — this is a major incident; CTO approval required.
Cloudflare / DNS rollback
Cloudflare changes are managed via Terraform-equivalent in the SST config. Reverting requires:
- Revert the relevant commit on
main - Re-deploy the platform stack
- DNS propagation: ~1 minute typically
What gets cached
After a rollback, these caches need invalidation:
- Token introspection cache (Redis) — TTL ≤ 60s, self-clears
- BrandsCatalog cache — invalidated by next
platform.brand.cache_invalidated.v1event - SDK package types — consumers need to
npm installthe rolled-back version