Rollback

The fast path when a deploy goes bad.

Decision tree

Service rollback (no schema)

If the deploy didn’t include a migration, OR the migration is backwards-compatible:

# Find the previous successful deploy SHA
gh run list --workflow deploy-prod.yml --status success --limit 3
 
# Trigger a deploy of that SHA
gh workflow run deploy-prod.yml --ref <previous-sha>
 
# Watch the deploy
gh run watch

Verify after rollback:

Healthz returns 200 on all tasks
Smoke test passes
Error rate returns to baseline within 5 minutes

Migration rollback (the dangerous case)

Reverse migrations are LAST RESORT. Drizzle migrations are forward-only. If you absolutely must reverse:

Stop the bleeding first — disable the affected route, page-gate via a feature flag, or rollback service code if the new code path is at fault.
Write a corrective migration (forward-only) rather than running migrate:down. Reverse migrations have caused more incidents than they’ve fixed.

If you MUST reverse the schema:

pnpm --filter @services/<svc> exec drizzle-kit drop
# then manually verify the schema state

Get explicit approval from the service owner first.

Secret rotation

If a secret was leaked or compromised:

Generate new secret value
Update SST Secret: pnpm sst secret set <NAME> <value> --stage prod
Redeploy the affected services
Verify the old secret is rejected

For OAuth client secrets specifically, use the rotation endpoint:

POST /v1/admin/oauth/clients/:id/rotate-secret

The old secret remains valid for 24 hours so existing tokens keep working.

Database rollback

Aurora point-in-time recovery exists but is catastrophic in scope — it rolls back the entire DB to a snapshot. Use only for:

Verified data corruption from a bad migration
Active malicious actor mid-attack

Never use for “I committed bad code.” Roll forward with a corrective fix.

To restore:

aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier loop-platform-prod \
  --db-cluster-identifier loop-platform-prod-restore \
  --restore-to-time <ISO-8601>

The restore takes ~30 minutes. Then promote the restored cluster to primary — this is a major incident; CTO approval required.

Cloudflare / DNS rollback

Cloudflare changes are managed via Terraform-equivalent in the SST config. Reverting requires:

Revert the relevant commit on main
Re-deploy the platform stack
DNS propagation: ~1 minute typically

What gets cached

After a rollback, these caches need invalidation:

Token introspection cache (Redis) — TTL ≤ 60s, self-clears
BrandsCatalog cache — invalidated by next platform.brand.cache_invalidated.v1 event
SDK package types — consumers need to npm install the rolled-back version

Genetics / PGx personalization