operationsRollback

Rollback

The fast path when a deploy goes bad.

Decision tree

Service rollback (no schema)

If the deploy didn’t include a migration, OR the migration is backwards-compatible:

# Find the previous successful deploy SHA
gh run list --workflow deploy-prod.yml --status success --limit 3
 
# Trigger a deploy of that SHA
gh workflow run deploy-prod.yml --ref <previous-sha>
 
# Watch the deploy
gh run watch

Verify after rollback:

  • Healthz returns 200 on all tasks
  • Smoke test passes
  • Error rate returns to baseline within 5 minutes

Migration rollback (the dangerous case)

Reverse migrations are LAST RESORT. Drizzle migrations are forward-only. If you absolutely must reverse:

  1. Stop the bleeding first — disable the affected route, page-gate via a feature flag, or rollback service code if the new code path is at fault.
  2. Write a corrective migration (forward-only) rather than running migrate:down. Reverse migrations have caused more incidents than they’ve fixed.
  3. If you MUST reverse the schema:
    pnpm --filter @services/<svc> exec drizzle-kit drop
    # then manually verify the schema state
    Get explicit approval from the service owner first.

Secret rotation

If a secret was leaked or compromised:

  1. Generate new secret value
  2. Update SST Secret: pnpm sst secret set <NAME> <value> --stage prod
  3. Redeploy the affected services
  4. Verify the old secret is rejected

For OAuth client secrets specifically, use the rotation endpoint:

POST /v1/admin/oauth/clients/:id/rotate-secret

The old secret remains valid for 24 hours so existing tokens keep working.

Database rollback

Aurora point-in-time recovery exists but is catastrophic in scope — it rolls back the entire DB to a snapshot. Use only for:

  • Verified data corruption from a bad migration
  • Active malicious actor mid-attack

Never use for “I committed bad code.” Roll forward with a corrective fix.

To restore:

aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier loop-platform-prod \
  --db-cluster-identifier loop-platform-prod-restore \
  --restore-to-time <ISO-8601>

The restore takes ~30 minutes. Then promote the restored cluster to primary — this is a major incident; CTO approval required.

Cloudflare / DNS rollback

Cloudflare changes are managed via Terraform-equivalent in the SST config. Reverting requires:

  1. Revert the relevant commit on main
  2. Re-deploy the platform stack
  3. DNS propagation: ~1 minute typically

What gets cached

After a rollback, these caches need invalidation:

  • Token introspection cache (Redis) — TTL ≤ 60s, self-clears
  • BrandsCatalog cache — invalidated by next platform.brand.cache_invalidated.v1 event
  • SDK package types — consumers need to npm install the rolled-back version