Production readiness

Each service goes through this list before being promoted to the prod stage. The catalog page surfaces a “prod-ready: ✓/✗” badge derived from these checks.

Required artifacts

service.yaml complete: name, owner, on-call, status=stable, brand_scope, contracts
README.md exists at service root with run-locally instructions
RUNBOOK.md exists with: alarms, dashboards, remediation, escalation, recent incidents
openapi.yaml has every route with description AND example
Service detail page renders cleanly in docs site

Code conventions

All 23+ convention checks pass: scope enforcement, audit, brand scoping, no raw fetch, no PHI logging
No TODO / FIXME comments in critical paths (acceptable in stubs and tests)
All routes have integration tests
Conformance tests pass (audit, brand scoping, scope enforcement)

Auth

Every route enforces a scope via requireScope(...)
M2M client registered in identity if other services call this
Public paths explicitly allow-listed in publicPaths (healthz, well-known only)
BAA gate enforced for any PHI-returning scope

Data

Every table has brand_id NOT NULL
Migrations follow expand-only or expand–contract pattern
No long-running locks
Migration runbook documents rollback procedure
Postgres role grants verified: append-only audit, no UPDATE/DELETE on ledger

Observability

OTel instrumented: logs, traces, metrics
service.name set in OTel resource attributes
Sentry DSN wired per stage
Structured logs include: request_id, route, status, duration_ms, client_id, brand_id (PHI-safe)
CloudWatch alarms: error rate > 1%, latency p99 > 1s, DLQ depth > 0
PagerDuty routing: critical → page, warn → Slack

Idempotency

Mutation endpoints accept Idempotency-Key
Event handlers dedupe by event ID + handler scope
Cron-callable endpoints are safe to retry

Rate limits & resilience

Rate limits configured per route class
Circuit breakers wrap every outbound vendor call
Retries with exponential backoff + jitter
Graceful degradation when downstream service is unavailable

Deployment

infra.ts complete with hostname, scaling per stage, health checks
Service registered in sst.config.ts
SST Secrets set for dev / staging / prod
Cloudflare DNS configured for *.platform.loop.health per stage
Deploys to dev succeed, healthz returns 200
Smoke test against staging passes
Soaked in staging for ≥ 24 hours with no alarms

SDK

pnpm sdk:gen produces clean output
@platform/sdk-<service> package builds + publishes to GitHub Packages
SDK reference page generated and committed
At least one consumer (canary, internal app) imports + calls successfully

On-call

Primary on-call assigned in service.yaml
Secondary on-call assigned
PagerDuty schedule exists
Escalation path documented in RUNBOOK.md
On-call has read access to dashboards
On-call has been notified before promotion

Documentation

Service page has description, status, owner, dashboards link
Concept pages exist for any new domain primitive this service introduces
Connect docs updated if this service exposes new partner-facing scopes
Event reference pages generated for every published event
Cross-links from related services and concepts back to this service

Pre-promotion gate

PR approved by reviewer
No CI failures across all 23+ convention checks, typecheck, tests, build, drift checks
No merge during a freeze window
Changeset has user-facing summary
Rollback procedure tested OR explicitly documented in RUNBOOK.md
Manual prod deploy triggered + monitored for 30 minutes

Overview Release process