Production readiness
Each service goes through this list before being promoted to the prod stage. The catalog page surfaces a “prod-ready: ✓/✗” badge derived from these checks.
Required artifacts
-
service.yamlcomplete: name, owner, on-call, status=stable, brand_scope, contracts -
README.mdexists at service root with run-locally instructions -
RUNBOOK.mdexists with: alarms, dashboards, remediation, escalation, recent incidents -
openapi.yamlhas every route with description AND example - Service detail page renders cleanly in docs site
Code conventions
- All 23+ convention checks pass: scope enforcement, audit, brand scoping, no raw fetch, no PHI logging
- No
TODO/FIXMEcomments in critical paths (acceptable in stubs and tests) - All routes have integration tests
- Conformance tests pass (audit, brand scoping, scope enforcement)
Auth
- Every route enforces a scope via
requireScope(...) - M2M client registered in identity if other services call this
- Public paths explicitly allow-listed in
publicPaths(healthz, well-known only) - BAA gate enforced for any PHI-returning scope
Data
- Every table has
brand_id NOT NULL - Migrations follow expand-only or expand–contract pattern
- No long-running locks
- Migration runbook documents rollback procedure
- Postgres role grants verified: append-only audit, no UPDATE/DELETE on ledger
Observability
- OTel instrumented: logs, traces, metrics
-
service.nameset in OTel resource attributes - Sentry DSN wired per stage
- Structured logs include: request_id, route, status, duration_ms, client_id, brand_id (PHI-safe)
- CloudWatch alarms: error rate > 1%, latency p99 > 1s, DLQ depth > 0
- PagerDuty routing: critical → page, warn → Slack
Idempotency
- Mutation endpoints accept Idempotency-Key
- Event handlers dedupe by event ID + handler scope
- Cron-callable endpoints are safe to retry
Rate limits & resilience
- Rate limits configured per route class
- Circuit breakers wrap every outbound vendor call
- Retries with exponential backoff + jitter
- Graceful degradation when downstream service is unavailable
Deployment
-
infra.tscomplete with hostname, scaling per stage, health checks - Service registered in
sst.config.ts - SST Secrets set for dev / staging / prod
- Cloudflare DNS configured for
*.platform.loop.healthper stage - Deploys to dev succeed, healthz returns 200
- Smoke test against staging passes
- Soaked in staging for ≥ 24 hours with no alarms
SDK
-
pnpm sdk:genproduces clean output -
@platform/sdk-<service>package builds + publishes to GitHub Packages - SDK reference page generated and committed
- At least one consumer (canary, internal app) imports + calls successfully
On-call
- Primary on-call assigned in
service.yaml - Secondary on-call assigned
- PagerDuty schedule exists
- Escalation path documented in RUNBOOK.md
- On-call has read access to dashboards
- On-call has been notified before promotion
Documentation
- Service page has description, status, owner, dashboards link
- Concept pages exist for any new domain primitive this service introduces
- Connect docs updated if this service exposes new partner-facing scopes
- Event reference pages generated for every published event
- Cross-links from related services and concepts back to this service
Pre-promotion gate
- PR approved by reviewer
- No CI failures across all 23+ convention checks, typecheck, tests, build, drift checks
- No merge during a freeze window
- Changeset has user-facing summary
- Rollback procedure tested OR explicitly documented in RUNBOOK.md
- Manual prod deploy triggered + monitored for 30 minutes