operationsProduction readiness

Production readiness

Each service goes through this list before being promoted to the prod stage. The catalog page surfaces a “prod-ready: ✓/✗” badge derived from these checks.

Required artifacts

  • service.yaml complete: name, owner, on-call, status=stable, brand_scope, contracts
  • README.md exists at service root with run-locally instructions
  • RUNBOOK.md exists with: alarms, dashboards, remediation, escalation, recent incidents
  • openapi.yaml has every route with description AND example
  • Service detail page renders cleanly in docs site

Code conventions

  • All 23+ convention checks pass: scope enforcement, audit, brand scoping, no raw fetch, no PHI logging
  • No TODO / FIXME comments in critical paths (acceptable in stubs and tests)
  • All routes have integration tests
  • Conformance tests pass (audit, brand scoping, scope enforcement)

Auth

  • Every route enforces a scope via requireScope(...)
  • M2M client registered in identity if other services call this
  • Public paths explicitly allow-listed in publicPaths (healthz, well-known only)
  • BAA gate enforced for any PHI-returning scope

Data

  • Every table has brand_id NOT NULL
  • Migrations follow expand-only or expand–contract pattern
  • No long-running locks
  • Migration runbook documents rollback procedure
  • Postgres role grants verified: append-only audit, no UPDATE/DELETE on ledger

Observability

  • OTel instrumented: logs, traces, metrics
  • service.name set in OTel resource attributes
  • Sentry DSN wired per stage
  • Structured logs include: request_id, route, status, duration_ms, client_id, brand_id (PHI-safe)
  • CloudWatch alarms: error rate > 1%, latency p99 > 1s, DLQ depth > 0
  • PagerDuty routing: critical → page, warn → Slack

Idempotency

  • Mutation endpoints accept Idempotency-Key
  • Event handlers dedupe by event ID + handler scope
  • Cron-callable endpoints are safe to retry

Rate limits & resilience

  • Rate limits configured per route class
  • Circuit breakers wrap every outbound vendor call
  • Retries with exponential backoff + jitter
  • Graceful degradation when downstream service is unavailable

Deployment

  • infra.ts complete with hostname, scaling per stage, health checks
  • Service registered in sst.config.ts
  • SST Secrets set for dev / staging / prod
  • Cloudflare DNS configured for *.platform.loop.health per stage
  • Deploys to dev succeed, healthz returns 200
  • Smoke test against staging passes
  • Soaked in staging for ≥ 24 hours with no alarms

SDK

  • pnpm sdk:gen produces clean output
  • @platform/sdk-<service> package builds + publishes to GitHub Packages
  • SDK reference page generated and committed
  • At least one consumer (canary, internal app) imports + calls successfully

On-call

  • Primary on-call assigned in service.yaml
  • Secondary on-call assigned
  • PagerDuty schedule exists
  • Escalation path documented in RUNBOOK.md
  • On-call has read access to dashboards
  • On-call has been notified before promotion

Documentation

  • Service page has description, status, owner, dashboards link
  • Concept pages exist for any new domain primitive this service introduces
  • Connect docs updated if this service exposes new partner-facing scopes
  • Event reference pages generated for every published event
  • Cross-links from related services and concepts back to this service

Pre-promotion gate

  • PR approved by reviewer
  • No CI failures across all 23+ convention checks, typecheck, tests, build, drift checks
  • No merge during a freeze window
  • Changeset has user-facing summary
  • Rollback procedure tested OR explicitly documented in RUNBOOK.md
  • Manual prod deploy triggered + monitored for 30 minutes