Operating the intelligence loop

Operational reference for the intelligence service. See the concept for the model and the architecture for the build. Internal / M2M only; dark by default (stage S0).

Cadence — how the loop turns

The Loop Orchestrator (services/workflow → Inngest intelligence.loop.v1) calls POST /v1/loop/tick on a schedule. Each tick runs Monitor → Analyze → Propose:

Monitor — sweep pending decisions past their maturation deadline → expired_unjoined (closes the join window).
Analyze — compute + persist meta-metrics.
Propose — when a meta-metric trips its threshold, raise a loop_proposal + publish intelligence.proposal.raised.v1 (governance consumes it).

Manual tick (admin M2M): POST /v1/loop/tick. Over an empty ledger it no-ops.

Gate dashboards — the turn-on decision is data

GET /v1/loop/meta-metrics is the source; the authoritative metric/query/threshold definitions live in services/intelligence/docs/gate-dashboards.md. Watch:

Metric	Healthy	Why it matters
join_completeness	≥ 0.8	below → outcomes are being dropped → lift biased optimistically (the worst outcomes vanish)
loop_liveness	`stalled = false`	a stalled loop looks identical to a healthy converged one — alarm on it
realized_vs_predicted / calibration / lift_decay	per policy version	the model’s honesty + decay; the S2→S5 gate signals

SELECT metric, value, details, computed_at
FROM intelligence.meta_metrics
WHERE metric IN ('join_completeness','loop_liveness')
ORDER BY computed_at DESC LIMIT 2;

Alarms

Alarm (log metric)	Trigger	Action
`loop liveness alarm`	orchestrator dead / join-rate collapse / no proposals in N weeks	check the workflow + intelligence logs; run `/v1/loop/tick` manually
join-completeness below floor	`join_completeness < 0.8` (with closed outcomes)	check `intelligence-on-*` EventBridge subscriptions
`intelligence_guardrail_blocked`	a served rec was blocked by guardrails	review the violation types; expected for unsafe content, investigate spikes

Governance — the human gate

The orchestrator only proposes. A proposal changes what’s served only when a human applies it.

GET  /v1/governance/proposals?status=raised        # the review queue
POST /v1/governance/proposals/{id}/approve         # { reviewer_id, notes? }
POST /v1/governance/proposals/{id}/reject          # { reviewer_id, notes? }
POST /v1/governance/proposals/{id}/apply           # { version } → promote that registry version

Lifecycle: raised → approved | rejected → applied. Apply promotes a registry version to champion — the only path a proposal reaches members, always explicit + audited. The model never auto-publishes.

Registry promote / rollback

POST /v1/registry/{id}/promote          # make a version champion (retires the prior)
POST /v1/registry/rollback              # { policy_kind, policy_key, brand_id, to_version }

Rollback re-promotes a prior version. It preserves provenance — historical decision_records.policy_version / registry_id are never rewritten, so the Outcome Join keeps attributing correctly.

Turning a stage on (the ★ pathway)

feature.intelligence.stage (default S0) selects the ranker. Advance only when the gate is met (see the gate dashboards); roll back instantly by flipping the flag back. Every stage S0→S5 uses the same /rank contract — only the ranker changes.

PHI + erasure

The ledger + feature store hold PHI links once the clinical track is live. Cross-service reads + the platform_readonly role use the *_safe views. On user.erasure_requested.v1 the service deletes the entity’s decisions (outcomes cascade) + feature vectors, audited. Model-artifact erasure is handled when a model is trained.

Health

GET /healthz   # process up
GET /readyz    # DB + bus reachable

Deeper runbooks: services/intelligence/RUNBOOK.md + docs/runbooks/.

Incident response Re-test cycle data plane