Sequential learning → Helix (the in-silico twin)

This is the forward roadmap for the intelligence loop, above the level that’s built today, and the path to Helix — the in-silico twin. It exists so the next levels are decision-ready, not vague vision. The point to internalize: everything below is on the same rails — see the intelligence layer for the loop it builds on and Phenome for the evidence layer it feeds. The /rank contract, the Decision+Outcome ledger, the registry, propensity logging, and off-policy evaluation don’t change. Each level is a trainer/policy swap behind the same contract. We do not re-architect to climb.

Helix in one paragraph. Helix is Loop’s name for the in-silico virtual-control-arm engine described in this document’s Level 2 — a structural model of member-phenotype dynamics that can be rolled forward to simulate a control trajectory for an experiment. It is design-stage: there is no model today (ratified by ADR-0095 as a research program, not a build). It is gated on two hard preconditions that do not exist yet — a validated rung-3 (aggregated-n-of-1) engine, and enough longitudinal trajectory data to fit transition dynamics — plus a hard TRIPOD+AI validation gate (external and prospective; calibration, discrimination, decision-curve, subgroup/fairness) before any output feeds an inferential role. Its scope is fixed: virtual control arms in a governed trial, aligned to FDA’s January-2025 in-silico draft guidance — it is explicitly NOT a per-patient recommender. An unvalidated synthetic control is worse than none: it manufactures false confidence. (The name “Helix” carries a known trademark collision with Helix genomics; resolve before any external use.)

Where the track ends today (the uplift gate — built)

Counterfactual training — the model learns from the loop’s own logged decisions, IPS-weighted (TRAIN_MODE=counterfactual).
Uplift S-learner — predicts the incremental effect of acting (CATE = score(treat=1) − score(treat=0)), not baseline propensity (TRAIN_MODE=uplift).

That’s the leap from prediction to causal decision. The levels below are gated on it — technically (each is the generalization of the one under it), for data (the lower level generates the trajectories the next one needs), and for safety (off-policy evaluation validates a policy before it serves).

Level 1 — sequential decision-making / dynamic treatment regimes

Why: uplift optimizes a single decision’s incremental effect. The real objective is the member’s whole trajectory — lifetime value (commerce) or a sustained biomarker arc (clinical). A one-step policy is myopic: it will recommend the thing that converts now and churns them in month 3, or chase one biomarker while another drifts. A dynamic treatment regime is a sequence of decisions that adapts to the evolving member.

The formulation (the decision to make — this is the gate):

State — the member’s point-in-time representation: phenotype features (biomarkers, genetics, wearables), relationship/commerce features, and history (prior recommendations + responses). Already substantially in the feature store; the open call is how much history to encode and at what granularity.
Action — the recommendation / intervention chosen at each step (the same candidates /rank sees).
Reward — the long-horizon signal: cumulative LTV / retention, or a sustained biomarker-toward-optimal trajectory. The hard part — defining the reward is defining the product objective.
Horizon / discount — how far ahead to optimize, and how to trade near vs long term.

Data substrate (new): a trajectory projection of the ledger — each member’s decision → outcome sequence ordered in time. (Deliberately not built yet: its exact shape depends on the state representation above, so building it now would hard-code an unmade decision.)

Safety: sequential off-policy evaluation (per-step IPS / doubly-robust over trajectories) extends the policy_value estimator we already have — so a multi-step policy is validated offline before it ever serves. Non-negotiable for clinical. (The member-facing narration of clinical recommendations is a separate concern — see the AI & ML layer.)

Method options (an ADR-level choice): batch/offline RL (conservative Q-learning, fitted-Q) over logged trajectories — not online RL (unsafe to explore live, especially clinically). Clinical specifically maps to the optimal dynamic treatment regime literature (adaptive-trial math).

Level 2 — structural model + active experimental design (Helix, the twin)

Why: the levels above learn from what happened. Helix lets you simulate what would happen — a structural/mechanistic model of the member’s phenotype dynamics you can plan against (model-based control), plus active experimental design: the system chooses the most informative next lab/intervention, not just the best one. That compounds data efficiency — decisive for the data-starved clinical track.

What it adds over Level 1: a transition model (how interventions move the phenotype over time) you can roll forward to evaluate plans before acting, and an information-gain objective that turns each member interaction into optimal data collection. The sanctioned use of that roll-forward is one thing only: generating a virtual control arm that offsets some of the real participants a parallel control group would require — and only after validation earns a non-zero borrowing factor. The honest default when assumptions are shaky is offset = 0 (run the full real control).

Gated on: the sequential causal models from Level 1, plus enough longitudinal data to fit transition dynamics — the far end of the trajectory the uplift gate starts generating. And, before any inferential use, the TRIPOD+AI validation gate above. No validated twin → no virtual control arm. Full stop.

Out of scope: Helix is never a per-patient recommender. Member-level sequential decision-making is the separately-governed Level-1 track — a different problem with different governance. The twin’s only sanctioned role is trial-design augmentation under governance.

The honest unlock sequence

Ship the epic to main → the loop deploys and starts logging decisions.
Run the data path (materialize features → train observational → advance the stage) → decisions + outcomes accrue with propensities.
Flip TRAIN_MODE=counterfactual → the model learns from its own choices (gate live).
Serve both arms (rules + model) → two-arm data → uplift becomes validatable.
Decide the Level-1 formulation (state / action / reward / horizon — the ADR above) → build the trajectory view + the offline-RL learner on the same rails.
Fit transition dynamics on accrued trajectories → Helix + active design, behind the validation gate.

The binding constraint is never the algorithm — it’s data and one product decision (the reward). Uplift is the gate because it’s both the technical prerequisite and the engine that generates the trajectories every level above it needs. The full Helix design — modeling approaches, real-N offset quantification, the validation regime, and the S0→S5 activation pathway — lives in the evidence-engine digital-twin spec (ADR-0095). For where Helix sits in the wider effort, see Phenome — the evidence layer.

Phenome — the evidence layer Integration adapters