Routing & failover

What this is: how a single charge picks an MID, fails over to the next, and never double-charges — the safety core of the orchestration layer.

Who it’s for: anyone wiring the orchestrator to a live charge, implementing a TokenChargeProvider, or reasoning about “could this customer be charged twice?”

What to read next: The two seams · Entity model · Operational guardrails.

The routing layer is three small primitives, each separately tested: RoutingPolicyEngine (which MIDs), chargeAcrossProviders (the cascade), and PaymentOrchestrator (composes them). Keeping them small is what makes the safety invariants provable.

The end-to-end flow

Step 1 — eligibility (the laundering guard)

RoutingPolicyEngine.selectMids returns the ordered, pre-filtered list of MIDs for a charge. It enforces, in order: the entity exists, it can collect, and it is underwritten for the product. Then it filters out disabled / kill-switched MIDs and sorts active before warm_standby.

From services/payments/src/orchestration/routing-policy.ts:

selectMids(request: RoutingRequest): RoutingDecision {
  const entity = this.registry.get(request.entityId);
  if (!entity) return { ok: false, reason: "entity_not_found" };
  if (!this.registry.canCollect(entity)) return { ok: false, reason: "entity_cannot_collect" };
  // THE laundering guard: this entity must be underwritten for the product.
  if (!this.registry.allowsProduct(entity, request.product)) {
    return { ok: false, reason: "product_not_eligible" };
  }
 
  const eligible = entity.mids
    .filter((mid) => mid.status !== "disabled")
    .filter((mid) => !this.killSwitch.disabledMids?.has(mid.id))
    .filter((mid) => !this.killSwitch.disabledProviders?.has(mid.provider))
    .sort((a, b) => STATUS_ORDER[a.status] - STATUS_ORDER[b.status]);
 
  if (eligible.length === 0) return { ok: false, reason: "no_active_mid" };
  // Every candidate belongs to `entity` — failover can never leave it.
  return { ok: true, candidates: eligible.map((mid) => ({ entityId: entity.id, mid })) };
}

Two consequences are structural, not conventional:

Failover can never leave the entity. Every candidate belongs to the requested entity. There is no code path that widens to another entity’s MID. (See Entity model — the laundering guardrail.)
An ineligible request yields zero candidates ⇒ the charge declines — never cross-category.

The kill-switch

Deplatforming is treated as routine. A KillSwitch excludes disabled MIDs / providers from candidates without a redeploy — flip config to route around an acquirer that froze you.

export type KillSwitch = {
  disabledMids?: ReadonlySet<string>;
  disabledProviders?: ReadonlySet<string>;
};

Step 2 — the cross-PSP cascade

chargeAcrossProviders charges one portable vault token across the ordered candidates, failing over only on a positive not-captured confirmation. It is the smallest possible primitive and carries the NO-DOUBLE-CHARGE invariant.

From services/payments/src/orchestration/charge-cascade.ts:

for (const step of params.steps) {
  const idempotencyKey = buildAttemptKey(params.logicalAttemptId, step.provider.name, step.midId);
  const result = await step.provider.chargeToken({ token, amountCents, currency, midId, idempotencyKey, metadata });
 
  // A transport-level failure (no disposition) is itself indeterminate — we
  // cannot prove the money did not move.
  const attempt = result.ok ? result.data : { /* …, disposition: "indeterminate" */ };
  attempts.push(attempt);
 
  if (attempt.disposition === "captured") return { status: "captured", capturedBy: attempt, attempts };
  if (attempt.disposition === "indeterminate") {
    // HALT — do not fail over. Money may have moved; reconcile instead.
    return { status: "indeterminate", haltedAt: attempt, attempts };
  }
  // declined → safe to try the next eligible candidate
}
return { status: "all_declined", attempts };

The NO-DOUBLE-CHARGE invariant

This is the rule the whole cascade exists to enforce. It has three parts.

1. Fail over ONLY on a definitive decline

A captured stops the cascade with success. A declined (definitively not captured) is safe to fail over. An indeterminate — a timeout, a transport error, anything we cannot prove did not move money — HALTS the cascade. Trying the next provider after a timeout would risk charging the same token twice, because the timed-out gateway may have captured. The halt is surfaced for reconciliation, never silently retried elsewhere.

🚫

A timeout is not a decline. The single most dangerous mistake a provider adapter can make is mapping an ambiguous/timeout response to declined. It must be indeterminate. Even a transport-level failure with no disposition is treated as indeterminate by the cascade.

2. Deterministic per-attempt idempotency keys

Each attempt’s key is derived from the caller-supplied logical attempt id + provider + MID. Re-running the same cascade reuses the same per-attempt keys, so each gateway dedups its own attempt — a retried cascade cannot double-charge.

/** Deterministic per-attempt idempotency key. Stable across cascade retries. */
export function buildAttemptKey(logicalAttemptId: string, providerName: string, midId: string): string {
  return `${logicalAttemptId}:${providerName}:${midId}`;
}

This also fixes the shipped bug LOO-2203, where the legacy accounting-client.ts built its idempotency key with randomUUID() — so a retry double-booked the ledger by bypassing accounting’s withIdempotencyKey guard. The logical attempt id is supplied by the caller (e.g. derived from the order id) so the accounting key and the per-PSP keys are stable across retries: no double charge, no double ledger post.

3. Eligibility is enforced upstream

The cascade never widens the candidate set. It charges exactly the eligibility-filtered, kill-switch-applied steps the routing engine produced. Eligibility is the routing engine’s job; safety-on-failover is the cascade’s job; they do not overlap.

Step 3 — the orchestrator composes them

PaymentOrchestrator.charge runs selectMids, resolves each candidate’s provider by name, and hands the steps to the cascade. An ineligible decision returns rejected with zero attempts; a decision with no resolvable provider returns rejected too.

async charge(params: OrchestratedChargeParams): Promise<OrchestratedChargeResult> {
  const decision = this.routing.selectMids(params.request);
  if (!decision.ok) return { status: "rejected", reason: decision.reason }; // never widen
  const steps = decision.candidates
    .map((c) => ({ provider: this.resolveProvider(c.mid.provider), midId: c.mid.id }))
    .filter((s) => s.provider);
  if (steps.length === 0) return { status: "rejected", reason: "no_resolvable_provider" };
  // … chargeAcrossProviders(steps, token, amount, logicalAttemptId) → captured / all_declined / indeterminate
}

The proof tests

These invariants are not aspirations — they are proven by tests in the foundation PR (LOO-2208 / LOO-2226), running fully offline against StubVault + StubProvider (no Basis Theory key, no PSP key, no network).

Test	What it proves
`fails over A→B on decline and captures EXACTLY once`	Failover on decline; exactly one capture across the cascade; the same portable token presented to both PSPs
`HALTS on a timeout and never fails over — no double charge`	A timed-out primary whose capture side-effect fired → standby is never charged
`returns all_declined when every eligible PSP declines`	No capture, clean terminal state
`uses deterministic per-attempt idempotency keys`	Re-running the same cascade yields identical per-attempt keys
`REJECTS a product the entity is not underwritten for`	The laundering guard: zero attempts, `product_not_eligible`
`every candidate belongs to the requested entity`	Failover never leaves the entity, across all seeded entities
`routes around a killed primary MID straight to the standby`	The kill-switch routes around a frozen acquirer with no redeploy

A representative case — the timeout HALT — from tests/unit/orchestration/routing.test.ts:

it("HALTS on a primary timeout — standby is NEVER charged (no double charge)", async () => {
  // primary times out (and its capture side-effect fires); standby would succeed
  const result = await orchestrator.charge(chargeArgs("loop_bio_labs", "order_2"));
  expect(result.status).toBe("indeterminate");
  // The standby MID was NEVER attempted — the token cannot be double-charged.
  expect(spy.mock.calls.map((c) => c[0].midId)).toEqual(["mid_loopbio_nmi_primary"]);
  expect(captureSideEffects).toBe(1);
});

⚠️

What is real vs latent here. The primitives and proofs are built and passing. They are not wired to the live checkout path — the orchestrator and entity registry aren’t instantiated on a real charge yet, and the PSP charge legs are stubbed pending sandbox credentials and an NmiProvider (LOO-2192). The BT vault leg is real-tested against a Basis Theory TEST tenant via a gated spike; the cross-PSP charge is proven only against stubs so far. See Status & roadmap.

Source

services/payments/src/orchestration/routing-policy.ts · charge-cascade.ts · payment-orchestrator.ts
services/payments/tests/unit/orchestration/routing.test.ts · charge-cascade.spike.test.ts
ADR-0093 (LOO-2190 routing, LOO-2208 safety, LOO-2203 idempotency)

The two seams Vault & tokens