Routing & failover
What this is: how a single charge picks an MID, fails over to the next, and never double-charges — the safety core of the orchestration layer.
Who it’s for: anyone wiring the orchestrator to a live charge, implementing a TokenChargeProvider, or reasoning about “could this customer be charged twice?”
What to read next: The two seams · Entity model · Operational guardrails.
The routing layer is three small primitives, each separately tested:
RoutingPolicyEngine (which MIDs), chargeAcrossProviders (the cascade), and
PaymentOrchestrator (composes them). Keeping them small is what makes the
safety invariants provable.
The end-to-end flow
Step 1 — eligibility (the laundering guard)
RoutingPolicyEngine.selectMids returns the ordered, pre-filtered list of MIDs for a charge. It enforces, in order: the entity exists, it can collect, and it is underwritten for the product. Then it filters out disabled / kill-switched MIDs and sorts active before warm_standby.
From services/payments/src/orchestration/routing-policy.ts:
selectMids(request: RoutingRequest): RoutingDecision {
const entity = this.registry.get(request.entityId);
if (!entity) return { ok: false, reason: "entity_not_found" };
if (!this.registry.canCollect(entity)) return { ok: false, reason: "entity_cannot_collect" };
// THE laundering guard: this entity must be underwritten for the product.
if (!this.registry.allowsProduct(entity, request.product)) {
return { ok: false, reason: "product_not_eligible" };
}
const eligible = entity.mids
.filter((mid) => mid.status !== "disabled")
.filter((mid) => !this.killSwitch.disabledMids?.has(mid.id))
.filter((mid) => !this.killSwitch.disabledProviders?.has(mid.provider))
.sort((a, b) => STATUS_ORDER[a.status] - STATUS_ORDER[b.status]);
if (eligible.length === 0) return { ok: false, reason: "no_active_mid" };
// Every candidate belongs to `entity` — failover can never leave it.
return { ok: true, candidates: eligible.map((mid) => ({ entityId: entity.id, mid })) };
}Two consequences are structural, not conventional:
- Failover can never leave the entity. Every candidate belongs to the requested entity. There is no code path that widens to another entity’s MID. (See Entity model — the laundering guardrail.)
- An ineligible request yields zero candidates ⇒ the charge declines — never cross-category.
The kill-switch
Deplatforming is treated as routine. A KillSwitch excludes disabled MIDs / providers from candidates without a redeploy — flip config to route around an acquirer that froze you.
export type KillSwitch = {
disabledMids?: ReadonlySet<string>;
disabledProviders?: ReadonlySet<string>;
};Step 2 — the cross-PSP cascade
chargeAcrossProviders charges one portable vault token across the ordered candidates, failing over only on a positive not-captured confirmation. It is the smallest possible primitive and carries the NO-DOUBLE-CHARGE invariant.
From services/payments/src/orchestration/charge-cascade.ts:
for (const step of params.steps) {
const idempotencyKey = buildAttemptKey(params.logicalAttemptId, step.provider.name, step.midId);
const result = await step.provider.chargeToken({ token, amountCents, currency, midId, idempotencyKey, metadata });
// A transport-level failure (no disposition) is itself indeterminate — we
// cannot prove the money did not move.
const attempt = result.ok ? result.data : { /* …, disposition: "indeterminate" */ };
attempts.push(attempt);
if (attempt.disposition === "captured") return { status: "captured", capturedBy: attempt, attempts };
if (attempt.disposition === "indeterminate") {
// HALT — do not fail over. Money may have moved; reconcile instead.
return { status: "indeterminate", haltedAt: attempt, attempts };
}
// declined → safe to try the next eligible candidate
}
return { status: "all_declined", attempts };The NO-DOUBLE-CHARGE invariant
This is the rule the whole cascade exists to enforce. It has three parts.
1. Fail over ONLY on a definitive decline
A captured stops the cascade with success. A declined (definitively not captured) is safe to fail over. An indeterminate — a timeout, a transport error, anything we cannot prove did not move money — HALTS the cascade. Trying the next provider after a timeout would risk charging the same token twice, because the timed-out gateway may have captured. The halt is surfaced for reconciliation, never silently retried elsewhere.
A timeout is not a decline. The single most dangerous mistake a provider
adapter can make is mapping an ambiguous/timeout response to declined. It
must be indeterminate. Even a transport-level failure with no disposition
is treated as indeterminate by the cascade.
2. Deterministic per-attempt idempotency keys
Each attempt’s key is derived from the caller-supplied logical attempt id + provider + MID. Re-running the same cascade reuses the same per-attempt keys, so each gateway dedups its own attempt — a retried cascade cannot double-charge.
/** Deterministic per-attempt idempotency key. Stable across cascade retries. */
export function buildAttemptKey(logicalAttemptId: string, providerName: string, midId: string): string {
return `${logicalAttemptId}:${providerName}:${midId}`;
}This also fixes the shipped bug LOO-2203, where the legacy accounting-client.ts built its idempotency key with randomUUID() — so a retry double-booked the ledger by bypassing accounting’s withIdempotencyKey guard. The logical attempt id is supplied by the caller (e.g. derived from the order id) so the accounting key and the per-PSP keys are stable across retries: no double charge, no double ledger post.
3. Eligibility is enforced upstream
The cascade never widens the candidate set. It charges exactly the eligibility-filtered, kill-switch-applied steps the routing engine produced. Eligibility is the routing engine’s job; safety-on-failover is the cascade’s job; they do not overlap.
Step 3 — the orchestrator composes them
PaymentOrchestrator.charge runs selectMids, resolves each candidate’s provider by name, and hands the steps to the cascade. An ineligible decision returns rejected with zero attempts; a decision with no resolvable provider returns rejected too.
async charge(params: OrchestratedChargeParams): Promise<OrchestratedChargeResult> {
const decision = this.routing.selectMids(params.request);
if (!decision.ok) return { status: "rejected", reason: decision.reason }; // never widen
const steps = decision.candidates
.map((c) => ({ provider: this.resolveProvider(c.mid.provider), midId: c.mid.id }))
.filter((s) => s.provider);
if (steps.length === 0) return { status: "rejected", reason: "no_resolvable_provider" };
// … chargeAcrossProviders(steps, token, amount, logicalAttemptId) → captured / all_declined / indeterminate
}The proof tests
These invariants are not aspirations — they are proven by tests in the foundation PR (LOO-2208 / LOO-2226), running fully offline against StubVault + StubProvider (no Basis Theory key, no PSP key, no network).
| Test | What it proves |
|---|---|
fails over A→B on decline and captures EXACTLY once | Failover on decline; exactly one capture across the cascade; the same portable token presented to both PSPs |
HALTS on a timeout and never fails over — no double charge | A timed-out primary whose capture side-effect fired → standby is never charged |
returns all_declined when every eligible PSP declines | No capture, clean terminal state |
uses deterministic per-attempt idempotency keys | Re-running the same cascade yields identical per-attempt keys |
REJECTS a product the entity is not underwritten for | The laundering guard: zero attempts, product_not_eligible |
every candidate belongs to the requested entity | Failover never leaves the entity, across all seeded entities |
routes around a killed primary MID straight to the standby | The kill-switch routes around a frozen acquirer with no redeploy |
A representative case — the timeout HALT — from tests/unit/orchestration/routing.test.ts:
it("HALTS on a primary timeout — standby is NEVER charged (no double charge)", async () => {
// primary times out (and its capture side-effect fires); standby would succeed
const result = await orchestrator.charge(chargeArgs("loop_bio_labs", "order_2"));
expect(result.status).toBe("indeterminate");
// The standby MID was NEVER attempted — the token cannot be double-charged.
expect(spy.mock.calls.map((c) => c[0].midId)).toEqual(["mid_loopbio_nmi_primary"]);
expect(captureSideEffects).toBe(1);
});What is real vs latent here. The primitives and proofs are built and
passing. They are not wired to the live checkout path — the orchestrator
and entity registry aren’t instantiated on a real charge yet, and the PSP
charge legs are stubbed pending sandbox credentials and an NmiProvider
(LOO-2192). The BT vault leg is real-tested against a Basis Theory TEST
tenant via a gated spike; the cross-PSP charge is proven only against stubs so
far. See Status & roadmap.
See also
- The two seams — the
TokenChargeProviderdisposition the cascade keys on - Entity model — where the eligible candidates come from
- Operational guardrails — failure-injection test discipline
- Idempotency and retries — the platform-wide discipline
Source
services/payments/src/orchestration/routing-policy.ts·charge-cascade.ts·payment-orchestrator.tsservices/payments/tests/unit/orchestration/routing.test.ts·charge-cascade.spike.test.ts- ADR-0093 (LOO-2190 routing, LOO-2208 safety, LOO-2203 idempotency)