ConceptsDurable workflows

Durable workflows

What this is: how the platform orchestrates multi-step business processes that span minutes, hours, or days — onboarding sequences, Certificate of Analysis (COA) validation, fulfillment retries, win-back ladders, General Data Protection Regulation (GDPR) erasure fan-out. The workflow engine owns persistence, timers, retries, replay, and the operator UI; service code stays stateless.

Who it’s for: anyone writing a multi-step flow that needs to survive a deploy, retry intelligently, sleep for hours, or fan out across services. For a single API call, use a regular service route and an idempotency key instead.

What to read next: Events, Idempotency and retries, Reliability and deployment.

Source ADR: 0056 — Durable workflow engine.

The choice: Inngest, not Temporal, not in-process

LOO-1925 evaluated three options for durable orchestration:

  1. Port @loop/workflow-engine from the legacy repo. A rules interpreter, not a durable engine — no persistence, no timers, no retries, no operator visibility. Porting requires rebuilding every hard part.
  2. Temporal Cloud. Highest technical ceiling. Highest adoption cost: SDK shape, worker model, separate operational discipline.
  3. Inngest (managed). HTTP-first step functions, free tier covers early volume, retry, replay, and operator UI built in.

The platform adopted Inngest (Architecture Decision Record 0056). It is the smallest step up from “no engine at all” that still provides durable guarantees, and it composes cleanly with the HTTP-only service model — workflows are HTTP handlers that the Inngest runner invokes.

How a workflow composes

┌──────────────────────────────────────────────────────────┐
│  services/<owner>  — the service that owns the flow      │
│                                                          │
│  src/workflows/onboarding.workflow.ts                    │
│    inngest.createFunction(                               │
│      { id: "onboarding" },                               │
│      { event: "identity.user.created.v1" },              │
│      async ({ event, step }) => {                        │
│        await step.run("provision-brand", () => ...)      │
│        await step.sleep("wait-day-2", "2d")              │
│        await step.run("send-day-2-email", () => ...)     │
│      },                                                  │
│    )                                                     │
└──────────────────────┬───────────────────────────────────┘
                       │ HTTP step calls
┌──────────────────────▼───────────────────────────────────┐
│  Inngest (managed) — owns state, timers, retries         │
└──────────────────────────────────────────────────────────┘

Each step.run call is replay-safe: Inngest persists its result and skips the step on retry. Each step.sleep call survives deploys. When the timer fires, Inngest calls the same HTTP handler back at the correct step.

What belongs in a workflow vs. a route

Use a workflowUse a regular service route
Sleeps, waits, or human approval gatesSynchronous request → response
Multi-step fan-out across servicesSingle bounded-context operation
Needs to survive a deploy / restartCompletes in one request
Hour-plus retry windowsSub-second / sub-minute retry
Operator visibility on per-step failuresPer-request logs + audit are enough

If the code reaches for setTimeout, hand-rolls a state column, or stashes “next step” rows in a database for a cron to pick up later, the right primitive is a workflow.

Event handlers versus workflows

Both subscribe to EventBridge or NATS. They differ in unit of work:

  • Event handler — one event in, side effects out, idempotent. Lives in the service. On failure, the bus retries the entire handler.
  • Workflow — one event triggers a sequence of steps with shared state. Lives in Inngest. On failure, Inngest retries only the failed step.

A handler that grows multiple steps with persisted intermediate state is a workflow in disguise; extract it.

Common mistakes

  • Hand-rolled retry loop inside a workflow body. Configure step retries through Inngest configuration instead.
  • Reading mutable database state inside a step.run and assuming determinism. Capture the value once and pass it forward; replays may otherwise hit a different row.
  • Placing protected health information (PHI) in event payloads to share state between steps. Use step inputs and outputs for opaque ids, then pull PHI inside the step that needs it and audit the read.
  • Single 30-step monolith. Smaller, composable functions chained via events are easier to reason about than one giant workflow.

See also