Idempotency and retries
What this is: how we make every retry safe. Callers can re-send the same request and get the same outcome; consumers can re-process the same event without double-counting.
Who it’s for: anyone writing a POST/PUT/DELETE endpoint, anyone writing an event handler, anyone debugging “why did this customer get charged twice?”
What to read next: Events, Rate limits and circuit breakers, services/accounting.
The rule
If retrying any request produces a different outcome than the first attempt, that’s a bug.
This applies to:
- HTTP POST/PUT/DELETE endpoints (caller might retry on timeout)
- Event handlers (NATS + EventBridge may deliver twice)
- Outbound vendor calls (Stripe, BigCommerce, Postmark — each can drop a connection)
- Cron jobs (executor Lambda may be retried on Lambda-level failures)
The pattern: idempotency table
Each service has an idempotency table:
CREATE TABLE <service>.idempotency_keys (
key TEXT PRIMARY KEY,
scope TEXT NOT NULL, -- 'route:POST_/v1/payments' or 'event:order.placed.v1'
request_hash TEXT NOT NULL, -- SHA256 of canonical payload
response JSONB NOT NULL,
expires_at TIMESTAMPTZ NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);On a state-changing request:
- Caller provides
Idempotency-Keyheader (any 1–128 char string). - Service hashes the request body.
- Lookup
(scope, key):- Hit, same hash → return stored response (idempotent replay).
- Hit, different hash → 409
idempotency_key_payload_mismatch(caller misused the key). - Miss → execute, write the row, return.
The lookup + write is in the same transaction as the side effect. Keys expire after 24 hours by default; longer for high-stakes operations (payments: 7 days).
HTTP routes
Every POST/PUT/DELETE endpoint declares it requires an Idempotency-Key:
import { idempotent } from "@platform/hono";
app.openapi(createPaymentRoute, idempotent({ scope: "POST_/v1/payments" }, async (c) => {
const body = c.req.valid("json");
// ... business logic
return c.json({ payment_id: paymentId }, 201);
}));Without the header, the route returns 400 missing_idempotency_key for mutations. GETs don’t require it (idempotent by definition).
A cleanup job runs nightly to purge expired idempotency rows.
Event handlers
Event handlers dedupe by event ID + handler-specific key:
export const handler = createEventHandler({
event: EVENT_NAMES.ORDER_PLACED_V1,
async handle(payload, ctx) {
const key = `commission:${payload.order_id}`;
const existing = await ctx.idempotency.find(key, "event:order.placed.v1");
if (existing) return;
await ctx.db.transaction(async (tx) => {
const commission = await ctx.affiliates.postCommission(payload, tx);
await tx.insert(idempotency).values({
key,
scope: "event:order.placed.v1",
request_hash: hash(payload),
response: { commission_id: commission.id },
expires_at: addDays(30),
});
});
},
});The handler-specific key matters. If two different handlers consume the same event for different side effects, each has its own key (commission:<order_id> vs analytics:<order_id>).
Outbound vendor calls
Vendor calls go through @platform/core’s circuit breaker + retry helper, which respects the vendor’s idempotency protocol:
- Stripe — pass
Idempotency-Keyheader; Stripe handles dedup. - Postmark — message ID is the dedup key; we generate it client-side.
- BigCommerce — POST without a native idempotency primitive; we wrap calls with a local idempotency key that prevents double-submit on retry.
Retry policy (also in @platform/core):
- Exponential backoff: 500ms, 2s, 5s, 30s, then dead-letter.
- Network errors retry; 4xx (except 429) does not retry.
- 429 honors
Retry-After. - 5xx retries up to the cap.
Cron jobs
Each cron job declares its idempotency scope in services/jobs/src/jobs/registry.ts:
{
id: "release-commission-locks",
schedule: "cron(0 * * * ? *)",
targetUrlEnv: "MEMBERSHIP_URL",
path: "/v1/admin/commission-locks/release-expired",
scopes: ["membership:admin"],
}The executor Lambda includes a run-id in its admin call, and the target service uses run-id as the idempotency key. If the Lambda retries (Lambda-level failure), the second attempt with the same run-id is a no-op.
What this guards against
- Caller retries on timeout — gets the same response, no double-write.
- NATS + EventBridge double-delivery — handler dedupes by event ID + scope.
- Outbound vendor flakiness — retry honors vendor idempotency.
- Lambda execution retries — cron jobs are safe.
- Connection drops mid-write — the transaction either commits cleanly or rolls back; the outbox + idempotency rows are in the same transaction so we can’t get “half done.”
What it doesn’t guard against
- A different caller submitting the same logical operation — two different requests with two different idempotency keys are NOT deduped. Business-logic-level uniqueness constraints (unique constraint on
external_order_id, e.g.) cover that. - Replays beyond the expiry window — after 24h (or 7d for payments), the key expires and a replay would execute. That’s why expiry windows are set per scope.
- Cross-service deduplication — if service A retries a call to service B, B dedupes locally; if A also publishes an event downstream of that call, downstream consumers dedupe via their own keys. No global dedup table.
Common mistakes
- Forgetting the Idempotency-Key on mutations. The
idempotent()middleware fails the request, but make sure tests assert this. - Reusing a key for a different operation. The hash check catches this with a 409, but it’s a sign the caller’s key-generation logic is wrong.
- Wrapping the side effect outside the transaction. Idempotency lookup + side effect + idempotency write MUST be in one transaction.
- Not setting
Retry-Afteron 429. Callers need it to back off correctly.
Related
- Events — at-least-once delivery makes idempotency mandatory for handlers
- Rate limits and circuit breakers
- System overview
- services/accounting — heaviest idempotency surface
Source ADRs
ADR-0030 (idempotency keys), ADR-0040 (outbox + at-least-once events), ADR-0044 (vendor retry policy).