Rate limits and circuit breakers

What this is: how the platform protects itself from overload and protects its callers from cascading failure.

Who it’s for: anyone writing a route that calls a vendor, anyone seeing 429s in production, anyone wondering how to handle “vendor X is down.”

What to read next: Idempotency and retries, System overview, services/comms.

Two distinct mechanisms

These are often confused. They protect different things.

Mechanism	Protects	Lives at
Rate limits	The platform from too many caller requests	Per-route middleware on inbound traffic
Circuit breakers	Callers from a vendor that’s failing	Wrapper around outbound vendor calls

Rate limits

Sliding-window counters in Redis, per (client_id, scope-of-action).

Default caps (defined in @platform/hono/src/middleware/rate-limit.ts):

Endpoint pattern	Limit
OAuth `/authorize`	30 / min per IP
OAuth `/token`	60 / min per client_id
OAuth `/introspect`	600 / min per client_id
OAuth `/revoke`	60 / min per client_id
Standard read endpoints	1000 / min per client_id
Standard write endpoints	200 / min per client_id
Admin endpoints	100 / min per client_id
Comms send routes	60 / min per user (transactional)
Comms send routes	1 / day per user (marketing)

Routes can override defaults by declaring their own rate-limit config. Service-specific limits live in each service’s infra.ts and are documented in the service detail page.

Response shape

HTTP/1.1 429 Too Many Requests
Retry-After: 12
Content-Type: application/json

{
  "error": "rate_limited",
  "error_description": "Request limit exceeded for client_id",
  "retry_after_ms": 12000
}

Callers MUST honor Retry-After. Repeated 429s without backing off triggers a temporary client_id lockout (15 min, then 1h, then escalating).

How callers should handle 429

async function call(fn: () => Promise<Response>, maxAttempts = 5) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const res = await fn();
    if (res.status !== 429) return res;
    const retryAfter = Number(res.headers.get("Retry-After") ?? 1);
    await sleep((retryAfter + Math.random()) * 1000);
  }
  throw new Error("rate-limited after max attempts");
}

Jitter (Math.random()) prevents thundering-herd when many callers retry at the same boundary.

Quotas vs rate limits

Rate limits are per-minute. Quotas are per-billing-period and live in services/entitlements. A client can be inside rate limits but over quota (or vice versa).

A read of a biomarker counts against rate limits but not quota. A send of a marketing email counts against both: the rate limit (1/day/user) and the quota (e.g., 10,000/month/client).

Circuit breakers

Wrapped around every outbound vendor call. From @platform/core/src/circuit-breaker.ts:

import { createCircuitBreaker } from "@platform/core";
 
const stripeBreaker = createCircuitBreaker({
  name: "stripe.payments",
  threshold: 5,       // 5 failures
  windowMs: 60_000,   // in 60 seconds
  openMs: 30_000,     // → open circuit for 30 seconds
});
 
const result = await stripeBreaker.exec(() => stripe.paymentIntents.create(params));

States:

Closed — normal. All calls go through.
Open — circuit tripped. Calls fail fast with CircuitBreakerError. After openMs, transitions to half-open.
Half-open — one test call goes through. Success → closed. Failure → open again.

This protects callers from:

Cascading failure when a vendor degrades (we don’t pile on retries that just compound the problem)
Long tail latencies from a vendor’s slow responses (we fail fast)
Resource exhaustion (thread pools, connection pools) waiting on a sick vendor

What happens when the circuit is open

The wrapping service decides:

For non-critical calls (analytics, marketing) — drop silently, log it, move on.
For user-facing critical calls (payments) — return a graceful error to the user (“Stripe is temporarily unavailable, please try again in a minute”); do NOT log this as 5xx.
For internal workflows — write the work to a retry queue, alert ops, return success to the caller.

The choice is per-call-site. The default in @platform/core is “fail with CircuitBreakerError”; the wrapping service catches and decides.

Circuits we currently have

Vendor	Where
Stripe	`services/payments/src/services/stripe.client.ts`
BigCommerce	`services/commerce/src/services/bigcommerce.client.ts`
Recharge	`services/commerce/src/services/recharge.client.ts`
Twilio	`services/comms/src/lib/adapters/twilio.adapter.ts`
Postmark	`services/comms/src/lib/adapters/postmark.adapter.ts`
Resend	`services/comms/src/lib/adapters/resend.adapter.ts`
WorkOS	`services/identity/src/services/workos.service.ts`
Clerk	`services/identity/src/services/clerk-bridge.service.ts`
OpenAI / Anthropic	`services/ai/src/lib/providers/`

Each emits OTel metrics (open count, half-open transitions, breaker decision) so we can alarm on a sustained-open circuit.

What this guards against

A spammy client can’t take the platform down with a tight loop.
A vendor outage doesn’t propagate into our own latencies — calls fail fast, freeing up our resources.
A retry storm is dampened by jitter + circuit breakers.
A noisy neighbor can’t starve other tenants (rate limits are per client_id).

What it doesn’t guard against

A coordinated DDoS — that’s the ALB’s job. Rate limits are application-level.
Slow business-logic bugs (a query that gets slower as data grows) — that’s the job of SLO monitoring + alarms.
A vendor returning slow but successful responses — circuit breakers fire on failure, not latency alone. We have a latency-based wrapper but it’s used selectively.

Common mistakes

Forgetting jitter on retry — synchronous retries amplify the problem.
Treating 429 as a 5xx — it’s a 4xx, “you did too much,” not “we broke.” Retry but back off.
Wrapping the wrong call — circuit breakers go around vendor calls, not around internal DB calls. DB issues have their own pattern.
Hardcoding rate limit values — they live in @platform/hono/src/middleware/rate-limit.ts config. Don’t sprinkle magic numbers in handlers.
Letting the circuit decide UX — the breaker just signals “vendor unhealthy”; the service decides what the user sees.

Idempotency and retries — retries are safe BECAUSE idempotency
System overview
services/comms — heaviest rate-limit surface
services/payments — heaviest circuit-breaker surface

Source ADRs

ADR-0030 (rate limiting), ADR-0044 (circuit breakers), ADR-0047 (observability for breakers).

Integration adapters Zero-downtime migrations