ConceptsRate limits & circuit breakers

Rate limits and circuit breakers

What this is: how the platform protects itself from overload and protects its callers from cascading failure.

Who it’s for: anyone writing a route that calls a vendor, anyone seeing 429s in production, anyone wondering how to handle “vendor X is down.”

What to read next: Idempotency and retries, System overview, services/comms.

Two distinct mechanisms

These are often confused. They protect different things.

MechanismProtectsLives at
Rate limitsThe platform from too many caller requestsPer-route middleware on inbound traffic
Circuit breakersCallers from a vendor that’s failingWrapper around outbound vendor calls

Rate limits

Sliding-window counters in Redis, per (client_id, scope-of-action).

Default caps (defined in @platform/hono/src/middleware/rate-limit.ts):

Endpoint patternLimit
OAuth /authorize30 / min per IP
OAuth /token60 / min per client_id
OAuth /introspect600 / min per client_id
OAuth /revoke60 / min per client_id
Standard read endpoints1000 / min per client_id
Standard write endpoints200 / min per client_id
Admin endpoints100 / min per client_id
Comms send routes60 / min per user (transactional)
Comms send routes1 / day per user (marketing)

Routes can override defaults by declaring their own rate-limit config. Service-specific limits live in each service’s infra.ts and are documented in the service detail page.

Response shape

HTTP/1.1 429 Too Many Requests
Retry-After: 12
Content-Type: application/json

{
  "error": "rate_limited",
  "error_description": "Request limit exceeded for client_id",
  "retry_after_ms": 12000
}

Callers MUST honor Retry-After. Repeated 429s without backing off triggers a temporary client_id lockout (15 min, then 1h, then escalating).

How callers should handle 429

async function call(fn: () => Promise<Response>, maxAttempts = 5) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const res = await fn();
    if (res.status !== 429) return res;
    const retryAfter = Number(res.headers.get("Retry-After") ?? 1);
    await sleep((retryAfter + Math.random()) * 1000);
  }
  throw new Error("rate-limited after max attempts");
}

Jitter (Math.random()) prevents thundering-herd when many callers retry at the same boundary.

Quotas vs rate limits

Rate limits are per-minute. Quotas are per-billing-period and live in services/entitlements. A client can be inside rate limits but over quota (or vice versa).

A read of a biomarker counts against rate limits but not quota. A send of a marketing email counts against both: the rate limit (1/day/user) and the quota (e.g., 10,000/month/client).

Circuit breakers

Wrapped around every outbound vendor call. From @platform/core/src/circuit-breaker.ts:

import { createCircuitBreaker } from "@platform/core";
 
const stripeBreaker = createCircuitBreaker({
  name: "stripe.payments",
  threshold: 5,       // 5 failures
  windowMs: 60_000,   // in 60 seconds
  openMs: 30_000,     // → open circuit for 30 seconds
});
 
const result = await stripeBreaker.exec(() => stripe.paymentIntents.create(params));

States:

  • Closed — normal. All calls go through.
  • Open — circuit tripped. Calls fail fast with CircuitBreakerError. After openMs, transitions to half-open.
  • Half-open — one test call goes through. Success → closed. Failure → open again.

This protects callers from:

  • Cascading failure when a vendor degrades (we don’t pile on retries that just compound the problem)
  • Long tail latencies from a vendor’s slow responses (we fail fast)
  • Resource exhaustion (thread pools, connection pools) waiting on a sick vendor

What happens when the circuit is open

The wrapping service decides:

  • For non-critical calls (analytics, marketing) — drop silently, log it, move on.
  • For user-facing critical calls (payments) — return a graceful error to the user (“Stripe is temporarily unavailable, please try again in a minute”); do NOT log this as 5xx.
  • For internal workflows — write the work to a retry queue, alert ops, return success to the caller.

The choice is per-call-site. The default in @platform/core is “fail with CircuitBreakerError”; the wrapping service catches and decides.

Circuits we currently have

VendorWhere
Stripeservices/payments/src/services/stripe.client.ts
BigCommerceservices/commerce/src/services/bigcommerce.client.ts
Rechargeservices/commerce/src/services/recharge.client.ts
Twilioservices/comms/src/lib/adapters/twilio.adapter.ts
Postmarkservices/comms/src/lib/adapters/postmark.adapter.ts
Resendservices/comms/src/lib/adapters/resend.adapter.ts
WorkOSservices/identity/src/services/workos.service.ts
Clerkservices/identity/src/services/clerk-bridge.service.ts
OpenAI / Anthropicservices/ai/src/lib/providers/

Each emits OTel metrics (open count, half-open transitions, breaker decision) so we can alarm on a sustained-open circuit.

What this guards against

  • A spammy client can’t take the platform down with a tight loop.
  • A vendor outage doesn’t propagate into our own latencies — calls fail fast, freeing up our resources.
  • A retry storm is dampened by jitter + circuit breakers.
  • A noisy neighbor can’t starve other tenants (rate limits are per client_id).

What it doesn’t guard against

  • A coordinated DDoS — that’s the ALB’s job. Rate limits are application-level.
  • Slow business-logic bugs (a query that gets slower as data grows) — that’s the job of SLO monitoring + alarms.
  • A vendor returning slow but successful responses — circuit breakers fire on failure, not latency alone. We have a latency-based wrapper but it’s used selectively.

Common mistakes

  • Forgetting jitter on retry — synchronous retries amplify the problem.
  • Treating 429 as a 5xx — it’s a 4xx, “you did too much,” not “we broke.” Retry but back off.
  • Wrapping the wrong call — circuit breakers go around vendor calls, not around internal DB calls. DB issues have their own pattern.
  • Hardcoding rate limit values — they live in @platform/hono/src/middleware/rate-limit.ts config. Don’t sprinkle magic numbers in handlers.
  • Letting the circuit decide UX — the breaker just signals “vendor unhealthy”; the service decides what the user sees.

Source ADRs

ADR-0030 (rate limiting), ADR-0044 (circuit breakers), ADR-0047 (observability for breakers).