Rate limits and circuit breakers
What this is: how the platform protects itself from overload and protects its callers from cascading failure.
Who it’s for: anyone writing a route that calls a vendor, anyone seeing 429s in production, anyone wondering how to handle “vendor X is down.”
What to read next: Idempotency and retries, System overview, services/comms.
Two distinct mechanisms
These are often confused. They protect different things.
| Mechanism | Protects | Lives at |
|---|---|---|
| Rate limits | The platform from too many caller requests | Per-route middleware on inbound traffic |
| Circuit breakers | Callers from a vendor that’s failing | Wrapper around outbound vendor calls |
Rate limits
Sliding-window counters in Redis, per (client_id, scope-of-action).
Default caps (defined in @platform/hono/src/middleware/rate-limit.ts):
| Endpoint pattern | Limit |
|---|---|
OAuth /authorize | 30 / min per IP |
OAuth /token | 60 / min per client_id |
OAuth /introspect | 600 / min per client_id |
OAuth /revoke | 60 / min per client_id |
| Standard read endpoints | 1000 / min per client_id |
| Standard write endpoints | 200 / min per client_id |
| Admin endpoints | 100 / min per client_id |
| Comms send routes | 60 / min per user (transactional) |
| Comms send routes | 1 / day per user (marketing) |
Routes can override defaults by declaring their own rate-limit config. Service-specific limits live in each service’s infra.ts and are documented in the service detail page.
Response shape
HTTP/1.1 429 Too Many Requests
Retry-After: 12
Content-Type: application/json
{
"error": "rate_limited",
"error_description": "Request limit exceeded for client_id",
"retry_after_ms": 12000
}Callers MUST honor Retry-After. Repeated 429s without backing off triggers a temporary client_id lockout (15 min, then 1h, then escalating).
How callers should handle 429
async function call(fn: () => Promise<Response>, maxAttempts = 5) {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
const res = await fn();
if (res.status !== 429) return res;
const retryAfter = Number(res.headers.get("Retry-After") ?? 1);
await sleep((retryAfter + Math.random()) * 1000);
}
throw new Error("rate-limited after max attempts");
}Jitter (Math.random()) prevents thundering-herd when many callers retry at the same boundary.
Quotas vs rate limits
Rate limits are per-minute. Quotas are per-billing-period and live in services/entitlements. A client can be inside rate limits but over quota (or vice versa).
A read of a biomarker counts against rate limits but not quota. A send of a marketing email counts against both: the rate limit (1/day/user) and the quota (e.g., 10,000/month/client).
Circuit breakers
Wrapped around every outbound vendor call. From @platform/core/src/circuit-breaker.ts:
import { createCircuitBreaker } from "@platform/core";
const stripeBreaker = createCircuitBreaker({
name: "stripe.payments",
threshold: 5, // 5 failures
windowMs: 60_000, // in 60 seconds
openMs: 30_000, // → open circuit for 30 seconds
});
const result = await stripeBreaker.exec(() => stripe.paymentIntents.create(params));States:
- Closed — normal. All calls go through.
- Open — circuit tripped. Calls fail fast with
CircuitBreakerError. AfteropenMs, transitions to half-open. - Half-open — one test call goes through. Success → closed. Failure → open again.
This protects callers from:
- Cascading failure when a vendor degrades (we don’t pile on retries that just compound the problem)
- Long tail latencies from a vendor’s slow responses (we fail fast)
- Resource exhaustion (thread pools, connection pools) waiting on a sick vendor
What happens when the circuit is open
The wrapping service decides:
- For non-critical calls (analytics, marketing) — drop silently, log it, move on.
- For user-facing critical calls (payments) — return a graceful error to the user (“Stripe is temporarily unavailable, please try again in a minute”); do NOT log this as 5xx.
- For internal workflows — write the work to a retry queue, alert ops, return success to the caller.
The choice is per-call-site. The default in @platform/core is “fail with CircuitBreakerError”; the wrapping service catches and decides.
Circuits we currently have
| Vendor | Where |
|---|---|
| Stripe | services/payments/src/services/stripe.client.ts |
| BigCommerce | services/commerce/src/services/bigcommerce.client.ts |
| Recharge | services/commerce/src/services/recharge.client.ts |
| Twilio | services/comms/src/lib/adapters/twilio.adapter.ts |
| Postmark | services/comms/src/lib/adapters/postmark.adapter.ts |
| Resend | services/comms/src/lib/adapters/resend.adapter.ts |
| WorkOS | services/identity/src/services/workos.service.ts |
| Clerk | services/identity/src/services/clerk-bridge.service.ts |
| OpenAI / Anthropic | services/ai/src/lib/providers/ |
Each emits OTel metrics (open count, half-open transitions, breaker decision) so we can alarm on a sustained-open circuit.
What this guards against
- A spammy client can’t take the platform down with a tight loop.
- A vendor outage doesn’t propagate into our own latencies — calls fail fast, freeing up our resources.
- A retry storm is dampened by jitter + circuit breakers.
- A noisy neighbor can’t starve other tenants (rate limits are per client_id).
What it doesn’t guard against
- A coordinated DDoS — that’s the ALB’s job. Rate limits are application-level.
- Slow business-logic bugs (a query that gets slower as data grows) — that’s the job of SLO monitoring + alarms.
- A vendor returning slow but successful responses — circuit breakers fire on failure, not latency alone. We have a latency-based wrapper but it’s used selectively.
Common mistakes
- Forgetting jitter on retry — synchronous retries amplify the problem.
- Treating 429 as a 5xx — it’s a 4xx, “you did too much,” not “we broke.” Retry but back off.
- Wrapping the wrong call — circuit breakers go around vendor calls, not around internal DB calls. DB issues have their own pattern.
- Hardcoding rate limit values — they live in
@platform/hono/src/middleware/rate-limit.tsconfig. Don’t sprinkle magic numbers in handlers. - Letting the circuit decide UX — the breaker just signals “vendor unhealthy”; the service decides what the user sees.
Related
- Idempotency and retries — retries are safe BECAUSE idempotency
- System overview
- services/comms — heaviest rate-limit surface
- services/payments — heaviest circuit-breaker surface
Source ADRs
ADR-0030 (rate limiting), ADR-0044 (circuit breakers), ADR-0047 (observability for breakers).