Why two layers

Per-app limits cap traffic for a whole application — every device, every connection, every region — keyed by your APP_ID. They protect the shared relay from a single tenant.
Per-connection limits cap traffic on a single open stream — one Agent SDK process, one device WebSocket, one gRPC bi-di. They protect against runaway loops and bursty clients without affecting the rest of your fleet.

A request that breaks either layer is rejected; a request that survives both is then metered against your monthly decision quota.

Current limits

Defaults are tuned for typical HITL workloads — they apply to every app on the public relay.

Scope	Trigger	Messages	Bytes
Per app (authenticated)	Any request after a session is established	200 msg/s	10 MB/s
Per app (unauthenticated)	`Authenticate` and `RegisterDevice` only	5 msg/s	8 KB/s
Per connection	Each gRPC stream or WebSocket session	20 msg/s	1 MB/s

Both message and byte counts use a token-bucket implementation (governor) with a 1-second refill window — short bursts are allowed; sustained traffic above the rate is rejected.

Per-app limits are shared. They count against your APP_ID across all connections, agent backends, and devices. If your agent backend pushes 200 msg/s through RouteDecision, every device using the same app shares the remaining headroom.

What you see when limits are hit

The relay returns a structured error on the same channel as the original request. The exact surface depends on which API you are calling:

Surface	How the limit is reported
Agent API (gRPC)	gRPC status `RESOURCE_EXHAUSTED` on the unary call. SDKs raise the language's standard rate-limit error type.
Client API (gRPC stream)	A `DeviceResponse` with `status = ERROR` and `error.code = RATE_LIMITED` on the same stream. `error.retryable` is `true`.
Client API (WebSocket / JSON)	A JSON frame: `{"status":"ERROR","error":{"code":"RATE_LIMITED","message":"Rate limit exceeded","retryable":true}}`.

Error envelope

{
  "requestId": "req-1731700000000",
  "status": "ERROR",
  "error": {
    "code": "RATE_LIMITED",
    "message": "Rate limit exceeded",
    "retryable": true
  }
}

Order of checks

The relay applies limits in a fixed order. Use this when you debug a 429-style failure:

Connection rate limits — message count, then byte count of the inbound frame.
App rate limits — using the unauthenticated bucket if the message is Authenticate or RegisterDevice, otherwise the authenticated bucket.
App existence — unknown APP_ID short-circuits with APP_NOT_FOUND.
Decision quota — only consumed on successful RouteDecision; APP_QUOTA_EXCEEDED if the monthly pool is empty.

The error you see therefore tells you which guard tripped — RATE_LIMITED means you exceeded one of the per-second buckets, not the monthly quota.

Handling rate limits

SDKs already implement most of this; if you call the API directly, follow the same patterns:

Back off exponentially. Retry on RATE_LIMITED / RESOURCE_EXHAUSTED with jitter. A starting delay of 250 ms doubling up to a 5 s cap works well at the default per-app rate.
Cap concurrency in your agent. If your agent fans out to many users in parallel, bound it to ~150 concurrent RouteDecision calls per app to leave headroom for retries and devices.
Reuse one Client SDK per device. The Client SDK keeps a single bi-di stream — opening parallel streams against the same identity does not help and uses up the per-connection budget faster.
Batch where it makes sense. Group device fan-out for one decision into one logical attempt; the relay returns one DeliveryOutcome per device but it is still a single user-visible decision against your quota.
Watch retryable. The flag is true for RATE_LIMITED — safe to retry after a delay. Errors with retryable: false (e.g. UNAUTHORIZED) won't succeed on retry.

Planning headroom

For a quick capacity check:

One synchronous decision exchange (route + outcome) is roughly 4 messages on the relay: RouteDecision + DecisionEvent to the device + DecisionOutcome from the device + a final SDK poll. At 200 msg/s per app that is well over 50 user decisions per second sustained.
Heavy RegisterDevice bursts on launch days are limited by the unauthenticated bucket (5 msg/s). If you onboard thousands of devices in minutes, consider staggering the rollout so registrations don't queue behind one another.
The per-connection ceiling matters for chatty bots — a Telegram bot delivering decisions to many users from one process should keep request rate per connection well under 20 msg/s.

Rate Limits

Why two layers

Current limits

What you see when limits are hit

Error envelope

Order of checks

Handling rate limits

Planning headroom

Related

Agent & Client API

Best Practices

Plans & Quota