Design a distributed rate limiter

Q: What is the difference between token bucket and sliding window rate limiting?

Token bucket allows bursting: a bucket holds up to N tokens and refills at a fixed rate. If a user sends 100 requests in one second, they consume 100 tokens — allowed if the bucket had 100 tokens. The bucket fills back up over time. Sliding window tracks the exact count of requests in the last N seconds using a rolling timestamp log. It prevents burst exploitation but uses more memory (stores individual timestamps rather than a counter). Token bucket is simpler and better for APIs that should allow occasional bursts; sliding window is stricter and better for strict per-second enforcement.

Q: How does a distributed rate limiter work with Redis?

A distributed rate limiter stores per-user counters in a central Redis instance (or cluster). Each API request increments the user's counter with an atomic INCR command and checks whether it exceeds the limit. Redis's single-threaded command execution makes INCR atomic — no race condition between the read and the increment. A Lua script or INCR+EXPIRE combination sets the counter's TTL so it resets after the rate limit window expires.

Q: What happens if the rate limiter's Redis goes down?

Two strategies: fail-open (allow all traffic if the limiter is unavailable) and fail-closed (reject all traffic or serve a cached approximation). Fail-open risks abuse if the limiter is down for a sustained period. Fail-closed protects downstream services but may incorrectly reject legitimate traffic. Most production systems fail-open with an alert, accepting the risk of temporary over-serving in exchange for availability. The choice should be made explicitly based on whether the rate limiter is protecting against abuse (fail-open safer) or protecting an overloaded downstream service (fail-closed safer).

Q: How do you handle rate limiting across multiple data centres?

Exact distributed rate limiting across data centres requires synchronised counters, which introduces cross-DC latency on every request. Most systems use approximate limiting instead: each data centre maintains a local counter. Periodically (every second or so), counters are synchronised to a central store or gossiped between DCs. The rate limit check uses the local counter, accepting that users can briefly exceed the limit by a factor proportional to the number of DCs and the sync interval. For abuse prevention, approximate limiting is usually sufficient.

Q: What rate limit headers should an API return?

The standard headers (from IETF RFC drafts and adopted by major APIs): X-RateLimit-Limit (maximum requests allowed in the window), X-RateLimit-Remaining (requests remaining in the current window), X-RateLimit-Reset (Unix timestamp when the window resets), and Retry-After (seconds to wait before retrying, included only on 429 responses). Returning these headers allows client libraries to implement automatic backoff and retry logic.

Deceptively simple on the surface, rich in tradeoffs underneath. The rate limiter question appears at every level from mid to staff, and interviewers use it specifically because the algorithm choice, the distributed coordination problem, and the failure behaviour all reveal different depths of systems understanding.

What the interview is really asking

A rate limiter question is not really about limiting rates. Interviewers use it to probe a cluster of specific skills that are hard to test with larger system designs.

Algorithm fluency. There are four main rate limiting algorithms — fixed window counter, sliding window log, token bucket, and leaky bucket — and each has distinct tradeoffs in memory usage, burst behaviour, and implementation complexity. Candidates who know only "token bucket" without being able to contrast it against sliding window log are signalling shallow preparation. The interviewer wants to hear you name the tradeoffs explicitly and make a justified choice for the specific use case.

Distributed state management. A single-server rate limiter is trivial — a hash map in memory. The interesting problem is: how do you rate limit a user who is sending requests to 100 different API servers simultaneously? Their counter is distributed across 100 machines. Keeping it consistent without adding latency to every request is the core engineering problem, and it has no perfect solution — only tradeoffs between exactness, latency, and complexity.

Atomic operations. The classic race condition in naive rate limiting: read the counter (value: 99), check if under limit (100), increment to 100 — two threads do this simultaneously and both get approved, counter ends up at 101. The interviewer is checking that you know atomic operations (Redis INCR, Lua scripts, compare-and-swap) and understand why they matter.

Graceful rejection behaviour. A rate limiter that simply drops requests creates poor user experience and is opaque to clients. A well-designed rate limiter returns standard HTTP 429 responses with Retry-After headers, rate limit headers on every response, and a sensible error body. The interviewer is checking product and API design awareness, not just systems thinking.

Back-of-envelope estimation

Rate limiters are infrastructure-layer components — they sit in front of your application and must be fast enough to not become a bottleneck. Let's estimate for a mid-scale API platform.

Traffic volume. Assume 10,000 API requests per second total across all users. The rate limiter must check and update a counter on every single one of these requests. If the check adds 1ms of latency, the system is adding 1ms to every user's API call — unacceptable for a component that should be invisible. Target: rate limit check must complete in under 1ms, which means sub-millisecond reads and writes to the counter store.

Counter storage per user. Each user needs one counter per rate limit window. For a token bucket, that's: user ID (8 bytes) + token count (4 bytes) + last refill timestamp (8 bytes) = 20 bytes per user. For 1 million users: 20 MB of counter data. This fits comfortably in Redis memory — a single Redis instance can hold billions of these entries. Storage is not the constraint; latency is.

Redis throughput. A single Redis instance handles approximately 100,000 simple commands per second (higher with pipelining). At 10,000 RPS on the API, with one Redis INCR per request, Redis is at 10% capacity on a single instance. Add replication for availability, and you still have comfortable headroom. For very high traffic (100,000+ RPS), a Redis cluster with sharding by user ID distributes the load linearly.

Key expiration overhead. Each counter key has a TTL equal to the rate limit window (e.g., 60 seconds for "100 requests per minute"). Redis's lazy and active expiration handles this efficiently. At 1M active users with 60-second TTLs, approximately 16,667 keys expire per second — well within Redis's expiration capacity.

Architecture decisions and why

Algorithm choice: token bucket for API rate limiting. For most API rate limiting use cases, token bucket is the right choice. It allows bursting — a user who hasn't made requests in the last 10 seconds has accumulated tokens and can make 10 requests instantly. This matches how humans (and reasonable API clients) actually behave. Fixed window counter is simpler but allows a 2× burst at window boundaries (exhausting the limit at the end of one window and immediately starting the next). Sliding window log prevents this but stores individual request timestamps (memory scales with requests per user, not users). For an API that wants to allow burst while setting an average rate limit, token bucket is the standard answer.

Storage: centralised Redis. The rate limiter state must be centralised — not stored in the application servers' local memory — because requests from the same user can arrive at any application server. Redis is the standard choice: sub-millisecond latency, atomic INCR, built-in TTL, and horizontal scaling via cluster mode. The Redis key is typically: rate_limit:{user_id}:{window}. The value is the token count. INCR with EXPIRE is atomic enough for most use cases; a Lua script provides full atomicity (check-then-set as a single transaction).

Atomic token consumption with Lua. The naive approach — GET counter, check limit, INCR — has a race condition between the GET and the INCR. Redis Lua scripts execute atomically (Redis is single-threaded; the script runs without interruption). A Lua script that checks the current count, compares it to the limit, and either increments and returns "allowed" or returns "denied" without incrementing eliminates the race entirely. This adds ~50 microseconds of Lua execution overhead — negligible compared to network round-trip time.

Where to place the rate limiter. Options in order of preference: (1) API gateway — a single enforcement point before requests reach any application server. No per-service changes required. (2) Middleware library — rate limiting logic embedded in each service. Allows per-service customisation but requires consistent deployment across services. (3) Sidecar proxy — rate limiting in a service mesh sidecar (e.g., Envoy). Infrastructure-managed, application-agnostic. For a new system design: API gateway is the simplest and most maintainable answer.

Response headers. Every API response should include: X-RateLimit-Limit (your limit), X-RateLimit-Remaining (tokens left in window), X-RateLimit-Reset (when the window resets, as a Unix timestamp). On 429 responses, add Retry-After (seconds until the client can retry). These headers allow client libraries to implement intelligent backoff without polling the API. Include them on every response, not just 429s — clients use them to rate-limit themselves proactively.

Multiple limit tiers. A production rate limiter typically enforces limits at multiple levels simultaneously: per-IP (protects against unauthenticated abuse), per-user (enforces service tier limits), per-endpoint (resource-specific protection — a heavy search endpoint may be limited more strictly than a lightweight status endpoint), and global (total traffic cap for capacity protection). Each level checks against its own Redis counter. If any level denies the request, a 429 is returned immediately.

Run it in the simulator

Load the API Rate Limiter blueprint in SysSimulator. The blueprint models a token bucket rate limiter in front of an API service, with Redis as the counter store and a configurable limit per user.

Set traffic to 10,000 RPS total with a limit of 100 requests/second per user and 100 simulated users (each sending 100 RPS — right at their limit). Observe: error rate should be near zero (users are at but not over limit), Redis utilisation should be moderate, API service should be handling traffic normally.

Now simulate a traffic spike — push one user to 500 RPS (5× their limit). Watch: that user's requests start hitting 429s at the rate limiter. The API service is protected — it never sees the excess 400 RPS. Record the exact error rate for the throttled user vs total system error rate — these numbers demonstrate that the rate limiter is protecting the downstream service precisely.

Then inject a Redis failure and observe the fail-open vs fail-closed behaviour. With fail-open: all traffic passes through (no rate limiting protection). With fail-closed: all traffic is rejected. The simulation shows you the concrete impact of each policy choice.

Open API Rate Limiter blueprint →

Failure narration — word for word

"I'm running the rate limiter at 10,000 RPS total with 100 users each at their 100 RPS limit. Everything is healthy — error rate is near zero, Redis is at 12% utilisation. Now I'll simulate one user sending 500 RPS — 5× their token bucket limit."

"[inject] The rate limiter starts rejecting 80% of that user's requests with 429s. The 400 excess RPS never reaches the API service — you can see the API service RPS is still at 10,000, not 10,400. The rate limiter is doing exactly what it's designed for: protecting the downstream service from a single misbehaving client. The other 99 users are unaffected."

"Now I'll inject a Redis failure. [inject] The rate limiter can no longer check counters. With a fail-open policy: all traffic passes through, error rate drops to zero, but we've lost all rate limiting protection. With a fail-closed policy: all 10,000 RPS are rejected with 503s — maximum protection, but we've taken down the API for all users including legitimate ones. I'd configure fail-open with an immediate alert and a circuit breaker timeout. The risk of a few minutes of unthrottled traffic is lower than the risk of a full outage."

The question behind the question

"How do you handle a user sending requests to multiple servers simultaneously?" Without centralised state, each server has an independent counter. A user can send N requests to each of M servers, consuming N×M requests while appearing to each server as under-limit. The only correct answer is a centralised counter store (Redis) that all servers read from — trading a network round-trip for correctness.

"What's the race condition in a naive implementation?" GET → check → INCR: two requests arriving simultaneously both read the same counter value (99), both check against the limit (100), both increment. Counter lands at 101 — one request over limit was approved. Fix: Redis INCR is atomic, returning the new value in a single command. Check the return value: if it's over the limit, the caller was the one that exceeded it and should be rejected.

"How do you rate limit at the data centre level?" Exactly distributed rate limiting requires synchronised counters across DCs, adding cross-DC latency to every request. The practical answer for most systems: approximate limiting. Each DC maintains a local counter, synced to a central store every ~1 second. Users can exceed the limit by (number of DCs × sync interval × per-DC rate) in the worst case. For abuse prevention purposes, approximate limiting is almost always sufficient.

"How do you distinguish between token bucket and leaky bucket?" Token bucket: tokens accumulate up to a maximum, consumed per request. Allows bursting up to the bucket capacity. Leaky bucket: requests enter a queue and are processed at a fixed output rate. Excess requests either queue (introducing latency) or are dropped (introducing errors). Leaky bucket produces a perfectly smooth output rate; token bucket produces bursty output. For API rate limiting, token bucket is almost always preferred — clients want burst capability, and smooth output rate is rarely a requirement.

Frequently asked questions

What is the difference between token bucket and sliding window rate limiting?
Token bucket allows bursting (accumulated tokens can be spent quickly). Sliding window tracks exact request counts in a rolling time window — no burst allowed. Token bucket uses constant memory per user; sliding window log uses memory proportional to request count. Token bucket is the standard choice for API rate limiting; sliding window for strict per-second enforcement.

How does a distributed rate limiter work with Redis?
Per-user counters are stored in Redis. Each request runs an atomic Redis INCR and checks the returned value against the limit. A TTL on the key resets the counter after the window expires. Redis's single-threaded execution makes INCR atomically safe — no separate read-check-write race.

What happens if the rate limiter's Redis goes down?
Choose: fail-open (allow all traffic, lose rate limiting protection) or fail-closed (reject all traffic, protect downstream services). Most systems fail-open with an alert. The right answer depends on whether the rate limiter is protecting against abuse (fail-open preferred) or protecting an overloaded service (fail-closed preferred).

How do you handle rate limiting across multiple data centres?
Exact synchronisation requires cross-DC round trips on every request — too expensive. Approximate limiting: local counters per DC, synced periodically. Users can briefly exceed limits by a factor proportional to DC count and sync interval. For abuse prevention, this is almost always sufficient.

What rate limit headers should an API return?
X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset on every response. Retry-After on 429 responses. These allow clients to implement proactive backoff before hitting the limit.

Run this in SysSimulator → Browse all blueprints

Next in the series: Design a payment system →