Design a message queue (Kafka)

Q: How does Kafka achieve high throughput?

Four mechanisms compound. (1) Sequential disk writes: Kafka appends to a commit log — sequential I/O is 10-100× faster than random I/O. (2) Zero-copy transfers: Kafka uses the sendfile() syscall to transfer data from the page cache directly to the network socket, bypassing user space entirely. (3) Batching: producers batch messages before sending; brokers batch messages before writing. Batching amortizes per-message overhead. (4) Partitioning: a single topic is split across multiple partitions, each on a different broker, allowing parallel writes and reads. The combination enables millions of messages/second on commodity hardware.

Q: What is the difference between at-least-once and exactly-once delivery in Kafka?

At-least-once: the producer retries on failure, which can cause duplicate messages if the broker acknowledged the write but the response was lost. Consumers receive every message at least once but might see duplicates. Exactly-once: Kafka achieves this via idempotent producers (each batch has a unique sequence number; the broker deduplicates retries) combined with transactional semantics (atomic write across multiple partitions). The consumer reads committed messages only. Exactly-once has higher latency and throughput cost — use it only when duplicates are unacceptable (financial records, order processing). Most analytics pipelines use at-least-once with idempotent consumers.

Q: Why does Kafka use consumer groups?

Consumer groups enable parallel consumption while preserving ordering within a partition. Each partition is assigned to exactly one consumer in a group — so a partition's messages are always processed in order by one consumer. By adding more consumers to a group (up to the number of partitions), you scale consumption throughput linearly. Multiple independent consumer groups can read the same topic independently — each group maintains its own offset, so a real-time dashboard group and a batch ETL group can consume the same stream at different rates without interfering with each other.

Q: How does Kafka handle broker failures?

Each partition has a leader broker and N-1 replica brokers (typically 2 replicas, replication factor 3). Producers write to the leader; replicas sync from the leader. The set of in-sync replicas (ISR) is the set of replicas that are caught up to the leader within the configured lag threshold. If the leader fails, the controller (originally ZooKeeper, now KRaft in Kafka 3.x) elects a new leader from the ISR. With min.insync.replicas=2 and replication factor 3, the cluster tolerates one broker failure without losing any acknowledged messages. After failure, the rejoining broker fetches missed messages from the new leader to rejoin the ISR.

Designing a message queue like Kafka is one of the questions where interviewers explicitly want to see whether you understand log-based architecture versus traditional queues. The surface answer — "use a queue" — misses the real probes: why sequential disk writes beat in-memory queues at scale, how partitions enable parallel consumption without sacrificing per-partition ordering, and how the replication protocol ensures durability. This guide covers the architecture that underpins Kafka and the failure narration that demonstrates you've operated it.

What the interview is really asking

Log-based vs. traditional queue semantics. Traditional message queues (RabbitMQ, SQS) delete a message once it's acknowledged. Log-based systems (Kafka) retain messages for a configurable period regardless of consumption — multiple consumers can read the same message independently. The interviewer is checking whether you understand this distinction and can articulate when each model is appropriate (event streaming vs. work queue).

Partition design and throughput scaling. A single Kafka partition is a sequential log. Throughput scales by adding partitions, each on a different broker, each consumed by a different consumer in a group. The interviewer is checking whether you understand the relationship between partition count, consumer group size, and throughput — and the trade-off: more partitions means higher overhead (more file descriptors, more replication connections, longer leader election time).

Delivery guarantees. At-most-once, at-least-once, and exactly-once are meaningfully different. The interviewer is checking that you can map each to a concrete implementation mechanism (no retry, idempotent producer, transactional producer) and that you know the latency and throughput cost of each.

Consumer group coordination. When a consumer joins or leaves a group, partitions are rebalanced. During rebalancing, no consumer in the group is consuming. The interviewer is checking whether you know about rebalance impact and mitigations (incremental cooperative rebalancing, static group membership).

Back-of-envelope estimation

Write throughput target. A large-scale system: 1 million messages/second ingest. Average message size: 1 KB. Write bandwidth: 1M × 1KB = 1 GB/second. Kafka brokers on SSD hardware can sustain ~1 GB/second sequential write throughput per broker. With 5 brokers, write bandwidth ceiling: 5 GB/second — well above target.

Replication write amplification. Replication factor 3: every byte written is replicated to 2 additional brokers. Effective disk write rate per cluster: 1 GB/sec × 3 = 3 GB/second aggregate across all brokers. Disk per broker with 7-day retention: 1 GB/sec × 86,400 × 7 / 5 brokers = ~1.2 PB/broker. Use HDDs for log storage (sequential access, cheap capacity); SSDs for ZooKeeper/KRaft metadata.

Partition count. Each partition is consumed by at most one consumer in a group. If the downstream consumer application has 100 instances, you need at least 100 partitions per topic to fully utilize the consumer fleet. Over-provisioning partitions has diminishing returns — cap at 2–3× the expected max consumer count. For this example: 200 partitions per topic.

Consumer lag budget. At 1M messages/second ingest and a consumer fleet processing 800K messages/second, lag accumulates at 200K messages/second. If a consumer batch takes 5 minutes to deploy: 200K × 300 = 60M messages of lag. To process this at 1.2M messages/second (20% above ingest rate): 60M / 200K = 300 seconds to drain the lag. Retention must be set long enough to accommodate the worst-case processing delay (7 days is standard; never less than 3×worst-case deployment window).

Architecture decisions and why

Log-based storage. Each partition is an ordered, immutable sequence of records appended to a commit log file. Producers append to the end; consumers read from a specific offset. This design has two critical properties: (1) sequential disk writes — the OS can optimize these to near-memory speed via write combining and prefetch; (2) messages are not deleted after consumption — the broker retains all messages until the retention policy (time or size) expires. Any consumer can replay from any offset. This is fundamentally different from a work queue that deletes messages on acknowledgment.

Partition leader and ISR replication. Every partition has one leader broker that handles all reads and writes. N−1 follower brokers replicate from the leader. The in-sync replica set (ISR) tracks which followers are caught up (within replica.lag.time.max.ms). A producer with acks=all must wait for all ISR replicas to acknowledge the write before returning success. If a leader fails, the controller elects the highest-offset ISR member as the new leader — no data loss. If an ISR member falls behind, it is removed from the ISR (it can rejoin once caught up). min.insync.replicas=2 is the minimum safety floor: if fewer than 2 replicas are in sync, writes fail rather than risking data loss.

Consumer offset management. Each consumer group tracks its read position (offset) per partition. Offsets are committed back to Kafka itself (the __consumer_offsets internal topic), eliminating the need for an external coordination service. The consumer controls when to commit — after processing a batch, not before. This gives the consumer at-least-once semantics: if processing fails before commit, the message is reprocessed. For exactly-once, use transactional APIs to atomically commit the offset and the output message in one transaction.

KRaft: ZooKeeper replacement. Kafka historically used ZooKeeper for controller election and metadata management. Kafka 3.x removes this dependency with KRaft (Kafka Raft). The controller is now a Kafka broker running a Raft consensus protocol on a metadata topic. Benefits: simpler operations (no ZooKeeper cluster to manage), faster controller failover (~5 seconds vs. ~30 seconds), ability to scale to millions of partitions without ZooKeeper bottlenecks.

Idempotent and transactional producers. Idempotent producer: each batch has a monotonically increasing sequence number per partition. The broker deduplicates retried batches with the same sequence number — producer retries cannot cause duplicate messages. Transactional producer: wraps writes across multiple partitions and the consumer offset commit into one atomic transaction. Consumers configured with isolation.level=read_committed only see messages from committed transactions. This achieves exactly-once end-to-end.

Run it in the simulator

Model the Kafka topology in SysSimulator: producers feeding a message queue, which fans out to multiple consumer service instances. Set up 3 consumer service nodes to represent a consumer group with 3 active consumers.

Set ingest to 100,000 messages/second. Observe even distribution across consumer nodes (each processes ~33K messages/second). Inject a consumer failure — one consumer goes down. Watch: its partitions are rebalanced to the remaining two consumers (each now processing ~50K messages/second). Observe the brief rebalance pause — this is the period where no messages are consumed during partition reassignment.

Then inject a broker failure. Watch: partition leaders on the failed broker trigger new leader elections; replication resumes from the new leaders. Observe that committed messages are not lost — consumers resume from their last committed offset on the new leader.

Open SysSimulator and model this →

Failure narration — word for word

"I'll inject a broker failure — one of our five brokers goes down. It holds the leader for 40 of our 200 partitions."

"[inject] The controller detects the broker is gone — heartbeat timeout after 10 seconds (session.timeout.ms). It identifies the 40 partitions whose leader was on the failed broker. For each, it picks the highest-offset ISR member as the new leader and updates the cluster metadata. Producer clients are notified of the new leaders via metadata refresh. Total leader election time: 10–15 seconds."

"During those 10–15 seconds, producers targeting partitions on the failed broker get NotLeaderOrFollowerException. They retry automatically — the Kafka producer client has built-in retry with backoff. No messages are lost because acks=all required all ISR replicas to acknowledge before the broker failure — the replicas have those messages."

"Consumer groups reading from the affected partitions pause during leader election (no leader to consume from). After election, they resume from their committed offset on the new leader. Consumer lag builds for ~15 seconds, then drains normally once the new leaders are active. Total observable impact: ~15 seconds of reduced throughput (remaining 160 partitions continue normally), then full recovery."

The question behind the question

"How do you handle a slow consumer that can't keep up with ingest rate?" Consumer lag is the queue depth for a consumer group. Monitoring: track consumer_lag per partition in your metrics pipeline. Mitigation: add more consumer instances (up to partition count). If partition count is the bottleneck, increase partition count for the topic — note this requires redistributing existing partitions, which causes a brief leadership election round. For lag spikes due to slow processing, consider splitting the heavy processing into a separate topic/consumer pipeline with different throughput characteristics.

"When would you choose RabbitMQ over Kafka?" RabbitMQ is better for: complex routing (per-message header-based routing, dead letter queues, TTL-per-message, priority queues), transient tasks where you don't need replay, small message volumes where Kafka's overhead isn't justified. Kafka is better for: high-throughput event streaming, multi-consumer fan-out, event replay (audit logs, data pipelines), consumer lag tolerance. If the use case is "many independent services need to consume the same events at their own pace," Kafka wins. If it's "route this specific job to the right worker," RabbitMQ wins.

"How do you handle message ordering across partitions?" Kafka guarantees ordering within a partition, not across partitions. For use cases requiring global ordering (all events for a specific user in order), partition by the key that must be ordered (user_id). All events for user_id 12345 go to the same partition and are consumed in order. For global total ordering, use a single partition — but this caps throughput at one consumer per consumer group.

Frequently asked questions

How does Kafka achieve high throughput?
Sequential disk writes (10-100× faster than random I/O), zero-copy sendfile() syscall (no user-space copy), batching (amortizes per-message overhead), and partitioning (parallel writes/reads). Millions of messages/second on commodity hardware.

What is the difference between at-least-once and exactly-once delivery in Kafka?
At-least-once: producer retries can cause duplicates; consumers must be idempotent. Exactly-once: idempotent producer (sequence number dedup) + transactional APIs + isolation.level=read_committed. Higher latency cost — use only when duplicates are unacceptable.

Why does Kafka use consumer groups?
Parallel consumption with per-partition ordering. Each partition → one consumer in the group. Scale by adding consumers (up to partition count). Multiple groups consume the same topic independently at their own pace.

How does Kafka handle broker failures?
Leader election from in-sync replica set (ISR). With replication factor 3 and min.insync.replicas=2, tolerates one broker failure with no data loss. Election takes ~10–15 seconds; producers retry automatically. KRaft mode improves election to ~5 seconds.

Run this in SysSimulator → Browse all blueprints

Next: Design a proximity service →