Design a message queue (Kafka)

Designing a message queue like Kafka is one of the questions where interviewers explicitly want to see whether you understand log-based architecture versus traditional queues. The surface answer — "use a queue" — misses the real probes: why sequential disk writes beat in-memory queues at scale, how partitions enable parallel consumption without sacrificing per-partition ordering, and how the replication protocol ensures durability. This guide covers the architecture that underpins Kafka and the failure narration that demonstrates you've operated it.

What the interview is really asking

Log-based vs. traditional queue semantics. Traditional message queues (RabbitMQ, SQS) delete a message once it's acknowledged. Log-based systems (Kafka) retain messages for a configurable period regardless of consumption — multiple consumers can read the same message independently. The interviewer is checking whether you understand this distinction and can articulate when each model is appropriate (event streaming vs. work queue).

Partition design and throughput scaling. A single Kafka partition is a sequential log. Throughput scales by adding partitions, each on a different broker, each consumed by a different consumer in a group. The interviewer is checking whether you understand the relationship between partition count, consumer group size, and throughput — and the trade-off: more partitions means higher overhead (more file descriptors, more replication connections, longer leader election time).

Delivery guarantees. At-most-once, at-least-once, and exactly-once are meaningfully different. The interviewer is checking that you can map each to a concrete implementation mechanism (no retry, idempotent producer, transactional producer) and that you know the latency and throughput cost of each.

Consumer group coordination. When a consumer joins or leaves a group, partitions are rebalanced. During rebalancing, no consumer in the group is consuming. The interviewer is checking whether you know about rebalance impact and mitigations (incremental cooperative rebalancing, static group membership).

Back-of-envelope estimation

Write throughput target. A large-scale system: 1 million messages/second ingest. Average message size: 1 KB. Write bandwidth: 1M × 1KB = 1 GB/second. Kafka brokers on SSD hardware can sustain ~1 GB/second sequential write throughput per broker. With 5 brokers, write bandwidth ceiling: 5 GB/second — well above target.

Replication write amplification. Replication factor 3: every byte written is replicated to 2 additional brokers. Effective disk write rate per cluster: 1 GB/sec × 3 = 3 GB/second aggregate across all brokers. Disk per broker with 7-day retention: 1 GB/sec × 86,400 × 7 / 5 brokers = ~1.2 PB/broker. Use HDDs for log storage (sequential access, cheap capacity); SSDs for ZooKeeper/KRaft metadata.

Partition count. Each partition is consumed by at most one consumer in a group. If the downstream consumer application has 100 instances, you need at least 100 partitions per topic to fully utilize the consumer fleet. Over-provisioning partitions has diminishing returns — cap at 2–3× the expected max consumer count. For this example: 200 partitions per topic.

Consumer lag budget. At 1M messages/second ingest and a consumer fleet processing 800K messages/second, lag accumulates at 200K messages/second. If a consumer batch takes 5 minutes to deploy: 200K × 300 = 60M messages of lag. To process this at 1.2M messages/second (20% above ingest rate): 60M / 200K = 300 seconds to drain the lag. Retention must be set long enough to accommodate the worst-case processing delay (7 days is standard; never less than 3×worst-case deployment window).

Architecture decisions and why

Log-based storage. Each partition is an ordered, immutable sequence of records appended to a commit log file. Producers append to the end; consumers read from a specific offset. This design has two critical properties: (1) sequential disk writes — the OS can optimize these to near-memory speed via write combining and prefetch; (2) messages are not deleted after consumption — the broker retains all messages until the retention policy (time or size) expires. Any consumer can replay from any offset. This is fundamentally different from a work queue that deletes messages on acknowledgment.

Partition leader and ISR replication. Every partition has one leader broker that handles all reads and writes. N−1 follower brokers replicate from the leader. The in-sync replica set (ISR) tracks which followers are caught up (within replica.lag.time.max.ms). A producer with acks=all must wait for all ISR replicas to acknowledge the write before returning success. If a leader fails, the controller elects the highest-offset ISR member as the new leader — no data loss. If an ISR member falls behind, it is removed from the ISR (it can rejoin once caught up). min.insync.replicas=2 is the minimum safety floor: if fewer than 2 replicas are in sync, writes fail rather than risking data loss.

Consumer offset management. Each consumer group tracks its read position (offset) per partition. Offsets are committed back to Kafka itself (the __consumer_offsets internal topic), eliminating the need for an external coordination service. The consumer controls when to commit — after processing a batch, not before. This gives the consumer at-least-once semantics: if processing fails before commit, the message is reprocessed. For exactly-once, use transactional APIs to atomically commit the offset and the output message in one transaction.

KRaft: ZooKeeper replacement. Kafka historically used ZooKeeper for controller election and metadata management. Kafka 3.x removes this dependency with KRaft (Kafka Raft). The controller is now a Kafka broker running a Raft consensus protocol on a metadata topic. Benefits: simpler operations (no ZooKeeper cluster to manage), faster controller failover (~5 seconds vs. ~30 seconds), ability to scale to millions of partitions without ZooKeeper bottlenecks.

Idempotent and transactional producers. Idempotent producer: each batch has a monotonically increasing sequence number per partition. The broker deduplicates retried batches with the same sequence number — producer retries cannot cause duplicate messages. Transactional producer: wraps writes across multiple partitions and the consumer offset commit into one atomic transaction. Consumers configured with isolation.level=read_committed only see messages from committed transactions. This achieves exactly-once end-to-end.

Run it in the simulator

Model the Kafka topology in SysSimulator: producers feeding a message queue, which fans out to multiple consumer service instances. Set up 3 consumer service nodes to represent a consumer group with 3 active consumers.

Set ingest to 100,000 messages/second. Observe even distribution across consumer nodes (each processes ~33K messages/second). Inject a consumer failure — one consumer goes down. Watch: its partitions are rebalanced to the remaining two consumers (each now processing ~50K messages/second). Observe the brief rebalance pause — this is the period where no messages are consumed during partition reassignment.

Then inject a broker failure. Watch: partition leaders on the failed broker trigger new leader elections; replication resumes from the new leaders. Observe that committed messages are not lost — consumers resume from their last committed offset on the new leader.

Open SysSimulator and model this →

Failure narration — word for word

"I'll inject a broker failure — one of our five brokers goes down. It holds the leader for 40 of our 200 partitions."

"[inject] The controller detects the broker is gone — heartbeat timeout after 10 seconds (session.timeout.ms). It identifies the 40 partitions whose leader was on the failed broker. For each, it picks the highest-offset ISR member as the new leader and updates the cluster metadata. Producer clients are notified of the new leaders via metadata refresh. Total leader election time: 10–15 seconds."

"During those 10–15 seconds, producers targeting partitions on the failed broker get NotLeaderOrFollowerException. They retry automatically — the Kafka producer client has built-in retry with backoff. No messages are lost because acks=all required all ISR replicas to acknowledge before the broker failure — the replicas have those messages."

"Consumer groups reading from the affected partitions pause during leader election (no leader to consume from). After election, they resume from their committed offset on the new leader. Consumer lag builds for ~15 seconds, then drains normally once the new leaders are active. Total observable impact: ~15 seconds of reduced throughput (remaining 160 partitions continue normally), then full recovery."

The question behind the question

"How do you handle a slow consumer that can't keep up with ingest rate?" Consumer lag is the queue depth for a consumer group. Monitoring: track consumer_lag per partition in your metrics pipeline. Mitigation: add more consumer instances (up to partition count). If partition count is the bottleneck, increase partition count for the topic — note this requires redistributing existing partitions, which causes a brief leadership election round. For lag spikes due to slow processing, consider splitting the heavy processing into a separate topic/consumer pipeline with different throughput characteristics.

"When would you choose RabbitMQ over Kafka?" RabbitMQ is better for: complex routing (per-message header-based routing, dead letter queues, TTL-per-message, priority queues), transient tasks where you don't need replay, small message volumes where Kafka's overhead isn't justified. Kafka is better for: high-throughput event streaming, multi-consumer fan-out, event replay (audit logs, data pipelines), consumer lag tolerance. If the use case is "many independent services need to consume the same events at their own pace," Kafka wins. If it's "route this specific job to the right worker," RabbitMQ wins.

"How do you handle message ordering across partitions?" Kafka guarantees ordering within a partition, not across partitions. For use cases requiring global ordering (all events for a specific user in order), partition by the key that must be ordered (user_id). All events for user_id 12345 go to the same partition and are consumed in order. For global total ordering, use a single partition — but this caps throughput at one consumer per consumer group.

Frequently asked questions

How does Kafka achieve high throughput?
Sequential disk writes (10-100× faster than random I/O), zero-copy sendfile() syscall (no user-space copy), batching (amortizes per-message overhead), and partitioning (parallel writes/reads). Millions of messages/second on commodity hardware.

What is the difference between at-least-once and exactly-once delivery in Kafka?
At-least-once: producer retries can cause duplicates; consumers must be idempotent. Exactly-once: idempotent producer (sequence number dedup) + transactional APIs + isolation.level=read_committed. Higher latency cost — use only when duplicates are unacceptable.

Why does Kafka use consumer groups?
Parallel consumption with per-partition ordering. Each partition → one consumer in the group. Scale by adding consumers (up to partition count). Multiple groups consume the same topic independently at their own pace.

How does Kafka handle broker failures?
Leader election from in-sync replica set (ISR). With replication factor 3 and min.insync.replicas=2, tolerates one broker failure with no data loss. Election takes ~10–15 seconds; producers retry automatically. KRaft mode improves election to ~5 seconds.

Run this in SysSimulator →   Browse all blueprints

Next: Design a proximity service →