One of the highest-frequency system design interview questions at FAANG and top-tier companies. WhatsApp tests connection management, delivery guarantees, and fault tolerance — three skills interviewers probe explicitly. This guide gives you the estimation, architecture reasoning, and failure narration to answer it completely.
The WhatsApp question is deceptively simple on the surface: "design a chat system." But interviewers use it to probe a specific set of skills that a Twitter or YouTube question doesn't cover as directly.
Stateful connection management. HTTP is stateless — each request is independent. Chat is inherently stateful — the server must maintain a persistent, bidirectional connection to each active user. The interviewer is evaluating whether you understand what changes at scale when you move from stateless REST to stateful WebSocket connections. File descriptors, memory per connection, routing across a server fleet — these are the traps that catch underprepared candidates.
Delivery semantics. "The message was delivered" is not a binary. There are at least three distinct states in WhatsApp: sent (server received it), delivered (recipient device received it), read (recipient opened it). The interviewer is checking that you can design ACK chains, distinguish between at-least-once and exactly-once delivery, and explain why exactly-once is expensive enough that most messaging systems settle for at-least-once with idempotent message IDs.
Offline state handling. What happens when the recipient's phone is off? This seems trivial, but the answer reveals whether a candidate understands message queuing, push notification integration, and the difference between push and pull delivery models. The follow-up — "what if the user is offline for two weeks?" — probes message retention policy and storage cost awareness.
Presence at scale. "Is this contact online right now?" is a simple product feature that creates non-obvious infrastructure costs. Broadcasting presence updates to all of a user's contacts every time they connect or disconnect generates quadratic amplification: a user with 500 contacts who connects generates 500 presence update events. Multiply by 100M concurrent connections and the naive design collapses immediately. The interviewer is checking whether you see this problem.
WhatsApp's scale as of its peak pre-acquisition growth: 2 billion registered users, ~500M daily active users. Use 500M DAU for your interview estimates — it's realistic and produces clean numbers.
Concurrent connections. At any moment, roughly 10% of DAU are actively connected. That's 50 million concurrent WebSocket connections. Each connection requires a file descriptor (OS limit: typically 65,535 per process by default, configurable to millions with tuning) and approximately 50–100 KB of memory overhead for the connection state and send/receive buffers. 50M connections × 65 KB = ~3.25 TB of RAM just for connection state across the fleet. Your connection server tier needs to be sized for memory, not just CPU.
Message volume. WhatsApp processes approximately 100 billion messages per day — this is a publicly stated figure. 100B / 86,400 seconds ≈ 1.16 million messages per second. This includes text, media pointers, delivery ACKs, and read receipts. Raw text messages are tiny (under 1 KB each). Media references are also small — the actual media is stored in blob storage and referenced by ID. The messaging pipeline handles message routing and ACK delivery; media upload/download is a separate pipeline.
Storage. Retaining messages server-side for delivery to offline users requires temporary storage. WhatsApp's model: messages are deleted from the server once delivered and ACK'd. For the window between send and delivery: 1.16M messages/sec × 1 KB average × average 30-second delivery window ≈ ~35 GB of in-flight message storage at any moment. For undelivered messages to truly offline users, storage grows with offline duration. At 1.16M messages/sec with a 1% offline rate, messages accumulate at ~11,600 messages/sec for offline users — call it 1 TB/day across the fleet for offline message queues.
Media storage. Approximately 30% of messages include media. At 1.16M messages/sec, that's 350,000 media items/sec, averaging perhaps 200 KB each. That's 70 GB/sec of media storage writes — clearly not all retained permanently. WhatsApp's approach: media is end-to-end encrypted and stored on the recipient's device. The server stores it only transiently during delivery, then expresses.
The critical number to commit to memory: 50 million concurrent WebSocket connections and 1.16 million messages per second. These are the numbers that drive every architectural decision.
Walk through decisions in order of what makes WhatsApp different from a standard REST API system. The difference is the stateful connection layer.
WebSocket connection servers. Long-polling (periodic HTTP requests to check for new messages) can work at small scale but fails at WhatsApp scale: 50M connections polling every second generates 50M HTTP requests/sec of overhead before a single message is sent. WebSockets establish a persistent bidirectional TCP connection, reducing per-message overhead to framing bytes. The connection server maintains an in-memory map of user ID → connection. When a message arrives for user X, the routing layer looks up which connection server holds user X's connection and routes the message there.
Connection routing. With a fleet of connection servers, the routing challenge is: how do you know which server holds a given user's connection? Three approaches: (1) consistent hashing — map user ID to a server deterministically; reconnections go to the same server. Problem: server failures require re-hashing. (2) A central routing registry — each server registers its connections in Redis. The message router looks up user → server in Redis. (3) User → server mapping via service discovery. WhatsApp effectively uses approach (2): a Redis-backed connection registry allows any server to route a message to the correct connection server without coordinating directly.
Message delivery and ACK chain. Exactly-once delivery is expensive — it requires distributed transactions and idempotency machinery. At-least-once delivery is achievable with a simple ACK pattern. The flow: (1) Sender client sends message with a unique message ID. (2) Connection server writes message to Cassandra message queue for recipient and returns ACK to sender (one check mark). (3) Connection server attempts to deliver to recipient's WebSocket. (4) Recipient client processes message and sends ACK. (5) Connection server marks message as delivered in Cassandra, sends delivery confirmation to sender (two check marks). (6) Sender's client stores this for the read receipt flow. If step (3) fails (recipient offline), the message stays in Cassandra and is delivered on reconnect.
Cassandra for message storage. Message storage maps perfectly to Cassandra's data model. Partition key: conversation_id (or user_id for 1-to-1 chats). Clustering column: message_timestamp (descending — newest first). Columns: message_id, content, media_ref, sender_id, delivery_status. Cassandra's write-optimised design (append-only LSM tree) handles 1.16M writes/sec efficiently. Reads are also fast for recent messages because they hit the memtable or the most recent SSTable files on disk.
Presence service. Presence state (online/offline, last seen) is stored in Redis with a short TTL. Each connection server publishes presence events when users connect or disconnect. The key optimization: presence updates are not broadcast to all contacts. Instead, when user A opens a chat with user B, A's client subscribes to B's presence. When A closes the chat, the subscription expires. This limits presence fan-out to active conversations rather than the entire contact list — reducing a potential quadratic amplification to a near-linear workload.
Push notifications for offline delivery. When a message arrives for an offline user, the connection server checks the connection registry (no active WebSocket for this user), stores the message in Cassandra, and triggers a push notification via APNs or FCM. The notification wakes the device, the client reconnects via WebSocket, and the server flushes the pending message queue in order. The push notification contains no message content — it's a wake signal. The actual content comes over the re-established WebSocket, ensuring end-to-end encryption is preserved.
Load the Chat System blueprint in SysSimulator to run the architecture under real simulated load. The blueprint includes connection servers, a message router, Cassandra for message storage, Redis for presence and routing, and an offline queue processor.
Set traffic to 1,000,000 RPS (approximating message throughput) and observe: message queue depth, connection server utilisation, and Cassandra write throughput. At healthy load, message queue depth should be near zero — messages are being delivered as fast as they arrive. P99 latency should be under 100ms for end-to-end delivery.
Then inject a connection server failure via the Chaos panel. This simulates one of your connection servers crashing — losing all the WebSocket connections it held. Watch: affected users' messages queue up in Cassandra, push notifications fire to wake devices, clients reconnect to other servers, message queue depth spikes as reconnecting clients flush their backlogs. Record the recovery time and the peak queue depth — these are your interview numbers.
Next, inject a database partition and observe the delivery guarantee behaviour. Messages queue at the connection server level, Cassandra writes fail, and depending on your circuit breaker configuration, the system either degrades (queuing in memory with risk of loss on crash) or rejects new messages with a 503. This is the CAP theorem tradeoff applied concretely — consistency vs availability under partition.
The question interviewers always ask: "What happens if a connection server crashes?" Practise saying this out loud.
"I'll simulate a connection server crash — this server is holding WebSocket connections for roughly 50,000 users based on our fleet size of 1,000 servers for 50M concurrent connections."
"[inject crash] Immediately, 50,000 users lose their persistent connection. From their perspective, the app shows 'connecting...' within a few seconds when the TCP keepalive timeout fires. Messages sent to these users in the next 30 seconds go to the connection server, which looks up their connection in Redis, finds no active connection on that server, and routes to the message queue in Cassandra instead. Push notifications fire via APNs/FCM to wake the affected devices."
"The blast radius: 50,000 users experience a 5–30 second connectivity interruption. In-flight messages — sent in the milliseconds before the crash was detected — may be lost if they were in the server's send buffer and hadn't been written to Cassandra yet. This is the gap in at-least-once delivery: if the server crashes before writing to the queue, the message is lost. The mitigation is a write-ahead log — the connection server writes to Cassandra before attempting WebSocket delivery, so a crash after the write but before delivery is recoverable."
"Recovery: clients reconnect to any available server (load balancer assigns them). The new server registers their connection in Redis. Cassandra flushes the pending message queue. Total recovery time depends on client reconnect logic — with exponential backoff starting at 1 second, most clients reconnect within 5 seconds. The queue depth spike you saw in the metrics is all 50,000 users flushing their backlogs simultaneously — that's a thundering herd on the message queue. Staggered reconnect with jitter prevents this from overloading Cassandra."
That answer — covering the blast radius, the at-least-once gap, the write-ahead log mitigation, and the reconnect thundering herd — earns a strong hire signal.
"How do you scale to 50 million concurrent connections?" This is checking whether you understand the file descriptor and memory constraints of stateful connections. Answer: connection servers are sized for connection count (memory), not request throughput. 50K connections per server at 65 KB each = 3.25 GB RAM per server for connection state. 1,000 servers handle 50M connections. The interviewer wants to see you do this calculation.
"Why WebSockets instead of long-polling?" Long-polling at 50M connections × 1 poll/second = 50M HTTP requests/second of overhead before any messages are sent. That overhead alone requires a fleet the size of the actual message delivery system. WebSockets amortise the connection cost across the session lifetime. The follow-up — "what are the downsides of WebSockets?" — wants: complexity of stateful routing, harder to load balance (sticky sessions or routing registry required), harder to scale behind standard HTTP infrastructure.
"What's the difference between at-least-once and exactly-once delivery?" At-least-once: the server retries until it receives an ACK. A message may be delivered multiple times if the ACK is lost (duplicate messages). Exactly-once: requires the recipient to deduplicate based on message ID. The full exactly-once guarantee requires distributed transactions and is expensive. WhatsApp uses at-least-once with client-side deduplication via message ID — effectively achieving exactly-once semantics at the application layer without the distributed transaction cost.
"How do you handle the group message fan-out problem?" A WhatsApp group with 512 members: one message becomes 512 delivery operations. At 1.16M messages/sec with 10% being group messages, that's 60 million delivery operations/sec from group messages alone. The optimisation: for large groups, store one copy of the message and send lightweight "you have a new message in group X" notifications. Each recipient fetches the message by ID. Write amplification drops from 512× to 1×.
"How do you handle end-to-end encryption at scale?" The server never has access to message content — it routes encrypted blobs. Each message is encrypted with the recipient's public key (the Signal Protocol uses a ratchet that derives per-message keys). The server stores and forwards opaque bytes. This simplifies the server design (no content inspection needed) and is a security architecture decision, not a performance one.
What is the hardest part of designing WhatsApp?
Managing persistent WebSocket connections at scale — 50 million open connections require memory-aware server sizing and a routing layer to direct messages to the correct server holding each user's connection. Most candidates design the stateless message path correctly but miss the stateful connection routing challenge.
How does WhatsApp guarantee message delivery?
At-least-once delivery via an ACK chain: server receives message and stores it (one check mark), recipient device receives and ACKs it (two check marks), recipient opens it (blue check marks). The server retries delivery until each ACK is received. Client-side deduplication by message ID prevents duplicates from appearing.
How does WhatsApp handle offline users?
Messages are stored in Cassandra when the recipient has no active WebSocket connection. A push notification (APNs/FCM) wakes the device, prompting reconnection. On reconnect, the server flushes the queued messages in order. Messages are retained until delivered or the retention window expires.
What database does WhatsApp use?
Cassandra for message storage — write-optimised wide-column storage partitioned by conversation ID, clustered by timestamp. Redis for presence state and the connection routing registry. A relational store for user metadata and contact lists. Blob storage (S3-compatible) for media files referenced by message.
How does WhatsApp's presence service work?
Online status is stored in Redis with a short TTL. Connection servers publish presence events on connect/disconnect. Presence updates are delivered only to users who have that contact's chat open — not broadcast to all contacts — limiting fan-out to active conversations.
How does WhatsApp handle group messages at scale?
For groups up to 512 members, fan-out on write: one message becomes 512 delivery operations. For very large groups or broadcast lists, a shared message store approach reduces write amplification — one copy stored, lightweight pointers sent to each member, who fetches the message by ID.
Run this in SysSimulator → Browse all blueprints
Next in the series: Design YouTube's architecture →