Load balancing algorithms

Q: What is the difference between L4 and L7 load balancing?

L4 (Layer 4, transport layer) load balancers route traffic based on TCP/UDP connection information — source IP, destination IP, port. They do not inspect the packet payload. They're extremely fast (hardware-accelerated, millions of connections/second) but limited routing capabilities. L7 (Layer 7, application layer) load balancers inspect HTTP/HTTPS content — URL path, headers, cookies, request body. They can route /api/images to one cluster and /api/data to another, perform SSL termination, apply request rewriting, and route based on cookies (sticky sessions). L7 is slower (full packet inspection, TLS overhead) but far more capable for modern web applications. AWS ALB is L7; AWS NLB is L4. In practice: use L7 for web traffic, L4 for non-HTTP protocols or when raw throughput is the priority.

Q: What are sticky sessions and when should you use them?

Sticky sessions (session affinity) route all requests from a given user to the same backend server. Implemented via: cookie (the load balancer sets a cookie containing the server ID); IP hash (requests from the same source IP go to the same server). Use sticky sessions when: the application stores session state in server memory (not in a shared session store) and migrating in-session requests to a new server would break the experience. Avoid sticky sessions when possible — they create uneven load distribution when some users generate much more traffic than others, and they complicate failover (if the sticky server goes down, the user's session is lost anyway). The better long-term solution: move session state to a shared store (Redis) so any server can handle any request. Then sticky sessions are unnecessary.

Q: How does consistent hashing help with load balancing?

Consistent hashing routes requests to the same server based on a hash of the request key (user_id, session token, resource ID), while minimizing redistribution when servers are added or removed. Unlike modulo hashing (user_id % N servers), where adding a server remaps most requests, consistent hashing remaps only 1/N of requests. This is valuable for cache affinity: if user 12345's requests always go to server 3, server 3's in-memory cache warms up with user 12345's data. If requests scatter randomly (round robin), no server warms a useful cache. Consistent hashing is used in CDN routing, in-memory cache clusters (Memcached, Redis Cluster), and any scenario where request affinity improves performance.

Q: How does a load balancer detect that a backend server is unhealthy?

Active health checks: the load balancer periodically sends probe requests to each backend (e.g., HTTP GET /health every 5 seconds). A server is marked unhealthy after N consecutive failures (e.g., 3 failures = mark unhealthy). Health check endpoints should verify the application is functional, not just that the process is running — check database connectivity, cache reachability, and any critical dependencies. Passive health checks: the load balancer monitors actual traffic responses. If a server returns >X% 5xx errors or takes >Y ms to respond, it is marked unhealthy without a separate probe. Production load balancers typically use both: active checks for proactive detection, passive checks for faster detection of errors that pass the health endpoint but affect real traffic.

Every system design interview has a load balancer — but most candidates draw one box and move on. The algorithms behind load balancing determine cache hit rates, connection overhead, failure recovery speed, and whether your stateful services survive rolling deployments. This guide covers the algorithms, the L4/L7 distinction, and the situations where each choice matters.

The core algorithms

Round robin. Distribute requests in rotation: request 1 → server A, request 2 → server B, request 3 → server C, request 4 → server A. Simple, stateless, equal distribution assuming requests are of similar cost. Variant: weighted round robin assigns more requests to higher-capacity servers (server A has 4× the CPU → gets 4× the requests). Round robin is the correct default when: requests are roughly equal in cost, servers are homogeneous in capacity, and statelessness allows any server to handle any request.

Least connections. Route each new request to the server with the fewest active connections. Better than round robin when request duration varies significantly — a server handling a 30-second long-poll request shouldn't receive as many new requests as a server handling 10ms API calls. Requires the load balancer to track active connections per backend (state in the load balancer). Use least connections for: WebSocket connections, long-poll APIs, streaming endpoints, any workload where connection duration is variable.

Least response time. Route to the server with the lowest combination of active connections and current response time. The most adaptive algorithm — it naturally directs traffic away from degraded servers and toward fast servers. Higher overhead (requires tracking response time metrics per server). Useful in heterogeneous environments where servers have different processing speeds at runtime.

IP hash / source IP hashing. Compute hash(source_ip) % num_servers to select the server. All requests from the same client IP go to the same server — session affinity without cookies. Limitation: CGNATs and corporate proxies put many users behind a single IP, causing hotspots. Load distribution is as uniform as IP distribution, which is often not uniform. Prefer cookie-based sticky sessions over IP hash for this reason.

Consistent hashing. A ring-based algorithm where both servers and request keys (user_id, session_token) are hashed to positions on a ring. A request routes to the nearest server clockwise on the ring. Adding or removing a server only remaps ~1/N of requests (vs. near-total remapping with modulo). Consistent hashing is used for: CDN edge routing (requests for the same URL go to the same cache node), in-memory cache clusters (warm cache affinity), and service meshes where request affinity improves performance.

L4 vs L7 load balancing

L4 (transport layer). Routes based on IP address and TCP/UDP port. Does not inspect packet payload. Extremely fast — L4 load balancers handle millions of connections per second with minimal CPU overhead (often hardware-accelerated, or kernel-bypass via DPDK). Can handle any protocol (TCP, UDP, not just HTTP). Limitations: no URL-based routing, no header inspection, no SSL termination in a meaningful way (TLS is end-to-end to the backend). Use cases: high-throughput non-HTTP traffic, database connection proxying, gaming servers, DNS. AWS Network Load Balancer (NLB) is L4.

L7 (application layer). Inspects HTTP/HTTPS request content. Routes based on URL path (/api/upload → upload servers, /api/query → query servers), headers (X-User-Tier: premium → premium backend), cookies (sticky sessions), or request body content. Performs SSL termination (decrypts TLS at the load balancer, sends plaintext to backends — simplifies certificate management and enables header inspection). Supports connection pooling between load balancer and backend (fewer TCP handshakes). AWS Application Load Balancer (ALB), nginx, and HAProxy are L7. Use for all web application traffic.

Health checks

A load balancer without proper health checks will route traffic to dead backends. Active health checks: the load balancer probes each backend at a configured interval (e.g., every 5 seconds) with an HTTP GET /health. A backend is marked unhealthy after N consecutive probe failures (typically 2–3) and removed from the rotation. It is marked healthy again after N consecutive successes. The health endpoint must test actual application health — database connectivity, cache reachability, memory not in OOM — not just "is the process running?"

Passive health checks: the load balancer monitors responses from real traffic. If a backend returns 5xx responses or times out at a rate above a threshold, it is marked unhealthy without waiting for the next active probe. This provides faster failure detection for problems that affect real requests but pass synthetic probes. Production load balancers use both: active health checks for proactive detection, passive health checks for reactive detection during real traffic failures.

Sticky sessions: when to use and when to avoid

Sticky sessions route all requests from a given user to the same backend. Common implementation: the load balancer sets a cookie (SERVERID=backend-3) on the first response; subsequent requests with that cookie route to backend-3. Necessary when: the application stores session state in local server memory and that state cannot be accessed by other servers. This is a legacy architecture — stateless applications with session state in Redis do not need sticky sessions.

Problems with sticky sessions: uneven load distribution when some users generate disproportionate traffic. If backend-3 goes down, all users sticky to it lose their session state (the session is lost regardless of stickiness if the backend dies). Rolling deployments are complicated by sticky sessions — draining a server means waiting for all sticky sessions to expire. The correct architectural decision: eliminate server-side session state and use a shared session store. Sticky sessions become a temporary workaround you're actively working to remove, not a permanent architecture choice.

Load balancer placement and redundancy

A load balancer is a single point of failure if not made redundant. Production setups use: active-passive (one load balancer is active, one is standby — failover via virtual IP / floating IP); active-active (both load balancers handle traffic, DNS round-robin between them, failure of one reduces capacity by 50% but doesn't drop availability). Cloud managed load balancers (AWS ALB/NLB, Google Cloud Load Balancing) are regionally redundant by default — the cloud provider handles the infrastructure HA.

In interviews: always mention that the load balancer is itself a component that needs redundancy. "I'd use AWS ALB which is managed and regionally HA" is a complete answer — you don't need to design the HA mechanism from scratch for managed services.

Frequently asked questions

What is the difference between L4 and L7 load balancing?
L4: routes by IP/port, no payload inspection, millions of connections/second, any protocol. L7: routes by HTTP content (URL, headers, cookies), SSL termination, URL-based routing, connection pooling. Use L7 for web applications, L4 for non-HTTP or raw throughput.

What are sticky sessions and when should you use them?
Routes all requests from a user to the same backend. Necessary for in-memory session state, but creates uneven load and complicates failover. Long-term solution: move session state to Redis and eliminate the need for sticky sessions.

How does consistent hashing help with load balancing?
Routes requests to the same server via hash-ring positioning. Adding/removing servers only remaps 1/N requests. Enables cache affinity — the same server warms its cache for the same set of users. Used in CDNs and cache clusters.

How does a load balancer detect that a backend server is unhealthy?
Active: periodic probe requests to /health, mark unhealthy after N failures, re-mark healthy after N successes. Passive: monitor error rate and latency in real traffic. Use both for fastest detection. Health endpoint must verify real application functionality, not just process liveness.

Practice in SysSimulator → Browse all blueprints

Next: API design: REST vs GraphQL vs gRPC →