Database sharding strategies

Q: What is database sharding?

Database sharding (horizontal partitioning) splits a large table across multiple database instances (shards) where each shard holds a subset of the rows. A shard key determines which shard each row belongs to. Unlike vertical partitioning (splitting columns) or replication (copying all data to all nodes), sharding splits data so each shard is authoritative for a different subset of rows. Sharding is used when a single database instance can no longer handle write throughput or data volume — typically at hundreds of thousands of writes/second or multi-terabyte datasets with active write patterns.

Q: What is the difference between hash sharding and range sharding?

Hash sharding: compute hash(shard_key) % num_shards to determine the shard. Distributes data uniformly — no hotspots for high-cardinality keys. Disadvantage: range queries across the shard key require querying all shards (scatter-gather). Range sharding: shard by value range (e.g., user_id 1–10M on shard 1, 10M–20M on shard 2). Supports efficient range scans within a shard. Disadvantage: uneven distribution if data is not uniformly distributed across the range (e.g., newer users are more active, causing a hotspot on the highest-ID shard). Hash sharding is the safer default for most use cases; range sharding is used when range queries on the shard key are critical.

Q: How do you choose a shard key?

A good shard key has: (1) High cardinality — many distinct values so data distributes evenly. Low cardinality keys (e.g., boolean status field) create hotspots. (2) Even distribution — values should be uniformly distributed, not concentrated. Monotonically increasing keys (timestamps, auto-increment IDs) create write hotspots on the latest shard. Use UUID v4 or hash the key before sharding. (3) Locality alignment — queries that frequently access the same set of rows should land on the same shard. For a social app, sharding by user_id means all of a user's posts, comments, and likes land on the same shard — no cross-shard joins for per-user queries. (4) Stability — the shard key should not change for a row (if user_id changes, the row must move between shards — operationally expensive).

Q: How do you handle resharding when adding new shards?

Hash sharding with modulo (hash % N) requires remapping all rows when N changes — every row potentially moves to a different shard. This makes adding shards extremely disruptive. Solution: consistent hashing. Use a hash ring with virtual nodes. Adding a new shard only migrates the rows that fall in the new node's region of the ring — on average 1/N of the total data. Consistent hashing is used by Cassandra, DynamoDB, and Redis Cluster for this reason. Alternative: logical shard routing. Use more logical shards than physical shards (e.g., 1024 logical shards mapped to 4 physical servers). Adding a physical server means reassigning some logical shards to the new server — only the rows in those logical shards migrate. This makes resharding a controlled, incremental migration.

Sharding comes up in every interview involving data at scale — but it's often introduced too early. The correct framing is: sharding is a last resort after you've exhausted vertical scaling, read replicas, caching, and partitioning. When you do shard, the choice of shard key is the most consequential decision in your entire data architecture. Get it wrong, and you've created a distributed database that's harder to operate than the monolith it replaced.

When sharding is (and isn't) the answer

Before recommending sharding in an interview, demonstrate that you know the alternatives. A single PostgreSQL instance on modern hardware handles 10,000–50,000 writes/second and 100,000+ reads/second. Adding read replicas handles read scale horizontally. A caching layer (Redis) absorbs 90%+ of read traffic. Vertical scaling (larger instance) delays the sharding decision by years.

Shard when: write throughput exceeds what a single primary can handle (>50K writes/second sustained), total data volume makes a single instance impractical (>10TB with active write patterns), or geographical data isolation is required (regulatory requirement to store certain users' data in specific regions — a form of manual sharding).

In an interview, the signal that sharding is appropriate is a combination of: very high write QPS, very large data volume, and a clear access pattern that maps to a natural shard key. If those conditions aren't clearly present, state "I'd use read replicas and caching before considering sharding" — this shows maturity.

Hash sharding vs range sharding

Hash sharding. Compute hash(shard_key) % num_shards to determine the shard. User with ID 12345 goes to shard hash(12345) % 10 = 5. Distribution is uniform for high-cardinality keys — no hotspots. The primary disadvantage: range queries on the shard key require a scatter-gather across all shards. SELECT * FROM users WHERE user_id BETWEEN 10000 AND 20000 must query all 10 shards and merge results. For most applications, range queries on the primary key are rare (you query by primary key exactly, or by secondary indexes) — hash sharding is safe.

Range sharding. Assign key ranges to shards: shard 1 holds user_id 1–10M, shard 2 holds 10M–20M. Range scans within a shard are efficient. The primary disadvantage: monotonically increasing keys create write hotspots. If new users always get the highest user_id, all writes land on the last shard. Mitigation: use non-monotonic keys (UUID v4) or deliberately shuffle the range distribution. Range sharding is used by HBase and Google Bigtable where sorted range scans are the primary access pattern.

Choosing a shard key

High cardinality. The shard key must have many distinct values so rows distribute across shards. A boolean, status enum, or date (day-level) are poor shard keys — they result in a small number of distinct values and massive hotspots. User_id (millions of users), order_id, or device_id are good candidates.

Avoid monotonically increasing keys. Auto-increment IDs and timestamps are classic hotspot creators with hash sharding that uses modulo. UUID v4 distributes uniformly. Alternatively, hash the auto-increment ID before sharding: shard = hash(id) % N distributes monotonic IDs uniformly.

Align with access locality. For a social application, sharding by user_id means all of a user's posts, follows, and activity land on the same shard. A query "get all posts for user 12345" hits one shard. A query "get all comments by user 12345" also hits one shard. This co-location eliminates cross-shard joins for the most common access patterns. The alternative — sharding posts by post_id and users by user_id — means every user's feed query is a cross-shard scatter-gather.

Immutability. Once a row is assigned to a shard, the shard key should not change. If user_id changes (hypothetically), the row must be deleted from the old shard and reinserted on the new shard — a non-atomic distributed operation. Design entities so the natural shard key is stable for the lifetime of the row.

Resharding and consistent hashing

The hardest operational problem with sharding is adding shards. With modulo hash sharding (hash % N), increasing N from 4 to 5 means the shard assignment of most rows changes — approximately 80% of rows must migrate to different shards. This requires a background migration that reads every row and reinserts it on the new shard, with application code serving dual-mode (reading from both old and new locations) during the migration window. Extremely disruptive.

Consistent hashing solves this. Each shard owns a range of a 0–2^32 hash ring. Adding a new shard means it takes over a contiguous range from an existing shard — only ~1/N of data migrates (the portion that falls in the new shard's range). This is the property that makes Cassandra, DynamoDB, and Redis Cluster operationally manageable at large shard counts. See the consistent hashing guide for the full mechanics.

The practical alternative for systems already using modulo sharding: logical shard routing. Use 1024 logical shards mapped to a smaller number of physical shards. Adding a physical shard means reassigning some logical shards — only the rows in those logical shards need to migrate. The application routes by logical shard; a configuration table maps logical shards to physical servers. Adding capacity is incremental and controlled.

Cross-shard queries: the hard part

Sharding optimizes single-shard operations. Cross-shard operations are expensive:

Scatter-gather queries. A query not filtered by the shard key must go to all shards and merge results. "Find all users who signed up in the last 24 hours" — if sharded by user_id, this query hits all shards. At 100 shards, you make 100 parallel queries and merge 100 result sets. This is slower than a single-instance query and creates load on all shards simultaneously.

Cross-shard joins. A join between two tables that are sharded differently (or not sharded) requires one of: (a) moving all data from one table to the shard of the other at query time (expensive), (b) denormalization (embed the data you'd join in the same record, accepting duplication), or (c) a separate analytics database (data warehouse, Redshift, BigQuery) for complex multi-table queries. In interview designs: state explicitly that you'd denormalize for co-located access and use a separate analytics pipeline for aggregations.

Distributed transactions. A transaction that spans two shards requires a distributed commit protocol (two-phase commit or saga pattern). Two-phase commit is slow and locks rows across shards during the prepare phase. Sagas are more complex but avoid distributed locks. For sharded systems, design your data model to avoid cross-shard transactions wherever possible — co-locate entities that are transactionally coupled on the same shard.

Frequently asked questions

What is database sharding?
Splitting a table across multiple database instances (shards), each holding a subset of rows determined by a shard key. Used when write throughput or data volume exceeds single-instance limits. A last resort after vertical scaling, read replicas, and caching.

What is the difference between hash sharding and range sharding?
Hash: uniform distribution, no range scans on shard key. Range: efficient range scans, risk of write hotspots on monotonic keys. Hash sharding is the safer default for most OLTP workloads.

How do you choose a shard key?
High cardinality, non-monotonic, aligned with access locality, immutable per row. Sharding by user_id co-locates all user data on one shard, eliminating cross-shard joins for user-centric queries.

How do you handle resharding when adding new shards?
Consistent hashing: new shard takes only 1/N of existing data, not a full reshuffle. Or: logical shard routing with 1024 logical shards mapped to physical — adding a server reassigns some logical shards, migrating only their data.

Practice in SysSimulator → Consistent hashing guide

Next: Load balancing algorithms →