Design Uber's ride-sharing architecture

Q: What is the hardest part of designing Uber?

The real-time geospatial matching problem. Millions of drivers update their GPS coordinates every 4 seconds. Every time a rider requests a trip, the system must query which drivers are within a certain radius, rank them by distance and availability, and complete the match before the rider cancels. This requires a geospatial index that can handle both continuous writes (location updates) and fast proximity reads — two conflicting access patterns that must be solved simultaneously.

Q: How does Uber index driver locations?

Uber uses geohashing and S2 geometry cells to partition the world into a grid of cells at multiple precision levels. Each driver is placed in the cell corresponding to their current GPS coordinates. Finding nearby drivers means looking up drivers in the same cell and adjacent cells at the appropriate precision level. Geohashing converts 2D coordinates into a 1D string, enabling proximity queries using string prefix matching — a technique that maps neatly to key-value store lookups in Redis.

Q: How does Uber handle location update write amplification?

With 1 million active drivers each sending GPS updates every 4 seconds, the system receives 250,000 location writes per second. Naive database writes at this rate would overwhelm any single relational database. Uber routes location updates through a dedicated in-memory location service that maintains the current position of each driver in Redis (O(1) write per driver per update), with asynchronous writes to a persistent store for analytics. The matching service reads from Redis, not from the persistent store.

Q: How does surge pricing work architecturally?

Surge pricing is computed by comparing supply (available drivers in a cell) to demand (rider requests in the same cell) in real time. A stream processor reads the location updates and trip request events, aggregates supply and demand per geohash cell in sliding time windows, computes the surge multiplier when demand exceeds supply by a threshold, and writes the multiplier back to a pricing cache. The rider's app reads the multiplier at request time. The surge computation is a continuous stream aggregation — not a batch job.

Q: What happens if the matching service goes down?

Riders cannot be matched to drivers. Trip requests queue in a buffer (Kafka) rather than being dropped. When the matching service recovers, it processes the queued requests in order. However, for ride-hailing the queue staleness window matters: a trip request queued for 30 seconds is no longer valid — the rider may have walked away or cancelled. The matching service must validate that requests are recent before processing and discard stale ones. The blast radius of matching service downtime is complete trip initiation failure.

The Uber question is a senior-level staple that tests a domain most candidates haven't encountered: real-time geospatial systems. Unlike Twitter or YouTube where the data is text and media, Uber's hardest problem is matching moving objects in two-dimensional space with sub-second latency. The techniques — geohashing, S2 cells, location write management — don't appear in any other common interview question.

What the interview is really asking

Uber is not really about rides. It is a real-time supply/demand matching system operating in two-dimensional geographic space. The interviewer is testing whether you understand the specific challenges that geographic data introduces — challenges that standard key-value and relational data models handle poorly.

Geospatial indexing. "Find me all drivers within 5 kilometres of this GPS coordinate" is not a query that standard database indexes are built for. A B-tree index on latitude and a B-tree index on longitude together don't efficiently answer a 2D radius query. You need a spatial index — geohash, R-tree, quadtree, or S2 geometry cells. The interviewer is checking whether you know this problem exists and have a specific solution.

Write amplification from location updates. Millions of active drivers, each sending a GPS update every 4 seconds. At 1 million drivers that's 250,000 writes per second — sustained, not peak. This is an extraordinary write rate for what is essentially a key-value update (driver_id → current_coordinates). The interviewer is checking whether you reason about the write volume before designing the storage layer.

Matching algorithm tradeoffs. The simplest matching algorithm — assign the nearest available driver — creates local optimization that fails at scale. If all riders in a city request simultaneously and the system naively assigns the nearest driver to each, some areas end up with no drivers while nearby areas have driver clusters. The interviewer is checking whether you think beyond "nearest driver" to batch matching, ETA-based ranking, and area rebalancing.

Trip state machine. A trip is not a binary. It moves through: requested → driver assigned → driver en route → driver arrived → trip in progress → trip completed → payment processing. Each state must be atomic and persistent. The interviewer is checking that you model this explicitly rather than treating "a trip" as a single database row with a status field.

Back-of-envelope estimation

Scale. Uber operates in 70+ countries with approximately 5 million trips per day globally. At peak concurrency (commuter hours), roughly 1 million drivers are active simultaneously. Use these numbers as your baseline.

Location update write rate. Each active driver sends a GPS update every 4 seconds. 1 million drivers / 4 seconds = 250,000 location updates per second. This is the dominant write workload. At 50 bytes per update (driver_id + lat/lng + timestamp): 250,000 × 50 bytes = 12.5 MB/sec of location data. Small in bytes, large in IOPS. A relational database handling 250,000 writes/sec requires aggressive sharding. Redis handles it comfortably on a single instance (100,000+ writes/sec, higher with pipelining).

Trip request rate. 5 million trips/day / 86,400 seconds = ~58 trip requests/second at average. At peak: 5× = ~290 requests/second. Modest — the matching service is not throughput-constrained, it is latency-constrained. A match must complete in under 2 seconds or the rider perceives lag.

Matching query radius. A typical driver search radius is 5km in urban areas. At a geohash precision level 6 (~1.2km cells), a 5km search covers approximately 25 cells (the target cell plus its neighbours, recursively). Each cell lookup retrieves the list of available drivers in that cell. With 1 million drivers distributed across a city of 100km², average density is ~100 drivers per km² × 1.2km² per cell = 120 drivers per cell. A 25-cell scan returns ~3,000 driver candidates to rank and match. Fast enough for a single Redis lookup pass.

Storage for trip history. Each completed trip record: ~1KB (coordinates, timestamps, fare, driver/rider IDs). At 5 million trips/day: 5 GB/day. Over 10 years: ~18 TB. Fits on a sharded PostgreSQL cluster with date-based partitioning, with older data archived to cold storage.

Architecture decisions and why

Location service: Redis with geohashing. The location service maintains the current position of every active driver in Redis. Redis's native GEOADD and GEORADIUS commands provide built-in geospatial index operations — adding a driver's location and querying drivers within a radius are both O(N+log(M)) where N is the number of results and M is the number of elements in the set. At 250,000 writes/sec, Redis handles the load from a cluster of 3–5 nodes with geographic sharding (one cluster per metro area). Driver location is updated in-place — no append, just overwrite — keeping storage constant regardless of update frequency.

Geohashing for spatial partitioning. A geohash encodes a GPS coordinate as a string like "u4pruydqqvj" where longer strings = higher precision. Nearby coordinates share a common prefix. To find drivers near a rider at precision level 6 (each cell ~1.2km²): encode the rider's coordinates to a level-6 geohash, retrieve drivers in that cell, retrieve drivers in the 8 adjacent cells. This covers a ~3.6km radius and returns all candidate drivers. The matching service then ranks candidates by true geodesic distance, ETA, and driver rating.

Trip state machine with PostgreSQL. Each trip is a row in PostgreSQL with a status enum (REQUESTED, MATCHED, EN_ROUTE, ARRIVED, IN_PROGRESS, COMPLETED, CANCELLED). State transitions are atomic updates with optimistic locking (a version column that increments on each transition). Concurrent state transition attempts (rider cancels at the same moment as driver accepts) resolve via the version constraint — only one succeeds. The audit log records every state transition with timestamp and actor. This is the correctness-critical part of Uber's system — not the location service.

Matching service: batch + rank. When a trip request arrives, the matching service queries Redis for all available drivers within 5km, ranks them by: (1) ETA to pickup — computed by the routing engine, not Euclidean distance. (2) Driver rating. (3) Whether the driver's heading aligns with the pickup direction (reduces U-turns). The top candidate is offered the trip. If the driver does not accept within 10 seconds, the next candidate is offered. This cascade continues until a match is made or no drivers are available.

Dispatch and routing engine. Computing accurate ETA for 3,000 driver candidates per trip request is expensive if done via a full routing engine query per candidate. Uber uses a two-pass approach: first pass uses Euclidean distance to narrow to the top 20 candidates (fast, imprecise). Second pass uses a routing engine (road network graph) to compute accurate ETA for the 20 finalists. This reduces routing engine load by 150× while maintaining accuracy for the final match decision.

Surge pricing as a stream aggregation. Demand (trip requests per cell per minute) and supply (available drivers per cell) are aggregated in real time by a stream processor reading from Kafka. When demand/supply ratio in a cell exceeds a threshold (e.g., 1.5×), the surge multiplier for that cell is updated in a pricing cache (Redis). Riders who request a trip read the current multiplier at request time. Surge is computed per-cell at a 1-minute granularity — frequent enough to be responsive, infrequent enough to avoid thrashing.

Run it in the simulator

Load the Ride-Sharing Platform blueprint in SysSimulator. The blueprint models the location service, matching service, trip state machine, and a simplified routing engine. Set active driver count to 10,000 and trip request rate to 50 RPS — a small metro area.

Observe: location update write rate, matching latency (time from trip request to driver assignment), and Redis utilisation. At healthy load, matching should complete in under 500ms and Redis should be under 40% utilised.

Inject a location service spike — simulate a sudden 10× increase in driver location updates (as might happen during a major event when all drivers activate simultaneously). Watch Redis CPU climb. If it hits 100%, location writes queue up and become stale — matches are made on outdated driver positions, causing drivers to be assigned to trips they are already further from than the match assumed.

Then inject a matching service failure. Watch trip requests queue in Kafka. When the service recovers, observe the backlog flush — but note the staleness problem: trip requests that queued more than 30 seconds ago should be discarded, not processed against the current driver positions.

Open Ride-Sharing blueprint →

Failure narration — word for word

"I'll simulate a Redis location service outage — this takes down our driver position index, so the matching service can no longer find nearby drivers."

"[inject] Trip requests are coming in at 50 RPS. With the location service down, the matching service has no driver candidates to query — matching attempts fail immediately. Trip requests queue in Kafka. From the rider's perspective, the app shows 'Looking for drivers...' indefinitely."

"The blast radius: all new trip requests fail to match. In-progress trips are unaffected — they are already matched and the trip state is in PostgreSQL, which is running normally. The location service outage only prevents new matches."

"Recovery: Redis restarts and repopulates. Driver apps immediately send location updates on reconnect — Redis repopulates within 4–8 seconds (one to two update cycles from all active drivers). Kafka's queued trip requests can now be processed. However, requests queued more than 30 seconds ago are discarded — riders have already cancelled or timed out. Only requests from the last 30 seconds are matched. Total recovery time: under 30 seconds with a hot Redis restart, under 2 minutes with a cold restart from a snapshot."

The question behind the question

"Why geohashing instead of just querying latitude/longitude range in a database?" A range query on latitude AND longitude requires a compound index. B-tree indexes on two columns can satisfy one range constraint efficiently but degrade on two simultaneous ranges. A geospatial index (geohash, R-tree) is purpose-built for 2D proximity queries. The specific answer: geohashing converts a 2D coordinate to a 1D string where string prefix = geographic proximity, enabling a key-value lookup (O(1)) instead of a range scan (O(log N)).

"How do you handle the driver who is 4.9km away vs 5.1km away?" The matching radius is a soft boundary, not a hard cutoff. The geohash precision cell boundary doesn't align perfectly with a 5km radius circle — some drivers just outside the cell boundary are closer than drivers inside it. Solution: always query the target cell plus all 8 neighbours, then filter and rank by true geodesic distance in memory. The cell lookup is the coarse filter; the distance calculation is the fine filter.

"What happens during a city-wide event (New Year's Eve, stadium concert)?" Demand spikes 10–20× in a small geographic area simultaneously. The location service in that metro area's Redis shard saturates. Mitigation: pre-scaled Redis clusters before known events, automatic horizontal sharding when a cell's update rate exceeds a threshold, and surge pricing that proactively reduces demand by increasing price before the system saturates.

"How do you handle driver location spoofing?" Drivers can fake GPS coordinates to appear in high-surge areas without physically being there. Detection: compare consecutive location updates for physically impossible movement (speed > 200km/h between updates). Cross-reference with cell tower triangulation data as a secondary signal. Flag outliers for manual review. Suspend accounts with repeated violations. This is a fraud/abuse question that rewards candidates who think beyond the happy path.

Frequently asked questions

What is the hardest part of designing Uber?
The real-time geospatial matching problem — handling 250,000 driver location writes per second while simultaneously answering low-latency proximity queries for trip matching. These conflicting access patterns (high-write location index + low-latency reads) require a purpose-built geospatial index (Redis GEORADIUS) rather than a standard relational database.

How does Uber index driver locations?
Geohashing maps GPS coordinates to a hierarchical string where common prefixes mean geographic proximity. Redis stores driver locations in a geospatial sorted set (GEOADD), allowing O(N+log M) radius queries (GEORADIUS). Driver locations overwrite in-place on each update — storage is constant regardless of update frequency.

How does Uber handle location update write amplification?
1 million active drivers × 1 update per 4 seconds = 250,000 writes/sec. Redis handles this on a per-metro cluster. Location updates are in-memory overwrites (not appends) — negligible storage growth. Asynchronous persistence to a time-series store captures location history for analytics without blocking the real-time update path.

How does surge pricing work architecturally?
A stream processor aggregates demand (trip requests/minute) and supply (available drivers) per geohash cell. When demand/supply exceeds a threshold, the surge multiplier for that cell is written to Redis. The rider's app reads the multiplier at request time. Computed continuously, not in batch.

What happens if the matching service goes down?
Trip requests queue in Kafka. On recovery, requests newer than ~30 seconds are processed against current driver positions; older requests are discarded (stale). In-progress trips are unaffected — their state lives in PostgreSQL, independent of the matching service.

Run this in SysSimulator → Browse all blueprints

Next: Design Instagram →