Design typeahead search

Typeahead (autocomplete) is a common mid-to-senior interview question that tests latency reasoning, data structure selection, and the interplay between offline data processing and real-time serving. The 100ms latency requirement makes it one of the strictest performance constraints in any standard design question.

What the interview is really asking

Latency as a first-class constraint. Typeahead suggestions must appear within 100ms of a keystroke to feel responsive. This is not a soft target — users can perceive lag above 100ms and will stop using the feature if it stutters. The interviewer is checking whether you design around this constraint from the start: pre-computing suggestions, aggressive caching, and keeping the query-time critical path to a cache lookup plus serialisation.

Data structure selection. Trie is the textbook answer for prefix matching, but tries have serious practical limitations at scale. The interviewer wants to hear you evaluate trie vs inverted index vs Elasticsearch, not just name the trie. Candidates who only say "trie" without discussing scale limitations are demonstrating preparation breadth without depth.

Offline vs online split. The system has two distinct parts: an offline data pipeline that processes historical search queries to compute suggestion frequency and popularity, and an online serving layer that responds to keystrokes in real time. The interviewer is checking that you separate these concerns — the offline pipeline handles complexity (ranking model, frequency aggregation, trie construction or index building), while the online layer is kept simple (cache lookup).

Personalisation at query time. Google Search shows different suggestions to different users based on their search history and location. This requires blending a global suggestion set with a per-user personalisation signal at query time. The interviewer is checking whether you account for this without over-engineering it — personalisation can be a late-stage ranking signal applied to the top-K global candidates, not a separate suggestion index per user.

Back-of-envelope estimation

Query volume. Google Search processes ~8.5 billion searches per day. Typeahead fires on every keystroke, and the average search query is 4–5 words / 20–30 characters. Assuming users type 3 characters before picking a suggestion: 8.5B searches × 3 keystrokes = 25.5 billion typeahead requests per day. 25.5B / 86,400 seconds ≈ 295,000 typeahead RPS at average. At peak (office hours): ~500,000 RPS. This is an extreme read workload requiring massive caching.

Unique prefixes. With 26 characters and max prefix length of 6: 26⁶ = ~300 million unique 6-char prefixes at most. In practice, far fewer — English-language queries cluster around common prefixes. The hot prefixes (top 10,000 by traffic volume) handle the vast majority of typeahead requests. These fit easily in Redis memory.

Cache sizing. Each prefix maps to a list of top-10 suggestions. At 100 bytes per suggestion × 10 suggestions = 1KB per prefix. Top 10 million prefixes × 1 KB = 10 GB. Fits in a Redis cluster of 3 nodes with replication.

Suggestion index size. The suggestion corpus (all queries ever searched, with frequency scores) at Google scale: hundreds of billions of queries. After deduplication and aggregation: ~10–50 billion unique query strings. This does not fit in memory — it lives in a distributed search index (Elasticsearch cluster) and only the hot prefixes are cached in Redis.

Architecture decisions and why

Serving layer: Redis prefix cache. The hot path for every typeahead request is: receive prefix → Redis HGET "suggestions:{prefix}" → return serialised suggestion list. Redis delivers this in under 1ms. Combined with network latency, the total query time is 20–50ms — well under the 100ms budget. Cache miss path: Redis miss → query Elasticsearch → cache result in Redis with 1-hour TTL → return suggestions. Cache miss adds 50–80ms, still within budget for cold prefixes.

Why Elasticsearch over a trie. A trie storing every query prefix globally would require hundreds of GB of memory and complex distributed sharding logic. Elasticsearch's inverted index with edge n-gram tokenisation (which indexes each prefix of each word: "searc", "sear", "sea", "se", "s") achieves prefix matching with the same O(1) lookup characteristic, scales horizontally by adding Elasticsearch nodes, and supports compound ranking queries (frequency + personalisation + recency) natively. Edge n-grams are an indexing-time transformation — at query time, a prefix lookup is a standard term query.

Offline frequency pipeline. A batch job (daily or hourly) aggregates search query logs: count occurrences of each query string per day, compute a smoothed frequency score (exponentially weighted moving average to down-weight aging queries), write frequency scores back to Elasticsearch and Redis. Trending queries require a streaming pipeline: a Kafka consumer aggregates query counts in 1-hour sliding windows, detects frequency spikes, and injects trending labels into the cache. This keeps the online serving layer stateless — it reads from caches, never from raw logs.

Personalisation as a re-ranking pass. After fetching the top-20 global suggestions for a prefix, the serving layer applies a personalisation boost: queries the user's recent search history (Redis, per-user sorted set of recent queries), boosts any candidate that prefix-matches the user's recent queries. This is a simple in-memory pass over 20 candidates — adds <5ms. No per-user index is needed: personalisation is an overlay on the global suggestion set, not a separate corpus.

Client-side debouncing. The client should not fire a typeahead request on every single character input. Standard practice: debounce requests by 100–150ms after the last keystroke. A user typing "python tutorial" at normal speed would generate 15 requests without debouncing; with 150ms debounce, they generate 3–4. This reduces server load by ~5× with no perceptible impact on suggestion quality.

Query cancellation. When a user types fast, earlier requests may be overtaken by later ones. The client should cancel in-flight requests when a new keystroke arrives. On the server side, implement request timeouts: if a typeahead request takes longer than 80ms, return an empty response rather than a slow one. An empty typeahead is better than a stale or slow one.

Run it in the simulator

Model the typeahead serving layer in SysSimulator with: a load balancer → typeahead API servers → Redis (prefix cache) → Elasticsearch (cold lookup path). Set traffic to 100,000 RPS (typeahead queries).

Configure Redis hit rate to 95% (hot prefixes cached) and observe: API server load should be minimal — most requests are pure Redis lookups. Elasticsearch sees only the 5% cache miss traffic (~5,000 RPS), well within its capacity.

Inject a Redis failure. All 100,000 RPS fall through to Elasticsearch. At 100K RPS, a typical Elasticsearch cluster will begin showing queue depth buildup and latency spikes past the 100ms budget within 30 seconds. Typeahead responses become too slow to be useful. This is why the Redis cache is not optional for the hot path — Elasticsearch alone cannot serve production typeahead traffic within latency budget.

Open SysSimulator →

Failure narration — word for word

"I'll inject a Redis cache failure for the suggestion prefix cache. All 100,000 RPS of typeahead queries will fall through to Elasticsearch."

"[inject] Elasticsearch starts receiving 100K RPS. At this rate, query queue depth climbs — Elasticsearch is optimised for complex search queries, not 100K simple prefix lookups per second. P99 query latency goes from 2ms (Redis) to 180ms (Elasticsearch under load). We've blown our 100ms latency budget. Users see the suggestion box stop populating in real time."

"The blast radius: typeahead suggestions lag or disappear. Core search functionality is unaffected — users can still submit a full query without suggestions. The UX impact is real but not catastrophic."

"My mitigation: a circuit breaker on the Redis client that, when Redis is unavailable, immediately returns an empty suggestion list rather than falling through to Elasticsearch. Empty suggestions under high load is better than slow suggestions that cause queue buildup. As Redis recovers, the circuit breaker half-opens and re-enables caching. The suggestion cache re-warms within minutes as traffic re-populates the hot prefixes."

The question behind the question

"How do you handle misspellings?" Fuzzy matching via Elasticsearch's fuzziness parameter (Levenshtein distance tolerance). At a distance of 1, "pthon" matches "python". Fuzzy matching is more expensive than exact prefix matching — apply it only on cache miss for the exact prefix and as a secondary pass when the exact prefix returns fewer than 5 suggestions. Don't apply fuzzy matching on the hot path.

"How do you build the initial suggestion corpus?" Aggregate search logs from production. Process the log stream daily: normalise queries (lowercase, remove punctuation), count frequency per query, compute IDF-like scores that down-weight excessively common queries ("the", "a"), and build the Elasticsearch index from the result. The index is rebuilt or incrementally updated daily. Bootstrapping for a new product: use Wikipedia titles, dictionary words, and product catalog names as an initial corpus.

"How do you serve different suggestions in different countries?" Segment the suggestion corpus by locale. Query logs in Japan produce different high-frequency queries than in Brazil. Maintain separate Elasticsearch indexes per locale, or use a single index with locale as a filter field. The Redis cache key includes locale: suggestions:{locale}:{prefix}. The cache is partitioned by locale — hot prefixes in each locale are cached independently.

Frequently asked questions

Should typeahead use a trie or Elasticsearch?
Elasticsearch (or a purpose-built search index) for global-scale typeahead. Tries are elegant but impractical at scale — hundreds of GB of memory, complex distributed updates. Elasticsearch with edge n-gram indexing achieves the same prefix matching, scales horizontally, and supports ranking signals. Tries are appropriate for small, static suggestion sets.

How do you rank typeahead suggestions?
Primary: query frequency (historical search logs aggregated by offline pipeline). Secondary: personalisation boost (user's recent queries), recency boost (trending queries), geographic relevance. Combined as weighted score. Ranking is applied offline; serving is a cache lookup.

What is the latency requirement?
100ms from keystroke to suggestions displayed. This requires pre-computed suggestions in Redis (sub-1ms lookup). Cold path via Elasticsearch adds 50–80ms on a miss, still within budget. Never do database queries, ML inference, or joins on the hot path.

How do you handle trending queries?
Stream processor aggregates query counts in sliding 1-hour windows. Queries with 3× frequency spike vs baseline are flagged trending. Trending signal is injected into the cache asynchronously every few minutes. Trending suggestions are labelled in the UI.

Run in SysSimulator →   Interview prep guide

Next: Design a notification system →