Design a web crawler

Q: How does a web crawler handle duplicate content?

Two levels of deduplication. URL-level: before fetching, check whether the URL has been seen before using a distributed Bloom filter or URL hash store. This prevents re-fetching the same URL. Content-level: after fetching, compute a SimHash fingerprint of the page content (a 64-bit hash where similar documents produce similar hashes). Compare the new fingerprint against stored fingerprints — if the Hamming distance is ≤3 bits, the pages are considered near-duplicates and the new page is not indexed. SimHash is used by Google's crawling infrastructure for near-duplicate detection at web scale.

Q: What is crawl politeness and why does it matter?

Crawl politeness means not overwhelming a web server with too many concurrent requests. An impolite crawler can overload small sites, triggering IP bans or legal issues. Implementation: (1) respect robots.txt — check the Crawl-delay directive and disallowed paths; (2) per-domain rate limiting — at most 1 request per domain every N seconds (N from robots.txt Crawl-delay or a default like 1 second); (3) request spreading — use per-domain queues and schedule requests with appropriate gaps, not burst fetching. The URL frontier enforces politeness by organizing URLs into per-domain queues and back-pressure scheduling.

Q: How do you prioritize which URLs to crawl?

Multi-factor priority scoring. High priority: high-PageRank pages (many inbound links), recently updated content (indicated by HTTP Last-Modified or sitemap changefreq), pages in freshness-sensitive categories (news, product pricing). Low priority: deep path URLs (heuristic for less important pages), previously crawled pages with low change frequency (no updates in 3+ recrawl cycles), URLs from low-authority domains. Priority is stored as a numeric score in the URL frontier. The frontier's front queues are priority-ordered; back queues are per-domain for politeness enforcement. High-priority pages are recrawled more frequently (shorter crawl interval).

Q: How does a crawler handle JavaScript-rendered pages?

Two-tier fetching. Standard fetcher: HTTP GET, HTML parser, fast and cheap (handles ~90% of pages). Headless browser tier: Chromium instances via Playwright/Puppeteer, execute JavaScript, extract rendered DOM. The headless tier is 50-100× more expensive per page (CPU, memory, time). Routing decision: if the initial HTTP response returns a page with minimal HTML body but heavy JS bundle (heuristic: body text ratio < threshold), route to the headless tier. The headless tier is scaled separately with much lower throughput targets (thousands/sec vs. millions/sec for the standard tier).

Web crawler questions are deceptively broad. Interviewers use them to probe URL frontier design, politeness policies, deduplication at petabyte scale, DNS caching, and distributed work scheduling. The candidate who says "fetch the URL, parse the links, put them in a queue" is describing a homework project. This guide covers the full architecture — the parts that differentiate a crawler that handles the open web from one that handles a single domain.

What the interview is really asking

URL frontier management. The frontier is the queue of URLs to crawl. At web scale, it holds billions of URLs and must support priority ordering (crawl important pages first) and politeness (don't hammer a single domain). The interviewer is checking whether you understand the two-queue model: front queues (priority-ordered) and back queues (per-domain for politeness), and how a scheduler mediates between them.

Politeness and robots.txt compliance. A crawler that sends 100 simultaneous requests to a single domain will get banned within minutes. The interviewer is checking that you know about robots.txt, Crawl-delay directives, per-domain rate limiting, and how these are enforced in a distributed system where multiple crawler workers might independently crawl the same domain.

Deduplication. The web contains enormous amounts of duplicate and near-duplicate content. Without deduplication, you waste crawl budget on pages you've already seen. The interviewer is checking that you know URL-level deduplication (Bloom filter or hash store) and content-level near-duplicate detection (SimHash or MinHash).

Fault tolerance and recrawl scheduling. Individual pages need to be recrawled — content changes over time. The interviewer is checking that you have a recrawl strategy (adaptive frequency based on observed change rate) and that the system recovers gracefully from crawler worker failures without losing frontier state.

Back-of-envelope estimation

Scale target. Build a crawler that indexes 1 billion pages in 30 days. 1B / (30 × 86,400) = ~385 pages/second. Accounting for DNS lookup latency, page processing, and retries, round up to 500 pages/second sustained throughput.

Storage for crawled content. Average HTML page: 100 KB raw, compressed to ~30 KB. 1 billion pages × 30 KB = 30 TB for raw page content. With metadata (URL, crawl timestamp, HTTP headers, status code): another ~2 KB/page = 2 TB. Total: ~32 TB for a single crawl pass. Content addressed (SHA-256 of URL as key) in an object store (S3-compatible), not a relational database — this is write-once, read-rarely.

URL frontier size. Discovered URLs per crawled page: average 50 outbound links. 1B pages × 50 = 50B discovered URLs. After deduplication, roughly 10B unique URLs at any point in the frontier. Each URL record: ~100 bytes (URL string + priority score + domain + crawl metadata). 10B × 100 bytes = 1 TB. This must live in a distributed data store, not in-memory.

DNS resolution. 500 pages/second. Pages from unique domains require DNS lookups. If 10% of pages are from new domains: 50 DNS lookups/second. DNS lookup latency: 20–100ms. Cache DNS responses by domain with TTL. A local DNS cache at each crawler worker eliminates >95% of DNS latency, reducing the effective DNS lookup rate to ~2.5/second at steady state.

Crawler workers needed. Each worker makes one HTTP request at a time (to respect per-domain rate limits). Average page fetch + parse time: 200ms (includes network RTT, HTML parsing). Each worker processes 5 pages/second. To achieve 500 pages/second: 500 / 5 = 100 concurrent crawler workers. With retries and variance, provision 150 workers.

Architecture decisions and why

Two-queue URL frontier. The frontier has two layers. Front queues (F1…Fn, one per priority level) hold URLs sorted by priority — high-PageRank, recently changed, and seeded URLs are in higher-priority queues. A prioritizer reads URLs from front queues and routes them to back queues. Back queues (B1…Bm, one per domain or domain hash) hold URLs grouped by domain. The scheduler selects from back queues such that each domain is fetched at most once per crawl-delay interval. This two-level design decouples "what to crawl next" (priority) from "when to crawl it" (politeness).

Bloom filter for URL deduplication. A distributed Bloom filter (hosted in Redis) tracks all seen URLs. Before a URL is added to the frontier, the crawler checks the Bloom filter. False positive rate: ~0.1% with a well-sized filter (10 bits per element, 7 hash functions). 10B URLs × 10 bits = 12.5 GB for the Bloom filter — fits in a Redis cluster. False positives mean some URLs are skipped erroneously, which is acceptable (small re-discovery error). False negatives are impossible in a Bloom filter — a URL in the filter is always detected.

SimHash for near-duplicate content detection. After fetching a page, compute a 64-bit SimHash fingerprint. SimHash works by hashing each word in the document, weighting by frequency, and summing bitwise. Two pages with Hamming distance ≤3 bits are considered near-duplicates. Store all fingerprints in a hash store. Query: any page with matching fingerprint (±3 bit distance) is a near-duplicate. Implementation: store fingerprints in 4 tables, each indexed on a different 16-bit chunk of the fingerprint — a near-duplicate query checks all 4 tables. SimHash is used by Google's crawl deduplication at web scale.

DNS caching at crawler worker level. Each crawler worker maintains a local in-process DNS cache (domain → IP, TTL from DNS response). Cache hit: 0ms lookup. Cache miss: external DNS resolver query (~50ms). With a crawler making 500 requests/second against a corpus of tens of millions of unique domains, the cache hit rate for steady-state crawling (revisiting known domains) is >95%. DNS failures (NXDOMAIN, SERVFAIL) are also cached temporarily to avoid hammering DNS resolvers for dead domains.

Adaptive recrawl scheduling. After initial indexing, pages must be recrawled to capture updates. Recrawl frequency is adaptive: pages that changed on last N visits get a shorter crawl interval (e.g., 1 day for news sites), pages that didn't change get a longer interval (30 days for static docs). Observed change rate is stored per URL and fed back into the priority calculation. High-change pages get higher priority scores in the front queues.

robots.txt compliance. Each worker fetches and caches robots.txt for each domain (cached with a 24-hour TTL). Before fetching any URL, the worker checks the cached robots.txt for disallow rules and Crawl-delay. Disallowed URLs are dropped. Crawl-delay is enforced per back queue — the scheduler waits at least Crawl-delay seconds between consecutive requests to that domain. If no Crawl-delay is specified, default to 1 request/second per domain for courtesy.

Run it in the simulator

The distributed crawl topology maps directly to components in SysSimulator: a URL frontier (priority queue), scheduler, crawler workers, DNS resolver, content storage, and a deduplication service.

Set up a load generator producing 500 URL-fetch tasks/second. Observe throughput across the crawler worker pool. Inject a DNS resolver failure — watch how the local caches at each worker absorb the disruption for previously seen domains while new-domain fetches back off.

Then inject a slow domain (one back queue with 5× normal response time). Observe that only that domain's back queue backs up; other domains are unaffected. This is the isolation guarantee of per-domain queues.

Open SysSimulator and model this →

Failure narration — word for word

"I'll inject a crawler worker failure — 30 of our 100 workers go down simultaneously. This could be a spot instance reclaim in a cloud environment."

"[inject] The 30 failed workers had URLs checked out from the URL frontier — those URLs were in-flight. Without acknowledgment, the frontier detects timeout (URL not acknowledged within 30 seconds). The frontier marks those URLs as unacknowledged and returns them to the back queue with their original priority. The remaining 70 workers continue crawling. Throughput drops from 500 to ~350 pages/second."

"Recovery: the auto-scaler provisions 30 replacement workers within 60–90 seconds. The replacement workers pull URLs from the frontier normally. The timed-out URLs are requeued and will be refetched — they might be fetched twice (once by the replacement workers and once if the original workers briefly recovered), but the URL-level deduplication and content-level SimHash ensures we don't index duplicate content. Throughput returns to 500/second within ~2 minutes of the initial failure."

"The important property here: the frontier is the source of truth, not the workers. Worker failures are always recoverable via the checkpoint-and-requeue mechanism."

The question behind the question

"How do you handle infinite crawl traps?" A crawl trap is a site that generates infinite unique URLs (e.g., infinite calendar pages, session-parameterized URLs). Detection: limit crawl depth per domain (max 10 hops from the seed URL); detect URL patterns with dynamic parameters (e.g., `?date=2024-01-01`) and canonicalize (strip or normalize query parameters based on a URL normalization policy); set a per-domain page limit (e.g., max 10,000 pages per domain per crawl cycle). URLs that match a "trap" pattern are filtered before being added to the frontier.

"How do you handle login-required content?" Most web crawlers are limited to publicly accessible content. For authenticated crawling (e.g., a company's internal knowledge base crawler): the crawler holds session credentials (API keys, OAuth tokens, or browser session cookies). Cookies are scoped per domain and rotated periodically. This is a specialized use case — the public web crawler does not authenticate.

"How do you prioritize between freshness and coverage?" It is a resource allocation problem. Total crawl budget is fixed (e.g., 500 pages/second). Budget is split: X% for recrawling known high-value pages (freshness), (100-X)% for discovering new pages (coverage). X is tuned based on objectives — a news indexer weights freshness heavily, a web archive weights coverage. Each front queue level can have a budget cap to prevent any one priority class from monopolizing the crawlers.

Frequently asked questions

How does a web crawler handle duplicate content?
Two levels: URL-level deduplication via Bloom filter (prevent re-fetching), content-level near-duplicate detection via SimHash (prevent indexing similar pages). SimHash fingerprints pages with a 64-bit hash — pages with Hamming distance ≤3 are near-duplicates. This catches mirrors, scraped copies, and paginated near-identical variants.

What is crawl politeness and why does it matter?
Respecting per-domain rate limits to avoid overloading servers. Enforced by per-domain back queues with Crawl-delay scheduling. An impolite crawler gets banned. Good crawlers follow robots.txt and limit to 1 request/domain/second as a default.

How do you prioritize which URLs to crawl?
Multi-factor priority score: PageRank of source page, inbound link count, freshness decay (time since last observed change). Stored in priority-ordered front queues. High-priority URLs are scheduled first; low-priority URLs are crawled in background passes.

How does a crawler handle JavaScript-rendered pages?
Two-tier fetching: standard HTTP fetcher (fast, cheap) for static HTML; headless Chromium tier for JS-heavy pages. Routing heuristic: low body-text ratio in HTML response → route to headless tier. Headless tier scales separately at much lower throughput due to cost.

Run this in SysSimulator → Browse all blueprints

Next: Design Kafka / Message Queue →