The YouTube question is a senior and staff-level staple. It tests two distinct architectural skills that rarely appear together: a write-heavy, CPU-intensive upload pipeline and a massively read-heavy, CDN-dominated playback path. Interviewers use it to separate candidates who understand distributed systems from those who understand only web APIs.
The YouTube design question is not asking you to build YouTube. It is asking whether you can reason about two fundamentally different system profiles operating simultaneously within the same product.
The write path (upload) is CPU-bound and latency-tolerant. 500 hours of video uploaded per minute. Each upload triggers a multi-stage processing pipeline: validation, virus scanning, transcoding into five or more resolutions and multiple codecs, thumbnail extraction, audio track separation, caption generation, and metadata indexing. Transcoding a single 1-hour 4K video can take hours on a single machine. The upload pipeline is designed for throughput, not latency — users expect their video to be "processing" for some time after upload. The interviewer is evaluating whether you understand that the bottleneck here is computational capacity, not I/O or network bandwidth.
The read path (playback) is I/O-bound and latency-critical. 1 billion hours of video watched per day. Playback must begin within 1–2 seconds of hitting play. Buffering destroys user experience and is a key product metric. The read path is almost entirely served from cache (CDN) — the origin storage systems are comparatively lightly loaded. The interviewer is evaluating whether you understand CDN architecture, adaptive bitrate streaming, and how to design for a 99%+ cache hit rate on popular content.
The metadata layer connects them. Video title, creator, duration, view count, like count, search index — this is structured data that needs fast random access by video ID, creator ID, and search terms. It's a completely different storage and query pattern from the blob storage used for video files. The interviewer is checking whether you separate these concerns clearly or conflate them into a single "database" that would handle neither access pattern well.
Upload volume. 500 hours of video per minute is YouTube's publicly stated figure. Convert: 500 × 60 seconds × ~1 MB/sec for 1080p = 30 GB/minute raw upload bandwidth, or ~500 MB/sec sustained. This is the raw ingress. After transcoding into multiple resolutions and formats, storage amplification is typically 5–10×. At 10× amplification: 500 GB of new video storage added per minute, or ~720 TB/day.
Transcoding compute demand. Each uploaded video needs to be transcoded into at minimum five variants: 360p, 480p, 720p, 1080p, and 4K (where applicable), plus multiple codecs (H.264 for compatibility, VP9 and AV1 for efficiency). A single 1-hour video at 4K requires approximately 10–30 CPU-hours to transcode. At 500 hours uploaded per minute: up to 15,000 hours of raw video per hour requiring transcoding. With a 20-CPU-hour average per hour of 4K video, that's 300,000 CPU-hours per hour of sustained transcoding demand — requiring tens of thousands of transcoding worker cores running continuously.
Storage. YouTube stores approximately 1 exabyte (1,000 PB) of video data. At 720 TB/day ingest, 10 years of video accumulates to roughly 2.6 exabytes. Storage is tiered: recently uploaded and frequently accessed videos on faster, more expensive storage; older, rarely watched content on cold archival storage (lower cost, higher retrieval latency).
Playback volume. 1 billion hours watched per day / 86,400 seconds = ~11.6 million concurrent viewers. At an average bitrate of 4 Mbps (1080p): 11.6M × 4 Mbps = 46 terabits per second of aggregate playback bandwidth. YouTube's CDN must serve ~46 Tbps globally. For context, this is a significant fraction of total global internet traffic.
CDN cache hit rate. YouTube's traffic is extremely concentrated — the top 1% of videos by popularity generate roughly 90% of watch time. A CDN cache sized for the top 1% of videos can serve 90% of requests from cache, reducing origin load by 10×. This is the leverage point of the entire architecture.
Upload pipeline: async and queue-driven. The upload flow is entirely asynchronous. The client streams the raw video to an ingestion service, which stores the raw bytes in object storage and returns an upload ID immediately. The client sees "Upload complete." A separate pipeline then begins processing: a job is published to a queue (Kafka or a dedicated task queue), transcoding workers pull jobs and run FFmpeg or equivalent, completed transcoded segments are written back to object storage, and the metadata database is updated when each resolution becomes available. The upload response does not wait for transcoding — it waits only for raw byte receipt.
Why separate blob storage from metadata. Video files are large (megabytes to gigabytes), accessed sequentially, and read by byte-range. Object storage (S3-compatible) is optimised for exactly this: cheap at scale, handles large files, delivers sequential reads efficiently. Video metadata is small, accessed by arbitrary query patterns (by ID, creator, tag, upload date), and must be indexed. A relational database with appropriate indexes handles this. Mixing the two — storing video blobs in a relational database or trying to query metadata from object storage — creates systems that handle neither workload well.
Transcoding at scale: worker pools with preemption. Transcoding workers are stateless — they pull a job from the queue, perform CPU-intensive encoding, write output to object storage, and return. If a worker crashes mid-job, the job returns to the queue and another worker picks it up. This is the classic competing consumer pattern. Worker count scales horizontally with queue depth. Priority queues allow newly uploaded videos (which users are waiting for) to jump ahead of background re-encoding tasks (re-encoding old videos to newer codecs).
Progressive availability. Users expect their video to be watchable as quickly as possible after upload. The lowest resolution (360p) is transcoded first — typically completing in minutes for a 10-minute video. The metadata is updated to mark the video as available at 360p, and it becomes watchable. Higher resolutions complete in the background and are added to the manifest as they finish. This is a product design decision that informs the architecture: process lowest quality first, publish as soon as any quality is available.
CDN and adaptive bitrate streaming. YouTube's playback works via DASH (Dynamic Adaptive Streaming over HTTP). Each video is divided into 5–10 second segments at each quality level. The manifest file (JSON/XML) lists all available segments and their URLs. The player fetches the manifest, then fetches segments sequentially. The CDN caches segments by URL — popular video segments are cached at thousands of edge nodes globally. Adaptive bitrate: the player monitors download speed and buffer depth, requesting higher quality when bandwidth allows and dropping quality when it doesn't. Quality switches are invisible to the user at segment boundaries.
View count as an eventually consistent counter. Incrementing a database row on every view would make the view count row a write hot spot — viral videos receive thousands of views per second. The solution: write view events to Kafka, aggregate in stream processing (counting in 1-second windows), periodically flush aggregated counts back to the database. View count may lag by a few seconds on fast-growing content. This is an intentional tradeoff — eventual consistency is acceptable for a public counter, and it protects the database from hotspot writes.
Load the Video Streaming blueprint in SysSimulator. This blueprint models the full pipeline: an ingestion service, a transcoding worker pool connected via queue, object storage, a CDN layer, and the metadata service.
Set traffic to represent playback load — start at 50,000 RPS (sustained playback requests). Observe: CDN hit rate (should be 90%+), origin request rate (should be 10% of CDN requests), and object storage read latency. At healthy load, virtually all playback traffic is absorbed by the CDN.
Then simulate a CDN edge node failure via the Chaos panel. Watch: traffic that was being served locally now falls back to the next-nearest node or origin. CDN hit rate drops, origin request rate spikes, p99 latency for affected users increases from ~50ms to 200–500ms as requests travel further. Record the impact radius — what percentage of users are affected and by how much latency.
Next, simulate a transcoding queue backup by reducing worker capacity. Watch the upload-to-availability latency climb as the queue depth grows. This is the signal that tells an on-call engineer to add transcoding capacity.
Open Video Streaming blueprint →
"I'm going to inject a CDN edge failure — this simulates a CDN PoP going offline, affecting users in that geographic region."
"[inject] CDN hit rate drops from 94% to 71% — about 23% of requests are now missing their nearest cache and falling back to the origin or a more distant PoP. For those users, p99 latency jumps from 48ms to 340ms. That's still playable — DASH adaptive bitrate will notice the increased buffering time and drop these users from 1080p to 720p or 480p automatically. Visible quality drop, not an outage."
"The blast radius: viewers in the affected region experience quality degradation. No requests are dropped — they fall back to origin or a distant PoP. The origin infrastructure sees a 3× spike in requests. If the origin was already near capacity, this is when you'd see real errors. In our design, the origin has a 5× headroom over normal load precisely for CDN failover scenarios."
"Recovery: CDN PoPs are automatically failed over by the CDN provider's health-checking. If it's a partial failure, traffic re-routes around the degraded node in under a minute. If it's a full PoP outage, recovery depends on the provider's infrastructure repair timeline — which is why we don't build our own CDN and instead use a provider with a global PoP network."
"What happens when the transcoding pipeline falls behind?" Queue depth grows. Newly uploaded videos have longer wait times before becoming available. The system degrades gracefully — existing videos continue playing normally, only new uploads are delayed. The mitigation: auto-scaling transcoding workers based on queue depth, and prioritising new uploads over background re-encoding tasks. The interviewer is checking that you designed for graceful degradation, not just the happy path.
"How does YouTube handle duplicate uploads?" Content fingerprinting (Content ID) identifies duplicate or copyrighted content. This is a separate system from the transcoding pipeline — it runs as a background task after initial processing. Deduplication at the storage level uses perceptual hashing; exact duplicate detection at the byte level uses cryptographic hashing. The interviewer is checking breadth of design thinking.
"How do you design for live streaming vs on-demand?" Live streaming is fundamentally different: segments are generated in real time, the manifest is continuously updated, and CDN cache TTLs must be very short (or bypassed) for live segments. On-demand videos have static manifests and long-lived cache entries. Many designs use separate pipelines for live vs VOD precisely because their cache and latency requirements are incompatible.
"Why not serve video directly from the database?" Video files are gigabytes. A database row is kilobytes. Databases are optimised for structured queries with indexes, transactions, and small random reads — none of which apply to serving a 2 GB video file as a sequential byte stream. Object storage is purpose-built for this: petabyte-scale, byte-range reads, optimised for throughput over latency. This question is checking whether you understand the principle of matching storage to access pattern.
What is the hardest part of designing YouTube?
The transcoding pipeline at scale. 500 hours uploaded per minute, each requiring CPU-intensive conversion into multiple formats and resolutions, with user expectation of near-immediate availability. The challenge is throughput, parallelism, and progressive availability — getting at least one resolution online as fast as possible while higher qualities are still being processed.
How does YouTube separate video storage from metadata?
Video blobs go to object storage (optimised for large sequential reads). Metadata (title, view count, creator, tags) goes to a relational database. These have incompatible access patterns and should never share the same storage system. This separation is one of the first things interviewers expect you to articulate.
How does YouTube's CDN work?
Videos are divided into 5–10 second segments cached at edge PoPs globally. Popular segments (top 1% of content by watch time) are pre-cached at all major PoPs, achieving 90%+ hit rates. The DASH player requests segments sequentially; cache hits are served from the edge in under 50ms.
How does YouTube handle adaptive bitrate streaming?
Each video is transcoded into multiple quality levels. The DASH manifest lists all variants. The player monitors bandwidth and buffer health, selecting the highest quality that can be downloaded faster than it's played. Quality changes happen at segment boundaries — seamless to the viewer.
How does YouTube scale view counts?
View events stream to Kafka, are aggregated in-memory by a stream processor, and periodically flushed back to the database. The counter lags real-time by seconds on viral content. This protects the database from hotspot writes while keeping counts acceptably current.
What happens when YouTube's transcoding pipeline falls behind?
Queue depth grows, new uploads take longer to become available. Existing playback is unaffected. Auto-scaling adds transcoding workers based on queue depth. Priority queues ensure new uploads are processed before background re-encoding work.
Run this in SysSimulator → Browse all blueprints
Next in the series: Design a rate limiter →