This is the complete reference for every chaos engineering scenario available in SysSimulator. For each scenario you'll find what it simulates, which metrics it impacts, what a real production example looks like, and the mitigations that address the root cause.
The 28 scenarios span six failure categories: network, infrastructure, traffic, data layer, application, and dependency. A seventh category of MCP / AI agent scenarios unlocks when you load an agent architecture blueprint.
For the principles behind when and how to run these scenarios, see chaos engineering principles and practices. To run any scenario against your own architecture, open SysSimulator — no signup, no install.
Not all chaos scenarios are equal. Run them in this order for the highest-signal-per-experiment ratio:
| Priority | Scenario | Why first |
|---|---|---|
| 1st | Cache stampede | Exposes the most common catastrophic failure mode for web systems. Cache absorbs 90–98% of read load; its failure creates an immediate 10–50x DB load spike. |
| 2nd | Node failure | Identifies single points of failure. Run on every stateful component individually. |
| 3rd | Network partition | Exposes CAP theorem tradeoffs in practice. Where does your system prefer consistency vs availability? |
| 4th | Thundering herd | Exposes reconnection storm vulnerabilities — critical before any maintenance operation. |
| 5th | Connection pool exhaustion | Exposes the hidden constraint between your application tier and database — often invisible until it saturates. |
| 6th+ | All others | Scenario-specific, run after the five above have passed. |
Network failures are the most common production failure category. Four scenarios cover the main failure modes: added latency, dropped connectivity, intermittent packet loss, and constrained throughput.
What it simulates: Artificial latency added to all requests through a target component — a slow upstream dependency, geographic distance, network congestion, or an overloaded downstream service.
Real-world example: A payment processor starts responding at 800ms instead of 80ms. Your checkout API is synchronous on the payment call. Five checkout requests in parallel hold five threads for 800ms each. At 10,000 RPS with 5% checkout rate, 500 threads are held simultaneously — thread pool saturation follows within seconds.
Interview application: "Your third-party payment processor starts responding slowly — walk me through the blast radius on your checkout service."
What it simulates: Complete loss of connectivity between two components — a broken switch, firewall rule change, VPC misconfiguration, or cross-availability-zone outage.
Real-world example: A firewall rule change drops all traffic between the application tier and the database. Without circuit breakers, application tier connections to the database hang until their timeout — typically 30 seconds. At 5,000 RPS with 10% DB-touching requests, 15,000 requests per 30-second window queue up and timeout. Thread pool exhaustion follows.
Interview application: "Your primary region loses connectivity to your database. You have read replicas in a secondary region. Walk me through the failover."
What it simulates: A configurable percentage of requests are dropped — not all of them, unlike a full partition. Some requests complete, some do not. Intermittent and non-deterministic.
Real-world example: A flaky wireless link between a mobile client and your API drops 15% of requests. The client retries immediately. If the payment endpoint is not idempotent, 15% of payment attempts may result in double charges on retry success.
Interview application: "A mobile client on a flaky connection is retrying your payment endpoint. Are you charging once or multiple times?"
What it simulates: The network throughput of a component is constrained. Requests are processed but at reduced rate, causing queue buildup upstream.
Real-world example: A data pipeline egress is throttled by a cloud provider during peak usage. Upstream jobs accumulate in a queue. The queue grows faster than it drains. Eventually, queue-full backpressure reaches the ingestion layer.
Infrastructure failures target compute, storage, and memory at the host or container level. These are the most straightforward failure modes to identify but often reveal the most critical SPOFs.
What it simulates: A component goes entirely offline. Simulates an EC2 instance crash, container OOM kill, Kubernetes pod eviction, or bare-metal hardware failure.
Real-world example: One of three application servers in a load balancer pool goes down. With 33% of traffic on the failed server and only two servers surviving, each surviving server goes from 33% to 50% load — a 50% increase. If 50% load exceeds capacity, all three servers degrade and the failure cascades.
Interview application: "Your diagram has a single database instance. What's your RTO if that instance fails?"
What it simulates: A component runs out of disk space. Database writes fail. App servers with local logging or write caching fail writes silently or noisily depending on error handling.
Real-world example: A database data volume fills because a runaway query produced an unexpectedly large result set that was written to a temp table. All subsequent writes fail. Reads continue from the existing data. The application appears partially functional but is silently losing data.
What it simulates: A component's CPU is fully saturated. Simulates heavy table scans, large sort operations, expensive cryptographic work, or inefficient regex evaluation.
Real-world example: A full-table scan triggered by a missing index causes database CPU to spike to 100%. All queries slow proportionally. Queue depth builds. Application servers start timing out on database calls.
What it simulates: A component approaches out-of-memory conditions. In GC-managed runtimes (JVM, Go, Node.js), this increases GC frequency and pause duration.
Real-world example: A JVM-based service accumulates request contexts in a map that is never cleaned up after request completion. Memory grows slowly. GC pauses lengthen. p99 spikes erratically. Eventually the container hits its memory limit and is OOM-killed, triggering a node failure cascade.
Traffic failure modes test how systems behave under abnormal load patterns — not just volume, but timing, size, and coordination.
What it simulates: Incoming request rate multiplies suddenly — 5x, 10x, or configurable. Simulates viral moments, flash sales, bot attacks, or synchronized scheduled jobs.
Real-world example: A product is featured on a major media site. Traffic spikes 8x in under a minute. The bottleneck is the session token validation service, which is single-instance. All authenticated requests fail for 4 minutes until a second instance is started.
What it simulates: Average request and response payload size increases significantly. Simulates large file uploads, unexpectedly verbose JSON, logging payload explosions, or large query result sets.
Real-world example: A logging library update starts including full stack traces in every log entry, increasing log payload size 20x. The log aggregation pipeline, sized for normal log volume, starts dropping logs and eventually backs up to the point where the application itself slows waiting for log writes to complete.
What it simulates: Clients read responses slowly — simulating mobile connections on poor signal, geographic distance, or clients doing expensive processing between reads.
Real-world example: Mobile clients on 3G connections take 10 seconds to receive a 500KB response. Server connections are held open for the full transfer. At 1,000 concurrent slow clients, 10,000 connection-seconds are consumed per second — rapidly exhausting connection pools even when the server CPU is idle.
What it simulates: Many clients make the same request simultaneously after a period of silence — a reconnection storm after a server restart, a popular cached item expiring and causing concurrent misses, or scheduled jobs firing at the same second.
Real-world example: Slack, May 2022. A routine maintenance operation caused all Slack clients to disconnect simultaneously. On completion, millions of clients reconnected at the same instant, overwhelming connection handling infrastructure. The fix: exponential backoff with jitter on client reconnect, spreading the load over several minutes instead of milliseconds.
Data layer failures are the highest-stakes failure category because they can affect data durability, not just availability. These scenarios test your database failover, replication behavior, cache resilience, and connection management.
What it simulates: The primary database goes entirely offline. Writes stop immediately. Read replicas may continue serving reads depending on your read/write split configuration.
Real-world example: Amazon DynamoDB, February 2021. A configuration change to DynamoDB's metadata service caused a cache overload that propagated to impair the entire service. The root cause was a change that resulted in increased cache misses, which cascaded to database load — effectively a cache stampede at service infrastructure level.
Interview application: "Your diagram has a primary database with read replicas. Walk me through RTO when the primary fails."
What it simulates: Growing delay between when writes are committed on the primary and when they become visible on read replicas.
Real-world example: A social platform routes all reads to replicas to reduce primary load. A user posts a tweet, which is written to the primary. They immediately refresh their profile, which reads from a replica that is 8 seconds behind. The tweet doesn't appear. The user thinks the post failed and submits it again — creating a duplicate.
Interview application: "You're using read replicas for 90% of reads. A user writes a record and immediately reads it back. How do you handle this?"
What it simulates: Cache hit rate drops to near zero — simulating a cache restart, mass invalidation event, or eviction storm when the cache is undersized for the working set.
Real-world example: A Redis cluster is restarted for a configuration change. At restart, 10,000 RPS of requests that were previously served from the 98% cache hit rate suddenly hit the database. The database was handling 200 QPS (the 2% miss rate). It now receives 10,000 QPS — a 50x spike. Connection pool of 100 connections exhausts in under 2 seconds. The database becomes unresponsive. The cache, once it restarts, cannot warm up fast enough because the database is too loaded to serve the initial fill requests.
Interview application: "Your cache goes down at peak traffic. Walk me through the blast radius."
What it simulates: Database connection pool fills until no new connections can be established. New requests queue, timeout, and fail — even when database CPU and memory are entirely normal.
Real-world example: A slow query introduced by a deployment takes 800ms instead of the normal 40ms. Each connection holds a query for 800ms instead of 40ms. The same 100-connection pool that served 2,500 QPS (100 connections ÷ 40ms) now serves only 125 QPS (100 connections ÷ 800ms). At 1,000 QPS of DB-touching requests, 87.5% of requests cannot get a connection and timeout.
Application-level failures test runtime behavior: memory management, thread scheduling, lock contention, and failure propagation logic.
What it simulates: Continuous, slow memory growth in a component that is never released — objects accumulated in maps without eviction, unclosed streams, event listener leaks.
Signal to watch: The sawtooth pattern on a memory graph — peaks rise with each GC cycle but never drop to baseline. Over time, the peaks get higher and the valleys shallower. OOM is the endpoint if unaddressed.
What it simulates: Worker threads are all occupied and cannot accept new requests. Simulates blocking I/O that holds threads — synchronous database calls, synchronous HTTP calls to slow dependencies, or CPU-intensive work on the request thread.
Real-world example: A service uses a fixed thread pool of 200 threads. A downstream dependency starts responding at 2 seconds instead of 50ms. Each thread is blocked for 2 seconds per request. The pool that served 4,000 RPS now serves 100 RPS. At 2,000 RPS of incoming traffic, the queue fills in seconds.
What it simulates: Circular lock dependencies where two or more operations each hold a lock the other needs, causing both to wait indefinitely.
What it simulates: A component failure overloads a downstream dependency, which fails and overloads its dependency, propagating failure through the dependency graph.
Key insight: Aggressive retries without backoff are the primary cascade accelerator. A service receiving 1,000 RPS that retries 3 times on each 5xx sends 3,000 RPS to an already-failing downstream. The downstream, now receiving 3x load, fails harder, causing the upstream to retry more, creating a positive feedback loop that collapses both services simultaneously.
Dependency failures test how your system behaves when third-party services — payment providers, email services, identity providers, external APIs — degrade or fail.
What it simulates: An external dependency stops responding entirely. Requests hang until they reach their timeout — which may be 30 seconds or more if not explicitly configured.
Interview application: "Your payment provider is down. What happens to your checkout flow?"
What it simulates: A dependency responds at 10x normal latency instead of timing out completely. It appears nominally healthy in status checks but is significantly slower.
Why this scenario is harder than a full timeout: A full timeout produces obvious 5xx errors. Slow responses produce high latency and connection saturation without triggering error-rate alerts. Many teams don't notice until thread pools are near exhaustion.
What it simulates: A dependency returns 5xx errors quickly — fast failures rather than hangs.
What it simulates: A third-party API enforces quota and returns 429 Too Many Requests intermittently as usage exceeds the rate limit.
These four scenarios unlock automatically when you load a blueprint that includes Agent Runtimes, MCP Tool Servers, Vector Stores, or Tool Registries. They do not appear on classic microservice blueprints.
What it simulates: An MCP tool server stops responding mid-task. The agent runtime is waiting for a tool call result that never arrives.
What it tests: Whether the agent retries intelligently, falls back to an alternative tool, or fails gracefully with a useful error message. Agents without explicit tool-timeout handling may hang indefinitely or consume the full context window on retry loops.
What it simulates: The agent approaches its context window limit mid-task, with significant work remaining.
What it tests: Whether the agent summarizes progress and continues, truncates silently, or fails usefully. Agents without context budget awareness produce truncated, incorrect outputs without signaling the truncation.
What it simulates: The vector store serving retrieval-augmented generation contains stale embeddings — documents that have been updated in the source system but not re-indexed.
What it tests: Whether the agent's retrieval produces confidently wrong answers grounded in outdated data, and whether the system has any mechanism to detect and signal staleness.
What it simulates: The tool registry that the agent queries at runtime for available tools is unavailable. The agent cannot discover which tools it can use.
What it tests: Whether the agent can operate with a cached tool manifest, fail gracefully, or degrade to a subset of hardcoded tools. Tool discovery failure at runtime is the agent equivalent of a service discovery outage in microservice architectures.
Every scenario above is available in SysSimulator — free, in your browser, with no signup or infrastructure required. Load any of the 57 architecture blueprints, start traffic, and inject any scenario from the chaos panel.
For how to design and interpret experiments, see chaos engineering principles and practices. For interview preparation using these scenarios, see the interview prep guide.