Chaos engineering scenarios: all 28 failure modes

This is the complete reference for every chaos engineering scenario available in SysSimulator. For each scenario you'll find what it simulates, which metrics it impacts, what a real production example looks like, and the mitigations that address the root cause.

The 28 scenarios span six failure categories: network, infrastructure, traffic, data layer, application, and dependency. A seventh category of MCP / AI agent scenarios unlocks when you load an agent architecture blueprint.

For the principles behind when and how to run these scenarios, see chaos engineering principles and practices. To run any scenario against your own architecture, open SysSimulator — no signup, no install.


Recommended starting order

Not all chaos scenarios are equal. Run them in this order for the highest-signal-per-experiment ratio:

PriorityScenarioWhy first
1stCache stampedeExposes the most common catastrophic failure mode for web systems. Cache absorbs 90–98% of read load; its failure creates an immediate 10–50x DB load spike.
2ndNode failureIdentifies single points of failure. Run on every stateful component individually.
3rdNetwork partitionExposes CAP theorem tradeoffs in practice. Where does your system prefer consistency vs availability?
4thThundering herdExposes reconnection storm vulnerabilities — critical before any maintenance operation.
5thConnection pool exhaustionExposes the hidden constraint between your application tier and database — often invisible until it saturates.
6th+All othersScenario-specific, run after the five above have passed.

Network chaos scenarios

Network failures are the most common production failure category. Four scenarios cover the main failure modes: added latency, dropped connectivity, intermittent packet loss, and constrained throughput.

Scenario 1 of 28 Latency injection
Network Impact: Medium–High

What it simulates: Artificial latency added to all requests through a target component — a slow upstream dependency, geographic distance, network congestion, or an overloaded downstream service.

Metrics impacted
p99 latency (multiplies through synchronous call chains), throughput (queue buildup), error rate (timeout-induced errors at high injection values)
Primary mitigation
Decouple synchronous call chains into async patterns. Apply read timeouts on all dependency calls. Identify which dependencies are on the critical latency path.

Real-world example: A payment processor starts responding at 800ms instead of 80ms. Your checkout API is synchronous on the payment call. Five checkout requests in parallel hold five threads for 800ms each. At 10,000 RPS with 5% checkout rate, 500 threads are held simultaneously — thread pool saturation follows within seconds.

Interview application: "Your third-party payment processor starts responding slowly — walk me through the blast radius on your checkout service."

Scenario 2 of 28 Network partition
Network Impact: High

What it simulates: Complete loss of connectivity between two components — a broken switch, firewall rule change, VPC misconfiguration, or cross-availability-zone outage.

Metrics impacted
Error rate (requests that cannot complete), p99 latency (requests hang until timeout), connection pool utilization (held connections waiting for timeout)
Primary mitigation
Circuit breakers at every service boundary. Explicit fallback behavior at each partition point. Design for "what does this service do when it cannot reach its dependency?"

Real-world example: A firewall rule change drops all traffic between the application tier and the database. Without circuit breakers, application tier connections to the database hang until their timeout — typically 30 seconds. At 5,000 RPS with 10% DB-touching requests, 15,000 requests per 30-second window queue up and timeout. Thread pool exhaustion follows.

Interview application: "Your primary region loses connectivity to your database. You have read replicas in a secondary region. Walk me through the failover."

Scenario 3 of 28 Packet loss
Network Impact: Medium

What it simulates: A configurable percentage of requests are dropped — not all of them, unlike a full partition. Some requests complete, some do not. Intermittent and non-deterministic.

Metrics impacted
Error rate (rises proportionally with loss percentage), p99 latency (retry overhead), request success rate
Primary mitigation
Idempotent operations that are safe to retry. At-least-once delivery with deduplication on the receiver side. Retry budgets to prevent retry storms.

Real-world example: A flaky wireless link between a mobile client and your API drops 15% of requests. The client retries immediately. If the payment endpoint is not idempotent, 15% of payment attempts may result in double charges on retry success.

Interview application: "A mobile client on a flaky connection is retrying your payment endpoint. Are you charging once or multiple times?"

Scenario 4 of 28 Bandwidth throttle
Network Impact: Medium

What it simulates: The network throughput of a component is constrained. Requests are processed but at reduced rate, causing queue buildup upstream.

Metrics impacted
Queue depth (builds upstream of the throttled component), p99 latency, throughput (capped at throttle limit)
Primary mitigation
Backpressure signals from throttled components to callers. Client-side rate limiting to match the downstream capacity ceiling. Load shedding at the throttled component boundary.

Real-world example: A data pipeline egress is throttled by a cloud provider during peak usage. Upstream jobs accumulate in a queue. The queue grows faster than it drains. Eventually, queue-full backpressure reaches the ingestion layer.


Infrastructure chaos scenarios

Infrastructure failures target compute, storage, and memory at the host or container level. These are the most straightforward failure modes to identify but often reveal the most critical SPOFs.

Scenario 5 of 28 Node failure
Infrastructure Impact: High

What it simulates: A component goes entirely offline. Simulates an EC2 instance crash, container OOM kill, Kubernetes pod eviction, or bare-metal hardware failure.

Metrics impacted
Error rate (traffic that was on the failed node), p99 latency (load concentrates on surviving nodes), throughput (if surviving nodes reach capacity)
Primary mitigation
Redundancy with N+1 capacity (surviving nodes can absorb the failed node's traffic). Health checks with automatic traffic rerouting. Stateless services that restart cleanly.

Real-world example: One of three application servers in a load balancer pool goes down. With 33% of traffic on the failed server and only two servers surviving, each surviving server goes from 33% to 50% load — a 50% increase. If 50% load exceeds capacity, all three servers degrade and the failure cascades.

Interview application: "Your diagram has a single database instance. What's your RTO if that instance fails?"

Scenario 6 of 28 Disk full
Infrastructure Impact: High (writes)

What it simulates: A component runs out of disk space. Database writes fail. App servers with local logging or write caching fail writes silently or noisily depending on error handling.

Metrics impacted
Write error rate (spikes to 100%), read availability (may continue), data durability (in-flight writes lost)
Primary mitigation
Disk utilization alerts at 70% and 85%. Log rotation and retention policies. Write failure handling that surfaces errors explicitly rather than silently dropping writes.

Real-world example: A database data volume fills because a runaway query produced an unexpectedly large result set that was written to a temp table. All subsequent writes fail. Reads continue from the existing data. The application appears partially functional but is silently losing data.

Scenario 7 of 28 CPU spike
Infrastructure Impact: Medium–High

What it simulates: A component's CPU is fully saturated. Simulates heavy table scans, large sort operations, expensive cryptographic work, or inefficient regex evaluation.

Metrics impacted
p99 latency (rises sharply as requests queue), throughput (capped by processing capacity), queue depth
Primary mitigation
Scale out (add replicas for CPU-bound services). Query optimization for database CPU spikes. Identify the specific workload causing CPU saturation — the fix is workload-specific.

Real-world example: A full-table scan triggered by a missing index causes database CPU to spike to 100%. All queries slow proportionally. Queue depth builds. Application servers start timing out on database calls.

Scenario 8 of 28 Memory pressure
Infrastructure Impact: Medium

What it simulates: A component approaches out-of-memory conditions. In GC-managed runtimes (JVM, Go, Node.js), this increases GC frequency and pause duration.

Metrics impacted
p99 latency (erratic spikes from GC pauses), error rate (eventual OOM crashes), GC pause duration
Primary mitigation
Memory utilization alerts. JVM heap sizing. Memory leak detection via sawtooth memory graphs. Restart policies for containers approaching OOM limits.

Real-world example: A JVM-based service accumulates request contexts in a map that is never cleaned up after request completion. Memory grows slowly. GC pauses lengthen. p99 spikes erratically. Eventually the container hits its memory limit and is OOM-killed, triggering a node failure cascade.


Traffic chaos scenarios

Traffic failure modes test how systems behave under abnormal load patterns — not just volume, but timing, size, and coordination.

Scenario 9 of 28 Request spike
Traffic Impact: High

What it simulates: Incoming request rate multiplies suddenly — 5x, 10x, or configurable. Simulates viral moments, flash sales, bot attacks, or synchronized scheduled jobs.

Metrics impacted
First bottlenecked component saturates; error rate climbs as queue depth exceeds capacity
Primary mitigation
Rate limiting at the ingress. Autoscaling with pre-warming. Load shedding with graceful degradation (drop non-critical requests, serve critical ones). Capacity headroom above peak-of-peak, not average.

Real-world example: A product is featured on a major media site. Traffic spikes 8x in under a minute. The bottleneck is the session token validation service, which is single-instance. All authenticated requests fail for 4 minutes until a second instance is started.

Scenario 10 of 28 Payload bloat
Traffic Impact: Medium

What it simulates: Average request and response payload size increases significantly. Simulates large file uploads, unexpectedly verbose JSON, logging payload explosions, or large query result sets.

Metrics impacted
Network-bound component throughput (decreases), database write throughput (decreases with larger rows), gateway rejection rate (if payload limits enforced)
Primary mitigation
Payload size limits at ingress (400/413 responses for oversized requests). Pagination for large query results. Field filtering to return only requested fields. Compression for large payloads.

Real-world example: A logging library update starts including full stack traces in every log entry, increasing log payload size 20x. The log aggregation pipeline, sized for normal log volume, starts dropping logs and eventually backs up to the point where the application itself slows waiting for log writes to complete.

Scenario 11 of 28 Slow clients
Traffic Impact: Medium

What it simulates: Clients read responses slowly — simulating mobile connections on poor signal, geographic distance, or clients doing expensive processing between reads.

Metrics impacted
Connection pool utilization (rises even with idle CPU — connections are held open for transfer duration), p99 latency
Primary mitigation
Write timeouts on responses (terminate connections that aren't reading). Async response streaming. Separate connection pools for fast and slow clients. Payload size reduction.

Real-world example: Mobile clients on 3G connections take 10 seconds to receive a 500KB response. Server connections are held open for the full transfer. At 1,000 concurrent slow clients, 10,000 connection-seconds are consumed per second — rapidly exhausting connection pools even when the server CPU is idle.

Scenario 12 of 28 Thundering herd
Traffic Impact: High

What it simulates: Many clients make the same request simultaneously after a period of silence — a reconnection storm after a server restart, a popular cached item expiring and causing concurrent misses, or scheduled jobs firing at the same second.

Metrics impacted
Instantaneous load spike (even at normal average RPS), single-resource saturation, p99 latency spike
Primary mitigation
Jittered retry timers (randomise reconnect delays across a wide distribution). Request coalescing (serve one pending request, then propagate the result to all waiters). Probabilistic early cache expiration to prevent synchronized expiry.

Real-world example: Slack, May 2022. A routine maintenance operation caused all Slack clients to disconnect simultaneously. On completion, millions of clients reconnected at the same instant, overwhelming connection handling infrastructure. The fix: exponential backoff with jitter on client reconnect, spreading the load over several minutes instead of milliseconds.


Data layer chaos scenarios

Data layer failures are the highest-stakes failure category because they can affect data durability, not just availability. These scenarios test your database failover, replication behavior, cache resilience, and connection management.

Scenario 13 of 28 Database crash
Data layer Impact: Critical

What it simulates: The primary database goes entirely offline. Writes stop immediately. Read replicas may continue serving reads depending on your read/write split configuration.

Metrics impacted
Write error rate (100%), read availability (depends on replica configuration), failover RTO (how long until the replica is promoted)
Primary mitigation
Automated primary promotion with sub-minute RTO (RDS Multi-AZ, Patroni, etc.). Read/write split so reads continue from replicas during primary outage. Write queuing for non-critical writes during failover window.

Real-world example: Amazon DynamoDB, February 2021. A configuration change to DynamoDB's metadata service caused a cache overload that propagated to impair the entire service. The root cause was a change that resulted in increased cache misses, which cascaded to database load — effectively a cache stampede at service infrastructure level.

Interview application: "Your diagram has a primary database with read replicas. Walk me through RTO when the primary fails."

Scenario 14 of 28 Replication lag
Data layer Impact: Medium

What it simulates: Growing delay between when writes are committed on the primary and when they become visible on read replicas.

Metrics impacted
Read freshness (replica reads return stale data), write-then-read consistency failures, replication lag metric
Primary mitigation
Route freshness-critical reads to the primary, not replicas. Replication lag alerts at 1 second, emergency response at 30 seconds. Avoid read replicas for flows that require reading what was just written.

Real-world example: A social platform routes all reads to replicas to reduce primary load. A user posts a tweet, which is written to the primary. They immediately refresh their profile, which reads from a replica that is 8 seconds behind. The tweet doesn't appear. The user thinks the post failed and submits it again — creating a duplicate.

Interview application: "You're using read replicas for 90% of reads. A user writes a record and immediately reads it back. How do you handle this?"

Scenario 15 of 28 Cache stampede
Data layer Impact: Critical

What it simulates: Cache hit rate drops to near zero — simulating a cache restart, mass invalidation event, or eviction storm when the cache is undersized for the working set.

Metrics impacted
Database QPS (spikes by 1/(1-hit_rate) factor), connection pool utilization, p99 latency, error rate (if DB saturates)
Primary mitigation
Probabilistic early expiration (renew entries before they expire, probabilistically). Request coalescing (collapse concurrent misses for the same key into one DB query). Write-through caching (keep cache populated during restarts). Secondary cache layer with longer TTL.

Real-world example: A Redis cluster is restarted for a configuration change. At restart, 10,000 RPS of requests that were previously served from the 98% cache hit rate suddenly hit the database. The database was handling 200 QPS (the 2% miss rate). It now receives 10,000 QPS — a 50x spike. Connection pool of 100 connections exhausts in under 2 seconds. The database becomes unresponsive. The cache, once it restarts, cannot warm up fast enough because the database is too loaded to serve the initial fill requests.

Interview application: "Your cache goes down at peak traffic. Walk me through the blast radius."

Scenario 16 of 28 Connection pool exhaustion
Data layer Impact: High

What it simulates: Database connection pool fills until no new connections can be established. New requests queue, timeout, and fail — even when database CPU and memory are entirely normal.

Metrics impacted
Connection pool utilization (100%), request error rate (timeout errors), p99 latency (queue wait time), database CPU (may appear normal — misleadingly)
Primary mitigation
Pool utilization monitoring (alert at 80%). PgBouncer/connection multiplexing to serve more app-side connections from fewer database connections. Connection timeout tuning — requests should fail fast when the pool is full, not wait indefinitely. Identify slow queries that hold connections past their useful lifetime.

Real-world example: A slow query introduced by a deployment takes 800ms instead of the normal 40ms. Each connection holds a query for 800ms instead of 40ms. The same 100-connection pool that served 2,500 QPS (100 connections ÷ 40ms) now serves only 125 QPS (100 connections ÷ 800ms). At 1,000 QPS of DB-touching requests, 87.5% of requests cannot get a connection and timeout.


Application chaos scenarios

Application-level failures test runtime behavior: memory management, thread scheduling, lock contention, and failure propagation logic.

Scenario 17 of 28 Memory leak
Application Impact: Medium (slow onset)

What it simulates: Continuous, slow memory growth in a component that is never released — objects accumulated in maps without eviction, unclosed streams, event listener leaks.

Metrics impacted
Memory utilization (grows monotonically), GC pause duration (increases as heap fills), p99 latency (erratic due to GC), eventual OOM crash
Primary mitigation
Memory utilization monitoring with trend alerts. Heap profiling to identify leak sources. Bounded data structures (caches with eviction policies, not unbounded maps). Scheduled restarts as a stopgap while root cause is fixed.

Signal to watch: The sawtooth pattern on a memory graph — peaks rise with each GC cycle but never drop to baseline. Over time, the peaks get higher and the valleys shallower. OOM is the endpoint if unaddressed.

Scenario 18 of 28 Thread pool exhaustion
Application Impact: High

What it simulates: Worker threads are all occupied and cannot accept new requests. Simulates blocking I/O that holds threads — synchronous database calls, synchronous HTTP calls to slow dependencies, or CPU-intensive work on the request thread.

Metrics impacted
Request queue depth (grows rapidly), error rate (queue-full rejections or timeouts), CPU (may appear low — threads are blocked, not computing)
Primary mitigation
Async I/O to free threads during wait time. Separate thread pools for different I/O categories (DB, external HTTP, internal). Queue depth alerts. Thread pool sizing based on measured concurrency, not rule of thumb.

Real-world example: A service uses a fixed thread pool of 200 threads. A downstream dependency starts responding at 2 seconds instead of 50ms. Each thread is blocked for 2 seconds per request. The pool that served 4,000 RPS now serves 100 RPS. At 2,000 RPS of incoming traffic, the queue fills in seconds.

Scenario 19 of 28 Deadlock
Application Impact: High

What it simulates: Circular lock dependencies where two or more operations each hold a lock the other needs, causing both to wait indefinitely.

Metrics impacted
Affected request latency (goes to infinity — they never complete), thread pool utilization (fills with stuck threads), eventual thread pool exhaustion
Primary mitigation
Consistent lock acquisition order (always acquire lock A before lock B, across all code paths). Lock timeouts (fail after N seconds rather than waiting forever). Short transaction windows. Deadlock detection in database engines (most databases detect and kill deadlocked transactions automatically).
Scenario 20 of 28 Cascading failure
Application Impact: Critical

What it simulates: A component failure overloads a downstream dependency, which fails and overloads its dependency, propagating failure through the dependency graph.

Metrics impacted
Error rate (climbs through the dependency graph), p99 latency (climbs as retries amplify load), bottleneck count (grows as cascade propagates)
Primary mitigation
Circuit breakers at every service boundary (fail fast on unhealthy downstream, stop retries from amplifying cascade). Exponential backoff with jitter on retries. Bulkhead pattern (separate thread pools per downstream dependency to contain blast radius).

Key insight: Aggressive retries without backoff are the primary cascade accelerator. A service receiving 1,000 RPS that retries 3 times on each 5xx sends 3,000 RPS to an already-failing downstream. The downstream, now receiving 3x load, fails harder, causing the upstream to retry more, creating a positive feedback loop that collapses both services simultaneously.


Dependency chaos scenarios

Dependency failures test how your system behaves when third-party services — payment providers, email services, identity providers, external APIs — degrade or fail.

Scenario 21 of 28 Third-party timeout
Dependency Impact: High

What it simulates: An external dependency stops responding entirely. Requests hang until they reach their timeout — which may be 30 seconds or more if not explicitly configured.

Metrics impacted
Thread pool utilization (held for timeout duration), error rate (timeout errors after delay), p99 latency (equals timeout value for affected requests)
Primary mitigation
Short, explicit read timeouts on every external call (500ms–2s, not the default 30s). Circuit breakers that open after N consecutive timeouts. Async dependency calls so threads aren't held. Graceful degradation when the dependency is unavailable.

Interview application: "Your payment provider is down. What happens to your checkout flow?"

Scenario 22 of 28 Degraded response
Dependency Impact: Medium

What it simulates: A dependency responds at 10x normal latency instead of timing out completely. It appears nominally healthy in status checks but is significantly slower.

Metrics impacted
p99 latency (rises proportionally), thread/connection utilization (holds longer per request), effective throughput (decreases)
Primary mitigation
Read timeouts calibrated to internal SLAs, not default library values. p99 latency alerts on dependency calls (not just error rate). Circuit breakers triggered by latency, not just errors.

Why this scenario is harder than a full timeout: A full timeout produces obvious 5xx errors. Slow responses produce high latency and connection saturation without triggering error-rate alerts. Many teams don't notice until thread pools are near exhaustion.

Scenario 23 of 28 Error response
Dependency Impact: Medium

What it simulates: A dependency returns 5xx errors quickly — fast failures rather than hangs.

Metrics impacted
Error rate (rises fast), retries (naive retry logic amplifies load on failing dependency)
Primary mitigation
Exponential backoff with jitter on retries. Circuit breakers to stop retrying after threshold errors. Retry budgets to prevent retry storms from amplifying load on the failing dependency.
Scenario 24 of 28 Rate limit hit
Dependency Impact: Medium

What it simulates: A third-party API enforces quota and returns 429 Too Many Requests intermittently as usage exceeds the rate limit.

Metrics impacted
Dependent operation success rate, quota remaining (rate-unaware retries burn remaining quota faster)
Primary mitigation
Client-side quota budgeting and tracking. Respect Retry-After headers. Adaptive throttling that backs off when approaching quota limits. Graceful degradation for quota-limited operations.

MCP / AI agent scenarios

These four scenarios unlock automatically when you load a blueprint that includes Agent Runtimes, MCP Tool Servers, Vector Stores, or Tool Registries. They do not appear on classic microservice blueprints.

Scenario 25 of 28 Tool server timeout
MCP / Agent Impact: High

What it simulates: An MCP tool server stops responding mid-task. The agent runtime is waiting for a tool call result that never arrives.

What it tests: Whether the agent retries intelligently, falls back to an alternative tool, or fails gracefully with a useful error message. Agents without explicit tool-timeout handling may hang indefinitely or consume the full context window on retry loops.

Scenario 26 of 28 Token budget pressure
MCP / Agent Impact: Medium

What it simulates: The agent approaches its context window limit mid-task, with significant work remaining.

What it tests: Whether the agent summarizes progress and continues, truncates silently, or fails usefully. Agents without context budget awareness produce truncated, incorrect outputs without signaling the truncation.

Scenario 27 of 28 Vector index staleness
MCP / Agent Impact: Medium

What it simulates: The vector store serving retrieval-augmented generation contains stale embeddings — documents that have been updated in the source system but not re-indexed.

What it tests: Whether the agent's retrieval produces confidently wrong answers grounded in outdated data, and whether the system has any mechanism to detect and signal staleness.

Scenario 28 of 28 Tool registry unavailability
MCP / Agent Impact: High

What it simulates: The tool registry that the agent queries at runtime for available tools is unavailable. The agent cannot discover which tools it can use.

What it tests: Whether the agent can operate with a cached tool manifest, fail gracefully, or degrade to a subset of hardcoded tools. Tool discovery failure at runtime is the agent equivalent of a service discovery outage in microservice architectures.


Run any scenario now

Every scenario above is available in SysSimulator — free, in your browser, with no signup or infrastructure required. Load any of the 57 architecture blueprints, start traffic, and inject any scenario from the chaos panel.

For how to design and interpret experiments, see chaos engineering principles and practices. For interview preparation using these scenarios, see the interview prep guide.

Open Simulator — No signup required →