What is a chaos engineering scenario?

A chaos engineering scenario is a specific, reproducible failure condition injected into a system to test its resilience. Each scenario simulates a real production failure event — cache stampede, node failure, network partition, connection pool exhaustion — and observes how the system responds. A well-designed chaos scenario has a defined blast radius (which components are affected), a measurable signal (which metrics deviate from baseline), and a clear pass/fail criterion (does the system recover to steady state within the acceptable window?).

What is a cache stampede scenario?

The cache stampede chaos scenario forces cache hit rate to near zero, simulating a cache restart, mass invalidation, or eviction storm. At 10,000 RPS with 98% normal cache hit rate, the database handles 200 QPS. With zero cache hits, the database suddenly receives 10,000 QPS — a 50x load spike. At high enough RPS, connection pool exhaustion follows within seconds. Mitigations: probabilistic early expiration (renew cache entries before expiry), request coalescing (collapse concurrent misses for the same key into one database query), write-through caching (keep cache populated during spikes).

What is a network partition scenario in chaos engineering?

A network partition chaos scenario drops all requests between two specific components, simulating a broken switch, firewall rule change, VPC misconfiguration, or cross-AZ outage. One side of the partition continues operating; the other queues requests that never complete and eventually timeout. Without circuit breakers, the calling component's connection pool exhausts. This scenario directly tests CAP theorem tradeoffs: does your system prefer consistency (rejecting requests it cannot fulfill correctly) or availability (serving potentially stale data)?

What is a thundering herd chaos scenario?

The thundering herd chaos scenario simulates many clients making the same request simultaneously after a period of silence — a reconnection storm after a server restart, a scheduled job triggering simultaneously across many instances, or a popular cached item expiring and causing all pending requests to miss at the same moment. The coordinated, homogeneous load overwhelms a single resource even when average RPS is within normal range. Mitigations include jittered retry timers (randomise reconnect delays to spread load), request coalescing, and probabilistic cache expiration.

What is connection pool exhaustion in chaos engineering?

Connection pool exhaustion occurs when all available database connections are in use and new requests cannot acquire a connection, causing them to queue and eventually timeout. The insidious aspect: database CPU may be entirely normal — the resource constraint is the connection pool, not processing capacity. The chaos scenario gradually fills the pool until no new connections can be established. Detection requires pool utilization monitoring (80% utilization is a warning signal, 95% is near-incident). Mitigation: right-size pools based on measured peak utilization, add pool saturation alerts, use connection multiplexing with tools like PgBouncer.

What is a cascading failure in chaos engineering?

A cascading failure occurs when one component failure overwhelms a downstream dependency, which fails and overwhelms its dependency, propagating failure through the system. Aggressive retries without circuit breakers accelerate the cascade: a failed upstream retried 3 times multiplies load 3x on an already-failing downstream. The cascading failure chaos scenario triggers this propagation pattern and tests whether circuit breakers are positioned correctly to contain it. Mitigation: circuit breakers at every service boundary, with fail-fast thresholds tuned to stop cascade propagation before it reaches user-facing services.

Which chaos scenario should I run first?

Run the cache stampede scenario first. For most web systems, the cache layer absorbs 90–98% of read traffic, making cache failure the highest-impact single-component failure. At typical cache hit rates, a stampede creates a 10–50x load spike on the origin database in under a second. If your system passes the cache stampede at normal RPS, run node failure on each stateful component to find SPOFs. Run network partition between your most critical service pairs third. Fix everything those three scenarios find before exploring more exotic scenarios.

How do chaos engineering scenarios map to system design interview questions?

Each major chaos scenario maps directly to a common system design interview failure question. Cache stampede maps to: 'What happens when your cache goes down?' Node failure maps to: 'What's your single point of failure?' Network partition maps to: 'How does your system handle a network split between regions?' Thundering herd maps to: 'What happens when all your clients reconnect at once?' Connection pool exhaustion maps to: 'How do slow database queries affect other services?' Running these scenarios in SysSimulator gives you specific numbers to use in interviews: real p99 values, real pool saturation timelines, real blast radius measurements.

Chaos engineering scenarios: all 28 failure modes

This is the complete reference for every chaos engineering scenario available in SysSimulator. For each scenario you'll find what it simulates, which metrics it impacts, what a real production example looks like, and the mitigations that address the root cause.

The 28 scenarios span six failure categories: network, infrastructure, traffic, data layer, application, and dependency. A seventh category of MCP / AI agent scenarios unlocks when you load an agent architecture blueprint.

For the principles behind when and how to run these scenarios, see chaos engineering principles and practices. To run any scenario against your own architecture, open SysSimulator — no signup, no install.

Recommended starting order

Not all chaos scenarios are equal. Run them in this order for the highest-signal-per-experiment ratio:

Priority	Scenario	Why first
1st	Cache stampede	Exposes the most common catastrophic failure mode for web systems. Cache absorbs 90–98% of read load; its failure creates an immediate 10–50x DB load spike.
2nd	Node failure	Identifies single points of failure. Run on every stateful component individually.
3rd	Network partition	Exposes CAP theorem tradeoffs in practice. Where does your system prefer consistency vs availability?
4th	Thundering herd	Exposes reconnection storm vulnerabilities — critical before any maintenance operation.
5th	Connection pool exhaustion	Exposes the hidden constraint between your application tier and database — often invisible until it saturates.
6th+	All others	Scenario-specific, run after the five above have passed.

Network chaos scenarios

Network failures are the most common production failure category. Four scenarios cover the main failure modes: added latency, dropped connectivity, intermittent packet loss, and constrained throughput.

Scenario 1 of 28 Latency injection

Network Impact: Medium–High

What it simulates: Artificial latency added to all requests through a target component — a slow upstream dependency, geographic distance, network congestion, or an overloaded downstream service.

Metrics impacted

p99 latency (multiplies through synchronous call chains), throughput (queue buildup), error rate (timeout-induced errors at high injection values)

Primary mitigation

Decouple synchronous call chains into async patterns. Apply read timeouts on all dependency calls. Identify which dependencies are on the critical latency path.

Real-world example: A payment processor starts responding at 800ms instead of 80ms. Your checkout API is synchronous on the payment call. Five checkout requests in parallel hold five threads for 800ms each. At 10,000 RPS with 5% checkout rate, 500 threads are held simultaneously — thread pool saturation follows within seconds.

Interview application: "Your third-party payment processor starts responding slowly — walk me through the blast radius on your checkout service."

Scenario 2 of 28 Network partition

Network Impact: High

What it simulates: Complete loss of connectivity between two components — a broken switch, firewall rule change, VPC misconfiguration, or cross-availability-zone outage.

Metrics impacted

Error rate (requests that cannot complete), p99 latency (requests hang until timeout), connection pool utilization (held connections waiting for timeout)

Primary mitigation

Circuit breakers at every service boundary. Explicit fallback behavior at each partition point. Design for "what does this service do when it cannot reach its dependency?"

Real-world example: A firewall rule change drops all traffic between the application tier and the database. Without circuit breakers, application tier connections to the database hang until their timeout — typically 30 seconds. At 5,000 RPS with 10% DB-touching requests, 15,000 requests per 30-second window queue up and timeout. Thread pool exhaustion follows.

Interview application: "Your primary region loses connectivity to your database. You have read replicas in a secondary region. Walk me through the failover."

Scenario 3 of 28 Packet loss

Network Impact: Medium

What it simulates: A configurable percentage of requests are dropped — not all of them, unlike a full partition. Some requests complete, some do not. Intermittent and non-deterministic.

Metrics impacted

Error rate (rises proportionally with loss percentage), p99 latency (retry overhead), request success rate

Primary mitigation

Idempotent operations that are safe to retry. At-least-once delivery with deduplication on the receiver side. Retry budgets to prevent retry storms.

Real-world example: A flaky wireless link between a mobile client and your API drops 15% of requests. The client retries immediately. If the payment endpoint is not idempotent, 15% of payment attempts may result in double charges on retry success.

Interview application: "A mobile client on a flaky connection is retrying your payment endpoint. Are you charging once or multiple times?"

Scenario 4 of 28 Bandwidth throttle

Network Impact: Medium

What it simulates: The network throughput of a component is constrained. Requests are processed but at reduced rate, causing queue buildup upstream.

Metrics impacted

Queue depth (builds upstream of the throttled component), p99 latency, throughput (capped at throttle limit)

Primary mitigation

Backpressure signals from throttled components to callers. Client-side rate limiting to match the downstream capacity ceiling. Load shedding at the throttled component boundary.

Real-world example: A data pipeline egress is throttled by a cloud provider during peak usage. Upstream jobs accumulate in a queue. The queue grows faster than it drains. Eventually, queue-full backpressure reaches the ingestion layer.

Infrastructure chaos scenarios

Infrastructure failures target compute, storage, and memory at the host or container level. These are the most straightforward failure modes to identify but often reveal the most critical SPOFs.

Scenario 5 of 28 Node failure

Infrastructure Impact: High

What it simulates: A component goes entirely offline. Simulates an EC2 instance crash, container OOM kill, Kubernetes pod eviction, or bare-metal hardware failure.

Metrics impacted

Error rate (traffic that was on the failed node), p99 latency (load concentrates on surviving nodes), throughput (if surviving nodes reach capacity)

Primary mitigation

Redundancy with N+1 capacity (surviving nodes can absorb the failed node's traffic). Health checks with automatic traffic rerouting. Stateless services that restart cleanly.

Real-world example: One of three application servers in a load balancer pool goes down. With 33% of traffic on the failed server and only two servers surviving, each surviving server goes from 33% to 50% load — a 50% increase. If 50% load exceeds capacity, all three servers degrade and the failure cascades.

Interview application: "Your diagram has a single database instance. What's your RTO if that instance fails?"

Scenario 6 of 28 Disk full

Infrastructure Impact: High (writes)

What it simulates: A component runs out of disk space. Database writes fail. App servers with local logging or write caching fail writes silently or noisily depending on error handling.

Metrics impacted

Write error rate (spikes to 100%), read availability (may continue), data durability (in-flight writes lost)

Primary mitigation

Disk utilization alerts at 70% and 85%. Log rotation and retention policies. Write failure handling that surfaces errors explicitly rather than silently dropping writes.

Real-world example: A database data volume fills because a runaway query produced an unexpectedly large result set that was written to a temp table. All subsequent writes fail. Reads continue from the existing data. The application appears partially functional but is silently losing data.

Scenario 7 of 28 CPU spike

Infrastructure Impact: Medium–High

What it simulates: A component's CPU is fully saturated. Simulates heavy table scans, large sort operations, expensive cryptographic work, or inefficient regex evaluation.

Metrics impacted

p99 latency (rises sharply as requests queue), throughput (capped by processing capacity), queue depth

Primary mitigation

Scale out (add replicas for CPU-bound services). Query optimization for database CPU spikes. Identify the specific workload causing CPU saturation — the fix is workload-specific.

Real-world example: A full-table scan triggered by a missing index causes database CPU to spike to 100%. All queries slow proportionally. Queue depth builds. Application servers start timing out on database calls.

Scenario 8 of 28 Memory pressure

Infrastructure Impact: Medium

What it simulates: A component approaches out-of-memory conditions. In GC-managed runtimes (JVM, Go, Node.js), this increases GC frequency and pause duration.

Metrics impacted

p99 latency (erratic spikes from GC pauses), error rate (eventual OOM crashes), GC pause duration

Primary mitigation

Memory utilization alerts. JVM heap sizing. Memory leak detection via sawtooth memory graphs. Restart policies for containers approaching OOM limits.

Real-world example: A JVM-based service accumulates request contexts in a map that is never cleaned up after request completion. Memory grows slowly. GC pauses lengthen. p99 spikes erratically. Eventually the container hits its memory limit and is OOM-killed, triggering a node failure cascade.

Traffic chaos scenarios

Traffic failure modes test how systems behave under abnormal load patterns — not just volume, but timing, size, and coordination.

Scenario 9 of 28 Request spike

Traffic Impact: High

What it simulates: Incoming request rate multiplies suddenly — 5x, 10x, or configurable. Simulates viral moments, flash sales, bot attacks, or synchronized scheduled jobs.

Metrics impacted

First bottlenecked component saturates; error rate climbs as queue depth exceeds capacity

Primary mitigation

Rate limiting at the ingress. Autoscaling with pre-warming. Load shedding with graceful degradation (drop non-critical requests, serve critical ones). Capacity headroom above peak-of-peak, not average.

Real-world example: A product is featured on a major media site. Traffic spikes 8x in under a minute. The bottleneck is the session token validation service, which is single-instance. All authenticated requests fail for 4 minutes until a second instance is started.

Scenario 10 of 28 Payload bloat

Traffic Impact: Medium

What it simulates: Average request and response payload size increases significantly. Simulates large file uploads, unexpectedly verbose JSON, logging payload explosions, or large query result sets.

Metrics impacted

Network-bound component throughput (decreases), database write throughput (decreases with larger rows), gateway rejection rate (if payload limits enforced)

Primary mitigation

Payload size limits at ingress (400/413 responses for oversized requests). Pagination for large query results. Field filtering to return only requested fields. Compression for large payloads.

Real-world example: A logging library update starts including full stack traces in every log entry, increasing log payload size 20x. The log aggregation pipeline, sized for normal log volume, starts dropping logs and eventually backs up to the point where the application itself slows waiting for log writes to complete.

Scenario 11 of 28 Slow clients

Traffic Impact: Medium

What it simulates: Clients read responses slowly — simulating mobile connections on poor signal, geographic distance, or clients doing expensive processing between reads.

Metrics impacted

Connection pool utilization (rises even with idle CPU — connections are held open for transfer duration), p99 latency

Primary mitigation

Write timeouts on responses (terminate connections that aren't reading). Async response streaming. Separate connection pools for fast and slow clients. Payload size reduction.

Real-world example: Mobile clients on 3G connections take 10 seconds to receive a 500KB response. Server connections are held open for the full transfer. At 1,000 concurrent slow clients, 10,000 connection-seconds are consumed per second — rapidly exhausting connection pools even when the server CPU is idle.

Scenario 12 of 28 Thundering herd

Traffic Impact: High

What it simulates: Many clients make the same request simultaneously after a period of silence — a reconnection storm after a server restart, a popular cached item expiring and causing concurrent misses, or scheduled jobs firing at the same second.

Metrics impacted

Instantaneous load spike (even at normal average RPS), single-resource saturation, p99 latency spike

Primary mitigation

Jittered retry timers (randomise reconnect delays across a wide distribution). Request coalescing (serve one pending request, then propagate the result to all waiters). Probabilistic early cache expiration to prevent synchronized expiry.

Real-world example: Slack, May 2022. A routine maintenance operation caused all Slack clients to disconnect simultaneously. On completion, millions of clients reconnected at the same instant, overwhelming connection handling infrastructure. The fix: exponential backoff with jitter on client reconnect, spreading the load over several minutes instead of milliseconds.

Data layer chaos scenarios

Data layer failures are the highest-stakes failure category because they can affect data durability, not just availability. These scenarios test your database failover, replication behavior, cache resilience, and connection management.

Scenario 13 of 28 Database crash

Data layer Impact: Critical

What it simulates: The primary database goes entirely offline. Writes stop immediately. Read replicas may continue serving reads depending on your read/write split configuration.

Metrics impacted

Write error rate (100%), read availability (depends on replica configuration), failover RTO (how long until the replica is promoted)

Primary mitigation

Automated primary promotion with sub-minute RTO (RDS Multi-AZ, Patroni, etc.). Read/write split so reads continue from replicas during primary outage. Write queuing for non-critical writes during failover window.

Real-world example: Amazon DynamoDB, February 2021. A configuration change to DynamoDB's metadata service caused a cache overload that propagated to impair the entire service. The root cause was a change that resulted in increased cache misses, which cascaded to database load — effectively a cache stampede at service infrastructure level.

Interview application: "Your diagram has a primary database with read replicas. Walk me through RTO when the primary fails."

Scenario 14 of 28 Replication lag

Data layer Impact: Medium

What it simulates: Growing delay between when writes are committed on the primary and when they become visible on read replicas.

Metrics impacted

Read freshness (replica reads return stale data), write-then-read consistency failures, replication lag metric

Primary mitigation

Route freshness-critical reads to the primary, not replicas. Replication lag alerts at 1 second, emergency response at 30 seconds. Avoid read replicas for flows that require reading what was just written.

Real-world example: A social platform routes all reads to replicas to reduce primary load. A user posts a tweet, which is written to the primary. They immediately refresh their profile, which reads from a replica that is 8 seconds behind. The tweet doesn't appear. The user thinks the post failed and submits it again — creating a duplicate.

Interview application: "You're using read replicas for 90% of reads. A user writes a record and immediately reads it back. How do you handle this?"

Scenario 15 of 28 Cache stampede

Data layer Impact: Critical

What it simulates: Cache hit rate drops to near zero — simulating a cache restart, mass invalidation event, or eviction storm when the cache is undersized for the working set.

Metrics impacted

Database QPS (spikes by 1/(1-hit_rate) factor), connection pool utilization, p99 latency, error rate (if DB saturates)

Primary mitigation

Probabilistic early expiration (renew entries before they expire, probabilistically). Request coalescing (collapse concurrent misses for the same key into one DB query). Write-through caching (keep cache populated during restarts). Secondary cache layer with longer TTL.

Real-world example: A Redis cluster is restarted for a configuration change. At restart, 10,000 RPS of requests that were previously served from the 98% cache hit rate suddenly hit the database. The database was handling 200 QPS (the 2% miss rate). It now receives 10,000 QPS — a 50x spike. Connection pool of 100 connections exhausts in under 2 seconds. The database becomes unresponsive. The cache, once it restarts, cannot warm up fast enough because the database is too loaded to serve the initial fill requests.

Interview application: "Your cache goes down at peak traffic. Walk me through the blast radius."

Scenario 16 of 28 Connection pool exhaustion

Data layer Impact: High

What it simulates: Database connection pool fills until no new connections can be established. New requests queue, timeout, and fail — even when database CPU and memory are entirely normal.

Metrics impacted

Connection pool utilization (100%), request error rate (timeout errors), p99 latency (queue wait time), database CPU (may appear normal — misleadingly)

Primary mitigation

Pool utilization monitoring (alert at 80%). PgBouncer/connection multiplexing to serve more app-side connections from fewer database connections. Connection timeout tuning — requests should fail fast when the pool is full, not wait indefinitely. Identify slow queries that hold connections past their useful lifetime.

Real-world example: A slow query introduced by a deployment takes 800ms instead of the normal 40ms. Each connection holds a query for 800ms instead of 40ms. The same 100-connection pool that served 2,500 QPS (100 connections ÷ 40ms) now serves only 125 QPS (100 connections ÷ 800ms). At 1,000 QPS of DB-touching requests, 87.5% of requests cannot get a connection and timeout.

Application chaos scenarios

Application-level failures test runtime behavior: memory management, thread scheduling, lock contention, and failure propagation logic.

Scenario 17 of 28 Memory leak

Application Impact: Medium (slow onset)

What it simulates: Continuous, slow memory growth in a component that is never released — objects accumulated in maps without eviction, unclosed streams, event listener leaks.

Metrics impacted

Memory utilization (grows monotonically), GC pause duration (increases as heap fills), p99 latency (erratic due to GC), eventual OOM crash

Primary mitigation

Memory utilization monitoring with trend alerts. Heap profiling to identify leak sources. Bounded data structures (caches with eviction policies, not unbounded maps). Scheduled restarts as a stopgap while root cause is fixed.

Signal to watch: The sawtooth pattern on a memory graph — peaks rise with each GC cycle but never drop to baseline. Over time, the peaks get higher and the valleys shallower. OOM is the endpoint if unaddressed.

Scenario 18 of 28 Thread pool exhaustion

Application Impact: High

What it simulates: Worker threads are all occupied and cannot accept new requests. Simulates blocking I/O that holds threads — synchronous database calls, synchronous HTTP calls to slow dependencies, or CPU-intensive work on the request thread.

Metrics impacted

Request queue depth (grows rapidly), error rate (queue-full rejections or timeouts), CPU (may appear low — threads are blocked, not computing)

Primary mitigation

Async I/O to free threads during wait time. Separate thread pools for different I/O categories (DB, external HTTP, internal). Queue depth alerts. Thread pool sizing based on measured concurrency, not rule of thumb.

Real-world example: A service uses a fixed thread pool of 200 threads. A downstream dependency starts responding at 2 seconds instead of 50ms. Each thread is blocked for 2 seconds per request. The pool that served 4,000 RPS now serves 100 RPS. At 2,000 RPS of incoming traffic, the queue fills in seconds.

Scenario 19 of 28 Deadlock

Application Impact: High

What it simulates: Circular lock dependencies where two or more operations each hold a lock the other needs, causing both to wait indefinitely.

Metrics impacted

Affected request latency (goes to infinity — they never complete), thread pool utilization (fills with stuck threads), eventual thread pool exhaustion

Primary mitigation

Consistent lock acquisition order (always acquire lock A before lock B, across all code paths). Lock timeouts (fail after N seconds rather than waiting forever). Short transaction windows. Deadlock detection in database engines (most databases detect and kill deadlocked transactions automatically).

Scenario 20 of 28 Cascading failure

Application Impact: Critical

What it simulates: A component failure overloads a downstream dependency, which fails and overloads its dependency, propagating failure through the dependency graph.

Metrics impacted

Error rate (climbs through the dependency graph), p99 latency (climbs as retries amplify load), bottleneck count (grows as cascade propagates)

Primary mitigation

Circuit breakers at every service boundary (fail fast on unhealthy downstream, stop retries from amplifying cascade). Exponential backoff with jitter on retries. Bulkhead pattern (separate thread pools per downstream dependency to contain blast radius).

Key insight: Aggressive retries without backoff are the primary cascade accelerator. A service receiving 1,000 RPS that retries 3 times on each 5xx sends 3,000 RPS to an already-failing downstream. The downstream, now receiving 3x load, fails harder, causing the upstream to retry more, creating a positive feedback loop that collapses both services simultaneously.

Dependency chaos scenarios

Dependency failures test how your system behaves when third-party services — payment providers, email services, identity providers, external APIs — degrade or fail.

Scenario 21 of 28 Third-party timeout

Dependency Impact: High

What it simulates: An external dependency stops responding entirely. Requests hang until they reach their timeout — which may be 30 seconds or more if not explicitly configured.

Metrics impacted

Thread pool utilization (held for timeout duration), error rate (timeout errors after delay), p99 latency (equals timeout value for affected requests)

Primary mitigation

Short, explicit read timeouts on every external call (500ms–2s, not the default 30s). Circuit breakers that open after N consecutive timeouts. Async dependency calls so threads aren't held. Graceful degradation when the dependency is unavailable.

Interview application: "Your payment provider is down. What happens to your checkout flow?"

Scenario 22 of 28 Degraded response

Dependency Impact: Medium

What it simulates: A dependency responds at 10x normal latency instead of timing out completely. It appears nominally healthy in status checks but is significantly slower.

Metrics impacted

p99 latency (rises proportionally), thread/connection utilization (holds longer per request), effective throughput (decreases)

Primary mitigation

Read timeouts calibrated to internal SLAs, not default library values. p99 latency alerts on dependency calls (not just error rate). Circuit breakers triggered by latency, not just errors.

Why this scenario is harder than a full timeout: A full timeout produces obvious 5xx errors. Slow responses produce high latency and connection saturation without triggering error-rate alerts. Many teams don't notice until thread pools are near exhaustion.

Scenario 23 of 28 Error response

Dependency Impact: Medium

What it simulates: A dependency returns 5xx errors quickly — fast failures rather than hangs.

Metrics impacted

Error rate (rises fast), retries (naive retry logic amplifies load on failing dependency)

Primary mitigation

Exponential backoff with jitter on retries. Circuit breakers to stop retrying after threshold errors. Retry budgets to prevent retry storms from amplifying load on the failing dependency.

Scenario 24 of 28 Rate limit hit

Dependency Impact: Medium

What it simulates: A third-party API enforces quota and returns 429 Too Many Requests intermittently as usage exceeds the rate limit.

Metrics impacted

Dependent operation success rate, quota remaining (rate-unaware retries burn remaining quota faster)

Primary mitigation

Client-side quota budgeting and tracking. Respect Retry-After headers. Adaptive throttling that backs off when approaching quota limits. Graceful degradation for quota-limited operations.

MCP / AI agent scenarios

These four scenarios unlock automatically when you load a blueprint that includes Agent Runtimes, MCP Tool Servers, Vector Stores, or Tool Registries. They do not appear on classic microservice blueprints.

Scenario 25 of 28 Tool server timeout

MCP / Agent Impact: High

What it simulates: An MCP tool server stops responding mid-task. The agent runtime is waiting for a tool call result that never arrives.

What it tests: Whether the agent retries intelligently, falls back to an alternative tool, or fails gracefully with a useful error message. Agents without explicit tool-timeout handling may hang indefinitely or consume the full context window on retry loops.

Scenario 26 of 28 Token budget pressure

MCP / Agent Impact: Medium

What it simulates: The agent approaches its context window limit mid-task, with significant work remaining.

What it tests: Whether the agent summarizes progress and continues, truncates silently, or fails usefully. Agents without context budget awareness produce truncated, incorrect outputs without signaling the truncation.

Scenario 27 of 28 Vector index staleness

MCP / Agent Impact: Medium

What it simulates: The vector store serving retrieval-augmented generation contains stale embeddings — documents that have been updated in the source system but not re-indexed.

What it tests: Whether the agent's retrieval produces confidently wrong answers grounded in outdated data, and whether the system has any mechanism to detect and signal staleness.

Scenario 28 of 28 Policy deny burst

MCP / Agent Impact: High

What it simulates: The policy enforcement layer starts denying a large share of the agent's tool calls — a tightened permission set, a misconfigured rule, or a security lockdown. Schema validation overhead climbs and the number of allowed tool calls per turn drops.

What it tests: Whether the agent distinguishes "denied by policy" from "tool failed" — the correct responses are different. A well-built agent re-plans around the tools it is still allowed to use and surfaces permission errors to the user; a fragile one burns its context window retrying calls that will never be permitted.

Run any scenario now

Every scenario above is available in SysSimulator — free, in your browser, with no signup or infrastructure required. Load any of the 56 architecture blueprints, start traffic, and inject any scenario from the chaos panel.

For how to design and interpret experiments, see chaos engineering principles and practices. For interview preparation using these scenarios, see the interview prep guide.

Open Simulator — No signup required →