Chaos engineering: principles, scenarios, and best practices

Chaos engineering is the discipline of deliberately injecting failures into systems to discover weaknesses before they cause production incidents. This page covers the full picture: the 5 formal principles from principlesofchaos.org, the 28 failure scenarios you can run in SysSimulator, the best practices used by Netflix, Amazon, and Google, and how to apply all of this in system design interviews.

SysSimulator brings chaos engineering to your browser with zero infrastructure required. Build an architecture, run traffic through it, and inject any of the 28 scenarios to watch exactly how your system breaks — and what it takes to recover.


What is chaos engineering?

Chaos engineering is the practice of deliberately injecting failures into systems to discover weaknesses before they cause production incidents. Netflix pioneered it with Chaos Monkey — a service created by Ben Christensen and the Netflix engineering team in 2010 that randomly killed EC2 instances in production to force the engineering team to build resilient systems.

The core insight: if your system cannot survive a randomly killed service in a controlled experiment, it definitely cannot survive one at 3am during peak load. Better to discover fragility now, in a controlled way, than later in a customer-facing incident.

The discipline was formally defined in the Principles of Chaos Engineering document: "Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

SysSimulator brings chaos engineering to the learning and design context. You do not need a Kubernetes cluster. You do not need production infrastructure. You do not need to risk real user traffic. Build an architecture, simulate load through it, and inject failures to watch exactly how it breaks.


The 5 principles of chaos engineering

These five principles were formalized at Netflix and documented at principlesofchaos.org. They define what separates rigorous chaos engineering from random failure injection.

Principle 1
Build a hypothesis around steady-state behaviour

A chaos experiment starts with a measurable definition of normal. Before injecting any failure, establish: at 10,000 RPS with no errors, this system runs at p99 = 45ms with 0.01% error rate. The experiment then tests whether that steady state holds under a specific perturbation. Without a quantified steady state, you cannot know whether chaos changed anything.

SysSimulator establishes your steady state before any scenario runs — the live metrics bar shows RPS, p99, error share, and bottleneck count under normal load. You see your baseline, then watch it deviate when you inject a scenario.

Violated when: Teams inject failures without recording baselines, then cannot tell whether the system "recovered" because they have no definition of recovery.

Principle 2
Vary real-world events

Real systems fail in patterns that trace back to specific events: traffic spikes, node reboots, hardware failures, dependency degradation. Chaos experiments should map to these real event categories rather than arbitrary failure injection. The 28 scenarios in SysSimulator each correspond to a real production event: cache stampede, network partition, third-party timeout, replication lag.

Violated when: Teams inject random chaos with no mapping to known production failure modes. They test everything and learn nothing specific.

Principle 3
Run experiments in production (or as close to it as possible)

This is the uncomfortable principle. A system behaves differently in production than in staging because the data, traffic patterns, and configuration are different. Staging environments that don't mirror production traffic miss the production failure modes that matter most.

SysSimulator sits at the learning layer — before staging. It teaches you the patterns so that when you graduate to production experiments, you know what to look for and what a healthy recovery looks like.

Violated when: Teams only run chaos in staging, then encounter failure modes in production that staging never exercised.

Principle 4
Automate experiments to run continuously

A chaos experiment run once, fixed, and never repeated does not verify the fix holds across subsequent deployments. Chaos engineering matures when it becomes part of CI/CD — every deployment runs the canonical failure scenarios and the build fails if steady-state recovery degrades.

Violated when: Chaos engineering is a one-time audit or a quarterly exercise rather than a continuous feedback loop.

Principle 5
Minimise blast radius

Run the smallest experiment that tests your hypothesis. Start with a single replica in one region. Limit the experiment duration. Have a rollback ready. The goal is learning, not causing incidents. The blast radius principle means chaos experiments should be surgical, not reckless — particularly in production.

Violated when: Teams run wide-blast experiments without controls and trigger real user-facing incidents they cannot quickly recover from.


How to use chaos scenarios in SysSimulator

Build or load an architecture, start the simulation at your target RPS, then select a chaos scenario and aim it at a specific component. The effect applies immediately — you will see it in the live status bar (RPS, p99, error share, bottleneck count) and in the full-page System metrics analysis view (latency history chart, bottleneck list, glossary).

Chaos scenarios compose. You can inject a network partition on your cache and a traffic spike simultaneously and watch a compounding failure. This is closer to real production incidents, which are almost never single-cause.

See all 28 scenarios with detailed analysis in the complete chaos scenarios reference.


Network chaos — 4 scenarios

Latency injection

Adds an artificial latency distribution to all requests passing through a target component. You set the P50 and P99 injection values independently.

What you observe: Requests slow down through the affected component. In synchronous call chains, latency multiplies — a 100ms injection on a service that makes five downstream synchronous calls adds 500ms to end-to-end P99.

What it teaches: Latency-sensitive dependency identification. Long synchronous chains are hidden p99 killers. This scenario reveals which dependencies should be decoupled into async calls.

Interview application: "Your third-party payment processor starts responding slowly. Your checkout API is synchronous on the payment call. Walk me through impact on cart service p99."

Network partition

Drops all requests between two specific components. Simulates a broken switch, firewall rule change, VPC misconfiguration, or cross-availability-zone outage.

What you observe: One side of the partition continues operating. The other side queues requests that never complete. Timeouts trigger. Without circuit breakers, the calling component's connection pool exhausts.

What it teaches: Partition tolerance and the P in CAP theorem. Engineers learn to ask what each service does when it cannot reach a dependency, then design explicit fallback behavior at every boundary. → Read: CAP theorem explained

Packet loss

Drops a configurable percentage of requests — not all of them, unlike partition. Some requests complete, some do not. Intermittent and unpredictable.

What you observe: Error rate rises with loss percentage, but not uniformly. Retry logic may recover some operations at higher latency. The non-determinism is the point.

What it teaches: Idempotency requirements. With packet loss, you may not know whether a request completed before the response dropped. Engineers learn to make operations safe to retry.

Interview application: "A mobile client on a flaky connection is retrying your payment endpoint. Are you charging once or multiple times?"

Bandwidth throttle

Limits the throughput capacity of a component's network interface. Requests are processed but at reduced rate, causing queue buildup.

What you observe: Downstream components see slower responses. Queue depth builds on the throttled component. If traffic arrives faster than processing capacity, queue growth ends in timeouts.

What it teaches: Backpressure design. Without client-side rate limiting, circuit breakers, or load shedding, a slow component drags down all dependents.


Infrastructure chaos — 4 scenarios

Node failure

Takes a component entirely offline. Simulates an EC2 instance crash, container OOM kill, Kubernetes pod eviction, or hardware failure.

What you observe: If the component is behind a load balancer, health checks eventually route around it. Traffic concentrates on surviving nodes and can overload them. Without redundancy, traffic fails.

What it teaches: Single point of failure identification. Running node failure on each component reveals SPOFs and capacity headroom required to survive node loss.

Disk full

Simulates a component running out of disk space. Database writes fail. App servers with local logging or write caching can fail writes silently.

What you observe: Write operations error on the affected component. Reads may continue. If write failures are not handled, errors propagate to user-facing layers.

What it teaches: Capacity planning and write failure handling. This forces explicit design for "what happens when we cannot write."

CPU spike

Maxes out a component's processing capacity. Simulates heavy full scans, large sorts, or expensive cryptographic work.

What you observe: Response time climbs sharply. Queue depth rises because requests arrive faster than they can be processed.

What it teaches: CPU-bound vs I/O-bound bottlenecks. Correct diagnosis drives the right fix: scale out CPU-bound services, optimize queries and caches for I/O-bound bottlenecks.

Memory pressure

Simulates a component approaching out-of-memory conditions. In GC runtimes, this increases GC frequency and pause duration.

What you observe: Latency rises erratically due to GC pauses. Eventually the component crashes and can trigger node-failure cascades.

What it teaches: Memory leak pattern recognition via sawtooth memory curves and rising peaks over time.


Traffic chaos — 4 scenarios

Request spike

Suddenly multiplies incoming request rate — 5x, 10x, or configurable. Simulates viral moments, flash sales, bot attacks, or synchronized jobs.

What you observe: The first saturated critical-path component becomes the bottleneck. If capacity stays flat while load grows, error share climbs.

What it teaches: Capacity headroom above peak-of-peak, not average traffic.

Payload bloat

Increases average request and response size. Simulates large file uploads, unexpectedly verbose JSON, or logging payload explosions.

What you observe: Network-bound components degrade first. Gateways may reject oversized payloads. Database write throughput drops as row size grows.

What it teaches: Payload size is a separate scaling dimension from request count.

Slow clients

Makes clients slow to receive responses, simulating mobile links, geographic distance, or slow reads.

What you observe: Connections stay open longer. Pool utilization rises even with idle CPU. Eventually connection pools exhaust.

What it teaches: Connection pools are consumed by both processing and transfer time, so payload and timeout design matter.

Thundering herd

Simulates many clients making the same request at once after silence — reconnect storms, scheduled bursts, or cache-expiry fanout.

What you observe: Homogeneous, coordinated load concentrates on one resource. Even at normal RPS averages, the instantaneous peak overwhelms capacity.

What it teaches: Jittered retries, request coalescing, and probabilistic early expiration. → Read: Distributed caching patterns


Data layer chaos — 4 scenarios

Database crash

Takes the primary database offline entirely. Writes stop; read replicas may continue serving reads.

What you observe: Write operations fail immediately; read/write split behavior becomes visible.

What it teaches: Write availability design and realistic failover RTO planning.

Replication lag

Introduces growing delay between primary writes and replica visibility.

What you observe: Replica reads return stale data; write-then-read flows fail.

What it teaches: Practical eventual consistency design and primary-read exceptions for freshness-critical flows. → Read: CAP theorem and consistency tradeoffs

Cache stampede

Forces cache hit rate near zero. Simulates cache restarts, mass invalidation, or eviction storms.

What you observe: Origin load explodes — at 10,000 RPS with 98% hit rate, the database suddenly absorbs 9,800 QPS it was handling at 200 QPS before. Pool exhaustion can happen in under a second at high RPS.

What it teaches: Cache resilience patterns: probabilistic early expiration, request coalescing, and write-through caching. → Read: Distributed caching patterns

Connection pool exhaustion

Gradually fills database pools until new connections cannot be established.

What you observe: Some requests succeed while others queue and then timeout, even if database CPU is low. The database appears healthy while requests fail.

What it teaches: Pool sizing and pool-utilization monitoring as leading signals — pool utilization at 80% is a warning; 95% is a near-incident.


Application chaos — 4 scenarios

Memory leak

Simulates slow, continuous memory growth that is never released.

What you observe: Early impact is subtle; over time GC pressure rises, then memory pressure and crashes appear. The sawtooth pattern on a memory graph — rising peaks that never fully drop — is the signature.

What it teaches: Long-running reliability testing beyond short peak-load snapshots.

Thread pool exhaustion

Fills worker pools so new requests cannot be processed.

What you observe: Queue depth climbs rapidly and timeouts follow even if CPU and memory are fine. The component is healthy by most metrics but unavailable to new work.

What it teaches: Blocking I/O identification and async alternatives.

Deadlock

Simulates circular lock dependencies where operations wait indefinitely.

What you observe: Requests freeze, hold locks, and block others, expanding the stalled set over time.

What it teaches: Consistent lock ordering, lock timeouts, and short transaction windows.

Cascading failure

Triggers sequential failures across dependency chains as one failure overloads the next.

What you observe: Small initial issues propagate through the graph; aggressive retries without circuit breakers accelerate the collapse and can take down services that were not initially affected.

What it teaches: Circuit breakers at critical boundaries to fail fast and prevent nonlinear blast-radius growth.


Dependency chaos — 4 scenarios

Third-party timeout

A dependency (payments, email, SMS, identity) stops responding, so requests hang until they hit their timeout.

What you observe: Dependent operations hold threads for the full timeout duration; at scale, thread exhaustion follows within seconds.

What it teaches: Timeouts, async boundaries, graceful degradation, and circuit breakers for external dependencies.

Degraded response

Third-party dependency responds at 10x normal latency rather than timing out completely.

What you observe: Latency and pool utilization climb even though the dependency appears nominally healthy. This is harder to detect than a full timeout.

What it teaches: Read-timeout vs connect-timeout distinction and internal SLA ceilings for third-party calls.

Error response

Third-party dependency returns 5xx errors quickly instead of hanging.

What you observe: Fast failures with high error rates; naive retries without backoff can amplify load on an already-failing dependency.

What it teaches: Exponential backoff with jitter plus circuit breakers.

Rate limit hit

A third-party API enforces quota and returns 429s intermittently.

What you observe: Some operations fail; rate-unaware retries burn remaining quota faster, worsening the situation.

What it teaches: Client-side quota budgeting, adaptive throttling, and graceful degradation under quota exhaustion.


MCP / AI agent scenarios — unlocked by blueprint

When you load a blueprint built for MCP agents — architectures involving Agent Runtimes, MCP Tool Servers, Vector Stores, or Tool Registries — a seventh set of chaos scenarios unlocks automatically. These scenarios do not appear on classic microservice diagrams, keeping the catalog honest.

Tool server timeout — an MCP tool server stops responding mid-task. Does the agent retry, fall back, or fail usefully?

Token budget pressure — the agent approaches context-window limits mid-task. Does it summarize and continue or truncate?

Policy denial — a tool call is blocked by policy enforcement. How does failure propagate to user-facing behavior?

Vector index staleness — retrieval is grounded in stale embeddings and old data.

Tool registry unavailability — tool discovery fails at runtime. Can the agent degrade gracefully?

These scenarios matter as MCP agent architectures move toward production, where the reasoning loop itself is part of failure propagation. See the MCP agent architecture guide for the full topology.


Chaos engineering best practices

The teams at Netflix, Amazon, and Google that run chaos engineering programs follow these six practices. Each one maps to a way chaos programs fail when skipped.

1. Define steady state before any experiment

Every chaos experiment needs a before. Pick three metrics: p99 latency, error rate, and throughput. Record them under normal load before running any scenario. These numbers become your recovery criteria — the experiment ends when all three return to baseline.

Without a baseline, you cannot distinguish a system that recovered from one that was never healthy to begin with.

2. Start with the canonical failure modes

Cache stampede, node failure, and network partition are the right first experiments for any distributed system. They expose the three most common real production failure patterns. Every system should pass these three before running more exotic scenarios. If your system fails a cache stampede at normal RPS, that is the priority — not testing cascading database deadlocks.

3. Inject failures at the component level, not the system level

Kill one service at a time, not the whole cluster. This gives signal about individual dependencies. Node failure on your database tells you about your database failover. Node failure on your cache tells you about cache resilience. Running both simultaneously gives you a compounding failure — useful once you understand the individual failure modes, counterproductive as a starting point.

4. Observe the full dependency graph, not just the failed component

When you inject a cache stampede, the most important metric is not the cache's behavior — it's what happens to the database behind it, and then what happens to the services that depend on that database. Trace failure propagation through the full dependency graph. SysSimulator's bottleneck view shows which components are degraded and in what order they degraded.

5. Run GameDays

A GameDay is a scheduled, team-wide chaos experiment where engineers, SREs, and sometimes product managers watch together and narrate what they see in real time. Netflix, Amazon, and Google run GameDays regularly. The format forces teams to reason about failure together before it happens in an on-call escalation.

SysSimulator's shareable diagrams let teams run a browser-based GameDay — load a blueprint, start traffic, share the link, and run scenarios together. Use this to build shared failure vocabulary before escalating to production GameDays.

6. Fix what you find — or the experiment was theater

Chaos engineering without action items is failure theater. Every experiment should produce a result: either the system behaved as expected (the experiment passes) or it didn't (the experiment produces a ticket, design change, or SLO revision). Accumulating experiments without fixing findings is the most common way chaos engineering programs lose credibility and funding.


Real-world chaos engineering examples

Netflix Chaos Monkey (2010) — where it started

Chaos Monkey was created by Ben Christensen and the Netflix engineering team in 2010, during Netflix's migration from their own data centers to AWS. The tool randomly terminated EC2 instances in production during business hours, forcing engineers to build services that automatically recovered from instance loss. Before Chaos Monkey, a random EC2 failure might cause a customer-visible outage. After a year of Chaos Monkey, instance failure became unremarkable.

The Simian Army that followed expanded the concept: Chaos Gorilla killed entire availability zones, Latency Monkey injected artificial latency, Conformity Monkey shut down non-compliant instances. The full suite is now open-source at github.com/Netflix/SimianArmy.

Amazon GameDays — team-wide failure rehearsal

Amazon runs GameDays as structured exercises where engineers simulate failures and observe system behavior together as a team. The exercises reveal failure modes that individual service tests miss — particularly failures at the boundaries between services, where two services each behave correctly in isolation but produce incorrect behavior when one degrades the other. Amazon's GameDay program contributed directly to the resilience principles behind AWS's multi-AZ design.

Google DiRT — disaster recovery at scale

Google's Disaster Recovery Training program runs annual exercises where entire systems are deliberately taken offline and teams must recover from scratch. DiRT exercises have revealed recovery procedures that were documented but didn't work, backup systems that hadn't been exercised in years, and dependencies between systems that weren't in any architecture diagram. The program is described in detail in the Google SRE Book.

The Cloudflare WAF outage (2019) — a chaos experiment that wasn't run

On July 2, 2019, Cloudflare deployed a WAF rule containing a catastrophically backtracking regex that consumed 100% CPU across their entire network for 27 minutes. The root cause was not caught in testing because the regex performed correctly on all test inputs — it catastrophically backtracked only on a specific class of HTTP request that appeared in production traffic. A chaos engineering approach that fuzz-tested WAF rule performance with production-like payloads would have caught this before deployment.

The Slack reconnection storm (2022) — thundering herd in production

On May 12, 2022, a routine maintenance operation caused all Slack clients to disconnect simultaneously. When the maintenance completed, millions of clients reconnected at the same moment, overwhelming Slack's connection handling infrastructure. The thundering herd scenario — massive coordinated reconnection after a service restart — is one of SysSimulator's 28 chaos scenarios. Running it on the messaging system blueprint before a maintenance window would have revealed the connection handling bottleneck.


Chaos engineering for system design interviews

In 2025 and 2026, FAANG-level system design interviews increasingly include "what happens when X fails?" as a mandatory component. An interviewer who asks you to design Twitter's notification system will follow up with: "Your push notification dependency is down. Walk me through impact on user experience, metrics, and your recovery path."

The engineers who answer this well don't just say "we'd use a circuit breaker." They say: "At 50,000 RPS with 2% of requests touching the push notification path, a push dependency timeout that takes 30 seconds to circuit-break means 60,000 requests holding threads waiting for timeout. Thread pool exhaustion follows in under a minute. We'd need a client-side fallback to in-app notification with async push retry, and the circuit breaker threshold should be 500ms — not the default 30 seconds."

That answer comes from running the scenario and watching real numbers, not from reading about circuit breakers in a blog post.

SysSimulator includes chaos scenarios mapped to the most common interview system topics:

Run the scenario. Watch the metrics. Narrate what you see with specific numbers. See the full interview prep guide and the chaos engineering principles deep-dive for the complete preparation framework.


SysSimulator vs production chaos tools

SysSimulator is an educational simulator, not a production chaos engineering platform.

SysSimulator Gremlin LitmusChaos
What it breaks A simulation model Your actual running services Your actual running services
Risk Zero Real user impact if uncontrolled Real user impact if uncontrolled
Setup Browser, zero config Kubernetes agent Kubernetes cluster
Cost Free Paid Free (self-hosted)
Best for Learning, interview prep, architecture review, GameDay planning Production resilience validation Cloud-native resilience testing

Use SysSimulator to build resilience mental models, run browser-based GameDays, and prepare for system design interviews. Use Gremlin or LitmusChaos to validate actual production behavior. Simulate first, validate in production second.

For the complete scenario reference see all 28 chaos engineering scenarios. For the principles deep-dive see chaos engineering principles and practices.