Chaos engineering principles and practices: a complete guide

Most teams discover their system's failure modes in production — during an incident, at 3am, with users affected. Chaos engineering is the discipline that reverses this: you discover the failure modes first, deliberately, in a controlled experiment, when you have time to fix them.

This guide covers the five formal principles of chaos engineering, the six core practices used by Netflix, Amazon, and Google, real production incidents that chaos experiments would have caught, and how to apply all of this in system design interviews. Where relevant, SysSimulator examples show how to practice each concept in your browser with zero infrastructure.


What is chaos engineering?

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. The formal definition comes from the Principles of Chaos Engineering document, originally authored by engineers at Netflix.

It is not the same as load testing, which finds performance limits by increasing volume. It is not the same as fault injection, which tests that specific known failures produce specific expected responses. Chaos engineering is exploratory — it tests how a system behaves under realistic disturbances that may interact in ways no one predicted.

The origin: in 2010, Netflix was migrating from their own data centers to AWS. Ben Christensen and his team built Chaos Monkey — a service that randomly terminated EC2 instances in production during business hours. The goal was to force Netflix engineers to build services that recovered automatically. If a random instance kill caused an outage, the outage happened during the day when engineers were awake and could fix it — not at 3am during peak traffic. After a year of Chaos Monkey, instance failure became unremarkable at Netflix.


The 5 principles of chaos engineering

These principles distinguish chaos engineering from ad-hoc failure injection. Each one represents a way that chaos programs fail when the principle is skipped.

Principle 1
Build a hypothesis around steady-state behaviour

A chaos experiment is a scientific experiment. It needs a hypothesis and a measurable control state. Before injecting any failure, establish what normal looks like with specific numbers:

p99 latency: 45ms
error rate: 0.01%
throughput: 12,400 RPS

The hypothesis is then: "When we inject a cache stampede, steady state will be maintained." The experiment either confirms the hypothesis (the system recovers, metrics return to baseline) or refutes it (metrics deviate, revealing a real weakness).

Violated when: Teams inject failures without recording baselines and cannot tell whether the system recovered because they have no definition of recovery. The experiment produces no actionable signal.

In SysSimulator: The live metrics bar shows your baseline RPS, p99, error share, and bottleneck count before any chaos injection. Your steady state is visible before you touch a single scenario.

Principle 2
Vary real-world events

Chaos experiments should simulate the events that actually happen in production. The 28 scenarios in SysSimulator each correspond to a documented real production failure category:

  • Cache stampede — mass cache invalidation, cache restart, eviction storm
  • Network partition — broken switch, firewall rule change, AZ isolation
  • Third-party timeout — payment provider slow, SMS gateway unresponsive
  • Thundering herd — reconnection storm after server restart, coordinated scheduled jobs
  • Connection pool exhaustion — slow queries hold connections past their useful lifetime

Violated when: Teams inject random chaos with no mapping to known production failure modes. They test everything and learn nothing specific.

Principle 3
Run experiments in production (or as close to it as possible)

Staging environments miss real failure modes because they don't carry production traffic patterns, data distributions, or configuration. A staging cache hit rate may be 80%; production is 98%. Those 18 percentage points represent a completely different failure mode during a cache stampede — in staging, the database absorbs 5x load; in production, it absorbs 49x.

The maturity ladder: SysSimulator (learning layer) → staging with production traffic replay → canary deployment with chaos → production with blast radius controls. Start at the learning layer, graduate to production as confidence grows.

Violated when: Teams only run chaos in staging, then encounter failure modes in production that staging never exercised.

Principle 4
Automate experiments to run continuously

A chaos experiment run once and fixed does not verify the fix holds through subsequent deployments. A service that passed a connection pool exhaustion experiment today may fail it after next week's dependency update changes default timeout behavior.

Mature chaos programs integrate experiments into CI/CD. Every deployment runs the canonical failure scenarios. The build fails if steady-state recovery time degrades or error rate increases under the same chaos scenario. Tools like LitmusChaos and AWS Fault Injection Simulator are designed for this integration pattern.

Violated when: Chaos engineering is a quarterly audit or a one-off exercise rather than a continuous feedback loop. Regressions introduced by deployments go undetected.

Principle 5
Minimise blast radius

Run the smallest experiment that tests your hypothesis. Start with a single replica, in a single region, for the shortest duration that produces a clear signal. Have rollback ready before the experiment starts. The goal is learning, not causing incidents.

Blast radius control in practice: kill one instance of a service before killing all instances. Inject 50ms latency before injecting 5 second timeouts. Run for 60 seconds before running for 10 minutes. Each step up in blast radius should be a deliberate decision, not a default.

Violated when: Teams run wide-blast experiments without controls, trigger real user-facing incidents, and create the exact production failures chaos engineering was supposed to prevent.


6 core chaos engineering practices

Principles describe what chaos engineering should be. Practices describe how to actually run experiments effectively.

1. Define steady state first — always

Before any experiment, run your system at target RPS for five minutes with no chaos injected. Record p99 latency, error rate, throughput, and the count of bottlenecked components. These four numbers are your recovery criteria.

The experiment is over when all four numbers return to within 10% of their baseline values. If they don't return within a defined window — say, 60 seconds after stopping chaos injection — that recovery time is itself a finding worth investigating.

Without a baseline, "the system recovered" is an opinion, not a measurement.

2. Start with the canonical failure modes

Three scenarios cover the vast majority of real production failure patterns and should be the first experiments on any system:

If your system fails any of these at normal RPS, that failure is your top priority — not exploring more exotic scenarios. Run the cache stampede first. It is the single highest-signal experiment for most web systems because cache is the primary scaling mechanism and its failure mode is non-obvious.

3. Inject failures at the component level, not the system level

Kill one service at a time. This isolates the signal: node failure on your database tells you about your database failover specifically. Node failure on your cache tells you about cache resilience specifically. Running both simultaneously tells you about compounding failures — useful, but only after you understand the individual failure modes.

The component-level discipline also enforces blast radius control. A team that always starts with single-component failures builds the habit of controlled experimentation. A team that runs multi-component failures from the start tends to accidentally cause real incidents.

4. Observe the full dependency graph, not just the failed component

When you inject a cache stampede, the cache's behavior is almost irrelevant — you know it's failing, you injected the failure. What matters is the propagation: the database behind it, the services that depend on the database, the user-facing APIs that depend on those services.

A cache stampede that saturates the database, which exhausts the connection pool, which causes the auth service to start rejecting requests, which causes user sessions to expire — that's the full blast radius, and it's visible in SysSimulator's bottleneck chain.

Trace the dependency graph downstream from the injected failure. The furthest downstream effect is usually the most important finding.

5. Run GameDays — team-wide failure rehearsals

A GameDay is a scheduled exercise where the whole team watches a chaos experiment run together. Engineers, SREs, and product managers are in the room. Someone narrates what they see on the metrics as the failure propagates. Someone else tracks what questions the metrics raise that can't be answered from current observability. A third person records the action items.

Netflix runs GameDays. Amazon runs GameDays. Google's DiRT (Disaster Recovery Training) program is a formalized annual GameDay at org scale. The value isn't just the chaos experiment itself — it's building shared failure vocabulary before incidents happen.

The practical starting point: share a SysSimulator diagram link with your team. Load a blueprint. Start traffic. Pass the chaos scenario control to different team members and have each person narrate what they see. This builds the pattern of narrating failure with specific metrics — the same skill that matters in on-call escalations and system design interviews.

6. Fix what you find — or the experiment was theater

Every chaos experiment should produce one of two outcomes: the hypothesis is confirmed (the system behaves as expected under this failure) or a finding is documented and becomes an action item. There is no third outcome.

The most common failure mode of chaos engineering programs: experiments reveal weaknesses, those weaknesses are acknowledged as "known," and nothing changes. Three months later the same experiment reveals the same weakness. The team has spent time running experiments and learned nothing new, because the findings never drove design changes.

Track findings from every experiment. Assign them. Close them. Re-run the experiment after the fix. Only then move to the next scenario.


Real-world chaos engineering examples

Netflix — Chaos Monkey (2010)

Where it started

Chaos Monkey was built during Netflix's migration from their own data centers to AWS. The problem: Netflix was becoming dependent on AWS in a way that assumed AWS instances were reliable. They weren't — EC2 instances failed regularly. Chaos Monkey made that failure routine, during business hours, until Netflix's engineering culture expected and handled it.

The Simian Army that followed: Chaos Gorilla (killed entire availability zones), Latency Monkey (injected artificial latency into service calls), Conformity Monkey (shut down instances not meeting configuration standards), Security Monkey (found misconfigured security groups). The full suite is open-source at github.com/Netflix/SimianArmy.

What it proved: Chaos engineering, applied continuously, changes engineering culture. Teams that know their services will randomly be killed build services that recover automatically. Teams that don't run chaos assume their services will stay up — and are surprised when they don't.

Amazon — GameDays

Team-wide failure rehearsal

Amazon's GameDay program runs structured exercises where engineers simulate specific failure scenarios and observe the full system response together. The exercises revealed failure modes that individual service tests consistently missed: failures at service boundaries, where two services each behaved correctly in isolation but produced incorrect behavior when one degraded the other.

The critical insight from Amazon's GameDays: the most dangerous failure modes are not service failures — they're interaction failures. Service A fails in a way that causes Service B to degrade, which causes Service C to retry aggressively, which amplifies load on Service A. No individual service test catches this because it requires the full interaction graph.

Google — DiRT (Disaster Recovery Training)

Annual disaster recovery at org scale

Google's Disaster Recovery Training runs annual exercises where entire systems are deliberately taken offline. Teams must recover from scratch, following documented recovery procedures, with the clock running. DiRT exercises have revealed recovery procedures that were documented but didn't work when executed under pressure, backup systems that hadn't been tested in years, and cross-system dependencies not captured in any architecture diagram.

The lesson from DiRT: documentation of recovery procedures is not the same as tested recovery procedures. Recovery time objectives exist on paper; DiRT reveals the actual RTO.

Cloudflare — WAF CPU Exhaustion (July 2, 2019)

The chaos experiment that wasn't run

On July 2, 2019, Cloudflare deployed a WAF firewall rule update that contained a regex with catastrophic backtracking behavior. The regex performed correctly on all test inputs but consumed 100% CPU on a specific class of HTTP request that appeared in production traffic. The result: 100% CPU utilization across Cloudflare's entire network for 27 minutes, dropping 82% of HTTP traffic globally.

The root cause was not a missing unit test — the unit tests passed. It was missing fuzz testing with production-like inputs. A chaos engineering approach that subjected new WAF rules to production-representative request patterns before deployment would have triggered the backtracking behavior in a controlled environment.

The chaos experiment that would have caught it: Run new WAF rules against a production traffic replay with CPU utilization monitoring. Reject the deployment if CPU utilization exceeds baseline by more than 20%.

Slack — Reconnection Storm (May 12, 2022)

Thundering herd at scale

On May 12, 2022, a routine maintenance operation caused all Slack clients to disconnect simultaneously. When the maintenance completed, millions of clients reconnected at the same moment, producing a massive coordinated thundering herd on Slack's connection handling infrastructure. The infrastructure could not absorb the instantaneous connection spike and degraded significantly.

The failure mode — coordinated reconnection after service restart — is one of SysSimulator's 28 chaos scenarios. The fix requires jitter in client reconnect logic: instead of all clients retrying immediately after detecting disconnection, each client waits a random interval drawn from a distribution wide enough to spread load over several minutes.

The chaos experiment that would have caught it: Thundering herd scenario targeting the connection handling layer, run before any maintenance operation that causes coordinated disconnection. The experiment would have shown connection pool saturation and quantified the jitter window needed to prevent it.


Chaos engineering for system design interviews

In 2025 and 2026, FAANG-level system design interviews increasingly treat failure mode analysis as a required component, not a bonus section. After you draw the architecture, the interviewer will ask about failure: "Your cache goes down. Walk me through what happens." Or: "Your payment provider starts timing out. How does that propagate?"

The engineers who answer these questions well don't just name mitigations. They narrate failure propagation with specific numbers.

What a weak answer looks like

"If the cache goes down, we'd have a cache stampede. We'd handle that with a circuit breaker and write-through caching."

This answer demonstrates that the candidate has read about cache stampedes. It doesn't demonstrate that they understand the failure propagation mechanics or the specific parameters that determine whether a mitigation works.

What a strong answer looks like

"At 50,000 RPS with a 98% cache hit rate, the database is handling about 1,000 QPS — the 2% of requests that miss cache. If the cache goes down entirely, the database suddenly receives 50,000 QPS — a 50x load spike. A typical database connection pool of 100 connections, each handling 500 QPS at peak, can process about 50,000 QPS. We're right at the edge of pool saturation. With any query slowdown — which is guaranteed at 50x normal load — pool exhaustion follows within seconds. The database starts rejecting connections. Services that depend on the database start failing. We have a cascading failure from a cache outage within 30 seconds of the cache going down."

"The mitigations, in order of implementation priority: request coalescing so that 1,000 concurrent cache misses for the same key produce one database query, not 1,000. Probabilistic early expiration to prevent synchronized expiry. A secondary read-through cache with longer TTL as fallback. Circuit breakers on database connections with a short fail-fast threshold to prevent pool saturation from cascading."

The difference is not knowledge of mitigations — both answers mention the same ones. The difference is that the second answer demonstrates understanding of the actual failure mechanics: the specific load multiplier, the pool saturation threshold, the timeline of cascading failure. That understanding comes from running the scenario against a simulation, watching p99 spike and error rate climb, and developing intuition for the numbers.

How SysSimulator builds this skill

Load the Twitter blueprint in SysSimulator. Set traffic to 50,000 RPS. Let it stabilize. Note the database QPS and connection pool utilization. Then inject a cache stampede. Watch what happens to those numbers over the next 30 seconds. Watch the bottleneck list update. Watch p99 climb. Watch error rate follow.

Now you have numbers. Now you can narrate the failure because you've seen it happen, not just read about it. That narration — grounded in specific metrics — is what separates the top 10% of system design candidates.

See the complete chaos scenarios reference for all 28 scenarios and the specific metrics each one reveals. See the interview prep guide for the full system design interview preparation framework.


Start your first chaos experiment

The single highest-signal first experiment for most web systems is the cache stampede scenario. It exposes whether your origin can absorb cache-miss load, which connection pools saturate first, and how long recovery takes without intervention.

Run it in SysSimulator in under two minutes, with no infrastructure required:

  1. Open SysSimulator — no signup, no install
  2. Load any blueprint with a cache layer (Twitter, Uber, payment system, or notification pipeline all work)
  3. Start traffic at 10,000–50,000 RPS and let it stabilize for 30 seconds
  4. Note your baseline p99 latency and database QPS
  5. Open the chaos panel and inject a cache stampede on the cache component
  6. Watch the database QPS spike, p99 climb, and error rate follow
  7. Note how long recovery takes after stopping the scenario

That's your first chaos experiment. The number you observed — how long recovery takes — is the first question to answer: is that acceptable for your SLO? If not, that's your action item.