Chaos engineering is the practice of deliberately breaking your system to find weaknesses before your users do. What started as a Netflix experiment in 2010 is now standard practice at every major technology company. This guide explains what it is, why it works, how to design experiments correctly, and how to practise safely in SysSimulator before touching production.
Every distributed system will fail. Servers crash, network cables are cut, hard drives fill up, dependencies time out, certificates expire, and cloud providers experience outages. The question is not whether your system will experience failures — it will — but whether those failures will cascade into user-visible outages or be absorbed gracefully.
Traditional approaches to resilience are reactive: you wait for an outage, diagnose what broke, fix the immediate problem, and write a postmortem with action items. This approach has a critical flaw: you only discover failure modes after they have already affected users. The blast radius of the failure is determined by how fast you can respond, not how well you designed the system.
Chaos engineering is a proactive approach. Instead of waiting for failures to happen randomly, you deliberately cause them in a controlled way. You define a hypothesis ("our system will maintain p99 latency below 200ms if one of three database replicas fails"), run the experiment (take down one replica), and observe whether the hypothesis holds. If it does, you have gained confidence in your system's resilience. If it does not, you have discovered a weakness in a controlled environment — not during a production incident at 3 AM.
The Netflix origin story is instructive. In 2010, Netflix moved its streaming infrastructure to AWS. AWS, unlike a traditional data centre, could terminate virtual machine instances without warning. Netflix engineers realised that if any individual instance could be terminated at any time, the only resilient architecture was one designed to handle instance loss as a normal operating condition, not an exceptional one. Chaos Monkey — a tool that randomly terminates EC2 instances during business hours — was their solution. By forcing instance loss to happen constantly during office hours, engineers were forced to design services that survived it. The result: Netflix significantly reduced the number of outages caused by instance failure.
Start with a steady state hypothesis. Before running any experiment, define what "normal" looks like in measurable terms. Not "the system works" — that is not measurable. Steady state should be expressed as specific metrics: "error rate below 0.1%, p99 latency below 150ms, successful checkout rate above 99.7%." You will measure the system against this baseline before the experiment (to confirm you are starting from normal), during the experiment (to observe the impact), and after the experiment (to confirm recovery). Without a defined steady state, chaos engineering is just random destruction with no way to interpret the results.
Vary real-world events. The failures you inject should reflect the failures that actually occur in your infrastructure. Instance termination, network packet loss, disk full, CPU spike, dependency timeout, DNS failure, memory leak — these are real failure modes that your cloud provider's infrastructure or your own code can experience. Injecting unrealistic failures (deleting random database rows, for example) produces uninterpretable results. The failure catalogue in SysSimulator's chaos panel is designed around real-world failure modes observed in production systems.
Run experiments in production — but minimise blast radius first. This is the most counterintuitive principle. The value of chaos engineering is discovering how your actual production system behaves under failure — not how a staging environment behaves. Staging environments are typically smaller, differently configured, and differently loaded than production. A failure that is absorbed in staging may cascade in production due to higher load or different configuration. The constraint: start with the smallest possible blast radius. Begin with a single instance in one availability zone. Use feature flags or traffic percentages to limit the scope. Expand scope only after validating safety at smaller scale.
Automate experiments to run continuously. A chaos experiment run once is a point-in-time check. A chaos experiment run on every deployment is a continuous resilience test. As your system evolves — new code deployed, dependencies upgraded, configuration changed — previously passing experiments may start failing. Automated chaos experiments in your CI/CD pipeline catch regressions before they reach production. This is the mature form of the practice: chaos as a continuous test, not a one-time audit.
Every chaos experiment has the same structure: hypothesis, method, scope, metrics, abort condition, and rollback plan. Skip any of these and your experiment is incomplete.
Hypothesis. State what you expect to happen when the failure is injected. "Our e-commerce checkout flow will maintain a success rate above 98% if one of two Redis cache nodes fails, because we have cache miss handling that falls through to the database, and the database has sufficient capacity to absorb the additional load." A good hypothesis is falsifiable — it names a specific metric and a specific threshold that will confirm or deny it.
Method. The specific failure you will inject: "Terminate one of the two Redis cache nodes by removing it from the load balancer and stopping the process." Be precise. "Kill Redis" is ambiguous — does the process crash (cache miss), the node lose network connectivity (timeout), or the data get corrupted? Each produces different failure modes. Pick one and describe it exactly.
Scope. Define the blast radius before you start. "This experiment will affect 10% of production traffic, routed to the canary fleet. The other 90% of traffic uses the unaffected infrastructure." If you cannot define a limited scope, do not run the experiment in production yet. Run it in a staging environment until you understand the failure mode.
Metrics. List exactly what you will monitor. Dashboard URLs, specific metrics names, alert thresholds. During the experiment, you are watching these numbers — not anything else. Cognitive load during a live experiment is high. Pre-define your monitoring before you start.
Abort condition. Define in advance the condition that will cause you to immediately stop the experiment and roll back. "If error rate exceeds 5% or p99 exceeds 1,000ms, stop the experiment immediately." This prevents a controlled experiment from becoming an uncontrolled outage.
Rollback plan. Know exactly how to reverse the failure before you inject it. "Restart the Redis process and add it back to the load balancer load balancer. Verify cache hit rate returns to baseline within 60 seconds." If you cannot describe a clear rollback, you are not ready to run the experiment.
Before running chaos experiments against production or even staging infrastructure, SysSimulator lets you run them against a simulated architecture with real metrics — zero blast radius, immediate results, complete observability. This is the right first step for engineers new to chaos engineering and for testing new failure hypotheses against unfamiliar architectures.
Load any blueprint in SysSimulator — E-Commerce, Social Feed, or any of the 57 available architectures. Set your baseline traffic. Open the Chaos panel. You will find 28 categorised failure scenarios: cache failures, network partitions, database crashes, dependency timeouts, CPU spikes, memory pressure, and more.
Run the experiment in the simulator first: inject the failure, watch the metrics respond in real time, observe the cascade, note the blast radius, record the recovery time. This gives you the specific numbers — p99 spike, error rate peak, recovery duration — that you will use when designing and pitching the production experiment to your team.
When you move to production experiments, the simulator run gives you a baseline expectation. If production behaves dramatically differently from the simulation, that discrepancy itself is a finding — it may indicate that your system is more tightly coupled, less resilient, or differently configured than your architecture diagram suggests.
Explore all 28 chaos scenarios →
Instance termination. The Netflix classic. Reveals: which services have no redundancy (a single instance whose termination causes a full outage), which services have redundancy but no health check-based load balancer routing (requests still go to the dead instance), and which services have proper health checks and failover (the ideal case). Start here for any new service you own.
Cache failure / cache stampede. Reveals: whether your system has graceful cache miss handling or collapses when the cache is cold. A cache stampede — all cache misses hitting the database simultaneously — is one of the most common failure cascades in web systems. Run this experiment against any system where caching is a significant part of the read path. SysSimulator's cache stampede scenario shows you the exact p99 spike and database connection pool exhaustion timeline.
Network partition. Reveals: your CAP theorem decision in action. Does the system choose consistency (return errors to maintain data integrity) or availability (continue serving potentially stale data)? Run this experiment against any replicated data store. The results will tell you whether your intended consistency/availability tradeoff matches your actual behaviour. See the CAP theorem guide for what to expect.
Dependency latency injection. Reveals: whether slow dependencies cascade into slow responses across your entire call graph. Inject 2,000ms of latency into a single downstream service and watch how long before your p99 response time spikes. If a slow dependency causes your service to queue requests and exhaust its thread pool, you have discovered a cascading latency failure. The fix: timeouts on all external calls, circuit breakers to fail fast when a dependency is slow.
CPU saturation. Reveals: whether your autoscaler responds fast enough to prevent request queuing during a CPU spike. Inject a CPU-intensive workload and watch: does the autoscaler add instances before request queue depth becomes problematic? What is the latency penalty during the scale-out window?
"Isn't chaos engineering just breaking things on purpose? Why would you do that in production?" The alternative is breaking things accidentally in production. Every system that has never had chaos experiments run against it has unknown failure modes. Those modes will be discovered — either in a controlled chaos experiment with a limited blast radius and a rollback plan, or during a real incident at peak traffic with full user impact. The question is not whether to discover failure modes, but whether to discover them on your terms or the incident's terms.
"How do you get organisational buy-in for chaos engineering?" Start small and in non-production environments. Run a chaos experiment in staging, find a real weakness, fix it, and document the finding. The concrete result — "we discovered that our payment service has a single point of failure in the Redis dependency; without a circuit breaker, a Redis failure causes the entire payment flow to fail; we fixed this before it caused a production incident" — is more persuasive than theoretical arguments for the practice.
"What is the difference between chaos engineering and fault injection testing?" Fault injection testing is typically a developer-level technique: inject a fault in unit or integration tests to verify that error handling code works. Chaos engineering operates at the system level in production-like environments. Fault injection tests that a function handles a null input correctly; chaos engineering tests that the payment service degrades gracefully when the database is unreachable. They are complementary, not alternatives.
What is chaos engineering?
The practice of deliberately injecting failures into a system to discover weaknesses before they cause unexpected production outages. Define a hypothesis, inject a failure with limited blast radius, observe whether the system maintains its steady state, and use the findings to improve resilience.
What is the steady state hypothesis?
A measurable definition of normal: specific metrics and thresholds that describe a healthy system. You confirm steady state before, monitor during, and verify recovery after every experiment. Without it, you cannot distinguish a resilience finding from normal variation.
What is blast radius?
The scope of impact when an experiment reveals a weakness or goes wrong. Start with the smallest possible blast radius — a single instance, a small traffic percentage — and expand only after validating safety. Keeping blast radius small is what separates chaos engineering from reckless destruction.
How is it different from load testing?
Load testing validates performance under high traffic. Chaos engineering validates resilience under component failure. A system can pass all load tests and still collapse when a single cache node fails. Both are necessary; neither replaces the other.
What is Chaos Monkey?
Netflix's tool that randomly terminates EC2 instances during business hours, forcing engineers to design services that survive instance loss as a normal operating condition. The origin of the chaos engineering discipline and the model for every chaos tool that followed.
Explore all 28 chaos scenarios → Browse all blueprints
Next in the series: Distributed caching patterns →