Wired for Scale: Sid Rao's Musings

Wired for Scale: Sid Rao's Musings

Share this post

Wired for Scale: Sid Rao's Musings
Wired for Scale: Sid Rao's Musings
Chaos Engineering: The Evolution from Netflix's Chaos Monkey to AI-Powered Resilience

Chaos Engineering: The Evolution from Netflix's Chaos Monkey to AI-Powered Resilience

Denny's Led to Chaos in My Stomach, and I Decided to Write an Article on Chaos Engineering...

Sid Rao's avatar
Sid Rao
Jun 09, 2025
∙ Paid
1

Share this post

Wired for Scale: Sid Rao's Musings
Wired for Scale: Sid Rao's Musings
Chaos Engineering: The Evolution from Netflix's Chaos Monkey to AI-Powered Resilience
Share

Denny’s pancakes are fantastic. The tender crumb, combined with the buttery vanilla, just melts in your mouth. But at the tender age of 43, don’t try to eat too many of them. Otherwise, you may end up on your ass writing an article on Chaos Engineering.

When breaking things on purpose actually makes sense – and when it's just expensive theater.

Let me tell you something about chaos engineering. It's like democracy – the worst form of reliability testing except for all the others that have been tried. Except, and here's where it gets interesting, sometimes it's actually worse than the alternatives.

We've got companies spending millions to randomly terminate instances when they could achieve better results with a $50 canary deployment. But I'm getting ahead of myself.

In August 2008, Netflix experienced what we in the business call "a situation." Their database corruption brought DVD shipping to its knees for three days. Now, most companies would have written a post-mortem, implemented some backups, and considered the matter closed.

Netflix? They decided the solution was to break things constantly. It's the kind of logic that makes sense after you've been in tech long enough – or after you've had enough bourbon, though I don't recommend mixing the two.

Fast forward to today, and chaos engineering has evolved from Netflix's digital game of whack-a-mole to a sophisticated discipline that somehow convinced venture capitalists to pour billions into companies whose primary product is breaking other companies' products.

The practice generates 245% ROI for some organizations while others hemorrhage cash faster than a startup with a ping-pong table budget. The difference?

Whether you actually need it.

From firefighters to venture-funded failure factories

The intellectual genealogy of chaos engineering traces back to Jesse Robbins at Amazon, who held the magnificently apocalyptic title "Master of Disaster." Robbins, drawing from his experience as a volunteer firefighter, understood something fundamental about human nature: we don't prepare for disasters, we react to them. Between 2008 and 2010, he created "Game Day" – because nothing says "productive Thursday" like intentionally crashing production systems.

Side Note: We used to hand out ambulance-chaser phone tool icons to folks who loved outages. There was something very thrilling about being engaged in a significant outage.

It became even more exhilarating when it occurred on an important day (e.g., Cyber Monday or Black Friday). I’m sure it was absolutely thrilling for Andy and Charlie.

Don’t confuse Chaos Engineering with doing Game Days. Game Days are a fundamental concept of Chaos Engineering, one step towards implementing a Chaos Engineering strategy, but don’t go as far as executing Game Days all the time.

Game Days are essential parts of understanding how your service operates when critical components fail, such as availability zones, autoscaling groups, or even entire regions.

Do Game Days even if you don’t do them all the time, as a Chaos Engineering service. I may make fun of Game Days, but they are a necessary part of launch planning.

"Game Day was inspired by his experience & training as a firefighter combined with lessons from other industries and research on complex systems," the documentation tells us, with the kind of matter-of-fact tone usually reserved for explaining why someone thought New Coke was a good idea.

The approach spread faster than a rumor in a startup's Slack channel, with Google developing "DiRT" (Disaster Recovery Testing) – because if you're going to break things, you might as well give it an acronym.

But we took it another step…

But here's where the story takes a turn worthy of a Silicon Valley screenplay. Netflix's 2008 outage didn't just inspire better backup procedures. No, they decided the problem was that things didn't break often enough. Greg Orzell and his team created Chaos Monkey in 2010, a tool with the singular purpose of randomly terminating production instances during business hours.

It's the equivalent of hiring someone to randomly unplug servers while you're trying to stream The Great British Bake Off.

"The best way to avoid failure is to fail constantly," became Netflix's motto, which sounds profound until you realize it's the same logic my nephew uses to justify his report card.

On July 19, 2011, Netflix announced its Simian Army publicly – because nothing instills customer confidence like telling them you've weaponized primates against your own infrastructure.

The discipline achieved what we call "legitimacy" in 2015 when Casey Rosenthal joined Netflix to lead the Chaos Engineering Team. Finding that many engineers understood chaos engineering as simply "breaking stuff in production," which, to be fair, isn't entirely wrong, Rosenthal worked to define a more scientific approach.

This culminated in the 2017 publication of the Principles of Chaos Engineering, establishing it as "the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

It's like calling arson "thermal stress testing for buildings."

There is another reality every engineering manager will encounter. Some amount of chaos inevitably occurs due to the human beings running the service you are responsible for.

I still remember a funny period from right before a launch I was involved with. Somebody put me in charge of the administration console for this service. As a dutiful engineering manager, I wanted to conduct a basic load test of this service. So I steadily increased my canary rate from zero to a steady state of 1 TPS.

In absolute wonder, I watched as the entire service melted away. “SHUT YOUR CANARY DOWN!” my fellow engineering managers yelled. But I impudently stood there and stood my ground. We needed to fix this. 1 TPS was nothing. Then another engineering manager tried to announce that we needed “a process for running load tests.” Um - no.

I learned that a certain amount of chaos, from any source, is beneficial for a service. When that systems engineer inevitably conducts a change without change control, count your stars and learn from it. Also, tell the systems engineer to follow the process.

The inconvenient truth about who shouldn't use chaos engineering

Now, let me tell you something the chaos engineering vendors won't mention in their slick pitch decks: for most organizations, implementing chaos engineering is like buying a Ferrari to commute through Manhattan – impressive, expensive, and completely missing the point.

Young startups – and by young, I mean any company where the CEO still knows everyone's name – have no business implementing chaos engineering. You know what you need? Basic monitoring. Error handling that doesn't involve print statements. A deployment process more sophisticated than "Jim's laptop."

When serving a hundred users, you don't need to simulate regional failures; you need to ensure your application doesn't crash when someone enters an emoji in the username field.

Small to medium businesses with straightforward architectures are similarly poorly served. If your entire infrastructure fits on a whiteboard – and I mean a regular whiteboard, not one of those conference room monstrosities – chaos engineering is operational theater. You're spending $50,000 to discover what a junior developer could tell you for free: your single database is a single point of failure. Shocking revelation, that.

Companies with poor observability attempting chaos engineering are like pilots flying blind who decide what they really need is more turbulence. Without comprehensive monitoring, running chaos experiments is just expensive randomness. You'll break things, certainly, but you won't learn anything except that broken things are, indeed, broken.

Legacy monoliths present another category of poor ROI. These systems, held together by equal parts code and prayer, don't need chaos engineering – they have chaos built in. Every deployment is an experiment in failure. Every configuration change is a roll of the dice. Adding intentional chaos to inherent chaos is like adding gasoline to a fire because you're curious about thermodynamics.

The financial mathematics are unforgiving. A comprehensive chaos engineering program requires dedicated personnel, enterprise tooling, extensive monitoring infrastructure, and cultural transformation. We're talking minimum investments of $200,000 annually for small implementations, scaling to millions for enterprise deployments.

For a startup burning through runway, that's not resilience engineering – that's fiscal irresponsibility.

The alternatives that actually work (and cost less than a Series A)

Before you mortgage your technical debt to fund a chaos engineering program, consider these alternatives that provide better ROI for most organizations:

Progressive delivery and canary deployments offer 80% of the resilience benefits at 10% of the cost. A well-implemented canary system – gradually rolling out changes while monitoring key metrics – catches most production issues before they become incidents. It's like chaos engineering, except instead of breaking things randomly, you break them in a controlled, measurable way. Revolutionary concept, I know.

Side Note: It distresses me how little time goes into building a solid canary. I watch as engineers build all kinds of unit and integration tests, but then lack a simple canary that can tell them when their service is up or down.

Take the time to ensure the canary is included in the project plan, please. Make it part of your launch checklist, even for simple features. It doesn’t have to be even a UX canary; it can simply exercise the underpinning APIs to start with. Speaking of…

Comprehensive testing remains embarrassingly effective. Unit tests, integration tests, load tests, and synthetic monitoring – the greatest hits of software quality. Boring? Absolutely. Effective? Empirically so. Most outages aren't caused by complex distributed systems failures; they're caused by someone forgetting to handle null values. You don't need chaos engineering to catch that – you need a code review process that involves actual review.

Side Note: This is where many teams get lost. They assume that testing must start from the UX layer. Ideally, yes. However, you can start by exercising your underlying APIs. You can then access services like Playwright and other complex UX testing scenarios.

Feature flags and circuit breakers provide runtime resilience without the drama. The ability to disable problematic features instantly, to gracefully degrade functionality, to prevent cascade failures – these patterns solve real problems for a fraction of the cost. They're like chaos engineering's responsible older sibling who went to business school. See my article on load shedding.

Enhanced observability and monitoring give you the insights that chaos engineering promises without the operational overhead. Distributed tracing, structured logging, meaningful dashboards – these investments pay dividends regardless of whether you ever run a chaos experiment. It's the difference between breaking things to learn and actually learning from the things that break naturally.

Blue-green deployments eliminate entire categories of deployment failures. Two identical production environments, seamless switching, instant rollback capability – it's elegant in its simplicity. No chaos required, just good architectural decisions.

But there is nothing wrong with starting with simple game days. In fact, this is what most AWS teams do. It is highly indicative that, while I was there, I didn’t see much beyond game days (including using internal tools like Gremlin, not the same external service as Gremlin) in most services, especially in young services that were just starting out and trying to win customers.

If you aren’t doing the basics, don’t bother with chaos engineering. It just will make you feel good without having the positive impact you’re looking for.

For organizations below enterprise scale, these approaches provide superior return on investment. They're implementable incrementally, deliver immediate value, and don't require convincing your CEO that deliberately breaking production is a growth strategy.

AWS chaos engineering: Because complexity needed more complexity

While AWS offers its own Fault Injection Simulator – and kudos to their marketing team for making "we'll break your stuff" sound professional – the majority of organizations implementing chaos engineering on AWS infrastructure rely on third-party tools. It's like buying a Mercedes and then replacing all the parts with aftermarket components, but here we are.

Now watch, we're going to give it, a lot of money." : r/southpark

Gremlin, founded by former Netflix and Amazon engineers who apparently didn't get enough chaos at their day jobs, leads the commercial market. They've built an empire on the simple premise that companies will pay handsomely for software that breaks their software. Organizations use Gremlin to simulate everything from availability zone failures to testing auto-scaling behavior, because nothing says "Thursday afternoon" like wondering if your application can survive us-east-1 going dark.

Chaos Toolkit represents the open-source approach, offering a Python-based framework that lets you define disasters in JSON. There's something poetic about describing catastrophe in such a structured format:

{
  "steady-state-hypothesis": {
    "title": "Application responds normally",
    "probes": [{
      "name": "application-health",
      "provider": {
        "type": "python",
        "module": "chaosaws.ec2.probes",
        "func": "instance_state"
      }
    }]
  },
  "method": [{
    "name": "terminate-random-instance",
    "provider": {
      "module": "chaosaws.ec2.actions",
      "func": "terminate_instance"
    }
  }]
}

It's like writing poetry, if poetry could cause production outages and generate incident reports.

For Kubernetes workloads on EKS, LitmusChaos and Chaos Mesh offer cloud-native approaches, as regular chaos apparently wasn't complicated enough. These tools excel at pod-level fault injection, which is a fancy way of saying "we'll randomly delete your containers and see what happens."

The real complexity comes from AWS's architecture itself. Multi-region deployments, availability zones, elastic load balancers, auto-scaling groups – it's a distributed systems engineer's paradise and a chaos engineer's playground. Organizations implementing chaos engineering on AWS must test cross-region latency, data consistency during regional failures, and DNS failover behavior. It's like playing three-dimensional chess while someone randomly removes pieces from the board.

When artificial intelligence met chaos (A love story nobody asked for)

The integration of AI with chaos engineering represents either the pinnacle of human ingenuity or the beginning of Skynet, depending on your perspective. We've taught machines to break other machines more efficiently than humans ever could. Progress.

Current AI applications in chaos engineering span four major areas, each more ambitious than the last:

Automated failure prediction uses machine learning models to identify potential failure points before they manifest. It's like having a fortune teller for your infrastructure, except instead of crystal balls, we use LSTM models and anomaly detection algorithms. These systems analyze patterns from chaos experiments to predict where your system will fail next, which is either incredibly useful or deeply unsettling, depending on how much coffee you've had.

Intelligent experiment design has reached the point where AI can automatically generate chaos experiments based on system characteristics. Platforms like Harness Chaos Engineering use GenAI to translate conversational prompts into executable disasters. "Hey AI, what happens if we lose database connectivity during peak traffic?" becomes a sophisticated experiment complete with blast radius calculations and rollback procedures. We've automated the process of breaking things, which is either peak efficiency or peak absurdity.

ML-enhanced result analysis transforms the massive data generated by chaos experiments into actionable insights. Because it turns out, when you break things at scale, you create more logs than a lumber mill. AI systems analyze these logs, metrics, and alerts to distinguish between normal and anomalous behavior, utilizing frameworks like SHAP (Shapley Additive Explanations) to provide explainable AI. Yes, we need AI to explain what our AI is doing. Welcome to the future.

AI-driven workflow automation enables autonomous chaos engineering, where systems automatically design, execute, and learn from experiments without human intervention. It's like giving your infrastructure a self-destruct button and trusting it to use it wisely. What could possibly go wrong?

Looking ahead, experts predict that fully autonomous chaos engineering will be achieved by 2026-2028. Self-healing systems will automatically reorganize based on insights from chaos experiments, while neuromorphic computing architectures will be designed specifically for chaos engineering applications. We're building systems that break themselves to make themselves stronger. It's either evolution or a costly form of digital masochism.

The brutal mathematics of chaos engineering ROI

Let's talk numbers, because nothing brings sobriety to a chaos engineering discussion faster than actual financial data.

The headline figure – 245% ROI over three years – comes with more asterisks than a pharmaceutical advertisement. This return applies to large enterprises with complex distributed systems, mature engineering practices, and budgets that would make a small country envious. For everyone else, the mathematics tells a different story.

For enterprises, the numbers can be compelling:

  • Average downtime cost: $9,000 per minute

  • MTTR reduction: 60-90% for mature implementations

  • Availability improvements: From 99.9 % to 99.99 %+

  • Engineering time saved: 30-40% reduction in incident response

But here's what those case studies don't mention: the prerequisite investments. Before you see any ROI, you need:

  • Comprehensive monitoring and observability: $100,000-$500,000

  • Dedicated chaos engineering team: $400,000-$1,000,000 annually

  • Enterprise tooling: $50,000-$200,000 per year

  • Cultural transformation: Priceless (and painful)

For smaller organizations, the mathematics becomes punitive:

  • Initial investment: $200,000 minimum

  • Time to positive ROI: 2-3 years (if ever)

  • Opportunity cost: Could fund 2-3 additional engineers

  • Risk: Causing outages that exceed any potential savings

The chaos engineering tools market, valued at $1.9 billion in 2023 and projected to reach $2.9 billion by 2028, indicates that someone is making money.

Hint: It's not usually the companies buying the tools.

The dark side of chaos: What vendors won't tell you

Every technology has its shadow side, and chaos engineering's is darker than a datacenter during a power outage. Let me illuminate some inconvenient truths:

False confidence ranks as the most insidious risk. Teams run a few chaos experiments, nothing breaks, and suddenly they're invincible. It's like wearing a bulletproof vest and thinking you're immortal. Chaos engineering can't test for all failure modes – black swans don't schedule appointments.

Cultural resistance remains real and justified. Telling engineers to break production systems contradicts every instinct they've developed. It's like asking a surgeon to randomly cut arteries to test the hospital's emergency response. The cognitive dissonance alone can destroy team morale.

Cascade failures from poorly designed experiments have taken down entire companies. One misconfigured chaos experiment can trigger failure modes you didn't know existed. It's the technological equivalent of pulling a thread and watching the whole sweater unravel.

Security vulnerabilities increase with the use of chaos engineering tools. You're literally installing software designed to break things, with privileged access to your infrastructure. It's like giving burglars a key to test your home security. What could possibly go wrong? It is essential to build a threat model around these tools.

Operational overhead often exceeds the benefits. Running chaos experiments requires dedicated time, resources, and attention. For many teams, it becomes another checkbox in an already overwhelming list of "best practices" that provide marginal value.

Gaming the metrics becomes commonplace. Teams optimize for chaos experiment success rather than actual reliability. They build systems that pass chaos tests but fail in novel ways. It's teaching to the test, infrastructure edition.

The most damaging aspect? Chaos engineering as theater – organizations implementing it because it sounds impressive in engineering blogs and conference talks, not because it provides value. It's the technical equivalent of a participation trophy, expensive and ultimately meaningless.

Keep reading with a 7-day free trial

Subscribe to Wired for Scale: Sid Rao's Musings to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Sid Rao
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share