๐๐ต๐ฎ๐ผ๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ ๐ถ๐๐๐ฎ๐ธ๐ฒ๐
Five chaos engineering mistakes to avoid
Chaos engineering doesnโt create resiliency, it reveals where resiliency is missing based on real world failures โchaos variablesโ such as hardware or network failures. It literally break things on purpose to build more resilient systems.
Popularized by Netflix, chaos engineering has been gaining attention and acceptance in the software engineering community. As more teams start using chaos engineering, there are a few common seen mistakes are worth discussing.
๐ ๐ถ๐๐๐ฎ๐ธ๐ฒ ๐ญ: ๐ก๐ผ๐ ๐๐ฒ๐ณ๐ถ๐ป๐ถ๐ป๐ด ๐ฎ ๐ฆ๐๐ฒ๐ฎ๐ฑ๐ ๐ฆ๐๐ฎ๐๐ฒ ๐๐ถ๐ฟ๐๐
Not defining a steady state first means we donโt have a clear baseline of how our system should work under normal conditions. Without this, chaos engineering becomes a messโwe canโt tell if the chaos broke something or if it was already off. Itโs like trying to fix a car without knowing how it runs when itโs healthy. We need that โnormalโ snapshot to compare against, so we can spot whatโs failing and why.
โWe must know what "normal" looks like before breaking things, or we canโt measure whatโs broken.โ
๐ ๐ถ๐๐๐ฎ๐ธ๐ฒ ๐ฎ: ๐ฆ๐ธ๐ถ๐ฝ๐ฝ๐ถ๐ป๐ด ๐ฅ๐ผ๐น๐น๐ฏ๐ฎ๐ฐ๐ธ ๐ฃ๐น๐ฎ๐ป๐
We donโt have a way to reverse the chaos we introduce. In chaos engineering, weโre intentionally breaking thingsโlike killing a server or spiking latencyโto see how the system works. But if we canโt undo it fast, we might tank the whole operation.
โFailing to plan for rollback can turn a controlled experiment into uncontrolled chaos.โ
For example, we cut a database connection to test resilience. Without a rollbackโlike a script to reconnect or a backup readyโwe could leave users stuck, data lost, or the app dead. Itโs not just about starting the chaos; itโs about controlling it. A good rollback keeps the experiment safe and lets us learn without burning everything down.
๐ ๐ถ๐๐๐ฎ๐ธ๐ฒ ๐ฏ: ๐ง๐ฒ๐๐๐ถ๐ป๐ด ๐๐ผ๐ผ ๐บ๐๐ฐ๐ต ๐ฎ๐ ๐ผ๐ป๐ฐ๐ฒ
Testing too much at once such as cutting servers, spiking traffic, and dropping connections all together. Letโs say we aim to test by killing two services and faking a network lag simultaneously. If the app crashes, is it one failure, both, or the combo? Weโre left guessing instead of learning. Chaos engineering works best when we break one thing at a timeโclean, clear results show us exactly whatโs weak.
โOverloading with multiple failures(chaos variables) obscures root causes.โ
๐ ๐ถ๐๐๐ฎ๐ธ๐ฒ ๐ฐ: ๐๐ผ๐ฟ๐ด๐ฒ๐๐๐ถ๐ป๐ด ๐๐ผ ๐บ๐ผ๐ป๐ถ๐๐ผ๐ฟ ๐ฟ๐ฒ๐๐๐น๐๐
Forgetting to monitor results means weโre running chaos experiments blind. We break stuff such as killing a server or spiking load but if we donโt watch what happens, weโve got no clue what it proves.
โNeglecting to monitor defeats the purposeโyou need visibility into the chaos.โ
For instance, we cut a database link to test resilience, but without tracking response times or error rates, we wonโt know if it failed quietly or crashed hard. Monitoring is how we catch the weak spots and fix them. No data, no insightโjust wasted chaos.
๐ ๐ถ๐๐๐ฎ๐ธ๐ฒ ๐ฑ: ๐๐ด๐ป๐ผ๐ฟ๐ถ๐ป๐ด ๐๐ฒ๐ฎ๐บ ๐ฏ๐๐-๐ถ๐ป
Ignoring team buy-in means starting chaos experiments without team agreeing to it. If the people who run and fix the system arenโt on board, itโs a waste. They wonโt get ready, wonโt learn, and wonโt fix anything. Itโs like throwing a party no one wants to come toโnothing happens.
โNo team support means chaos engineering doesnโt workโitโs about people too.โ
For example, if you crash a service and the team doesnโt know why, theyโll just be annoyed instead of making things tougher. Getting everyone to say โyesโ first makes sure theyโre in on it and ready to improve.
๐๐ผ๐ป๐ฐ๐น๐๐๐ถ๐ผ๐ป
To succeed, start with a clear baseline, plan rollbacks, test one thing at a time, watch the results, and get the team on boardโmiss any of these, and the chaos just creates more problems.
Disclaimer: The statements and opinions expressed in this article are based on personal experiences, faqs, and references. It do not necessarily reflect the positions of Miahlouge.
๐๐ป๐๐ฝ๐ถ๐ฟ๐ฎ๐๐ถ๐ผ๐ป๐ ๐ฎ๐ป๐ฑ ๐ฅ๐ฒ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ๐
Chaos engineering frequently asked questions
John Allspaw discussion on Resilience Engineering to industry
Thoughtworks article on avoiding mistakes in chaos engineering
Reddit discussion on Is Chaos Enginnering still a Thing?