๐ช๐ต๐ฎ๐ ๐ถ๐ ๐๐ต๐ฎ๐ผ๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด?
Break things on purpose
Enterprise software systems have become more sophisticated, relying heavily on distributed components like cloud services and micro-services. Chaos engineering plays a vital role in creating resilient distributed systems.
This publication explores chaos engineeringโwhat it is, why itโs crucial, and the principles behind it, spotlighting organizations that benefit with it. It will also cover popular tools for it and the pros and cons of using it.
๐๐ต๐ฎ๐ผ๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
Chaos engineering is the practice of intentionally injecting faults into a system to test its resilience. The goal is to identify potential failure points and correct them before they cause an actual outage or other disruption. By proactively testing how a system responds under stress, we can identify and fix failures before they end up in the news.
โChaos doesnโt create resiliency, it reveals where resiliency is missing. It literally โbreak things on purposeโ to build more resilient systems.โ
๐ช๐ต๐ ๐ฏ๐ฟ๐ฒ๐ฎ๐ธ ๐๐ต๐ถ๐ป๐ด๐ ๐ผ๐ป ๐ฝ๐๐ฟ๐ฝ๐ผ๐๐ฒ?
Think of a vaccine or a flu shot for instance COVID 19, where we injected ourselves with a small amount of a potentially harmful foreign body in order to build resistance and prevent illness. Chaos Engineering is a tool we use to build such an immunity in our technical systems by injecting harm such as latency, CPU failure, or network black holes in order to find and mitigate potential weaknesses.
The chaos engineering report shows that testing systems with failuresโlike the Australian Securities Exchange(ASX) 2005 outage, New York Stock Exchange(NYSE) 2015 outage, and Moscow Exchange(MOEX) 2007 power outageโboosts availability, cuts resolution and detection times, reduces bugs, and prevents outages. Teams doing this often hit >99.9% reliability.
๐ช๐ต๐ฎ๐'๐ ๐๐ต๐ฒ ๐ฟ๐ผ๐น๐ฒ ๐ผ๐ณ ๐๐ต๐ฎ๐ผ๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด?
Distributed systems are inherently more complex than monolithic systems, so itโs hard to predict all the ways they might fail. The eight fallacies of distributed systems that new programmers often assume:
The network is reliable - Nope, it can fail anytime
Latency is zero - Data doesnโt move instantly, lag happens
Bandwidth is infinite - Networks have limits, and they get clogged
The network is secure - Hackers and glitches say otherwise
Topology doesn't change - Servers and connections shift constantly
There is one administrator - Multiple people of or systems shift constantly
Transport cost is zero - Moving data takes time and money
The network is homogeneous - Nodes and networks differ everywhere
Many of these fallacies drive the design of Chaos Engineering experiments such as โpacket-loss attacksโ and โlatency attacksโ.
For example, network outages can cause a range of failures for applications that severely impact customers. Applications may stall while they wait endlessly for a packet. Applications may permanently consume memory or other system resources like Linux. And even after a network outage has passed, applications may fail to retry stalled operations, or may retry too aggressively. Applications may even require a manual restart. Each of these examples need to be tested and prepared for.
๐๐ฒ๐ป๐ฒ๐ณ๐ถ๐๐ ๐ผ๐ณ ๐๐ต๐ฎ๐ผ๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
Customer: The increased availability and durability of service means no outages disrupt their day-to-day lives.
Business: Chaos Engineering cuts big revenue and maintenance losses, boosts engineer morale and engagement, enhances on-call training, and strengthens company-wide incident management.
Technical: Chaos experiments reduce incidents and on-call stress, deepen knowledge of system failures, enhance design, and speed up incident detection.
๐ข๐ฟ๐ด๐ฎ๐ป๐ถ๐๐ฎ๐๐ถ๐ผ๐ป๐ ๐๐ฒ๐ป๐ฒ๐ณ๐ถ๐ ๐ณ๐ฟ๐ผ๐บ ๐๐ต๐ฎ๐ผ๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
Top tech organizations such as Amazon, Netflix, Microsoft, and National Australia Bank, New York Stock Exchange, Moscow Exchange all utilizes chaos engineering to achieve a better understanding of internal systematic behavior and flaws. It can also help organizations to enhance the velocity of their continuous integration and continuous delivery (CI/CD) pipelines. Incorporating chaos engineering into CI/CD enables organizations to automate continuous experiments while controlling their potential impact.
Source : The Guardian - Stock trading closed on NYSE after glitch caused major outage
๐ง๐๐ฝ๐ฒ๐ ๐ผ๐ณ ๐๐ต๐ฎ๐ผ๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐๐
DevOps teams have several options for running chaos engineering experiments to test various system processes. In the simplest terms, writing code to help a company make more money. But there are details to consider: like the company size where it makes sense to have a dedicated team do this. Randomly turns services off to see if an application fails -
Latency Injection
DevOps teams intentionally create scenarios that emulate a slow or failing network connection. This includes the introduction of network delays or slower response times.
Fault Injection
This involves purposefully introducing errors into the system to determine how it affects other dependent systems and whether it interrupts services. Examples of fault injections include inducing disk failures, terminating processes, shutting down a host or introducing power or temperature increases. Fault injections can help organizations identify any single points of failure, which can cause the entire system to fail if something happens to them.
Load Generation
This relates to intentionally stressing the system by sending significant traffic levels well beyond normal operations. This helps the site reliability engineers (SREs) to understand any bottlenecks in the system, which in turn allows them to build more scalable systems.
Canary Testing
This involves releasing a new product or feature to a small group of users. That way, any glitches or bugs will only affect a percentage of visitors, leaving the rest of the audience to access the existing website experience.
๐ช๐ต๐ถ๐ฐ๐ต ๐ฒ๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐๐ ๐ฑ๐ผ ๐๐ผ๐ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ ๐ณ๐ถ๐ฟ๐๐?
People argue that we should perform experiments in the following order:
Known Knowns - Things you are aware of and understand
Known Unknowns - Things you are aware of but donโt fully understand
Unknown Knowns - Things you understand but are not aware of
Unknown Unknowns - Things you are neither aware of nor fully understand
๐ฃ๐ฟ๐ถ๐ป๐ฐ๐ถ๐ฝ๐น๐ฒ๐ ๐ผ๐ณ ๐๐ต๐ฎ๐ผ๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
Chaos engineering defines general principles to follow when designing and conducting experiments.
๐๐ฒ๐ณ๐ถ๐ป๐ฒ ๐๐ต๐ฒ ๐๐๐๐๐ฒ๐บโ๐ ๐๐๐ฒ๐ฎ๐ฑ๐-๐๐๐ฎ๐๐ฒ
How does the system behave when it is steady? These definitions set the baseline for the experiments. The definition of steady state includes measurable outcomes defined using key performance indicators. Some examples of such KPIs are:
โThe system latency is below 300ms.โ
โThe error rate is below 3%.โ
๐๐ฟ๐ฒ๐ฎ๐๐ฒ ๐๐ต๐ฒ ๐ต๐๐ฝ๐ผ๐๐ต๐ฒ๐๐ถ๐
A chaos experiment needs a hypothesis on how the system will behave if a chaotic situation arises in a production environment. It should be based on the established baselines and knowledge of the behavior and weaknesses of the system. When creating a hypothesis, ask โwhat ifโ questions or create statements on how the system should behave.
Examples include:
โIf we increase the load by 1x, the system can handle it without issue.โ
โAn increase in request latency will not impact the user experience.โ
โIf the primary database is down, the system will automatically failover to the secondary database with minimum downtime.โ
๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐ ๐ฏ๐ ๐ฐ๐ต๐ฎ๐ป๐ด๐ถ๐ป๐ด ๐ฟ๐ฒ๐ฎ๐น-๐๐ผ๐ฟ๐น๐ฑ ๐ฐ๐ผ๐ป๐ฑ๐ถ๐๐ถ๐ผ๐ป๐
Consider real-world scenarios or events that can deviate from the steady state.
For example:
Events resulting in hardware and software failures.
High-network latency and error rates.
Network traffic spikes.
It helps identify vulnerabilities and ensures that the system can handle different scenarios.
๐ฅ๐๐ป ๐ฎ๐๐๐ผ๐บ๐ฎ๐๐ฒ๐ฑ ๐ฒ๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐๐ ๐ถ๐ป ๐ฝ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐ฒ๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐๐
Prior systems like development, staging, and pre-production do not simulate the actual production systems. Thatโs why chaos engineering experiments run in actual production systems under controlled conditions. In 2021 Gremlin Survey found that only 34% of companies actually did chaos testing on their production servers.
๐ ๐ถ๐ป๐ถ๐บ๐ถ๐๐ถ๐ป๐ด ๐๐ต๐ฒ ๐๐น๐ฎ๐๐ ๐ฅ๐ฎ๐ฑ๐ถ๐๐
Since chaos engineering experiments are conducted within real production environments, it is crucial to minimize any potential performance degradation or disruptions that customers may experience during their execution. The blast radius should be determined to use metrics:
For example:
The number of affected users
Impacted locations
Workload quantities
Therefore, it is advisable to schedule these experiments during non-peak times and ensure the availability of backup systems for restorations.
๐๐ต๐ฎ๐ผ๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ง๐ผ๐ผ๐น๐
Chaos engineering tools are a relatively new approach to traditional testing methods used to establish confidence in systems. Software platforms will inevitably fail, and therefore it's critical to pinpoint weaknesses and fix them before they negatively impact business operations. Through the deployment of assumptions and successful chaos experiments, chaos engineering tools can provide a roadmap for uncovering infrastructural failures or unresponsive systems.
๐๐ต๐ฎ๐ผ๐ ๐ ๐ผ๐ป๐ธ๐ฒ๐
Pros
Configurable technology allows for easy monitoring and scheduling of attacks
Open-source software has no licensing costs
Extensive development history
Cons
Can only perform one type of experiment
Attacks are randomized and users have limited control of the blast radius
Requires writing custom code
Key Features
Detects systems bottlenecks to help limit disruption to production environments
The ability to test resiliency and availability of applications at an infra level
Tests can be scheduled during certain timeframes
Allows for easy monitoring
Cost
As open-source software, Chaos Monkey is free to use without a commercial license.
"Netflixโs Chaos Monkey, an open-source tool, tests system resilience by randomly killing virtual machines, pushing teams to prepare for chaos."
Should I use Chaos Monkey?
Chaos Monkey is a popular chaos engineering tool. While it may have revolutionized the open-source community, its contemporary application is far less practical today. Chaos Monkey is useful to an extent, but users must take into account its limitations and arduous deployment capabilities.
๐๐ฟ๐ฒ๐บ๐น๐ถ๐ป
Pros
Easy-to-use UI allows for various attacks and tests to onboard teams
Support with API for creating manual integrations
Evaluates reliability based on a variety of different factors
Cons
Software is not customizable
Challenging to integrate experiment JSON files into the software delivery pipeline
Minimal reporting capabilities
Key Features
Controlling failures in a precise and controlled manner
Custom scenarios that include multi-levels of system attacks
Testing process for memory leaks, latency injections, disk fill-ups, and more
GameDay feature
Reliability score based on predefined tests
Cost
Gremlin's pricing has fluctuated over the years ranging from per-agent pricing to attacks per target to support the frequency of testing required by a team.
โGremlin empowers you to proactively root out failure before it causes downtime.โ
Should I use Gremlin?
As the world's first managed enterprise chaos engineering technology, Gremlin provides users with the ability to launch dozens of attack vectors, stop and roll back attacks, and improve system reliability. Designed with the mission of creating a sustainable and reliable internet, Gremlin pinpoints software weaknesses to minimize revenue loss and negative systematic impacts.
๐๐ต๐ฎ๐ผ๐ ๐ ๐ฒ๐๐ต
Pros
Easy-to-use functionality and automation.
The user interface supports many different configurations.
Experiments can be paused and resumed
Cons
Experiments run indefinitely as there is no ability to schedule attacks
Node-level attacks cannot be run
Cannot control user access within the dashboard; as a result, there are increased security risks
Key Features
Chaos Mesh uses a Kubernetes-based interface that's supported with full automation and graphical capabilities used in the testing of high visibility distribution systems such as Apache APISIX and RabbitMQ
Chaos Mesh technology is able to test various scenarios using event-driven fault simulations
Chaos Mesh provides the ability to design experiments on the platform using different variables and status checks
Cost
As an open-source chaos tool, Chaos Mesh is free to use without a commercial license.
โChaos Mesh, an open-source tool with a handy Chaos Dashboard, lets teams test system weaknesses in Kubernetes by simulating faults like network delays and resource spikes across development, testing, and production.โ
Should I use Chaos Mesh?
Chaos Mesh offers an open-source technology that can be used in Kubernetes to design and manage automated experiments. However, be worry of certain limitations to the technology. Predicting failures can be a cumbersome task due to the complexities in cloud operations. Unreliable functions and outages can result in a downgraded reputation and a loss of consumer trust.
๐๐ผ๐ป๐ฐ๐น๐๐๐ถ๐ผ๐ป
In conclusion, chaos engineering continues to play a pivotal role in ensuring the resilience and reliability of modern software systems specially distributed systems, offering organizations valuable insights and tools to proactively address potential failures and disruptions.
#ChaosEngineering breaks systems on purpose to make them unbreakable in practice.
โBecause engineers allowed full night's sleep.โ
๐๐ป๐๐ฝ๐ถ๐ฟ๐ฎ๐๐ถ๐ผ๐ป๐ ๐ฎ๐ป๐ฑ ๐ฅ๐ฒ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ๐
What is Chaos Engineering: History, Principles and Best Practices - Lambda Test
Chaos Engineering: the history, principles, and practice - Gremlin
Principles of Chaos Engineering
Benefits, Best Practices, and Challenges - Splunk
Chaos Engineering Companies, People, Tools, and Practices
National Australia Bank deploys Chaos Monkey to kill servers 24/7
ASX Trade Outage Report
Curious for more? Check out Chaos Engineering FAQs for extra goodies!