Chaos Engineering in Shift-right testing

Tien Nguyen Anh

1. What is Chaos Engineering?

Chaos engineering is the technique which will inject the fault into the system to test its resilience and ability to recover. It’s designed to test the weakness or vulnerability of our application during unexpected conditions like network issues, hardware problems or high traffic load.

Chaos engineering will be implemented based on the following principles

Build a Hypothesis: We need to define the weak points which can impact the stability of our system. Create a hypothesis that the system will continue to operate even in the event of certain failures.
Introduce Controlled Chaos: We’ll simulate the failure scenarios like shutting down servers, cutting off or reducing network connections, or database performance issues in production or a production-like environment. These failures should be carried out randomly.
Minimize the Blast Radius: When we start Chaos Engineering, we should define the small scope. We can start by targeting a specific microservice, feature, or environment before expanding to a larger system so that any unforeseen issues don’t cause serious damage.
Monitor the System: While running Chaos Engineering, we should monitor how the system behaves to the failures. Metrics like logs, and user feedback should be collected to understand the impact of the failure on performance, reliability, and user experience.
Automate Experiments: The activities in Chaos Engineering should be automated and integrated with CI/CD tools to ensure that resilience is continuously validated.
Learn from Failures: Based on the results of Chaos Engineer activities, we can create plans to improve our product as well as prepare for handling the failure better in the future.

2. Benefits of Chaos Engineering

Improved System Resilience: By determining the weakness of the system proactively, we can make it more robust and fault tolerant.
High Availability: According to the chaos experiments, we can fix issues and improve the system before they occur on production environments. Thanks to this, our product availability can be high.
Faster Recovery: Team will be aware of how their system behaves under stress, and it will help them to develop better strategies for scalability and recovery.
Increased Confidence: Chaos testing ensures that your system can handle failures gracefully, giving teams and stakeholders confidence in its reliability.
Better Incident Response: As a result of chaos engineering, our team will have a stronger knowledge of different failure kinds, and we can create more efficient strategies for the actual incidents.

3. Step of implementing Chaos Engineering

3.1 Identify Critical Systems and Services

Based on the requirements of our product, we can identify the important components. They include the infrastructure or applications which must always be available like:

Core application service: authentication and authorization service, microservices, API Gateways and Load Balancer
Databases
Networking
Message Queue service
Third Party Dependences
Monitoring, Log and Alert
CI/CD pipeline system

3.2 Develop a Hypothesis

Create hypothesis for specific conditions. Here are some popular hypotheses we can consider

If a database instance goes down, the application should continue running without downtime because of load balancing.
If one or more nodes fail, the system should automatically re-route traffic or redistribute workloads without service interruption
If a dependent service experiences latency, the application should degrade gracefully and maintain core functionality
If a sudden increase in traffic occurs, the system should scale up resources automatically to handle the load
If a critical component fails, the monitoring and alerting system should trigger alerts within a specific timeframe.
If a network partition occurs, the system should continue operating with minimal disruption and recover gracefully when the network is restored
If critical services are restarted, they should return to a healthy state and resume their intended functions without manual intervention.
If there is a data loss or corruption event, the system should be able to recover quickly using backups with minimal data loss

3.3 Set Up Monitoring and Observability

Nowadays, there are many tool stacks supporting us to monitor our application as well as its infrastructure. It can help us track metrics like system health, error rates, response times, and traffic patterns during the tests. We can choose various kinds of tool like ELK stack, Data Dog, Prometheus, Grafana, Checkly or New Relic. Please make sure that the alert mechanism is also enable so that we can get the health status as soon as possible.

3.4 Design Chaos Experiments

Based on the hypothesis, we can select chaos scenarios that simulate real-world failures. Common experiments include:

Killing random microservices or processes.
Simulating network failures (high latency, dropped packets).
Introducing memory or CPU exhaustion on servers.
Disrupting dependencies like databases or third-party APIs.
Create huge data in the database
Simulate data corruption or deletion, and test whether data can be restored from recent backups.

3.5 Execute Experiments in Controlled Environments:

Start small by testing in staging environments that mimic production. Of course, we can do it manually, create automation script or use tools like Chaos Monkey, Gremlin, or LitmusChaos to execute controlled failures.

3.6 Analyze Results

After each chaos experiment, review the system’s behavior.

Did it fail gracefully?
Did failover mechanisms work?
Were users impacted?

Based on the problem, we can identify bottlenecks and weaknesses.

3.7 Iterate and Improve

After analyzing, we can create a plan for improving our application to overcome the occurred problem. After improving, we should retest to check if the issues still happen.

3.8 Automate Chaos Experiments

We can use tools or develop the automation script for carrying out the Chaos Experiments. Then, we can integrate them with the CI/CD tool so that we can test system periodically. Especially, if there is new development, the pipeline for Chaos testing can be triggered immediately.

3.9 Expand to Production:

Once we have confidence in our chaos engineering practices in a staging environment, we can consider expanding chaos experiments to production. Of course, we still start with low-risk experiments that won’t disrupt users and minimize the blast radius by focusing on individual microservices or small user groups.

4. Popular tool for chaos engineering

Tool	Description	Features	Best For	Platform
Gremlin	Controlled fault injection platform	Easy-to-use GUI, pre-defined failure scenarios (e.g., latency, DNS failure), supports hybrid environments	Robust Chaos Engineering solution	Linux, Kubernetes, Windows, Cloud
Chaos Monkey	Tool by Netflix to randomly terminate instances	Focused on killing random instances, part of Netflix’s Simian Army	Microservices architecture on AWS	AWS, Spinnaker, compatible cloud
LitmusChaos	Kubernetes-native Chaos Engineering tool	Extensive Kubernetes experiments, CI/CD integration, ChaosHub library	Open-source for Kubernetes environments	Kubernetes, Cloud-native
Chaos Mesh	Open-source, advanced chaos tool for Kubernetes	Web dashboard, advanced Kubernetes tests (e.g., latency, IO chaos, time-based tests)	Kubernetes users	Kubernetes
AWS Fault Injection Simulator (FIS)	Fully-managed chaos service for AWS	Integration with AWS (e.g., EC2, RDS), templates for network latency and instance termination, CloudWatch monitoring	AWS-centric organizations	AWS
Azure Chaos Studio	Managed Chaos Engineering service for Azure	Faults targeted at VMs and PaaS, Azure Monitor integration	Azure-focused teams	Azure
Toxiproxy	Proxy tool for network fault simulation	Injects latency, bandwidth constraints, and network failures, configurable API	Network-related fault testing	Any platform supporting Docker
Steadybit	Platform for uncovering reliability gaps in complex systems	Scenario-based experiments (e.g., network disruptions, CPU spikes), supports cloud and container environments, visual interface for orchestration	Distributed systems	Kubernetes, Cloud, On-premises
Pumba	Open-source chaos tool for Docker	CLI-based, Docker container failures, network delays, integrates with CI/CD	Docker container testing	Docker, compatible with Kubernetes
PowerfulSeal	Open-source tool for Kubernetes	Automates chaos scripts, modes for interaction, network failure and pod-killing experiments	Basic Kubernetes chaos experiments	Kubernetes

Conclusion

Chaos engineering allows us to simulate adverse conditions so that we can test how our system behaves to them and improve the system’s overall resilience. In the next articles, we’ll deep delves into some specific tools to see how we can leverage them to implement Chaos engineering.

Tien Nguyen Anh

I'm an Automation Test Manager with more than 10 years in software testing and development. Currently, I'm responsible for managing automation testing team, building their skills and supporting them to overcome issues. I also research the new automation testing technologies to share with team or conduct the training in NashTech.

Solutions

Industry

Our thinking

Chaos Engineering in Shift-right testing

Tien Nguyen Anh

Table of Contents

1. What is Chaos Engineering?

2. Benefits of Chaos Engineering

3. Step of implementing Chaos Engineering

3.1 Identify Critical Systems and Services

3.2 Develop a Hypothesis

3.3 Set Up Monitoring and Observability

3.4 Design Chaos Experiments

3.5 Execute Experiments in Controlled Environments:

3.6 Analyze Results

3.7 Iterate and Improve

3.8 Automate Chaos Experiments

3.9 Expand to Production:

4. Popular tool for chaos engineering

Conclusion

Tien Nguyen Anh

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements