NashTech Blog

Chaos Engineering with Azure Chaos Studio- Part 2

Table of Contents

Demonstrating Fault Injection on Real Application

Rebooting Azure Cache for Redis and Stopping App Service under Load

In [Part 1] of this series, I introduced Chaos Engineering and Azure Chaos Studio, and we walked through a simple experiment to understand how our system behaves when a virtual machine is shut down.

In this Part 2, we move closer to real-world scenarios. Instead of breaking infrastructure in isolation, we run two end-to-end chaos experiments against a live web application that’s receiving load:

  • Demo 1 – Reboot Azure Cache for Redis while the application is serving traffic.
  • Demo 2 – Stop the application service itself and observe the impact.

Both demos use the same environment, but they answer different questions:

  • What happens if a critical dependency (Redis) is rebooted?
  • What happens if the application itself stops?

The goal is not just to “break things for fun”, but to build confidence that our system can handle failures in a controlled, observable way.


Experiment setup

The playground for both demos is a simple but realistic Azure environment:

  • Azure App Service hosting a Python Flask web application
  • Azure Cache for Redis used as a caching layer
  • Azure Load Testing to continuously generate traffic
  • Azure Load Testing Dashboard to visualize load results and application insights
  • Azure Chaos Studio to inject controlled faults
  • A “Chaos Engineer” (you and me) orchestrating experiments and analysing the outcome

In the first demo, our chaos experiment targets Azure Cache for Redis.

Demo 1: Rebooting Azure Cache for Redis

Flow:

  • Azure Load Testing sends steady traffic to the Application and collects telemetry.
  • The Application reads and writes cache data to Azure Cache for Redis on each request.
  • An Azure Chaos Experiment is configured to reboot the Redis cache instance.
  • The Chaos Engineer triggers the experiment and monitors behaviour.
  • The Azure Load Testing Dashboard shows load test results and application insights, including any impact on availability.

Demo 2: Stopping the App Service

In the second demo, the chaos experiment targets the application itself.

Flow:

  • Azure Load Testing generates load and sends requests to the Application.
  • The Chaos Engineer triggers an Azure Chaos Experiment that stops the App Service.
  • The Application becomes unavailable for a period.
  • The Azure Load Testing Dashboard visualizes how the success rate and error rate respond to the outage and recovery.

With the architecture in place, let’s go through each demo step by step.


Demo 1 – Rebooting Azure Cache for Redis

Objective

The key objective for Demo 1 is:

Reboot Azure Cache for Redis and verify that the web application remains available while the cache is temporarily unavailable.

More specifically, we want to see:

  • Azure Chaos Studio successfully injects a reboot fault into the Redis instance.
  • Application availability and user experience do not degrade significantly during the reboot window.

If Redis goes away for a short period and the application continues to handle requests gracefully, we can say our system has reasonable resilience to cache failures.

Preparing the application

The Python Flask web app is deployed to Azure App Service. To make this chaos experiment meaningful, the app is explicitly wired to depend on Redis:

  • On each incoming request, the app connects to Azure Cache for Redis.
  • It performs a simple operation (for example, incrementing a counter or storing a timestamp).
  • Only after the Redis operation succeeds (or is handled) does the app return a response.

This means Redis is not just a background component — it is part of the request path. If Redis fails and we do not handle it properly, we will see:

  • Application errors
  • Timeouts
  • Drops in availability

Exactly what we want to observe in a controlled way.

Generating load with Azure Load Testing

Next, we configure an Azure Load Testing test:

  • Target URL: the main endpoint of the web application.
  • Load pattern: a modest but steady number of virtual users.
  • Duration: long enough to cover the entire chaos experiment, including a buffer before and after.

When the load test runs, Azure Load Testing:

  • Continuously sends traffic to the application.
  • Collects metrics such as response time, throughput, and error rate.
  • Feeds results into the Azure Load Testing Dashboard for live monitoring.

This gives us a realistic “user traffic” baseline before we start injecting failures.

Configuring the Chaos Experiment – Redis Reboot

In Azure Chaos Studio, we then create a Chaos Experiment with the following configuration:

  • Target resource: the Azure Cache for Redis instance used by the application.
  • Fault: “Reboot cache instance” – instructs Azure to reboot the Redis cache.
  • Step timing: a predefined duration indicating when the reboot happens and how long we observe after that.

We also ensure that the necessary permissions are granted so Chaos Studio can perform actions on the Redis resource. Without this, the experiment would appear to run but no actual fault would be injected.

Running Demo 1

The sequence of the demo looks like this:

  1. Start Azure Load Testing and confirm that the system is healthy:
    • Low error rate
    • Consistent response times
    • Application reachable
  2. Trigger the Azure Chaos Experiment to reboot the Redis cache.
  3. During the reboot window, closely watch:
    • The Azure Load Testing Dashboard for spikes in errors or latency.
    • App Service metrics such as availability, requests, and HTTP status codes.

The Redis instance goes through a reboot cycle while requests continue to hit the application.

Observations

There are two main outcomes we’re interested in:

  1. Behaviour of Redis
    • The cache instance is successfully rebooted by Chaos Studio.
    • During this period, connections from the application may fail or timeout.
  2. Behaviour of the Application
    • Does the app handle Redis failures gracefully with retries or fallbacks?
    • Or does every failed cache call translate into a failed user request?

If availability remains high and users hardly notice the Redis reboot, we’ve validated that the application is resilient to cache interruptions.

If availability or success rate drops significantly, we’ve discovered a resilience gap to address (e.g., adding retry policies, cache fallbacks, or better error handling).


Demo 2 – Stopping the App Service

If Demo 1 focuses on dependency resilience, Demo 2 focuses on observability and incident visibility.

Objective

The objective for Demo 2 is:

Stop the web application (App Service) while load is running, and verify that our monitoring clearly reflects the outage.

Here we expect user impact; the app is being deliberately stopped. The goal is not to avoid the impact, but to ensure we:

  • Detect it quickly.
  • See a clear signal in our dashboards and alerts.

Adjusting the Chaos Experiment – Stop App Service

For this demo, we keep the same environment and the same load test. The only change is the target of the chaos experiment:

  • Target resource: the Azure App Service that hosts the web application.
  • Fault: an action that stops or shuts down the App Service for a period.

The Chaos Engineer triggers this experiment while Azure Load Testing is actively sending requests.

Running Demo 2

  1. Launch Azure Load Testing and confirm that the application is handling requests normally.
  2. Trigger the Chaos Experiment that stops the App Service.
  3. Monitor the Azure Load Testing Dashboard and application metrics:
    • Success rate and error rate.
    • HTTP status codes (e.g., 5xx, 503, timeouts).
    • Availability curves and any configured alerts.

During the stop window, the application becomes unavailable to users. Once the experiment finishes and the service is started again, we continue monitoring to confirm:

  • How quickly the application recovers.
  • Whether any residual issues appear after restart.

Observations

Unlike Demo 1, here we expect availability to drop sharply:

  • Requests start failing once the App Service is stopped.
  • The Load Testing Dashboard shows a spike in errors and a drop in success rate.
  • Availability charts in Azure Monitor clearly highlight the outage period.

This is valuable because it confirms that:

  • Our monitoring and dashboards accurately reflect reality.
  • We would notice a full application outage in production.
  • If alerts are configured, we can check whether they fire quickly and with the right severity.

If we don’t see a clear signal in our tools, that’s a strong indication that our observability needs improvement, even before we worry about making the system more resilient.


Key lessons from these experiments

Looking at these two experiments side by side gives a more complete picture of system reliability.

AspectDemo 1 – Reboot RedisDemo 2 – Stop App Service
TargetAzure Cache for RedisAzure App Service
Expected user impactMinimal / hopefully noneClear outage during experiment
Main focusResilience to dependency failureObservability of core service outage
Key signalsAvailability staying high, few errorsAvailability drop, spike in errors

Together, they answer two fundamental questions:

  1. Can our application survive a dependency failure?
  2. When something really bad happens, do we see it immediately?

Running these experiments regularly helps teams:

  • Build confidence that their systems behave as designed under stress.
  • Identify weaknesses early, when it’s still cheap to fix them.
  • Encourage a culture of proactive reliability, instead of waiting for incidents to happen in production.

Closing thoughts – wrapping up my Chaos Engineering journey

This post marks the end of my small Chaos Engineering research series, but definitely not the end of learning.

Across these experiments, a few themes kept repeating themselves:

  • Chaos Engineering is not about random destruction. It is a disciplined way to turn scary “what if?” questions into concrete “we’ve tested this” answers.
  • Tools like Azure Chaos Studio, Azure Load Testing, and Azure Monitor work best together: one injects failure, one simulates users, and one tells you the truth.
  • The most valuable outcome is not the fancy dashboard; it’s the conversations that follow:
    • Why did this fail?
    • What do we want the system to do next time?
    • How can we make this normal incident boring?

If you’ve followed along from Part 1 to this final demo, my hope is that you now feel confident enough to:

  • Start with a small, low-risk experiment

References

  • Documents, articles, and papers on the Internet
    (What is Azure Chaos Studio? | Microsoft Learn)
  • Create and run a chaos experiment by using Azure Chaos Studio | Microsoft Learn
  • Images from the Internet.
  • Images created by Gemini.

Picture of Vy Au

Vy Au

Many years of experience in software testing (web application testing, application testing, mobile testing, API testing).

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top