In the digital banking era, AI chatbots have evolved from optional tools to critical front-line assistants, handling thousands of user queries every day. However, building a chatbot that understands natural language isn’t enough. It must also be fast, stable, and scalable under real-world load.
This blog shares a performance testing journey for a banking AI chatbot expected to handle about 150,000 daily questions from 15,000 active users. We’ll walk through objectives, test design, results, bottlenecks, and actionable takeaways.
Test Objectives
The purpose of this performance testing was to evaluate the system’s ability to handle expected and peak workloads, identify potential bottlenecks, and ensure long-term stability. The testing was specifically designed to answer the following key questions:
- Can the system handle peak load of 30,000 requests/hour (~8 requests/second)?
- Does the system degrade over time under continuous traffic?
- How do key performance indicators like response time, throughput, and error rate evolve?
Metrics Measured:
To evaluate the system’s behavior under load, we monitored several key performance metrics. These metrics help us understand how well the system performs, identify potential bottlenecks, and ensure it meets user expectations.
- Throughput: The number of requests the system can process per second. This shows how much traffic the system can handle at any given time.
- Response Time: The total time taken to complete a request. We measured average, median, and percentile values (90th, 95th, and 99th) to understand both typical and worst-case performance.
- Latency: he time from when a request is sent to when the server starts to respond (also known as time to first byte). It reflects system responsiveness.
- Error Rate: The percentage of failed requests out of total requests. A high error rate may indicate that the system is overwhelmed or misconfigured
- CPU & Memory Usage: Server-level monitoring of resource usage. High CPU or memory consumption may signal performance issues, especially under high load
Test Configuration & Setup
To ensure meaningful and reliable results, the performance test was conducted in a controlled environment that closely mirrors production. Below are the key elements of the test setup:
- Environment: The test was executed in the User Acceptance Testing (UAT) environment, which is configured to match the production environment in terms of system architecture and deployment settings.
- Infrastructure: The application was hosted across 3 servers, each with:
- 4 CPU cores
- 16 GB RAM
This setup helped simulate real-world resource allocation and observe load distribution across multiple instances.
- Authentication: For authentication, the test used pre-generated ADFS tokens stored in a CSV file. This allowed the simulation of authenticated user sessions without needing to log in during runtime
- Test Data: A pool of randomized test questions was prepared in a CSV file to simulate a variety of input scenarios and avoid response caching effect
- Tool: The performance tests were conducted using Apache JMeter, an open-source tool widely used for load testing and measuring performance across different scenarios
Test Results & Analysis
Scenario 1 – 4 requests/sec | 5 minutes
Highlights:
- Started with ~25s average response time, increased steadily to >400s
- Throughput dropped as response time increased
- Latency grew from ~25,000ms to ~175,000ms
- Test stopped due to excessive delays
Conclusion: System shows early signs of overload even at moderate sustained traffic.
Scenario 2 – 8 requests/sec | 5 minutes
Highlights:
- High start (~100s), continuous increase in response time
- Error rate spiked to 79.88%
- TPS couldn’t be sustained due to backend strain
Conclusion: The system cannot handle high concurrency and collapses under pressure.
Key Observations
| Area | Observation |
|---|---|
| Response Time | Increases over time, reaching critical levels (up to 1,000s) |
| Throughput | Starts stable, drops as system becomes overloaded |
| Error Rate | Low under short duration, spikes with sustained or high traffic |
| Latency | Grows steadily, indicates possible queue build-up or backend bottlenecks |
Recommendations for Optimization
- Auto-scaling: Dynamically allocate resources based on real-time load
- Real-time Monitoring: CPU, memory, latency, error spikes
- Proactive Alerts: Setup alerts for key performance thresholds
- Smart Caching: Cache common user questions to avoid AI API calls
- Batch Load Optimization: Avoid traffic bursts by smoothing load across time
Compute Unit (CU) Benchmarking
To determine optimal scalability, we defined:
- 1 Compute Unit (CU) = 1 vCPU + 2GB RAM
- Each CU can handle up to 10 Concurrent Connections (CCUs)
Benchmark Scenarios:
| Case | CUs | CCUs | Result |
|---|---|---|---|
| Case 1 | 40 | 400 | ✅ Stable, 0% error, faster response |
| Case 2 | 40 | 600 | ❌ Higher errors, slower response, lower throughput |
We allocated 40 compute units (CUs) to support 400 concurrent users (CCUs) and ensure stable performance. This allocation helped us use resources effectively for handling multiple user requests. The system managed the load well, showing it was scalable and reliable, with performance tests confirming that this setup provided a smooth user experience. Additionally, ongoing monitoring of system metrics suggested that we could increase the number of compute units in the future to support growth, keeping our infrastructure strong and adaptable to changing needs.
We asked the question: Can the system handle 1,000 concurrent users (CCUs) with 100 compute units (CUs)?
To find out, we ran a performance test using this setup. However, the results showed that performance did not scale linearly. The system did not behave the same way as it did with fewer users. This means that simply increasing CUs based on earlier ratios may not be reliable at larger scales.
Because of this, we need to adjust our approach. Instead of jumping directly to 1,000 CCUs, we should gradually increase the load in steps. At each step, we run performance tests to see how many users each CU can handle reliably. This helps us allocate the right number of CUs and ensures the system stays stable, even at higher loads.
After testing, we found that when the system scales up to 1,000 CCUs, the previous assumption of 1 CU per 10 CCUs no longer holds. Under high load, 1 CU can reliably support only 8 CCUs, due to increased pressure on CPU, memory, backend services, and other components.
To ensure stability, we also recommend adding a 20% performance buffer to account for unexpected spikes, system overhead, or minor inefficiencies.
Finally, we allocate 150 CUs to safely support 1,000 concurrent users, based on the test result of 1 CU per 8 CCUs with a 20% buffer.
Final Thoughts
Performance testing an AI chatbot is not just about checking whether “it works”. It’s about ensuring it works well, at scale, under pressure. From short-lived spikes to sustained high traffic, your chatbot must be resilient.
If you’re deploying AI chatbots in your enterprise, remember: performance testing should not be an afterthought. It should be an integral part of your DevOps pipeline.