AI Chatbot Performance Testing in the Banking Sector

Tuan Vo Manh

In the digital banking era, AI chatbots have evolved from optional tools to critical front-line assistants, handling thousands of user queries every day. However, building a chatbot that understands natural language isn’t enough. It must also be fast, stable, and scalable under real-world load.

This blog shares a performance testing journey for a banking AI chatbot expected to handle about 150,000 daily questions from 15,000 active users. We’ll walk through objectives, test design, results, bottlenecks, and actionable takeaways.

Test Objectives

The purpose of this performance testing was to evaluate the system’s ability to handle expected and peak workloads, identify potential bottlenecks, and ensure long-term stability. The testing was specifically designed to answer the following key questions:

Can the system handle peak load of 30,000 requests/hour (~8 requests/second)?
Does the system degrade over time under continuous traffic?
How do key performance indicators like response time, throughput, and error rate evolve?

Metrics Measured:

To evaluate the system’s behavior under load, we monitored several key performance metrics. These metrics help us understand how well the system performs, identify potential bottlenecks, and ensure it meets user expectations.

Throughput: The number of requests the system can process per second. This shows how much traffic the system can handle at any given time.
Response Time: The total time taken to complete a request. We measured average, median, and percentile values (90th, 95th, and 99th) to understand both typical and worst-case performance.
Latency: he time from when a request is sent to when the server starts to respond (also known as time to first byte). It reflects system responsiveness.
Error Rate: The percentage of failed requests out of total requests. A high error rate may indicate that the system is overwhelmed or misconfigured
CPU & Memory Usage: Server-level monitoring of resource usage. High CPU or memory consumption may signal performance issues, especially under high load

Test Configuration & Setup

To ensure meaningful and reliable results, the performance test was conducted in a controlled environment that closely mirrors production. Below are the key elements of the test setup:

Environment: The test was executed in the User Acceptance Testing (UAT) environment, which is configured to match the production environment in terms of system architecture and deployment settings.
Infrastructure: The application was hosted across 3 servers, each with:
- 4 CPU cores
- 16 GB RAM
  This setup helped simulate real-world resource allocation and observe load distribution across multiple instances.
Authentication: For authentication, the test used pre-generated ADFS tokens stored in a CSV file. This allowed the simulation of authenticated user sessions without needing to log in during runtime
Test Data: A pool of randomized test questions was prepared in a CSV file to simulate a variety of input scenarios and avoid response caching effect
Tool: The performance tests were conducted using Apache JMeter, an open-source tool widely used for load testing and measuring performance across different scenarios

Test Results & Analysis

Scenario 1 – 4 requests/sec | 5 minutes

Highlights:

Started with ~25s average response time, increased steadily to >400s
Throughput dropped as response time increased
Latency grew from ~25,000ms to ~175,000ms
Test stopped due to excessive delays

Conclusion: System shows early signs of overload even at moderate sustained traffic.

Scenario 2 – 8 requests/sec | 5 minutes

Highlights:

High start (~100s), continuous increase in response time
Error rate spiked to 79.88%
TPS couldn’t be sustained due to backend strain

Conclusion: The system cannot handle high concurrency and collapses under pressure.

Key Observations

Area	Observation
Response Time	Increases over time, reaching critical levels (up to 1,000s)
Throughput	Starts stable, drops as system becomes overloaded
Error Rate	Low under short duration, spikes with sustained or high traffic
Latency	Grows steadily, indicates possible queue build-up or backend bottlenecks

Recommendations for Optimization

Auto-scaling: Dynamically allocate resources based on real-time load
Real-time Monitoring: CPU, memory, latency, error spikes
Proactive Alerts: Setup alerts for key performance thresholds
Smart Caching: Cache common user questions to avoid AI API calls
Batch Load Optimization: Avoid traffic bursts by smoothing load across time

Compute Unit (CU) Benchmarking

To determine optimal scalability, we defined:

1 Compute Unit (CU) = 1 vCPU + 2GB RAM
Each CU can handle up to 10 Concurrent Connections (CCUs)

Benchmark Scenarios:

Case	CUs	CCUs	Result
Case 1	40	400	✅ Stable, 0% error, faster response
Case 2	40	600	❌ Higher errors, slower response, lower throughput

We allocated 40 compute units (CUs) to support 400 concurrent users (CCUs) and ensure stable performance. This allocation helped us use resources effectively for handling multiple user requests. The system managed the load well, showing it was scalable and reliable, with performance tests confirming that this setup provided a smooth user experience. Additionally, ongoing monitoring of system metrics suggested that we could increase the number of compute units in the future to support growth, keeping our infrastructure strong and adaptable to changing needs.

We asked the question: Can the system handle 1,000 concurrent users (CCUs) with 100 compute units (CUs)?

To find out, we ran a performance test using this setup. However, the results showed that performance did not scale linearly. The system did not behave the same way as it did with fewer users. This means that simply increasing CUs based on earlier ratios may not be reliable at larger scales.

Because of this, we need to adjust our approach. Instead of jumping directly to 1,000 CCUs, we should gradually increase the load in steps. At each step, we run performance tests to see how many users each CU can handle reliably. This helps us allocate the right number of CUs and ensures the system stays stable, even at higher loads.

After testing, we found that when the system scales up to 1,000 CCUs, the previous assumption of 1 CU per 10 CCUs no longer holds. Under high load, 1 CU can reliably support only 8 CCUs, due to increased pressure on CPU, memory, backend services, and other components.

To ensure stability, we also recommend adding a 20% performance buffer to account for unexpected spikes, system overhead, or minor inefficiencies.

Finally, we allocate 150 CUs to safely support 1,000 concurrent users, based on the test result of 1 CU per 8 CCUs with a 20% buffer.

Final Thoughts

Performance testing an AI chatbot is not just about checking whether “it works”. It’s about ensuring it works well, at scale, under pressure. From short-lived spikes to sustained high traffic, your chatbot must be resilient.
If you’re deploying AI chatbots in your enterprise, remember: performance testing should not be an afterthought. It should be an integral part of your DevOps pipeline.

Tuan Vo Manh

With more than 12 years as a senior software tester in outsourcing company, I have experiences on full system development life-cycle, including designing, developing and implementing test plans, test cases and SCRUM processes. I always enjoy to learn new technologies in QA software testing and software development.

Solutions

Industry

Our thinking

AI Chatbot Performance Testing in the Banking Sector

Tuan Vo Manh

Table of Contents

Test Objectives

Metrics Measured:

Test Configuration & Setup

Test Results & Analysis

Scenario 1 – 4 requests/sec | 5 minutes

Scenario 2 – 8 requests/sec | 5 minutes

Key Observations

Recommendations for Optimization

Compute Unit (CU) Benchmarking

Benchmark Scenarios:

Final Thoughts

Tuan Vo Manh

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements