Artificial intelligence (AI) is becoming a partner in our daily work, research, development, testing … But when AI grow, there is a concern: how to evaluate the response when the AI output is unpredictable? While the traditional system return fixed result, AI systems don’t, their response differs from time to time. This unpredictability makes validation difficult and complicated.
So, validating response from AI system with unpredictable output asks to shift from traditional pass/fail check to statistical and contextual validation approach. Key strategies including flexible criteria, leveraging human expertise and employing continuous monitoring.
Why AI output is unpredictable?
Large Language Models (LLMs) generate response based on probability not deterministic logic. They don’t remember facts and don’t have the ability to understand and analyze, they predict the most likely next token.
This means that AI responses vary depending on:
- The exact wording of the prompt
- Format and context size
- Temperature and randomness setting
- Models update over times
- Missing or ambiguous input document
Instead of having one correct response, AI can generate multiple responses that may be all valid or totally incorrect.
The challenge: Traditional validation doesn’t work
Classical software testing validates output using exact expectation outputs. This means that every time we do the same action we receive the same output.
And this is not possible as AI systems may responses many different acceptable outputs.
Therefore, validation must shift from binary correctness to quality judgment. For example, we have to answer these below questions when justification LLM response:
- Is the answer accurate?
- Is the answer complete?
- Is the answer relevant to the question?
- Is the answer safe and appropriate?
Strategy to validate AI Responses
There are a lot of approaches for this and organizations can combine some of them, such as rule-based validation, scenario-based, cross-check, confident and quality scoring, explainability check, human in the loop ….
Although there are a lot of approaches, the key is to:
- Define the acceptance bands:
Instead of expecting a single correct answer, establish a predefined “acceptable range of quality” (example a 5 -stars scale). This allowed for varied response that are still correct valid.
- Use scoring and metrics:
For example, precision, recall, semantic relevant or coherent score.
- Involve domain experts:
Expert can interpret the result and give nuance feedback that automatic metrics might miss.
Or in another way, testing and evaluation is to:
- Evaluate the AI’s response under various conditions, including normal, edge, ambiguous and adversarial cases to understand its limitations and robustness.
- Run the test multiple times with the same input to assess the performance and consistency of AI’s response.
- Request the AI provide its reasoning, assumption, explanation and source (for example, reference documents) to make the audit and explainability more transparent.
Evaluation criteria rubric sample can be:
- Accuracy: is the information provided factual and correct?
- Relevance: how well the response match with user’s question and context?
- Clarity and coherence: is the response easy to understand, logical and well-structured?
- Tone and style: is the response match with the communication style?
- Safety: does the response avoid harmful, inappropriate or biased content?
These can be done either by automation or manual or hybrid approach.
In automated testing (quantitative metrics), we can use:
- BLEU Score, ROUGE and other kinds of metric …
- Basic safety check if system filter basic harmful content
- …
In manual test (qualitative and subjective metrics):
- Assessing if the response generally stays in topics and follows prompt instructions.
- Ensure that the system correctly handles long inputs, special characters, or rapid-fire question….
- Validate basic readability, grammar, spelling, formatting ….
- Check response speed and stability.
- Domain-expert to verify domain-specific facts and advices using their professional expertise, appropriate domain terminology, …
We can also use other AI to check the response of the AI under test, in which you provide for another “judge” LLM the input prompt, the generated response and evaluation criteria rubric and ask the “judge” LLM to provide the justification for each criteria.
Conclusion
Validating AI systems with unpredictable outputs requires a multi-faceted approach combining statistical analysis, human oversight, and automated evaluation.
Key strategies involve defining flexible quality criteria and establishing “acceptance bands” rather than rigid pass/fail standards. Different roles handle different aspects of validation:
- Domain Experts and Manual Testers provide crucial oversight for subjective quality, tone, empathy, and high-stakes safety/compliance in the real world. The software tester focus on system integration, user experience, performance consistency, regression testing, and ensuring basic guardrails and formatting rules are met, using the rubric criteria for mechanical checks rather than deep domain accuracy as this will be what domain experts focus on.
- Automated Testing is used for scalable checks using quantitative metrics (like ROUGE/BLEU) and automated tools.