1. Introduction
Large Language Models (LLMs) are playing an increasingly important role in modern AI applications. However, to ensure the quality, accuracy, and safety of these models, a systematic evaluation process is essential. DeepEval and DeepTeam are two tools that provide effective support for this purpose.
2. Why Evaluate LLMs?
Evaluating LLMs helps to:
- Ensure accuracy: Verify whether the model provides trustworthy and accurate responses
- Detect errors: Identify and address incorrect, illogical, or off-topic outputs
- Improve user experience: Optimize model responses for smoothness and relevance
- Minimize risk: Reduce bias, inappropriate content, or undesirable behavior
3. DeepEval – How to Use It
3.1. Installation
Install Python: Download Python
Create a new folder A and execute the commands below in the command line
cd A # Move to the target folder which we define DeepEval tests
python3 -m venv venv
source venv/bin/activate # or source .venv/Scripts/activate on Windows
pip install -U deepeval # and you can install deepteam if running red-teaming: pip install -U deepteam
Set up a .env file in the folder and set up the API key for OpenAI
# .env file content
OPENAI_API_KEY=... # add your OpenAI key in ... Get from https://platform.openai.com/settings/organization/api-keys
If running with Gemini, please use the command below
deepeval set-gemini --model-name="models/gemini-2.0-flash" --google-api-key="..." # add your GG API key in ...
3.2. Creating a Basic Test Case
DeepEval uses LLMTestCase to define the input, expected output, and actual output. Here’s a simple example:
# Filename: test_example.py
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import CoherenceMetric
def my_llm_greeting(query: str) -> str:
if "hello" in query.lower():
return "Hello there! How can I help you today?"
return "I'm sorry, I didn't understand that."
test_case_1 = LLMTestCase(
input="Say hello.",
actual_output=my_llm_greeting("Say hello."),
expected_output="Hello there! How can I help you today?",
)
evaluate(test_cases=[test_case_1], metrics=[CoherenceMetric(threshold=0.7)])
And run test by the command below
deepeval test run test_example.py # test_example.py is the file name which we need to run
4. Common Evaluation Metrics
DeepEval provides a range of built-in metrics designed for different evaluation purposes.
4.1. Answer Relevancy
Evaluating how relevant the actual_output of your LLM application is compared to the provided input
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
metric = AnswerRelevancyMetric(
threshold=1,
model="gpt-4.1",
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
# Replace this with the output from your LLM app
actual_output="We offer a 30-day full refund at no extra cost."
)
evaluate(test_cases=[test_case], metrics=[metric])
4.2. Faithfulness
Evaluating whether the actual_output factually aligns with the contents of your retrieval_context
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."
# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]
metric = FaithfulnessMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output,
retrieval_context=retrieval_context
)
evaluate(test_cases=[test_case], metrics=[metric])
4.3. G-Eval
G-Eval is a framework that uses LLM-as-a-judge with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. Here is an example of customized metric:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the expected output.",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=[
"Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
"You should also heavily penalize omission of detail",
"Vague language, or contradicting OPINIONS, are OK"
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)
4.4. Other metrics
BiasMetric: Detects bias (e.g., gender, political, or cultural)ToxicityMetric: Flags harmful or inappropriate contentContextualRelevancyMetric: Evaluates how relevant the retrieved documents areAccuracyMetric: Compares actual output with expected output
5. Synthetic Data and Goldens
When there isn’t enough real-world data, synthetic data can be used to test your model.
- Goldens are reference test cases used for evaluation
- You can generate Goldens from documents, existing contexts, or from scratch
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
include_expected_output=True
)
print(synthesizer.synthetic_goldens)
How Does it Work?
- Input Generation: Generate synthetic goldens
inputs with or without provided contexts. - Filtration: Filter away any initial synthetic goldens that don’t meet the specified generation standards.
- Evolution: Evolve the filtered synthetic goldens to increase complexity and make them more realistic.
- Styling: Style the output formats of the
inputs andexpected_outputs of the evolved synthetic goldens.
6. Automating Evaluation with Pytest
DeepEval works well with Pytest, allowing you to automate your evaluation process. Install by command below:
pip install pytest
Here is an example when integrating with pytest
import pytest
from deepeval.test_case import LLMTestCase
from deepeval.metrics import CoherenceMetric
def my_llm_app(query: str) -> str:
return "Hello there!" if "hello" in query.lower() else "I'm sorry."
@pytest.mark.parametrize("test_case", [
LLMTestCase(
input="Say hello",
actual_output=my_llm_app("Say hello"),
expected_output="Hello there!"
)
])
def test_llm_response(test_case):
evaluate(test_cases=[test_case], metrics=[CoherenceMetric(threshold=0.7)])
7. Red Teaming with DeepTeam
In addition to functional evaluation, it’s important to assess how your model handles harmful inputs, prompt injection, or jailbreak scenarios. DeepTeam is designed for this purpose.
7.1. Common Types of Attacks
- Single-turn:
- Prompt injection
- Obfuscation (e.g., Leetspeak)
- Multi-turn:
- Linear: Directly bypassing restrictions using explicit prompts
- Crescendo: Gradually escalating prompts to trick the model
- Tree-style: Using logical traps that lead to unsafe behavior
7.2. Example Red Teaming Implementation
pip install -U deepteam
from deepteam import red_team
from deepteam.vulnerabilities import Bias
from deepteam.attacks.single_turn import PromptInjection
async def my_llm_application_callback(input: str) -> str:
return "I cannot provide information on harmful activities." if "bomb" in input else f"Sorry: '{input}'"
import asyncio
async def run_red_team_example():
risk_assessment = await red_team(
model_callback=my_llm_application_callback,
vulnerabilities=[Bias()],
attacks=[PromptInjection()]
)
print(risk_assessment.overall)
if __name__ == "__main__":
asyncio.run(run_red_team_example())
8. Conclusion
Combining DeepEval and DeepTeam enables you to ensure your LLMs behave as expected, mitigate potential risks, and maintain ethical and secure AI applications. This is an essential step in building robust and trustworthy modern AI systems.