NashTech Blog

Mastering LLM Quality: A Key Guide for DeepEval & Red Teaming

Table of Contents

1. Introduction

Large Language Models (LLMs) are playing an increasingly important role in modern AI applications. However, to ensure the quality, accuracy, and safety of these models, a systematic evaluation process is essential. DeepEval and DeepTeam are two tools that provide effective support for this purpose.

2. Why Evaluate LLMs?

Evaluating LLMs helps to:

  • Ensure accuracy: Verify whether the model provides trustworthy and accurate responses
  • Detect errors: Identify and address incorrect, illogical, or off-topic outputs
  • Improve user experience: Optimize model responses for smoothness and relevance
  • Minimize risk: Reduce bias, inappropriate content, or undesirable behavior

3. DeepEval – How to Use It

3.1. Installation

Install Python: Download Python

Create a new folder A and execute the commands below in the command line

cd A # Move to the target folder which we define DeepEval tests
python3 -m venv venv
source venv/bin/activate # or source .venv/Scripts/activate on Windows
pip install -U deepeval # and you can install deepteam if running red-teaming: pip install -U deepteam

Set up a .env file in the folder and set up the API key for OpenAI

# .env file content
OPENAI_API_KEY=... # add your OpenAI key in ... Get from https://platform.openai.com/settings/organization/api-keys

If running with Gemini, please use the command below

deepeval set-gemini --model-name="models/gemini-2.0-flash" --google-api-key="..." # add your GG API key in ...

3.2. Creating a Basic Test Case

DeepEval uses LLMTestCase to define the input, expected output, and actual output. Here’s a simple example:

# Filename: test_example.py
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import CoherenceMetric

def my_llm_greeting(query: str) -> str:
    if "hello" in query.lower():
        return "Hello there! How can I help you today?"
    return "I'm sorry, I didn't understand that."

test_case_1 = LLMTestCase(
    input="Say hello.",
    actual_output=my_llm_greeting("Say hello."),
    expected_output="Hello there! How can I help you today?",
)

evaluate(test_cases=[test_case_1], metrics=[CoherenceMetric(threshold=0.7)])

And run test by the command below

deepeval test run test_example.py # test_example.py is the file name which we need to run

4. Common Evaluation Metrics

DeepEval provides a range of built-in metrics designed for different evaluation purposes.

4.1. Answer Relevancy

 Evaluating how relevant the actual_output of your LLM application is compared to the provided input

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(
    threshold=1,
    model="gpt-4.1",
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the output from your LLM app
    actual_output="We offer a 30-day full refund at no extra cost."
)
evaluate(test_cases=[test_case], metrics=[metric])

4.2. Faithfulness

Evaluating whether the actual_output factually aligns with the contents of your retrieval_context

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."
# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context
)

evaluate(test_cases=[test_case], metrics=[metric])

4.3. G-Eval

G-Eval is a framework that uses LLM-as-a-judge with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. Here is an example of customized metric:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

4.4. Other metrics

  • BiasMetric: Detects bias (e.g., gender, political, or cultural)
  • ToxicityMetric: Flags harmful or inappropriate content
  • ContextualRelevancyMetric: Evaluates how relevant the retrieved documents are
  • AccuracyMetric: Compares actual output with expected output

5. Synthetic Data and Goldens

When there isn’t enough real-world data, synthetic data can be used to test your model.

  • Goldens are reference test cases used for evaluation
  • You can generate Goldens from documents, existing contexts, or from scratch
from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
    include_expected_output=True
)
print(synthesizer.synthetic_goldens)

How Does it Work?

  • Input Generation: Generate synthetic goldens inputs with or without provided contexts.
  • Filtration: Filter away any initial synthetic goldens that don’t meet the specified generation standards.
  • Evolution: Evolve the filtered synthetic goldens to increase complexity and make them more realistic.
  • Styling: Style the output formats of the inputs and expected_outputs of the evolved synthetic goldens.

6. Automating Evaluation with Pytest

DeepEval works well with Pytest, allowing you to automate your evaluation process. Install by command below:

pip install pytest

Here is an example when integrating with pytest

import pytest
from deepeval.test_case import LLMTestCase
from deepeval.metrics import CoherenceMetric

def my_llm_app(query: str) -> str:
    return "Hello there!" if "hello" in query.lower() else "I'm sorry."

@pytest.mark.parametrize("test_case", [
    LLMTestCase(
        input="Say hello",
        actual_output=my_llm_app("Say hello"),
        expected_output="Hello there!"
    )
])
def test_llm_response(test_case):
    evaluate(test_cases=[test_case], metrics=[CoherenceMetric(threshold=0.7)])

7. Red Teaming with DeepTeam

In addition to functional evaluation, it’s important to assess how your model handles harmful inputs, prompt injection, or jailbreak scenarios. DeepTeam is designed for this purpose.

7.1. Common Types of Attacks

  1. Single-turn:
    • Prompt injection
    • Obfuscation (e.g., Leetspeak)
  2. Multi-turn:
    • Linear: Directly bypassing restrictions using explicit prompts
    • Crescendo: Gradually escalating prompts to trick the model
    • Tree-style: Using logical traps that lead to unsafe behavior

7.2. Example Red Teaming Implementation

pip install -U deepteam
from deepteam import red_team
from deepteam.vulnerabilities import Bias
from deepteam.attacks.single_turn import PromptInjection

async def my_llm_application_callback(input: str) -> str:
    return "I cannot provide information on harmful activities." if "bomb" in input else f"Sorry: '{input}'"

import asyncio
async def run_red_team_example():
    risk_assessment = await red_team(
        model_callback=my_llm_application_callback,
        vulnerabilities=[Bias()],
        attacks=[PromptInjection()]
    )
    print(risk_assessment.overall)

if __name__ == "__main__":
    asyncio.run(run_red_team_example())

8. Conclusion

Combining DeepEval and DeepTeam enables you to ensure your LLMs behave as expected, mitigate potential risks, and maintain ethical and secure AI applications. This is an essential step in building robust and trustworthy modern AI systems.

Picture of Đại Phạm Ngọc

Đại Phạm Ngọc

I am an automation test engineer with 9 years of experience in the software testing field across various platforms. I have extensive experience in applying testing development techniques such as BDD and TDD. I have previously spent more than 3 years managing teams in different areas of testing. Currently, I am responsible for teaching automation test-related content.

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top