Strengthen AI Security: Red Teaming LLMs with Promptfoo

Quân Đỗ

1. Introduction

In the previous blog post, we explored how Promptfoo excels in functional testing of LLMs. Additionally, Promptfoo can perform non-functional testing, particularly in identifying vulnerabilities and AI security through Red teaming.

Red teaming, a cybersecurity practice, involves simulating attacks to uncover weaknesses. As an open-source tool, Promptfoo provides a robust framework for red teaming LLMs, aiding developers in testing, evaluating, and securing their AI applications.

Screenshot of an LLM Risk Assessment dashboard displaying security and legal risk levels, along with critical, high, medium, and low issue counts. This is a report from LLM Red-teaming — A typical Promptfoo’s report after LLM red-teaming

2. Understanding Red Teaming for LLMs

Red teaming in AI involves creating adversarial inputs to test the weaknesses of an LLM. This process helps in finding potential vulnerabilities like prompt injection, unauthorized actions, or exposure of sensitive data. By systematically testing the model’s responses to these inputs, developers can measure risks and implement strategies to fix them.

LLMs, due to their complexity and wide attack surface, face various security challenges. Red teaming provides a way to measure risk, allowing developers to make informed decisions about acceptable risk levels before deployment. This proactive approach is essential for maintaining the integrity and reliability of AI systems in production environments.

3. Common Threats

3.1. Privacy violations

Gen AI apps depend on large data sources, and adversaries who could gain access to those data sources would pose massive threats to the companies behind the apps.

Even if user privacy isn’t directly violated, companies with AI apps likely don’t want outsiders to know the training data they use. An attacker can make an LLM share phone numbers it shouldn’t to sharing individual email addresses.

A leak of personally identifiable information (PII) is bad in itself, but once adversaries have that PII, they could use the stolen identities to gain unauthorized access to internal companies’ resources—to steal the resources, blackmail the company, or insert malware.

A table displaying examples of AI sharing phone numbers and email addresses of its training data. This is the result of an LLM red-teaming activity. — An LLM sharing the phone number and email addresses in its training data.

3.2. Prompt injections

Prompt injections resemble SQL injections but present differently. They are a type of attack that chains untrusted user input with trusted prompts built by a trusted developer.

In the example below, with one prompt injection, the attacker hijacked an LLM, convinced the user to disclose their names, and got the user to click on a link that redirected them to a malware website, for example.

An illustration showing an indirect web injection attack strategy, featuring a code snippet and a simulated chat interaction. The left side displays a code block describing an unrestricted AI assistant's reaction when prompted, while the right side shows the assistant asking a user for their name. — An example of indirect injection in AI where the assistant attempts to extract user information, illustrating a cybersecurity vulnerability.

3.3. Jailbreaking

Jailbreaking refers to attacks that intentionally subvert the foundational safety filters and guardrails built into the LLMs supporting AI apps. These attacks aim to make the model depart from its core constraints and behavioral limitations.

Jailbreaking can be surprisingly simple—sometimes as easy as copying and pasting a carefully crafted prompt to make a Gen AI app do things it’s fundamentally not supposed to do.

Chat interface showing a conversation with the Chevrolet of Watsonville, welcoming a customer and offering assistance.

Chat conversation between a user and Chevrolet of Watsonville Chat Team discussing a car deal. The user convinced a Chevrolet dealer's ChatGPT-powered customer service app to sell him a 2024 Chevy Tahoe for $1 with a simple prompt that gave the bot a new objective. — With a simple prompt that gave the bot a new object, the user was able to override its core constraints.

3.4. Generation of Unwanted Content

Separate from jailbreaking, AI apps can sometimes generate unwanted or unsavory content simply due to the broad knowledge base of the foundation model, which may not be limited to the specific use case of the app. This incorrect information can damage the reputation of the company or in worse cases can actually hurt users.

Screenshot of a Google search result page displaying advice on why cheese might slide off pizza, discussing possible reasons such as too much sauce or cheese, and suggesting solutions including the addition of non-toxic glue for tackiness. — Google’s AI Overview giving (potential) harmful advice to the user.

4. Promptfoo’s Red-teaming Features

Automatically scans 50+ vulnerability types:
- Security & data privacy: jailbreaks, injections, RAG poisoning, etc.
- Compliance & ethics: harmful & biased content, content filter validation, OWASP/NIST/EU compliance, etc.
- Custom policies: enforce organizational guidelines.
Generates dynamic attack probes tailored to your application using specialized uncensored models.
Implements state-of-the-art adversarial ML research from Microsoft, Meta, and others.
Integrates with CI/CD.
Tests via HTTP API, browser, or direct model access.

5. Setting Up Promptfoo for Red Teaming

Installation and Initialization:
- Install Promptfoo using npm: npx promptfoo@latest init
- Initialize your project and configure settings as needed.
Generating Adversarial Inputs
- Create a diverse set of malicious intents targeting potential vulnerabilities.
- Use techniques like prompt injection and jailbreaking to craft these inputs.
Executing Tests
Analyzing Results
- Evaluate the model’s responses using deterministic and model-graded metrics.
- Identify weaknesses or undesirable behaviors and document them for further analysis.

Promptfoo includes plugins, which are adversarial generators that produce malicious inputs that are sent you your application.

Promptfoo's report includes a breakdown of specific test types. — Promptfoo’s report includes a breakdown of specific test types

Clicking into a specific test case to view logs will display the raw inputs and outputs.

6. Best Practices for Red Teaming LLMs

Continuous Monitoring:
- Regularly update and run red teaming tests to keep up with evolving threats.
- Integrate Promptfoo into your CI/CD pipeline for automated security checks.
Collaborative Approach:
- Share findings with your team and collaborate on mitigation strategies.
- Use Promptfoo’s reporting features to generate detailed vulnerability reports.
Staying Informed:
- Keep up with emerging standards and best practices in AI security.
- Follow guidelines from organizations like OWASP, NIST, and the EU AI Act.

7. Conclusion

Red teaming is an essential practice for securing LLM applications, and Promptfoo provides a powerful toolset to facilitate this process. By systematically probing your AI systems for vulnerabilities, you can ensure they are robust, reliable, and ready for deployment. Embrace red teaming with Promptfoo to safeguard your AI innovations and contribute to a safer digital future.

Quân Đỗ

Result-oriented QA Automation Engineer keen on building test frameworks that can achieve thorough test coverage with efficient performance. Currently handy with writing test scripts and developing test frameworks using C# .Net, Java and Python.

Solutions

Industry

Our thinking

Strengthen AI Security: Red Teaming LLMs with Promptfoo

Quân Đỗ

Table of Contents

1. Introduction

2. Understanding Red Teaming for LLMs

3. Common Threats

3.1. Privacy violations

3.2. Prompt injections

3.3. Jailbreaking

3.4. Generation of Unwanted Content

4. Promptfoo’s Red-teaming Features

5. Setting Up Promptfoo for Red Teaming

6. Best Practices for Red Teaming LLMs

7. Conclusion

Quân Đỗ

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements