Virtue AI Blog Post: An Introduction to VirtueGuard-Text-Lite: Fastest and Most Effective Text Moderation Solution

We are excited to launch our advanced guardrail model, VirtueGuard-Text-Lite. This innovative Guardrail model sets a new standard and surpasses existing state-of-the-art models in safety protection performance while operating at unprecedented speeds. (Please talk to our team about our VirtueGuard-Text-Pro if interested.)

‍

In the rapidly evolving landscape of AI, ensuring that models comply with safety and security standards is crucial. VirtueGuard-Text-Lite is designed to provide a robust framework that actively monitors and regulates AI outputs, ensuring they remain aligned with established safety and security protocols. Leveraging dynamic risk assessment and contextual awareness, the model not only prevents the harmful or inappropriate input/output content but also adapts to emerging threats in real-time. As shown in the figure below, VirtueGuard-Text-Lite achieves over 10% improvement on AUPRC when evaluated with standard benchmarks such as OpenAI Mod and ToxicChat datasets while being more than 30 times faster than models like LlamaGuard. This proactive approach to AI safety represents a significant step forward in maintaining trust and reliability in AI systems, protecting users while unlocking the full potential of AI technologies.

‍

Open AI Mod Dataset

‍

ToxicChat Dataset

‍

Overall Performance

‍

Building on its exceptional performance, VirtueGuard-Text-Lite showcases its superiority across various safety benchmarks. As highlighted in the detailed comparison table, VirtueGuard-Text-Lite achieves the best performance in critical metrics such as AUPRC on public benchmarks: Open AI Moderation dataset (0.948 AUPRC) and ToxicChat dataset (0.912 AUPRC). It significantly outperforms other leading models, such as Llama Guard 3.8B and ShieldGemma 9B, by substantial margins. Notably, VirtueGuard-Text-Lite also stands out for its ability to minimize false positive rates, with an industry-leading low Overkill rate of only 0.007 FPR. This combination of high accuracy in detecting risky content with low false positive rates makes VirtueGuard-Text-Lite an ideal choice for real-world applications.

‍

‍

Risk Categories & Jailbreak

‍

VirtueGuard-Text-Lite Risk Categories

VirtueGuard-Text-Lite covers a comprehensive range of 12 risk categories, including 11 categories from the MLCommons taxonomy and an additional “Jailbreak Prompts” category. This extra category is specifically designed to detect and prevent jailbreak attacks on AI models, adding crucial protection against emerging threats to Large Language Model systems.

‍

Although VirtueGuard-Text-Lite is not designed as a specialized jailbreak detection model, it still excels particularly in this field. VirtueGuard-Text-Lite achieves a near-perfect performance with a 0.99 AUPRC score on jackhhao/jailbreak-classification dataset. Notably, this performance surpasses leading specialized jailbreak detection models, including Deepset, ProtectAI, LlamaPromptGuard, and the jackhhao/jailbreak-classifier.

‍

Another significant advantage of VirtueGuard-Text-Lite over specialized jailbreak detection models is its ability to maintain a low false positive rate. Specialized models, trained primarily on jailbreak or similar tasks, often lack exposure to the diverse range of prompts encountered in real-world applications. As a result, models like Deepset and LlamaPromptGuard tend to misclassify benign prompts as threats, leading to high false alarms. In contrast, VirtueGuard-Text-Lite achieves a remarkably low false positive rate of 0.022. This precision ensures robust security without compromising user experience, making it an ideal solution for real-world applications where both safety protection and usability are critical.

‍

LlamaGuard Comptabiltiy

‍

VirtueGuard-Text-Lite offers seamless compatibility with the open-sourced Llama Guard model in both input and output formats, simplifying the process for developers to upgrade their safety tools. By merely replacing the API calling function, developers can effortlessly tap into VirtueGuard-Text-Lite's superior performance with no additional integration effort. This plug-and-play compatibility ensures a cost-effective, near-zero effort transition to a more effective text moderation AI safety solution.

‍

Free API Access: We release a free API key with 10,000 queries daily on our X(Twitter) account. Follow @VirtueAI_co for your chance to get free access!

‍

import os
import requests

API_KEY = os.environ.get('VIRTUEAI_API_KEY')
API_URL = "http://api.virtueai.io/textguardlite"

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

data = {
    "message": "###Your prompt for moderation###"
}

response = requests.post(API_URL, json=data, headers=headers)
print(response.json())

Safe Output Format

safe

Unsafe Output Format

unsafeS2, S9

‍

mport axios from 'axios';

const API_KEY = process.env.VIRTUEAI_API_KEY;
const API_URL = "http://api.virtueai.io/textguardlite";

const headers = {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${API_KEY}`
};

const data = {
    message: "###Your prompt for moderation###"
};

axios.post(API_URL, data, { headers })
    .then(response => console.log(response.data))
    .catch(error => console.error('Error:', error));

Safe Output Format

safe

Unsafe Output Format

unsafeS2, S9

‍

curl -X POST "http://api.virtueai.io/textguardlite" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"message": "###Your prompt for moderation###"}'

Safe Output Format

safe

Unsafe Output Format

unsafeS2, S9

‍

Created on

September 6, 2024

Updated on

April 13, 2026

An Introduction to VirtueGuard-Text-Lite: Fastest and Most Effective Text Moderation Solution

Authors

Overall Performance

Risk Categories & Jailbreak

LlamaGuard Comptabiltiy

Strengthen Your AI Posture Today