Pattern

An Introduction to VirtueGuard-Text-Lite: Fastest and Most Effective Text Moderation Solution

Created:

September 6, 2024

Last Updated:

March 25, 2026

Authors

We are excited to launch our advanced guardrail model, VirtueGuard-Text-Lite. This innovative Guardrail model sets a new standard and surpasses existing state-of-the-art models in safety protection performance while operating at unprecedented speeds. (Please talk to our team about our VirtueGuard-Text-Pro if interested.)

In the rapidly evolving landscape of AI, ensuring that models comply with safety and security standards is crucial. VirtueGuard-Text-Lite is designed to provide a robust framework that actively monitors and regulates AI outputs, ensuring they remain aligned with established safety and security protocols. Leveraging dynamic risk assessment and contextual awareness, the model not only prevents the harmful or inappropriate input/output content but also adapts to emerging threats in real-time. As shown in the figure below, VirtueGuard-Text-Lite achieves over 10% improvement on AUPRC when evaluated with standard benchmarks such as OpenAI Mod and ToxicChat datasets while being more than 30 times faster than models like LlamaGuard. This proactive approach to AI safety represents a significant step forward in maintaining trust and reliability in AI systems, protecting users while unlocking the full potential of AI technologies.

ToxicChat DatasetOpen AI Mod Dataset

Image 1

Performance vs. Inference Speed on ToxicChat Dataset

Overall Performance

Building on its exceptional performance, VirtueGuard-Text-Lite showcases its superiority across various safety benchmarks. As highlighted in the detailed comparison table, VirtueGuard-Text-Lite achieves the best performance in critical metrics such as AUPRC on public benchmarks: Open AI Moderation dataset (0.948 AUPRC) and ToxicChat dataset (0.912 AUPRC). It significantly outperforms other leading models, such as Llama Guard 3.8B and ShieldGemma 9B, by substantial margins. Notably, VirtueGuard-Text-Lite also stands out for its ability to minimize false positive rates, with an industry-leading low Overkill rate of only 0.007 FPR. This combination of high accuracy in detecting risky content with low false positive rates makes VirtueGuard-Text-Lite an ideal choice for real-world applications.

Risk Categories & Jailbreak

VirtueGuard-Text-Lite Risk Categories

VirtueGuard-Text-Lite covers a comprehensive range of 12 risk categories, including 11 categories from the MLCommons taxonomy and an additional “Jailbreak Prompts” category. This extra category is specifically designed to detect and prevent jailbreak attacks on AI models, adding crucial protection against emerging threats to Large Language Model systems.

Although VirtueGuard-Text-Lite is not designed as a specialized jailbreak detection model, it still excels particularly in this field. VirtueGuard-Text-Lite achieves a near-perfect performance with a 0.99 AUPRC score on jackhhao/jailbreak-classification dataset. Notably, this performance surpasses leading specialized jailbreak detection models, including Deepset, ProtectAI, LlamaPromptGuard, and the jackhhao/jailbreak-classifier.

Another significant advantage of VirtueGuard-Text-Lite over specialized jailbreak detection models is its ability to maintain a low false positive rate. Specialized models, trained primarily on jailbreak or similar tasks, often lack exposure to the diverse range of prompts encountered in real-world applications. As a result, models like Deepset and LlamaPromptGuard tend to misclassify benign prompts as threats, leading to high false alarms. In contrast, VirtueGuard-Text-Lite achieves a remarkably low false positive rate of 0.022. This precision ensures robust security without compromising user experience, making it an ideal solution for real-world applications where both safety protection and usability are critical.

LlamaGuard Comptabiltiy

VirtueGuard-Text-Lite offers seamless compatibility with the open-sourced Llama Guard model in both input and output formats, simplifying the process for developers to upgrade their safety tools. By merely replacing the API calling function, developers can effortlessly tap into VirtueGuard-Text-Lite's superior performance with no additional integration effort. This plug-and-play compatibility ensures a cost-effective, near-zero effort transition to a more effective text moderation AI safety solution.

Free API Access: We release a free API key with 10,000 queries daily on our X(Twitter) account. Follow @VirtueAI_co for your chance to get free access!

PythonTypeScriptcURL

import os
import requests

API_KEY = os.environ.get('VIRTUEAI_API_KEY')
API_URL = "http://api.virtueai.io/textguardlite"

headers = {
   "Content-Type": "application/json",
   "Authorization": f"Bearer {API_KEY}"
}

data = {
   "message": "###Your prompt for moderation###"
}

response = requests.post(API_URL, json=data, headers=headers)
print(response.json())

Safe Output Format

safe

Unsafe Output Format

unsafeS2, S9