New Research

Created on

May 5, 2026

Decoding Trust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Virtue Associated Authors

Abstract:

‍

As large language models (LLMs) become increasingly capable, robust and scalable security evaluation is crucial. While current red teaming approaches have made strides in assessing LLM vulnerabilities, they often rely heavily on human input and fail to provide comprehensive coverage of potential risks. This paper introduces AutoRedTeamer, a unified framework for fully automated, end-to-end red teaming against LLMs. AutoRedTeamer is an LLM-based agent architecture comprising five specialized modules and a novel memory-based attack selection mechanism, enabling deliberate exploration of new attack vectors. AutoRedTeamer supports both seed prompt and risk category inputs, demonstrating flexibility across red teaming scenarios. We demonstrate AutoRedTeamer’s superior performance in identifying potential vulnerabilities compared to existing manual and optimization-based approaches, achieving higher attack success rates by 20% on HarmBench against Llama-3.1-70B while reducing computational costs by 46%. Notably, AutoRedTeamer can break jailbreaking defenses and generate test cases with comparable diversity to human-curated benchmarks. AutoRedTeamer establishes the state of the art for automating the entire red teaming pipeline, a critical step towards comprehensive and scalable security evaluations of AI systems.

‍

Authors
Zhaorun Chen, Xun Liu, Haibo Tong, Chengquan Guo, Yuzhou Nie, Jiawei Zhang, Mintong Kang, Chejian Xu, Qichang Liu, Xiaogeng Liu, Tianneng Shi, Chaowei Xiao, Sanmi Koyejo, Percy Liang, Wenbo Guo, Dawn Song, Bo Li.

<- Next Page

Previous Page ->

Created on

May 5, 2026

Decoding Trust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Virtue Associated Authors

Launch your autonomous future with absolute certainty.