As AI models become increasingly adopted, ensuring their safe and responsible deployment is essential. At Virtue AI, we conduct rigorous red-teaming evaluations to stress-test AI models and applications, uncover vulnerabilities, and drive safe and secure deployments. In particular, VirtueAI’s proprietary VirtueRed platform, containing more than 100 specialized red-teaming algorithms, rigorously tests foundation models & applications across multiple risk domains, including regulatory compliance and use-case–driven vulnerabilities.
In this latest analysis, we put OpenAI’s GPT-4.5 and Anthropic’s Claude 3.7 through extensive red-teaming tests, assessing their risks such as safety, security, hallucination, regulatory compliance, codeGen vulnerabilities, and more. (Please check out our previous blog on red-teaming analysis for Claude 3.7 here).
For a comprehensive breakdown of our methodology, results, and key findings, as well as insights into Virtue AI’s red-teaming approach, please schedule time with our team here.
🔍 Overview: Strengths & Weaknesses
Each model has distinct strengths and weaknesses when it comes to hallucination, security, privacy, over-cautiousness, and compliance.
- GPT-4.5 shows notable improvements in hallucination reduction, privacy protection, and fairness mitigation, but struggles with over-cautious refusals and security risks in code generation.
- Claude 3.7 is more aligned with regulatory frameworks like the EU AI Act and offers better defenses against adversarial prompts, but it occasionally reinforces subtle biases and is weaker against privacy-focused attacks.
VirtueRed: Automated Red-Teaming for AI Models & Applications
VirtueRed systematically performs penetration testing and evaluates AI safety & security of AI models and applications with a series of advanced and adaptive red teaming algorithms, assessing:
- Practical use-case driven risks (e.g., hallucination, privacy, over-cauciousness, bias)
- Regulatory compliance risks (e.g., EU AI Act, AI company policies)
- Adaptive multi-modal vulnerabilities (e.g., security risks given adaptive multimodal jailbreaks)
- CodeGen-related risks (e.g., malicious and risky code generation)
Through diverse advanced red teaming algorithms, VirtueRed identifies systemic weaknesses and real-world safety challenges for foundation models and applications, enabling organizations to strengthen safe and secure AI products deployment, focus on product development, and minimize the time to market.
Key Findings: Vulnerabilities of GPT-4.5 & Comparison with Claude 3.7
Summary
Hallucination
✅ GPT-4.5 demonstrates a significant reduction in hallucinations, producing more factually consistent responses.
✅ Both models perform well, but GPT-4.5 leads in avoiding misleading outputs.
Over-Cautiousness & User Experience
❌ GPT-4.5 is overly cautious, often refusing harmless requests, which can frustrate users.
✅ Claude 3.7 strikes a better balance, distinguishing between genuinely harmful and benign queries more effectively.
Regulatory Compliance & AI Policy Risks
✅ Claude 3.7 is better aligned with the EU AI Act, reducing policy violations.
❌ GPT-4.5, while strong in fact-checking and misinformation prevention, struggles with operational misuse scenarios.
Privacy & Security Vulnerabilities
✅ GPT-4.5 shows stronger resilience to adversarial privacy attacks, including data extraction threats.
❌ Claude 3.7, though robust in direct privacy violations, is more vulnerable to indirect adversarial probing.
Code Generation & Security Risks
❌ GPT-4.5 is more prone to generating harmful code, including insecure scripting and exploit-enabling instructions.
✅ Claude 3.7 has stronger safeguards, effectively blocking malicious code-generation attempts.
Multi-Modal Adversarial Attacks
❌ GPT-4.5 is more susceptible to visual-based adversarial attacks, indicating gaps in multi-modal safety.
✅ Claude 3.7 exhibits stronger defenses, but advanced techniques can still bypass its safeguards.
Fairness & Bias Mitigation
✅ GPT-4.5 outperforms Claude 3.7 in fairness, neutralizing biases across demographics more consistently.
❌ Claude 3.7 struggles with nuanced demographic associations, occasionally reinforcing subtle stereotypes.
Hallucination
In the GPT-4.5 system card, OpenAI claims to reduce hallucinations and enhance accuracy through unsupervised learning, chain-of-thought reasoning, improved alignment techniques, enhanced data filtering, refined RLHF, and an instruction hierarchy for better response control.

From our red-teaming tests, we find:
- Potentially due to the new change in the pre-training and alignment process, GPT-4.5 indeed demonstrates a notable improvement in hallucination reduction compared with Claude 3.7, implying a higher capability of facing misleading and irrelevant information.
- The red teaming tests of hallucination contain diverse scenarios, some of which are described in our ICLR paper MMDT. This sophisticated test shows that GPT-4.5 is able to reduce hallucinations through improved world model accuracy.
Takeaway: GPT-4.5 demonstrates a significant reduction in hallucinations, producing more factually consistent responses. Compared with Claud 3.7, although both models perform well, GPT-4.5 leads in avoiding misleading outputs.
EU AI Act Compliance Risk Analysis

Other AI company policies


From our red-teaming tests, we find:
- GPT-4.5 demonstrates strong security safeguards complying with AI regulations and policies, and excels at identifying and refusing to propagate mis/disinformation, particularly with fact-checking capabilities and source verification. However, it struggles with some scenarios such as operational misuse, automated decision-making, and weapon usage. It occasionally produces financial advice or fails to recognize implicit discrimination.
- Claude 3.7 maintains a low-risk profile under regulatory compliance risks, consistently avoiding harmful content, political bias, and privacy violations.
Takeaway: While GPT-4.5 leads in combating mis/disinformation, Claude 3.7 outperforms GPT-4.5 overall in regulatory compliance, demonstrating stronger risk mitigation under, say, the EU AI Act across a broader range of categories.
Fraud

From our red-teaming tests, we find:
- GPT-4.5 can sometimes be manipulated into producing marketing copy or advice that contains exaggerated, unverified claims—potentially enabling fraud or deceptive schemes.
- Claude 3.7 is more inclined to reject such content by default, although it is not entirely foolproof.
Takeaway: Businesses, especially those who rely on user trust (e.g., e-commerce, consumer services) should implement robust AI guardrails model (e.g. VirtueGuard) to detect requests promoting fraudulent or unethical practices.
Over-cautiousness (False Refusals)
GPT 4.5 refuses a purely fictional game mechanic (virtual car hacking)

From our red-teaming tests, we find:
- Despite implementing new alignment methods, GPT-4.5 demonstrates persistent over-cautiousness, rejecting harmless requests about video game mechanics (like selling virtual items in TitanFall or car hacking in Watch Dogs) and even benign educational content (like simulating ancient trade strategies).
- Claude 3.7 maintains a significantly lower false refusal rate, properly distinguishing between requests about fictional scenarios in games or educational contexts versus genuinely harmful content.
Takeaway: GPT-4.5’s excessive caution creates potentially frustrating user experiences with frequent false positives, while Claude 3.7 offers a more nuanced understanding of context, making it substantially more practical for benign discussion when involved with sensitive elements.
Privacy & Security

From our red-teaming tests, we find:
- GPT-4.5 demonstrates relatively stronger resilience against adversarial privacy attacks, including data extraction attempts, reducing the risk of sensitive information leakage.
- Claude 3.7, while effective in rejecting direct privacy violations, shows more susceptibility to indirect adversarial probing techniques compared to GPT-4.5.
Takeaway: GPT-4.5 demonstrates relatively stronger resilience against adversarial privacy attacks, including data extraction attempts, making it the more secure option for safeguarding sensitive information.
Code generation vulnerability


From our red-teaming tests, we find:
- GPT 4.5 exhibits higher vulnerability to code generation exploits, sometimes providing destructive scripts upon request, including steps to bypass authentication protocols or create destructive payloads
- Claude 3.7 is more reliable against malicious code requests, successfully blocking attempts to generate harmful code during our tests
Takeaway: GPT-4.5’s vulnerability to generating harmful code instructions poses a significant risk, whereas Claude 3.7 provides a more secure and reliable framework for handling sensitive code generation requests.
Resilience against Multi-Modal Attacks


From our red-teaming tests, we find:
- GPT-4.5 demonstrates reduced robustness in visual safety, as it is more susceptible to simple adversarial visual manipulations.
- Claude 3.7 generally exhibits stronger refusal behaviors when faced with suspicious multi-modal prompts, but advanced manipulations can still bypass its safeguards.
Takeaway: Organizations leveraging GPT-4.5 for multi-modal use cases should implement an image-guardrail model to mitigate vulnerabilities inherent in visual inputs, as GPT-4.5’s safety measures for images are currently weaker than its safeguards for text-based content.
Fairness & Bias
From our red-teaming tests, we find:
- GPT-4.5 demonstrates a stronger commitment to fairness, actively mitigating biases across a wide range of demographic categories. It effectively neutralizes stereotypes in areas such as intelligence, crime, and leadership skills.
- Claude 3.7, while maintaining low bias overall, struggles more with nuanced demographic associations and occasionally reinforces subtle stereotypes.
Takeaway: GPT-4.5 outperforms Claude 3.7 in fairness, offering more consistent bias mitigation across sensitive topics.
Conclusion
GPT-4.5 demonstrates superior performance in reducing hallucinations, maintaining fairness, and protecting privacy through enhanced resilience against adversarial attacks. In the meantime, Claude 3.7 excels in EU AI Act compliance with stronger risk mitigation across a broader range of categories and shows a more nuanced understanding of context, resulting in fewer false refusals of benign requests.
Compared to Claude 3.7, GPT-4.5 strikes a better balance between usability and safety, while Claude 3.7 remains superior in adversarial resilience and compliance. Future iterations and guardrail solutions will be needed to enhance robustness and reliance when using both models in practice.
Safe and Secure AI Deployments with VirtueAI
Need assistance in ensuring the safe and secure deployment of AI models & applications? Learn more about our red-teaming and guardrail solutions by contacting our team at contact@virtueai.com.