In Cybersecurity, Claude Leaves Other LLMs in the Dust
Summary
Giskard’s PHARE benchmark evaluated major LLMs (OpenAI, Anthropic, xAI, Meta, Google, etc.) on safety metrics including resistance to jailbreaks, prompt injection, hallucinations, bias and harmful-output refusal. The results show modest industry-wide progress in safety, but Anthropic’s Claude models (4.1 and 4.5) clearly outperformed peers across almost every metric. Many models remain vulnerable to well-known jailbreaks and prompt-injection techniques, while most do better at refusing to generate explicitly harmful content. The data suggest that prioritising safety throughout the development pipeline — as Anthropic appears to do — yields materially safer models.
Key Points
- Giskard’s PHARE benchmark tested many brand-name LLMs on safety and misuse resistance using known exploits and techniques.
- Claude 4.1 and 4.5 consistently scored highest for resisting jailbreaks and avoiding harmful, biased or hallucinatory outputs.
- Most models still fall to disclosed jailbreaks and prompt-injection attacks; GPT-family models generally pass tests ~66–75% of the time, while Gemini (except 3.0 Pro) and others often score around 40%.
- There is no clear positive correlation between model size and robustness; larger models can have a bigger attack surface because they parse complex malicious prompts better.
- Across the board, LLMs improved notably at refusing explicit harmful content, but progress on preventing subtle misuse and hallucinations is limited.
- Anthropic’s advantage may stem from embedding alignment and safety earlier in development (alignment engineers), rather than treating it as a final refinement step.
- Implication for defenders: choose models with demonstrable safety engineering, run rigorous red-teaming and PHI/PII tests, and do not assume any model is immune to jailbreaks.
Content summary
The PHARE report highlights that while LLM capabilities have advanced, most vendors are not making meaningful, industry-wide strides on safety. Researchers used known jailbreak and injection techniques and found many models still vulnerable. Claude stands out, skewing the industry average upward: remove Anthropic and the trendline for safety would be much weaker.
Key findings include weak resistance to long-established jailbreaks, inconsistent correlation between model size and safety, and better-than-before refusal to generate clearly harmful content. Experts quoted in the article argue that process differences — embedding safety and alignment throughout training versus when refining a final model — likely explain Claude’s superior performance.
Context and relevance
This matters if you choose or deploy LLMs for security-sensitive tasks. The benchmark shows that vendor decisions about where and when to prioritise safety in the development lifecycle have a real impact on risk. For CISOs, threat teams and product owners, the article signals that vendor selection, independent testing, and continuous red-teaming are essential. It also underlines a trend: safety-focused engineering can produce materially safer models even without mysterious data or vastly greater resources.
Why should I read this?
Short version: if you’re picking an LLM for anything that touches security, privacy or trust, this piece tells you which models actually behave like grown-ups — and which ones still fall for old tricks. Saves you poking the models yourself and possibly getting burned.
Author style
Punchy. The reporting pulls a clear, business-critical conclusion out of the PHARE data: Anthropic’s approach to embedding alignment earlier in development is paying off. If model safety affects your risk profile, it’s worth digging into the full benchmarks linked in the piece.
Source
Source: https://www.darkreading.com/cybersecurity-analytics/cybersecurity-claude-llms
