Why Multimodal AI Benchmarks Fail

As multimodal large language models (MLLMs) move from demos into production, their safety properties are being stress-tested in the unforgiving reality of everyday life. One misstep can lead to costly lawsuits and real-world harm to users and bystanders.

Recent research exposes a widening safety gap between leading models, and even state-of-the-art systems remain vulnerable to simple attack strategies. Multimodal benchmarks continue to fail in predictable ways and risks escalate as models expand into new modalities, generating increasingly complex outputs from paintings to music videos.

Earlier this year, we evaluated four MLLMs – GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus – using a novel dataset of 726 adversarial prompts designed to test illegal activity, disinformation, and unethical behavior. The disparities were substantial.

Across both multimodal and text-only inputs, Pixtral 12B produced harmful content roughly 62% of the time, while Claude Sonnet 3.5 was the most resistant at around 10% to 11%, with GPT-4o around 19% and Qwen VL Plus at 39%. While intuition suggests the newest modality would be the weakest point, most models were slightly more resistant to multimodal attacks than text-only, suggesting historical challenges with text-based LLM safety persist and safety remains a challenge across all modalities.

Old jailbreaks, new risks

These results translate to real-world risk. Under the hood, the attacks looked familiar: role-play to recast intent, ‘refusal suppression’ to undercut a model’s safety instincts, strategic framing to skirt policy triggers, and noise to distract pattern-based defenses.

None of this is new, and that is the point. Subtle, social hacks still pull systems toward helpfulness and away from their safety protocols. Many well-known techniques remain successful in a single turn even as models improve and new ones are released. But, the rate of AI innovation continues to raise the stakes.

As capabilities increase, so do the opportunities for failure. When resistance to adversarial prompting isn’t robust in the earliest modality (text), expanding to new modalities like image and video compounds existing risks.

Mitigating safety vulnerabilities is increasingly complex for developers as adding new features has the unfortunate side effect of giving users new opportunities to produce harmful content, whether they intend to or not. Where a text-only model might risk copyright infringement by outputting a chapter of Harry Potter, a multimodal one could produce the film and soundtrack plus Rowling’s famous novel.

‘No response’ as a safety feature

This brings us to the central tension surfaced in our research: Is the safest response sometimes no response?

We observed a strong correlation between Claude’s lower rate of harmful output and its higher rate of default refusals. In scenarios where other models produced harmful content (often acknowledging that the subject matter was potentially unsafe then proceeding to engage regardless), Claude declined outright. Previously criticized for inaccuracy and overconfidence, the negative impacts of model helpfulness and sycophancy are reaching new heights.

Refusals may frustrate users when benign requests get caught in the crosshairs, but they start to look less like avoidance and more like a necessary safety feature when we consider that the alternative is plausible-sounding harmful content.

Why benchmarks must reward abstention

Most benchmarks still score with a binary pass/fail that labels turns as right/wrong, safe/unsafe. That framing quietly penalizes abstention and rewards confident fabrication – the very behavior we want to avoid under adversarial pressure. A more mature approach is to treat refusal as a first-class outcome and differentiate how a model stays safe.

Instead of collapsing everything into pass/fail, evaluate three things separately:

(1) harmless engagement – a safe, helpful reply

(2) justified refusal- a principled “no” with or without brief context)

(3) harmful or policy-violating output

With this framing, abstention becomes measurable and, where appropriate, rewarded. It also mirrors how safety operates in products, where the best outcome is sometimes to decline, log, and escalate.

Early reliability checks suggest humans can score this nuance consistently enough to use in production pipelines, which means product teams can optimize for it without flying blind. Refusals aren’t universally ‘better,’ but benchmarks should recognize abstention as a legitimate safety outcome, especially under adversarial pressure.

Why incentives matter for real-world deployment

Why does this matter outside a lab? Incentives steer behavior. If internal tests dock points for abstention, you’ll select for systems that ‘try something’ when they shouldn’t. What looks like success in development becomes a policy violation, reputational damage, or a legal incident in production. If, instead, justified refusals count as positive safety events, you create space for models to surface risk, redirect to safer alternatives, and escalate when needed.

This shift also clarifies what ‘multimodal readiness’ should mean for buyers and policymakers. It’s not enough to pass a single blended score across text and image+text in standard benchmarks. Models should show disaggregated results by modality and harm scenario because risks are context-specific.

Our findings demonstrate that systems are vulnerable across both their most established capabilities and newer ones, with performance varying widely across models. In practice, AI systems should be held to context-aware safety standards that reflect real-world use cases and the pace of AI innovation. A one-size-fits-all approach will fail in increasingly dangerous and complex ways.

Follow the real threat model, not a sanitized one

There’s a second implication that’s easy to miss: Evaluation should follow the threat model. The most successful jailbreaks in our study weren’t novel. Our team succeeded with social maneuvers like role-play, refusal suppression, and strategic reframing that exploit the learned instinct to be helpful.

If your safety protocol leans on keyword triggers and ignores these conversational tactics, you’re relying on an overly reductive understanding of human communication and introduce risk to your customers and bottom line.

So, what does good look like in practice?

Start with the data. Avoid blending red teaming results across text-only and multimodal results into a single composite score and report modalities side by side, by harm scenario. Then split safety into at least two outcomes, harmless engagement and justified refusal, and set thresholds for both.

This approach discourages models from gaming safety by stonewalling every hard question, and it also prevents overly compliant systems from passing acceptance tests. If you need more granularity, use a simple three-level scheme to distinguish principled refusals from mechanical ones. You’ll learn where safety protocols succeed and produce actionable feedback for targeted improvement.

The bottom line: Multimodal benchmarks fail when they punish abstention and hide risk in blended reports. Measure outcomes across modalities and test like an adversary. “No” can be the most powerful safety control in your arsenal and reduce the critical vulnerabilities making it to production.

Author

Madison Van Doren

Madison Van Doren is an AI research and strategy manager at Appen, provider of human annotated datasets for AI and machine learning.

View all posts