arXiv Preprint

Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning

MM-Plan: A framework that reformulates multimodal jailbreaking as agentic planning, achieving state-of-the-art attack success rates against frontier MLLMs

1Tulane University    2Amazon

*Work done during an internship at Amazon

Abstract

Warning: This paper contains examples of potentially harmful content for research purposes.

Current multimodal red teaming treats images as wrappers for malicious payloads via typography or adversarial noise. These attacks are structurally brittle, as standard defenses neutralize them once the payload is exposed.

We introduce Visual Exclusivity (VE), a more resilient Image-as-Basis threat where harm emerges only through reasoning over visual content such as technical schematics. To systematically exploit VE, we propose Multimodal Multi-turn Agentic Planning (MM-Plan), a framework that reframes jailbreaking from turn-by-turn reaction to global plan synthesis.

MM-Plan trains an attacker planner to synthesize comprehensive, multi-turn strategies, optimized via Group Relative Policy Optimization (GRPO), enabling self-discovery of effective strategies without human supervision. To rigorously benchmark this reasoning-dependent threat, we introduce VE-Safety, a human-curated dataset filling a critical gap in evaluating high-risk technical visual understanding.

MM-Plan achieves 46.3% attack success rate against Claude 4.5 Sonnet and 13.8% against GPT-5, outperforming baselines by 2–5× where existing methods largely fail. These findings reveal that frontier models remain vulnerable to agentic multimodal attacks, exposing a critical gap in current safety alignment.

Key Results

46.3%
ASR on Claude 4.5 Sonnet
vs 24.4% best baseline
13.8%
ASR on GPT-5
vs 3.1% best baseline
2–5×
Improvement
over prior methods
440
VE-Safety Instances
15 safety categories

Method Overview

MM-Plan Framework Overview

Figure: MM-Plan framework. Given a harmful goal and image, our Attacker Planner generates complete multi-turn strategies in a single pass. Plans are sampled and executed against victim MLLMs, with rewards collected from a judge model. The policy is updated via GRPO based on relative plan performance.

Why Agentic Planning?

Prior Approaches

  • Sequential RL: Suffers from myopia, optimizing for immediate rewards
  • Iterative Search: Scales poorly with KN trajectories for N-turn dialogues
  • Wrapper-based Attacks: Easily neutralized by OCR-aware filters

MM-Plan (Ours)

  • Global Planning: Synthesizes complete strategy in one pass
  • Linear Scaling: Only K × N steps for K sampled plans
  • Visual Operations: Exploits image reasoning, not text wrappers

Visual Exclusivity: A New Threat Model

Image-as-Wrapper vs Image-as-Basis

Image-as-Wrapper vs. Image-as-Basis. Prior attacks (top) embed harmful instructions typographically within images. In contrast, Visual Exclusivity (bottom) presents an Image-as-Basis threat where text input alone is insufficient—the harmful goal requires reasoning about spatial and functional relationships exclusive to the image.

Unlike prior "wrapper-based" attacks where images merely conceal text payloads, Visual Exclusivity (VE) exploits the model's own visual reasoning capabilities. In VE attacks:

This dependency renders standard defenses largely ineffective: OCR cannot extract payloads that don't exist in text form, and caption-based screening cannot capture precise structural details required for harm.

Main Results on VE-Safety

Attack Success Rate (ASR %) across 8 frontier MLLMs. MM-Plan significantly outperforms all baselines, especially on heavily defended proprietary models.

Method Open-Weight Proprietary
Llama-3.2-11B InternVL3-8B Qwen3-VL-8B GPT-4o GPT-5 Sonnet 3.7 Sonnet 4.5 Gemini 2.5 Pro
Direct Request 13.4 27.2 11.9 5.0 0.6 4.7 8.4 9.7
Direct Plan 18.1 34.7 22.5 9.4 0.9 8.1 9.7 11.9
FigStep 23.8 44.4 33.1 6.6 0.6 13.4 24.4 11.3
SI-Attack 25.6 31.9 29.1 8.1 1.9 12.8 15.6 12.5
SSA 25.3 39.1 29.4 6.3 1.6 9.7 15.9 12.2
Crescendo 21.9 45.0 33.8 14.4 3.1 15.0 18.1 15.9
MM-Plan (Ours) 64.4* 65.0* 54.4* 36.9* 13.8* 27.2* 46.3* 43.8*

* Statistically significant improvement (p ≤ 0.05) over second-best method.

VE-Safety Benchmark

We introduce VE-Safety, the first benchmark specifically targeting the Image-as-Basis threat model with real-world technical imagery.

440
Human-Curated Instances
15
Safety Categories
100%
Real-World Images

Dataset Characteristics

Benchmark Human-Curated Image Type Visual Role Multi-Turn
FigStep Typographic Image-as-Wrapper
HADES Typo. / Adv. Noise Image-as-Wrapper
MM-SafetyBench Typo. / SD Image-as-Wrapper
HarmBench (MM) SD / Real Image-as-Basis
VE-Safety (Ours) Real Image-as-Basis

Contributions

Citation

If you find our work useful, please cite our paper:

@article{zhang2025mmplan,
  title={Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning},
  author={Zhang, Yunbei and Ge, Yingqiang and Xu, Weijie and Xu, Yuhui and Hamm, Jihun and Reddy, Chandan K.},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}