MM-Plan: A framework that reformulates multimodal jailbreaking as agentic planning, achieving state-of-the-art attack success rates against frontier MLLMs
1Tulane University 2Amazon
*Work done during an internship at Amazon
Current multimodal red teaming treats images as wrappers for malicious payloads via typography or adversarial noise. These attacks are structurally brittle, as standard defenses neutralize them once the payload is exposed.
We introduce Visual Exclusivity (VE), a more resilient Image-as-Basis threat where harm emerges only through reasoning over visual content such as technical schematics. To systematically exploit VE, we propose Multimodal Multi-turn Agentic Planning (MM-Plan), a framework that reframes jailbreaking from turn-by-turn reaction to global plan synthesis.
MM-Plan trains an attacker planner to synthesize comprehensive, multi-turn strategies, optimized via Group Relative Policy Optimization (GRPO), enabling self-discovery of effective strategies without human supervision. To rigorously benchmark this reasoning-dependent threat, we introduce VE-Safety, a human-curated dataset filling a critical gap in evaluating high-risk technical visual understanding.
MM-Plan achieves 46.3% attack success rate against Claude 4.5 Sonnet and 13.8% against GPT-5, outperforming baselines by 2–5× where existing methods largely fail. These findings reveal that frontier models remain vulnerable to agentic multimodal attacks, exposing a critical gap in current safety alignment.
Figure: MM-Plan framework. Given a harmful goal and image, our Attacker Planner generates complete multi-turn strategies in a single pass. Plans are sampled and executed against victim MLLMs, with rewards collected from a judge model. The policy is updated via GRPO based on relative plan performance.
Image-as-Wrapper vs. Image-as-Basis. Prior attacks (top) embed harmful instructions typographically within images. In contrast, Visual Exclusivity (bottom) presents an Image-as-Basis threat where text input alone is insufficient—the harmful goal requires reasoning about spatial and functional relationships exclusive to the image.
Unlike prior "wrapper-based" attacks where images merely conceal text payloads, Visual Exclusivity (VE) exploits the model's own visual reasoning capabilities. In VE attacks:
This dependency renders standard defenses largely ineffective: OCR cannot extract payloads that don't exist in text form, and caption-based screening cannot capture precise structural details required for harm.
Attack Success Rate (ASR %) across 8 frontier MLLMs. MM-Plan significantly outperforms all baselines, especially on heavily defended proprietary models.
| Method | Open-Weight | Proprietary | ||||||
|---|---|---|---|---|---|---|---|---|
| Llama-3.2-11B | InternVL3-8B | Qwen3-VL-8B | GPT-4o | GPT-5 | Sonnet 3.7 | Sonnet 4.5 | Gemini 2.5 Pro | |
| Direct Request | 13.4 | 27.2 | 11.9 | 5.0 | 0.6 | 4.7 | 8.4 | 9.7 |
| Direct Plan | 18.1 | 34.7 | 22.5 | 9.4 | 0.9 | 8.1 | 9.7 | 11.9 |
| FigStep | 23.8 | 44.4 | 33.1 | 6.6 | 0.6 | 13.4 | 24.4 | 11.3 |
| SI-Attack | 25.6 | 31.9 | 29.1 | 8.1 | 1.9 | 12.8 | 15.6 | 12.5 |
| SSA | 25.3 | 39.1 | 29.4 | 6.3 | 1.6 | 9.7 | 15.9 | 12.2 |
| Crescendo | 21.9 | 45.0 | 33.8 | 14.4 | 3.1 | 15.0 | 18.1 | 15.9 |
| MM-Plan (Ours) | 64.4* | 65.0* | 54.4* | 36.9* | 13.8* | 27.2* | 46.3* | 43.8* |
* Statistically significant improvement (p ≤ 0.05) over second-best method.
We introduce VE-Safety, the first benchmark specifically targeting the Image-as-Basis threat model with real-world technical imagery.
| Benchmark | Human-Curated | Image Type | Visual Role | Multi-Turn |
|---|---|---|---|---|
| FigStep | ✗ | Typographic | Image-as-Wrapper | ✗ |
| HADES | ✗ | Typo. / Adv. Noise | Image-as-Wrapper | ✗ |
| MM-SafetyBench | ✗ | Typo. / SD | Image-as-Wrapper | ✗ |
| HarmBench (MM) | ✗ | SD / Real | Image-as-Basis | ✗ |
| VE-Safety (Ours) | ✓ | Real | Image-as-Basis | ✓ |
We formalize a new multimodal vulnerability where harmful goals require visual reasoning about image content, providing criteria that distinguish VE from wrapper-based attacks.
We construct the first benchmark targeting Image-as-Basis threats, comprising 440 human-curated instances across 15 safety categories with verified non-textual irreducibility.
We propose a multimodal agentic planning framework that achieves 2–5× higher attack success rates than search-based and turn-by-turn baselines across frontier MLLMs.
If you find our work useful, please cite our paper:
@article{zhang2025mmplan,
title={Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning},
author={Zhang, Yunbei and Ge, Yingqiang and Xu, Weijie and Xu, Yuhui and Hamm, Jihun and Reddy, Chandan K.},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2025}
}