In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the "pretrain-then-finetune" paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications.
In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). Within this framework, we distinguish methods based on their injection granularity: VP operates at the pixel level, while VPT injects prompts at the token level. We further categorize these methods by their generation mechanism into fixed, learnable, and generated prompts. Beyond the core methodologies, we examine PA's integrations across diverse domains—including medical imaging, 3D point clouds, and vision-language tasks—as well as its role in test-time adaptation and trustworthy AI.
PA introduces two complementary mechanisms for steering frozen vision backbones with minimal parameter updates: modifying pixels before the model sees them (VP), or inserting learnable tokens inside the model (VPT).
Methods are categorized by where prompts are injected and how they are obtained.
@article{xiao2025prompt,
title = {Prompt-based Adaptation in Large-scale Vision
Models: A Survey},
author = {Xiao, Xi and Zhang, Yunbei and Zhao, Lin
and Liu, Yiyang and Liao, Xiaoying and Mai, Zheda
and Li, Xingjian and Wang, Xiao and Xu, Hao
and Hamm, Jihun and Lin, Xue and Xu, Min
and Wang, Qifan and Wang, Tianyang and Han, Cheng},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://openreview.net/forum?id=UwtXDttgsE}
}