Prompt-based Adaptation in Large-scale Vision Models: A Survey

A unified map of how small prompts—pixels added to images or tokens injected into transformers—can adapt frozen vision models to new tasks without touching their weights.

300+

Papers

5

Prompt types

6+

Domains

1st

Unified PA survey

Abstract

In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the "pretrain-then-finetune" paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications.

In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). Within this framework, we distinguish methods based on their injection granularity: VP operates at the pixel level, while VPT injects prompts at the token level. We further categorize these methods by their generation mechanism into fixed, learnable, and generated prompts. Beyond the core methodologies, we examine PA's integrations across diverse domains—including medical imaging, 3D point clouds, and vision-language tasks—as well as its role in test-time adaptation and trustworthy AI.

Overview

How Prompt-based Adaptation Works

PA introduces two complementary mechanisms for steering frozen vision backbones with minimal parameter updates: modifying pixels before the model sees them (VP), or inserting learnable tokens inside the model (VPT).

Comparison of transfer learning and prompt-based adaptation methods — **Figure 1.** Transfer learning vs. prompt-based adaptation. (a) Conventional protocols grouped by tuning scope. (b) VPT freezes the backbone and optimizes additional prompt tokens together with the head. (c) VP modifies the input space by adding pixel-level prompts while keeping the backbone frozen.

Taxonomy

A Unified View of PA

Methods are categorized by where prompts are injected and how they are obtained.

Visual Prompting (VP)

Pixel-level — applied before tokenization

VP-Fixed: Static points, boxes, masks (e.g. SAM)
VP-Learnable: Optimized overlays, frequency cues
VP-Generated: Instance-adaptive prompts via generators

Visual Prompt Tuning (VPT)

Token-level — injected inside the network

VPT-Learnable: Gradient-trained tokens, shallow or deep
VPT-Generated: Network-produced adaptive tokens

**Figure 2.** VPT variants. *Left:* Shallow prompts at the first layer only. *Middle:* Deep prompts injected at every transformer layer. *Right:* Generated prompts produced per-instance by a lightweight network.

**Figure 3.** VP variants. *Left:* Fixed prompts (predefined boxes, points). *Middle:* Learned pixel-space overlays. *Right:* Generator-produced instance-adaptive image prompts.

Applications

Where PA Is Used

Open Questions

Key Challenges & Future Directions

Safety Alignment

Prompt interventions can be exploited by malicious actors to generate harmful content. Aligning PA with human values requires robustness evaluations, continuous monitoring, and systematic bias audits throughout development and deployment.

Training Overhead & Stability

Although per-iteration cost is low, the total training duration often exceeds full fine-tuning due to extensive hyperparameter search over prompt length, learning rate, and initialization. Seed sensitivity further compounds the issue, demanding multiple runs for reliable results.

Inference Latency

Supplementary prompt components increase memory consumption at inference time. Pruning, knowledge distillation, and quantization are promising directions for mitigating this overhead without sacrificing adaptation quality.

Real-World Evaluation

Current PA methods are predominantly benchmarked on VTAB-1k, FGVC, and ImageNet. Evaluation on diverse, complex, distribution-shifting datasets is needed to validate practical applicability in real-world deployment scenarios.

Citation

BibTeX

@article{xiao2025prompt,
  title   = {Prompt-based Adaptation in Large-scale Vision
             Models: A Survey},
  author  = {Xiao, Xi and Zhang, Yunbei and Zhao, Lin
             and Liu, Yiyang and Liao, Xiaoying and Mai, Zheda
             and Li, Xingjian and Wang, Xiao and Xu, Hao
             and Hamm, Jihun and Lin, Xue and Xu, Min
             and Wang, Qifan and Wang, Tianyang and Han, Cheng},
  journal = {Transactions on Machine Learning Research},
  year    = {2026},
  url     = {https://openreview.net/forum?id=UwtXDttgsE}
}

How Prompt-based Adaptation Works

A Unified View of PA

Visual Prompting (VP)

Visual Prompt Tuning (VPT)

Where PA Is Used

Foundational CV Tasks

Domain-Specific

Constrained Learning Paradigms

Trustworthy AI

Key Challenges & Future Directions

BibTeX