Transactions on Machine Learning Research — 2026

Prompt-based Adaptation in Large-scale Vision Models: A Survey

Xi Xiao*· Yunbei Zhang*· Lin Zhao*· Yiyang Liu*· Xiaoying Liao· Zheda Mai· Xingjian Li· Xiao Wang· Hao Xu· Jihun Hamm· Xue Lin· Min Xu· Qifan Wang· Tianyang Wang†· Cheng Han†
* Equal contribution    † Corresponding authors
University of Alabama at Birmingham · Tulane University · Northeastern University · University of Missouri-Kansas City · Johns Hopkins University · Ohio State University · Carnegie Mellon University · Oak Ridge National Laboratory · Harvard University · MBZUAI · Meta AI
A unified map of how small prompts—pixels added to images or tokens injected into transformers—can adapt frozen vision models to new tasks without touching their weights.
300+
Papers
5
Prompt types
6+
Domains
1st
Unified PA survey

In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the "pretrain-then-finetune" paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications.

In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). Within this framework, we distinguish methods based on their injection granularity: VP operates at the pixel level, while VPT injects prompts at the token level. We further categorize these methods by their generation mechanism into fixed, learnable, and generated prompts. Beyond the core methodologies, we examine PA's integrations across diverse domains—including medical imaging, 3D point clouds, and vision-language tasks—as well as its role in test-time adaptation and trustworthy AI.

How Prompt-based Adaptation Works

PA introduces two complementary mechanisms for steering frozen vision backbones with minimal parameter updates: modifying pixels before the model sees them (VP), or inserting learnable tokens inside the model (VPT).

Comparison of transfer learning and prompt-based adaptation methods
Figure 1. Transfer learning vs. prompt-based adaptation. (a) Conventional protocols grouped by tuning scope. (b) VPT freezes the backbone and optimizes additional prompt tokens together with the head. (c) VP modifies the input space by adding pixel-level prompts while keeping the backbone frozen.

A Unified View of PA

Methods are categorized by where prompts are injected and how they are obtained.

Visual Prompting (VP)

Pixel-level — applied before tokenization
VP-Fixed
Static points, boxes, masks (e.g. SAM)
VP-Learnable
Optimized overlays, frequency cues
VP-Generated
Instance-adaptive prompts via generators

Visual Prompt Tuning (VPT)

Token-level — injected inside the network
VPT-Learnable
Gradient-trained tokens, shallow or deep
VPT-Generated
Network-produced adaptive tokens
VPT variants
Figure 2. VPT variants. Left: Shallow prompts at the first layer only. Middle: Deep prompts injected at every transformer layer. Right: Generated prompts produced per-instance by a lightweight network.
VP variants
Figure 3. VP variants. Left: Fixed prompts (predefined boxes, points). Middle: Learned pixel-space overlays. Right: Generator-produced instance-adaptive image prompts.

Where PA Is Used

Foundational CV Tasks

Domain-Specific

Constrained Learning Paradigms

Trustworthy AI

Key Challenges & Future Directions

Safety Alignment
Prompt interventions can be exploited by malicious actors to generate harmful content. Aligning PA with human values requires robustness evaluations, continuous monitoring, and systematic bias audits throughout development and deployment.
Training Overhead & Stability
Although per-iteration cost is low, the total training duration often exceeds full fine-tuning due to extensive hyperparameter search over prompt length, learning rate, and initialization. Seed sensitivity further compounds the issue, demanding multiple runs for reliable results.
Inference Latency
Supplementary prompt components increase memory consumption at inference time. Pruning, knowledge distillation, and quantization are promising directions for mitigating this overhead without sacrificing adaptation quality.
Real-World Evaluation
Current PA methods are predominantly benchmarked on VTAB-1k, FGVC, and ImageNet. Evaluation on diverse, complex, distribution-shifting datasets is needed to validate practical applicability in real-world deployment scenarios.

BibTeX

@article{xiao2025prompt,
  title   = {Prompt-based Adaptation in Large-scale Vision
             Models: A Survey},
  author  = {Xiao, Xi and Zhang, Yunbei and Zhao, Lin
             and Liu, Yiyang and Liao, Xiaoying and Mai, Zheda
             and Li, Xingjian and Wang, Xiao and Xu, Hao
             and Hamm, Jihun and Lin, Xue and Xu, Min
             and Wang, Qifan and Wang, Tianyang and Han, Cheng},
  journal = {Transactions on Machine Learning Research},
  year    = {2026},
  url     = {https://openreview.net/forum?id=UwtXDttgsE}
}