BETA: Adapting in the Dark. Test-Time Adaptation for Black-Box Models

Abstract

Test-Time Adaptation (TTA) for black-box models accessible only via APIs remains a largely unexplored challenge. Existing approaches such as post-hoc output refinement offer limited adaptive capacity, while Zeroth-Order Optimization (ZOO) enables input-space adaptation but faces high query costs and optimization challenges in the unsupervised TTA setting. We introduce BETA (Black-box Efficient Test-time Adaptation), a framework that addresses these limitations by employing a lightweight, local white-box steering model to create a tractable gradient pathway. Through a prediction harmonization technique combined with consistency regularization and prompt learning-oriented filtering, BETA enables stable adaptation with no additional API calls and negligible latency beyond standard inference. On ImageNet-C, BETA achieves a +7.1% accuracy gain on ViT-B/16 and +3.4% on CLIP, surpassing strong white-box and gray-box methods including TENT and TPT. On a commercial API, BETA achieves comparable performance to ZOO at 250× lower cost.

Key Results at a Glance

+7.1%

ImageNet-C accuracy gain
(ViT-B/16, black-box)

+3.4%

ImageNet-C accuracy gain
(CLIP)

250×

Cheaper than ZOO
(Clarifai commercial API)

API call per test sample
(vs. 16+ for ZOO)

The Strict Black-Box TTA Setting

Most state-of-the-art vision models are now deployed as opaque APIs. In this setting the client can only submit a raw image and receive a probability vector in return, with no access to parameters, gradients, or intermediate features. Prior TTA work relies on exactly those missing signals.

Figure 1. The strict black-box TTA setting. The client only observes probabilities from the server-side API. Unlike white-box TTA, no gradients, parameters, or intermediate features are available. This is a practical but challenging regime for real-world API-based deployment.

Table 1. Comparison of TTA methods across key capabilities. We evaluate each method's requirements for accessing model parameters, internal tokens, intermediate features, and gradients, alongside its visual encoder architectural flexibility, support for different model types (Vision models (VMs) / Vision-Language models (VLMs)), query efficiency (one API call per test sample), and inference latency. BETA is the only strict black-box method that keeps a single API call per sample and real-time latency.

Access	Method	w/o Params.	w/o Tokens	w/o Feats.	w/o Grad.	Arch-Agnostic	VMs	VLMs	1 API/Sample	Low Latency
White	TENT	✗	✗	✗	✗	✓	✓	✓	✓	✓
White	TPT	✗	✗	✗	✗	✓	✗	✓	✓	✓
Gray	T3A	✗	✓	✗	✓	✓	✓	✓	✗	✓
Gray	FOA	✓	✗	✗	✓	ViT-only	✓	✓	✗	✗
Gray	B²TPT	✓	✗	✓	✓	ViT-only	✗	✓	✗	✗
Gray	BCA	✓	✓	✗	✗	✓	✓	✓	✓	✓
Black	LAME	✓	✓	✓	✓	✓	✓	✓	✓	✓
Black	Augmentation	✓	✓	✓	✓	✓	✓	✓	✗	✗
Black	Purification	✓	✓	✓	✓	✓	✓	✓	✗	✗
Black	ZOO	✓	✓	✓	✓	✓	✓	✓	✗	✗
Black	BETA (Ours)	✓	✓	✓	✓	✓	✓	✓	✓	✓

Method Overview

Figure 2. Comparison of black-box TTA strategies. (a) Output refinement (LAME) only post-processes predictions. (b) ZOO prompting needs many API calls per sample. (c) BETA uses a local steering model and prediction harmonization to create a tractable gradient pathway with one API call per sample.

BETA operates with two models: a frozen black-box target $f_B$ (e.g., a commercial API) and a lightweight local steering model $f_S$ with full gradient access. We learn an additive visual prompt $\delta$ and optimize it through the steering model's local gradients.

Why naive gradient transfer fails

Gradients from a local surrogate do not transfer directly to a different black-box architecture. The per-example gradient cosine similarity between a ViT-B/16 target and a local ViT-S/16 or ResNet-18 is near zero.

Figure 3. Gradient similarity is not transferable. Directly importing gradients from a small local model into the input update for a black-box target is ineffective. BETA instead harmonizes predictions to create a tractable optimization target.

Prediction harmonization

Rather than transferring gradients, BETA fuses output probabilities from the two models into a single harmonized distribution $p_\alpha(x') = \alpha\,p_S(x') + (1-\alpha)\,p_B(x')$ and minimizes its entropy. This exposes a tractable asymmetric gradient pathway: the prompt $\delta$ is updated only through $f_S$'s gradients, while the black-box predictions $p_B$ enter as a data-dependent mixing target.

Why harmonization helps (analytical view). Let $p_m = \alpha p_S + (1-\alpha) p_B$ denote the harmonized distribution. Using the convexity of entropy, minimizing $H(p_m)$ upper-bounds a weighted combination of the two per-model entropies and a Jensen Shannon alignment term between $p_S$ and $p_B$: $H(p_m) \;\leq\; \alpha H(p_S) + (1-\alpha) H(p_B) \;-\; \mathrm{JS}_\alpha(p_S \,\|\, p_B)$. Minimizing $H(p_m)$ therefore (i) sharpens the local view, (ii) softly nudges it toward the remote view, and (iii) does not require $f_B$'s gradients. This is exactly what a black-box TTA objective needs: a signal that is simultaneously confident and consistent with the opaque API, while remaining differentiable through $f_S$ alone.

Stability: consistency and prompt-oriented filtering

Learning prompts from random initialization is brittle. BETA adds (i) a KL consistency term between clean and prompted predictions in the local view, and (ii) a prompt-learning-oriented reliable-and-diverse filter so the prompt is updated only from samples that give a stable learning signal.

Figure 4. Without stabilization, prompt learning collapses on hard corruptions (Contrast). BETA's consistency and filtering keep adaptation stable across 5 random seeds.

Main Results: ImageNet-C, ViT-B/16

Table 2. Classification accuracy (%) on ImageNet-C (severity 5) with ViT-B/16 as the frozen black-box target. BETA surpasses all black-box baselines and several strong white-box methods, despite only seeing API probabilities.

Access	Method	Avg.	Gain
Black	Source (no adapt)	55.5	n/a
White	TENT	59.6	+4.1
White	SAR	63.6	+8.1
White	CoTTA	61.6	+6.1
White	ETA	65.8	+10.3
Gray	T3A	56.9	+1.4
Gray	FOA	44.9	−10.6
Black	LAME	54.1	−1.4
Black	ZOO-CMA	54.5	−1.0
Black	ZOO-RGF	56.0	+0.5
Black	BETA (Ours)	62.6	+7.1

Real-world commercial API (Clarifai)

Figure 5. On the Clarifai commercial API, BETA delivers a +5.2% accuracy gain for under $0.4. ZOO-based prompting requires over $100 to reach a comparable point, a 250× cost gap.

Generalization beyond ImageNet-C

Table 3. Fine-grained EuroSAT with CLIP ViT-B/16 as the target. BETA is the only strict black-box method and delivers the largest gain (+11.3%) with a single API query per sample, while gray-box prompting and ZERO-style variants need 64 to 448 queries.

Access	Method	Acc. (%)	Gain	#API
Black	Source	42.0	n/a	1
Gray	B²TPT (w/ tokens)	46.8	+4.8	120
Gray	ZERO (w/ logits)	39.6	−2.4	64
Gray	ZERO_ensemble (w/ logits)	43.8	+1.8	448
Black	BETA (Ours)	53.3	+11.3	1

Table 4. Dermatology classification on Derm7pt with both a general-purpose CLIP ViT-B/16 and a domain-specialized BiomedCLIP as the black-box target. BETA gives consistent gains across very different backbones.

Target (Black-box)	Source	LAME	TT-Aug	BETA (Ours)
CLIP ViT-B/16	55.9	56.0	57.1	58.6
BiomedCLIP	60.9	60.4	61.3	62.1

Beyond Vision: a Test-Time Advisor Strategy

BETA's core mechanism, a small local model shaping the behavior of a larger frozen remote model at inference time, is, we believe, a preview of how adaptation will work in the agent era. Anthropic describe a closely related pattern for LLM agents:

“The advisor strategy pairs a strong, expensive model as an advisor with a faster, cheaper model as an executor. The advisor sees the task and context up front and writes a plan… The executor then carries out that plan step-by-step, consulting the advisor when it hits a hard decision.” Anthropic, The Advisor Strategy (2026)

BETA is the vision, test-time version of the same design, with a twist: the expensive side is the executor, and the small local side is the advisor. The advisor has gradients but limited capacity; the executor has capacity but is opaque. The advisor does not change the executor's weights. It only shapes its input so that the executor's own computation lands on the correct answer.

LLM advisor strategy

Strong advisor writes the plan
Cheap executor runs it
Advisor consulted on hard decisions
Uses more tokens, not more training

BETA (vision, test-time)

Local steering model provides gradients
Black-box API produces predictions
Reliable-and-diverse filter consults only on trustworthy samples
Uses a single API call, not retraining

We expect the local-advisor-shapes-remote-executor recipe to become a common pattern as AI stacks move to agents composed of many opaque sub-components, from vision APIs to retrieval systems, reward models, and tool APIs. BETA's prediction-harmonization and consistency-regularization objective is, in this light, a principled replacement for “try many prompts and keep the best,” and we see direct extensions to multi-modal agent pipelines and to LLM tool-use where tool outputs play the role of the opaque API.

BibTeX

@inproceedings{zhang2026adapting, title={Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models}, author={Yunbei Zhang and Shuaicheng Niu and Chengyi Cai and Feng Liu and Jihun Hamm}, booktitle={Third Workshop on Test-Time Updates (Main Track)}, year={2026}, url={https://openreview.net/forum?id=v56b8I1tua} }