🏆  ICLR 2026 TTU Workshop · Oral

Adapting in the Dark

Efficient and Stable Test-Time Adaptation for Black-Box Models
Yunbei Zhang1, Shuaicheng Niu2, Chengyi Cai3, Feng Liu3, Jihun Hamm1

Abstract

Test-Time Adaptation (TTA) for black-box models accessible only via APIs remains a largely unexplored challenge. Existing approaches such as post-hoc output refinement offer limited adaptive capacity, while Zeroth-Order Optimization (ZOO) enables input-space adaptation but faces high query costs and optimization challenges in the unsupervised TTA setting. We introduce BETA (Black-box Efficient Test-time Adaptation), a framework that addresses these limitations by employing a lightweight, local white-box steering model to create a tractable gradient pathway. Through a prediction harmonization technique combined with consistency regularization and prompt learning-oriented filtering, BETA enables stable adaptation with no additional API calls and negligible latency beyond standard inference. On ImageNet-C, BETA achieves a +7.1% accuracy gain on ViT-B/16 and +3.4% on CLIP, surpassing strong white-box and gray-box methods including TENT and TPT. On a commercial API, BETA achieves comparable performance to ZOO at 250× lower cost.

Key Results at a Glance

+7.1%
ImageNet-C accuracy gain
(ViT-B/16, black-box)
+3.4%
ImageNet-C accuracy gain
(CLIP)
250×
Cheaper than ZOO
(Clarifai commercial API)
1
API call per test sample
(vs. 16+ for ZOO)

The Strict Black-Box TTA Setting

Most state-of-the-art vision models are now deployed as opaque APIs. In this setting the client can only submit a raw image and receive a probability vector in return, with no access to parameters, gradients, or intermediate features. Prior TTA work relies on exactly those missing signals.

Black-Box TTA Setting
Figure 1. The strict black-box TTA setting. The client only observes probabilities from the server-side API. Unlike white-box TTA, no gradients, parameters, or intermediate features are available. This is a practical but challenging regime for real-world API-based deployment.

Table 1. Comparison of TTA methods across key capabilities. We evaluate each method's requirements for accessing model parameters, internal tokens, intermediate features, and gradients, alongside its visual encoder architectural flexibility, support for different model types (Vision models (VMs) / Vision-Language models (VLMs)), query efficiency (one API call per test sample), and inference latency. BETA is the only strict black-box method that keeps a single API call per sample and real-time latency.

Access Method w/o Params. w/o Tokens w/o Feats. w/o Grad. Arch-Agnostic VMs VLMs 1 API/Sample Low Latency
White TENT
White TPT
Gray T3A
Gray FOA ViT-only
Gray B2TPT ViT-only
Gray BCA
Black LAME
Black Augmentation
Black Purification
Black ZOO
Black BETA (Ours)

Method Overview

BETA Workflow
Figure 2. Comparison of black-box TTA strategies. (a) Output refinement (LAME) only post-processes predictions. (b) ZOO prompting needs many API calls per sample. (c) BETA uses a local steering model and prediction harmonization to create a tractable gradient pathway with one API call per sample.

BETA operates with two models: a frozen black-box target $f_B$ (e.g., a commercial API) and a lightweight local steering model $f_S$ with full gradient access. We learn an additive visual prompt $\delta$ and optimize it through the steering model's local gradients.

Why naive gradient transfer fails

Gradients from a local surrogate do not transfer directly to a different black-box architecture. The per-example gradient cosine similarity between a ViT-B/16 target and a local ViT-S/16 or ResNet-18 is near zero.

Gradient similarity
Figure 3. Gradient similarity is not transferable. Directly importing gradients from a small local model into the input update for a black-box target is ineffective. BETA instead harmonizes predictions to create a tractable optimization target.

Prediction harmonization

Rather than transferring gradients, BETA fuses output probabilities from the two models into a single harmonized distribution $p_\alpha(x') = \alpha\,p_S(x') + (1-\alpha)\,p_B(x')$ and minimizes its entropy. This exposes a tractable asymmetric gradient pathway: the prompt $\delta$ is updated only through $f_S$'s gradients, while the black-box predictions $p_B$ enter as a data-dependent mixing target.

Why harmonization helps (analytical view). Let $p_m = \alpha p_S + (1-\alpha) p_B$ denote the harmonized distribution. Using the convexity of entropy, minimizing $H(p_m)$ upper-bounds a weighted combination of the two per-model entropies and a Jensen Shannon alignment term between $p_S$ and $p_B$: $H(p_m) \;\leq\; \alpha H(p_S) + (1-\alpha) H(p_B) \;-\; \mathrm{JS}_\alpha(p_S \,\|\, p_B)$. Minimizing $H(p_m)$ therefore (i) sharpens the local view, (ii) softly nudges it toward the remote view, and (iii) does not require $f_B$'s gradients. This is exactly what a black-box TTA objective needs: a signal that is simultaneously confident and consistent with the opaque API, while remaining differentiable through $f_S$ alone.

Stability: consistency and prompt-oriented filtering

Learning prompts from random initialization is brittle. BETA adds (i) a KL consistency term between clean and prompted predictions in the local view, and (ii) a prompt-learning-oriented reliable-and-diverse filter so the prompt is updated only from samples that give a stable learning signal.

Stability
Figure 4. Without stabilization, prompt learning collapses on hard corruptions (Contrast). BETA's consistency and filtering keep adaptation stable across 5 random seeds.

Main Results: ImageNet-C, ViT-B/16

Table 2. Classification accuracy (%) on ImageNet-C (severity 5) with ViT-B/16 as the frozen black-box target. BETA surpasses all black-box baselines and several strong white-box methods, despite only seeing API probabilities.

Access Method Avg. Gain
Black Source (no adapt) 55.5 n/a
White TENT 59.6 +4.1
White SAR 63.6 +8.1
White CoTTA 61.6 +6.1
White ETA 65.8 +10.3
Gray T3A 56.9 +1.4
Gray FOA 44.9 −10.6
Black LAME 54.1 −1.4
Black ZOO-CMA 54.5 −1.0
Black ZOO-RGF 56.0 +0.5
Black BETA (Ours) 62.6 +7.1

Real-world commercial API (Clarifai)

Clarifai cost comparison
Figure 5. On the Clarifai commercial API, BETA delivers a +5.2% accuracy gain for under $0.4. ZOO-based prompting requires over $100 to reach a comparable point, a 250× cost gap.

Generalization beyond ImageNet-C

Table 3. Fine-grained EuroSAT with CLIP ViT-B/16 as the target. BETA is the only strict black-box method and delivers the largest gain (+11.3%) with a single API query per sample, while gray-box prompting and ZERO-style variants need 64 to 448 queries.

Access Method Acc. (%) Gain #API
Black Source 42.0 n/a 1
Gray B²TPT (w/ tokens) 46.8 +4.8 120
Gray ZERO (w/ logits) 39.6 −2.4 64
Gray ZERO_ensemble (w/ logits) 43.8 +1.8 448
Black BETA (Ours) 53.3 +11.3 1

Table 4. Dermatology classification on Derm7pt with both a general-purpose CLIP ViT-B/16 and a domain-specialized BiomedCLIP as the black-box target. BETA gives consistent gains across very different backbones.

Target (Black-box) Source LAME TT-Aug BETA (Ours)
CLIP ViT-B/16 55.9 56.0 57.1 58.6
BiomedCLIP 60.9 60.4 61.3 62.1

Beyond Vision: a Test-Time Advisor Strategy

BETA's core mechanism, a small local model shaping the behavior of a larger frozen remote model at inference time, is, we believe, a preview of how adaptation will work in the agent era. Anthropic describe a closely related pattern for LLM agents:

“The advisor strategy pairs a strong, expensive model as an advisor with a faster, cheaper model as an executor. The advisor sees the task and context up front and writes a plan… The executor then carries out that plan step-by-step, consulting the advisor when it hits a hard decision.” Anthropic, The Advisor Strategy (2026)

BETA is the vision, test-time version of the same design, with a twist: the expensive side is the executor, and the small local side is the advisor. The advisor has gradients but limited capacity; the executor has capacity but is opaque. The advisor does not change the executor's weights. It only shapes its input so that the executor's own computation lands on the correct answer.

LLM advisor strategy

  • Strong advisor writes the plan
  • Cheap executor runs it
  • Advisor consulted on hard decisions
  • Uses more tokens, not more training

BETA (vision, test-time)

  • Local steering model provides gradients
  • Black-box API produces predictions
  • Reliable-and-diverse filter consults only on trustworthy samples
  • Uses a single API call, not retraining

We expect the local-advisor-shapes-remote-executor recipe to become a common pattern as AI stacks move to agents composed of many opaque sub-components, from vision APIs to retrieval systems, reward models, and tool APIs. BETA's prediction-harmonization and consistency-regularization objective is, in this light, a principled replacement for “try many prompts and keep the best,” and we see direct extensions to multi-modal agent pipelines and to LLM tool-use where tool outputs play the role of the opaque API.

BibTeX

@inproceedings{zhang2026adapting, title={Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models}, author={Yunbei Zhang and Shuaicheng Niu and Chengyi Cai and Feng Liu and Jihun Hamm}, booktitle={Third Workshop on Test-Time Updates (Main Track)}, year={2026}, url={https://openreview.net/forum?id=v56b8I1tua} }