SoC

TL;DR

We propose SoC, a Huber-based regularizer for test-time prompt tuning that improves the calibration of vision-language models by enforcing smooth prototype separation while preserving semantic proximity between related classes. Unlike prior work (O-TPT) which enforces full orthogonality and aggressively pushes semantically similar classes apart, inflating the model confidence, SoC caps repulsion for highly similar pairs. On 11 diverse image classification benchmarks with ViT-L/14, SoC achieves an average ECE of 5.4 (vs. 7.7 for O-TPT) while also improving accuracy from 71.4% to 72.3%.

Methods

Problem Setting

A pretrained VLM pairs a vision encoder \(f_\theta(\cdot)\) with a text encoder \(f_\phi(\cdot)\). For an image \(\mathbf{x}\), the visual embedding is \(\mathbf{v} = f_\theta(\mathbf{x}) \in \mathbb{R}^d\), and the class prototype for class \(k\) is \(\mathbf{t}_k = f_\phi(\text{"a photo of a [CLASS]"}) \in \mathbb{R}^d\). Logits and softmax probabilities follow CLIP's cosine-similarity scoring:

\[\mathbf{z}_k = \alpha\,\mathbf{v}^\top \mathbf{t}_k, \qquad p_k(\mathbf{v}) = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}\]

We denote \(\mathbf{S} = \mathbf{E}\mathbf{E}^\top\) the pairwise cosine similarity matrix, where \(\mathbf{E} \in \mathbb{R}^{K \times d}\) stacks all \(K\) unit-norm class prototypes. Test-time prompt tuning (TPT) adapts the text prompts at inference using only unlabeled test images, minimizing the entropy of predictions averaged over \(N\) augmented views:

\[\mathcal{L}_{\text{TPT}} = -\sum_{k=1}^K \tilde{p}_k(\mathbf{v})\log \tilde{p}_k(\mathbf{v}), \qquad \tilde{p}_k(\mathbf{v}) = \frac{1}{\rho N}\sum_{n=1}^N \mathbb{1}\!\left(\mathcal{H}(p_k)\ge\tau\right)p_k\!\left(\mathcal{A}_n(\mathbf{v})\right)\]

While entropy minimization improves discriminative accuracy, it induces overconfident predictions — raising calibration concerns in safety-critical deployments.

Calibration-Oriented Baselines

C-TPT adds a dispersion regularizer that spreads text prototype embeddings away from their centroid \(\bar{\mathbf{t}} = \tfrac{1}{K}\sum_k \mathbf{t}_k\), encouraging inter-class separability as a proxy for calibration:

\[\mathcal{L}_{\text{C-TPT}} = \mathcal{L}_{\text{TPT}} - \lambda \cdot \frac{1}{K}\sum_{k=1}^K \|\bar{\mathbf{t}} - \mathbf{t}_k\|_2\]

O-TPT directly operates on the prototype geometry, enforcing full pairwise orthogonality across all class prototype pairs:

\[\mathcal{L}_{\text{O-TPT}} = \mathcal{L}_{\text{TPT}} + \lambda\|\mathbf{S} - \mathbf{I}_K\|_F^2\]

Although orthogonality improves separation, its quadratic penalty grows without bound, aggressively pushing semantically similar classes (e.g., annual crop land vs. permanent crop land) apart even when their proximity reflects meaningful conceptual overlap. We prove analytically that full orthogonality reduces the worst-case similarity \(\mu = \max_{i \neq j} s_{ij}\) more sharply per gradient step than any Huber-style regularizer, systematically inflating model confidence and degrading calibration for highly related categories.

Our Proposed SoC Regularizer

We replace the quadratic repulsion of O-TPT with a Huber-based regularizer that applies a quadratic penalty for small pairwise similarities but transitions to a linear (capped) regime for large similarities, preserving semantic proximity between related classes:

\[\mathcal{L}_{\text{Huber}}(s,\,\delta) = \begin{cases} \dfrac{s^2}{2} & s \le \delta \\[6pt] \delta\!\left(s - \dfrac{\delta}{2}\right) & s > \delta \end{cases}\]

The full SoC objective integrates this loss over all \(\tfrac{K(K-1)}{2}\) class prototype pairs:

\[\mathcal{L}_{\text{SoC}} = \mathcal{L}_{\text{TPT}} + \lambda \cdot \frac{2}{K(K-1)}\sum_{i < j} \mathcal{L}_{\text{Huber}}(s_{ij},\,\delta)\]

When \(s_{ij} \le \delta\), SoC behaves identically to O-TPT (quadratic repulsion). When \(s_{ij} > \delta\), the gradient is capped at \(\delta\), preventing the over-separation of semantically close classes. This directly yields a higher confidence lower bound compared to O-TPT whenever \(\mu > \delta\), translating into better-calibrated predictions.

O-TPT

cos(dog, puppy) = 0.94

Quadratic penalty drives dog & puppy nearly orthogonal

SoC (Ours)

cos(dog, puppy) = 0.94

Huber loss caps repulsion, preserving semantic proximity

Results

We evaluate on 11 diverse image classification datasets. Accuracy (↑) and ECE (↓) are reported for four baselines and SoC. Colored subscripts indicate change vs. O-TPT: green = improvement, red = degradation.

	ImgNet	DTD	Flowers	Food101	SUN397	Aircraft	Pets	Caltech	UCF101	EuroSAT	Cars	Avg
Accuracy ↑
Zero-Shot	73.5	52.4	76.2	88.6	67.7	29.9	93.1	95.1	73.8	55.0	76.8	71.1
TPT	75.6	55.3	76.3	89.0	70.2	31.8	93.6	95.5	74.5	51.9	77.8	72.0
C-TPT	75.0	55.1	76.5	88.9	70.1	30.9	94.1	95.5	75.2	54.0	77.5	72.1
O-TPT	73.2	54.6	76.4	88.6	68.9	30.0	93.8	95.3	74.5	53.6	76.7	71.4
SoC (Ours)	74.5+1.3	54.4−0.2	77.0+0.6	88.9+0.3	69.5+0.6	30.9+0.9	93.9+0.1	95.6+0.3	74.9+0.4	58.3+4.7	77.0+0.3	72.3+0.9
ECE ↓
Zero-Shot	2.9	10.3	4.3	1.8	3.3	10.3	2.8	3.7	5.6	6.9	3.9	5.1
TPT	14.8	25.0	15.0	6.2	17.2	28.8	3.7	2.4	17.7	25.8	7.1	14.9
C-TPT	10.5	18.3	11.1	3.3	13.5	21.4	1.0	0.9	11.1	17.3	1.4	10.0
O-TPT	5.5	13.8	7.0	2.8	7.6	16.8	1.4	2.0	8.5	17.7	2.2	7.7
SoC (Ours)	7.2+1.7	10.9−2.9	5.3−1.7	2.1−0.7	7.2−0.4	12.7−4.1	0.5−0.9	1.3−0.7	6.5−2.0	3.2−14.5	2.2±0.0	5.4−2.3

Accuracy and ECE across 11 datasets with ViT-L/14. Subscripts show change vs. O-TPT.

BibTeX

@inproceedings{fillioux2026soc,
    author    = {Fillioux, Leo and Chakraborty, Omprakash and Ben Ayed, Ismail and Cournède, Paul-Henry and Christodoulidis, Stergios and  Vakalopoulou, Maria and Dolz, Jose},
    title     = {{SoC}: Semantic Orthogonal Calibration for Test-Time Prompt Tuning},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2026},
}

SoC

Semantic Orthogonal Calibration for Test-Time Prompt Tuning