Unifying Vision-Language Latents for Zero-Label Image Caption Enhancement¶

Conference: NeurIPS 2025 arXiv: 2510.12931 Code: Unavailable Area: Multimodal VLM Keywords: Zero-label learning, image captioning, vision-language alignment, joint embedding, self-supervised

TL;DR¶

This paper proposes the ViZer framework, which improves the image captioning capability of VLMs through a unified vision-language latent space alignment training paradigm—requiring no text annotations whatsoever. Using only raw image data, the model learns to generate more grounded and descriptive captions.

Background & Motivation¶

Background: Vision-language models (VLMs) have achieved impressive performance after large-scale image-text pretraining, yet their reliance on annotated data limits scalability and leaves vast quantities of unlabeled image data unexploited. This annotation scarcity not only constrains the training scope but also results in a persistent misalignment between visual encoders and language models—manifesting as hallucinations, factually incorrect captions, and inconsistent multimodal reasoning even in state-of-the-art systems.

Limitations of Prior Work: Advances in representation learning (e.g., JEPA, DINO) have demonstrated that more robust and generalizable features can be learned through predictive latent-space modeling without relying on pixel reconstruction or dense supervision. However, these methods are primarily oriented toward visual representation learning and do not directly address the generation of grounded captions or the synchronization of cross-modal semantics.

Key Challenge: Current alignment strategies—such as CLIP's contrastive learning and Q-Former's learnable queries—are typically fixed during pretraining and do not continuously adjust alignment when integrated with downstream LLMs. This leaves a representational gap: visual and language models are individually powerful, but their latent spaces are never directly co-adapted.

Key Insight: ViZer asks: Can actively aligning visual and language representations during training improve VLM performance without any annotated data? This is not a simple self-supervised feature learning approach, but rather a direct optimization of cross-modal alignment in service of generative image captioning.

Method¶

Overall Architecture¶

ViZer introduces a lightweight alignment mapper between the frozen visual encoder and the hidden feature space of the VLM. This mapper is trained via a contrastive loss to align visual and textual features. The VLM itself is fine-tuned with LoRA, receiving gradient signals from the alignment loss provided by the ViZer mapper—without any text labels.

Key Designs¶

ViZer Alignment Mapper:
Defines the mapping function \(M_\tau(\cdot) = h_\tau(f_\psi(\cdot))\), where \(h\) is an MLP (projecting text embeddings into the visual feature space) and \(f_\psi\) denotes the transformer layers of the VLM (excluding the LM head).
Visual features \(F_I = V_\theta(I)\) are extracted directly from the frozen visual encoder.
Text features are processed through the mapper: \(\hat{F}_T = M_\tau(E_\phi(x_{:t}))\).
The mapper is parameter-efficient and scales flexibly with dataset or model size.
Design Motivation: Inspired by the joint embedding principle, ViZer learns a bidirectional mapping between visual and language embeddings. Unlike static projection layers, it continuously optimizes alignment throughout training.
Two Mapper Variants (ViZer\(_{\text{GT}}\) and ViZer\(_{\text{G}}\)):
ViZer\(_{\text{GT}}\): The mapper is trained on ground-truth image-text pairs, using image captions as text input.
ViZer\(_{\text{G}}\): A fully zero-label approach—the mapper is trained using captions generated by the VLM itself: \(\hat{F}_T = M_\tau(f_\psi(V_\theta(I) \circ E_\phi(P))_{t+1:})\), where \(P\) is the captioning prompt.
Only ViZer\(_{\text{G}}\) constitutes a truly unsupervised, zero-label scheme.
Design Motivation: ViZer\(_{\text{G}}\) establishes alignment via the model's own generated captions, forming a self-improving closed loop.
Zero-Label VLM Training:
The VLM is fine-tuned with LoRA (rank=32, alpha=64) exclusively on unlabeled OpenImagesV7 data.
The loss function is cosine similarity: \(\mathcal{L}_{\text{zero}} = 1 - \frac{F_I \cdot \hat{F}_T}{\|F_I\| \times \|\hat{F}_T\|}\)
LoRA is adopted to avoid disrupting pretrained capabilities; the original zero-shot performance can be recovered by disabling LoRA.
Design Motivation: The aligned latent space provides a gradient signal that substitutes for missing text labels, enabling the VLM to self-improve on unannotated data.

Loss & Training¶

Both the mapper and the VLM are trained for 1 epoch each.
AdamW optimizer with weight decay of 0.01.
Mapper depth is fixed at 2 MLP layers; width is variable (optimal at 256).
The mapper is trained on a mixture of COCO and CC3M data; the VLM is trained strictly on unlabeled OpenImagesV7.
All training is conducted on a single RTX 4090 (24 GB).

Key Experimental Results¶

Main Results¶

Method	Model	COCO BLEU1	COCO CIDEr	COCO CLIPS	CC3M CLIPS
Base	SmolVLM	0.3784	0.276	0.2529	0.2617
RL	SmolVLM	0.3623	0.255	0.2506	0.2604
ViZer\(_{\text{GT}}\)	SmolVLM	0.5564	0.505	0.2569	0.2647
ViZer\(_{\text{G}}\)	SmolVLM	0.4081	0.337	0.2571	0.2636
Base	Qwen2-VL	0.5249	0.521	0.2693	0.2766
ViZer\(_{\text{G}}\)	Qwen2-VL	0.5373	0.470	0.2744	0.2774

Ablation Study (Mapper Hyperparameters, SmolVLM-Base)¶

ViZer Variant	Data Size	Width	COCO BLEU1	COCO CIDEr	Notes
ViZer\(_{\text{GT}}\)	10k	256	0.4064	0.342	Competitive with limited data
ViZer\(_{\text{GT}}\)	40k	256	0.4169	0.355	Optimal configuration
ViZer\(_{\text{GT}}\)	100k	256	0.4161	0.354	No additional gain from more data
ViZer\(_{\text{G}}\)	10k	256	0.4112	0.313	ViZer\(_{\text{G}}\) favors less data
ViZer\(_{\text{G}}\)	40k	256	0.3662	0.254	Performance degrades with more data

Key Findings¶

CLIPScore improves consistently across all ViZer variants, confirming genuine enhancement in image-caption semantic consistency.
Traditional metrics (BLEU, CIDEr) show limited or even negative gains—because ViZer-generated captions contain correct details absent from reference captions, which are penalized as "errors."
Qualitative evaluation reveals substantial improvements: "ITAP of an airplane" is enriched to "ITAP of an airplane flying over power lines"; "\<PERSON> in 2008" is corrected to "Woman surfing in the ocean."
Less data is often better: The optimal data volume for ViZer\(_{\text{GT}}\) is approximately 40k; ViZer\(_{\text{G}}\) requires only ~10k. Excessive data leads to mapper overfitting.
The RL baseline (using a reward model) yields negligible improvement, as the reward signal tends toward conservative updates that preserve pretrained representations.

Highlights & Insights¶

Paradigm Shift: This work demonstrates that VLMs can self-improve their captioning capability using only unlabeled image data, pioneering a zero-label augmentation training paradigm for the vision-language domain.
Evaluation Metric Critique: The paper offers a penetrating analysis of the limitations of reference-dependent metrics such as CIDEr and BLEU—they penalize correct details that fall outside the reference set, which is fundamentally unfair to self-supervised methods.
Architectural Generality: ViZer can be plug-and-play integrated into any VLM architecture that employs a visual encoder, with training requiring only a single 24 GB GPU.

Limitations & Future Work¶

When baseline caption quality is extremely poor (e.g., guessing the year from a fire station image), ViZer's improvement is limited.
Validation is currently restricted to image captioning; extension to VQA is non-trivial, as VQA focuses on local regions rather than global semantics.
The absence of suitable automated evaluation metrics remains an open challenge—metrics that do not rely on reference text yet possess image-grounded understanding capabilities are needed.
Performance on out-of-distribution images (medical, satellite, etc.) remains unexplored.

vs. CLIP/ALIGN: CLIP establishes a static alignment via contrastive learning; ViZer performs continuous, dynamic alignment during training and is oriented toward generative tasks.
vs. BLIP-2 Q-Former: Q-Former learns bridging queries between modalities but still depends on annotated caption data; ViZer eliminates the need for text labels entirely.
vs. I-JEPA/DINO: These methods learn visual representations but do not directly serve cross-modal generative tasks; ViZer extends the joint embedding principle to vision-language generation.

Rating¶

Novelty: ⭐⭐⭐⭐ Zero-label captioning training is a novel direction, though conceptually it is a natural extension of joint embedding and contrastive learning to VLMs.
Experimental Thoroughness: ⭐⭐⭐ Quantitative results are limited by metric unsuitability; evaluation relies primarily on qualitative comparisons, with insufficient validation on larger models and broader tasks.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated and the discussion of evaluation metric limitations is insightful, though certain design decisions could be explained more thoroughly.
Value: ⭐⭐⭐⭐ Provides a practical pathway to leverage large volumes of unlabeled image data for VLM improvement, with direct applicability in annotation-scarce scenarios.