Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness¶

Conference: CVPR 2026 arXiv: 2602.19615 Code: N/A Area: Multimodal VLM Keywords: rare object recognition, visual token enhancement, multimodal class embeddings, plug-and-play, VLM robustness

TL;DR¶

This paper proposes an efficient plug-and-play module that learns multimodal class embeddings to enhance VLM recognition and reasoning on rare objects. On the visual side, a cross-attention adapter refines visual tokens; on the textual side, object detection prompts are injected. Without fine-tuning the VLM, the method achieves a significant gain from 72.8 to 75.4 on CODA-LM.

Background & Motivation¶

Background: VLMs perform well on general visual understanding, but exhibit notable degradation on reasoning tasks involving rare or uncommon objects.

Limitations of Prior Work: - Attention weights in intermediate decoding layers of VLMs are significantly lower for rare object regions compared to common objects. - Approaches that introduce stronger visual encoders or perform full model fine-tuning incur high computational costs and are not optimized at the object level. - Retrieval-augmented learning (RAL) requires large-scale external data and VLM fine-tuning, and may cause catastrophic forgetting.

Key Challenge: Rare objects appear with extremely low frequency in pretraining data, leading to insufficient visual-language alignment. Existing improvement methods are not designed at the object level and require expensive full model fine-tuning.

Goal: Efficiently improve VLM perception and reasoning for rare objects without fine-tuning the VLM.

Key Insight: Attention visualizations reveal that VLMs pay insufficient attention to rare objects in intermediate decoding layers. This motivates remedies from two directions — enhancing visual tokens (making rare objects more salient) and enriching text prompts (guiding attention to target regions).

Core Idea: Learn multimodal class embeddings that fuse features from visual foundation models and synonym-augmented text descriptions. These embeddings serve both as anchors for visual token refinement and as object detectors for generating text prompts.

Method¶

Overall Architecture¶

Three stages: (a) learning multimodal class embeddings (visual + text alignment) → (b) visual token enhancement (cross-attention adapter) → (c) text prompt injection (class embeddings as detector → generating object prompts).

Key Designs¶

Multimodal Class Embedding Learning:
- Adaptive Semantic Augmentation: An LLM generates synonyms and descriptive text for each rare category. Categories with fewer samples receive more text variants (re-sampling) to mitigate class imbalance.
- Dual-Branch Feature Extraction: A VFM (DINOv3) extracts object visual features \(z_v\); CLIP extracts text features \(z_t\); both are projected into the LLM embedding space.
- Cross-Modal Alignment: \(\mathcal{L}_{align}\) applies contrastive learning to pull together visual and textual features of the same class.
- Class Embedding Optimization: \(\mathcal{L}_{class}\) classification loss with EMA updates, making class embeddings serve as unified anchors for both modalities.
- Initialization: Class embeddings are initialized from the mean visual features of same-class samples, yielding greater stability than random initialization.
Visual Token Enhancement (Cross-Attention Adapter):
- Input: frozen VLM visual tokens \(V\) and class embeddings \(W\).
- Cross-attention: \(V\) as query, \(W\) as key-value → refined output \(\hat{V} = V + \mathcal{C}_{att}(V, W)\).
- Refined tokens are injected only at the first decoding layer of the VLM.
- Loss = reconstruction loss \(\mathcal{L}_{rec}\) (keeping \(\hat{V}\) close to the distribution of \(V\)) + autoregressive loss \(\mathcal{L}_{autoreg}\).
- Design Motivation: Class embeddings carry discriminative knowledge of rare objects, which is injected into visual tokens via cross-attention.
Text Prompt Injection at Inference:
- Class embeddings \(W\) are used as a detector: cosine similarity is computed between VFM visual tokens and each class embedding.
- Top-\(k\) categories are selected as candidate objects.
- Candidate object names are injected into the text prompt, e.g., "In this image, there might be objects such as: [bollard, debris, ...]".
- Design Motivation: Explicit text prompts guide the LLM's attention to relevant objects.

Loss & Training¶

Stage 1: \(\mathcal{L}_{align} + \mathcal{L}_{class}\) (training class embeddings and projection layers, 20 epochs).
Stage 2: \(\mathcal{L}_{adapter} = \mathcal{L}_{rec} + \mathcal{L}_{autoreg}\) (training the adapter, 10 epochs).
The VLM remains frozen throughout. All training can be completed on a single RTX 4090.

Key Experimental Results¶

Main Results (CODA-LM GPT Score)¶

Model	Barrier↑	Cone↑	Vehicle↑	All↑
LLaVA-1.5-7B	39.3	54.5	48.9	46.5
LLaVA-1.5-7B + Ours	68.3	84.9	73.0	72.8
Qwen2.5-VL-7B	70.9	84.9	66.5	67.9
Qwen2.5-VL-7B + Ours	79.8	91.7	71.0	75.4
InternVL3-8B	59.7	73.3	66.9	65.4
InternVL3-8B + Ours	76.4	85.8	73.8	74.2

Ablation Study¶

Configuration	All↑	Note
LLaVA-1.5-7B baseline	46.5	No enhancement
+ Text prompt only	56.2	Prompt is helpful but insufficient
+ Visual enhancement only	65.8	Visual enhancement contributes more
+ Visual enhancement + Text prompt	72.8	Best with both components

Key Findings¶

LLaVA-1.5-7B improves by 26.3 points (46.5→72.8), a remarkably large gain.
Generalizes across models: effective for LLaVA, Qwen2.5-VL, and InternVL3.
Visual enhancement contributes more than text prompt injection, but the two are complementary.
Requires only a single RTX 4090 and minimal training data (CODA-LM, tens of thousands of QA pairs).
The largest gain is on the Barrier category (39.3→68.3), a prototypical rare object class.

Highlights & Insights¶

Dual utility of multimodal class embeddings: The same set of class embeddings serves both as visual refinement anchors (keys and values in cross-attention) and as object detectors (similarity matching), achieving two goals simultaneously.
Efficient frozen-VLM paradigm: Only a lightweight cross-attention adapter and class embeddings are trained, achieving substantial improvements without modifying any VLM parameters — highly valuable for deploying existing large models.
Attention visualization analysis: Directly demonstrates insufficient attention to rare objects in intermediate VLM layers, providing clear motivation for the proposed method.

Limitations & Future Work¶

Requires a predefined set of rare categories; cannot handle entirely unseen categories at test time.
The number of class embeddings is bounded by the number of rare categories \(C\); very large-scale category sets would require architectural adjustments.
Top-\(k\) detection may introduce false positives, generating incorrect text prompts that mislead reasoning.
Performance gains on GeoBench-VLM (satellite imagery) are weaker than on CODA-LM, indicating remaining challenges under extremely scarce data.

vs. VLM internal feature supervision methods (LLaVA-Grounding): These methods align all visual tokens with VFM features without targeting rare objects; the proposed method uses class embeddings for object-level refinement, achieving greater precision and efficiency.
vs. Retrieval-Augmented Learning (RAL): RAL retrieves from large-scale external data and fine-tunes the VLM, incurring high computational cost and risk of forgetting; the proposed method requires neither large-scale data nor VLM fine-tuning.

Rating¶

Novelty: ⭐⭐⭐⭐ The dual-purpose design of multimodal class embeddings is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model validation with attention visualization analysis.
Writing Quality: ⭐⭐⭐⭐ Motivation is clear and figures are intuitive.
Value: ⭐⭐⭐⭐ A practical solution for rare object understanding.