Skip to content

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

Conference: CVPR 2026 arXiv: 2602.19615 Code: N/A Area: Multimodal VLM Keywords: rare object recognition, visual token enhancement, multimodal class embeddings, plug-and-play, VLM robustness

TL;DR

This paper proposes an efficient plug-and-play module that learns multimodal class embeddings to enhance VLM recognition and reasoning on rare objects. On the visual side, a cross-attention adapter refines visual tokens; on the textual side, object detection prompts are injected. Without fine-tuning the VLM, the method achieves a significant gain from 72.8 to 75.4 on CODA-LM.

Background & Motivation

Background: VLMs perform well on general visual understanding, but exhibit notable degradation on reasoning tasks involving rare or uncommon objects.

Limitations of Prior Work: - Attention weights in intermediate decoding layers of VLMs are significantly lower for rare object regions compared to common objects. - Approaches that introduce stronger visual encoders or perform full model fine-tuning incur high computational costs and are not optimized at the object level. - Retrieval-augmented learning (RAL) requires large-scale external data and VLM fine-tuning, and may cause catastrophic forgetting.

Key Challenge: Rare objects appear with extremely low frequency in pretraining data, leading to insufficient visual-language alignment. Existing improvement methods are not designed at the object level and require expensive full model fine-tuning.

Goal: Efficiently improve VLM perception and reasoning for rare objects without fine-tuning the VLM.

Key Insight: Attention visualizations reveal that VLMs pay insufficient attention to rare objects in intermediate decoding layers. This motivates remedies from two directions — enhancing visual tokens (making rare objects more salient) and enriching text prompts (guiding attention to target regions).

Core Idea: Learn multimodal class embeddings that fuse features from visual foundation models and synonym-augmented text descriptions. These embeddings serve both as anchors for visual token refinement and as object detectors for generating text prompts.

Method

Overall Architecture

Three stages: (a) learning multimodal class embeddings (visual + text alignment) → (b) visual token enhancement (cross-attention adapter) → (c) text prompt injection (class embeddings as detector → generating object prompts).

Key Designs

  1. Multimodal Class Embedding Learning:

    • Adaptive Semantic Augmentation: An LLM generates synonyms and descriptive text for each rare category. Categories with fewer samples receive more text variants (re-sampling) to mitigate class imbalance.
    • Dual-Branch Feature Extraction: A VFM (DINOv3) extracts object visual features \(z_v\); CLIP extracts text features \(z_t\); both are projected into the LLM embedding space.
    • Cross-Modal Alignment: \(\mathcal{L}_{align}\) applies contrastive learning to pull together visual and textual features of the same class.
    • Class Embedding Optimization: \(\mathcal{L}_{class}\) classification loss with EMA updates, making class embeddings serve as unified anchors for both modalities.
    • Initialization: Class embeddings are initialized from the mean visual features of same-class samples, yielding greater stability than random initialization.
  2. Visual Token Enhancement (Cross-Attention Adapter):

    • Input: frozen VLM visual tokens \(V\) and class embeddings \(W\).
    • Cross-attention: \(V\) as query, \(W\) as key-value → refined output \(\hat{V} = V + \mathcal{C}_{att}(V, W)\).
    • Refined tokens are injected only at the first decoding layer of the VLM.
    • Loss = reconstruction loss \(\mathcal{L}_{rec}\) (keeping \(\hat{V}\) close to the distribution of \(V\)) + autoregressive loss \(\mathcal{L}_{autoreg}\).
    • Design Motivation: Class embeddings carry discriminative knowledge of rare objects, which is injected into visual tokens via cross-attention.
  3. Text Prompt Injection at Inference:

    • Class embeddings \(W\) are used as a detector: cosine similarity is computed between VFM visual tokens and each class embedding.
    • Top-\(k\) categories are selected as candidate objects.
    • Candidate object names are injected into the text prompt, e.g., "In this image, there might be objects such as: [bollard, debris, ...]".
    • Design Motivation: Explicit text prompts guide the LLM's attention to relevant objects.

Loss & Training

  • Stage 1: \(\mathcal{L}_{align} + \mathcal{L}_{class}\) (training class embeddings and projection layers, 20 epochs).
  • Stage 2: \(\mathcal{L}_{adapter} = \mathcal{L}_{rec} + \mathcal{L}_{autoreg}\) (training the adapter, 10 epochs).
  • The VLM remains frozen throughout. All training can be completed on a single RTX 4090.

Key Experimental Results

Main Results (CODA-LM GPT Score)

Model Barrier↑ Cone↑ Vehicle↑ All↑
LLaVA-1.5-7B 39.3 54.5 48.9 46.5
LLaVA-1.5-7B + Ours 68.3 84.9 73.0 72.8
Qwen2.5-VL-7B 70.9 84.9 66.5 67.9
Qwen2.5-VL-7B + Ours 79.8 91.7 71.0 75.4
InternVL3-8B 59.7 73.3 66.9 65.4
InternVL3-8B + Ours 76.4 85.8 73.8 74.2

Ablation Study

Configuration All↑ Note
LLaVA-1.5-7B baseline 46.5 No enhancement
+ Text prompt only 56.2 Prompt is helpful but insufficient
+ Visual enhancement only 65.8 Visual enhancement contributes more
+ Visual enhancement + Text prompt 72.8 Best with both components

Key Findings

  • LLaVA-1.5-7B improves by 26.3 points (46.5→72.8), a remarkably large gain.
  • Generalizes across models: effective for LLaVA, Qwen2.5-VL, and InternVL3.
  • Visual enhancement contributes more than text prompt injection, but the two are complementary.
  • Requires only a single RTX 4090 and minimal training data (CODA-LM, tens of thousands of QA pairs).
  • The largest gain is on the Barrier category (39.3→68.3), a prototypical rare object class.

Highlights & Insights

  • Dual utility of multimodal class embeddings: The same set of class embeddings serves both as visual refinement anchors (keys and values in cross-attention) and as object detectors (similarity matching), achieving two goals simultaneously.
  • Efficient frozen-VLM paradigm: Only a lightweight cross-attention adapter and class embeddings are trained, achieving substantial improvements without modifying any VLM parameters — highly valuable for deploying existing large models.
  • Attention visualization analysis: Directly demonstrates insufficient attention to rare objects in intermediate VLM layers, providing clear motivation for the proposed method.

Limitations & Future Work

  • Requires a predefined set of rare categories; cannot handle entirely unseen categories at test time.
  • The number of class embeddings is bounded by the number of rare categories \(C\); very large-scale category sets would require architectural adjustments.
  • Top-\(k\) detection may introduce false positives, generating incorrect text prompts that mislead reasoning.
  • Performance gains on GeoBench-VLM (satellite imagery) are weaker than on CODA-LM, indicating remaining challenges under extremely scarce data.
  • vs. VLM internal feature supervision methods (LLaVA-Grounding): These methods align all visual tokens with VFM features without targeting rare objects; the proposed method uses class embeddings for object-level refinement, achieving greater precision and efficiency.
  • vs. Retrieval-Augmented Learning (RAL): RAL retrieves from large-scale external data and fine-tunes the VLM, incurring high computational cost and risk of forgetting; the proposed method requires neither large-scale data nor VLM fine-tuning.

Rating

  • Novelty: ⭐⭐⭐⭐ The dual-purpose design of multimodal class embeddings is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model validation with attention visualization analysis.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clear and figures are intuitive.
  • Value: ⭐⭐⭐⭐ A practical solution for rare object understanding.