CVPR 2026 Multimodal VLM visual representation degradation multimodal large language models predictive regularization self-supervised learning visual fidelity

Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models¶

Conference: CVPR 2026 arXiv: 2603.20808 Code: None Area: Multimodal VLM Keywords: visual representation degradation, multimodal large language models, predictive regularization, self-supervised learning, visual fidelity

TL;DR¶

This paper systematically diagnoses visual representation degradation in MLLMs across two levels—global functionality and patch-level semantic structure—revealing that such degradation is an intrinsic "visual sacrifice" induced by the pure text-generation objective. It proposes Predictive Regularization (PRe), which mitigates degradation by training intermediate-layer features to predict the initial visual features, achieving consistent improvements across multiple vision-language benchmarks.

Background & Motivation¶

Background: The dominant MLLM architecture consists of a visual encoder, a projection layer, and an LLM, trained entirely with a language modeling objective (next-token prediction). Visual representations are progressively transformed within the LLM to serve the final text generation task.
Limitations of Prior Work: Prior work has focused primarily on the functional role of visual features in cross-modal tasks (e.g., how they facilitate question answering), while overlooking a critical question: what cost does purely language-driven training impose on the intrinsic quality of visual representations?
Key Challenge: MLLM training lacks direct visual supervision signals. Under a single text-generation objective, the model sacrifices visual fidelity to optimize language capability. Linear classification performance on intermediate-layer visual representations degrades significantly relative to the input layer, and patch-level semantic boundaries become blurred—this constitutes "visual representation degradation."
Goal: (1) Systematically quantify and explain the phenomenon and mechanism of visual degradation in MLLMs; (2) Design a lightweight method to mitigate degradation without interfering with language capability.
Key Insight: Inspired by the theory of Predictive Coding—efficient neural systems should continuously predict their own low-level signals to maintain a coherent world model. The authors recontextualize this principle as a regularizer.
Core Idea: A lightweight prediction head is used to train the degraded intermediate-layer visual features of the LLM to predict the initial input visual features, thereby anchoring the visual fidelity of intermediate representations via "visual self-prediction" regularization.

Method¶

Overall Architecture¶

A bypass branch is added to the standard MLLM training pipeline: visual token hidden states are extracted from an intermediate LLM layer, passed through a 2-layer MLP prediction head, and used to predict the stop-gradient initial visual token features $\mathbf{H}_v^0$ at the LLM input. A cosine similarity loss serves as the regularization term and is jointly optimized with the standard language modeling loss. No additional data or architectural modifications are required.

Key Designs¶

Multi-level Diagnosis of Visual Degradation:
- Function: Reveals the degradation phenomenon and quantifies its extent.
- Mechanism: Global average-pooled visual representations are extracted from each layer of the MLLM, and linear classifiers are trained for image classification (linear probing). Results show significant classification accuracy drops at intermediate layers relative to the input layer (global functional degradation). At the patch level, COCO-Stuff segmentation masks are used to compute intra-object cohesion and inter-object coupling; the faster rise in coupling leads to a decline in the semantic contrast ratio (patch structural degradation). Visualization reveals that similarity from one patch "bleeds" into unrelated objects at intermediate layers.
- Design Motivation: Precise diagnosis must precede problem-solving. Evidence of degradation is established through a complete chain spanning macro (global classification) to micro (patch semantic boundaries) scales.
Degradation Attribution: The Visual Sacrifice Hypothesis:
- Function: Explains the root cause of degradation.
- Mechanism: Statistical properties of intermediate-layer representations are analyzed—PCA effective dimensionality peaks and feature correlation is minimized at intermediate layers, indicating that these layers perform "unfolding and disentangling" to construct representation spaces suited for language generation. The dynamics of VQA performance and linear probing accuracy are tracked throughout pre-training, revealing a clear negative correlation: language capability improves while visual fidelity continuously declines.
- Design Motivation: Demonstrates that degradation is not random noise but a systematic byproduct of single-objective text training, providing a theoretical foundation for solution design.
Predictive Regularization (PRe):
- Function: Regularizes visual degradation during training.
- Mechanism: Visual hidden states $\mathbf{H}_v^l$ are extracted from an intermediate LLM layer (e.g., layer 16 of Vicuna) and passed through a 2-layer MLP prediction head. The negative cosine similarity loss is computed against the stop-gradient initial visual features $\mathbf{H}_v^0$: $$\mathcal{L}_{\text{PRe}} = -\frac{1}{N_p}\sum_{i=1}^{N_p} \mathcal{D}(f_{pred}(\mathbf{h}_{v,i}^l), \text{stopgrad}(\mathbf{h}_{v,i}^0))$$ The final loss is $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{LM}} + \lambda \mathcal{L}_{\text{PRe}}$ with $\lambda=0.5$.
- Design Motivation: Using an internal anchor (pre-LLM features) rather than features from an external model avoids representation space mismatch; patch-level operation provides richer supervision signals than global aggregation; stop-gradient prevents the anchor from being corrupted by backpropagation.

Loss & Training¶

Standard LLaVA two-stage training (pre-training 558K + instruction tuning 665K)
Total loss = language modeling loss + 0.5 × PRe regularization loss
Applied to intermediate layers (layer 16 of Vicuna, layer 14 of Qwen); not applied to the final layer, where visual tokens have been actively "silenced" by the model into high-frequency meaningless tokens

Key Experimental Results¶

Main Results¶

Config (Encoder + LLM)	PRe	GQA	MMMU	AI2D	MMStar	TextVQA	OCRbench	RWQA	MMVP
CLIP* + Vicuna-7B	✗	62.0	35.7	55.4	30.3	45.5	318	54.8	20.0
CLIP* + Vicuna-7B	✓	62.7	36.1	57.1	34.6	46.6	329	55.4	22.0
SigLIP2 + Qwen2.5-7B	✗	63.5	45.8	68.9	48.0	59.2	413	60.3	46.0
SigLIP2 + Qwen2.5-7B	✓	64.4	46.2	69.5	47.8	59.7	428	61.9	46.7

Ablation Study¶

Config	GQA	MMMU	TextVQA	RWQA	MMVP
Baseline (CLIP* + Vicuna)	62.0	35.7	45.5	54.8	20.0
PRe @ mid-layer	62.7	36.1	46.6	55.4	22.0
PRe @ last-layer	62.4	35.6	45.7	54.5	25.3
Anchor: Pre-LLM (default)	62.7	36.1	46.6	55.4	22.0
Anchor: Pre-Proj	62.7	35.1	46.4	54.4	32.7
Anchor: DINOv2	62.8	35.9	46.5	54.6	28.7

Key Findings¶

Mid-layer vs. Last-layer: PRe performs best when applied at intermediate layers. At the final layer, visual tokens have already been actively collapsed by the model into high-frequency meaningless tokens (e.g., _in, ., <<0x0A>>); forcing visual structure preservation at this stage is detrimental.
Anchor Selection: Pre-LLM internal features as anchors yield the best overall performance, avoiding both dimensionality alignment difficulties (introduced by patch merging) and representation space mismatch (as with DINOv2). Pre-Proj is particularly strong on MMVP (+12.7) but has practical limitations.
Patch-level vs. Global-level: Patch-level regularization consistently outperforms global regularization by preserving finer-grained spatial structural information.
Cross-architecture Generality: PRe is effective across six configurations spanning CLIP/SigLIP encoders × Vicuna/Qwen LLMs × frozen/trainable encoders.

Highlights & Insights¶

Complete Research Paradigm of Diagnosis → Attribution → Solution: The logical chain from phenomenon discovery to causal analysis to solution design is highly coherent. This "understand first, then solve" approach is more persuasive than directly stacking modules.
The Concept of "Visual Degradation" Itself: The paper uncovers a systematic, previously overlooked issue in MLLM training—representation degradation is the cost of language optimization. This insight can inspire future work on better multi-objective training strategies.
Lightweight and Universal: PRe requires only a 2-layer MLP and a cosine loss, with zero additional data and zero architectural modifications, making it a plug-and-play addition to various MLLMs. This philosophy of "minimal intervention" is worth emulating.

Limitations & Future Work¶

Validation is currently limited to 7B-scale LLMs; degradation patterns and PRe effectiveness at larger scales (e.g., 70B) remain unknown.
Regularization is applied at a single intermediate layer; multi-layer cascaded or progressive regularization may yield better results.
The PRe anchor is the static initial input feature, which (derived from frozen CLIP/SigLIP) may not constitute an optimal visual representation—could a dynamically updated "ideal visual anchor" be used instead?
The quantitative metric for visual degradation (linear probing accuracy) is indirect; more direct metrics reflecting visual fidelity may be needed.

vs. JEPA/SimSiam: PRe recontextualizes the predictive coding principle from self-supervised learning—originally a pre-training objective—as a training regularizer, a clever cross-domain borrowing.
vs. FastV/Token Pruning Methods: Such methods reduce visual tokens to accelerate inference but may exacerbate degradation; PRe is complementary—preserving visual quality before pruning.
vs. Multimodal Hallucination Mitigation Methods: Visual degradation may be an underlying cause of hallucinations; PRe addresses the issue at the representation level, complementing output-level calibration approaches.

Rating¶

Novelty: ⭐⭐⭐⭐ Diagnosing the visual degradation phenomenon is itself valuable; the PRe method is simple yet precise.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six architectural configurations, nine benchmarks, and detailed ablations—highly comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ Clear logic, with analysis, method, and experiments building progressively upon each other.
Value: ⭐⭐⭐⭐ The revealed degradation phenomenon has broad implications for the MLLM community; the method is simple and practical.