BRAVE: Broadening the Visual Encoding of Vision-Language Models¶

Conference: ECCV 2024
arXiv: 2404.07204
Code: Project Page
Area: Multimodal VLMs
Keywords: Multi-encoder fusion, Q-Former, Visual encoding, VQA, Hallucination

TL;DR¶

This paper systematically analyzes the impact of different visual encoders (CLIP, DINOv2, EVA-CLIP, etc.) on VLM performance, finding that no single encoder is optimal across all tasks. Based on this, the BRAVE method is proposed, which utilizes a lightweight MEQ-Former to fuse features from multiple frozen encoders into a compact representation. Consequently, it achieves SOTA results on captioning and VQA tasks with only 116M trainable parameters while significantly reducing visual hallucinations.

Background & Motivation¶

Background: VLMs typically consist of a visual encoder (e.g., CLIP) + a bridging module (e.g., Q-Former/MLP) + a language model (e.g., LLaMA). Recent research has invested heavily in larger LMs and more training data, yielding significant performance improvements.

Limitations of Prior Work: VLMs exhibit severe limitations on the visual side: - CLIP Blind Spots: Tong et al. found that CLIP is "blind" to certain visual differences, failing to distinguish between image pairs with clear visual deviations. - Visual Hallucinations: VLMs hallucinate details that are not present in the images. - Biases of a Single Encoder: Different encoders possess distinct inductive biases due to their training objectives, datasets, and model sizes, meaning any single encoder inevitably has blind spots.

Key Challenge: VLMs require a comprehensive understanding of various visual attributes (color, spatial relationships, texture, semantics, etc.), but a single encoder bound by specific training objectives and datasets cannot perform optimally across all dimensions.

Goal: To efficiently fuse multiple encoders with different visual biases to create a more comprehensive visual representation.

Key Insight: First, conduct a systematic encoder benchmark (8 encoders × 5 tasks) to empirically demonstrate that "there is no silver-bullet encoder," and then propose a fusion framework.

Core Idea: Use a unified, lightweight Multi-Encoder Querying Transformer (MEQ-Former) to resample and fuse features from an arbitrary number of frozen encoders into a fixed-length compact representation, serving as a soft visual prompt for a frozen LM.

Method¶

Overall Architecture¶

Multiple frozen visual encoders (5) → Extract individual image features → Linearly project to a unified dimension → Sequence-level concatenation → MEQ-Former resamples and fuses via cross-attention → Fixed-length output → FC projection to LM input space → Input as soft visual prompt + text prompt to frozen LM → Generate output.

Key Designs¶

Systematic Multi-Encoder Analysis (Section 2):
- Function: Evaluate the impact of 8 visual encoders on VLM performance under a unified framework.
- Encoder Selection: CLIP-L/14, OpenCLIP-G/14, EVA-CLIP-g, SIGLIP-G/14, SILC-G/16, ViT-e, ViT-G, DINOv2-L/14.
- Key Findings:
  - Different encoders exhibit significant performance discrepancies across tasks (COCO standard deviation of 4.91, VQAv2 standard deviation of 1.74).
  - No single encoder consistently outperforms others.
  - Encoders with drastically different biases can achieve similar performance (e.g., EVA-CLIP vs. ViT-e).
  - MMVP is extremely challenging for all encoders (most fall below the random guess baseline of 25%).
- Design Motivation: Empirically demonstrates the necessity of multi-encoder fusion in a data-driven manner.
MEQ-Former (Multi-Encoder Querying Transformer):
- Function: Unify and fuse features from \(K\) encoders into a fixed-length compact representation.
- Mechanism:
  - Features from each encoder are projected to a unified dimension (1408 dimensions) via linear layers.
  - Sequence-level concatenation serves as the keys/values for cross-attention.
  - 160 learnable queries (\(32 \times 5\) encoders) combined with text prompt tokens serve as queries for cross-attention.
  - 12 Transformer layers perform alternating cross-attention and self-attention.
  - The final 160 query outputs are mapped to the LM input space via an FC layer.
- Feature Compression Effect: Compresses features from \(1223 \times 1408\) to \(160 \times 768\) (14x compression).
- Design Motivation:
  - Cross-attention naturally resolves the issue of differing output dimensions across different encoders.
  - Fixed-length outputs keep the computation cost at the LM side constant, regardless of the number of encoders.
  - Forgoing encoder-specific identity embeddings allows the MEQ-Former to autonomously learn how to utilize disparate features.
  - Compared to standard Q-Former ensembling (\(5 \times 110\text{M} = 550\text{M}\)), MEQ-Former requires only 116M parameters.
Encoder Dropout Training Strategy:
- Function: Randomly mask the features of each encoder with a 20% probability during pre-training.
- Mechanism: Serves as a regularization method to prevent the MEQ-Former from over-relying on a single encoder.
- Design Motivation: Avoid local optima; without dropout, the MEQ-Former might exploit a shortcut by focusing solely on the easiest-to-fit encoder.

Loss & Training¶

Pre-training: Train the MEQ-Former with a captioning objective on the WebLI dataset (100M image-text pairs) while keeping both visual encoders and the LM frozen.
VQA Fine-tuning: Fine-tune the MEQ-Former and LM on a mixture of VQAv2 + OKVQA + VQ2A datasets (17M samples).
High-Resolution Fine-tuning: Further fine-tune at \(336 \times 336\) resolution.
Total trainable parameters are only 116M (comprising approximately 1% of the total VLM parameters).

Key Experimental Results¶

Main Results (Captioning)¶

Method	Trainable Params	COCO (CIDEr) ↑	NoCaps out-domain ↑	NoCaps overall ↑
PaLI-17B	16.9B	149.1	-	127.0
GiT2	5.1B	145.0	130.6	126.9
BLIP-2	1.1B	144.5	124.8	121.6
InstructBLIP	188M	-	-	121.9
BRAVE	116M	148.0	133.3	127.6

Main Results (VQA)¶

Method	Trainable Params	VQAv2 ↑	OKVQA ↑	GQA ↑	VizWiz ↑	MMVP ↑	POPE ↑
PaLI-17B	16.9B	84.3	64.5	-	-	-	-
LLaVA-1.5	13B	80.0	-	63.3	53.6	24.7	85.9
InstructBLIP	188M	-	55.5	-	33.4	16.7	78.9
SPHINX-2k	13B	80.7	62.6	63.1	44.9	-	87.2
BRAVE	3B	82.5	66.0	66.3	54.2	42.0	87.6

Ablation Study¶

Configuration	COCO ↑	VQAv2 ↑	OKVQA ↑	Description
A0: Full BRAVE	147.0	81.8	65.7	Baseline
A1: No LM Fine-tuning	-	78.6	57.5	LM fine-tuning is crucial for VQA
A1: LoRA r=128	-	81.0	62.9	LoRA offsets 70% of the performance gap
A2: No Synthetic VQA Data	-	81.1	64.0	Synthetic data contributes significantly
A3: No Encoder Dropout	145.3	81.3	66.0	Captioning is more heavily affected
A4: No Text Input to MEQ	145.9	81.4	64.9	Text prompts assist in task alignment
A5: No High-Resolution FT	145.2	79.6	65.0	High resolution is important for VQA
A8: FlanT5-L (Smaller LM)	142.5	79.9	65.5	Larger LMs offer clear linguistic advantages

MEQ-Former vs. Q-Former Ensembling¶

Bridging Method	Params	COCO ↑	VQAv2 ↑	OKVQA ↑	GQA ↑
Q-Former Ensemble	605M	140.9	78.5	64.3	50.6
MEQ-Former	116M	145.2	79.6	65.0	51.5

Key Findings¶

The performance boost of BRAVE on MMVP is striking: 42.0% vs. the best single encoder at 27.3% (+14.7%), far exceeding the 25% random guess baseline.
Strong performance on NoCaps out-domain (133.3) indicates that multi-encoder fusion significantly strengthens out-of-distribution (OOD) generalization.
MEQ-Former outperforms the Q-Former ensemble with 5x fewer parameters, demonstrating that unified resampling is superior to simple concatenation.
Removing any 2 encoders results in graceful performance degradation (good robustness); however, degradation accelerates when more than 2 are excluded.
MEQ-Former adaptively allocates attention weights to different encoders based on downstream tasks.
Visual-side scaling (multi-encoder) and language-side scaling (larger LM) provide complementary benefits to VLM performance.

Highlights & Insights¶

Systematic Evidence of "No Silver-Bullet Encoder": The comprehensive benchmark across 8 encoders and 5 tasks quantifies this intuition within a unified framework for the first time.
New Paradigm for Visual-Side Scaling: Combining multiple encoders with diverse biases represents a novel scaling dimension compared to simply widening a single encoder or scaling up the LM.
Extreme Parameter Efficiency: SOTA results are achieved with only 116M trainable parameters (1% of total parameters), which is 150x fewer trainable parameters than PaLI-17B.
Encoder Dropout as Regularization: A simple yet effective training trick that can be transferred to other multi-source feature fusion scenarios.
No Encoder-Source Labels on Features: MEQ-Former does not need to know which encoder the features originate from, simplifying the architecture and enhancing flexibility.

Limitations & Future Work¶

Inference requires forward passes through all encoders, causing computational costs to scale linearly with the number of encoders.
Adaptive encoder selection mechanisms remain unexplored; dynamically deciding which encoders to activate based on the input could alleviate inference overhead.
The ensemble of encoders could be further expanded, for instance, by adding 3D prior encoders or scene understanding encoders.
It only explores image-text modalities and could be extended to other modalities like audio and video.
Performance remains bounded by the inherent biases and hallucination issues of LLMs.

vs. BLIP-2: BLIP-2 bridges a single encoder with a single Q-Former. BRAVE generalizes this to multiple encoders with the MEQ-Former, achieving higher performance with fewer parameters (116M vs. 188M).
vs. LLaVA-1.5: LLaVA utilizes an MLP projection to connect CLIP and the LM. Despite having 13B parameters, it falls far short of BRAVE (42.0%) on MMVP (24.7%), because CLIP's blind spots cannot be rectified by an MLP.
vs. LLaVA-MoF / SPHINX: These concurrent works also explore multi-encoder architectures, but they merely concatenate features before feeding them into the LM, limiting scalability. BRAVE's unified resampling mechanism can seamlessly scale to any number of encoders.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of multi-encoder fusion itself is not entirely new, but the unified resampling design of the MEQ-Former and the systematic encoder analysis are significant contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, encompassing an 8-encoder benchmark, 8 downstream tasks, exhaustive ablation studies, analysis of encoder contributions, and comparisons against ensemble methods.
Writing Quality: ⭐⭐⭐⭐⭐ Highly logical flow from analysis to methodology to experiments, with strong motivation and professional figures/tables.
Value: ⭐⭐⭐⭐⭐ Demonstrates the clinical importance of visual-side scaling. The MEQ-Former design is elegant, highly efficient, and reusable, offering direct impact to the VLM community.