Breaking the Encoder Barrier for Seamless Video-Language Understanding¶

Conference: ICCV 2025 arXiv: 2503.18422 Code: None Area: Video Understanding / Video Large Language Models Keywords: encoder-free, Video-LLM, token merging, video guidance, hybrid resolution

TL;DR¶

This paper proposes ELVA, the first encoder-free Video Large Language Model (Video-LLM), which achieves performance comparable to encoder-based architectures through hierarchical token merging, video guidance supervision, and hybrid resolution inference, using only 7M publicly available video-text pairs while reducing FLOPs by 95% and inference latency by 92%.

Background & Motivation¶

Existing Video-LLMs almost universally adopt an "encoder + decoder" framework (e.g., CLIP encoder + LLM), which suffers from three fundamental limitations:

Accumulated computational overhead: Videos require per-frame feature extraction through the visual encoder, with costs scaling linearly with frame count; large encoders (e.g., InternViT-6B) further exacerbate this issue.

Spatiotemporal resolution constraints: Encoders impose resolution biases on fixed-size visual representations, preventing dynamic adaptation to content.

Multimodal interaction bottleneck: Reliance on pre-extracted features limits low-level interaction between video pixels and text tokens, as well as inter-frame dependency modeling.

Encoder-free approaches have been explored in the image domain (Fuyu, EVE), but the high dimensionality and temporal dependencies of video data introduce additional challenges. ELVA aims to demonstrate that encoder-free Video-LLMs can achieve competitive performance.

Method¶

Overall Architecture¶

ELVA is built on the Qwen2 LLM backbone and directly processes raw video pixels. Key techniques include: a Native Video Tokenizer that preserves original resolution and aspect ratio, a lightweight video patch embedding layer for spatiotemporal pre-modeling, hierarchical token merging for progressive compression of redundant information, and video guidance supervision for learning spatiotemporal representations.

Key Designs¶

Native Video Tokenization:
- Video frames are directly partitioned into patches at their original resolution without preprocessing.
- Special tokens are introduced: <FRAME> marks the start of each frame, and <LINE> marks the end of each patch row (in raster scan order).
- Advantage: supports video input at arbitrary resolution and frame length.
Video Patch Embedding Layer:
- A lightweight spatiotemporal pre-modeling module with only 9M parameters.
- Learnable <LINE> tokens are appended to each row of patches, and learnable <FRAME> tokens are appended to each frame.
- Long-range spatiotemporal relationships are established via cross-attention: <FRAME> tokens query intra-frame embeddings, and <LINE> tokens query intra-row embeddings.
- Compared to naive patch embedding, this yields an average improvement of 2.53% on long-video tasks.
Hierarchical Token Merging:
- Redundant tokens along the temporal dimension are progressively merged across different LLM layers.
- An index matrix \(\bm{M} \in \{0,1\}^{T \times (H \cdot W / P^2)}\) is maintained to compute cosine similarity between corresponding token positions in adjacent frames: \(s_{ij} = \langle f^l_{ij}, f^l_{(i+1)j} \rangle\).
- Tokens whose similarity exceeds threshold \(\tau=0.6\) are merged by averaging.
- In shallow layers, tokens are merged immediately upon exceeding the threshold; in deeper layers, merging continues until a target compression ratio (50%) is reached.
- Compared to direct pooling, this approach preserves critical spatiotemporal information, with far less performance degradation on long-video tasks.
Video Guidance Supervisor:
- A pretrained SigLIP video model serves as the teacher.
- Tube-wise alignment loss: The visual features \(\mathbf{f}_{\text{vis}}\) from the LLM's final layer are temporally mean-pooled and aligned with teacher features \(\mathbf{f}_{\text{target}}\) via MSE: \(\mathcal{L}_{\text{MSE}} = \text{MSE}(\frac{\mathbf{f}_{\text{vis}}}{\|\mathbf{f}_{\text{vis}}\|_2}, \frac{\mathbf{f}_{\text{target}}}{\|\mathbf{f}_{\text{target}}\|_2})\)
- Frame-wise contrastive loss: <FRAME> tokens are retained and frame-level mean-pooled features are used to compute InfoNCE contrastive loss \(\mathcal{L}_{\text{Con}}\) across GPUs.
- Total training loss: \(L = L_{\text{Gen}} + L_{\text{MSE}} + L_{\text{Con}}\)

Loss & Training¶

Three-stage progressive training:

Stage 1 — Spatial Pre-training: Images are treated as single-frame videos; training uses ELVA-Image (4M samples) to learn basic visual information.
Stage 2 — Spatiotemporal Pre-training: ELVA-Video (3M samples) is added; all three loss functions are applied to learn spatiotemporal representations.
Stage 3 — Supervised Fine-Tuning (SFT): Only the text generation loss is used, with 665K image + 178K video SFT data.

High-quality dense captions re-annotated by Qwen2-VL are extensively used in the training data, yielding substantial improvements over original annotations.

Key Experimental Results¶

Main Results¶

Model	Type	LLM	MSVD	ActivityNet	VideoMME	MLVU	CinePile
Video-LLaVA	encoder	7B	70.7	45.3	39.9	47.3	22.5
VideoLLaMA2	encoder	7B	70.9	50.2	46.6	48.5	44.6
Fuyu	encoder-free	8B	56.8	28.8	28.7	31.1	26.0
EVE	encoder-free	7B	61.4	41.8	29.3	36.8	26.4
ELVA	encoder-free	7B	65.2	48.7	47.1	51.8	46.1

Inference Efficiency Comparison (32 frames)¶

Model	MEM (G)	FLOPs (T)	TTFT (s)
Encoder-based	20.7	260	2.59
ELVA (no merging)	20.0 (-3%)	75 (-71%)	0.51 (-80%)
ELVA + Merge	16.4 (-21%)	25 (-90%)	0.26 (-90%)
ELVA + Merge + HR	15.5 (-25%)	14 (-95%)	0.22 (-92%)

The gains are even more pronounced at 128 frames: FLOPs are reduced by 96%, and TTFT is only 0.56s (vs. 15.18s for encoder-based models).

Ablation Study¶

Pre-training Objective	GQA	SEED_I	MSVD	VideoMME
\(\mathcal{L}_{\text{Gen}}\) only	42.2	40.0	45.8	37.9
+ \(\mathcal{L}_{\text{MSE}}\)	43.6	42.6	47.1	38.1
+ \(\mathcal{L}_{\text{Con}}\)	42.4	41.0	47.4	38.5
+ Both	44.4	44.8	48.0	38.5

Data Quality	GQA	MSVD	VideoMME
Original captions	42.1	46.0	34.2
Recaptioned Image+Video	46.1	49.4	38.5

Key Findings¶

For the first time, an encoder-free Video-LLM is demonstrated to achieve performance comparable to encoder-based models.
A video-pretrained teacher model (video-pretrained SigLIP) provides better guidance than an image-only pretrained encoder (approximately 1-point gain per task).
High-quality re-annotated captions are critical: they yield 3–4% improvements over original captions across tasks.
Short-video QA primarily benefits from spatial modeling (Stage 1 gains are rapid), while long-video tasks require spatiotemporal modeling (Stage 2 gains are more pronounced).
Hybrid resolution inference: maintaining the number of high-resolution frames while adding low-resolution frames substantially improves long-video performance (VideoMME +5.9%) with negligible token overhead.
Hierarchical merging at a 50% compression ratio incurs almost no accuracy loss while significantly reducing inference cost.

Highlights & Insights¶

First validation of encoder-free Video-LLMs: This work challenges the assumption that visual encoders are necessary, demonstrating that the LLM itself can directly learn video representations from pixels.
Substantial efficiency gains: A 95% reduction in FLOPs and 92% reduction in latency make real-time video understanding practically feasible.
Elegant hybrid resolution strategy: The flexibility of the encoder-free architecture is fully exploited by mixing high- and low-resolution frames within the same video.
Data quality over data scale: High-quality re-annotated captions yield greater improvements than increasing data volume.

Limitations & Future Work¶

Training on only 7M samples represents a large data scale gap compared to encoder-based models trained on billions of samples.
Performance on short-video benchmarks still slightly lags behind the strongest encoder-based models.
The threshold and compression ratio for hierarchical merging require manual tuning; adaptive strategies warrant further exploration.
The video guidance teacher model remains an external visual encoder, so the system is not entirely self-contained in principle.

Compared to Fuyu (linear projection only) and EVE (image-only pre-training), ELVA's spatiotemporal pre-modeling and video guidance loss are the key differentiating factors.
Hierarchical token merging shares conceptual similarities with methods such as ToME, but operates across layers inside the LLM, making it better suited for autoregressive generation.
The hybrid resolution inference strategy is generalizable to other multimodal models that require processing long sequences.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First effective encoder-free Video-LLM with multiple key technical innovations.
Experimental Thoroughness: ⭐⭐⭐⭐ — 8 video benchmarks, detailed ablations, and efficiency comparisons.
Writing Quality: ⭐⭐⭐⭐ — Thorough problem analysis with precise identification of three fundamental limitations.
Value: ⭐⭐⭐⭐⭐ — A 95% FLOPs reduction establishes a new efficiency paradigm for video understanding.