Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval¶

Conference: CVPR 2025
arXiv: 2503.19009
Code: None
Area: Video Generation
Keywords: Text-to-Video Retrieval, Late Interaction, Multi-vector Representation, ColBERT, Sigmoid Loss

TL;DR¶

This paper proposes Video-ColBERT, which introduces ColBERT's late interaction in text retrieval into text-to-video retrieval (T2VR). By performing MeanMaxSim interaction at both the frame and video levels and employing a dual Sigmoid loss to train independent yet compatible multi-granularity representations, Video-ColBERT outperforms existing dual-encoder methods on multiple T2VR benchmarks.

Background & Motivation¶

Background: Most efficient T2VR solutions employ a dual-encoder architecture. Mainstream methods are adapted from CLIP, such as CLIP4Clip, X-CLIP, and DRL.

Limitations of Prior Work: (1) Single-vector representations struggle to encode complex video content and textual concepts; (2) Existing token-level interaction methods have over-complicated mechanisms or suffer from limited final representations (e.g., only performing interactions on features after temporal modeling, thereby ignoring spatial information of individual frames); (3) InfoNCE loss is sensitive to noisy data.

Key Challenge: Fine-grained interaction is effective but slow, while single-vector methods are fast but lack expressiveness. ColBERT-style late interaction offers a middle ground, but existing T2VR methods have not fully utilized it.

Goal: (1) Achieve bi-level token-level interaction across spatial and spatiotemporal dimensions; (2) Design appropriate training objectives to ensure both representation branches are sufficiently strong; (3) Maintain the efficiency of dual encoders.

Key Insight: ColBERT's MaxSim lets each query token scan the document independently. Extending this to videos: MaxSim is applied separately to independent frame features (spatial) and temporally modeled features (spatiotemporal), and the two scores are then summed.

Core Idea: Bi-level MeanMaxSim interaction (\(MMS_F + MMS_V = MMS_{FV}\)), trained with a dual Sigmoid loss to independently optimize both representation branches.

Method¶

Overall Architecture¶

Dual encoder: The text is encoded as a sequence of token representations. On the video side, the image encoder extracts the frame [CLS] tokens, and a temporal transformer generates contextualized video features. Similarity is computed via frame-level and video-level bi-level MeanMaxSim.

Key Designs¶

Bi-level MeanMaxSim Interaction:
- Function: Query-video matching at both spatial and spatiotemporal granularities.
- Mechanism: \(MMS_F = \frac{1}{M}\sum_j \max_i (\mathbf{q}_j \cdot \mathbf{f}_i)\) is computed for frame features, and the same operation is applied for \(MMS_V\) over video features. Compared to ColBERT's summation, a mean operation is adopted to handle queries of varying lengths. The final similarity is \(MMS_{FV} = MMS_F + MMS_V\). Interaction is unidirectional from query to video.
- Design Motivation: Since MMS_F covers spatial information, the temporal transformer can focus on encoding temporal dynamics, achieving functional division.
Query and Visual Expansion Tokens:
- Function: Enhance query and video representations.
- Mechanism: Padding tokens are added to the query side to participate in MMS, learning implicit query expansion. Learnable visual expansion tokens are added to the temporal transformer on the video side.
- Design Motivation: T2VR queries are short and abstract; query expansion infers concepts that are not explicitly expressed.
Dual Sigmoid Loss:
- Function: Independently train frame-level and video-level representations.
- Mechanism: Sigmoid losses are computed separately for the \(MMS_F\) and \(MMS_V\) similarity matrices: \(\mathcal{L}_D = \lambda_F \mathcal{L}_F + \lambda_V \mathcal{L}_V\). Each Sigmoid loss formulates contrastive learning as independent binary classification.
- Design Motivation: Computing loss directly on the combined score causes gradient imbalance. Calculating them separately allows both paths to learn discriminative features independently, which are then summed during inference.

Loss & Training¶

Dual Sigmoid loss: Each loss is formulated as \(\mathcal{L} = -\frac{1}{|B|}\sum_i\sum_j \log\frac{1}{1+e^{z_{ij}(-t \cdot MMS + b)}}\), where \(z_{ij}\) is the label, \(t\) is a learnable scaling factor, and \(b\) is a learnable bias. Unidirectional MaxSim reflects the query-video asymmetry.

Key Experimental Results¶

Main Results¶

Method	MSR-VTT R@1	MSVD R@1	VATEX R@1
X-CLIP (B/16)	49.3	50.4	—
DRL (B/16)	50.2	50.0	65.7
V-ColBERT (CLIP-B/16)	51.0	50.2	66.8
V-ColBERT (SigLIP-B/16)	51.5	55.2	68.0

Ablation Study¶

Configuration	Description
MMS_F Only	Pure frame-level spatial matching, ~48 R@1
MMS_V Only	Spatiotemporal features only, ~49 R@1
MMS_FV Bi-level	~51 R@1, two levels are complementary
InfoNCE Loss	Inferior to Sigmoid
SMS (Sum)	Unfair for variable-length queries
Without Query Expansion	Loses implicit concept expansion

Key Findings¶

Bi-level interaction improves R@1 by 2-3 points compared to single-level, indicating spatial (frame-level) and spatiotemporal information are complementary.
Dual Sigmoid outperforms InfoNCE and single Sigmoid.
Unidirectional MaxSim outperforms bidirectional MaxSim—irrelevant frames in the video should not drag down the score.
SigLIP serves as a better backbone than CLIP.

Highlights & Insights¶

Migration of ColBERT to video proves that multi-vector late interaction is effective in the video modality.
Functional division in hierarchical interaction: MMS_F covers spatial features, allowing the temporal transformer to focus on temporal dynamics.
Dual Sigmoid loss is simple yet clever—independent training guarantees discriminativeness for each path, while summation ensures compatibility.

Limitations & Future Work¶

Larger storage overhead compared to single-vector methods, as double the features must be stored for each video.
Efficient approximations, such as storing only TopK frame features, were not explored.
Comparison with large-scale models like InternVideo2 was not conducted.
The interpretability of query expansion tokens remains unexplored.

vs. DRL: DRL uses weighted token-wise interaction but only on features after temporal modeling. Video-ColBERT's bi-level interaction is more comprehensive.
vs. X-CLIP: X-CLIP's multi-granularity interaction is more complex, whereas Video-ColBERT is simpler yet more effective.
vs. CLIP4Clip: Late interaction improves R@1 by 5-8 points compared to single-vector methods.

Rating¶

Novelty: ⭐⭐⭐⭐ Clever combination of ColBERT migration to video, bi-level interaction, and dual Sigmoid loss.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 5 datasets, multiple backbones, and rich ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and concise methodology.
Value: ⭐⭐⭐⭐ Provides a powerful and concise late interaction solution for video retrieval.