Can Text-to-Video Generation Help Video-Language Alignment?¶

Conference: CVPR 2025
arXiv: 2503.18507
Code: https://lucazanella.github.io/synvita/
Area: Video Generation
Keywords: Video-Language Alignment, Synthetic Video, Text-to-Video Generation, Alignment Weighting, Semantic Consistency

TL;DR¶

Proposes the SynViTA framework to explore whether synthetic videos generated by text-to-video (T2V) models can improve video-language alignment (VLA). By addressing semantic inconsistency and appearance bias in synthetic videos through alignment quality-based sample weighting and semantic consistency regularization, it achieves a improvement of over 4 percentage points on temporally challenging tasks.

Background & Motivation¶

Background: Video-language alignment (VLA) trains VLMs to determine whether a video matches its textual description, which is a foundational task in video understanding. Existing methods rely on human annotations or automatically generated negative samples for training.

Limitations of Prior Work: (1) High-quality video-text paired data is scarce. (2) Text-to-video (T2V) generation models can produce a massive volume of synthetic videos, but these videos may be semantically inconsistent with the input text (e.g., action mismatches). (3) The visual appearance of synthetic videos differs significantly from real videos, causing models to potentially exploit shortcuts by focusing on appearance instead of semantics.

Key Challenge: T2V models provide an abundant source of videos, but their quality is uncontrollable—some synthetic videos faithfully reflect the text semantics while others mismatch completely. The key challenge lies in how to utilize the high-quality synthetic videos while filtering out the low-quality ones.

Goal: Systematically investigate the value of synthetic videos in VLA training and design methodologies to address the quality issues of synthetic videos.

Key Insight: Utilize an ensemble of VQAScore to evaluate the alignment quality difference between synthetic videos and target/reference texts as dynamic weights. Meanwhile, use the Longest Common Subsequence (LCS) to extract shared semantics between positive and negative descriptions for triplet regularization, forcing the model to focus on semantic discrepancies rather than appearance differences.

Core Idea: Dynamically weight the training contribution of synthetic videos using VQAScore-based alignment quality, combined with semantic consistency regularization to suppress appearance bias, thereby enabling synthetic videos to effectively enhance VLA.

Method¶

Overall Architecture¶

Real video-text pairs train the baseline VLA loss \(\rightarrow\) Synthetic videos are generated via T2V models \(\rightarrow\) VQAScore evaluates the alignment discrepancy of synthetic video-target text versus synthetic video-reference text \(\rightarrow\) The discrepancy serves as a weight incorporated into the synthetic VLA loss \(\rightarrow\) LCS extracts shared descriptions for triplet semantic regularization.

Key Designs¶

Alignment-Based Sample Weighting:
- Function: Automatically filter out semantically inconsistent synthetic videos.
- Mechanism: \(\omega_i = \max(0, \bar{f}(V^s_i, t^s_i) - \bar{f}(V^s_i, t^r_i))\), where \(\bar{f}\) is the VQAScore ensemble. If the alignment score of the synthetic video with the target text is higher than that with the reference text, it indicates that the video faithfully reflects the target semantics, resulting in a high weight.
- Design Motivation: Ablation studies show that with a fixed weight of 1.0, the performance on SSv2-Temporal is only 12.54, but increases to 17.32 after weighting. This 5-percentage-point gain demonstrates that quality filtering is critical.
Semantic Consistency Regularization (SCR):
- Function: Force the model to focus on semantic discrepancies rather than visual appearance differences.
- Mechanism: Use the LCS algorithm to extract the shared parts of positive and negative descriptions as the anchor text \(t'\). The triplet loss enforces: alignment of synthetic video with positive description > alignment with shared description > alignment with negative description.
- Design Motivation: Without SCR, the model tends to distinguish positive and negative samples based on "whether the video is synthetic or real." SCR forces the model to focus on the discriminative parts of the text semantics.

Loss & Training¶

\(\mathcal{L} = \mathcal{L}_{real} + \mathcal{L}_{syn}^\phi + \lambda_{scr} \cdot \mathcal{L}_{scr}^\phi\). The approach is model-agnostic, demonstrating effectiveness on both mPLUG-Owl 7B and Video-LLaVA. CogVideoX serves as the best single T2V generator.

Key Experimental Results¶

Main Results¶

Model	Human-Hard	SSv2-Temporal	SSv2-Events	ATP-Hard
VideoCon (mPLUG)	74.76	13.00	10.37	35.46
SynViTA (mPLUG)	74.54	17.32	12.54	37.31
VideoCon (Video-LLaVA)	75.74	19.77	10.01	38.76
SynViTA (Video-LLaVA)	76.86	20.10	11.21	39.88

Ablation Study¶

Weighting Strategy	SSv2-Temporal	SSv2-Events
Fixed 1.0	12.54	8.48
Alignment difference weighting	17.32	12.54

Key Findings¶

Synthetic videos aid temporal tasks: Benefits SSv2-Temporal with a 4.1 percentage point gain (13.00 \(\rightarrow\) 17.32), though minor performance degradation might occur on in-distribution entailment tasks.
Quality filtering is critical: Discrepancies between unweighted and weighted models reach up to 5 percentage points.
Flip and Hallucination types of misalignment are the most challenging: T2V models yield the poorest generation quality on these semantic alterations.

Highlights & Insights¶

First systematic study on the value of T2V synthetic videos for VLA, concluding they are "conditionally useful"—benefiting temporal understanding but requiring strict quality control.
VQAScore weighting provides a general quality control solution for synthetic data, which can be extended to other scenarios using synthetic data.

Limitations & Future Work¶

Synthetic videos can cause minor degradation in in-distribution tasks, implying that the domain gap between synthetic and real data still exists.
Relies heavily on the accuracy of VQAScore for quality assessment.
Only three T2V models were tested; stronger models (e.g., Sora) might alter the conclusions.

vs VideoCon: VideoCon relies solely on real data. SynViTA achieves significant improvements on temporal tasks through synthetic video enhancement.
vs Image synthesis-based enhancement (e.g., StableRep): Video synthesis faces more challenges than image synthesis (such as temporal consistency and action matching). The quality control strategies in SynViTA are designed specifically to tackle these challenges.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic study of utilizing T2V synthetic videos for VLA training.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple T2V models, multiple VLM baselines, and extensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Comprehensive analytical perspective.
Value: ⭐⭐⭐⭐ Offers important insights into training video models using synthetic data.