Skip to content

Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability

Conference: CVPR 2026 arXiv: 2510.08138 Code: N/A Area: Video Understanding Keywords: Temporal Logic Consistency, Video-Language Models, Attention Interpretability, Cross-Modal Attention, Video Temporal Grounding

TL;DR

This paper investigates, from an interpretability perspective, the root cause of temporal logic inconsistency in Video-LLMs—namely, that cross-modal attention heads fail to effectively discriminate video tokens at different timestamps—and proposes TCAS (Temporally Conditioned Attention Sharpening), which significantly improves temporal logic consistency and general temporal grounding performance by optimizing attention distributions.

Background & Motivation

  1. Background: Video-LLMs achieve strong performance on tasks such as video question answering and captioning, and numerous works have introduced additional temporal modules to enhance temporal understanding (e.g., TimeChat, VTG-LLM).
  2. Limitations of Prior Work: Jung et al. (2024) demonstrate that all Video-LLMs fail to provide logically consistent answers to rephrased questions—models can correctly localize events, yet produce contradictory answers when the same question is posed differently. This reveals a fundamental lack of genuine temporal understanding.
  3. Key Challenge: Although a large number of modular approaches have been proposed to enhance temporal understanding, the underlying reasons for logical inconsistency in temporal comprehension have remained unexplored. Prior work identified the phenomenon without diagnosing the mechanism.
  4. Goal: (a) Identify internal factors that affect the consistency of temporal understanding; (b) Design improved methods based on the diagnostic findings.
  5. Key Insight: The authors approach the problem through the interpretability of attention mechanisms, focusing on cross-modal attention heads—the small subset of attention heads responsible for mapping event text tokens to video tokens within the corresponding temporal segments.
  6. Core Idea: The discriminability of cross-modal attention heads with respect to video tokens at different timestamps is the key factor governing temporal logic consistency; enhancing this discriminability via contrastive learning loss substantially improves consistency.

Method

Overall Architecture

The work comprises two phases: (1) an analysis phase—via head detection, attention visualization, statistical analysis, and causal intervention, revealing the causal relationship between cross-modal attention discriminability and temporal consistency; and (2) a method phase—proposing the TCAS loss, which uses contrastive learning to optimize attention distributions and enhance the model's temporal discrimination capability.

Key Designs

  1. Cross-Modal Attention Head Detection and Analysis:

    • Function: Locate the key attention heads within the model responsible for visual–textual alignment.
    • Mechanism: Define a cross-modal score \(S_{cross}^{h,v}\) as the average attention weight from all event text tokens to video tokens. Ranking heads by this score identifies a small number of cross-modal attention heads concentrated in intermediate layers. Visualization reveals that, on samples with high consistency, these heads precisely focus event text tokens on video tokens within the corresponding temporal segment, whereas on low-consistency samples the attention is scattered or misaligned.
    • Design Motivation: Rather than directly designing new modules, the authors first seek to understand the model's internal mechanism and identify the root cause of the problem.
  2. Attention Discriminability Score:

    • Function: Quantify the ability of attention heads to discriminate the temporal segment of an event.
    • Mechanism: For attention head \(h\) and sample \(v\), define discriminability \(S_{disc}^{h,v}\) as the proportion of attention weight that event text tokens assign to video tokens within the ground-truth temporal interval, relative to the total attention. The sample-level metric is the average discriminability over the top-\(t\) cross-modal heads.
    • Design Motivation: A quantifiable metric is needed to link attention behavior to consistency performance. Experiments show a Pearson correlation coefficient of 0.4778 (\(p\)-value \(\ll 0.05\)), confirming a significant positive correlation.
  3. Causal Intervention Validation:

    • Function: Confirm the causal relationship between attention discriminability and consistency.
    • Mechanism: At inference time, the cross-modal attention heads are subjected to targeted intervention—the original attention is linearly interpolated with a ground-truth attention map (uniform distribution within the ground-truth temporal interval) using coefficient \(\alpha\): \(A_{q,V} = (1-\alpha)A_{q,V}^{orig} + \alpha A_{q,V}^{gt}\). Moderate intervention (\(\alpha=0.2\)\(0.4\)) improves consistency, whereas excessive intervention degrades performance.
    • Design Motivation: Statistical correlation alone is insufficient to establish causality; intervention experiments are required to validate the causal direction "increased discriminability → improved consistency."
  4. TCAS Loss (Temporally Conditioned Attention Sharpening):

    • Function: Enhance temporal discrimination capability by optimizing attention distributions during training.
    • Mechanism: No ground-truth temporal annotations are required. For each cross-modal attention head, text tokens with a clear temporal preference (maximum attention exceeding threshold \(thr\)) are selected. Attention scores are aggregated by timestamp, and tokens are divided into positive samples (timestamps above the mean) and negative samples (timestamps below the mean). A contrastive loss \(\mathcal{L}_q^h = \max(m + \max(N_q^h) - \min(P_q^h), 0)\) is applied to widen the gap between positive and negative attention scores. The total TCAS loss is combined with the standard next-token prediction loss via a weighting coefficient.
    • Design Motivation: The key innovation lies in requiring no temporal annotations—the coarse temporal preference already present in the model's own attention distribution is used as a self-supervised signal, which is then sharpened via contrastive learning. This ensures generalizability across tasks.

Loss & Training

Total training loss = Standard SFT loss + \(w_{ae}\) × TCAS loss. Key hyperparameters: number of top heads \(t=32\), margin \(m=0.2\), threshold \(thr=0.1\), loss weight \(w_{ae}=0.5\). Training takes approximately three days on a single A100 80 GB GPU using the Adam optimizer (lr=\(10^{-5}\), batch size=4).

Key Experimental Results

Main Results (Charades-CON Consistency Evaluation)

Method Data Fine-tuning Grounding R-Ground S-Ground H-Verify C-Verify
TimeChat VTune SFT 76.2 69.2 (90.8%) 36.2 (47.5%) 44.8 (58.8%) 42.4 (55.7%)
TimeChat VTune TCAS 83.3 75.0 (90.1%) 39.5 (47.4%) 52.9 (63.5%) 50.8 (61.0%)
Qwen2.5-VL VTune SFT 28.3 17.5 (62.0%) 6.0 (21.1%) 15.1 (53.3%) 14.8 (52.1%)
Qwen2.5-VL VTune TCAS 34.0 23.0 (67.5%) 8.1 (23.7%) 19.6 (57.6%) 18.5 (54.3%)

Ablation Study (Hyperparameter Sensitivity)

Hyperparameter Value Grounding R-Grounding S-Grounding
\(t\) (# heads) = 16 Too small 80.91 72.14 36.77
\(t\) = 32 (optimal) Default 83.31 75.02 39.52
\(t\) = 48 Too large 77.37 69.66 39.04
\(thr\) = 0.05 Low threshold 81.90 74.95 41.45
\(thr\) = 0.1 (default) Balanced 83.31 75.02 39.52

Key Findings

  • TCAS simultaneously improves both consistency and grounding performance: On Charades-STA, TimeChat's R@1,0.5 increases from 58.4% to 60.2% on the general grounding task, indicating that inconsistency is an underlying factor limiting temporal understanding.
  • Robustness across video lengths: On videos longer than 40 seconds, TCAS yields +17.7 Grounding and +14.8 R-Grounding improvements, demonstrating greater advantage on long videos.
  • Scope parameters are more sensitive than intensity parameters: The number of heads \(t\) and threshold \(thr\) have the greatest impact on performance, while margin \(m\) and weight \(w_{ae}\) are relatively robust. Too many heads or too low a threshold introduces noise.
  • Attention discriminability visualization confirms the mechanism: After TCAS training, the distribution of attention discriminability scores shifts markedly rightward, confirming that consistency improvements genuinely originate from enhanced attention discrimination capability.

Highlights & Insights

  • A complete closed loop from interpretability to method: The paper first diagnoses (detection + visualization + statistics + causal intervention), then treats (TCAS), and finally verifies that the treatment repairs the issue identified during diagnosis. This paradigm of analysis-driven method design is highly instructive.
  • Attention sharpening without temporal annotations: TCAS leverages the coarse temporal preferences already present in the model's attention as a self-supervised signal, requiring no ground-truth temporal labels. This makes it applicable across diverse video–language tasks—an approach more elegant than straightforward attention supervision.
  • The insight that inconsistency constrains understanding: TCAS not only improves consistency but also unexpectedly improves general grounding performance, suggesting that logical consistency is not an isolated problem but a fundamental reflection of the model's temporal understanding capability.

Limitations & Future Work

  • The authors acknowledge that focusing on logical inconsistency may not capture all aspects of temporal understanding.
  • Improvements on ActivityNet-CON are relatively modest, as event descriptions in that dataset tend to be longer and noisier.
  • Internal mechanism analysis is limited to TimeChat and Qwen2.5-VL; generalizability to additional architectures requires further validation.
  • TCAS must be applied during training and cannot be directly used as a training-free inference-time enhancement (although the causal intervention experiments hint at the possibility of inference-time intervention).
  • vs. TimeChat/VTG-LLM: These methods enhance temporal understanding by adding temporal modules, but do not analyze why models struggle with temporal understanding. The present paper identifies the root cause from an interpretability perspective and proposes a more lightweight solution.
  • vs. Jung et al. (consistency benchmark): That work introduced a benchmark and the VTune dataset for evaluating consistency but did not investigate the underlying causes of inconsistency. The present paper builds upon it to analyze the mechanism in depth and propose improvements.
  • vs. LLM interpretability works: Nikankin et al. study modality-specific circuits in image–text models; Li et al. identify LLM decoders as a bottleneck for visual reasoning. The present paper further attributes the bottleneck to insufficient discriminability in cross-modal attention heads.

Rating

  • Novelty: ⭐⭐⭐⭐ First work to analyze temporal consistency in Video-LLMs from an interpretability perspective, forming a complete diagnosis-to-treatment loop.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model, multi-dataset evaluation; causal intervention validation; hyperparameter analysis; robustness on long videos.
  • Writing Quality: ⭐⭐⭐⭐⭐ Rigorous logical chain, progressing layer by layer from phenomenon to analysis to method to verification.
  • Value: ⭐⭐⭐⭐ Reveals the root cause of temporal understanding inconsistency; the proposed method is concise, effective, and generalizable.