WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs¶
Conference: CVPR 2026
arXiv: 2602.22142
Code: None (Coming soon)
Area: Multimodal / Streaming Video Understanding
Keywords: Video-LLM, streaming VQA, temporal order, memory cache, uncertainty-gated retrieval
TL;DR¶
This work diagnoses the "Time-Agnosticism" issue in current Video-LLMs and proposes the WeaveTime framework. It endows the model with temporal awareness through a Streaming Temporal Perception Enhancement (SOPE) auxiliary task during training. At inference, it implements efficient adaptive memory retrieval via an uncertainty-gated Past-Current Dynamic Focus Cache (PCDF-Cache), achieving significant improvements in streaming video QA.
Background & Motivation¶
Background: Modern visual understanding systems are increasingly deployed in streaming scenarios where frame sequences arrive in real-time (e.g., autonomous driving, human-computer interaction, real-time monitoring). Video-LLM-based methods (e.g., LLaVA-Video, Qwen2-VL) perform excellently in offline settings but face fundamental challenges in streaming contexts.
Limitations of Prior Work: 1. Current Video-LLMs suffer from Time-Agnosticism: they treat videos as unordered bags of evidence rather than causally ordered sequences. Experiments show that shuffling frame order has almost no impact on model accuracy and even improves performance on certain temporal tasks (whereas human performance drops sharply). 2. Existing streaming methods (e.g., StreamBridge, VideoLLM-Online) either require large-scale specialized streaming datasets and high-cost training or rely on customized memory mechanisms with sub-optimal results. 3. Compressed memory (selecting/merging/dropping visual features) leads to information loss; retrieval-based memory retains information but suffers from unnecessary long-range reloading and loss of temporal focus.
Key Challenge: Video-LLMs lack genuine temporal reasoning capabilities, and existing streaming augmentation methods fail to balance "temporal awareness" with "memory efficiency."
Goal: To address two coupled issues caused by Time-Agnosticism: Temporal Order Ambiguity and Past-Current Focus Blindness.
Key Insight: Teach temporal order first, then utilize it during inference—"first teach order, then use order."
Core Idea: Empower the model to perceive frame order through a lightweight temporal reconstruction auxiliary task, followed by demand-driven backtracking using uncertainty-driven coarse-to-fine retrieval.
Method¶
Overall Architecture¶
WeaveTime targets the "Time-Agnosticism" of Video-LLMs—the tendency of models to treat videos as unordered bags of evidence where shuffling frames barely affects accuracy. It is a plug-and-play, model-agnostic streaming QA framework following the principle of "first teach order, then use order." During training, Streaming Temporal Perception Enhancement (SOPE) teaches the model that "frames have a sequence" via a temporal reconstruction task. During inference, the Past-Current Dynamic Focus Cache (PCDF-Cache) utilizes uncertainty gating and coarse-to-fine retrieval, allowing the model to trace back through history on demand rather than reloading excessively.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Video Frame Stream + Question"] --> B["SOPE: Streaming Temporal Perception Enhancement (Training)<br/>Timestamp tokens + Shuffled frame reconstruction"]
B --> C["PCDF-Cache Inference<br/>Short-term window response + Entropy H_t calculation"]
C --> D{"H_t < δ ?"}
D -->| "Yes, skip retrieval" | F["Output Answer"]
D -->| "No" | E["Coarse-to-fine Retrieval<br/>Frame-level screening + Late-interaction matching"]
E --> F
Key Designs¶
1. Time-Agnosticism Diagnosis: Proving models neglect temporal order
This is motivated by a set of control experiments: when video frames are shuffled, model accuracy remains nearly unchanged or even improves in temporal awareness and action recognition tasks. In contrast, human performance on shuffled videos without timestamps collapses (dropping from 1.0 to 0.0–0.2) and only recovers when timestamps are provided. Heatmaps further reveal temporal position biases—focusing on the start and end of short videos and biased towards the beginning of long videos. These diagnostics indicate that Video-LLMs rely on spatio-temporal shortcuts and position biases rather than causal reasoning, justifying the need for temporal enhancement.
2. Streaming Temporal Perception Enhancement (SOPE): Forcing the model to order frames
To teach temporal reasoning, a Temporal Reconstruction (TR) auxiliary task is designed. Given a sequence of patched video tokens \(\mathbf{X} = [\tilde{\mathbf{v}}_{1,1}, \ldots, \tilde{\mathbf{v}}_{1,N_f}, \tilde{\mathbf{v}}_{2,1}, \ldots]\), timestamp tokens \(\mathbf{ts}_i\) are inserted before each frame before shuffling the content. The prompt is prepended with: "These segments are shuffled, please list the true time range for each segment." This transforms temporal prediction into a next-token prediction task, leveraging the LLM's text-reordering capabilities without adding extra modules or losses. It upgrades memory from an "unordered cache" to an "ordered state chain," enabling retrieval to locate event times rather than just content. It is efficient, using only 30k samples from LLaVA-Video-178K for 1 epoch of LoRA training on 8 GPUs.
3. Past-Current Dynamic Focus Cache (PCDF-Cache): Look Now, Recall if Needed
With temporal awareness established, PCDF-Cache addresses "when and how much to backtrack" via a "Look Now, Recall if Needed" strategy. When query \(q\) arrives at time \(t\), the model first predicts an answer \(a_t^{(0)}\) using only the short-term window \(\mathcal{M}_{t-1}[-C:]\) and calculates its prediction entropy \(H_t = \text{Entropy}(a_t^{(0)})\). If \(H_t < \delta\), the answer is adopted directly to save computation. If \(H_t \geq \delta\), coarse-to-fine retrieval is triggered. This involves two layers: first, frame-level cosine similarity \(\text{Sim}(f_i^v, f^q)\) identifies a coarse set \(\mathcal{M}_{\text{coarse}}\), followed by fine matching via late-interaction max-sim: \(\text{maxSim}(\{f_{i,k}^v\}, \{f_j^q\}) = \sum_{j=1}^{N_q}\max_{1\leq k\leq N_i}\langle f_j^q, f_{i,k}^v \rangle\). The top-\(K\) frames (up to 64) are retrieved, achieving token-level accuracy at frame-level computational costs while avoiding OOM issues.
Loss & Training¶
- Standard next-token prediction language modeling loss is used, with the TR auxiliary task and original QA merged into a single-turn conversation.
- 30k samples of offline video IT data are randomly sampled for LoRA fine-tuning (lr=\(1\times10^{-5}\)) for 1 epoch.
- The entropy threshold during inference is set to \(\delta=0.6\) (optimal value from ablation); implemented based on ReKV with a maximum of 64 recalled frames.
Key Experimental Results¶
Main Results¶
Streaming Multi-Turn Evaluation based on LLaVA-OV-7B:
| Method | OVO-Bench Overall | Streaming-Bench Real-Time |
|---|---|---|
| LLaVA-OV-7B + StreamBridge | 61.72 | 68.39 |
| LLaVA-OV-7B + ReKV | 61.72 | 66.15 |
| LLaVA-OV-7B + WeaveTime | 68.82 (+7.10) | 72.13 (+3.74) |
Evaluation based on Qwen2-VL-7B:
| Method | OVO-Bench Overall | Streaming-Bench Real-Time |
|---|---|---|
| Qwen2-VL-7B + StreamBridge | 63.35 | 72.01 |
| Qwen2-VL-7B + ReKV | 59.72 | 70.07 |
| Qwen2-VL-7B + WeaveTime | 66.28 | 75.39 |
Improvements in temporal-sensitive sub-tasks are particularly significant: ACP +7.56%, EU +9.04%, ACR +11.09%.
Ablation Study¶
| SOPE w/ TP | SOPE w/ TR | PCDF-Cache | OVO-Bench | Gain | Streaming-Bench | Gain |
|---|---|---|---|---|---|---|
| 53.56 | — | 66.15 | — | |||
| ✔ | 49.88 | -3.68 | 65.91 | -0.54 | ||
| ✔ | ✔ | 55.70 | +5.82 | 68.49 | +2.58 | |
| ✔ | ✔ | ✔ | 57.57 | +1.87 | 72.13 | +3.64 |
Retrieval strategy comparison (LLaVA-OV-7B):
| Method | QAEGO4D Recall↑ | QAEGO4D Acc↑ | MLVU Acc↑ | EventHALL Acc↑ |
|---|---|---|---|---|
| LLaVA-OV | 14.0 | 52.8 | 64.7 | 60.1 |
| + ReKV | 23.9 | 54.3 | 68.5 | 60.6 |
| + C2F (Ours) | 25.2 | 55.2 | 68.9 | 61.4 |
| + Fine-only | OOM | — | — | — |
Key Findings¶
- Fine-tuning on small-scale offline data with only timestamp prompts (no TR) leads to a decline in streaming performance (-3.68%), indicating distribution mismatch.
- Adding Temporal Reconstruction (TR) significantly improves performance (+5.82%) under the same data budget, proving the effectiveness of SOPE.
- The optimal entropy threshold \(\delta\) is 0.6: too low causes interference from frequent recalls, while too high results in insufficient temporal evidence.
- Using only 30k offline samples and 8 GPUs matches the performance of StreamForest, which uses 121k streaming samples and 32 GPUs, demonstrating high efficiency.
- Fine-only token-level retrieval causes OOM, validating the necessity of the coarse-to-fine (C2F) strategy.
Highlights & Insights¶
- Compelling Diagnostic Experiments: Showing that frame shuffling affects humans but not models clearly reveals the fundamental flaw of Video-LLMs.
- "Teach First, Use Later" Philosophy: The two-stage design is elegant—injecting temporal awareness during training and utilizing it to guide retrieval during inference.
- Practical Uncertainty Gating: Using current frames for low-uncertainty responses and backtracking only when uncertainty is high avoids redundant computation.
- Data Efficiency: Significant gains are achieved without specialized streaming data, using only 30k random samples from general offline datasets.
Limitations & Future Work¶
- Validated only on 7B-scale models; performance on larger scales (e.g., 72B) remains untested.
- The entropy threshold \(\delta\) is a global hyperparameter and does not adapt to specific task types.
- Temporal reconstruction assumes clear temporal cues; effectiveness may be limited in static scenes or slow-changing videos.
- The coarse-to-fine retrieval in PCDF-Cache still requires two-stage computation, which might be a bottleneck for ultra-low latency scenarios.
- Does not discuss the impact of historical QA context in multi-turn dialogues on current retrieval decisions.
Related Work & Insights¶
- StreamBridge: Enhances Video-LLMs through a streaming training pipeline but requires substantial streaming data and resources.
- ReKV: A retrieval-based KV cache method that retains all visual memory but lacks temporal awareness.
- StreamForest: Manages streaming memory using clustering and forest structures, requiring 121k specialized samples and 32 GPUs.
- Insight: Temporal awareness may be a foundational capability for all video tasks, not just streaming; uncertainty-driven adaptive computation is a versatile design pattern.
Rating¶
| Dimension | Rating |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Value | ⭐⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall | ⭐⭐⭐⭐ |