Skip to content

📹 Video Understanding

🧠 NeurIPS2025 · 60 paper notes

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

This paper proves that increasing Transformer depth from a constant to \(\Theta(\log n)\) unlocks the ability to recognize regular languages and solve graph connectivity — two problems provably beyond the reach of fixed-depth Transformers — and that depth scaling is strictly more efficient than width scaling (which requires super-polynomial growth) or Chain-of-Thought (CoT) steps (which requires super-logarithmic growth).

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

AdaVideoRAG is proposed to route queries to one of three retrieval pathways (no retrieval / naive retrieval / graph retrieval) via a lightweight intent classifier, combined with an omni-knowledge indexing module (caption + ASR + OCR + visual + knowledge graph) to achieve an optimal efficiency–accuracy trade-off in long video understanding, yielding a 39.8% improvement for Qwen2.5-VL-7B on MLVU.

Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning

ALMI proposes an upper-lower body adversarial training framework: the lower-body policy learns robust locomotion under upper-body motion perturbations, while the upper-body policy learns precise motion imitation under lower-body locomotion perturbations. Through iterative adversarial training converging to a Nash equilibrium, the framework enables stable whole-body coordinated control on the Unitree H1-2 real robot.

Agentic Persona Control and Task State Tracking for Realistic User Simulation

A three-agent collaborative framework for realistic user simulation is proposed, comprising a User Agent (coordination), a State Tracking Agent (structured task state), and a Message Attributes Generation Agent (behavior attribute control conditioned on persona and state). On a restaurant ordering scenario, the framework achieves a 102.6% improvement in composite realism score (CRRS), +19.9% in persona adherence, and +284.5% in behavioral variability. A core finding is that behavior control without state awareness yields BVS = 0 (completely rigid behavior).

Cloud4D: Estimating Cloud Properties at a High Spatial and Temporal Resolution

The first learning framework based on ground-level multi-view cameras that reconstructs four-dimensional (3D spatial + temporal) cloud liquid water content distributions via a homography-guided 2D-to-3D Transformer. The method achieves less than 10% error relative to radar at 25 m spatial and 5 s temporal resolution, improving spatiotemporal resolution by an order of magnitude over satellite observations.

ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

This paper introduces ConViS, a concept-based video similarity estimation task, along with its accompanying benchmark ConViS-Bench (610 video pairs, 16 domains, 5 concepts). It systematically evaluates 10+ mainstream models on concept-conditioned video comparison, revealing significant deficiencies in current models' understanding of temporal structure and spatial context.

DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products

This paper proposes DeltaProduct, which extends DeltaNet's single-step gradient descent to \(n_h\)-step gradient descent per token, yielding a state transition matrix expressed as a product of \(n_h\) generalized Householder transformations. This achieves a tunable trade-off between expressivity and efficiency, significantly improving state-tracking capability and length extrapolation performance.

Dense SAE Latents Are Features, Not Bugs

This paper systematically investigates frequently activating "dense latents" in sparse autoencoders (SAEs), demonstrating that they are not training artifacts but rather reflections of intrinsically dense subspaces in language model residual streams. The authors propose a six-category taxonomy of dense latents encompassing position tracking, context binding, null space, alphabetic, part-of-speech, and PCA latents.

Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition

This paper proposes DANCE, a framework that achieves structured and motion-aware explainable video action recognition by disentangling action explanations into three concept types: motion dynamics, objects, and scenes.

DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering

This paper proposes Dual-Stage Adaptive Sharpening (DSAS), a training-free plug-and-play attention optimization framework. It employs Contextual Gate Weighting (CGW) to enhance attention from key passages toward the question and target positions, and Reciprocal Attention Suppression (RAS) to suppress information exchange between key and irrelevant passages, achieving an average F1 improvement of 4.2% on multi-document QA benchmarks.

DualGround: Structured Phrase and Sentence-Level Temporal Grounding

This paper identifies that existing video temporal grounding (VTG) models over-rely on the global sentence semantics encoded in the [EOS] token while neglecting word-level signals. It proposes DualGround, a dual-branch architecture that explicitly decouples global and local semantics via a sentence-level path (adaptive cross-attention) and a phrase-level path (recurrent phrase generation + Slot Attention), achieving state-of-the-art performance on QVHighlights and Charades-STA.

egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks

This paper introduces egoEMOTION — the first dataset combining egocentric vision (Meta Project Aria glasses) with physiological signals for emotion and personality recognition. It encompasses 43 participants, 50+ hours of recordings, and 16 tasks, and demonstrates that egocentric vision signals (particularly eye-tracking features) outperform conventional physiological signals for emotion prediction in real-world scenarios.

EgoGazeVQA: Egocentric Gaze-Guided Video Question Answering Benchmark

This paper introduces EgoGazeVQA, the first egocentric video question answering benchmark that incorporates user eye-gaze data. Through gaze-guided prompting strategies (textual, visual, and salience map), the benchmark demonstrates substantial improvements in MLLMs' ability to understand user intent. The Gaze Salience Map strategy raises MiniCPM-o's accuracy from 35.9% to 53.7%.

Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding

DualGround identifies a critical issue in existing VTG models — over-reliance on the global semantics of the [EOS] token while neglecting word-level signals — and proposes a sentence-level + phrase-level dual-path architecture. Through an Adaptive Cross-Attention (ACA) module and a Recurrent Phrase Generator (RPG), the model captures global and local semantics respectively, achieving state-of-the-art performance on QVHighlights and Charades-STA.

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

This paper proposes STAVEQ2, which inserts parameter-efficient Stacked Temporal Attention (STA) modules into the Vision Encoder to address fundamental architectural deficiencies in existing Video-LLMs for fine-grained temporal understanding (e.g., distinguishing "pulling from left to right" vs. "pulling from right to left"), achieving up to 5.5% improvement on VITATECS/MVBench/Video-MME.

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

This paper proposes FastVID, which systematically eliminates video token redundancy along both temporal and visual dimensions via Dynamic Temporal Segmentation (DySeg) and Density Spatiotemporal Pruning (STPrune). On LLaVA-OneVision-7B, FastVID retains 98% accuracy after pruning 90.3% of video tokens, achieving a 7.1× speedup in the LLM prefill stage.

Fixed-Point RNNs: Interpolating from Diagonal to Dense

This paper proposes the Fixed-Point RNN framework, which parameterizes dense linear RNNs as fixed points of diagonal linear RNNs. By varying the number of iterations, the model dynamically interpolates between diagonal (efficient) and dense (expressive) regimes, achieving state-of-the-art results simultaneously on state-tracking (\(A_5\)/\(S_5\)) and copying tasks for the first time.

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

A cross-attention multimodal architecture is proposed that integrates V-JEPA 2 visual context features with CoMotion 3D skeletal pose data, outperforming unimodal baselines on standard and high-occlusion action recognition benchmarks.

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

This paper proposes InfiniPot-V, the first training-free and query-agnostic streaming video understanding framework. It achieves online KV cache compression via two complementary metrics — Temporal-axis Redundancy (TaR) and Value-Norm (VaN) — enabling streaming video understanding of arbitrary length under a fixed memory budget.

InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras

This paper introduces InFlux, the first real-world video benchmark with per-frame ground-truth dynamic camera intrinsics (386 videos, 143K+ annotated frames). Accurate annotations are achieved via a lookup table (LUT) mapping lens metadata to intrinsic parameters. The benchmark reveals that existing intrinsic prediction methods perform poorly under dynamic intrinsic settings.

INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning

This work presents the complete Inst-IT framework: a GPT-4o-assisted automatic annotation pipeline for generating instance-level fine-grained data, an Inst-IT Bench evaluation benchmark, a 335K QA-pair instruction tuning dataset, and a continual fine-tuning paradigm that effectively enhances instance-level understanding in LMMs while also improving general image and video comprehension.

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

This paper proposes the PBHC framework, which enables a humanoid robot (Unitree G1) to learn highly dynamic whole-body skills such as kung fu and dance through a physics-aware motion processing pipeline and a bi-level optimization scheme for adaptive tracking factors. The approach achieves substantially lower tracking errors than existing methods and is successfully deployed on real hardware.

Lattice Boltzmann Model for Learning Real-World Pixel Dynamicity

Inspired by the Lattice Boltzmann Method from fluid dynamics, this work proposes LBM (Lattice Boltzmann Model) for online real-time pixel tracking. It models video pixels as fluid lattices and solves motion states via collision-streaming processes, achieving SOTA online tracking performance with 18M parameters while enabling real-time inference on edge devices.

Less is More: Local Intrinsic Dimensions of Contextual Language Models

This paper proposes using the Local Intrinsic Dimension (LID) of contextual token embeddings as an unsupervised signal for monitoring LLM training dynamics — a decrease in LID indicates improved generalization, while an increase signals overfitting. The utility of this geometric signal is validated on tasks including dialogue state tracking, grokking, and sentiment recognition.

LiveStar: Live Streaming Assistant for Real-World Online Video Understanding

This paper proposes LiveStar, an always-on live streaming video understanding assistant that achieves adaptive response timing via a Streaming Causal Attention Masks (SCAM) training strategy and a Streaming Verification Decoding (SVeD) inference framework, improving semantic correctness by 19.5% and reducing temporal deviation by 18.1% on the OmniStar benchmark.

MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments

This paper proposes the MEMTRACK benchmark to evaluate LLM agents' long-term memory and state tracking capabilities in multi-platform dynamic environments (Slack/Linear/Git), revealing that even the strongest model, GPT-5, achieves only 60% accuracy.

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

This work introduces MimeQA, the first nonverbal social reasoning benchmark built on mime performance videos. It comprises 101 videos and 806 QA pairs organized across three hierarchical question levels (grounding the imagined → scene-level understanding → global reasoning), and reveals a severe gap between current VideoLLMs and humans on nonverbal social understanding (20–30% vs. 86%).

MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence

This paper introduces MUVR, a benchmark for multi-modal untrimmed video retrieval targeting real-world long-video platforms. It proposes a video-centric multi-modal query format (video + text + tag + mask) and a six-level visual correspondence matching criterion, comprising 53K videos and 1,050 queries, and systematically evaluates the limitations of retrieval models and MLLMs.

Neural Stochastic Flows: Solver-Free Modelling and Inference for SDE Solutions

This paper proposes Neural Stochastic Flows (NSF), which directly learns the transition distribution \(p(x_t \mid x_s)\) of an SDE via conditional normalising flows. The architecture is constrained to satisfy stochastic flow properties (identity, Markov, Chapman-Kolmogorov), enabling single-step sampling without numerical solvers and achieving up to two orders of magnitude speedup at distant time points.

NeuroPath: Neurobiology-Inspired Path Tracking and Reflection for Semantically Coherent Retrieval

Inspired by the hippocampal place cell navigation and memory consolidation mechanisms in neurobiology, this paper proposes NeuroPath—a RAG framework based on semantic path tracking—that achieves average improvements of 16.3% in recall@2 and 13.5% in recall@5 on multi-hop QA tasks through LLM-driven goal-directed path construction and a post-retrieval completion strategy.

Open-World Drone Active Tracking with Goal-Centered Rewards

This paper introduces DAT, the first open-world drone active tracking benchmark comprising 24 city-scale scenes with high-fidelity dynamics simulation, along with GC-VAT, a reinforcement learning tracking method based on goal-centered rewards and curriculum learning, achieving approximately 72% tracking success rate in simulation.

PASS: Path-Selective State Space Model for Event-Based Recognition

PASS proposes the Path-selective Event Aggregation and Scan (PEAS) module and the Multi-faceted Selection Guiding (MSG) loss, leveraging the linear complexity and frequency generalization capability of SSMs to perform event-based recognition across a broad distribution of event lengths from \(10^6\) to \(10^9\), while limiting performance degradation to only 8.62% under varying inference frequencies (compared to 20.69% for the baseline).

PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?

By introducing four motion-centric probing techniques and the MoCentric-Bench benchmark, this paper demonstrates that current video multimodal LLMs fail to genuinely exploit motion information in pixel-level visual grounding tasks and can be deceived by static keyframes.

PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

This paper introduces the Online Audio-Visual Event Parsing (On-AVEP) paradigm for the first time, along with the PreFM framework, which leverages pseudo-future sequences to enhance current contextual understanding. Combined with modality-agnostic knowledge distillation and focal temporal prioritization, PreFM surpasses offline SOTA methods by +9.3 event-level average F1 score using only 2.7% of their parameter count.

QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

This paper introduces the NeuComBack benchmark for evaluating neural compilation on IR-to-assembly translation tasks, and proposes a self-evolving prompt optimization method that iteratively improves compilation prompts by learning from LLM self-debugging trajectories. The approach raises correctness from 44% to 64%, with 87.5% of correctly generated programs outperforming clang-O3.

Revisiting Bi-Linear State Transitions in Recurrent Neural Networks

This paper systematically revisits bilinear state transitions in RNNs—i.e., multiplicative interactions between the hidden state and the input—and theoretically proves that bilinear RNNs can simulate arbitrary finite-state machines. By removing additive terms, these models form a natural expressivity hierarchy ranging from diagonal to full-rank structures, revealing that popular linear RNNs such as Mamba occupy the lowest tier of this hierarchy.

SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

This paper proposes the SAMA framework, which jointly models fine-grained spatio-temporal understanding and grounding in multi-turn referential video dialogue for the first time, through the construction of a unified dataset (SAMA-239K), model (spatio-temporal context aggregator + SAM), and benchmark (SAMA-Bench).

Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

This paper systematically analyzes background bias in action recognition across three model paradigms — classification models, contrastive pre-trained models (CLIP/SigLIP2), and video large language models (VLLMs) — and proposes two mitigation strategies: a dual-branch architecture that fuses segmented human inputs to reduce SBErr by 3.78% for classification models, and automated prompt tuning to reduce SBErr by 9.85% for VLLMs.

Seeing the Arrow of Time in Large Multimodal Models

This paper reveals that current large multimodal models (LMMs) are surprisingly insensitive to the temporal directionality of video (i.e., the Arrow of Time)—producing nearly identical answers for forward and reversed playback. The authors propose ArrowRL, a GRPO-based training strategy that introduces a reverse video reward to elicit temporal direction awareness, and construct AoTBench for evaluation. The approach achieves significant gains across multiple VQA benchmarks, including a 65.9% relative improvement on Vinoground.

SmartWilds: Multimodal Wildlife Monitoring Dataset

This work introduces SmartWilds, the first synchronously collected multimodal wildlife monitoring dataset, integrating three complementary modalities — drone imagery, camera traps, and bioacoustics — comprising 101 GB of data. Cross-modal alignment is achieved via GPS coordinates and timestamps. The dataset establishes a reproducible standard protocol for conservation monitoring, filling the gap in comprehensive multi-sensor fusion benchmarks for ecosystem-scale ecological research.

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

This paper proposes the STAR framework, which constructs a video analysis toolbox comprising 22 tools and enables an LLM to alternately invoke temporal and spatial tools to progressively localize a 3D Region of Interest (3D RoI) within videos, achieving improvements of 8.2% on VideoMME and 4.6% on LongVideoBench.

Steering When Necessary: Flexible Steering Large Language Models with Backtracking

This paper proposes FASB (Flexible Activation Steering with Backtracking), a framework that dynamically determines the necessity and intensity of intervention by tracking the internal states of an LLM during generation, and introduces a backtracking mechanism to correct already-deviated tokens. FASB achieves a True*Info score of 80.56% on TruthfulQA and an average accuracy of 78.8% across six multiple-choice tasks, significantly outperforming all baselines.

Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

This paper proposes PD-SSM, a structured sparse parameterization for the state transition matrix of state-space models (SSMs). The core idea is to factorize the transition matrix as a product of a column-wise one-hot matrix P and a complex diagonal matrix D (i.e., \(A = PD\)), achieving expressiveness equivalent to unstructured (dense) SSMs while retaining computational efficiency comparable to diagonal SSMs at \(\Theta(LN)\). A single layer suffices to simulate any \(N\)-state finite-state automaton (FSA). The paper provides theoretical guarantees on BIBO stability and optimal state dimensionality, with strong empirical results on FSA simulation, multivariate time-series classification, long-sequence benchmarks, and natural-language state-tracking tasks.

TAPVid-360: Tracking Any Point in 360 from Narrow Field of View Video

This paper introduces the TAPVid-360 task and dataset, requiring models to track the 3D direction of query points (including those outside the field of view) in narrow field-of-view video. By leveraging 360° video to generate training data and fine-tuning CoTracker3 for directional prediction, the proposed approach substantially outperforms existing methods on out-of-field-of-view tracking.

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

This paper proposes TempSamp-R1, a mixed-policy reinforcement fine-tuning framework that integrates high-quality off-policy (ground truth) guidance into GRPO's on-policy sampling and introduces nonlinear soft advantage estimation to stabilize training, achieving state-of-the-art performance on video temporal grounding (Charades-STA R1@0.7: 52.9%, ActivityNet R1@0.5: 56.0%).

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

TempSamp-R1 is a reinforcement fine-tuning framework that addresses the inefficiency of on-policy sampling in GRPO for video temporal grounding—caused by the vast search space—by introducing ground-truth annotations as off-policy supervision signals, non-linear soft advantage estimation, and a hybrid CoT training paradigm, achieving new state-of-the-art results on Charades-STA, ActivityNet, and QVHighlights.

The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

Through a systematic analysis of 52 reasoning benchmarks across three major model families—OpenAI, Anthropic, and Google—this paper identifies an "ouroboros" cycle: old benchmarks are rapidly saturated → new benchmarks are created to restore discriminability → new benchmarks are rapidly saturated in turn. This cycle calls into question whether improvements in benchmark scores genuinely reflect generalized reasoning ability or merely overfit to specific evaluation sets.

TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning

This paper proposes TiRex, a pretrained time series forecasting model based on xLSTM. By introducing a Contiguous Patch Masking (CPM) strategy and data augmentation techniques, TiRex with only 35M parameters comprehensively outperforms larger models such as Chronos Bolt (200M) and TimesFM (500M) on the GiftEval and Chronos-ZS benchmarks, achieving state-of-the-art performance in both short- and long-horizon zero-shot forecasting.

Token Bottleneck: One Token to Remember Dynamics

This paper proposes Token Bottleneck (ToBo), a self-supervised visual representation learning pipeline that compresses a reference scene into a single bottleneck token and uses this token together with a minimal number of target scene patches to reconstruct the subsequent scene, thereby training visual backbone networks to simultaneously encode scene information conservatively and capture temporal dynamics.

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

This paper proposes a video toolkit comprising 22 tools and the STAR (Spatiotemporal Reasoning) framework, which progressively localizes a 3D Region of Interest (RoI) via an alternating temporal–spatial tool scheduling strategy. The approach improves GPT-4o by 8.2% on VideoMME while substantially reducing the number of processed frames and computational overhead.

Tracking and Understanding Object Transformations

This paper introduces the Track Any State task and the TubeletGraph zero-shot framework, which tracks objects undergoing drastic appearance changes in video (e.g., an apple being cut, a butterfly emerging from a chrysalis) while simultaneously detecting and describing these transformations.

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

This paper presents TrackingWorld, a pipeline for dense 3D tracking of almost all pixels from monocular video. It lifts sparse 2D trajectories to dense ones via a tracking upsampler, iteratively tracks newly appearing objects across all frames, and employs an optimization-based framework to lift 2D trajectories into world-coordinate 3D space with explicit decoupling of camera motion and object motion.

Two Causally Related Needles in a Video Haystack

This paper proposes Causal2Needles, a benchmark of 4,100 QA pairs that binds the understanding of two causally related events via a "bridging entity," forcing VLMs to jointly retrieve and reason over two needles scattered across a long video. It reveals severe deficiencies in state-of-the-art models on the causal dual-needle task (ChatGPT-4o achieves only 13.4% Both accuracy on the dual-needle setting).

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

This work constructs VideoMarathon, the first large-scale hour-level video instruction-following dataset (9,700 hours, 3.3M QA pairs, 22 task types), and proposes Hour-LLaVA, a model that leverages a memory repository, forgetting mechanism, and MemAug module to enable efficient training and inference on hour-scale videos at 1 FPS, achieving state-of-the-art results among open-source models of comparable scale across four long video benchmarks.

VGEnt: Graph-Based Retrieval-Reasoning-Augmented Generation for Long Video Understanding

This paper proposes VGEnt, a graph-based retrieval-reasoning-augmented generation framework that constructs a video knowledge graph to preserve cross-segment semantic relationships, and introduces structured reasoning steps to filter noise and aggregate information. VGEnt consistently improves open-source LVLMs by 3.0%–5.4% across multiple long video understanding benchmarks and outperforms existing video RAG methods by 8.6%.

Video Finetuning Improves Reasoning Between Frames

This paper proposes a visual chain-of-thought (vCoT) approach to systematically compare image LLMs and video-finetuned LLMs on inter-frame reasoning. It finds that video finetuning enables models to implicitly learn inter-frame transition reasoning, and that this capability transfers to relational reasoning tasks on static images.

VideoLucy: Deep Memory Backtracking for Long Video Understanding

This paper proposes VideoLucy, a framework that simulates the human coarse-to-fine recall process via a hierarchical memory structure and an agent-based iterative backtracking mechanism. VideoLucy substantially outperforms existing methods on multiple long video understanding benchmarks, surpassing even commercial models such as GPT-4o.

Web-Scale Collection of Video Data for 4D Animal Reconstruction

This paper proposes a fully automated large-scale video data collection pipeline that mines and processes 30K animal videos (2M frames) from YouTube, establishes the first 4D quadruped animal reconstruction benchmark Animal-in-Motion (230 sequences / 11K frames), and introduces a baseline method 4D-Fauna that achieves model-free 4D reconstruction via sequence-level optimization.

When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions

This paper proposes the QV-M2 dataset (the first fully human-annotated multi-moment retrieval benchmark) and the FlashMMR framework (incorporating a Post-Verification Module), extending video moment retrieval from single-moment to multi-moment scenarios and establishing a standardized evaluation protocol for multi-moment retrieval.

When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

This paper systematically identifies the "visual thinking drift" phenomenon in which CoT reasoning frequently degrades performance in video understanding, and proposes the Visual Evidence Reward (VER) reinforcement learning framework that corrects this problem by explicitly rewarding reasoning chains grounded in visual evidence.