CVPR2025 Video Understanding AI paper notes paper summaries LLM Object Tracking Multimodal/VLM Compression Question Answering Alignment/RLHF

📹 Video Understanding¶

📷 CVPR2025 · 69 paper notes

📌 Same area in other venues: 📷 CVPR2026 (187) · 🔬 ICLR2026 (48) · 🧪 ICML2026 (17) · 🤖 AAAI2026 (27) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (56)

🔥 Top topics: LLM ×8 · Object Tracking ×6 · Multimodal/VLM ×5 · Compression ×3 · Question Answering ×3

Anomize: Better Open Vocabulary Video Anomaly Detection

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning: This paper proposes BehaviorVLM, a unified finetuning-free vision-language framework that simultaneously addresses both animal pose estimation and physical behavior understanding via a multi-stage structured reasoning pipeline. It achieves reliable keypoint tracking using only 3 human-annotated seed frames, and enables interpretable multi-animal behavioral segmentation through deep embedded clustering, VLM-based segment description, and LLM semantic merging.
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding: Proposes R-MSD (Reliable Multi-Sample Distillation), which addresses the issue of unreliable single-sample teacher supervision in black-box distillation of video LVLMs by sampling multiple teacher responses for each input and incorporating task-adaptive quality matching. The 4B student model consistently improves performance on benchmarks such as VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).

BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering: This paper proposes BIMBA, a spatiotemporal token selector based on Mamba selective scan. It compresses long video sequences of over 100K tokens by 16 times down to 6,400 key tokens, achieving state-of-the-art (SOTA) performance across 7 long-video VQA benchmarks.
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-Grained View-Invariant Video Representations: Learning fine-grained view-invariant representations between egocentric and exocentric perspectives via masked modeling, enabling self-supervised learning from the association of the two views without paired annotations.
Context-Enhanced Memory-Refined Transformer for Online Action Detection: This paper reveals the training-inference inconsistency problem in existing online action detection (OAD) methods—where unbalanced context exposure of short-term memory frames and non-causal information leakage introduced by pseudo-futures bias learning toward intermediate frames—and proposes CMeRT to address this issue through a near-past context-enhanced encoder and a near-future-based memory refinement decoder, achieving state-of-the-art performance on THUMOS'14, CrossTask, and EK100.
Cross-modal Causal Relation Alignment for Video Question Grounding: Eliminates spurious cross-modal associations in Video Question Grounding (VideoQG) via causal intervention. It introduces three modules—Gaussian smoothing grounding, cross-modal alignment, and explicit causal intervention—simultaneously improving grounding (+2.2 Acc@GQA) and question answering (+0.9 Acc@VQA) performance on NextGQA.
DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos: DeCafNet is proposed, which outperforms all prior methods on long video temporal grounding tasks with a 47% reduction in TFLOPs, utilizing a delegate-and-conquer dual-encoder strategy (where a lightweight sidekick encoder extracts dense features and generates saliency maps, while an expert encoder only processes the top-c% key clips) combined with DeCaf-Grounder to unify features across different temporal resolutions.
DivPrune: Diversity-Based Visual Token Pruning for Large Multimodal Models: Reformulates the visual token pruning problem as the Max-Min Diversity Problem (MMDP). By solving it precisely to maximize the minimum pair-wise distance within the retained token set, a training-free and calibration-free plug-and-play pruning scheme is achieved. It yields SOTA performance on 16 multimodal benchmarks, significantly outperforming all baselines particularly under extreme pruning rates of \(\ge 80\%\).
DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework: This paper proposes DPFlow, a dual-pyramid recurrent encoder combining an image pyramid and a feature pyramid with a fully-convolutional Cross-Gated Unit (CGU). Trained only on standard resolutions, DPFlow adaptively generalizes to 8K resolution inputs, achieving state-of-the-art (SOTA) performance on Sintel, KITTI, and Spring benchmarks. Furthermore, the paper introduces Kubric-NK, a multi-resolution optical flow evaluation dataset, supporting quantitative high-resolution evaluation for the first time.
DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection: The Dynamic Prototype Updating (DPU) framework is proposed, which establishes a robust representation space via Cohesive-Separate Contrastive Training, dynamically updates class centers through Dynamic Prototype Approximation, and adjusts the intensification intensity of multimodal prediction discrepancy based on sample-to-prototype distance using Pro-ratio Discrepancy Intensification. Serving as a plug-and-play module, it comprehensively improves performance across 5 datasets and 9 base OOD methods, with up to an 80% performance gain in Far-OOD detection.
DrVideo: Document Retrieval Based Long Video Understanding: This paper proposes DrVideo, which reformulates long video understanding as a long document understanding task: it first converts video frames into text documents, locates keyframes and augments information via document retrieval, then uses a Planning-Interaction dual-agent loop to iteratively retrieve missing information, and finally answers questions in a CoT manner. It significantly outperforms existing LLM-based SOTAs on EgoSchema (3 mins), MovieChat-1K (10 mins), and Video-MME long video division (average of 44 mins).
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding: This paper proposes DynFocus, an LLM-based dynamic cooperative video encoding network. It dynamically selects Q&A-related keyframes via the DPE module, and encodes keyframes with fine-grained tokens (analogous to visual Cones) and redundant frames with very few tokens for coarse-grained encoding (analogous to visual Rods) via the CCE module, balancing spatial details and temporal dynamics under a limited token budget.
EDCFlow: Exploring Temporally Dense Difference Maps for Event-based Optical Flow Estimation: EDCFlow is proposed to exploit the complementarity between temporally dense feature difference maps and low-resolution cost volumes across adjacent event frames, achieving high-quality and lightweight event-based optical flow estimation at 1/4 resolution.
Efficient Transfer Learning for Video-language Foundation Models: The authors propose a Multimodal Spatiotemporal Adapter (MSTA), which achieves efficient transfer of video-language foundation models to downstream tasks with only 2-7% of trainable parameters, through a vision-language shared projection layer and spatiotemporal description-guided consistency constraints.
EgoLife: Towards Egocentric Life Assistant: Releases the EgoLife dataset (300 hours of egocentric multimodal videos of 6 participants co-living for a week) and the EgoLifeQA benchmark, and proposes the EgoButler system (EgoGPT + EgoRAG) to explore construction pathways for ultra-long context egocentric life assistants.
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering: This paper proposes the EgoTextVQA benchmark, which contains 1.5K egocentric videos and 7K scene-text-related QA pairs, revealing that existing MLLMs exhibit severe deficiencies in real-time scene text-aware QA assistance from an egocentric perspective (the best model, Gemini 1.5 Pro, achieves only ~33% accuracy).
ETAP: Event-based Tracking of Any Point: This paper proposes ETAP, the first tracking any point (TAP) approach designed purely for event-based cameras. It resolves the motion-dependency challenges inherent in event data via a novel feature-alignment contrastive loss. Trained on the newly constructed synthetic dataset EventKubric, ETAP significantly outperforms baseline methods, achieving a 136% improvement in the AJ metric across multiple benchmarks.
ExpertAF: Expert Actionable Feedback from Video: This paper proposes ExpertAF, the first method to generate actionable coaching feedback from video. By integrating a multimodal model with video, 3D human pose, and language, it not only generates textual feedback describing mistakes and suggesting improvements, but also retrieves/generates correct expert demonstrations. Leveraging the Ego-Exo4D dataset and LLMs to construct weakly-supervised training data, it significantly outperforms strong baselines across soccer, basketball, and climbing.
FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking: Proposes FC-Track, a lightweight post-association correction framework. By using IoA (Intersection over Area)-based appearance feature filtering and similarity comparisons within overlapping tracklet pairs, it online corrects detection-tracklet mismatch errors caused by target overlap. This reduces the ratio of long-term identity switches from 36.86% to 29.55% while maintaining SOTA performance on MOT17/MOT20.
FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video: FRAME proposes an egocentric motion capture method based on a floor-aligned coordinate system. By establishing a lightweight VR data acquisition system, it collects a large-scale real-world dataset. A geometry-aware multimodal fusion architecture is designed to effectively combine device 6D poses with camera images, achieving state-of-the-art whole-body pose prediction at 300 FPS.
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding: This work proposes FSAnno/FSBench, the first fine-grained, multimodal, and multi-level benchmark dataset for figure skating. It covers a comprehensive chain of tasks from prior knowledge testing and individual action recognition/evaluation/commentary to overall performance evaluation/commentary, revealing significant deficiencies in existing MLLMs regarding artistic sports understanding.
GG-SSMs: Graph-Generating State Space Models: Proposes Graph-Generating State Space Models (GG-SSMs), which dynamically construct a Minimum Spanning Tree (MST) based on feature similarity to replace the fixed 1D scanning paths in traditional SSMs. This enables efficient modeling of complex non-local dependencies in high-dimensional data, achieving SOTA performance on 11 datasets.
H-MoRe: Learning Human-centric Motion Representation for Action Analysis: This paper proposes H-MoRe (Human-centric Motion Representation), a joint self-supervised learning framework with skeleton constraints and boundary constraints. It learns precise, human-centric motion representations (world-local flows) from real-world scenes, significantly outperforming traditional optical flow methods across gait recognition (CL@R1 +16.01%), action recognition (Acc@1 +8.92%), and video generation (FVD -67.07%).
Heterogeneous Skeleton-Based Action Representation Learning: This work is the first to investigate the heterogeneity of human skeleton data (varying joint numbers and coordinate dimensions). It proposes three core components: a 3D pose estimation module to unify dimensions, skeleton-specific prompts to unify topologies, and semantic motion encoding to introduce semantic information. Combined with a self-supervised unified representation learning framework, this approach achieves significant improvements on NTU-60/120 and PKU-MMD II.
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding: Proposes HierarQ, a task-aware hierarchical Q-Former framework that achieves autoregressive frame-by-frame video processing through a two-stream language-guided feature modulator (entity stream + scene stream) and short/long-term memory banks. It bypasses the LLM context length limit without frame sampling, achieving SOTA or near-SOTA performance on 10 video understanding benchmarks.
Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity: This paper proposes Holmes-VAU, constructing HIVAU-70k, a video anomaly understanding benchmark with over 70k multi-granularity annotations. It also designs an Anomaly-focused Temporal Sampler (ATS) that enables multimodal VLMs to focus on anomaly-dense regions, significantly outperforming existing methods on long-term video anomaly detection and reasoning tasks.
HuMoCon: Concept Discovery for Human Motion Understanding: HuMoCon is a motion-video understanding framework designed for human action analysis. Its core innovation lies in the encoder pre-training stage, where it discovers semantic motion concepts (codebook) via explicit video-motion feature alignment and a velocity reconstruction-based high-frequency information preservation mechanism. This significantly enhances the human motion understanding and reasoning capabilities of downstream LLMs.
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation: HyperGLM proposes representing entity scene graphs (capturing spatial relationships) and program graphs (modeling causal temporal transitions) dynamically into a unified HyperGraph, and injecting it into a multimodal LLM to perform video scene graph generation, anticipation, and reasoning. Additionally, it releases the VSGR dataset containing 1.9 million frames to support five tasks.
Learning Audio-Guided Video Representation with Gated Attention for Video-Text Retrieval: The AVIGATE framework is proposed to selectively fuse audio and visual information through a gated attention mechanism (filtering out useless audio noise) and to design an adaptive margin contrastive loss to handle ambiguous positive-negative relationships between videos and texts, achieving state-of-the-art (SOTA) performance on multiple video-text retrieval benchmarks.
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant: The LION-FS online video assistant framework is proposed, inspired by the "fast and slow thinking" cognitive theory. It utilizes a Fast Path (router-based token aggregation and dropping) to achieve efficient real-time response decision-making, and a Slow Path (multi-granularity keyframe augmentation) to inject fine-grained spatial and interaction features during response generation. It comprehensively outperforms existing methods on the Ego4D and Ego-Exo4D benchmarks.
LLAVIDAL: A Large Language Vision Model for Daily Activities of Living: To address daily activities of living (ADL) understanding, a multi-view multimodal instruction-tuning dataset ADL-X is constructed, and the LLAVIDAL model is proposed to integrate video, 3D skeleton, and HOI cues, achieving SOTA performance through the MMPro progressive training strategy.
Localizing Events in Videos with Multimodal Queries: This work proposes the ICQ benchmark and the ICQ-Highlight dataset, representing the first systematic study of replacing text-only queries with multimodal queries (image + text) for video event localization. It also designs three query adaptation methods and the SUIT proxy fine-tuning strategy.
M-LLM Based Video Frame Selection for Efficient Video Understanding: A lightweight M-LLM frame selector is proposed. Trained via spatial and temporal pseudo-labels, it adaptively selects the most question-relevant frames for downstream video LLMs, improving performance across multiple video QA benchmarks without requiring fine-tuning of the downstream models.
MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking: MambaVLT is the first Mamba-based vision-language tracker. Leveraging the temporal evolution of state spaces, it maintains long-term target memory and adaptively updates multimodal reference features, achieving state-of-the-art (SOTA) performance on multiple vision-language tracking benchmarks.
MLVU: Benchmarking Multi-task Long Video Understanding: This work proposes the MLVU benchmark, which systematically evaluates the capability of multimodal large language models in long video understanding through 9 diverse evaluation tasks, various video types, and flexible duration settings, revealing significant limitations of existing models in processing long videos.
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding: This paper proposes MMVU, an expert-annotated benchmark containing 3,000 video understanding questions across 27 disciplines, to evaluate the expert-level knowledge reasoning capabilities of multimodal foundation models in professional video domains, revealing that even the strongest models still significantly lag behind human experts.
MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking: This work proposes the first large-scale multispectral UAV single object tracking dataset, MUST (250 sequences, 43K frames, 8 spectral bands), and designs a unified framework named UNTrack to fuse spectral, spatial, and temporal features, achieving efficient and robust tracking via an asymmetric Transformer and a spectral prompt encoder.
Number it: Temporal Grounding Videos like Flipping Manga: This paper introduces NumPro, which overlays frame indices directly onto the bottom-right corner of video frames. This binds the tasks of "observing an event" and "reporting the corresponding frame index" into a single OCR task for Vid-LLMs, significantly improving the mIoU and mAP of Video Temporal Grounding (VTG) under zero-shot settings or lightweight LoRA fine-tuning.
Object-Shot Enhanced Grounding Network for Egocentric Video: OSGNet targets the two major shortcomings of ego-centric Natural Language Queries (NLQ)—namely, visual features lacking fine-grained object information and neglecting attention switching implied by head-mounted camera motion. To address this, it proposes a two-branch architecture consisting of an "object branch (Co-DETR + CLIP text encoder) + shot branch (shot segmentation based on head turn + shot-level contrastive learning)", achieving state-of-the-art (SOTA) performance on Ego4D-NLQ, Goal-Step, and TACoS.
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks: Omni-RGPT proposes the Token Mark mechanism to directly mark target regions in the visual feature space, unifying regional-level understanding for both images and videos. Together with the RegVID-300k dataset containing 300,000 region-level video instructions, it achieves SOTA performance on tasks such as commonsense reasoning.
OmniTrack: Omnidirectional Multi-Object Tracking: This paper proposes OmniTrack, the first multi-object tracking framework for 360° panoramic images, which unifies the Tracking-by-Detection (TBD) and end-to-end (E2E) tracking paradigms. It mitigates panoramic distortion through the CircularStatE module, introduces temporal priors via FlexiTrack instances, and provides trajectory feedback via Tracklet Management, alongside constructing the QuadTrack dataset for quadruped robot panoramic MOT.
On the Consistency of Video Large Language Models in Temporal Comprehension: This work systematically investigates the prediction consistency of Video Large Language Models (Video-LLMs) in temporal comprehension. It reveals that current models exhibit extremely poor consistency (near-random levels) under probes such as query rephrasing, temporal shifting, and self-verification. To address this, the Event Temporal Verification Tuning (VTune) method is proposed, which explicitly incorporates consistency to significantly improve both grounding and consistency performance.
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?: OVO-Bench is the first online video benchmark that emphasizes the importance of timestamps in video understanding. It categorizes online video understanding into three modes: "Backward Tracing", "Real-Time Perception", and "Forward Active Responding", evaluating the online understanding capabilities of Video-LLMs through 12 tasks, 644 videos, and over 2,800 fine-grained annotations.
PAVE: Patching and Adapting Video Large Language Models: PAVE proposes a framework to adapt pre-trained Video LLMs via lightweight "patches," integrating side-channel signals such as audio, 3D cues, and multi-view videos into the base model with only about 0.1% additional parameters and computation, outperforming specialized models on tasks like audio-visual QA and 3D QA.
Progress-Aware Video Frame Captioning: This paper introduces the new task of "progress-aware video frame captioning" and develops the ProgressCaptioner model. Through a two-stage training paradigm (frame pair \(\rightarrow\) frame sequence) and an automated pseudo-label filtering mechanism, it generates fine-grained descriptions that precisely capture the frame-by-frame evolution of actions, significantly outperforming GPT-4o and Gemini-1.5-Pro on the self-constructed FrameCapEval benchmark.
Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs: The first benchmark, Q-Bench-Video, to systematically evaluate the video quality understanding capabilities of Large Multimodal Models (LMMs), covering natural/AIGC/CG videos, four-dimensional quality focus, and multiple question types.
QA-TIGER: Question-Aware Gaussian Experts for Audio-Visual Question Answering: QA-TIGER is proposed, a framework that models video temporal sequence using continuous adaptive weighting with a Mixture of Gaussian Experts (MoE), and injects question information early in the encoding process to achieve progressive semantic refinement, reaching SOTA on multiple AVQA benchmarks.
T*: Re-thinking Temporal Search for Long-Form Video Understanding: Proposes T*, a lightweight temporal search framework that reformulates expensive temporal search as a spatial search problem. It iteratively localizes keyframes in both temporal and spatial dimensions via an adaptive zooming mechanism. Combined with LV-Haystack, the first large-scale benchmark for long-form video keyframe search, it significantly improves the performance of existing VLMs in long-form video understanding.
ReWind: Understanding Long Videos with Instructed Learnable Memory: This paper proposes ReWind, a vision-language model architecture based on a learnable memory module. Through a novel read-perceive-write loop mechanism and instruction-guided dynamic frame selection, it significantly outperforms previous methods on long video VQA and temporal localization tasks while using fewer tokens and frames.
SEAL: SEmantic Attention Learning for Long Video Representation: This paper proposes SEAL, a unified long video representation method that decomposes video into three semantic tokens (scene, object, and action). It uses a query-aware subset selection optimization to balance relevance and diversity, achieving a score of 45.9% on LVBench and outperforming Qwen2-VL-72B (41.3%).
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding: Seq2Time proposes a data-driven training paradigm that converts large-scale image sequences and short video clips into training data simulating the temporal structure of long videos, and introduces unified relative position tokens. Without relying on abundant timestamp annotations, this approach significantly enhances the temporal understanding capability of video LLMs (achieving a 27.6% F1 improvement on YouCook2 and a 14.7% R@1 improvement on Charades-STA).
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding: Proposes SeriesBench, the first video benchmark for narrative-driven drama series understanding, covering 105 series, 28 tasks, and 5 major dimensions, and introduces the PC-DCoT (Plot-Character Double-Chain-of-Thought) framework, which boosts MLLM performance by over 10%.
Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking: A significant level of layer redundancy (feature saturation) is identified in the deep layers of lightweight ViT trackers. SGLATrack, a similarity-guided layer-adaptive approach, is proposed to dynamically disable redundant layers and retain only a single optimal layer, achieving real-time UAV tracking at 225 FPS on GPU.
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding: Proposes STOP, an integrated spatial-temporal dynamic prompting method for video understanding. It adaptively highlights discriminative regions via intra-frame spatial prompts and dynamically inserts prompt tokens between frames with high temporal variations via inter-frame temporal prompts, guiding the frozen CLIP model to focus on key spatial-temporal locations.
TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition: This paper proposes TAMT, a decoupled "pre-train, fine-tune" paradigm for cross-domain few-shot action recognition (CDFSAR). By efficiently recalibrating intermediate features of frozen models with a Temporal-Aware Adapter (TAA) and generating strong representations with Global Temporal Moment Tuning (GTMT) to capture long- and short-term temporal covariance, TAMT outperforms existing methods by 13% to 31% across multiple cross-domain scenarios while requiring 5 times lower training costs.
Temporal Alignment-Free Video Matching for Few-Shot Action Recognition: This paper proposes TEAM (TEmporal Alignment-free Matching), which aggregates video features with cross-attention using a fixed number of learnable pattern tokens. By doing so, it eliminates the dependence on predefined temporal units and brute-force alignment, achieving more flexible and efficient video matching for the FSAR task and reaching SOTA performance on multiple benchmarks.
Temporally Consistent Object-Centric Learning by Contrasting Slots: Slot Contrast proposes a novel object-level temporal contrastive loss that contrasts slot representations across videos within a batch. This significantly improves the temporal consistency of video object-centric models, outperforming even weakly supervised methods using motion masks on object discovery tasks across synthetic and real-world datasets, while effectively supporting downstream unsupervised object dynamics prediction.
Towards Universal Soccer Video Understanding: This paper constructs SoccerReplay-1988, the largest multimodal soccer dataset to date (1,988 full matches), and proposes a soccer-specific vision encoder, MatchVision. By utilizing a spatiotemporal attention mechanism, it unifiedly handles multi-task requirements including event classification, commentary generation, and foul recognition, achieving SOTA performance on several benchmarks.
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks: Ours proposes the UTD method, which leverages VLMs and LLMs to generate video frame textual descriptions to systematically analyze object, temporal, and commonsense biases in video benchmarks, and constructs unbiased test splits to make video understanding evaluation more robust and unbiased.
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models: Video-Panda proposes the first encoder-free video-language model, which directly processes video inputs via a Spatio-Temporal Alignment Block (STAB) with only 45M parameters. It achieves performance comparable to methods using 300M-1.4B parameter encoders on open-ended video QA tasks, while delivering a 3-4x speedup in inference.
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously: The Video Streaming Thinking (VST) paradigm is proposed to alternate between "watching" and "thinking" during video playback. The model generates intermediate reasoning chains while receiving video frames, amortizing CoT computation into the pre-query phase. This achieves a state-of-the-art (SOTA) score of 79.5% on StreamingBench while maintaining real-time responsiveness (0.56s QA latency).
Video Summarization with Large Language Models: LLMVS proposes an LLM-based video summarization framework. It first employs a multimodal LLM to convert video frames into textual descriptions, and then uses an LLM to evaluate the local importance scores of each frame via sliding-window in-context learning. Finally, it aggregates the global context through a global self-attention mechanism to generate the final predictions, achieving SOTA performance on SumMe and TVSum.
VideoGEM: Training-Free Action Grounding in Videos: VideoGEM proposes the first training-free spatial action grounding method based on pre-trained image/video-language models. By utilizing layer weighting and prompt decomposition strategies, it outperforms existing training-based methods on four action grounding datasets.
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM: VideoRefer Suite is a systematic system consisting of a dataset (700K object-level video instruction data), a model (a spatial-temporal object encoder enabling pixel-level region understanding), and a benchmark (multi-dimensional evaluation) to empower Video LLMs with the ability to perceive, reason about, and retrieve any object in a video at any timestamp.
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video SpatioTemporal Augmentation: The VISTA framework is proposed, which synthesizes long-duration and high-resolution video instruction data by spatiotemporally combining existing video-caption data (covering 7 augmentation methods). By constructing the VISTA-400K dataset, it achieves an average improvement of 3.3% on long video understanding benchmarks and introduces the first high-resolution video understanding benchmark, HRVideoBench, yielding a 6.5% improvement.
ViTED: Video Temporal Evidence Distillation: ViTED proposes a framework that automatically generates temporal grounding chain of evidence, unifying evidence collection, temporal grounding, and question-answering reasoning into a single video-language model, enhancing complex video QA capabilities through evidence distillation.
VoCo-LLaMA: Towards Vision Compression with Large Language Models: This paper proposes VoCo-LLaMA, the first method that leverages the LLM's own capacity to compress vision tokens. By inserting VoCo tokens between vision and text tokens and modifying the attention mask to achieve attention distillation, it achieves a 576x compression rate with a single token while preserving 83.7% of the performance.