📹 Video Understanding¶

🔬 ICLR2026 · 24 paper notes

AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference: This paper proposes AdAEM, an adaptive and self-extensible evaluation framework for LLM values. By leveraging information-theoretic optimization, AdAEM automatically generates test questions that maximally reveal value differences across LLMs, addressing the "insufficient informativeness" limitation of existing static benchmarks that fail to distinguish models' value orientations.
A.I.R.: Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering: This paper proposes A.I.R., a training-free adaptive-iterative-reasoning-driven frame selection framework that addresses two fundamental challenges in VideoQA—inaccurate similarity estimation by lightweight models (CLIP) and the prohibitive computational cost of VLM-based analysis—via a two-stage strategy: GMM-based adaptive initial sampling followed by iterative VLM-guided refinement. In the worst case, A.I.R. analyzes only 72 frames (vs. 128 for baselines), while consistently improving performance across multiple long-video benchmarks.
AnveshanaAI: A Multimodal Platform for Adaptive AI/ML Education through Automated Question Generation and Interactive Assessment: This paper presents AnveshanaAI, an adaptive AI/ML education platform grounded in Bloom's cognitive taxonomy. The system employs fine-tuned GPT-2 for automated question generation, semantic similarity-based deduplication, explainable AI (XAI) techniques for transparency, and gamification mechanisms (points/badges/leaderboards) to deliver a personalized learning assessment system spanning seven domains from data science to multimodal AI. Experiments demonstrate a significant reduction in perplexity after fine-tuning and a notable improvement in learner engagement.
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss: This paper proposes Expert-Router Coupling (ERC) Loss, a lightweight auxiliary loss function that treats router parameter rows as proxy tokens for cluster centroids and constrains expert activation norms with respect to them, achieving tight coupling between router decisions and expert capabilities. The method requires only \(n^2\) activation computations and yields significant performance gains in MoE-LLMs.
Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading: This paper introduces a novel task of decoding open-ended information seeking goals from eye movement trajectories during reading. Built upon the OneStop eye-tracking dataset (360 participants, 486 questions, 162 passages), the authors develop both discriminative and generative multimodal models. RoBERTEye-Fixations achieves 49.3% accuracy on three-way goal selection (random baseline: 33%) and 70.9% on different-critical-span conditions; DalEye-Llama/GPT also significantly outperforms eye-movement-free baselines on goal reconstruction.
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought: This paper theoretically analyzes the training dynamics of a two-layer Transformer trained with continuous Chain-of-Thought (Coconut) on the directed graph reachability problem, revealing how a "superposition" mechanism naturally emerges: the index-matching logit first grows and then remains bounded, thereby achieving a balance between exploration and exploitation.
FlashVID: Efficient Video Large Language Models via Training-free Tree-Based Spatiotemporal Token Merging: This paper proposes FlashVID, a training-free inference acceleration framework for video large language models (VLLMs) that jointly models spatial and temporal redundancy via Tree-based Spatiotemporal Token Merging (TSTM). Retaining only 10% of visual tokens, FlashVID preserves 99.1% of LLaVA-OneVision's performance and enables a 10× increase in input frames for Qwen2.5-VL.
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding: This paper proposes FLoC, a visual token compression framework based on the facility location function. Through submodular optimization, FLoC efficiently selects a token subset that is both representative and diverse under a given budget, enabling training-free, model-agnostic, and query-agnostic token compression for long video understanding.
From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning: This paper identifies a vicious cycle between the encoder (producing sharp but noisy attention maps) and the decoder (producing spatially consistent but blurry reconstruction masks) in slot-based object-centric learning. It proposes a synergistic contrastive learning objective paired with a slot regularization warm-up strategy to convert this vicious cycle into a virtuous one, achieving substantial improvements in object discovery performance on MOVi and YouTube-VIS.
Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding: This paper introduces a new task called Category Splitting, which exploits latent compositional structure embedded in video classifier weights to decompose coarse-grained action categories into fine-grained subcategories under zero-shot conditions, without retraining or additional data.
Log Probability Tracking of LLM APIs: This paper proposes Logprob Tracking (LT), a method that detects subtle changes in LLM APIs (e.g., single-step fine-tuning) using only the log probabilities of a single-token input and single-token output. LT achieves sensitivity 2–3 orders of magnitude higher than existing methods at 1000× lower cost.
LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals: This paper proposes the Lumina framework for detecting hallucinations in RAG systems via "context-knowledge signals": MMD is used to measure external context utilization, while cross-layer token prediction evolution measures internal knowledge utilization, enabling hyperparameter-free generalization.
Mamba-3: Improved Sequence Modeling using State Space Principles: Three core improvements are proposed from an SSM perspective: exponential-trapezoidal discretization, complex-valued state spaces, and multi-input multi-output (MIMO) formulation. These advances significantly improve model quality and state-tracking capability without increasing decoding latency, pushing the performance–efficiency Pareto frontier forward.
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs: This work presents the first systematic reverse-engineering of temporal reasoning in VideoLLMs using mechanistic interpretability tools (Attention Knockout + Logit Lens), uncovering a three-stage information flow blueprint—"early-to-mid-layer cross-frame interaction → mid-layer video-language integration → mid-to-late-layer answer generation"—and demonstrating that retaining only 42% of attention edges preserves VideoQA performance with negligible degradation.
NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks: This paper proposes NerVE, a lightweight eigenspectrum analysis framework that systematically reveals, via four complementary metrics (Spectral Entropy, Participation Ratio, Eigenvalue Early Enrichment, and JS Divergence), how FFN nonlinearities in LLMs re-inject variance, reshape the eigenspectrum, and how architectural and optimizer choices imprint distinct spectral signatures.
Online Time Series Prediction Using Feature Adjustment: This paper proposes ADAPT-Z (Automatic Delta Adjustment via Persistent Tracking in Z-space), which shifts the adaptation objective in online time series forecasting from model parameter updates to feature space correction. A lightweight adapter fuses current features with historical gradients to address delayed feedback in multi-step forecasting. ADAPT-Z consistently outperforms existing online learning methods across 13 datasets.
Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences: Paper Copilot is constructed as a persistent digital archive and analysis platform for peer reviews spanning dozens of AI/ML venues. It adopts a tri-source hybrid data collection strategy—OpenReview API, web scraping, and community contributions—to archive real-time score snapshots capturing pre- and post-rebuttal dynamics. The platform reveals a structural anomaly in ICLR 2025: a counterintuitive decline in decision entropy, signaling a shift from probabilistic tiering to near-deterministic score-driven decision-making. LLM-driven author–affiliation metadata extraction further supports talent trajectory tracking.
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning: This paper proposes CAPO (Curvature-Aware Policy Optimization), which models second-order optimization geometry at the LM head's final layer to predict and filter token updates that would cause policy collapse. Under aggressive hyperparameters (5× learning rate, 1/12 batch size), CAPO maintains training stability and achieves a 30× sample efficiency improvement over standard GRPO on MATH.
Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs: TRACE-RPS proposes a unified defense framework against attribute inference attacks in LLMs: TRACE leverages attention mechanisms and reasoning chains to precisely locate privacy-leaking text elements for fine-grained anonymization, while RPS employs lightweight suffix optimization to induce model refusal of inference, reducing attribute inference accuracy from ~50% to below 5%.
The Expressive Limits of Diagonal SSMs for State-Tracking: This paper provides a complete characterization of the expressive power of input-dependent complex diagonal (DCD) SSMs on group state-tracking tasks: a single layer cannot track any non-abelian group, and \(k\) layers can track group \(G\) if and only if \(G\) admits a subnormal series of length \(k\) with abelian factors — precisely defining the strict expressiveness gains conferred by depth. Experiments further reveal a significant gap between expressive capacity and learnability.
FuncBenchGen: A Contamination-Free Controllable Evaluation Framework for Reliable Benchmarking: This paper proposes FuncBenchGen, a framework that models multi-step function calling as a DAG traversal problem, enabling contamination-free and finely controllable evaluation of LLM tool-use capabilities. The framework further reveals critical failure modes of reasoning models under long call chains and connected irrelevant functions.
Video-KTR: Enhancing Video Reasoning via Key Token Attribution: This paper proposes Video-KTR, a modality-aware policy shaping framework that identifies three categories of key tokens—visual-aware, temporal-aware, and entropy-aware—via counterfactual analysis, and applies selective reinforcement learning updates exclusively to these tokens, achieving state-of-the-art performance across multiple video reasoning benchmarks (Video-Holmes 42.7%, surpassing GPT-4o).
VideoNSA: Native Sparse Attention Scales Video Understanding: This paper proposes VideoNSA, which introduces Native Sparse Attention (NSA) into video-language models. Through a mixed sparse attention mechanism combining compression, selection, and sliding window branches with dynamic gating, VideoNSA achieves 128K-token video understanding using only 3.6% of the attention budget, surpassing token compression and training-free sparse attention baselines on long video understanding, temporal reasoning, and spatial understanding tasks.
Robustness and Radioactivity of Watermarks in Federated Learning May Be at Odds: This work presents the first study on LLM watermark-based data provenance in federated learning (FL). It demonstrates that watermarks are radioactive (detectable) in FL, yet a malicious server can suppress watermark signals by employing strong robust aggregation algorithms to filter watermarked updates, revealing a fundamental trilemma among radioactivity, robustness, and model utility.