📷 CVPR2026 Paper Notes¶

1931 CVPR2026 paper notes covering Multimodal VLM (287), 3D Vision (252), Image Generation (239), Medical Imaging (154), Autonomous Driving (105), Segmentation (103), Video Understanding (92), Human Understanding (61) and other 42 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

🧩 Multimodal VLM¶

A3: Towards Advertising Aesthetic Assessment: This paper proposes the A3 framework, comprising a theory-driven three-stage advertising aesthetic assessment paradigm A3-Law (Perceptual Attention → Formal Interest → Desire Impact), a 120K-annotation dataset A3-Dataset, an SFT+GRPO aligned model A3-Align, and the evaluation benchmark A3-Bench. A3-Align surpasses existing MLLMs on automated advertising aesthetic assessment.
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks: This paper proposes a training-free, annotation-free debiasing method for VLMs that operates in cross-modal embedding spaces. Via orthogonal decomposition, it achieves a Pareto-optimal fairness–utility trade-off with a closed-form solution and provides theoretical upper bounds on utility loss.
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks: This paper proposes a closed-form debiasing method for VLMs that performs orthogonal decomposition of attribute subspaces in the cross-modal embedding space and solves via Chebyshev scalarization, achieving Pareto-optimal fairness with bounded utility loss. The approach is training-free and annotation-free, and uniformly covers three downstream tasks: zero-shot classification, text-image retrieval, and text-image generation.
Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models: This paper proposes TANL (Test-time Activated Negative Labels), which dynamically evaluates the "activation degree" of negative labels on OOD samples at test time to identify the most effective negative labels. Combined with an activation-aware scoring function, TANL reduces FPR95 from 17.5% to 9.8% on the ImageNet benchmark, while remaining entirely training-free and test-time efficient.
AVR: Adaptive VLM Routing for Computer Use Agents: This paper proposes AVR, an adaptive routing framework for Computer Use Agents that combines a lightweight multimodal embedding model for action difficulty assessment, small-model logprob confidence probing, and warm agent memory injection, enabling a three-tier routing strategy (simple → small model; difficult → large model; high-risk → large model + guardrail). AVR reduces inference cost by 78% with only a 2 pp accuracy loss.
Adaptive Vision-Language Model Routing for Computer Use Agents: This paper proposes the Adaptive VLM Routing (AVR) framework, which inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. Through three mechanisms — multimodal difficulty classification, logprob confidence probing, and historical memory injection — AVR dynamically selects the most cost-efficient model for each action, reducing inference cost by up to 78% with an accuracy drop of no more than 2 percentage points.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition: This paper proposes AdaptVision, which enables VLMs to autonomously determine the minimum number of visual tokens required per sample through a coarse-to-fine active visual mechanism and reinforcement learning training, combined with Decoupled Turn Policy Optimization (DTPO) to achieve an optimal trade-off between efficiency and accuracy.
AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models: AGFT proposes an alignment-guided fine-tuning framework that enhances zero-shot adversarial robustness of VLMs while preserving the pre-trained cross-modal semantic structure, through text-guided adversarial training and distribution consistency calibration. The method achieves an average robust accuracy of 46.57% across 15 zero-shot benchmarks, surpassing the previous state of the art by 3.1 percentage points.
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow: This paper identifies that excessive attention from text tokens to irrelevant visual tokens is the root cause of the "see but misperceive" phenomenon in VLMs. It proposes Adaptive Information Flow (AIF), a training-free method that modulates information flow at inference time by modifying the causal mask based on token dynamic entropy, blocking irrelevant visual-to-text connections and improving perceptual performance across multiple VLMs.
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors: AnomalyVFM proposes a general framework that transforms arbitrary Vision Foundation Models (VFMs) into strong zero-shot anomaly detectors via a three-stage synthetic data generation pipeline and parameter-efficient LoRA adaptation, achieving 94.1% image-level AUROC across 9 industrial datasets with RADIO as the backbone, surpassing the previous SOTA by 3.3 percentage points.
ApET: Approximation-Error Guided Token Compression for Efficient VLMs: From an information-theoretic angle, this paper proposes a visual token importance measure based on linear-approximation reconstruction error. The method requires no attention weights, is naturally compatible with FlashAttention, and on LLaVA-1.5 retains 95.2% of the original performance after compressing away 88.9% of visual tokens.
ApET: Approximation-Error Guided Token Compression for Efficient VLMs: Grounded in information theory, ApET reconstructs each visual token via linear approximation and measures its informativeness by reconstruction error (larger error = more information = should be retained). The proposed framework is entirely independent of attention weights, achieves 95.2% accuracy retention at 88.9% compression on LLaVA-1.5-7B, even surpasses the baseline at 100.4% on video tasks, and is fully compatible with FlashAttention.
Asking like Socrates: Socrates helps VLMs understand remote sensing images: This paper identifies the "pseudo-reasoning" phenomenon in remote sensing VLMs—where explicit reasoning chains actually degrade performance—attributing it to the "Glance Effect" (insufficient single-pass perception). It proposes RS-EoT (Evidence-of-Thought), an iterative evidence search paradigm. A SocraticAgent self-play mechanism synthesizes reasoning trajectories for SFT cold-start, followed by two-stage progressive RL (grounding → VQA) to enhance and generalize reasoning. RS-EoT-7B achieves state-of-the-art performance across multiple remote sensing VQA and grounding benchmarks.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models: This paper introduces AV-SpeakerBench, a speaker-centric audiovisual reasoning benchmark comprising 3,212 multiple-choice questions, revealing Gemini 2.5 Pro's superiority in audiovisual fusion while exposing significant deficiencies of open-source models in speaker reasoning.
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention: This work revisits visual processing in VLA models from a POMDP perspective and proposes the AVA-VLA framework, which dynamically modulates the importance of visual tokens in the current frame based on historical context via a recurrent state and an active visual attention module, achieving state-of-the-art performance on benchmarks including LIBERO and CALVIN.
BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates: BALM proposes a model-agnostic plug-and-play framework to address multimodal learning under Imbalanced Missing Rates (IMR). It introduces a Feature Calibration Module (FCM) to align representations across different missing patterns, and a Gradient Rebalancing Module (GRM) to balance the optimization dynamics of each modality from both distributional and spatial perspectives. The framework consistently improves the robustness of various backbone networks across multiple multimodal sentiment recognition benchmarks.
Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality: This paper introduces ContrAR, the first benchmark for contradictory virtual content attacks in AR environments, comprising 312 real videos recorded on Meta Quest 3, validated by 10 annotators with an average Likert score of 4.66/5. It systematically evaluates 11 VLMs (including GPT-5/Gemini-2.5/Grok-4) on semantic contradiction detection, finding that GPT-5 achieves the highest accuracy (88.14%) but incurs a 19s latency, while GPT-4o offers the best accuracy–latency trade-off (84.62% / 7.26s). An OCR-only text baseline reaches only 56%, demonstrating that visual reasoning is indispensable.
Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition: This paper reformulates VLM zero-shot image recognition as a Bayesian framework, constructs a concept proposal distribution via an LLM-driven multi-stage concept synthesis pipeline, and employs an adaptive soft-trim likelihood to suppress the influence of outlier concepts, achieving state-of-the-art performance across 11 classification benchmarks.
Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models: By constructing the Isle-Brick-V2 benchmark using psychologically inspired controlled LEGO scenes, this work systematically exposes significant deficiencies in current VLMs' Visual Perspective Taking (VPT) capabilities—even when scene understanding is near-perfect, spatial reasoning and perspective-taking performance degrade substantially, accompanied by persistent directional biases.
Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models: This paper introduces FAQ (Forensic Answer-Questioning), the first large-scale multiple-choice QA benchmark focused on temporal inconsistencies in deepfake videos (33K QA pairs, ~4,500 videos). Through a three-level progressive task hierarchy (facial perception → temporal localization → forensic reasoning), FAQ systematically enhances VLM forensic capabilities, yielding significant gains both on in-domain benchmarks and cross-dataset detection after fine-tuning (Qwen2.5-VL average accuracy improves from 21.6% to 52.4%).
Beyond the Mean: Modelling Annotation Distributions in Continuous Affect Prediction: This paper proposes a Beta distribution-based framework for modelling affective annotation consensus. The model predicts only the mean and standard deviation of the annotation distribution, from which higher-order descriptors—including skewness, kurtosis, and quantiles—are derived in closed form via moment matching. Experiments on SEWA and RECOLA demonstrate that Beta distributions effectively capture the full distributional characteristics of annotator disagreement.
BiCLIP: Domain Canonicalization via Structured Geometric Transformation: This paper proposes BiCLIP, a minimalist few-shot adaptation method for CLIP that applies a bilinear transformation matrix with an upper-triangular structural constraint to geometrically align image features with text embeddings, achieving state-of-the-art performance across 11 standard benchmarks with an exceptionally low parameter count.
BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment: BriMA is proposed to address the non-stationary modality imbalance problem in multi-modal continual action quality assessment (AQA) via memory-guided bridging imputation and modality-aware replay optimization, achieving an average improvement of 6–8% in correlation coefficient and a 12–15% reduction in error across three benchmarks.
BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection: This paper proposes BUSSARD, the first learning-based scene-specific anomalous relationship detection method. It encodes scene graph triplets via pretrained language model embeddings, applies an autoencoder for dimensionality reduction, and employs normalizing flows for likelihood estimation. BUSSARD achieves approximately 10% AUROC improvement on the SARD dataset and demonstrates robustness to synonym variation.
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions: This paper constructs a synthetic counting benchmark dataset, systematically evaluates the counting capabilities of open-source VLMs under varying image and prompt conditions, and investigates mechanisms for improving counting behavior through visual attention reweighting at the decoder level.
CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment: This paper proposes CAPT, a confusion-aware prompt tuning framework that explicitly models systematic misalignment patterns in VLMs via a Semantic Confusion Miner (SEM) and a Sample Confusion Miner (SAM). A Multi-Granularity Discrepancy Expert (MGDE) further integrates confusion information across different granularities. CAPT achieves a state-of-the-art HM of 83.90% across 11 benchmarks.
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding: This paper introduces ChartNet—a 1.5-million-scale, high-quality multimodal chart dataset. A code-guided synthesis pipeline generates aligned quintuples comprising image–code–data table–text–reasoning QA. Fine-tuning on ChartNet significantly improves VLM performance on chart understanding and reasoning tasks, enabling small models to surpass GPT-4o.
Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking: This paper proposes the first circuit tracing framework for VLMs, training per-layer transcoders on Gemma-3-4B and constructing attribution graphs to reveal the hierarchical integration mechanisms underlying multimodal reasoning, visual arithmetic circuits, and the internal causes of six-finger hallucinations. The causal controllability of the discovered circuits is validated through feature steering and circuit patching.
CLIP-Free, Label-Free, Unsupervised Concept Bottleneck Models: This paper proposes TextUnlock, a method that aligns the output distribution of an arbitrary frozen visual classifier to a vision-language correspondence space, enabling the construction of a fully unsupervised Concept Bottleneck Model (U-F²-CBM) that requires no CLIP, no labels, and no trained linear probes. U-F²-CBM surpasses supervised CLIP-based CBMs across 40+ models.
Concept-wise Attention for Fine-grained Concept Bottleneck Models: CoAt-CBM achieves adaptive fine-grained image–concept alignment via learnable concept-wise visual queries and Concept Contrastive Optimization (CCO), surpassing both existing concept bottleneck models and black-box models while maintaining high interpretability.
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning: This paper proposes CodeDance, which uses executable code as a unified medium for visual reasoning. Atomic capabilities are instilled via SFT, and a difficulty-adaptive tool-calling reward (BAT) is applied during RL to enable dynamic tool orchestration and self-verification reasoning. The resulting 7B model surpasses GPT-4o on tasks such as counting, visual search, and chart QA.
CodePercept: Code-Grounded Visual STEM Perception for MLLMs: Through systematic scaling analysis, this work identifies perception—rather than reasoning—as the true bottleneck for MLLMs in STEM domains. It proposes the CodePercept paradigm, which uses executable Python code as an anchoring medium, constructs the million-scale ICC-1M dataset and the STEM2Code-Eval benchmark, and achieves significant improvements in STEM visual perception and downstream reasoning after two-stage SFT+RL training.
CodePercept: Code-Grounded Visual STEM Perception for MLLMs: Through systematic scaling analysis, this paper reveals that perception rather than reasoning is the true bottleneck of MLLMs on STEM visual tasks. It proposes a paradigm that uses executable code as a medium to enhance perceptual capability, constructs ICC-1M — a 1M-scale Image-Caption-Code triplet dataset — and introduces two training tasks: code-grounded caption generation and STEM image-to-code translation.
CoMP: Collaborative Multi-Mode Pruning for Vision-Language Models: CoMP proposes a collaborative multi-mode pruning framework that eliminates inconsistencies between parameter and token pruning metrics via a Collaborative Importance Metric (CIM), and adaptively selects the optimal pruning mode at each stage through a Multi-mode Pruning Strategy (MPS), achieving significant improvements over single-mode and naive joint pruning approaches at high pruning ratios.
Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling: This paper proposes CFC (Conditional Factuality Control), a post-hoc conformal framework that learns feature-conditional acceptance threshold functions via augmented quantile regression, providing conditional coverage guarantees (rather than merely marginal guarantees) for LLM sampled outputs. The authors further derive a PAC-style finite-sample certificate CFC-PAC, and validate the approach on synthetic data, reasoning/QA benchmarks, and VLM settings.
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation: This paper proposes SeGP-CL, which constructs adversarial anchors via dual-objective projected gradient descent to probe fragile regions at old–new semantic boundaries. Combined with Anchor-guided Cross-modal Geometry Distillation (ACGD) and Text Semantic Geometry Regularization (TSGR), SeGP-CL effectively preserves the cross-modal semantic-geometric structure of VLMs under exemplar-free conditions, substantially alleviating catastrophic forgetting.
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation: This paper proposes SeGP-CL, which constructs anchor samples at the semantic boundaries between old and new classes via adversarial PGD, and couples them with Anchor-guided Cross-modal Geometry Distillation (ACGD) and Text Semantic Geometry Regularization (TSGR) to preserve cross-modal semantic-geometric structures during VLM continual learning without requiring replay of old data, achieving state-of-the-art performance on five benchmarks.
CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models: This paper identifies the "visual preference conflict" problem in visual encoder fine-tuning within MLLMs, and proposes the CoVFT framework. By introducing Context Vector Extraction (CVE) and Context-aware Mixture of Experts (CoMoE), CoVFT achieves context-aware visual fine-tuning, attaining state-of-the-art performance across 12 multimodal benchmarks with significantly improved stability over existing methods.
CoVR-R: Reason-Aware Composed Video Retrieval: CoVR-R proposes a reasoning-first zero-shot composed video retrieval framework that leverages a large multimodal model (Qwen3-VL) to explicitly reason about the "after-effects" (state transitions, temporal phases, shot changes, etc.) implied by edit instructions. The paper further introduces the CoVR-R benchmark, comprising structured reasoning traces and hard negatives, to evaluate reasoning capability. The method substantially outperforms existing approaches in retrieval accuracy.
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning: This paper proposes a graph-based automatic data generation pipeline that constructs the CRIT dataset and benchmark for training and evaluating VLMs on cross-modal multi-hop reasoning over interleaved image-text content. Models fine-tuned on CRIT achieve significant improvements on multiple benchmarks including SPIQA.
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception: This paper proposes CropVLM — a lightweight 256M-parameter cropping network trained via GRPO reinforcement learning (without manual bounding box annotations) that dynamically selects the most informative image regions for VLMs to focus on, enabling plug-and-play integration with both open-source and commercial VLMs to improve fine-grained visual understanding.
CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods: This paper proposes CrossHOI-Bench, the first unified multiple-choice HOI benchmark for evaluating both VLMs and HOI-specific models. Through carefully curated positive and negative examples that eliminate erroneous penalties from incomplete annotations, the benchmark reveals that large VLMs under zero-shot settings surpass state-of-the-art HOI methods by +5.18% in Instance-F1, while still exhibiting systematic weaknesses in multi-action recognition and cross-person action attribution.
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens: This paper proposes CubiD, the first model to perform discrete diffusion generation over high-dimensional representation tokens (768-dim). By conducting fine-grained mask prediction over an \(h \times w \times d\) cubic tensor, CubiD achieves high-quality image generation while preserving visual understanding capability.
Customized Visual Storytelling with Unified Multimodal LLMs: This paper proposes the VstoryGen framework and its core component CustFilmer, which leverages a unified multimodal large language model (UMLLM) to enable customized multimodal story generation with joint conditioning on text descriptions, character/scene reference images, and shot types. Two new benchmarks, MSB and M2SB, are also introduced.
DC-Merge: Improving Model Merging with Directional Consistency: DC-Merge identifies that the key to effective model merging lies in maintaining directional consistency in singular space between the merged multi-task vector and the original single-task vectors. By combining singular value smoothing with shared orthogonal subspace projection, DC-Merge achieves state-of-the-art merging performance on both Vision and Vision-Language tasks.
DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles: This paper proposes DeAR, which uses a Concept Entropy metric to decompose the deep-layer attention heads of ViT into three functional roles—attribute heads, generalization heads, and mixed heads—and designs a role-based attention masking mechanism to precisely control information flow, achieving the best balance between task adaptation and zero-shot generalization across 15 datasets.
Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation: This paper proposes DASP, which diagnoses biased modalities via a redundancy score and applies an asymmetric adaptation strategy to decouple stability and plasticity, addressing negative transfer and catastrophic forgetting in multi-modal test-time adaptation.
Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification: This paper presents the first systematic evaluation of demographic fairness in face verification across 9 open-source MLLMs, measuring gender and ethnicity bias on the IJB-C and RFW benchmarks using 4 FMR-based fairness metrics, and finds that bias patterns in MLLMs differ substantially from those in traditional face recognition systems.
Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models: This paper identifies an overlooked "Narrow Policy" bottleneck in driving VLA models—over-exploitation during the IL phase causes exploration collapse, which in turn constrains the RL phase. The proposed Curious-VLA framework achieves SOTA on Navsim (PDMS 90.3, Best-of-N 94.8) via feasible trajectory expansion and diversity-aware RL.
Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection: The paper proposes CARE, a framework that first applies causal mediation analysis to precisely localize neurons and layers causally associated with unsafe behavior in VLMs (diagnosis), then constructs a dual-modal safety subspace via generalized eigendecomposition and projects activations onto it at inference time (repair), reducing attack success rates to below 10% with negligible loss of general capability.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs: This paper proposes DACO, a framework that constructs a multimodal concept dictionary of 15,000 concepts from WordNet and CC-3M, and combines it with sparse autoencoders (SAE) to achieve fine-grained concept control over frozen MLLM activation spaces, significantly improving safety across multiple benchmarks while preserving general capability.
Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement: This paper proposes HRNet, which learns clean shared representations via cross-scale disentanglement and adaptive projection (CDAP), and jointly predicts rigid and non-rigid transformations in a unified coarse-to-fine pipeline without iteration, achieving state-of-the-art performance on four multimodal datasets.
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks: This paper presents the first systematic study of model inversion (MI) attacks against VLMs. It proposes SMI-AW, a sequence-level inversion method based on adaptive token attention weighting, which dynamically weights token gradients according to their visual relevance to reconstruct private training images from VLMs. The method achieves a human-evaluated attack accuracy of 61.21%.
Do Vision Language Models Need to Process Image Tokens?: This paper systematically demonstrates that image token representations in VLMs stabilize in shallow layers and become functionally interchangeable across deeper layers, while text token representations undergo continuous dynamic reconstruction — the necessity of deep image processing is highly dependent on the output task type.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding: DocSeeker is proposed to achieve structured reasoning and evidence grounding in long document understanding via an ALR (Analyze–Locate–Reason) visual reasoning paradigm combined with two-stage training (SFT + EviGRPO). The model is trained exclusively on short documents yet generalizes robustly to documents of extreme length.
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small VLMs: This paper systematically investigates the effect of LLM scaling on multimodal capabilities, finding that vision-dependent tasks—rather than LLM-intrinsic tasks—suffer the most, and that perception degradation is as severe as reasoning degradation. The proposed Extract+Think method (visual extraction tuning + step-by-step reasoning) uses a 0.6B perception module and a 1.7B reasoning module to outperform PrismCaptioner and LLaVA-OneVision-0.5B, which are up to 12× larger.
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing: DSCA decomposes the VLM representation space into a set of orthogonal semantic subspaces and performs gated residual interventions within each subspace for knowledge editing, achieving >95% editing success rate with near-zero forgetting after 1,000 sequential edits.
DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions: This paper introduces the DSERT-RoLL driving dataset, the first to integrate six sensor modalities — stereo event cameras, RGB, thermal imaging, 4D radar, and dual LiDAR — covering diverse weather and lighting conditions, along with a unified multi-modal 3D detection fusion framework.
DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference: This paper proposes DUET-VLM, a dual-stage visual token compression framework. Stage 1 operates within the visual encoder: dominant tokens are selected via V2V self-attention, and remaining tokens are merged into contextual tokens through attention-guided local cluster aggregation. Stage 2 operates within the LLM, progressively pruning visual tokens via T2V cross-attention across multiple layers. On LLaVA-1.5-7B, DUET-VLM achieves 67% token compression while retaining 99%+ accuracy, and 89% compression while retaining 97%+ accuracy, with a 31% reduction in training time.
DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference: DUET-VLM proposes a dual-stage visual token compression framework: the first stage (V2V) merges redundant tokens into compact, information-preserving representations via local cluster aggregation on the vision encoder side; the second stage (T2V) progressively discards low-information tokens through text-guided hierarchical adaptive pruning on the language backbone side. On LLaVA-1.5-7B, 67% compression retains 99% accuracy and 89% compression retains 97% accuracy.
Dynamic Token Reweighting for Robust Vision-Language Models: This paper proposes DTR (Dynamic Token Reweighting), the first inference-time defense against multimodal jailbreak attacks that operates by optimizing the KV cache of VLMs. DTR introduces the concept of "Reversal Safety-Relevant Shift" (RSS) to identify visual tokens responsible for safety degradation, dynamically adjusts their weights to restore the model's safety alignment, and preserves benign task performance.
DTR: Dynamic Token Reweighting for Robust Vision-Language Models: DTR is proposed as the first method to defend against multimodal jailbreak attacks via KV cache optimization. It identifies adversarial visual tokens using a Reversal Safety-Relevant Shift (RSS) and suppresses their influence through dynamic reweighting. With only 4 optimization steps and without relying on image-to-text conversion, DTR substantially reduces attack success rates (HADES S+T+A: 56.9%→15.9%) while preserving VLM performance and inference efficiency.
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs: This paper proposes DynamicGTR, a framework that dynamically routes each query at inference time to the optimal graph topology representation (GTR, 8 variants spanning visual and textual modalities), substantially improving VLM performance on zero-shot graph algorithm QA, with transferability to real-world tasks such as link prediction and node classification.
EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval: EagleNet constructs a text-frame relational graph and employs a relational graph attention network to learn fine-grained text-frame and frame-frame relationships, generating enhanced text embeddings enriched with video contextual information. An energy-based matching mechanism is further introduced to capture the distribution of ground-truth text-video pairs. The method achieves state-of-the-art performance on four benchmark datasets.
EBMC: Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis: This paper proposes EBMC, a two-stage framework that first improves the representation quality of weak modalities via semantic disentanglement and cross-modal enhancement, then achieves balanced multimodal sentiment analysis through energy-guided modality coordination and instance-aware trust distillation, maintaining strong robustness under missing-modality scenarios.
Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs: This paper proposes the AGDI framework for black-box copyright tracking in MLLMs via adversarially optimized trigger images. A dual injection mechanism simultaneously injects copyright information at the response level (CE loss driving an auxiliary model to produce a target answer) and the semantic level (minimizing cosine distance between the trigger image and target text in CLIP space). An adversarial training scheme simulates fine-tuning resistance. AGDI consistently outperforms PLA and RNA baselines on Qwen2-VL and LLaVA-1.5.
Efficient Document Parsing via Parallel Token Prediction: This paper proposes PTP (Parallel Token Prediction), a model-agnostic plug-and-play acceleration method that enables parallel multi-token prediction by inserting learnable register tokens into training sequences, achieving 1.6×–2.2× throughput gains on OmniDocBench without accuracy loss.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs: This paper proposes EgoMind, a CoT framework that requires no geometric priors. Through two core components—Role-Play Caption (RPC) and Progressive Spatial Analysis (PSA)—it achieves competitive multi-frame spatial reasoning using only 5K SFT and 20K RL samples.
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models: This paper proposes EMO-R3, which guides MLLMs to perform step-by-step emotional reasoning via Structured Emotional Thinking (SET), and introduces a Reflective Emotional Reward (RER) that prompts the model to re-evaluate the visual-textual consistency and emotional coherence of its reasoning, substantially improving both interpretability and accuracy in multimodal affective understanding.
EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis: This paper introduces EmoVerse — the first large-scale interpretable visual emotion dataset (219K+ images) covering both CES (Mikels 8-class discrete emotions) and DES (1024-dimensional continuous emotion space). It proposes a B-A-S (Background-Attribute-Subject) triplet knowledge graph annotation scheme and an Annotation & Verification Pipeline (Gemini/GPT-4o + EmoViT + CoT Critic Agent), and fine-tunes Qwen2.5-VL-3B to perform 1024-dimensional DES projection and emotion attribution explanation.
EmoVerse: A MLLMs-Driven Emotion Representation Dataset for Interpretable Visual Emotion Analysis: This paper proposes EmoVerse, a 219K-scale visual emotion dataset that achieves word-level and subject-level emotion attribution via knowledge graph-inspired Background-Attribute-Subject triplets. It provides dual emotion annotations in both discrete CES and continuous 1024-dimensional DES spaces, accompanied by a multi-stage annotation validation pipeline and an interpretable emotion model based on Qwen2.5-VL.
Empowering Semantic-Sensitive Underwater Image Enhancement with VLM: This paper proposes a plug-and-play strategy (-SS) that leverages VLMs to generate semantic guidance maps. Through a dual-guidance mechanism comprising cross-attention injection and a semantic alignment loss, the approach directs underwater image enhancement models to focus on semantically critical regions during restoration, yielding significant improvements in perceptual quality as well as downstream detection and segmentation performance.
Empowering Semantic-Sensitive Underwater Image Enhancement with VLM: This paper proposes a VLM-driven semantic-sensitive learning strategy that leverages LLaVA to generate object descriptions, BLIP to construct spatial semantic guidance maps, and a dual-guidance mechanism (cross-attention injection + semantic alignment loss) to steer the UIE decoder during reconstruction. The approach yields consistent improvements in both perceptual quality and downstream detection/segmentation performance.
ENC-Bench: A Benchmark for Evaluating MLLMs in Electronic Navigational Chart Understanding: This paper introduces ENC-Bench, the first professional-grade benchmark for Electronic Navigational Chart (ENC) understanding, comprising 20,490 samples organized under a three-level hierarchical evaluation framework (Perception → Spatial Reasoning → Maritime Decision-Making). Systematic evaluation of 10 MLLMs reveals that the best-performing model achieves only 47.88% accuracy, exposing a critical capability gap of general-purpose models in safety-critical specialized domains.
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards: This paper proposes EvoLMM, a fully unsupervised self-evolving framework that instantiates two roles from a single backbone LMM: a Proposer (generating visual questions) and a Solver (producing multiple answers). By replacing discrete majority voting with continuous self-consistency rewards, the model improves multimodal mathematical reasoning using only raw images (ChartQA +2.7%, MathVista +2.1%).
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards: This paper proposes EvoLMM, a fully unsupervised self-evolving framework that derives a Proposer (generating image-grounded questions) and a Solver (answering those questions) from a single LMM. A continuous self-consistency reward — replacing discrete majority voting — forms a closed-loop training signal. Using only raw images (no annotations, no external reward models), EvoLMM achieves consistent gains of approximately 2–3% across eight multimodal mathematical reasoning benchmarks.
Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory: This paper proposes MM-SafetyBench++ and the EchoSafe framework, which accumulates safety insights by maintaining a self-reflective memory bank at inference time, enabling MLLMs to distinguish visually similar scenarios with different safety intents based on context—improving contextual safety without any training.
EvoPrompt: Evolving Prompt Adaptation for Vision-Language Models: EvoPrompt addresses catastrophic forgetting and modality bias in VLM prompt learning via a trajectory-aware prompt evolution strategy — comprising unified embedding projection, direction–magnitude decoupled training, and feature geometric regularization — achieving state-of-the-art performance across few-shot, cross-dataset, and domain generalization benchmarks while preserving zero-shot capability.
Evolving Prompt Adaptation for Vision-Language Models: This paper proposes EvoPrompt, a framework that treats prompt training as a progressive evolution from general semantic anchors to task-specific features. It introduces a Modal-shared Prompt Projector (MPP) for unified cross-layer and cross-modal prompt generation, an evolution trajectory-aware strategy (direction–magnitude decoupling with historical direction freezing) to prevent forgetting, and Feature Geometry Regularization (FGR) to prevent representation collapse. EvoPrompt achieves an average HM of 80.73% on base-to-novel generalization across 11 datasets, surpassing all existing prompt learning methods.
Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration: This paper proposes the LMEE benchmark and the MemoryExplorer framework, which jointly evaluate the process and outcome of embodied exploration by unifying multi-object navigation with memory-based question answering. By fine-tuning an MLLM via reinforcement learning to actively invoke memory retrieval tools, the method achieves an SR of 23.53% on LMEE-Bench (surpassing 3D-Mem's 16.91%) and an SR of 46.40% on GOAT-Bench.
FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Models: This paper proposes FairLLaVA, a parameter-efficient fairness-aware fine-tuning method that eliminates demographic shortcuts in multimodal large language models by minimizing the mutual information between hidden states and demographic attributes, significantly narrowing inter-group performance gaps in chest X-ray report generation and skin lesion question answering.
FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment: This paper proposes FALCON, a learning-based mini-batch construction strategy that employs a negative mining scheduler to adaptively balance the trade-off between hard negatives and false negatives, substantially improving cross-modal alignment quality in vision-language pretraining (VLP).
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients: This paper proposes Quantization-aware Integrated Gradients (QIG), advancing sensitivity analysis for LVLM quantization from the modality level to the token level. By leveraging axiomatic attribution principles, QIG precisely quantifies each token's contribution to quantization error, achieving significant accuracy improvements under W4A8 and W3A16 settings with negligible additional computational overhead.
FINER: MLLMs Hallucinate under Fine-grained Negative Queries: This paper identifies that MLLMs suffer a dramatic increase in hallucination rates under fine-grained negative queries (queries involving multiple objects/attributes/relations with only one subtle error), proposes the FINER benchmark and FINER-Tuning (based on DPO), achieving up to 24.2% improvement on InternVL3.5-14B.
FlashCache: Frequency-Domain-Guided Outlier-KV-Aware Multimodal KV Cache Compression: This paper proposes FlashCache, the first method to analyze the importance distribution of multimodal KV Cache from a frequency-domain perspective. It discovers that KV pairs deviating from low-frequency principal components—termed "outlier KVs"—encode features critical for inference. By identifying outlier KVs via DCT low-pass filtering and prioritizing their retention alongside dynamic per-layer budget allocation, FlashCache achieves 1.69× decoding speedup under 80% KV memory compression with negligible task performance degradation, while being natively compatible with FlashAttention.
FlowComposer: Composable Flows for Compositional Zero-Shot Learning: FlowComposer is the first work to introduce Flow Matching into Compositional Zero-Shot Learning (CZSL). It learns two primitive flows—an attribute flow and an object flow—to transport visual features into their corresponding text embedding spaces, and employs a learnable Composer to explicitly combine velocity fields into a compositional flow. A leakage-guided augmentation strategy further converts imperfect feature disentanglement into auxiliary supervision signals. As a plug-and-play module, FlowComposer consistently improves CZSL performance across three benchmarks.
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching VLA Models: FlowHijack is the first systematic backdoor attack framework targeting the vector field dynamics of flow-matching VLA models. It achieves high attack success rates and behavioral stealthiness via a τ-conditional injection strategy and a dynamic imitation regularizer.
FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy: This paper proposes FluoCLIP, a two-stage vision-language framework that first performs stain-grounding to enable CLIP to learn the semantics of fluorescence stains, then conducts stain-guided ranking for stain-aware focus quality assessment. The paper also introduces FluoMix, the first multi-stain tissue-level fluorescence microscopy dataset for FQA.
PinPoint: Focus, Don't Prune — Identifying Instruction-Relevant Regions for Information-Rich Image Understanding: This paper proposes PinPoint, a two-stage framework that first localizes instruction-relevant image regions via Instruction-Region Alignment, then re-encodes the selected regions at fine granularity, achieving higher VQA accuracy with fewer visual tokens.
From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing: This paper proposes TAR-FAS, a framework that reformulates Face Anti-Spoofing (FAS) as a Chain-of-Thought with Visual Tools (CoT-VT) paradigm for the first time, enabling MLLMs to adaptively invoke external visual tools (LBP/FFT/HOG, etc.) during inference—upgrading from "intuitive judgment" to "fine-grained investigation"—achieving SOTA on the 1-to-11 cross-domain protocol.
from masks to pixels and meaning a new taxonomy benchmark and metrics for vlm im: This paper argues that existing image tampering detection benchmarks rely on coarse mask annotations that are severely misaligned with actual edit signals. It proposes PIXAR—a pixel-level, semantically-aware tampering detection benchmark containing 420K+ image pairs—along with a new training framework and evaluation metrics that substantially outperform existing methods in precise localization and semantic understanding.
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings: This paper proposes LAPS (Latent Action-based Primitive Segmentation), a pipeline that defines a "Latent Action Energy" metric in the latent action space to unsupervisedly discover and segment semantic action primitives from unannotated industrial video streams, providing structured data for VLA model pre-training.
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval: This paper proposes G-MIXER, a training-free zero-shot composed image retrieval method that achieves state-of-the-art performance via geodesic mixup-based implicit semantic expansion (expanding the retrieval scope along multiple interpolation ratios on the hypersphere) and explicit semantic re-ranking (filtering noisy candidates using MLLM-generated attributes).
GACD: Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection: By estimating each token's (visual/textual/output) contribution to the current prediction via first-order Taylor gradient, the GACD framework simultaneously mitigates text-visual bias (amplifying visual token influence) and co-occurrence bias (suppressing visual tokens anchored to previously generated objects). It achieves an 8% improvement in overall AMBER score and 8% gain in POPE F1, without requiring training or auxiliary models.
Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning: This paper proposes GAR-SSL, a training-free sound source localization (SSL) framework that reframes SSL as a three-stage metacognitive reasoning process—Generate, Analyze, and Refine—leveraging the intrinsic reasoning capabilities of MLLMs via prompt engineering alone. The method achieves performance comparable to or surpassing supervised approaches on both single-source and multi-source localization benchmarks.
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning: This paper proposes GraphVLM, a benchmark that systematically evaluates VLMs in three roles for multimodal graph learning (MMGL): VLM-as-Encoder (enhancing GNN features), VLM-as-Aligner (bridging modalities for LLM-based reasoning), and VLM-as-Predictor (serving directly as the graph learning backbone). Experiments across six datasets demonstrate that VLM-as-Predictor consistently achieves the best performance, revealing the substantial potential of VLMs as a new foundation for MMGL.
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning: This work proposes the GraphVLM benchmark, which systematically evaluates VLMs across three roles in multimodal graph learning (Encoder / Aligner / Predictor). The VLM-as-Predictor paradigm consistently achieves the best performance, revealing the substantial potential of VLMs as backbones for multimodal graph reasoning.
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding: This paper proposes GroundVTS, a query-guided fine-grained visual token sampling architecture for video large language models, which adaptively preserves spatiotemporally relevant information at the token level. It achieves an 18.4-point mIoU improvement on Charades-STA and a 20.6-point mAP improvement on QVHighlights.
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training: This paper proposes GTR-Turbo, a framework that merges historical checkpoints generated during RL training to serve as a free teacher model. Without relying on expensive external API models, GTR-Turbo achieves performance comparable to or better than GTR in multi-turn visual agent training, while reducing training time by 50% and computational cost by 60%.
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training: This paper proposes GTR-Turbo, which generates a "free teacher model" by merging historical checkpoints produced during RL training via TIES, and uses this teacher to guide subsequent training (via SFT or KL distillation). GTR-Turbo matches or surpasses GTR—which relies on external teachers such as GPT-4o—across multiple visual agent benchmarks, while reducing training time by 50% and computational cost by 60%.
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks: This paper proposes GUIDE, a benchmark comprising 67.5 hours of screen recordings and think-aloud annotations from 120 novice users across 10 software applications. It defines three hierarchical tasks—behavioral state detection, intent prediction, and assistance prediction—and finds that current state-of-the-art multimodal models show limited capability in understanding user behavior and judging assistance needs (behavioral detection accuracy of only 44.6%), while providing structured user context substantially improves performance (up to +50.2pp on assistance prediction).
HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding: This paper proposes the HAMMER framework, which extracts contact-aware intention embeddings from an MLLM, enhances point cloud features via hierarchical cross-modal fusion, and injects 3D spatial information into the intention embeddings through a multi-granular geometry lifting module. The framework achieves interaction-image-based 3D affordance grounding and comprehensively outperforms existing methods on the PIAD benchmark.
HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models: This paper introduces HandVQA — a large-scale diagnostic benchmark containing over 1.6 million multiple-choice questions, automatically generated from 3D hand joint annotations. The benchmark covers joint angles, distances, and relative positions, and systematically exposes severe deficiencies of current VLMs in fine-grained hand spatial reasoning. The paper further demonstrates that models fine-tuned on HandVQA can zero-shot transfer to downstream tasks such as gesture recognition (+10.33%) and hand-object interaction recognition (+2.63%).
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models: This paper proposes HAWK, a head importance-aware visual token pruning method that offline computes per-head contribution weights to visual understanding and dynamically evaluates each visual token's importance via text-guided attention scores. On Qwen2.5-VL, HAWK retains 96.0% of original performance after pruning 80.2% of visual tokens while reducing inference latency by 26%.
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models: This paper proposes HiF-VLA, a framework that uses Motion Vectors (MV) as compact temporal primitives to unify three temporal reasoning capabilities—Hindsight, Insight, and Foresight—enabling bidirectional temporal extension of VLA models. HiF-VLA substantially outperforms baselines on long-horizon manipulation tasks with minimal computational overhead.
HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks: By precisely decomposing the attention formula to reveal the mathematical essence of the ICL effect (a dynamic mixture of standard attention output and demonstration value matrices), this paper proposes HiFICL—which directly parameterizes the source of ICL via learnable low-rank virtual key-value pairs rather than approximating its effect—achieving comprehensive improvements over existing ICL approximation methods on multimodal benchmarks with only 2.2M parameters.
HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks: HiFICL reframes the ICL approximation problem through rigorous attention formula derivation — shifting from "fitting a shift vector" to "directly parameterizing the source of ICL" — by injecting learnable low-rank virtual key-value pairs into attention heads. Trained end-to-end, this yields a dynamic, context-aware parameter-efficient fine-tuning method that surpasses existing ICL approximation methods and LoRA on multiple multimodal benchmarks with significantly fewer parameters.
HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models: HiSpatial decomposes 3D spatial intelligence into four cognitive levels (geometric perception → object attributes → inter-object relations → abstract reasoning), constructs an automated data pipeline processing ~5M images, 45M objects, and 2B QA pairs, and designs an RGB-D VLM that takes metric-scale point cloud maps as auxiliary input. With only 3B parameters, it surpasses GPT-5 and Gemini-2.5-Pro on multiple spatial reasoning benchmarks.
HIVE: Query, Hypothesize, Verify — An LLM Framework for Multimodal Reasoning-Intensive Retrieval: HIVE is a plug-and-play multimodal retrieval framework that improves nDCG@10 from 27.6 (best multimodal model) to 41.7 (+14.1 absolute points) on reasoning-intensive multimodal retrieval through four stages — initial retrieval → LLM-driven compensatory query synthesis (explicitly expressing visual reasoning gaps) → secondary retrieval → LLM verification reranking — without any additional training.
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models: This paper proposes HOG-Layout, a hierarchical framework for 3D indoor scene generation, optimization, and editing based on VLM and LLM. It achieves superior performance over LayoutVLM on SceneEval at 4.5× faster speed, through RAG-enhanced semantic consistency and force-directed hierarchical optimization for physical plausibility.
HoneyBee: Data Recipes for Vision-Language Reasoners: This work systematically investigates the principles underlying the construction of vision-language reasoning datasets—covering context source strategies, data interventions (image caption auxiliary signals and text-only reasoning), and multi-dimensional data scaling—and uses these insights to build HoneyBee, a 2.5M-sample CoT reasoning dataset. A 3B VLM trained on HoneyBee surpasses the prior SOTA by 7.8% on MathVerse, while a proposed test-time scaling strategy reduces decoding cost by 73%.
HoneyBee: Data Recipes for Vision-Language Reasoners: This paper systematically investigates the design space of VL reasoning training data—covering data source selection, intervention strategy filtering, and three-dimensional scaling across images, questions, and CoTs. Based on the resulting insights, the authors construct the HoneyBee dataset with 2.5M samples. A 3B VLM trained on HoneyBee surpasses the previous SOTA on MathVerse by 7.8pp, and a shared caption decoding strategy for test-time scaling reduces token consumption by 73%.
HouseMind: Tokenization Allows MLLMs to Understand, Generate and Edit Architectural Floor Plans: This paper proposes HouseMind, a framework that discretizes architectural floor plans into structured sequences of contour tokens and room instance tokens via a hierarchical VQ-VAE. Combined with three-stage multimodal alignment and instruction fine-tuning on Qwen3-0.6B as the backbone, HouseMind achieves unified modeling of floor plan understanding, generation, and editing, substantially outperforming existing methods in geometric validity and controllability.
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models: This paper proposes HulluEdit, a single-pass, reference-free subspace editing framework that decomposes hidden states into three orthogonal subspaces—a visual evidence subspace, a conflicting prior subspace, and a residual uncertainty subspace—to selectively suppress hallucination patterns without interfering with visual grounding, achieving state-of-the-art hallucination mitigation on the POPE and CHAIR benchmarks.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks: This paper introduces HumanVBench, a human-centric video understanding benchmark comprising 16 fine-grained tasks, accompanied by two automated pipelines (video annotation and distractor-aware QA synthesis). Evaluation of 30 mainstream video MLLMs reveals critical deficiencies in current models regarding nuanced emotion perception and speech-visual alignment.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks: This paper presents HumanVBench, a video benchmark comprising 16 fine-grained tasks, systematically evaluating the human-centric video understanding capabilities of MLLMs via two automated pipelines (video annotation and distractor generation). The benchmark reveals significant deficiencies in current models regarding emotion perception and speech-visual alignment.
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding: This paper proposes IAG, the first multi-target backdoor attack method against VLM-based visual grounding. By employing a text-conditioned U-Net to dynamically generate input-aware triggers, IAG embeds the semantic information of any attacker-specified target object into the visual input, achieving the highest attack success rate in 11 out of 12 evaluated settings.
Interpretable Debiasing of Vision-Language Models for Social Fairness: This paper proposes DeBiasLens, which trains a Sparse Autoencoder (SAE) on VLM encoders to localize "social neurons" encoding social attributes, then selectively deactivates these neurons at inference time to mitigate bias. The method reduces Max Skew by 9–16% on CLIP and reduces gender bias rates by 40–50% on InternVL2, while preserving general performance.
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment: IsoCLIP provides a theoretical analysis of the CLIP projection head structure, revealing that the cosine similarity computation implicitly contains an inter-modal operator \(\Psi = W_i^\top W_t\) responsible for cross-modal alignment, and an intra-modal operator \(\Psi_i = W_i^\top W_i\) responsible solely for normalization without promoting intra-modal alignment. By applying singular value decomposition to \(\Psi\), the method identifies an approximately isotropic alignment subspace and, by removing anisotropic directions, significantly improves intra-modal retrieval and classification performance without any training.
It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models: This paper reveals that state-of-the-art VLMs still fail to reliably read analog clocks in real-world scenes (zero-shot accuracy below 10%), and proposes TickTockVQA, a real-world dataset of 12K images, along with a Swap-DPO fine-tuning framework that improves Llama-3.2-11B's time-reading accuracy from 1.43% to 46.22%.
Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild: This paper proposes the JALA framework, which constructs a unified latent action space via joint alignment between predictive embeddings and latent actions inferred by an inverse dynamics model, enabling VLAs to learn simultaneously from labeled data and unlabeled in-the-wild human videos. Combined with the 7.5M-sample UniHand-Mix dataset, JALA significantly improves the generalization of robot manipulation policies.
KEC: Hierarchical Textual Knowledge for Enhanced Image Clustering: KEC leverages LLMs to construct hierarchical concept-attribute structured textual knowledge to guide image clustering, outperforming zero-shot CLIP on 14 out of 20 datasets without any training, demonstrating that discriminative attributes are more effective than simple class names.
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing: This paper proposes KVSmooth, a training-free plug-and-play inference-time method that applies adaptive exponential moving average (EMA) smoothing to KV-Cache guided by attention row entropy, effectively suppressing semantic drift and hallucination generation caused by sink tokens during decoding in multimodal large language models (MLLMs). On LLaVA-1.5, CHAIR_S is reduced from 41.8 to 18.2 (a 56% reduction), while F1 improves from 77.5 to 79.2.
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing: KVSmooth proposes a training-free, plug-and-play method that applies attention row entropy-guided adaptive EMA smoothing to the KV-Cache, reducing LLaVA-1.5's CHAIR_S from 41.8 to 18.2 (a 56% reduction) while simultaneously improving F1 from 77.5 to 79.2, achieving gains in both precision and recall.
Label-Free Cross-Task LoRA Merging with Null-Space Compression: Motivated by the observation that the null-space ratio of the down-projection matrix \(\mathbf{A}\) decreases during LoRA fine-tuning and is strongly correlated with task performance, this paper proposes NSC Merging — a label-free, task-agnostic LoRA merging method that achieves state-of-the-art results across 20 heterogeneous vision tasks, 6 NLI tasks, and VLM benchmarks.
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection: This paper proposes the PROGRESS framework, which dynamically selects the most informative training samples by tracking a VLM's learning progress across automatically discovered multimodal concept clusters. Using only 16–20% of annotated data, PROGRESS achieves 99–100% of full-data performance with shorter total training time.
LFPC: Learning to Focus and Precise Cropping for MLLMs: LFPC proposes a two-stage pure reinforcement learning framework that addresses the spurious tool-calling behavior ("answer-before-crop") observed in existing agent-based MLLMs. It introduces an information gap mechanism — deliberately downsampling the global image to force the model to rely on high-resolution cropped regions — and a grounding loss to improve cropping precision, achieving state-of-the-art performance on high-resolution VQA benchmarks.
Linking Perception, Confidence and Accuracy in MLLMs: This paper reveals a severe confidence miscalibration problem in MLLMs—accuracy drops sharply when visual inputs are degraded while confidence remains unchanged—and proposes CDRL (Confidence-Driven Reinforcement Learning with clean-noisy image pairs) for perception-sensitive training. The calibrated confidence is then leveraged for adaptive test-time scaling via CA-TTS, achieving an average improvement of 8.8% across four benchmarks.
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models: Addressing three core challenges in multimodal multi-turn VLM dialogues—concealed malicious intent, cumulative contextual risk, and cross-modal joint risk—this work constructs the MMDS dataset (4,484 annotated dialogues) and the MCTS-based MMRT red-teaming framework, and proposes the LLaVAShield auditing model, achieving F1 scores of 95.71%/92.24% on the user/assistant sides respectively, substantially outperforming baselines such as GPT-5-mini.
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models: This paper proposes LLaVAShield — the first content moderation model designed for multimodal multi-turn dialogues — along with the MMDS dataset (4,484 dialogues covering 8 major categories and 60 subcategories of risk) and MMRT, an automated MCTS-based red-teaming framework. LLaVAShield substantially outperforms baselines such as GPT-5-mini on safety auditing of both user and assistant turns.
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models: Inspired by foveal encoding and cortical magnification in the human visual system, this paper proposes LLMind, a training-free adaptive sampling framework that leverages Möbius transformations for non-uniform pixel allocation. A closed-loop semantic feedback mechanism optimizes sampling parameters at test time, achieving substantial improvements over uniform sampling under tight pixel budgets of only 1%–5%.
Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation: This paper proposes LTS-FS (Locate-Then-Sparsify for Feature Steering), a framework that employs causal intervention-based attribution to identify hallucination-relevant layers and applies layer-wise sparse control over feature steering intensity according to attribution scores, effectively mitigating hallucinations in LVLMs while preserving generalization capability.
MA-Bench: Towards Fine-grained Micro-Action Understanding: This paper proposes MA-Bench, a micro-action understanding benchmark comprising 1,000 videos and 12,000 structured QA pairs. It introduces a three-tier "Perception–Comprehension–Reasoning" evaluation architecture to systematically assess fine-grained micro-action understanding across 23 MLLMs, and constructs a 20.5K training corpus, MA-Bench-Train, to support model fine-tuning and improvement.
MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures: MarkushGrapher-2 proposes an end-to-end multimodal chemical structure recognition model that jointly encodes image, text, and layout information via a dedicated chemical OCR module. Combined with a two-stage training strategy (first adapting to OCSR features, then integrating multimodal encoding), the model substantially outperforms existing methods on Markush structure recognition (M2S accuracy 56% vs. 38%), while remaining competitive on standard molecular structure recognition.
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models: This paper identifies a "smoothing misalignment" problem that arises when channel-wise smooth quantization methods (e.g., SmoothQuant) are directly applied to MLLMs—the large discrepancy in activation magnitudes across modalities causes non-dominant modalities to be over-smoothed. MASQuant is proposed to address this via modality-aware smoothing factors and SVD whitening-based cross-modal low-rank compensation.
Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning: This paper constructs D-Negation, the first visual grounding dataset with paired positive/negative semantic descriptions (14K images, 140K annotations), and proposes Grouped Opposition-Based Learning (GOBL), an efficient fine-tuning mechanism with two opposition-based loss functions—PNC and TSO. By tuning fewer than 10% of model parameters, GOBL improves Grounding DINO and APE by up to 5.7 mAP on negation-semantic benchmarks while simultaneously boosting performance on affirmative semantics.
Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning: This paper proposes the D-Negation dataset and a Grouped Opposition-Based Learning (GOBL) fine-tuning mechanism. By leveraging semantically opposed description pairs and two dedicated loss functions, GOBL fine-tunes fewer than 10% of model parameters while substantially improving negation semantic understanding in visual grounding models (up to +5.7 mAP).
Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence: Medic-AD upgrades a general-purpose medical VLM into a clinically intelligent model through a three-stage progressive training framework—anomaly detection (<Ano> token), longitudinal difference reasoning (<Diff> token), and visual explanation (heatmaps)—achieving state-of-the-art performance on multiple medical tasks with capabilities spanning lesion detection, symptom tracking, and visual interpretability.
Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs: This paper identifies that existing VLM-based OOD detection methods select negative texts using intra-modal distances (text-to-text or image-to-image), which are inconsistent with the cross-modal distances optimized by CLIP. The proposed InterNeg framework systematically leverages cross-modal distances from both textual and visual perspectives, achieving a 3.47% FPR95 reduction on ImageNet.
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents: MindPower proposes a Robot-Centric Theory-of-Mind reasoning framework that organizes perception → belief → desire → intention → decision → action into a three-level six-layer reasoning hierarchy (MindPower Reasoning Hierarchy), and employs Mind-Reward (GRPO-based reinforcement learning) to optimize reasoning consistency, surpassing GPT-4o by 12.77% on decision-making and 12.49% on action generation.
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents: MindPower proposes a robot-centric Theory-of-Mind (ToM) reasoning framework that organizes perception → belief → desire → intention → decision → action into a six-layer reasoning hierarchy, and optimizes reasoning consistency via Mind-Reward (based on GRPO), surpassing GPT-4o by 12.77% on decision-making and 12.49% on action generation.
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection: This paper proposes GACD (Gradient-based Influence-Aware Constrained Decoding), which employs first-order Taylor gradient estimation to quantify each token's influence on the output. GACD simultaneously mitigates multimodal hallucinations caused by text-visual bias and co-occurrence bias at inference time, requiring neither auxiliary models nor fine-tuning.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with MLLMs: MMR-AD constructs the largest multimodal reasoning-oriented industrial anomaly detection dataset to date (127K images, 188 product categories, 395 anomaly types) and proposes Anomaly-R1, a GRPO reinforcement learning-based baseline model that significantly outperforms general-purpose MLLMs.
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping: This paper proposes MoDES, the first training-free expert skipping framework for MoE multimodal large language models. By leveraging Globally Modulated Local Gating (GMLG) and Dual-Modal Thresholding (DMT), MoDES adaptively skips redundant experts, retaining over 97% of original performance while skipping 88% of experts, and achieving 2.16× prefill speedup.
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping: MoDES is the first expert skipping framework for MoE multimodal large language models. It incorporates layer-level importance into routing probabilities via Global-Modulated Local Gating (GMLG), applies modality-specific skipping strategies for text and visual tokens via a Dual-Modal Threshold (DMT), and efficiently optimizes thresholds via frontier search. On Qwen3-VL-MoE-30B, MoDES retains 97.33% accuracy with 88% expert skipping, achieving a 2.16× prefill speedup.
MODIX: Training-Free Multimodal Information-Driven Positional Index Scaling for VLMs: This paper proposes MODIX, a training-free framework that dynamically adjusts the positional encoding step sizes of visual and textual tokens in VLMs via information-theoretic analysis (covariance entropy + cross-modal alignment), allocating finer positional granularity to information-dense modalities to enhance multimodal reasoning.
MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models: This paper models expert selection in MoE as a sequential decision problem and optimizes the routing strategy via GRPO-based reinforcement learning. By introducing modality-aware router guidance, the proposed method consistently outperforms deterministic top-K routing and its variants on image and video understanding tasks in VLMs.
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes: This paper proposes the Panorama-Language Modeling (PLM) paradigm and the PanoVQA large-scale panoramic VQA dataset (653K QA pairs). A plug-and-play panoramic sparse attention (PSA) module is designed to enable existing VLMs to process equirectangular projection (ERP) panoramic images without retraining, achieving superior global reasoning over multi-view stitching approaches in adverse scenarios such as occlusion and accidents.
Mixture of States (MoS): Routing Token-Level Dynamics for Multimodal Generation: This paper proposes Mixture of States (MoS), a novel fusion paradigm for multimodal diffusion models. A lightweight, learnable token-level router dynamically routes hidden states from arbitrary layers of an understanding tower (frozen LLM/VLM) to arbitrary layers of a generation tower (DiT). With only 3–5B parameters, MoS matches or surpasses the 20B Qwen-Image on both image generation and editing benchmarks.
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models: MoT probe experiments reveal asymmetric pruning sensitivity between the text and visual pathways in LVLMs — the text pathway is highly sensitive and must be calibrated with text tokens, while the visual pathway is highly redundant and can tolerate 60% sparsity. Based on these findings, ATV-Pruning constructs a calibration pool using all text tokens plus a small, layer-adaptively selected subset of visual tokens.
MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding: This paper proposes MSJoE, a framework that jointly evolves an MLLM and a lightweight keyframe sampler via reinforcement learning. The MLLM generates visual queries to guide frame retrieval, a 1D U-Net sampler learns selection weights from a CLIP similarity matrix, and both components are optimized end-to-end, achieving +8% accuracy improvement on long-form video QA.
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following: This paper introduces Multi-Crit, the first benchmark for evaluating the pluralistic criteria-following capability of multimodal judge models. It features criterion-level human annotations and preference-conflicting samples, along with three new metrics—PAcc, TOS, and CMR—to comprehensively evaluate 25 LMMs, revealing that even the strongest closed-source model achieves only 32.78% multi-criteria consistency on open-ended generation tasks.
Multi-Modal Image Fusion via Intervention-Stable Feature Learning: This paper proposes a causal inference-inspired multi-modal image fusion framework that employs three structured intervention strategies (complementary masking, random masking, and modality dropout) to probe genuine inter-modal dependencies, and designs a Causal Feature Integrator (CFI) to learn intervention-stable features. The method achieves PSNR of 66.02 and AG of 4.129 on MSRS, and mAP of 0.821 on object detection.
Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery: This paper proposes SSR²-GCD, a framework that learns structured representations with uniformly compressed intra-modal distributions via a Semi-Supervised Rate Reduction (SSR²) loss, and introduces a Retrieval-based Text Aggregation (RTA) strategy to enhance cross-modal knowledge transfer. The method surpasses existing multi-modal GCD approaches on 8 benchmarks.
Multimodal OCR: Parse Anything from Documents: This paper proposes the Multimodal OCR (MOCR) paradigm, which unifies the parsing of text and graphics (charts, diagrams, UI components, etc.) in documents into structured textual representations (plain text + SVG code). The trained 3B-parameter dots.mocr model ranks second only to Gemini 3 Pro on OCR Arena, achieves a state-of-the-art score of 83.9 on olmOCR Bench, and surpasses Gemini 3 Pro on the image-to-SVG benchmark.
MUPO: All Roads Lead to Rome - Incentivizing Divergent Thinking in Vision-Language Models: MUPO identifies a reasoning diversity collapse in GRPO training — models prematurely converge to a small number of reasoning strategies while discarding most alternatives. By partitioning responses into groups for localized advantage estimation and introducing a diversity reward, MUPO incentivizes VLMs to maintain divergent thinking, achieving 2–7% improvements across multiple reasoning benchmarks.
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy: Nano-EmoX proposes a cognition-inspired three-level emotional task hierarchy (Perception → Understanding → Interaction) and is the first multimodal language model to unify six core affective tasks within a compact 2.2B parameter framework, employing a P2E progressive training paradigm that cultivates capabilities from basic perception to high-level empathy.
Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning: This paper proposes the Narrative Weaver framework, which combines narrative planning via MLLMs with fine-grained generation via diffusion models. Through learnable queries and a dynamic Memory Bank, the framework achieves long-range visually consistent generation under multi-modal conditioning. The authors also introduce EAVSD, the first e-commerce advertising video storyboard dataset, comprising 330K+ images.
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models: C2LIP proposes a contrastive learning fine-tuning approach that requires no hard negatives: by decomposing text into noun-phrase concepts and introducing cross-modal attention pooling, it achieves state-of-the-art performance on the SugarCrepe/SugarCrepe++ compositionality benchmarks while maintaining or improving zero-shot and retrieval performance.
No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection: This paper proposes LAVIDA, an end-to-end zero-shot video anomaly detection framework that transforms semantic segmentation datasets into pseudo-anomaly training data via an Anomaly Exposure Sampler. Combined with MLLM-based deep anomaly semantic feature extraction and reverse-attention token compression for spatiotemporal sparsity, LAVIDA achieves frame-level and pixel-level SOTA without any real VAD data.
Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment: This paper proposes NA-MVP, a framework that employs bi-directional (clean + noise-aware) multi-view prompt design combined with Unbalanced Optimal Transport (UOT) for fine-grained patch-to-prompt alignment, and applies classical OT for selective label correction on identified noisy samples, consistently outperforming state-of-the-art methods in noisy few-shot learning scenarios.
Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment: This paper proposes the NA-MVP framework, which employs a bi-directional (clean + noise-aware) multi-view prompt design coupled with Unbalanced Optimal Transport (UOT) for fine-grained patch-to-prompt alignment, and applies classical OT for selective label correction on identified noisy samples, consistently surpassing state-of-the-art methods in noisy few-shot learning scenarios.
OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models: This paper proposes OddGridBench to evaluate the fine-grained visual discrepancy sensitivity of MLLMs (i.e., identifying the element in a grid that differs from others in color, size, rotation, or position). All evaluated MLLMs fall far below human performance. To address this gap, the authors propose OddGrid-GRPO, which combines curriculum learning with a distance-aware reward to significantly improve visual discrimination ability.
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens: OmniLottie proposes a Lottie Tokenizer that converts Lottie JSON files into structured command-parameter sequences, enabling pretrained VLMs to generate high-quality vector animations from multimodal cross-modal instructions. The work also introduces the MMLottie-2M large-scale dataset to support training.
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models: This paper identifies the "token's dilemma" in dynamic MoE continual learning — ambiguous and old tokens in new-task data contribute minimally to new knowledge acquisition yet cause routing drift and catastrophic forgetting. The proposed LLaVA-DyMoE mitigates routing drift via Token Assignment Guidance and Routing Score Regularization, achieving over 7% MFN improvement and 12% forgetting reduction on the CoIN benchmark.
Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models: This paper reveals a novel mechanism underlying VLM hallucinations — overthinking: the model generates an excessive number of competing object hypotheses in intermediate decoding layers, and confounders propagate across layers to corrupt the final prediction. The paper proposes the Overthinking Score to quantify inter-layer hypothesis diversity × uncertainty, achieving F1 of 78.9% on MSCOCO and 71.58% on the OOD benchmark AMBER.
PaddleOCR-VL: Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing: PaddleOCR-VL introduces a coarse-to-fine document parsing framework that first employs a lightweight VRFM module to detect valid regions and predict reading order, then applies a compact 0.9B VLM for fine-grained recognition, achieving state-of-the-art document parsing performance with minimal visual tokens and parameters.
PaddleOCR-VL: Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing: PaddleOCR-VL proposes a coarse-to-fine document parsing architecture: the coarse stage employs a lightweight Valid Region Focusing Module (VRFM) to localize effective visual regions and predict reading order, while the fine stage applies a compact 0.9B vision-language model to perform detailed recognition on cropped regions, achieving state-of-the-art document parsing performance with minimal visual tokens and parameters.
PaddleOCR-VL: Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing: This paper proposes PaddleOCR-VL, a coarse-to-fine document parsing framework. The coarse stage employs a lightweight VRFM module to identify effective visual regions, while the fine stage applies a compact 0.9B VLM to process only those regions. With minimal visual tokens and parameters, the framework achieves state-of-the-art performance on OmniDocBench v1.5, substantially reducing latency and resource consumption.
Parallel In-context Learning for Large Vision Language Models: This paper proposes Parallel-ICL, which partitions the long demonstration context in multimodal in-context learning (MM-ICL) into chunks for parallel processing, and integrates predictions at the logit level via weighted Product-of-Experts (PoE). The method achieves performance on par with or superior to full-context MM-ICL while significantly reducing inference latency.
PersonaVLM: Long-Term Personalized Multimodal LLMs: This paper proposes PersonaVLM, a multimodal agent framework for long-term personalization. Through proactive memory management (four-type memory database), multi-step reasoning-based retrieval, and a momentum-based personality evolution mechanism, it transforms a general-purpose MLLM into a personalized assistant capable of adapting to shifting user preferences, surpassing GPT-4o by 5.2% under a 128K context.
Phantasia: Context-Adaptive Backdoors in Vision Language Models: Phantasia introduces the first context-adaptive backdoor attack against VLMs. Rather than generating fixed malicious text, a poisoned model receiving a triggered image silently answers an attacker-specified target question instead of the user's original query. The generated response is semantically consistent with the input image and linguistically fluent, thereby evading defenses such as STRIP-P and ONION-R. The paper also provides the first empirical demonstration that the stealthiness of existing VLM backdoor attacks has been substantially overestimated.
PhysInOne: Visual Physics Learning and Reasoning in One Suite: PhysInOne is a large-scale synthetic dataset comprising 153,810 dynamic 3D scenes and 2 million annotated videos, covering 71 fundamental physical phenomena across mechanics, optics, fluid dynamics, and magnetism, establishing a new benchmark for physically-aware world models.
Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision: This paper proposes the DeepfakeJudge framework, which scales human-annotated reasoning supervision into large-scale structured scoring data via a bootstrapped generator-evaluator pipeline. The framework trains 3B/7B vision-language models as automatic judges for deepfake detection reasoning quality, achieving high human alignment in both pointwise and pairwise evaluation settings.
PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models: PointAlign is proposed to apply feature-level alignment regularization to point cloud tokens at intermediate LLM layers (aligned with Q-Former outputs) in 3D VLMs. By training only a lightweight alignment projector and LoRA adapters, the method effectively prevents geometric information from degrading during language modeling, achieving a 7.50pp improvement on open-vocabulary classification.
Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees: This paper proposes Proof-of-Perception (PoP), which models multimodal reasoning as an executable directed acyclic graph (DAG) where each perception/logic node outputs set-valued predictions with conformal certificates providing step-wise reliability guarantees. A lightweight controller adaptively allocates computation within a budget based on these certificates. PoP outperforms CoT, ReAct, and PoT baselines on document, chart, and multi-image QA benchmarks.
Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models: This paper systematically diagnoses visual representation degradation in MLLMs across two levels—global functionality and patch-level semantic structure—revealing that such degradation is an intrinsic "visual sacrifice" induced by the pure text-generation objective. It proposes Predictive Regularization (PRe), which mitigates degradation by training intermediate-layer features to predict the initial visual features, achieving consistent improvements across multiple vision-language benchmarks.
Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation: This paper proposes AReS, which replaces the continuous API calls of conventional zeroth-order optimization (ZOO) with a single-round API query to prime a local encoder. AReS achieves a +27.8% improvement on GPT-4o (where ZOO methods are nearly ineffective), while reducing API calls by over 99.99% and enabling zero-cost inference.
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving: The first plug-and-play token pruning framework for multi-view autonomous driving VLMs. By leveraging T-FPS (Token-wise Farthest Point Sampling) to preserve semantic and spatial diversity, combined with view-adaptive pruning rate optimization to automatically allocate token budgets per camera, the framework achieves 6.40× prefill acceleration on DriveLM while retaining only 10% of tokens with only a 3% performance drop.
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving: Prune2Drive is the first plug-and-play token pruning framework designed for multi-view autonomous driving VLMs. It combines T-FPS (Token-wise Farthest Point Sampling) to preserve semantic and spatial diversity with view-adaptive pruning rate optimization to automatically allocate token budgets across camera views. Retaining only 10% of tokens on DriveLM, it achieves 6.40× prefill speedup with only a 3% performance drop.
Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher: This paper proposes PTA (Purify-then-Align), a framework that first purifies a noisy multimodal teacher via a meta-learning-driven modality weighting mechanism, then aligns each unimodal student through diffusion-model-driven knowledge distillation, enabling unimodal encoders to maintain strong robustness under modality-missing scenarios. PTA achieves state-of-the-art performance on MM-Fi and XRF55.
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization: This paper proposes Quant Experts (QE), a token-aware adaptive quantization error compensation framework based on Mixture of Experts (MoE). By partitioning important channels into token-independent and token-dependent groups, QE employs shared experts and routed experts to perform global and local quantization error reconstruction respectively, achieving significant accuracy recovery on VLMs ranging from 2B to 72B parameters.
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization: This paper proposes Quant Experts (QE), a token-aware adaptive quantization error reconstruction framework based on Mixture-of-Experts. It partitions important channels into token-independent (high-frequency, globally consistent) and token-dependent (low-frequency, locally dynamic) groups, compensating global and local quantization errors via low-rank adapters in shared and routed experts, respectively. QE consistently improves VLM performance across diverse quantization settings ranging from W4A6 to W3A16.
Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning: This paper proposes the Reason-SVG framework, which introduces a "Drawing-with-Thought" (DwT) paradigm that enables LLMs to perform explicit multi-stage design reasoning prior to SVG generation. Combined with SFT and GRPO reinforcement learning with a hybrid reward function, Reason-SVG consistently outperforms existing methods in semantic alignment, structural validity, and visual quality.
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps: This paper introduces the ReasonMap benchmark, which constructs 1,008 QA pairs from high-resolution transit maps of 30 cities and proposes a two-level evaluation framework (correctness + quality) to systematically assess fine-grained visual reasoning capabilities of 16 MLLMs. A key finding is that among open-source models, base models outperform reasoning models, while the opposite holds for closed-source models.
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps: This paper introduces the ReasonMap benchmark, constructed from high-resolution transit maps of 30 cities comprising 1,008 QA pairs, to systematically evaluate fine-grained visual understanding and spatial reasoning capabilities of 16 MLLMs. The work reveals the counter-intuitive phenomenon that base variants of open-source models consistently outperform their reasoning counterparts, and establishes a GRPO-based reinforcement fine-tuning training baseline.
ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval: This paper reveals a Capability Degradation phenomenon that occurs when adapting generative MLLMs into discriminative retrievers, and proposes the ReCALL framework — a three-stage pipeline that diagnoses retriever blind spots, leverages the base MLLM's CoT reasoning to generate corrective triplets, and applies grouped contrastive refinement to recover degraded fine-grained compositional reasoning ability. ReCALL achieves R@1 of 55.52% on CIRR and R@10 of 57.04% on FashionIQ.
Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress: This paper proposes R²VLM, a recurrent reasoning framework that processes local video segments sequentially, maintains a dynamically updated CoT record tracking task decomposition and completion status, and leverages a multi-dimensional RL reward scheme to achieve state-of-the-art performance in long-horizon embodied task progress estimation. The framework additionally supports downstream applications including policy learning, reward modeling, and proactive assistance.
Recursive Think-Answer Process for LLMs and VLMs: R-TAP proposes a recursive think-answer process that employs a confidence generator to assess the certainty of model responses and guide iterative reasoning refinement. Combined with dual reinforcement signals—a recursive confidence growth reward and a final answer confidence reward—R-TAP consistently outperforms single-pass inference methods on both LLMs and VLMs, while substantially reducing "Oops!"-style self-reflection expressions during reasoning.
ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation: This paper proposes ReHARK — a training-free one-shot CLIP adaptation framework that constructs a hybrid prior by fusing CLIP text knowledge, GPT-3 semantic descriptions, and visual prototypes, and performs global proximal regularization in RKHS via multi-scale RBF kernels, achieving a new one-shot SOTA of 65.83% average accuracy across 11 benchmarks.
ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation: ReHARK is a four-stage refinement pipeline that constructs hybrid semantic-visual priors, augments the support set, applies adaptive distribution rectification, and integrates multi-scale RBF kernels, achieving 65.83% one-shot adaptation accuracy across 11 benchmarks and substantially outperforming Tip-Adapter and ProKeR.
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training: This paper proposes the World-Env framework, which leverages a physically consistent world model as a virtual environment in place of real-world interaction to perform RL post-training on VLA models. With only 5 demonstrations per task, the framework achieves significant improvements in manipulation success rates.
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training: This paper proposes World-Env, a framework that employs a physically consistent world model as a virtual simulator in place of real-world interaction. Combined with a VLM-guided instant reflector that provides continuous rewards and dynamic termination signals, the framework enables safe and efficient RL post-training of VLA models using only 5 demonstration trajectories per task, improving average success rate from 74.85% to 79.6%.
Relational Visual Similarity: This paper formally defines the problem of relational visual similarity — the intrinsic relational or functional correspondence between two images, as opposed to surface-level attribute similarity — constructs a 114K anonymous-description dataset, trains the relsim model, and reveals fundamental deficiencies in existing similarity metrics (CLIP, DINO, etc.) for capturing relational similarity.
ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding: This paper proposes ReMoRa, which operates directly on compressed video representations (I-frames + motion vectors). A Refined Motion Representation (RMR) module refines coarse block-level motion vectors into fine-grained motion representations approximating optical flow, while a Hierarchical Motion State Space (HMSS) module performs linear-time long-range temporal modeling. ReMoRa surpasses baselines on LongVideoBench, NExT-QA, MLVU, and other benchmarks.
Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance: This paper proposes Residual Decoding (ResDec), a training-free plug-and-play decoding strategy that identifies the semantic anchoring phase by analyzing U-shaped JSD patterns in historical token logit distributions, aggregates logits from this phase as residual guidance to steer current decoding, and effectively suppresses language-prior hallucinations in LVLMs at near-zero additional inference overhead.
Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in VDU: Layer-wise linear probing analysis reveals a significant gap between internal representations and generated responses in LVLMs for visual document understanding (VDU). Intermediate layers encode more linearly accessible task-relevant information than final layers, and fine-tuning intermediate layers simultaneously improves accuracy and narrows the gap.
Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token: This paper proposes SELF1E, the first MLLM segmentation method that requires neither a dedicated mask decoder nor more than a single [SEG] token. By introducing Residual Features Refilling (RFR) and Residual Features Amplifier (RFA) to recover resolution lost during pixel-shuffle compression, SELF1E achieves performance competitive with decoder-based methods across multiple segmentation tasks.
Rethinking VLMs for Image Forgery Detection and Localization: This work reveals that VLMs inherently favor semantic plausibility over authenticity (CLIP cosine similarity for forged images reaches 96–99%), and proposes IFDL-VLM, which decouples detection/localization from language explanation into two stages: Stage-1 employs ViT+SAM for detection and localization, and Stage-2 feeds the resulting mask as auxiliary input to a VLM to enhance interpretability. The method achieves state-of-the-art performance across 9 benchmarks.
Rethinking VLMs for Image Forgery Detection and Localization: This paper proposes IFDL-VLM, a framework that identifies an inherent semantic plausibility bias in VLMs — their tendency to favor semantic coherence over authenticity — which impedes forgery detection performance. The framework decouples detection/localization from language explanation into a two-stage optimization pipeline, and leverages localization masks as auxiliary inputs to VLMs to enhance interpretability, achieving comprehensive SOTA results across 9 benchmarks.
Revisiting Model Stitching in the Foundation Model Era: This paper systematically investigates the feasibility of stitching heterogeneous Vision Foundation Models (VFMs), finds that conventional methods fail in this setting, and proposes a two-stage training strategy — Final Feature Matching + Task Loss Training — that enables reliable stitching across heterogeneous VFMs. The resulting stitched models can even surpass both constituent VFMs individually. Building on this, the paper introduces the VFM Stitch Tree (VST) architecture, which provides a controllable accuracy–efficiency trade-off for multi-VFM systems.
Revisiting Model Stitching In the Foundation Model Era: A two-stage stitching training method (Final Feature Matching + Task Loss Training) for heterogeneous Vision Foundation Models (VFMs) is proposed, demonstrating that heterogeneous VFMs can be reliably stitched and fused for complementary knowledge. A VFM Stitch Tree (VST) architecture is also designed to achieve controllable accuracy–efficiency trade-offs in multi-VFM systems.
Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach: This paper proposes FlashCache — the first training-free multimodal KV Cache compression framework that requires no attention scores. By identifying Outlier KVs via frequency-domain low-pass filtering and dynamically allocating per-layer budgets, FlashCache achieves 80% memory reduction and 1.69× decoding speedup while preserving model performance.
SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning: This paper proposes SALMUBench—the first association-level machine unlearning benchmark for CLIP-style models—comprising a 60K synthetic person–sensitive-attribute paired dataset, from-scratch-trained Compromised/Clean model pairs, and a structured holdout evaluation protocol. It is the first work to systematically reveal three failure modes of existing unlearning methods: catastrophic destruction, over-generalized unlearning, and ineffective unlearning.
Scaling Spatial Intelligence with Multimodal Foundation Models: SenseNova-SI systematically constructs an 8M-scale diverse spatial dataset (SenseNova-SI-8M) to cultivate spatial intelligence in multimodal foundation models including Qwen3-VL, InternVL3, and Bagel, achieving unprecedented performance on multiple spatial benchmarks such as VSI-Bench and MMSI while preserving general multimodal understanding capabilities.
Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework: This paper proposes the Self-Critical Inference (SCI) framework, which simultaneously addresses language bias and language sensitivity in LVLMs via multi-round textual and visual counterfactual logit aggregation. A dynamic robustness benchmark, DRBench, is introduced to evaluate robustness in a model-specific manner. Increasing the number of counterfactual inference rounds yields consistent robustness gains, opening a new direction for test-time scaling.
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism: This paper proposes FlexMem — a training-free visual memory mechanism that constructs a visual memory bank via iterative dual-pathway KV cache compression, and introduces both encoding-based and fast index-based memory retrieval strategies, enabling MLLMs to process 1000+ frame long videos on a single 3090 GPU while substantially outperforming existing efficient video understanding methods.
Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models: This paper proposes Scene-VLM — the first VLM fine-tuning-based framework for video scene segmentation — which leverages structured multimodal shot representations (visual frames + dialogue + metadata), causal sequential prediction, a context-focus window mechanism, and token logits-based confidence extraction, achieving substantial gains of +6 AP and +13.7 F1 on MovieNet, while demonstrating natural language explanation capability.
SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts: This work introduces SciPostGen, a large-scale dataset of 18,097 paper–poster pairs. Analysis reveals a moderate correlation between paper structure and the number of poster layout elements. A retrieval-augmented poster layout generation framework is proposed, which leverages contrastive learning to retrieve layout templates matching the input paper and guides an LLM to generate the final poster layout.
SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker: This paper proposes SEATrack, a multimodal tracker that achieves dynamic cross-modal attention map alignment via AMG-LoRA and efficient global relation modeling via HMoE, attaining a state-of-the-art performance–efficiency trade-off on RGB-T/D/E tracking with minimal trainable parameters.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models: This paper introduces AV-SpeakerBench, a benchmark comprising 3,212 speaker-centric audiovisual reasoning multiple-choice questions, which systematically evaluates multimodal large language models on fine-grained audiovisual fusion capabilities—specifically, who is speaking, what was said, and when—revealing a gap of over 20% between the strongest current models and human performance.
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles: This paper proposes State-aware Reasoning (StaR), which teaches multimodal agents a three-step reasoning chain — "perceive current state → analyze target state → decide whether to act" — improving GUI toggle control accuracy by over 30% without degrading general agent task performance.
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness: This paper proposes an efficient plug-and-play module that learns multimodal class embeddings to enhance VLM recognition and reasoning on rare objects. On the visual side, a cross-attention adapter refines visual tokens; on the textual side, object detection prompts are injected. Without fine-tuning the VLM, the method achieves a significant gain from 72.8 to 75.4 on CODA-LM.
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions: This paper introduces the tactile localization task—identifying regions in an image that share the same material properties as a given tactile input—and addresses it via local visual-tactile alignment and a material-diversity pairing strategy for learning dense cross-modal features. Two new tactile-material segmentation datasets are also constructed.
Self-Consistency for LLM-Based Motion Trajectory Generation and Verification: This paper extends the self-consistency paradigm of LLMs from natural language reasoning to the visual domain. It defines shape families for motion trajectories via a Lie transformation group hierarchy, and clusters multiple LLM-sampled trajectories under transformation-invariant distance metrics to achieve unsupervised trajectory generation improvement (+4–6%) and verification (precision +11.8%), without any training.
Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning: This paper proposes the Similarity-as-Evidence (SaE) framework, which reinterprets VLM text-image similarities as Dirichlet evidence. A Similarity Evidence Head (SEH) is introduced to calibrate overconfident softmax outputs, and a dual-factor acquisition strategy based on vacuity and dissonance enables interpretable, label-efficient medical active learning, achieving a SOTA macro-average accuracy of 82.57% across 10 datasets under a 20% annotation budget.
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models: SIMPACT proposes a test-time simulation-augmented action planning framework that automatically constructs a physics simulation environment from a single RGB-D image, enabling VLMs to propose actions, observe simulation outcomes, and iteratively refine their reasoning—achieving SOTA performance on both rigid and deformable object manipulation tasks without any additional training.
SoPE: Spherical Coordinate-Based Positional Embedding for 3D LVLMs: This paper identifies spatial perception bias in RoPE when applied to 3D LVLMs (1D indexing disrupts 3D locality and ignores directionality), and proposes SoPE, a spherical coordinate-based positional embedding using a four-dimensional index \((t, r, \theta, \phi)\) with multi-dimensional frequency allocation and multi-scale mixing. SoPE achieves state-of-the-art performance on 3D layout estimation and object detection benchmarks built upon SpatialLM.
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs: This paper proposes the SPARROW framework, which injects temporal referential consistency via Target-Specific Tracked Features (TSF) and stabilizes pixel-level localization through dual-prompt (BOX+SEG) initialization. As a plug-and-play module, SPARROW consistently improves performance across three video MLLM baselines on six benchmarks.
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs: This paper proposes the SPARROW framework, which injects temporal consistency supervision via Target-Specific Features (TSF), stabilizes first-frame initialization through dual-prompt ([BOX]+[SEG]) coarse-to-fine decoding, and integrates into existing video MLLMs in a plug-and-play manner, achieving consistent improvements across 6 benchmarks on 3 tasks.
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models: This paper proposes SpatiaLQA, a benchmark comprising 9,605 QA pairs across 241 real-world indoor scenes, systematically evaluates 41 VLMs on spatial logical reasoning, and introduces a recursive scene graph-assisted reasoning method to enhance VLMs' spatial logical reasoning capabilities.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence: This paper introduces SpatialScore, currently the most comprehensive multimodal spatial intelligence benchmark (5K samples / 30 tasks), and proposes two complementary approaches to enhance spatial understanding in MLLMs: a data-driven fine-tuning scheme via SpatialCorpus (331K QA pairs) and a training-free SpatialAgent system equipped with 12 specialized tools.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning: This paper proposes SpatialStack, a framework that injects multi-level geometric features from a multi-view geometry encoder (VGGT) into different layers of an LLM decoder (rather than fusing only the final layer), achieving open-source SOTA on multiple 3D spatial reasoning benchmarks through hierarchical alignment where shallow layers handle fine-grained spatial perception and deep layers support high-level semantic reasoning.
SSR2-GCD: Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery: This paper proposes SSR2-GCD, a framework that replaces conventional contrastive losses with a Semi-Supervised Rate Reduction (SSR2) loss to learn uniformly compressed, structured representations. The work further reveals that inter-modal alignment is not only unnecessary but harmful in multi-modal GCD, achieving +3.1% and +6.3% over the prior state of the art on Stanford Cars and Flowers102, respectively.
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles (StaR): This paper reveals the severe failure of existing multimodal GUI agents on toggle control tasks (GPT-5 achieves only 37% O-AMR), and proposes State-aware Reasoning (StaR), a three-step reasoning chain (perceive current state → analyze target state → decide whether to act) that improves execution accuracy by 30%+, without degrading general agent capabilities.
StructXLIP: Enhancing Vision-Language Models with Multimodal Structural Cues: StructXLIP adopts edge maps as proxy representations of visual structure and introduces three structure-centric losses during CLIP fine-tuning — edge-structure text alignment, local region-text chunk matching, and edge-color image connection. By maximizing the mutual information of multimodal structural representations, the model is guided toward more robust and semantically stable optima, surpassing existing competitors on cross-modal retrieval tasks.
Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models: This paper proposes TARA, a framework that injects taxonomic hierarchy knowledge into large multimodal models (LMMs) by aligning their intermediate representations with taxonomy-aware features from a biological foundation model (BFM), substantially improving hierarchical visual recognition performance on both known and novel categories.
Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention: This paper proposes Vision-Guided Attention (VGA), a training-free method that constructs precise visual grounding from the semantic features of visual tokens to guide model attention toward relevant visual regions, effectively mitigating hallucinations in MLLMs while remaining compatible with FlashAttention.
Test-Time Attention Purification for Backdoored Large Vision Language Models: This work identifies that the essence of backdoor behavior in LVLMs is cross-modal attention stealing (trigger visual tokens hijack the attention weights of text tokens), and proposes CleanSight — the first training-free test-time backdoor defense framework — which eliminates backdoor effects by detecting and pruning high-attention trigger tokens.
Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction: This paper proposes TOMCap — a text-only training approach for image captioning that combines retrieval augmentation, modality gap correction, and LoRA fine-tuning. The model trains exclusively on text yet processes images at inference time, surpassing existing training-free and text-only methods.
The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts: This work identifies a critical and overlooked threat: existing multimodal manipulation detection methods fail to account for MLLMs' ability to generate semantically coherent deceptive narratives. The authors construct MDSM, a semantically aligned manipulation dataset of 441k samples, and propose AMD, a framework based on Artifact Tokens and manipulation-oriented reasoning. With only 0.27B parameters, AMD achieves state-of-the-art cross-domain generalization of 88.18 ACC / 60.25 mAP / 61.02 mIoU.
The Coherence Trap: MLLM-Crafted Narratives Exploit Manipulated Visual Contexts: This paper identifies two fundamental flaws in existing multimodal disinformation detection—underestimating semantically coherent fake narratives generated by MLLMs and over-reliance on simple misalignment artifacts—and constructs the 441k-sample MDSM dataset (image manipulation + MLLM-generated semantically aligned text). The proposed AMD framework (Artifact Pre-perception + Manipulation-Oriented Reasoning) achieves 88.18 ACC / 60.25 mAP / 61.02 mIoU on cross-domain detection.
The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition: This paper reveals that open-source LLMs lack hierarchical taxonomic knowledge about the visual world (often failing to recognize even basic biological classification systems), making the LLM the bottleneck for hierarchical visual recognition in Vision LLMs.
The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment: This paper proposes Contrastive Fusion (ConFu), a framework that extends CLIP-style bimodal contrastive learning to tri-modal higher-order alignment, jointly learning paired and fused representations within a unified objective to support both 1→1 and 2→1 retrieval.
Think360: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth: This paper presents Think360, a multimodal benchmark focused on reasoning width—i.e., a model's capability for multi-path search, multi-constraint pruning, backtracking, and trial-and-error exploration. The benchmark comprises 1,200+ high-quality samples and introduces a fine-grained Tree-of-Thought evaluation protocol, revealing significant deficiencies in current MLLMs along the width dimension of reasoning.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models: This paper presents the first quantitative analysis of CoT reasoning in diffusion multimodal LLMs (dMLLMs), identifying two critical issues — "early answer generation" and "weak visual grounding" — and proposes two training-free methods, PSP (Position-Step Penalty) and VRG (Visual Reasoning Guidance), achieving up to 7.5% accuracy improvement at over 3× speedup.
Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World: This paper proposes Dyn-Bench — a large-scale benchmark for dynamic understanding of the physical 4D world (1k videos, 7k VQA pairs, 3k dynamic grounding pairs) — that systematically evaluates the spatio-temporal reasoning capabilities of general, spatial-aware, and region-level MLLMs. The study finds that existing models fail to maintain consistency between reasoning and grounding simultaneously, and introduces two structured integration methods, Mask-Guided Fusion and ST-TCM, that significantly improve dynamic perception.
TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval: This paper proposes TIGeR, a multimodal Transformer framework that jointly learns a unified geo-temporal embedding space over images, locations, and timestamps, enabling three tasks—geolocalization, capture time prediction, and geo-temporally aware image retrieval—within a single model. A high-quality benchmark dataset of 4.5M images is also introduced.
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs: This paper systematically investigates the key factors for building video temporal grounding (VTG) capabilities in MLLMs from two dimensions — data quality and algorithm design. It releases the high-quality benchmark TimeLens-Bench and training set TimeLens-100K, and constructs the TimeLens model series via interleaved textual timestamp encoding combined with a thinking-free RLVR training paradigm, achieving state-of-the-art performance among open-source models and surpassing GPT-5 and Gemini-2.5-Flash.
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment: TIPSv2 is proposed by discovering that distillation substantially improves patch-text alignment, and this insight is translated into a new pretraining objective, iBOT++ (where visible tokens also participate in the loss computation). Combined with head-only EMA and multi-granularity text augmentation, TIPSv2 achieves state-of-the-art performance across 9 tasks and 20 datasets.
Token Warping Helps MLLMs Look from Nearby Viewpoints: This paper proposes performing spatial warping on ViT image tokens within MLLMs—rather than conventional pixel-level warping—to simulate viewpoint changes. It is found that backward token warping maintains semantic consistency while remaining robust to depth estimation noise. The proposed method substantially outperforms pixel-level warping, specialized spatial-reasoning MLLMs, and generative warping approaches on the newly constructed ViewBench benchmark.
Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans (HouseMind): This paper presents HouseMind, which discretizes architectural floor plans into room-level spatial tokens via a hierarchical VQ-VAE, enabling floor plan understanding, generation, and editing within a unified MLLM framework. The approach comprehensively outperforms diffusion model and general-purpose VLM baselines in geometric validity and controllability.
Topo-R1: Detecting Topological Anomalies via Vision-Language Models: Topo-R1 is proposed as the first framework to equip VLMs with topology-aware perception. Through an automated data construction pipeline combined with SFT and GRPO reinforcement learning (incorporating a topology-aware composite reward), it enables annotation-free topological anomaly detection and classification in tubular structures.
Towards Calibrating Prompt Tuning of Vision-Language Models: To address the "dual miscalibration" problem in prompt-tuned CLIP (underconfidence on base classes and overconfidence on novel classes), this paper proposes two complementary regularizers — mean-variance margin regularization and text moment-matching loss — as plug-and-play modules that consistently reduce ECE across 7 prompt tuning methods and 11 datasets.
Towards Multimodal Domain Generalization with Few Labels: This paper defines and investigates the novel problem of Semi-Supervised Multimodal Domain Generalization (SSMDG), and proposes a unified framework integrating consensus-driven pseudo-labeling, disagreement-aware regularization, and cross-modal prototype alignment to achieve cross-domain generalization of multimodal models under limited annotation.
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training: This paper proposes DocHumming, a data-training co-design framework that constructs the large-scale synthetic dataset DocMix-3M via Realistic Scene Synthesis, and introduces a Document-Aware Training Recipe (DATR) combining progressive learning and structure token weighting. On a 1B-parameter MLLM, DocHumming achieves an OmniDocBench Overall score of 93.75, surpassing Qwen3-VL-235B (89.15), with only a 6.72-point degradation under real-world capture conditions (vs. 18–20 points for pipeline-based methods).
Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation: This paper proposes CES (Coordinator-Executor-State Tracker), a multi-agent framework coupled with a staged execution-feedback reinforcement learning algorithm. By decoupling high-level task planning from low-level execution, and through dedicated training of the Coordinator and State Tracker, CES significantly improves GUI agent planning and state management capabilities on long-horizon tasks.
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration: TreeTeaming proposes an automated red-teaming framework based on a hierarchical strategy tree, in which an LLM-driven Orchestrator dynamically explores and evolves attack strategies. The framework achieves state-of-the-art attack success rates (ASR) across 12 mainstream VLMs (87.60% on GPT-4o) and discovers diverse novel attack strategies that go beyond all known strategy sets.
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration: This paper proposes TreeTeaming, an autonomous red-teaming framework that transforms strategy exploration from static testing into a dynamic evolutionary process. An LLM orchestrator autonomously constructs and expands a hierarchical strategy tree, while a multimodal executor carries out concrete attacks. TreeTeaming achieves state-of-the-art attack success rates on 11 out of 12 evaluated VLMs, reaching 87.60% on GPT-4o.
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration: TreeTeaming proposes an autonomous red-teaming framework that dynamically constructs and expands a strategy tree via an LLM-driven Orchestrator, autonomously discovering diverse VLM attack strategies from a single seed example. It achieves state-of-the-art attack success rates across 12 mainstream VLMs (87.60% on GPT-4o), while the discovered strategy diversity surpasses the union of all known publicly available strategies.
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition: This paper proposes TRivia, a self-supervised fine-tuning framework that leverages QA-driven GRPO reinforcement learning to enable VLMs to learn table recognition directly from unannotated table images. The resulting TRivia-3B surpasses proprietary models such as Gemini 2.5 Pro and GPT-5 on multiple benchmarks.
Unbiased Dynamic Multimodal Fusion: UDML proposes an unbiased dynamic multimodal learning framework comprising two core components: a noise-aware uncertainty estimator (which injects controllable noise and predicts its intensity to achieve accurate modality quality assessment under both low-noise and high-noise conditions) and a modality dependency calculator (which quantifies the model's inherent dependency bias toward each modality via Dropout and incorporates it into the weighting mechanism). The framework addresses the dual suppression problem in existing methods and consistently improves performance across multiple multimodal benchmarks.
Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models: This paper proposes Beta-KD, an uncertainty-aware knowledge distillation framework grounded in a Bayesian perspective. By modeling teacher supervision as a Gibbs prior and deriving a closed-form solution via Laplace approximation, Beta-KD automatically balances data and teacher signals, consistently improving distillation performance on multimodal VQA benchmarks.
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models: This paper proposes UNCHA, a framework that models the semantic representativeness of image parts with respect to the whole scene via hyperbolic uncertainty in hyperbolic VLMs. By incorporating uncertainty-guided contrastive loss and entailment loss, UNCHA enhances compositional scene understanding and outperforms existing hyperbolic VLMs across multiple downstream tasks.
Understanding Task Transfer in Vision-Language Models: This paper presents the first systematic study of how fine-tuning a VLM on one visual perception task affects its zero-shot performance on other perception tasks. It proposes the Perfection Gap Factor (PGF), a normalized metric for quantifying cross-task transfer, and reveals structural regularities in task transfer (positive/negative transfer cliques, task personas, scale dependence) across three scales of Qwen-2.5-VL. The paper further demonstrates that PGF can guide data selection to improve fine-tuning efficiency.
UNICBench: UNIfied Counting Benchmark for MLLM: This paper introduces UNICBench, the first unified cross-modal (image/text/audio) multi-level counting benchmark, comprising 5,508 + 5,888 + 2,905 = 14,301 QA pairs organized along a three-level capability taxonomy (Pattern/Semantic/Reasoning) × three-level difficulty taxonomy (Easy/Medium/Hard). The benchmark systematically evaluates 45 state-of-the-art MLLMs, revealing that basic counting tasks are near saturation while reasoning-level and hard-difficulty tasks exhibit substantial performance gaps.
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary: UniGame proposes the first self-adversarial post-training framework for unified multimodal models (UMMs). By attaching a lightweight perturber at the shared visual token interface, the generation branch actively constructs semantically consistent adversarial samples to challenge the understanding branch, forming a minimax self-play game that substantially improves consistency (+4.6%), understanding (+3.6%), generation, and robustness.
UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression: This paper proposes UniMMAD, the first unified framework for multi-modal (RGB/Depth/IR, etc.) and multi-class anomaly detection. It follows a General-to-Specific paradigm: a general multi-modal encoder compresses features, and a Cross Mixture-of-Experts (C-MoE) decompresses them into domain-specific features. The method achieves state-of-the-art results on 5 datasets spanning industrial, medical, and synthetic scenarios at 59 FPS inference speed.
UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression: This paper proposes UniMMAD, the first unified framework that handles multi-modal and multi-class anomaly detection simultaneously with a single parameter set. The core contribution is an MoE-based feature decompression mechanism that adaptively decomposes general multi-modal encoded features into domain-specific unimodal reconstructions, achieving state-of-the-art performance across 9 datasets spanning 3 domains, 12 modalities, and 66 categories.
V2Drop: Variation-aware Vision Token Dropping for Faster Large Vision-Language Models: This work is the first to approach vision token compression from the perspective of inter-layer token variation. It identifies "lazy" vision tokens with small inter-layer variation as having negligible impact on model output, and proposes V2Drop, a progressive dropping scheme that eliminates low-variation tokens. V2Drop retains 94.0% of image understanding performance while reducing generation latency by 31.5%, and retains 98.6% of video understanding performance while reducing latency by 74.2%, with full compatibility with FlashAttention.
Variation-Aware Vision Token Dropping for Faster Large Vision-Language Models: This paper proposes V2Drop, the first method to approach token importance from the perspective of token variation. By progressively dropping "lazy" vision tokens with minimal variation inside the LLM, V2Drop achieves training-free, position-bias-free, and efficient-operator-compatible LVLM inference acceleration, retaining 94.0% and 98.6% of original performance on image and video understanding tasks while reducing LLM generation latency by 31.5% and 74.2%, respectively.
VecGlypher: Unified Vector Glyph Generation with Language Models: VecGlypher is proposed as the first unified language model for text- and image-guided vector glyph generation. Through a two-stage training pipeline (large-scale SVG syntax learning followed by expert-annotated alignment), it autoregressively generates editable SVG paths directly, without rasterization intermediate steps or vectorization post-processing.
Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping: This paper defines the novel task of Aesthetic Guidance (AG) and constructs the AesGuide benchmark (10,748 photos annotated with aesthetic scores, analyses, and guidance), then proposes Venus, a two-stage framework that first empowers MLLMs with aesthetic guidance capability via progressive aesthetic QA, and subsequently activates aesthetic cropping capability through CoT reasoning, achieving state-of-the-art performance on both tasks.
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving: This paper proposes VGGDrive, a framework that injects cross-view geometric perception into VLMs via a frozen 3D visual foundation model (VGGT). A plug-and-play CVGE module is designed to hierarchically and adaptively fuse 3D features into the 2D visual embeddings at each VLM layer, achieving significant performance gains across five autonomous driving benchmarks.
Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models: This paper proposes VisionToM, a lightweight vision-based intervention framework that probes and intervenes on attention heads sensitive to visual input and ToM reasoning within MLLMs. Without fine-tuning the backbone, VisionToM substantially enhances Theory of Mind reasoning in multimodal large language models, achieving significant performance gains on the EgoToM benchmark.
VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion: This paper proposes VideoFusion, the first large-scale infrared-visible video fusion framework, which jointly models cross-modal complementarity and temporal dynamics via cross-modal differential reinforcement, complete-modality guided fusion, and bidirectional temporal collaborative attention, generating spatiotemporally consistent high-quality fusion videos. The authors also construct the M3SVD dataset comprising 220 videos and 153,797 frames.
VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion: This paper introduces the M3SVD large-scale infrared-visible video dataset (220 videos / 150K frames) and proposes the VideoFusion framework, which achieves spatio-temporal collaborative multi-modal video fusion via a Cross-modal Differential Reinforcement Module (CmDRM), Complete Modal Guided Fusion (CMGF), Bidirectional Co-Attention Module (BiCAM), and a variational consistency loss. The method surpasses existing image fusion and video fusion approaches in both fusion quality and temporal consistency.
ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting: ViKey overlays frame-index visual prompts (VPs) onto video frames and incorporates a lightweight Keyword-Frame Mapping (KFM) module to significantly improve temporal reasoning in VideoLLMs without any training, achieving near-dense-frame performance with as few as 20% of frames.
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking: ViRC proposes a Reason Chunking mechanism that structures multimodal mathematical CoT into sequential Critical Reasoning Units (CRUs), simulating the process by which human experts repeatedly consult visual information and incrementally verify intermediate propositions. Through the CRUX dataset and a progressive training strategy (Instructional SFT → Practice SFT → Strategic RL), ViRC-7B achieves an average improvement of 18.8% across mathematical benchmarks.
Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning: This paper proposes MedCBR, a framework that integrates clinical diagnostic guidelines (e.g., BI-RADS) into the training and inference pipeline of concept bottleneck models. By leveraging LVLMs to generate guideline-consistent reports for enhanced concept supervision, combining multi-task CLIP training with a large reasoning model for structured clinical explanation generation, MedCBR achieves AUROCs of 94.2% and 84.0% on ultrasound and mammography cancer detection, respectively.
VISion On Request: Enhanced VLLM Efficiency with Sparse, Dynamically Selected, Vision-Language Interactions: VISOR proposes a new efficiency paradigm distinct from visual token compression — by sparsifying vision-language interaction layers within the LLM (a small number of cross-attention layers plus dynamically selected self-attention layers), it achieves 8.6–18× FLOPs savings while retaining all high-resolution visual tokens, substantially outperforming token compression methods on challenging tasks that require fine-grained understanding.
VL-RouterBench: A Benchmark for Vision-Language Model Routing: This paper introduces VL-RouterBench, the first systematic routing benchmark for vision-language models, encompassing 14 datasets, 17 candidate models, and 519,180 sample-model pairs. It evaluates 10 routing methods and reveals a significant gap between the current best router and the ideal Oracle.
VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery: This paper proposes a VLM-guided dual-memory self-reflective Critique Agent that generates group-level preference signals for diffusion-based human mesh recovery (HMR), followed by Group Preference Alignment fine-tuning of the diffusion model. The approach substantially improves in-the-wild HMR accuracy without requiring any 3D annotations.
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models: This paper proposes VLM-Loc, a framework that converts 3D point cloud maps into BEV images and scene graphs for structured spatial reasoning with VLMs, and introduces a Partial Node Assignment (PNA) mechanism for fine-grained text-to-point-cloud localization. On the newly constructed CityLoc benchmark, VLM-Loc achieves a 14.20% improvement in Recall@5m over the previous state of the art.
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm: This paper proposes VLM-Pruner, a training-free centrifugal token pruning method that balances redundancy elimination and local detail preservation through a Buffering for Spatial Sparsity (BSS) criterion. At an 88.9% pruning rate, it consistently outperforms existing methods across 5 VLMs while achieving end-to-end inference acceleration.
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks: This paper presents the first systematic study of model inversion (MI) attacks against VLMs, proposing a suite of inversion strategies tailored to token generation (TMI/TMI-C/SMI) and an adaptive attention-weighted method SMI-AW that dynamically weights token gradient contributions based on visual attention intensity. Evaluated across 4 VLMs and 3 datasets, SMI-AW achieves up to 61.21% human-evaluated attack accuracy, revealing severe training data privacy leakage risks in VLMs.
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments: This paper introduces VS-Bench, a multimodal benchmark comprising ten visual game environments, which systematically evaluates VLMs' strategic abilities in multi-agent settings across three dimensions—perception, strategic reasoning, and decision-making. Results reveal that even the strongest current models exhibit significant gaps from optimal performance in reasoning and decision-making.
Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training: Wan-Weaver proposes a decoupled architecture consisting of a planner (VLM) and a visualizer (DiT). By training the planner on large-scale textual-proxy data instead of real interleaved data, it achieves an Overall score of 8.67 on OpenING—approaching Nano Banana's 8.85—while maintaining strong comprehension capability (MMMU 74.9) and state-of-the-art interleaved text-image generation.
WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs: This paper diagnoses the Time-Agnosticism problem in current Video-LLMs and proposes the WeaveTime framework. During training, a temporal reconstruction auxiliary task (SOPE) endows the model with temporal awareness; during inference, an uncertainty-gated coarse-to-fine memory cache (PCDF-Cache) enables efficient adaptive memory retrieval, achieving significant gains on streaming video QA.
WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs: This paper diagnoses the Time-Agnosticism problem in current Video-LLMs and proposes WeaveTime, a framework that endows models with temporal awareness via a Shuffled-Order Prediction Enhancement (SOPE) auxiliary task during training, and achieves efficient adaptive memory retrieval at inference via an uncertainty-gated coarse-to-fine memory cache (PCDF-Cache), yielding significant gains on streaming video QA benchmarks.
What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models: This paper proposes EmbedLens, a probing tool for systematically analyzing the internal structure of visual tokens in MLLMs. It reveals that visual tokens fall into three categories—sink, dead, and alive (approximately 40% are uninformative)—that alive tokens already encode rich semantics before entering the LLM (a "pre-linguistic" property), and that intra-LLM visual computation is redundant for most tasks, such that direct mid-layer injection suffices.
When to Think and When to Look: Uncertainty-Guided Lookback: This paper presents the first systematic analysis of the effect of test-time thinking on visual reasoning in LVLMs. It reveals that "looking more is better than thinking more"—long reasoning chains frequently neglect the image, producing "long-wrong" trajectories. Based on this finding, the authors propose an uncertainty-guided lookback decoding strategy that injects visual re-inspection prompts when reasoning chains drift, achieving 2–6 point improvements on MMMU and five other benchmarks without modifying the model.
When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs: This paper identifies a phenomenon in which existing token pruning methods underperform random pruning in deep layers of VLLMs, proposes a method for quantifying visual token information based on changes in output probability, and reveals the "Information Horizon"—a critical layer at which visual token information uniformly dissipates to zero. The position of this horizon is dynamically influenced by task visual complexity and model capability. The paper further demonstrates that simply integrating random pruning can effectively improve existing methods.
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation: This paper proposes Eagle, a lightweight black-box attribution framework that performs spatial attribution for autoregressive token generation in MLLMs via a unified objective combining insight score (sufficiency) and necessity score (indispensability), and quantifies whether each generated token relies on language priors or perceptual evidence. Eagle comprehensively outperforms existing methods in faithfulness, localization, and hallucination diagnosis while substantially reducing GPU memory requirements.
Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models: This paper proposes CORE (COncept-aware REfuser), a framework for continual unlearning in large vision-language models (LVLMs). It decomposes vision-language deletion targets into fine-grained visual attribute concepts and textual intent concepts, employs a concept modulator to identify concept combinations requiring refusal, and generates concept-aligned refusal responses via a mixture of refusal experts (refusers). CORE achieves state-of-the-art unlearning-retention trade-offs of 90.67% CRR and 88.02% AR across 16 sequential unlearning tasks.
Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs: This paper is the first to formally define the Widget-to-Code task, constructing the first image-only widget dataset and a multi-dimensional evaluation framework. It proposes a modular baseline built upon a Perceptual Agent and the WidgetFactory infrastructure, achieving high-fidelity widget reconstruction through component decomposition, icon retrieval, reusable visualization templates, and adaptive rendering.
Zina: Multimodal Fine-grained Hallucination Detection and Editing: Zina formalizes the task of multimodal fine-grained hallucination detection and editing, proposes a two-stage system (detector MLLM + reviewer MLLM) that delegates token copying to a deterministic function to reduce model burden, constructs the VisionHall dataset (6.9K human-annotated + 20K graph-based synthetic samples), and surpasses GPT-4o by 15.8 points in detection F1.

🧊 3D Vision¶

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image: This paper proposes a novel paradigm termed in-place completion, which extends pretrained object-level generative priors to the scene level, directly completing fragmented geometry at its original spatial location without explicit pose alignment. The authors also construct ARSG-110K, a 110K-scale scene-level dataset, and substantially outperform baselines such as MIDI and Gen3DSR.
3D-IDE: 3D Implicit Depth Emergent: This paper proposes the Implicit Geometry Emergence Principle (IGEP), which employs a lightweight geometric validator and a global 3D teacher for privileged supervision during training, enabling the visual encoder to acquire 3D perception from RGB video input alone. The approach incurs zero latency overhead at inference time and surpasses comparable methods on multiple 3D scene understanding benchmarks.
3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction: This paper proposes Self-Constrained Priors (SCP), which construct a TSDF distance field by fusing depth maps rendered from the current 3D Gaussians. This field serves as a prior to impose geometry-aware constraints on Gaussians (outlier removal, opacity constraint, and surface attraction), enabling high-fidelity surface reconstruction that achieves state-of-the-art performance on NeRF-Synthetic and DTU benchmarks.
3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds: This paper proposes LAM3C, a framework that, for the first time, demonstrates that video-generated point clouds (VGPCs) reconstructed from unlabeled online videos (e.g., property walkthroughs) can replace real 3D scans for 3D self-supervised pre-training. By introducing a Laplacian smoothing loss and a noise consistency loss to stabilize representation learning on noisy point clouds, and paired with the authors' RoomTours dataset (49K scenes), LAM3C matches or surpasses methods that rely on real 3D scans on indoor semantic and instance segmentation benchmarks.
3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience: This paper proposes 3DrawAgent, a training-free framework that enables a frozen LLM to acquire 3D spatial reasoning through contrastive knowledge extraction (CKE), generating language-driven 3D Bézier sketches in an autoregressive manner without any parameter updates, achieving performance competitive with trained methods.
4C4D: 4 Camera 4D Gaussian Splatting: This paper proposes the 4C4D framework, which employs a Neural Decaying Function to adaptively control Gaussian opacity decay, addressing the geometry–appearance learning imbalance in sparse-view (only 4 cameras) 4D Gaussian Splatting, and achieves state-of-the-art performance across multiple benchmarks.
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video: This work decomposes equine 4D reconstruction into two sub-tasks — motion estimation (AniMoFormer: spatiotemporal Transformer + post-optimization) and appearance reconstruction (EquineGS: feed-forward 3DGS) — bridged by the VAREN parametric model. Trained exclusively on synthetic data (VarenPoser + VarenTex), the method achieves state-of-the-art performance on real-world benchmarks APT-36K and AiM, and generalizes zero-shot to zebras and donkeys.
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video: This paper proposes the 4DEquine framework, which disentangles 4D equine reconstruction from monocular video into two subproblems — dynamic motion estimation (AniMoFormer) and static appearance reconstruction (EquineGS) — achieving SOTA on real-world data while training exclusively on synthetic data.
A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering: This work constructs A2Z, a large-scale multimodal CAD dataset comprising 1M+ complex models with 10M+ annotations (high-resolution 3D scans, hand-drawn 3D sketches, text descriptions, and BRep topology labels), providing an unprecedented data foundation for Scan-to-BRep reverse engineering and multimodal BRep learning. Foundation models trained on A2Z substantially outperform existing methods on edge and junction detection.
A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection: This paper proposes SeDiR, a framework for semantically disentangled unified 3D anomaly detection, comprising three modules: Coarse-to-Fine Global Tokenization (CFGT), Category-Conditioned Contrastive Learning (C3L), and Geometry-Guided Decoder (GGD). SeDiR addresses the Inter-Category Entanglement (ICE) problem and outperforms the state of the art by 2.8% and 9.1% AUROC on Real3D-AD and Anomaly-ShapeNet, respectively.
GAP: Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation: GAP leverages a pretrained 3D geometric foundation model (π³) to extract 3D features, fuses them with 2D semantic features and proprioception, and jointly predicts future action sequences and future 3D point maps via conditional diffusion, achieving state-of-the-art performance on RoboTwin 2.0 and real-world bimanual manipulation benchmarks.
Action-guided Generation of 3D Functionality Segmentation Data: This paper presents SynthFun3D, the first method for automatically generating 3D functionality segmentation training data from action descriptions. By leveraging metadata-driven 3D object retrieval and scene layout generation, it produces precise part-level interaction masks without manual annotation. Training on combined synthetic and real data yields +2.2 mAP / +6.3 mAR / +5.7 mIoU improvements on the SceneFun3D benchmark.
ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion: ActionMesh minimally extends a pretrained 3D diffusion model with a temporal axis (temporal 3D diffusion), then employs a temporal 3D autoencoder to convert independent shape sequences into topology-consistent animated meshes. The method generates production-quality animated 3D meshes from diverse inputs (video, text, or 3D mesh) in just 2 minutes, achieving state-of-the-art performance in both geometric accuracy and temporal consistency.
Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation: To address the slow multi-step denoising of diffusion policies and the mode-averaging collision problem of one-step Flow Matching, this paper proposes Ada3Drift: a training-time drifting field that attracts predictions toward the nearest expert demonstration while repelling other modes, combined with multi-scale field aggregation and a sigmoid-scheduled loss transition, achieving multimodal action distribution preservation under 1 NFE inference and reaching SOTA on Adroit/Meta-World/RoboTwin and real robots.
Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation: Ada3Drift proposes shifting the iterative refinement of diffusion policies from inference time to training time. By introducing a training-time drifting field—attracting predicted actions toward expert modes while repelling other generated samples—it achieves high-fidelity one-step (1 NFE) 3D visuomotor policies, reaching state-of-the-art performance on Adroit, Meta-World, RoboTwin, and real-robot tasks, with a 10× speedup at inference.
Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning: BayesMM proposes a training-free dynamic Bayesian distribution learning framework that models textual and geometric modalities as Gaussian distributions and automatically balances modality weights via Bayesian model averaging, achieving robust test-time adaptation across multiple point cloud benchmarks with an average improvement exceeding 4%.
AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction: This paper proposes AeroDGS, a physics-guided 4D Gaussian Splatting framework for monocular UAV video. It introduces a Monocular Geometry Lifting module to reconstruct reliable static and dynamic geometry, and incorporates differentiable physical priors — ground support, upright stability, and trajectory smoothness — to resolve ambiguous image cues into physically consistent motion estimates, outperforming existing methods on both synthetic and real UAV scenes.
AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis: AffordGrasp proposes a diffusion-based cross-modal framework that generates physically plausible and semantically consistent hand grasp poses from text instructions and object point clouds, via affordance-guided latent diffusion and a Distribution Adjustment Module (DAM), significantly outperforming existing methods on four benchmarks.
AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers: AffordMatcher proposes a method for localizing affordance regions in 3D scenes from visual signifiers (RGB images depicting human interactions). Through the large-scale AffordBridge dataset and a Match-to-Match attention mechanism based on dissimilarity matrices, it achieves 53.4 mAP on zero-shot affordance segmentation, surpassing the second-best method by 7.8 points.
Affostruction: 3D Affordance Grounding with Generative Reconstruction: This paper proposes Affostruction, which completes object geometry (including unobserved regions) via sparse voxel fusion-based generative reconstruction, models the multimodal distribution of affordances using Flow Matching, and performs affordance region localization on complete 3D shapes — achieving a 54.8% improvement in reconstruction IoU and a 40.4% improvement in affordance aIoU.
AnchorSplat: Feed-Forward 3D Gaussian Splatting with 3D Geometric Priors: AnchorSplat proposes an anchor-aligned feed-forward 3DGS framework that leverages 3D geometric priors (sparse point clouds) as anchors to predict Gaussians directly in 3D space. Using approximately 20× fewer Gaussians and half the reconstruction time, it achieves state-of-the-art performance on ScanNet++ v2 (PSNR 21.48) with superior depth estimation accuracy.
AnthroTAP: Learning Point Tracking with Real-World Motion: AnthroTAP proposes an automated pipeline that generates large-scale pseudo-labeled point tracking data from real-world human motion videos via SMPL fitting and optical flow filtering. Using only 1.4K videos and 4 GPUs for one day of training, it achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing BootsTAPIR which uses 15M videos.
AnyPcc: Compressing Any Point Cloud with a Single Universal Model: AnyPcc proposes a Universal Context Model (UCM) that integrates dual-granularity spatial and channel priors, combined with an Instance-Adaptive Fine-Tuning (IAFT) strategy, to achieve state-of-the-art point cloud geometry compression across 15 diverse datasets using a single model, yielding approximately 12% bitrate reduction over G-PCC v23.
APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition: APC proposes a lightweight input-level purification module that neutralizes adversarial attacks by generating point-wise counter-perturbations, trained under dual geometric and semantic consistency constraints to achieve strong robustness across diverse attacks and models.
ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions: ArtHOI presents the first complete pipeline for reconstructing 4D interactions between hands and articulated objects (e.g., scissors, glasses, laptops) from monocular RGB video. Through Adaptive Sampling Refinement (ASR) for metric scale and pose estimation, and an MLLM-guided hand-object alignment strategy, the method outperforms the baseline RSRD—which requires pre-scanned object geometry—across multiple datasets.
ArtLLM: Generating Articulated Assets via 3D LLM: ArtLLM formulates articulated object generation as a language generation problem. A 3D multimodal LLM autoregressively predicts part layouts and kinematic joint parameters (discretized as tokens) from point cloud input, followed by XPart-based high-fidelity part geometry synthesis. The method significantly outperforms existing approaches on PartNet-Mobility (mIoU 0.69) with inference in only 19 seconds.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models: This paper proposes AVA-Bench, the first systematic evaluation benchmark that decouples the capabilities of vision foundation models (VFMs) into 14 atomic visual abilities (AVAs). By aligning training-test distributions and isolating individual abilities during evaluation, AVA-Bench precisely identifies the strengths and weaknesses of VFMs, and reveals that a 0.5B small model can maintain VFM ranking consistency comparable to a 7B model.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models: This paper proposes AVA-Bench, which decomposes the evaluation of vision foundation models (VFMs) into 14 "atomic visual abilities" (AVAs). Through train/test distribution alignment and single-ability isolation testing, AVA-Bench precisely identifies the strengths and weaknesses of VFMs. A key finding is that a 0.5B LLM preserves the same VFM ranking as a 7B LLM, reducing evaluation cost by \(8\times\).
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization: AvatarPointillist proposes an autoregressive (AR) generative framework for constructing 4D Gaussian avatars: a decoder-only Transformer generates 3DGS point clouds (with binding information) token by token, followed by a Gaussian Decoder that predicts rendering attributes for each point. This approach breaks free from fixed template topology, enables adaptive point density adjustment, and comprehensively outperforms baselines such as LAM and GAGAvatar on NeRSemble.
Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection: BTP is the first work to apply pretrained point-language models (PLMs, e.g., ULIP) to zero-shot 3D anomaly detection. It proposes a Multi-Granularity Feature Embedding Module (MGFEM) that fuses patch-level semantics, geometric descriptors, and global CLS tokens, coupled with a joint representation learning strategy. BTP achieves 84.5% point-level AUROC on Real3D-AD, substantially outperforming the VLM-rendering-based method PointAD (73.5%).
BRepGaussian: CAD Reconstruction from Multi-View Images with Gaussian Splatting: BRepGaussian is the first method to reconstruct complete B-rep CAD models directly from multi-view images. It employs a two-stage 2D Gaussian splatting framework to learn edge and patch features, followed by parametric fitting to produce watertight boundary representations, without requiring point cloud supervision.
BulletGen: Improving 4D Reconstruction with Bullet-Time Generation: BulletGen is proposed to generate novel views at selected "bullet-time" frozen frames using a static video diffusion model. The generated views are precisely localized and used to supervise 4D Gaussian scene optimization, achieving state-of-the-art performance in extreme novel view synthesis and 2D/3D tracking from monocular video input only.
Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?: This paper proposes TABLeT, which leverages a pretrained 2D natural image autoencoder (DCAE) to compress 3D fMRI volumes into as few as 27 continuous tokens per frame. Paired with a standard Transformer encoder, this enables unprecedented long-range temporal modeling (256 frames), surpassing SOTA voxel-based methods on multiple tasks across UKB, HCP, and ADHD-200, while significantly improving computational efficiency.
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction: CARI4D is proposed as the first category-agnostic method for reconstructing metric-scale 4D human-object interactions from monocular RGB video—encompassing object shape reconstruction, pose tracking, hand contact reasoning, and physics-constrained optimization—with zero-shot generalization to unseen categories.
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation: This paper proposes Catalyst4D, a framework that propagates high-quality 3D static editing results into 4D dynamic Gaussian scenes through two modules — Anchor-based Motion Guidance (AMG) and Color Uncertainty-guided Appearance Refinement (CUAR) — achieving spatiotemporally consistent, high-fidelity dynamic scene editing.
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation: This paper proposes Catalyst4D, a framework that propagates mature 3D static editing results into 4D dynamic Gaussian scenes via Anchor-based Motion Guidance (AMG, which establishes region-level correspondences using optimal transport) and Color Uncertainty-guided Appearance Refinement (CUAR, which automatically identifies and corrects occlusion artifacts). The method consistently outperforms existing approaches in CLIP semantic similarity.
CGHair: Compact Gaussian Hair Reconstruction with Card Clustering: CGHair is proposed, achieving over 200× compression of appearance parameters and 4× acceleration in strand reconstruction while maintaining comparable visual quality, via hair-card-guided hierarchical clustering and a shared Gaussian appearance codebook.
Changes in Real Time: Online Scene Change Detection with Multi-View Fusion: This paper presents the first scene change detection (SCD) method that simultaneously achieves online inference, pose-agnosticism, label-free operation, and multi-view consistency. By replacing hard-threshold heuristics with a self-supervised fusion (SSF) loss that integrates pixel-level and feature-level change cues into a 3DGS change representation, the proposed approach surpasses all existing offline methods in detection accuracy while operating in real time at over 10 FPS.
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation: The first CLIP-based few-shot unsupervised 3D point cloud domain adaptation framework. Through knowledge-driven prompt tuning, parameter-efficient fine-tuning, entropy-guided view selection, and uncertainty-aware alignment loss, it achieves consistent accuracy improvements of 3–16% on PointDA-10 and GraspNetPC-10 with only ~11M trainable parameters.
CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration: CMHANet is proposed to deeply integrate 2D image texture-semantic features with 3D point cloud geometric features via a cross-modal hybrid attention mechanism, combined with a contrastive learning objective, achieving state-of-the-art point cloud registration performance on 3DMatch/3DLoMatch.
CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration: CMHANet proposes a three-stage hybrid attention mechanism (geometric self-attention → image aggregation attention → source-target cross-attention) to fuse 2D image texture semantics with 3D point cloud geometric information, complemented by a cross-modal contrastive loss. The method achieves state-of-the-art registration recall on 3DMatch/3DLoMatch (92.4%/75.5%) and a zero-shot RMSE of only \(0.76\times10^{-2}\) on TUM RGB-D.
Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass: CHROMM is proposed as a unified framework that jointly estimates camera parameters, scene point clouds, and human body meshes (SMPL-X) from multi-person multi-view video in a single forward pass, without external modules or preprocessed data. It achieves competitive performance on global human motion estimation and multi-view pose estimation tasks while being more than 8× faster than optimization-based methods.
Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass: CHROMM is a unified framework that integrates the geometric prior of Pi3X and the human prior of Multi-HMR into a single feed-forward network, enabling joint reconstruction of cameras, scene point clouds, and SMPL-X human meshes from multi-person multi-view video in a single pass—without external modules, preprocessing, or iterative optimization. It achieves a multi-view WA-MPJPE of 53.1 mm on RICH and runs more than 8× faster than HAMSt3R.
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation: Context-Nav elevates the contextual information embedded in long-form textual descriptions from a posterior verification signal to a proactive exploration prior. By constructing a context-driven value map to guide frontier selection and performing viewpoint-aware 3D spatial relation verification at candidate target locations, Context-Nav achieves state-of-the-art performance on InstanceNav and CoIN-Bench without any task-specific training.
Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment: This paper proposes GSA (Gaussian Splatting Alignment), the first method for category-level cross-instance registration of 3DGS models. It combines geometry-aware feature-guided coarse alignment (extending ICP to solve similarity transformations) with multi-view feature consistency fine alignment, substantially outperforming existing methods in both same-instance and cross-instance scenarios.
CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image: CrowdGaussian proposes a unified framework for reconstructing multi-person 3D Gaussian splatting representations from a single image. It recovers complete geometry of occluded regions via a self-supervised-adapted Large Occluded Human Reconstruction Model (LORM), and enhances texture detail quality through a single-step diffusion refiner (CrowdRefiner) trained with Self-Calibrated Learning (SCL).
CUBE: Representing 3D Faces with Learnable B-Spline Volumes: This paper proposes CUBE (Control-based Unified B-spline Encoding), a hybrid geometric representation combining B-spline volumes with learnable high-dimensional control features. Through a two-stage decoding pipeline (B-spline basis interpolation followed by a lightweight MLP residual), CUBE enables editable, high-fidelity 3D face reconstruction and scan registration.
CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization: CustomTex is a framework that achieves high-fidelity, instance-controllable texture generation for 3D indoor scenes through instance-level multi-reference image conditioning and a dual distillation training strategy (semantic-level VSD distillation + pixel-level super-resolution distillation), surpassing existing methods in semantic consistency, texture sharpness, and reduction of baked-in shading.
Dark3R: Learning Structure from Motion in the Dark: Dark3R is a teacher-student distillation framework that transfers the 3D priors of MASt3R to extremely low-light (SNR < −4 dB) raw images, enabling Structure from Motion (SfM) and novel view synthesis in dark environments where traditional methods fail entirely.
DeepShapeMatchingKit: Accelerated Functional Map Solver and Shape Matching Pipelines Revisited: This paper proposes a vectorized reformulation of the functional map solver achieving a 33× speedup, identifies and documents two undocumented implementation variants of DiffusionNet, introduces balanced accuracy as a supplementary metric for partial matching evaluation, and releases a unified open-source codebase.
Deformation-based In-Context Learning for Point Cloud Understanding: This paper proposes DeformPIC, which reframes point cloud In-Context Learning from a "masked reconstruction" paradigm to a "deformation transfer" paradigm. A Deformation Extraction Network (DEN) extracts task-specific semantics, and a Deformation Transfer Network (DTN) applies the extracted deformation to the query point cloud, achieving CD reductions of 1.6/1.8/4.7 on reconstruction/denoising/registration respectively.
DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization: This paper natively integrates the Kannala-Brandt fisheye projection model into the 3DGS pipeline and proposes a cross-view joint optimization strategy based on feature overlap, eliminating the information loss caused by pre-undistortion and achieving state-of-the-art performance on multiple public benchmarks.
DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis: This paper proposes DMAligner, which reformulates image alignment from the traditional optical flow warping paradigm into an "alignment-oriented view synthesis" task. By leveraging a conditional diffusion model to directly generate complete aligned images, and combining a purpose-built DSIA synthetic dataset with a Dynamics-aware Mask Producing (DMP) module, DMAligner effectively eliminates the ghosting and occlusion artifacts inherent to warp-based methods, achieving state-of-the-art performance across multiple benchmarks.
DROID-W: DROID-SLAM in the Wild: This paper proposes DROID-W, which introduces uncertainty estimation into differentiable Bundle Adjustment (Uncertainty-aware BA), combined with a DINOv2-feature-driven dynamic uncertainty update mechanism and monocular depth regularization, enabling robust camera pose estimation and scene reconstruction for DROID-SLAM in highly dynamic in-the-wild scenarios at approximately 10 FPS in real time.
DropAnSH-GS: Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting: To address overfitting in 3DGS under sparse-view settings, this paper proposes DropAnSH-GS, which replaces independent random Dropout with Anchor-based Dropout—dropping entire clusters of spatially correlated Gaussians around selected anchors to disrupt local redundancy compensation—while introducing Spherical Harmonics (SH) Dropout to suppress high-order SH overfitting and enable lossless post-training compression.
DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction: DuoMo decomposes world-space human motion reconstruction into two independent diffusion models: a camera-space model that extracts generalizable motion estimates from video in camera coordinates, and a world-space model that refines the noisy lifted proposals into globally consistent world-space motion. By directly generating mesh vertex motion rather than SMPL parameters, DuoMo reduces W-MPJPE by 16% on EMDB and 30% on RICH.
Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields: This paper proposes PI-DEF, a physics-informed coordinate neural network framework that jointly reconstructs the 4D (temporal + 3D spatial) emissivity field and 3D velocity field of gas near a black hole. Under sparse EHT measurements, PI-DEF significantly outperforms BH-NeRF, which enforces hard Keplerian dynamical constraints.
E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training: E-RayZer is the first truly self-supervised feed-forward 3D Gaussian reconstruction model. It replaces RayZer's implicit latent scene representation with explicit 3D Gaussians, and incorporates a visual-overlap-based curriculum learning strategy. Under zero 3D annotation conditions, it learns geometrically grounded 3D-aware representations, drastically outperforming RayZer on pose estimation (RPA@5° from ≈0 to 90.8%). On downstream 3D tasks under frozen-backbone probing, it significantly surpasses mainstream pre-trained models such as DINOv3 and CroCo v2, and even rivals the supervised VGGT.
E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction: This paper proposes E2EGS, a fully pose-free 3D reconstruction framework driven entirely by event streams. It extracts noise-robust edge maps from event streams via patch-based temporal consistency analysis, leverages edge information to guide Gaussian initialization and weighted loss optimization, and achieves high-quality trajectory estimation and 3D reconstruction without any depth model or RGB input.
Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow: This paper proposes a feed-forward 3D asset editing framework built upon the TRELLIS 3D generation backbone. It achieves globally consistent geometric deformation in a sparse voxel latent space via Voxel FlowEdit, and recovers high-frequency details through normal-guided multi-view texture refinement.
Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics: This paper proposes E3Flow, the first equivariant flow matching policy framework based on spherical harmonic representations. It introduces a Feature Enhancement Module (FEM) to dynamically fuse point cloud and image modalities, and combines rectified flow for efficient equivariant action generation. E3Flow achieves an average success rate 3.12% higher than the strongest baseline SDP across 8 MimicGen tasks while delivering a 7× speedup in inference.
Ego-1K: A Large-Scale Multiview Video Dataset for Egocentric Vision: This paper presents Ego-1K, a large-scale temporally synchronized egocentric multiview video dataset comprising 956 short clips (12+4 cameras, 60Hz), addressing the data gap in egocentric dynamic 3D reconstruction, and demonstrates that stereo depth guidance can substantially improve 4D novel view synthesis quality.
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding: This paper proposes EmbodiedSplat, the first online feed-forward semantic 3DGS framework. It achieves memory-efficient per-Gaussian semantic representation via a sparse coefficient field and a CLIP global codebook, and integrates 3D geometry-aware features to enable full-scene open-vocabulary 3D understanding at 5–6 FPS over 300+ streaming frames.
EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy: This paper reformulates the anisotropic slice reconstruction problem in volume electron microscopy (vEM) as a dynamic 3D scene rendering task based on deformable 2D Gaussian splatting, achieving high-fidelity continuous slice synthesis under sparse data conditions via a Teacher-Student pseudo-label mechanism.
EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization: This paper proposes EmoTaG, an emotion-aware 3D talking head synthesis framework built upon FLAME-Gaussian structural priors and a Gated Residual Motion Network (GRMN). It achieves few-shot personalization from as little as 5 seconds of video while jointly addressing emotion expressiveness, lip-audio synchronization, and geometric stability.
Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator: This paper proposes Hand4Whole++, a modular framework that injects features from a pretrained hand estimator into a frozen whole-body pose estimator via a lightweight CHAM module, enabling accurate wrist orientation prediction and transferring fine-grained finger joints and hand shape from a hand model via differentiable rigid alignment.
EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors: This paper proposes EventHub, a training data factory for event-based stereo matching that requires no annotation from active sensors such as LiDAR. It generates proxy event-depth pairs via novel view synthesis and transfers knowledge from RGB stereo models through cross-modal distillation. The resulting event stereo models surpass LiDAR-supervised counterparts in cross-domain generalization, reducing error by up to 50% on M3ED and MVSEC.
Extend3D: Town-Scale 3D Generation: This paper proposes Extend3D, a training-free 3D scene generation pipeline that extends the voxel latent space of a pretrained object-level 3D generative model (Trellis) and introduces overlapping patch joint denoising, under-noising SDEdit initialization, and 3D-aware optimization to generate town-scale large-scale 3D scenes from a single image, surpassing existing methods in both human preference evaluations and quantitative metrics.
ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting: This paper proposes the extrinsic paradigm, which fully decouples semantics from 3DGS geometry. By combining multi-granularity overlapping object grouping with VLM-generated text hypotheses, it constructs a lightweight semantic index layer that enables training-free, low-storage, and ambiguity-aware open-vocabulary 3D scene understanding.
FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning: This paper proposes FaceCam, a system that addresses camera control in monocular portrait videos by using facial landmarks as a scale-aware camera representation, thereby avoiding the scale ambiguity inherent in conventional extrinsic camera representations. Two data augmentation strategies—synthetic camera motion and multi-clip stitching—are further designed to support continuous camera trajectory inference.
FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting: FACT-GS reframes texture parameterization as a sampling density allocation problem, employing a learnable deformation field to achieve frequency-adaptive non-uniform texture sampling, substantially improving high-frequency detail recovery under a fixed parameter budget.
Fall Risk and Gait Analysis using World-Spaced 3D Human Mesh Recovery: This paper proposes a gait analysis pipeline based on GVHMR (world-grounded 3D human mesh recovery) that extracts spatiotemporal gait parameters from monocular video of older adults performing the Timed Up and Go (TUG) test, validating the correlation between video-derived metrics and wearable sensor measurements as well as their association with fall risk.
Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration: This paper proposes Fast3Dcache, a training-free geometry-aware caching framework for 3D diffusion models. It dynamically allocates cache budgets via Predictive Cache Scheduling Constraint (PCSC) based on voxel stabilization patterns, and selects stable tokens for reuse via Spatiotemporal Stability Criterion (SSC) using velocity and acceleration signals. The method achieves up to 27.12% throughput improvement and 54.83% FLOPs reduction with only ~2% degradation in geometric quality.
Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction: This paper proposes Fast SceneScript, which introduces multi-token prediction (MTP) into structured language models for 3D scene understanding to accelerate inference. Combined with self-speculative decoding (SSD) and confidence-guided decoding (CGD) to filter unreliable tokens, as well as a parameter-efficient head-sharing mechanism, the method achieves 5.09× and 5.14× speedups on layout estimation and object detection respectively without accuracy loss.
FastGS: Training 3D Gaussian Splatting in 100 Seconds: FastGS is a multi-view consistency-based acceleration framework for 3DGS that precisely controls Gaussian count via View-Consistent Densification (VCD) and View-Consistent Pruning (VCP). It achieves scene training in approximately 100 seconds on datasets such as Mip-NeRF 360, delivering over 15× speedup over vanilla 3DGS with comparable rendering quality.
FF3R: Feedforward Feature 3D Reconstruction from Unconstrained Views: FF3R is the first fully annotation-free feedforward framework capable of jointly performing geometric reconstruction and open-vocabulary semantic understanding from unconstrained multi-view image sequences, achieving 180× speedup over optimization-based methods when processing 64+ images.
FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction: This paper proposes FluidGaussian, which guides active view selection in 3D reconstruction using uncertainty metrics propagated through fluid simulation, yielding reconstructions that are not only visually faithful but also physically plausible under interactive simulation.
ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph: This paper proposes ForgeDreamer, a framework that addresses domain semantic adaptation in industrial settings via multi-expert LoRA teacher-student distillation, and achieves high-order geometric consistency constraints through cross-view hypergraph geometric enhancement, outperforming existing methods on industrial text-to-3D generation tasks.
Foundry: Distilling 3D Foundation Models for the Edge: This paper proposes the Foundation Model Distillation (FMD) paradigm and the Foundry framework. Through a compress-and-reconstruct objective, the student model learns a set of learnable SuperTokens to compress the basis vectors of the teacher's latent space. The resulting single distilled model retains generality across classification, segmentation, and few-shot tasks, while reducing FLOPs from 478G to as low as 137G.
FreeArtGS: Articulated Gaussian Splatting Under Free-Moving Scenario: FreeArtGS addresses articulated object reconstruction from monocular RGB-D video under a free-moving scenario, where both object pose and joint state change arbitrarily and simultaneously. The proposed three-stage pipeline — motion-driven part segmentation, robust joint estimation, and end-to-end 3DGS optimization — substantially outperforms all baselines on the newly introduced FreeArt-21 benchmark and existing datasets.
FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation: FreeScale scales limited real-world data into large-scale training sets by sampling high-quality free-view images from existing scene reconstructions guided by certainty estimation, achieving a 2.7 dB PSNR improvement on feed-forward novel view synthesis models.
FE2E: From Editor to Dense Geometry Estimator: This paper systematically analyzes the fine-tuning behavior of image editing models versus generative models for dense geometry estimation. It finds that editing models possess inherent structural prior advantages, and proposes the FE2E framework — the first to adapt a DiT-based image editing model as a joint depth and normal estimator — achieving substantial zero-shot improvements over existing SOTA (35% AbsRel reduction on ETH3D).
From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images: A two-stage pipeline for reconstructing city-scale 3D models from sparse satellite images: Z-Monotonic SDF for geometry to ensure structural integrity of buildings, followed by a fine-tuned FLUX diffusion model for "deterministic inpainting" that synthesizes photorealistic textures from degraded maps, enabling view extrapolation of nearly 90° from orbit to ground level.
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection: This work shifts keypoint detection from an "image-pair matching" paradigm to "sequence-level trackability optimization." The proposed reinforcement learning framework, TraqPoint, directly optimizes long-term keypoint tracking quality over image sequences, achieving state-of-the-art performance on pose estimation, visual localization, visual odometry, and 3D reconstruction tasks.
FunREC: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos: This paper presents FunREC, a training-free optimization-based method that reconstructs functional articulated 3D digital twin scenes directly from egocentric RGB-D interaction videos. It automatically discovers articulated parts, estimates kinematic parameters, tracks 3D motion, and reconstructs both static and dynamic geometry. FunREC substantially outperforms prior methods across all benchmarks (part segmentation mIoU improves by 50+, joint angle error reduced by 5–10×) and supports simulation export and robotic interaction.
GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator: This paper proposes GaussFusion, a geometry-informed video-to-video generative model that conditions a video generator on a rendered Gaussian Primitives Buffer (GP-Buffer) — encoding depth, normals, opacity, and covariance — to effectively remove floaters, flickering, and blurring artifacts in 3DGS reconstructions. The framework is compatible with both optimization-based and feed-forward reconstruction paradigms, and its distilled variant achieves real-time inference at 16 FPS.
GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance: This paper proposes GaussianGrow, which replaces the conventional paradigm of jointly predicting geometry and appearance from scratch by "growing" 3D Gaussians from readily available 3D point clouds. It employs a geometry-aware multi-view diffusion model to generate consistent appearance supervision, and addresses view-fusion artifacts and invisible-region problems through an overlap-region detection mechanism coupled with an iterative inpainting strategy, achieving substantial improvements over state-of-the-art methods on both synthetic and real-scan point clouds.
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis: This paper proposes Data-to-Data Flow Matching (D2D-FM) to directly learn deterministic transformations between view pairs, and regularizes flow paths via probability density geodesics so that trajectories propagate along high-density data manifolds, achieving improved view consistency and geometric fidelity in novel view synthesis.
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis: This paper proposes a Probability Density Geodesic Flow Matching (PDG-FM) framework that replaces the noise-to-data diffusion process with a deterministic data-to-data flow matching scheme, and optimizes interpolation paths to traverse high-density regions of the data manifold via probability-density-based geodesics, achieving geometrically consistent novel view synthesis.
GGPT: Geometry-Grounded Point Transformer: This paper proposes the GGPT framework, which first obtains geometrically consistent sparse point clouds via an improved lightweight SfM pipeline (dense matching + sparse BA + DLT triangulation), then employs Point Transformer V3 to jointly process sparse geometric guidance and feed-forward dense predictions directly in 3D space for residual refinement. Trained exclusively on ScanNet++, GGPT significantly improves multiple feed-forward 3D reconstruction models across architectures and datasets without any fine-tuning.
GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport: GLINT decomposes Gaussian representations into three components — interface, transmission, and reflection — and couples them with a hybrid rasterization+ray-tracing rendering pipeline, achieving state-of-the-art geometry and appearance reconstruction for scene-scale transparent surfaces such as glass walls and display cases.
Global-Aware Edge Prioritization for Pose Graph Initialization: This paper proposes a GNN-based global edge prioritization method that upgrades pose graph initialization from independent pairwise image retrieval to globally structure-aware edge ranking combined with multi-minimum-spanning-tree construction, achieving significant improvements in SfM reconstruction accuracy under extremely sparse settings.
Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves: This paper proposes the Glove2Hand framework, which translates egocentric videos of instrumented sensing gloves into photorealistic bare-hand videos while preserving tactile and IMU signals. It also introduces HandSense, the first multi-modal hand-object interaction dataset, and demonstrates significant improvements on downstream bare-hand contact estimation and occluded hand tracking.
GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes: GP-4DGS integrates variational Gaussian Processes (GP) into 4D Gaussian Splatting, enabling probabilistic motion modeling via spatiotemporal composite kernels and variational inference, while endowing 4DGS with three new capabilities: uncertainty quantification, motion extrapolation, and adaptive motion priors.
GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning: This paper proposes GS-CLIP, a two-stage framework that injects global shape context and local defect information from 3D point clouds into text prompts via a Geometry Defect Distillation Module (GDDM), and employs a dual-stream LoRA architecture to synergistically fuse rendered images and depth maps, achieving state-of-the-art zero-shot 3D anomaly detection on four large-scale benchmarks.
Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs: Hg-I2P introduces a Heterogeneous Graph to jointly model relationships between 2D image regions and 3D point cloud regions. Through multi-path adjacency mining for learning cross-modal edges, heterogeneous-edge-guided feature adaptation, and graph-based projection consistency pruning, it achieves state-of-the-art generalization and accuracy across six indoor and outdoor cross-domain benchmarks.
Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting: SplatHLoc proposes a hierarchical visual relocalization framework based on Feature Gaussian Splatting (FGS). By combining adaptive viewpoint retrieval that synthesizes virtual views closer to the query perspective with a hybrid feature matching strategy (rendered features for coarse matching + semi-dense matcher for fine matching), the method achieves new state-of-the-art accuracy on both indoor and outdoor benchmarks.
Human Interaction-Aware 3D Reconstruction from a Single Image: This paper proposes HUG3D, a framework that achieves high-fidelity textured 3D reconstruction of interacting multiple persons from a single image via perspective-to-orthographic view transformation, a group-instance multi-view diffusion model, and physics-aware geometry reconstruction, outperforming existing methods across CD/P2S/NC and other metrics.
Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry: This paper proposes a hybrid method combining the union-find exact cluster-size retrieval of eTFCE with the GRF analytical inference of pTFCE, achieving for the first time both exact cluster-size queries and analytical \(p\)-value computation without permutation testing, while running \(4.6\times\)–\(75\times\) faster than R pTFCE.
Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry: This work combines the union-find data structure of eTFCE (for exact cluster-size queries) with the GRF analytical inference of pTFCE, achieving for the first time within a single framework both exact cluster-size extraction and analytical \(p\)-values without permutation testing. Whole-brain VBM analysis is 4.6–75× faster than R pTFCE and three orders of magnitude faster than permutation-based TFCE.
HyperMVP: Hyperbolic Multiview Pretraining for Robotic Manipulation: This paper proposes HyperMVP, the first framework for 3D multiview self-supervised pretraining in hyperbolic space. It learns hyperbolic multiview representations via a GeoLink encoder and transfers them to robotic manipulation tasks, achieving a 2.1× performance improvement on the most challenging All Perturbations setting of COLOSSEUM.
HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars: This paper proposes HyperGaussians, which extends 3DGS to high-dimensional multivariate Gaussians. Expression-dependent attribute variations are modeled via conditional distributions, and an inverse covariance trick enables efficient conditioning. Integrated as a plug-and-play module into FlashAvatar and GaussianHeadAvatar, the method significantly improves high-frequency detail quality.
ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects: This paper presents ICTPolarReal, the first large-scale real-world polarized reflection and material dataset, capturing 218 everyday objects using an 8-camera, 346-light Light Stage system under cross- and parallel-polarization configurations. The dataset comprises over 1.2 million high-resolution images with ground-truth diffuse–specular reflection separation, and demonstrably improves inverse rendering, forward relighting, and sparse-view 3D reconstruction.
Indoor Asset Detection in Large Scale 360° Drone-Captured Imagery via 3D Gaussian Splatting: This paper proposes a pipeline based on a 3D Object Codebook that associates 2D segmentation masks into consistent 3D object instances within 3DGS using semantic and spatial constraints, enabling object-level detection on large-scale indoor 360° drone imagery. It achieves a 65% improvement in F1 score and 11% improvement in mAP over the state-of-the-art method GAGA.
InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction: This paper proposes InstantHDR, the first feed-forward HDR novel view synthesis method. It achieves multi-exposure fusion via geometry-guided appearance modeling, and employs a meta-network to learn scene-adaptive tone mappers. The method reconstructs HDR 3D scenes from uncalibrated multi-exposure LDR images in a single forward pass, running ~700× faster than optimization-based methods (feed-forward) and ~20× faster (with post-optimization).
InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction: This paper proposes InstantHDR, the first feed-forward HDR novel view synthesis method. It introduces a geometry-guided appearance modeling module to resolve appearance inconsistencies in multi-exposure fusion, and employs a MetaNet to predict scene-specific tone mapping parameters for generalization. The method reconstructs HDR 3D Gaussian scenes in seconds from uncalibrated multi-exposure LDR images, achieving +2.90 dB PSNR over GaussianHDR under sparse 4-view settings at approximately 700× faster speed.
Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation: Iris proposes a deterministic diffusion framework that injects real-world priors into a diffusion model via a two-stage Prior-to-Geometry Decoupled (PGD) schedule: Stage 1 extracts low-frequency layout priors from a teacher model using Spectral Gated Distillation (SGD) at high timesteps, while Stage 2 refines high-frequency geometric details using synthetic data at low timesteps. A Spectral Gated Consistency (SGC) loss is further introduced to align high-frequency information across stages. The method achieves state-of-the-art zero-shot depth estimation performance under limited data and computational budget.
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas: This paper proposes JOPP-3D, the first framework for joint open-vocabulary semantic segmentation on 3D point clouds and panoramic images. It maps panoramas onto icosahedron faces via tangential decomposition, extracts semantically aligned 3D instance embeddings using SAM and CLIP, and achieves 80.9% mIoU on S3DIS under weak supervision, surpassing all closed-vocabulary methods.
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas: This paper proposes JOPP-3D — the first open-vocabulary semantic segmentation framework that jointly processes 3D point clouds and panoramic images. It decomposes panoramas into 20 perspective views via icosahedral tangential projection to accommodate SAM/CLIP, extracts mask-isolated instance-level CLIP embeddings for 3D semantic segmentation, and back-projects results to the panoramic domain via depth correspondence. Without any training, the method achieves 80.9% mIoU on S3DIS, surpassing all supervised approaches.
ECKConv: Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant Point Cloud Analysis: This paper proposes ECKConv, which defines convolutional kernels on the double coset space \(\text{SO(2)}\backslash\text{SE(3)}/\text{SO(2)}\) within the intertwiner framework and explicitly parameterizes kernel functions via coordinate networks. This is the first approach to simultaneously achieve continuous SE(3) equivariance and large-scale scalability, validated comprehensively across four tasks: classification, registration, and segmentation.
Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos: This paper proposes to explicitly model the continuous positional and rotational deformation trajectories of dynamic Gaussians via adaptive SE(3) B-spline motion bases, combined with a soft segment reconstruction strategy and multi-view diffusion model priors, achieving high-quality novel view synthesis of dynamic scenes from monocular video. The method surpasses existing approaches on both the iPhone and NVIDIA datasets.
Learning Multi-View Spatial Reasoning from Cross-View Relations: XVR (Cross-View Relations) constructs a large-scale multi-view visual question answering dataset of 100K samples. By explicitly training VLMs on three categories of tasks—correspondence, verification, and viewpoint localization—XVR significantly improves cross-view spatial reasoning, yielding notable gains on both multi-view benchmarks and robotic manipulation tasks.
Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation: This paper proposes a Physics-Guided Score Distillation framework that leverages physics simulation (MPM) as a motion prior to guide Video-SDS optimization, enabling the generation of dynamic weather effects (snow, rain, fog, sandstorm) with physically plausible motion and photorealistic appearance in static 3DGS scenes.
Lifting Unlabeled Internet-level Data for 3D Scene Understanding: This paper presents SceneVerse++, an automated data engine that generates 3D scene understanding training data from 6,687 unlabeled internet videos. It demonstrates the feasibility of leveraging internet-scale data to advance 3D scene understanding across three tasks: 3D object detection (F1@.25 +20.6), spatial VQA (+14.9%), and vision-language navigation (+14% SR).
LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds: LightSplat proposes a training-free framework that is both fast and memory-efficient. By assigning each 3D Gaussian a compact 2-byte semantic index instead of high-dimensional CLIP features, combined with a lightweight index-to-feature lookup and single-pass 3D clustering, it achieves open-vocabulary 3D scene understanding that is 50–400× faster and requires 64× less memory than existing state-of-the-art methods.
Lite Any Stereo: Efficient Zero-Shot Stereo Matching: This paper proposes Lite Any Stereo, which achieves first-place rankings on four real-world benchmarks using less than 1% of the computation (33G MACs) of state-of-the-art accurate methods. This is accomplished via a hybrid 2D-3D cost aggregation module and a three-stage million-scale training strategy (supervised → self-distillation → real-data knowledge distillation), demonstrating for the first time that ultra-lightweight models can exhibit strong zero-shot generalization.
LitePT: Lighter Yet Stronger Point Transformer: LitePT conducts a systematic analysis of the roles played by convolution and attention at different U-Net stages, and proposes a hierarchical hybrid architecture that employs sparse convolution in shallow stages and attention in deep stages. Combined with the parameter-free PointROPE positional encoding, LitePT achieves 3.6× fewer parameters, 2× faster inference, and 2× lower memory consumption compared to Point Transformer V3, while matching or surpassing its performance across multiple point cloud benchmarks.
Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception: Long-SCOPE proposes a fully sparse long-range cooperative 3D perception framework that achieves state-of-the-art performance in 100–150 m long-range scenarios through geometry-guided query generation and a context-aware association module, while maintaining efficient computation and communication costs.
LongStream: Long-Sequence Streaming Autoregressive Visual Geometry: LongStream is a gauge-decoupled streaming visual geometry model that achieves stable metric-scale scene reconstruction at 18 FPS over thousand-frame sequences, via keyframe-relative pose prediction, orthogonal scale learning, and cache-consistent training.
LoST: Level of Semantics Tokenization for 3D Shapes: This paper proposes Level-of-Semantics Tokenization (LoST), which orders 3D shape tokens by semantic saliency so that short prefixes can already decode into complete and semantically coherent shapes. Combined with the RIDA semantic alignment loss and GPT-style autoregressive generation, LoST achieves significant improvements over existing 3D AR methods that require tens of thousands of tokens, using only 128 tokens.
LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates: This paper proposes the LTGS framework, which constructs reusable object-level Gaussian templates to efficiently update 3DGS scene reconstructions from spatiotemporally sparse observations, enabling temporal modeling of long-term environmental evolution.
LumiMotion: Improving Gaussian Relighting with Scene Dynamics: LumiMotion is the first Gaussian-based inverse rendering method that leverages scene dynamics (motion regions) as supervision signals to improve material-lighting decomposition. Through static-dynamic separation and motion-revealed appearance changes, it achieves a 23% improvement in albedo LPIPS and a 15% improvement in relighting LPIPS.
M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation: This paper constructs M3DLayout, a large-scale multi-source 3D indoor layout dataset comprising 21,367 layouts and over 433k object instances. It integrates three complementary sources—real-world scans, professionally designed scenes, and procedurally generated environments—paired with structured textual descriptions, providing a high-quality training foundation for text-driven 3D scene generation.
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping: This paper proposes MAGICIAN, a framework that leverages a pretrained occupancy network to generate "Imagined Gaussians" for efficiently estimating surface coverage gain. Combined with beam search, MAGICIAN enables long-term trajectory planning for active mapping, achieving state-of-the-art performance in both indoor and outdoor scenes with coverage improvements exceeding 10%.
Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding: This paper proposes SADG, the first framework to introduce Mamba into in-context learning for multi-task point cloud domain generalization. Through three modules — structure-aware serialization (Centroid Distance Spectrum + Geodesic Curvature Spectrum), Hierarchical Domain-Aware Modeling, and Spectral Graph Alignment — SADG comprehensively surpasses the state of the art on reconstruction, denoising, and registration tasks.
MARCO: Navigating the Unseen Space of Semantic Correspondence: This paper proposes MARCO, a semantic correspondence model built on a single DINOv2 backbone. It progressively improves spatial precision via a coarse-to-fine Gaussian RBF loss, and expands sparse keypoint supervision into dense pseudo-correspondence labels through a self-distillation framework. MARCO achieves state-of-the-art performance on standard benchmarks as well as on unseen keypoints and categories, while being 3× smaller and 10× faster than dual-encoder approaches.
Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding: This paper identifies two fundamental conflicts between the causal mask in LLM decoders and 3D scene understanding (order bias and instruction isolation), and proposes the 3D-SLIM masking strategy (Geometry-adaptive Mask + Instruction-aware Mask) to replace the causal mask. It achieves significant improvements across multiple 3D scene-language tasks without any architectural modifications or additional parameters.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding: This paper proposes BrainCoDec, a framework that performs fMRI-based visual decoding generalizable to new subjects without any fine-tuning. It employs a two-stage hierarchical in-context learning approach: first estimating encoder parameters for each voxel, then aggregating across voxels via functional inversion. Top-1 retrieval accuracy improves from 3.9% (MindEye2) to 22.7%.
MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer: This paper proposes MimiCAT, a cascade Transformer framework that learns flexible many-to-many soft correspondences via semantic keypoint labels. Combined with the million-scale multi-category motion dataset PokeAnimDB, it achieves, for the first time, high-quality cross-category 3D pose transfer (e.g., humanoid to quadruped/bird).
Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamics: A diffusion Transformer framework conditioned on EEG for fMRI reconstruction is proposed, modeling brain activity as a spatiotemporal sequence of neural frames rather than independent snapshots. The method achieves spatiotemporally consistent fMRI reconstruction at cortical vertex-level resolution, supports intermediate frame interpolation via null-space sampling, and validates the preservation of functional information on downstream visual decoding tasks.
MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer: This paper presents MoRe, a feed-forward motion-aware 4D reconstruction Transformer that decouples dynamic motion from static structure during training via an attention enforcement strategy, and achieves efficient streaming inference through grouped causal attention, attaining state-of-the-art performance in camera pose estimation and depth prediction on dynamic scenes.
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification: To address the challenges of memory explosion, temporal flickering, and occlusion handling in 4D Gaussian Splatting for long-video dynamic scene modeling, this paper proposes MoRel, a framework based on Anchor Relay-based Bidirectional Blending (ARBB). Through progressive construction of keyframe anchors and learnable temporal opacity control, MoRel achieves flicker-free, memory-bounded long-range 4D motion reconstruction.
Motion-Aware Animatable Gaussian Avatars Deblurring: This paper proposes the first method for directly reconstructing sharp, animatable 3D Gaussian human avatars from blurry video, leveraging a 3D-aware physical blur formation model and an SMPL-based human motion model to jointly optimize the avatar representation and motion parameters.
MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins: This paper presents MotionAnymesh, a zero-shot automated framework that converts static 3D meshes into collision-free, simulation-ready articulated digital twins via motion-aware segmentation (SP4D priors + VLM reasoning) and geometry-physics joint optimization for joint estimation, achieving 87% physical executability on PartNet-Mobility and Objaverse.
MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins: This paper proposes MotionAnymesh, a zero-shot framework that uses SP4D kinematic priors to guide VLMs in eliminating kinematic hallucinations, and employs physics-constrained trajectory optimization to guarantee collision-free articulation. The framework automatically converts static 3D meshes into simulation-ready URDF digital twins directly deployable in physics engines such as SAPIEN, achieving a physical executability rate of 87%—far exceeding existing methods.
MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting: This paper proposes MotionScale, a scalable 4D Gaussian Splatting framework that reconstructs the appearance, geometry, and motion of large-scale dynamic scenes from monocular video with high fidelity. Through a clustering-based adaptive motion field and a progressive optimization strategy, MotionScale achieves a PSNR of 17.98 on DyCheck and reduces 3D tracking EPE to 0.070, substantially outperforming existing methods.
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second: This paper proposes MoVieS, a feed-forward 4D dynamic scene reconstruction framework that unifies appearance, geometry, and motion modeling via Dynamic Splatter Pixels, enabling 4D reconstruction from monocular video in approximately one second while supporting novel view synthesis, 3D point tracking, scene flow estimation, and moving object segmentation.
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation: This paper proposes a Multi-modal 3D Scene Graph (M3DSG) that replaces conventional text-based relation edges with dynamically assigned image edges, and builds a zero-shot navigation system MSGNav comprising four modules: Key Subgraph Selection, Adaptive Vocabulary Update, Closed-Loop Reasoning, and Visibility Viewpoint Decision. MSGNav achieves 52.0% SR on GOAT-Bench and 74.1% SR on HM3D-ObjNav, both state-of-the-art.
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation: This paper proposes a Multi-modal 3D Scene Graph (M3DSG) that replaces conventional text-based relation edges with dynamically assigned image edges to preserve visual information. Built upon M3DSG, the zero-shot navigation system MSGNav is constructed, and a Visibility-based Viewpoint Decision (VVD) module is introduced to address the "last-mile" navigation problem. The method achieves state-of-the-art performance on GOAT-Bench and HM3D-ObjNav.
MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction: This paper proposes MV-RoMa, the first multi-view dense matching model that simultaneously estimates dense correspondences from a single source image to multiple target images via a Track-Guided multi-view encoder and a pixel-aligned multi-view refiner, producing geometrically consistent tracks for SfM and achieving state-of-the-art performance on HPatches, ETH3D, IMC, and related benchmarks.
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation: This paper introduces the MV-3DRES task (language-guided 3D segmentation directly from sparse multiview RGB images) and the MVGGT framework (a dual-branch design combining a frozen geometry branch with a trainable multimodal branch). A PVSO optimization strategy is proposed to address the foreground gradient dilution (FGD) problem, achieving 39.9 mIoU on the newly constructed MVRefer benchmark, substantially outperforming baselines.
NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration: This paper proposes NanoSD, a family of Pareto-optimal lightweight diffusion foundation models (130M–315M parameters, as fast as 12 ms inference) built upon SD 1.5 through hardware-aware U-Net decomposition, block-wise feature distillation, and multi-objective Bayesian optimization. NanoSD serves as a drop-in backbone that achieves state-of-the-art performance across multiple tasks including super-resolution, face restoration, deblurring, and monocular depth estimation.
NeAR: Coupled Neural Asset–Renderer Stack: NeAR proposes jointly designing neural asset creation and neural rendering as a coupled stack. By introducing illumination-homogenized structured 3D latents (LH-SLAT) to remove baked lighting from input images, and employing an illumination-aware neural decoder for real-time synthesis of relightable 3D Gaussian fields, NeAR surpasses existing methods across four tasks: forward rendering, reconstruction, relighting, and novel-view relighting.
Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences: Neu-PiG proposes a fast optimization framework based on preconditioned multi-resolution latent grids, encoding the position and normal directions of a keyframe reference mesh into a unified latent space. A lightweight MLP decodes these features into per-frame 6-DoF deformations, achieving high-fidelity dynamic surface reconstruction more than 60× faster than existing training-free methods, without requiring category-specific priors or explicit correspondences.
Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy: This paper proposes NFH-SEM, a neural field-based hybrid framework that embeds the physical model of electron scattering in SEM into a neural field optimization pipeline, enabling high-fidelity 3D surface reconstruction of microstructures from multi-view, multi-detector SEM images. The framework achieves self-calibration and shadow-robust reconstruction at nanometer-scale accuracy (478 nm stacked features, 782 nm pollen textures, 1.559 μm fracture steps).
Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction: Neural Gabor Splatting embeds a lightweight MLP (SIREN architecture) into each Gaussian primitive, enabling a single primitive to represent complex spatially-varying color patterns. Combined with a frequency-aware densification strategy, this approach significantly improves high-frequency surface reconstruction quality under the same data budget.
NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation: This paper proposes the NG-GS framework, which leverages the continuous modeling capability of NeRF to address the boundary discretization problem in 3DGS segmentation. It constructs a continuous feature field via RBF interpolation, combined with multi-resolution hash encoding and joint NeRF-GS optimization, to achieve high-quality object segmentation.
NI-Tex: Non-isometric Image-based Garment Texture Generation: NI-Tex is proposed as a framework that, through the construction of a 3D Garment Videos dataset, image-editing-based cross-topology augmentation, and an uncertainty-guided iterative baking algorithm, achieves for the first time high-quality feed-forward generation of PBR textures for 3D garments from a single image under non-isometric conditions.
NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather: NimbusGS proposes a unified 3D scene reconstruction framework that decomposes weather degradation into a continuous scattering field (fog/haze) and a per-view particulate residual layer (rain/snow), coupled with a geometry-guided gradient scaling mechanism, achieving state-of-the-art reconstruction under individual and hybrid weather conditions within a single framework.
No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency: This paper proposes the first cross-sensor view synthesis framework that requires neither calibration nor depth. Through a match-densify-consolidate pipeline, sparse cross-modal keypoints are expanded into dense X-modality images (thermal/NIR/SAR) aligned with the RGB viewpoint. Synthesis quality is further improved via confidence-aware densification fusion (CADF) and self-matching filtering.
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs: Node-RF tightly couples Neural ODE with NeRF, driving the temporal evolution of implicit scene representations via continuous-time differential equations. This enables long-range extrapolation far beyond the training time horizon and cross-trajectory generalization, achieving significant improvements over baselines such as D-NeRF and 4D-GS on datasets including Bouncing Balls, Pendulum, and Oscillating Ball.
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs: Node-RF tightly couples Neural ODE with NeRF, modeling scene dynamic evolution via differential equations in latent space, enabling long-range extrapolation beyond training time horizons, cross-sequence generalization, and dynamical system behavior analysis.
NTK-Guided Implicit Neural Teaching: This paper proposes NINT, which leverages row vectors of the Neural Tangent Kernel (NTK) to measure each coordinate's influence on the global function update, enabling dynamic selection of coordinates with both high fitting error and high global influence for training. This approach reduces INR training time by nearly half without sacrificing reconstruction quality.
Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting: This paper proposes a feed-forward 3DGS decoder based on keypoint detection, liberating Gaussian primitives from the pixel grid by placing them adaptively at sub-pixel precision. Combined with an adaptive density mechanism and confidence-based pruning, the method surpasses state-of-the-art feed-forward approaches in novel view synthesis using only 1/7 of the primitives required by pixel-aligned methods.
OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting: This paper proposes OnlinePG, the first online open-vocabulary panoptic mapping system built upon 3DGS. It adopts a local-to-global paradigm: within a sliding window, a multi-cue clustering graph (geometric overlap + semantic similarity + view consensus) constructs locally consistent 3D instances, which are then incrementally merged into a global map via bidirectional bipartite matching. OnlinePG achieves state-of-the-art semantic and panoptic segmentation among online methods, surpassing OnlineAnySeg by +17.2 mIoU on ScanNet (48.48), while running at 10–18 FPS.
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness: This paper proposes OpenVO, an open-world monocular visual odometry framework that achieves robust metric-scale ego-motion estimation under uncalibrated cameras and varying frame rates, via a time-aware flow encoder and a geometry-aware context encoder. OpenVO achieves over 20% ATE improvement across datasets and reduces error by 46%–92% under variable frame-rate settings.
PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery: PAD-Hand is a physics-aware conditional diffusion framework that models Euler–Lagrange (EL) dynamics residuals as virtual observations integrated into the diffusion process, while estimating per-joint, per-frame dynamic variance via last-layer Laplace approximation. The method achieves physically plausible and uncertainty-aware hand motion recovery, reducing acceleration error by 50.1% on DexYCB.
Pano360: Perspective to Panoramic Vision with Geometric Consistency: Pano360 proposes a Transformer-based panoramic stitching framework that extends the conventional 2D pairwise alignment paradigm to 3D space, directly leveraging camera poses to guide global multi-image alignment. Combined with a multi-feature joint optimization strategy for seam detection, the method achieves a 97.8% success rate on challenging scenarios including weak texture, large parallax, and repetitive patterns, substantially outperforming existing approaches.
Pano360: Perspective to Panoramic Vision with Geometric Consistency: Pano360 extends panoramic image stitching from conventional 2D pairwise matching to the 3D photogrammetric space, leveraging a Transformer architecture to achieve globally geometrically consistent multi-view alignment. It attains a 97.8% success rate under challenging scenarios including weak texture, large parallax, and repetitive patterns.
Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image: This paper proposes Pano3DComposer, a modular feed-forward compositional 3D scene generation framework that takes a single panoramic image as input. A plug-and-play Object-World Transformation Predictor (based on Alignment-VGGT) maps generated 3D objects from local coordinates to world coordinates, producing high-fidelity 3D scenes in approximately 20 seconds on an RTX 4090.
PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery: This paper proposes PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and globally consistent 3D point clouds from one or more unordered panoramic images in a single feed-forward pass. The paper also contributes PanoCity — a large-scale dataset comprising over 120,000 outdoor panoramic images.
Parallelised Differentiable Straightest Geodesics for 3D Meshes: This paper proposes a parallel GPU implementation of straightest geodesics along with two differentiable schemes — an extrinsic proxy function method and a geodesic finite differences method — enabling efficient parallel and differentiable exponential map computation on triangular meshes. Three downstream applications are built upon this framework: a geodesic convolutional layer, a flow matching method on meshes, and a second-order optimizer.
Particulate: Feed-Forward 3D Object Articulation: Particulate proposes a feed-forward model that infers complete articulation structures (part segmentation, kinematic tree, and motion constraints) from a static 3D mesh within seconds. Built upon the Part Articulation Transformer and trained end-to-end on public datasets, it significantly outperforms existing per-object optimization methods and can be combined with 3D generative models to enable single-image-to-articulated-3D-object generation.
PCSTracker: Long-Term Scene Flow Estimation for Point Cloud Sequences: PCSTracker is the first end-to-end framework for long-term scene flow estimation on point cloud sequences. Through iterative joint geometry-motion optimization, spatiotemporal trajectory updates, and an overlapping sliding window strategy, it reduces EPE_3D by 57.9% on the synthetic dataset PointOdyssey3D while running in real time at 32.5 FPS.
PE3R: Perception-Efficient 3D Reconstruction: PE3R proposes a tuning-free, feed-forward 3D semantic reconstruction framework that directly generates semantic 3D point clouds from pose-free 2D images via three modules — pixel embedding disambiguation, semantic point cloud reconstruction, and global view perception — achieving a 9× speedup while establishing new state-of-the-art performance on open-vocabulary segmentation and depth estimation.
PhyGaP: Physically-Grounded Gaussians with Polarization Cues: This paper proposes PhyGaP, which integrates polarization cues into 2DGS optimization via a polarization deferred rendering pipeline (PolarDR), and introduces a self-occlusion-aware GridMap environment representation, enabling accurate reflection decomposition and realistic relighting of glossy objects.
PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis: PhysGaia constructs a physics-aware benchmark dataset comprising 17 scenes that cover multi-body interactions across four material categories—liquid, gas, cloth, and rheological matter—providing ground truth 3D particle trajectories and physical parameters (e.g., viscosity). The paper further introduces two new metrics, Trajectory Distance (TD) and AUOP, to quantify the physical realism of 4DGS methods, revealing severe deficiencies in physical reasoning among existing DyNVS approaches.
PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis: The first framework for feed-forward prediction of 3DGS and physical attributes (material category, Young's modulus, Poisson's ratio) from a single image. A two-stage training pipeline (supervised pretraining + DPO preference fine-tuning) entirely bypasses SDS and differentiable physics engines. Combined with the 50K+ PhysAssets dataset, the method generates high-fidelity 4D physical simulations within one minute, surpassing per-scene optimization methods in both CLIP similarity and human preference rate.
PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis: PhysGM proposes the first feed-forward framework that simultaneously predicts 3D Gaussian representations and physical properties (stiffness, mass, etc.) from a single image in one inference pass. Combined with MPM simulation, it generates high-fidelity, physically plausible 4D animations within one minute, requiring no per-scene optimization.
PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation: This paper proposes PhysGS, which integrates Bayesian inference into a 3D Gaussian Splatting pipeline. By leveraging vision-language model priors and multi-view confidence-weighted updates, PhysGS enables per-point probabilistic estimation and uncertainty quantification of physical properties (friction, hardness, density, stiffness), achieving a 22.8% improvement over NeRF2Physics in APE for mass estimation and a 61.2% reduction in Shore hardness error.
PhysHead: Simulation-Ready Gaussian Head Avatars: This paper proposes PhysHead—the first method to integrate physics-driven hair dynamics with animatable 3DGS head avatars. It models expressive faces via FLAME mesh + 3DGS, represents hair appearance via strands + 3DGS, drives hair animation through a physics engine, and enables layered optimization of hair and face through VLM-generated bald images.
Physically Inspired Gaussian Splatting for HDR Novel View Synthesis: This paper proposes PhysHDR-GS, a physically inspired HDR novel view synthesis framework that decomposes Gaussian colors into intrinsic reflectance and adjustable ambient illumination. An Image-Exposure (IE) branch and a Gaussian-Illumination (GI) branch complementarily capture HDR details. A cross-branch HDR consistency loss provides explicit HDR supervision without ground-truth HDR data, and illumination-guided gradient scaling addresses gradient starvation caused by exposure bias. The method outperforms HDR-GS by 2.04 dB across multiple benchmarks while maintaining real-time rendering at 76 FPS.
PIP-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching: This paper reveals the spatial sparsity and temporal redundancy of disparity updates in iterative stereo matching, and proposes: (1) Progressive Iteration Pruning (PIP) to compress 32 iterations down to 1; (2) a collaborative learning paradigm for monocular depth prior transfer without an independent monocular encoder; and (3) a hardware-aware FlashGRU operator (7.28× speedup). Together, these enable high-accuracy iterative stereo matching to achieve real-time inference on Jetson Orin NX for the first time (75ms/frame at 320×640).
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction: This paper proposes PixARMesh, the first autoregressive framework for single-view scene reconstruction that operates natively in mesh space (rather than SDF space). By enhancing a point cloud encoder with pixel-aligned image features and global scene context, the method jointly predicts object poses and meshes within a unified token sequence. PixARMesh achieves scene-level state-of-the-art on 3D-FRONT while producing compact, editable, artist-ready meshes.
PointINS: Instance-Aware Self-Supervised Learning for Point Clouds: PointINS proposes the first point cloud self-supervised learning framework that explicitly learns semantic consistency and geometric reasoning. By introducing a label-free offset branch with Offset Distribution Regularization (ODR) and Spatial Clustering Regularization (SCR), it achieves an average improvement of +3.5% mAP on indoor instance segmentation and +4.1% PQ on outdoor panoptic segmentation.
PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding: PointTPA is a framework that generates input-customized network parameters at inference time via two lightweight modules—Serialization-based Neighborhood Grouping (SNG) and Dynamic Parameter Projector (DPP)—achieving 78.4% mIoU on ScanNet with fewer than 2% additional parameters, surpassing existing parameter-efficient fine-tuning (PEFT) methods.
PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation: PoseMaster proposes a 3D native framework that unifies pose stylization and 3D generation in an end-to-end pipeline. It directly uses 3D skeletons as pose control signals (rather than 2D skeleton images), designs a skeleton densification strategy and a Point Transformer encoder to extract fine-grained spatial topology features, and trains on large-scale Image-Skeleton-Mesh triplet data, achieving state-of-the-art performance on both pose canonicalization and arbitrary pose stylization.
PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis: This paper proposes PR-IQA, a cross-reference image quality assessment method that first computes geometrically consistent local quality maps in multi-view overlapping regions, then propagates quality information to non-overlapping regions via a reference-conditioned cross-attention network, producing dense quality maps approaching full-reference accuracy. Integrated into a 3DGS pipeline with a dual-filtering strategy, it significantly improves sparse-view 3D reconstruction quality.
ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars: This paper proposes ProgressiveAvatars, a progressive avatar representation that constructs hierarchical 3DGS via adaptive implicit subdivision on a template mesh, enabling progressive transmission and rendering under varying bandwidth and compute constraints. With only 5% of the data (2.6 MB), a usable avatar is immediately renderable, and incremental loading smoothly improves quality to a level comparable with state-of-the-art methods.
PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts: This paper proposes the Prompt Recurrent Unit (PRU), which replaces the GRU in iterative refinement with the DPT decoder from a monocular depth foundation model. Structure Prompts and Motion Prompts inject monocular structural and stereo motion cues via residual addition, enabling zero-shot state-of-the-art stereo matching without corrupting the monocular prior (nearly 50% error reduction on Middlebury 2021).
Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives: An adaptive reconstruction-aware pruning scheduler (RPS) and 3D DoG primitives are proposed to achieve 90% Gaussian pruning while preserving rendering quality.
QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment: This paper proposes QD-PCQA, a quality-aware domain adaptation framework that transfers image-domain quality assessment priors to the point cloud domain via two core strategies: Rank-weighted Conditional Alignment (RCA) and Quality-guided Feature Augmentation (QFA).
QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition: This paper presents QuadSync, the first global synchronization algorithm for quadrifocal tensors. By constructing a block quadrifocal tensor and proving that it admits a Tucker decomposition with multilinear rank \((4,4,4,4)\), the method recovers camera poses from four-view measurements via an ADMM-IRLS optimization framework, achieving superior synchronization accuracy over two-view and three-view methods in dense-view settings.
r4det 4d radar camera fusion 3d detection: R4Det proposes three plug-and-play modules — Panoramic Depth Fusion (PDF), Deformable Gated Temporal Fusion (DGTF), and Instance-Guided Dynamic Refinement (IGDR) — to address the key challenges in 4D radar-camera fusion: inaccurate depth estimation, ego-pose-dependent temporal fusion, and poor small-object detection. State-of-the-art results are achieved on TJ4DRadSet and VoD.
Random Wins All: Rethinking Grouping Strategies for Vision Tokens: This paper proposes a minimalist random grouping strategy to replace various elaborately designed token grouping methods in Vision Transformers. The approach achieves near-universal improvements over all baselines across image classification, object detection, semantic segmentation, point cloud segmentation, and VLMs, and provides a four-dimensional explanation for its success: positional information, per-head feature diversity, global receptive field, and fixed grouping patterns.
RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing: This paper proposes RAP, a rendering-free feedforward method for Gaussian primitive importance scoring. It extracts 15-dimensional features from intrinsic attributes and local neighborhood statistics, employs a lightweight MLP to predict importance scores, and generalizes to unseen scenes after a single training run.
RayNova: Scale-Temporal Autoregressive World Modeling in Ray Space: This paper proposes RayNova, a geometry-agnostic multi-view world model based on dual-causal (scale + temporal) autoregressive modeling. By leveraging relative Plücker ray positional encodings, RayNova achieves unified 4D spatiotemporal reasoning and attains state-of-the-art multi-view video generation performance on nuScenes.
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface: This paper proposes the Real2Edit2Real framework, a three-stage pipeline of "3D reconstruction → point cloud editing to generate new trajectories → depth-guided video generation for synthesizing demonstrations." Starting from only 1–5 real demonstrations, the framework generates large quantities of diverse manipulation demonstrations, enabling policy performance that matches or exceeds training on 50 real demonstrations—achieving a 10–50× improvement in data efficiency.
Regularizing INR with Diffusion Prior for Self-Supervised 3D Reconstruction of Neutron Computed Tomography Data: This paper proposes DINR (Diffusive INR), which replaces the conventional inversion solver within the DD3IP diffusion framework with an INR, injecting diffusion denoising estimates into the INR optimization via a proximal loss. DINR surpasses existing SOTA methods for neutron CT reconstruction under extremely sparse-view conditions (as few as 4–5 views).
Regularizing INR with Diffusion Prior for Self-Supervised 3D Reconstruction of Neutron Computed Tomography Data: This paper proposes Diffusive INR (DINR), a framework that replaces the conventional DIS in the DD3IP diffusion reconstruction pipeline with an INR, and injects the diffusion model's denoising estimate as a regularization prior into the INR optimization via a proximal loss function. Under extremely sparse neutron CT conditions with only 4–5 views, DINR surpasses MBIR (qGGMRF), DD3IP, and vanilla INR in reconstruction quality.
ReLaGS: Relational Language Gaussian Splatting: This paper proposes ReLaGS, the first training-free framework that unifies multi-level language Gaussian fields with open-vocabulary 3D scene graphs. It improves scene representation via Maximum Weight Pruning and Robust Outlier-aware Feature Aggregation, and achieves efficient structured 3D scene understanding through GNN-based relation prediction.
Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations: Reliev3R introduces the first weakly supervised paradigm for training feed-forward 3D reconstruction models (FFRMs) from scratch without multi-view geometric annotations (i.e., no SfM/MVS-derived point clouds or camera poses). By substituting monocular relative depth and sparse image correspondences as supervisory signals, it achieves performance on par with or superior to certain fully supervised FFRMs.
Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery: This paper proposes RepTRFD, which reparameterizes Tensor Ring factors into the form of "learnable latent tensor × fixed basis" to address the spectral bias problem inherent in INR-parameterized TR factors, achieving state-of-the-art performance across image inpainting, denoising, super-resolution, and point cloud recovery tasks.
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty: This paper proposes UGS-Loc, a framework that jointly models pose prior uncertainty and geometric uncertainty via Monte Carlo pose sampling and Fisher information-guided PnP optimization, achieving significantly improved robustness in camera pose refinement within 3DGS scenes without requiring retraining.
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting: RetimeGS is proposed to address ghosting artifacts and temporal aliasing in 4DGS during inter-frame interpolation, through regularized temporal opacity, Catmull-Rom spline trajectories, bidirectional optical flow supervision, and triple rendering, enabling artifact-free continuous-time 4D reconstruction at arbitrary timestamps.
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting: This paper proposes RetimeGS, which addresses temporal aliasing (ghosting) in 4DGS frame interpolation through regularized temporal opacity (dual-Sigmoid short-tailed distribution) and Catmull-Rom spline trajectories for modeling continuous Gaussian primitive motion, combined with bidirectional optical flow supervision, triple rendering, and dynamic stretching strategies. RetimeGS achieves 30.08 dB PSNR on the Stage-Capture dataset, surpassing the previous SOTA by 1.29 dB.
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction: This paper proposes ReWeaver, a framework that jointly reconstructs 3D garment geometry and 2D sewing patterns from as few as four multi-view RGB images. A dual-path Transformer predicts 3D patches/curves and their topological connectivity, after which an intra-group attention module unfolds the 3D structure into 2D panel edges. ReWeaver is the first method to produce topology-accurate garment assets that are directly usable in physical simulation.
Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation: Rewis3d is the first work to introduce feed-forward 3D scene reconstruction as an auxiliary supervision signal for weakly-supervised semantic segmentation. Through a dual student-teacher architecture, it achieves bidirectional cross-modal consistency (CMC) learning between 2D images and reconstructed 3D point clouds. Combined with dual-confidence filtering and view-aware sampling, the method improves mIoU by 2–7% across multiple datasets under sparse annotations (points, scribbles, coarse labels), while requiring only 2D input at inference time.
Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation: This paper proposes Rewis3d, the first framework to integrate feedforward 3D scene reconstruction as an auxiliary supervision signal for weakly-supervised semantic segmentation. Through a dual student-teacher architecture and dual confidence-weighted cross-modal consistency loss, Rewis3d improves mIoU by 2–7% under sparse annotation, while using only 2D images at inference time.
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations: RnG proposes Reconstruction-Guided Causal Attention, which reinterprets the Transformer's KV-Cache as an implicit 3D representation, enabling a single feed-forward Transformer to jointly perform reconstruction and generation—recovering complete 3D geometry and appearance from sparse, pose-free images—at over 100× the speed of diffusion-based methods.
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations: This paper proposes RnG, a unified feed-forward Transformer that leverages reconstruction-guided causal attention to treat KV-Cache as an implicit 3D representation, simultaneously achieving 3D reconstruction and novel-view RGBD generation from sparse unposed images, with inference speeds over 100× faster than diffusion-based methods.
S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds: This paper presents S2AM3D, a point cloud part segmentation framework that integrates 2D pretrained priors with 3D contrastive supervision. A point-consistent encoder produces globally coherent per-point features, while a scale-aware prompt decoder enables continuously controllable segmentation granularity. The method substantially outperforms existing approaches across multiple benchmarks.
Sampling-Aware 3D Spatial Analysis in Multiplexed Imaging: This paper systematically investigates how sampling geometry (2D single sections vs. 3D serial sections) affects the accuracy of recovering spatial statistics in multiplexed imaging, and proposes a geometry-aware sparse 3D reconstruction module that enables reliable depth-informed spatial analysis under limited imaging budgets.
SASNet: Spatially-Adaptive Sinusoidal Networks for INRs: This paper proposes SASNet, which combines frozen frequency embedding layers with spatially-adaptive masks learned by a lightweight hash-grid MLP to address SIREN's sensitivity to frequency initialization and its high-frequency leakage problem, achieving faster convergence and higher reconstruction quality on image fitting, volumetric data fitting, and SDF reconstruction tasks.
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models: This paper proposes QuatRoPE, a quaternion rotation-based 3D positional encoding method that preserves all \(O(n^2)\) pairwise spatial relations using only \(O(n)\) input tokens. Combined with the IGRE mechanism to reduce interference with language RoPE, it achieves substantial improvements across multiple 3D vision-language benchmarks.
Scaling View Synthesis Transformers (SVSM): This work establishes, for the first time, scaling laws for geometry-free NVS Transformers. It proposes the effective batch size hypothesis (\(B_\text{eff} = B \cdot V_T\)) to reveal the root cause of the underestimation of encoder-decoder architectures, designs a unidirectional encoder-decoder architecture called SVSM, and achieves a new state of the art on RealEstate10K (30.01 PSNR) with less than half the training FLOPs. The Pareto frontier shifts 3× to the left relative to LVSM decoder-only.
Scene Grounding In the Wild: This paper proposes a semantic feature-based inverse optimization framework that aligns in-the-wild local 3D reconstructions (SfM) to a complete pseudo-synthetic reference model (e.g., Google Earth Studio). By leveraging DINOv2 features and robust optimization, the method addresses large domain gaps and achieves globally consistent fusion of non-overlapping local reconstructions.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations: This paper presents SceneScribe-1M — a large-scale multimodal video dataset comprising one million in-the-wild videos spanning over 4,000 hours, with comprehensive annotations including structured text descriptions, accurate camera parameters, temporally consistent depth maps, and 3D point trajectories. The dataset serves as a unified resource for 3D geometric perception and video generation tasks.
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation: This paper proposes SCOPE, a plug-and-play framework that leverages a class-agnostic segmentation model to mine pseudo-instance prototypes from background regions of base training scenes. By retrieving and fusing these prototypes into sparse few-shot novel-class prototypes via attention, SCOPE improves novel-class IoU by 6.98% on ScanNet without retraining the backbone.
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation: This paper proposes SeeThrough3D, which conditions the FLUX model on an Occlusion-aware Scene Control Representation (OSCR) rendered from semi-transparent 3D bounding boxes, enabling precise 3D layout control and occlusion-consistent text-to-image generation.
SEPatch3D: Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors: This paper proposes SEPatch3D, which achieves 57% inference acceleration with comparable detection accuracy in ViT-based sparse multi-view 3D detection, via spatiotemporal-aware dynamic patch size selection and an entropy-based informative patch enhancement mechanism.
SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM: This paper proposes SGAD-SLAM, which adopts a pixel-aligned simplified Gaussian representation and allows Gaussians to adjust their depth offset along the ray to improve rendering quality and scalability. A geometry-similarity-based GICP tracking strategy is introduced to accelerate camera pose estimation. The method comprehensively outperforms state-of-the-art approaches on Replica, TUM, ScanNet, and ScanNet++.
SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation: SGI proposes a seed-based structured 2D Gaussian representation framework that organizes unstructured Gaussian primitives into seed-driven neural Gaussians, coupled with context-guided entropy coding and a multi-scale fitting strategy, achieving up to 7.5× compression and 6.5× optimization speedup in high-resolution image representation while maintaining or improving reconstruction fidelity.
SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation: SGI organizes unstructured 2D Gaussian primitives via seed points and decodes their attributes with lightweight MLPs. Combined with context-model-driven entropy coding and a multi-scale fitting strategy, SGI achieves up to 7.5× compression and 6.5× speedup in high-resolution image representation while maintaining or improving fidelity.
SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering: SGS-Intrinsic proposes a two-stage indoor inverse rendering framework. Stage I constructs a geometrically consistent dense Gaussian field guided by semantic and geometric priors. Stage II performs material–illumination decomposition via a hybrid lighting model and material priors, with a dedicated de-shadowing module to prevent shadow baking into albedo.
Sky2Ground: A Benchmark for Site Modeling under Varying Altitude: This paper introduces the Sky2Ground dataset (51 scenes, 80k images, covering satellite/aerial/ground views with both synthetic and real imagery) and the SkyNet model (dual-stream encoder + masked satellite attention + progressive view sampling), presenting the first systematic study of joint camera localization across ground, aerial, and satellite viewpoints. SkyNet achieves a 9.6% improvement in RRA@5 and an 18.1% improvement in RTA@5.
SonoWorld: From One Image to a 3D Audio-Visual Scene: SonoWorld is proposed as a training-free framework that generates an explorable 3D audio-visual scene from a single image. The pipeline expands the input image into a 360° panorama and reconstructs it as a 3D Gaussian scene, places sound-source anchors via VLM-driven semantic grounding, and renders spatial audio through Ambisonics encoding, achieving geometric and semantic alignment between the visual and auditory modalities.
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs: This paper proposes SoPE, a spherical coordinate-based positional embedding that remaps point cloud tokens from one-dimensional sequence indices to a spherical coordinate space \((t,r,\theta,\phi)\), combined with multi-dimensional frequency allocation and multi-scale frequency mixing strategies, significantly enhancing the spatial perception capabilities of 3D large vision-language models.
SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection: This paper proposes SPAN, a plug-and-play geometric co-constraint framework that enforces global geometric consistency across decoupled predictions via two differentiable losses — Spatial Point Alignment (3D corner MGIoU alignment) and 3D-2D Projection Alignment (projected bounding rectangle GIoU alignment) — coupled with a Hierarchical Task Learning strategy to stabilize training. On KITTI, SPAN improves MonoDGP's Car Moderate AP3D by 0.92%, achieving a new state of the art with zero additional inference overhead.
Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting: This paper proposes the first frequency-domain defense framework against resource-targeting attacks on 3DGS. By combining a 3D frequency filter that selectively prunes anomalous high-frequency Gaussians with 2D spectral regularization that constrains anisotropic noise in rendered images, the method suppresses Gaussian over-proliferation by up to 5.92×, reduces peak GPU memory by up to 3.66×, and accelerates rendering by up to 4.34× under attack, while maintaining reconstruction quality.
Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting: This paper proposes the first frequency-domain defense framework against resource-targeting attacks on 3DGS — a 3D frequency filter that selectively prunes high-frequency anomalous Gaussians, combined with a 2D angular anisotropy regularization that penalizes directionally concentrated high-frequency noise. The method suppresses attack-induced Gaussian over-growth by up to 5.92×, reduces peak memory by 3.66×, improves rendering speed by 4.34×, and even raises PSNR by +1.93 dB.
Speed3R: Sparse Feed-forward 3D Reconstruction Models: Speed3R introduces a trainable dual-branch Global Sparse Attention (GSA) mechanism for feed-forward 3D reconstruction models. A compression branch provides coarse-grained scene summaries while a selection branch focuses fine-grained attention on critical tokens, achieving 12.4× inference speedup on 1000-view sequences with only marginal accuracy degradation.
Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists: By periodically resetting Gaussian scales (Scale Reset) and imposing an entropy constraint on alpha blending weights, this paper shortens the per-pixel Gaussian list length to achieve 5–12× training acceleration in 3DGS while maintaining comparable rendering quality.
SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting: SR3R reformulates 3D super-resolution (3DSR) as a feed-forward mapping from sparse low-resolution views to high-resolution 3DGS, achieving high-fidelity HR 3DGS reconstruction via Gaussian offset learning and feature refinement, without per-scene optimization, while enabling strong zero-shot generalization.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction: This paper proposes STAC, a framework that exploits spatio-temporal sparsity in the KV cache of causal Transformers. Through three modules—working temporal token caching, long-term spatial token caching, and chunk-based multi-frame optimization—STAC reduces memory consumption by approximately 10× and improves inference speed by 4× for streaming 3D reconstruction, without any additional training and with negligible degradation in reconstruction quality.
STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction: STAvatar is proposed, leveraging a UV-adaptive soft binding framework and a temporal adaptive density control strategy to reconstruct high-fidelity, drivable 3D head avatars from monocular video. It significantly outperforms existing methods in occluded regions (oral interior, eyelids) and fine-grained details.
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas: This paper proposes Stepper, a framework that generates immersive 3D scenes driven by text input by progressively synthesizing multi-view panoramas and feeding them into a feed-forward 3D reconstruction pipeline, achieving an average PSNR improvement of 3.3 dB over existing methods.
STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding: STS-Mixer is the first to introduce the Graph Fourier Transform (GFT) into 4D point cloud video understanding. By decomposing point clouds in the frequency domain to capture geometric structures at different scales (low frequency = global shape, high frequency = local details) and mixing spectral features with spatio-temporal information, STS-Mixer achieves state-of-the-art performance on action recognition and semantic segmentation.
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation: SwiftTailor is a lightweight two-stage framework that combines PatternMaker for sewing pattern prediction with GarmentSewer for converting patterns into a Garment Geometry Image (GGI) in a unified UV space. Via inverse mapping and dynamic stitching, the framework directly assembles 3D garment meshes, achieving SOTA quality while running orders of magnitude faster than existing methods.
TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking: TagSplat is a topology-aware Gaussian splatting framework that explicitly encodes spatial connectivity among Gaussian primitives, enabling the generation of topologically consistent mesh sequences in dynamic scene reconstruction while supporting accurate 3D keypoint tracking.
Learning 3D Reconstruction with Priors in Test Time: This paper proposes Test-time Constrained Optimization (TCO), a framework that improves 3D reconstruction accuracy by treating available priors (camera poses, intrinsics, depth) as output constraints optimized at inference time, without retraining or modifying the architecture of pretrained multiview Transformers.
Text–Image Conditioned 3D Generation: This paper identifies that image and text conditions provide complementary information for 3D generation—images supply precise appearance but are limited by viewpoint, while text provides global semantics but lacks visual detail—and proposes TIGON, a minimalist dual-branch DiT baseline that achieves native text-image jointly conditioned 3D generation via zero-initialized cross-modal bridges (early fusion) and step-wise prediction averaging (late fusion).
TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification: This paper proposes TopoMesh, which unifies both ground-truth and predicted meshes under the Dual Marching Cubes (DMC) topology framework, enabling explicit vertex- and face-level correspondence for the first time. This allows direct mesh-level supervision over topology, vertex positions, and face normals. The proposed method improves F1-Sharp by 5.9–7.1% over the current state of the art, with particularly notable advantages in sharp feature preservation.
Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos: This paper introduces the World Scene Graph Generation (WSGG) task, which constructs spatio-temporally persistent, world-coordinate-anchored scene graphs from monocular videos, covering all objects including occluded and out-of-frame ones. The paper also presents the ActionGenome4D dataset and three complementary methods (PWG/MWAE/4DST).
TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast: TR2M is a framework that leverages images and textual descriptions to predict pixel-wise scale/shift maps, converting generalizable but scale-ambiguous relative depth into metric depth. With only 19M trainable parameters and 102K training images, it achieves zero-shot cross-domain metric depth estimation.
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction: tttLRM is the first work to introduce Test-Time Training (TTT) into large-scale 3D reconstruction models. It leverages LaCT layers to achieve long-context and autoregressive 3D Gaussian reconstruction at linear complexity. Multi-view observations are compressed into TTT fast weights to form an implicit 3D representation, which is then decoded into explicit formats such as 3DGS, achieving state-of-the-art performance on both object-level and scene-level benchmarks.
Unblur-SLAM: Dense Neural SLAM for Blurry Inputs: Rather than naively inserting a deblurring network into the SLAM front-end, Unblur-SLAM is designed around a central decision: which blurry frames can be deblurred prior to tracking, and which must be modeled directly in 3D space. This insight drives a complete pipeline comprising blur detection, physically constrained deblurring, 3D Gaussian blur refinement, and a severe-blur fallback, enabling the system to handle both motion blur and defocus blur while substantially improving tracking and reconstruction quality.
UniSplat: Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images: UniSplat learns unified geometry-appearance-semantic 3D representations from unposed multi-view images via three components — dual masking, coarse-to-fine Gaussian splatting, and pose-conditioned recalibration — laying a perceptual foundation for spatial intelligence.
Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture: This paper proposes a face reconstruction pipeline based on improved Gaussian Splatting. It tightly couples Gaussians with triangle meshes via soft constraints and semantic segmentation supervision, reconstructing high-fidelity triangular mesh geometry from only 11 uncalibrated images. A PCA prior combined with a relightable Gaussian model is used to disentangle illumination and recover de-lit albedo textures, with outputs fully compatible with standard graphics pipelines (MetaHuman).
UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes: UTrice proposes replacing Gaussian ellipsoids with triangles as unified primitives for differentiable ray tracing, enabling direct triangle traversal within an OptiX BVH without any proxy geometry. The method significantly outperforms 3DGRT in rendering quality while maintaining real-time performance, and is natively compatible with triangles optimized by the rasterization-based Triangle Splatting, thereby achieving primitive unification across rasterization and ray tracing pipelines.
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM: This paper presents VarSplat, the first 3DGS-SLAM system that learns a per-splat appearance variance \(\sigma^2\) and renders a per-pixel uncertainty map \(V\) via the law of total variance. The uncertainty is uniformly applied to tracking, submap registration, and loop detection, achieving robust and state-of-the-art performance across four datasets.
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control: This paper presents VerseCrafter, a video world model based on a unified 4D geometric control representation (static background point cloud + per-object 3D Gaussian trajectories). A lightweight GeoAdapter injects 4D control signals into a frozen Wan2.1-14B video diffusion model, enabling precise and disentangled control over camera and multi-object motion. The authors also construct VerseControl4D, a real-world dataset containing 35K training samples.
VGG-T3: Offline Feed-Forward 3D Reconstruction at Scale: This paper proposes VGG-T3, which compresses the variable-length KV representations in VGGT's global attention layers into fixed-size MLP weights via test-time training (TTT), reducing the computational complexity of offline feed-forward 3D reconstruction from \(O(n^2)\) to \(O(n)\), enabling large-scale scene reconstruction at the thousand-image level (1k images in only 58 seconds).
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection: This paper proposes VGGT-Det, the first multi-view indoor 3D object detection framework under a sensor-geometry-free (SG-Free) setting. By mining semantic priors (via attention-guided query generation, AG) and geometric priors (via query-driven feature aggregation, QD) from the internal representations of the VGGT encoder, VGGT-Det surpasses prior state-of-the-art methods by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively.
VGGT-SLAM++: Visual SLAM with DEM-Based Covisibility and Local Bundle Adjustment: VGGT-SLAM++ augments the VGGT feed-forward Transformer odometry with Digital Elevation Maps (DEMs) as a compact, geometry-preserving representation. It leverages DINOv2 embeddings for efficient loop closure detection and covisibility graph construction, and applies high-frequency Sim(3) local bundle adjustment to correct short-term drift, achieving a 45% reduction in ATE on TUM RGB-D (0.079m → 0.036m).
VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection: This paper proposes VirPro—an adaptive multimodal pre-training paradigm that provides scene-aware semantic supervision signals for weakly-supervised monocular 3D detection via visually guided probabilistic prompts (Adaptive Prompt Bank + Multi-Gaussian Prompt Modeling). VirPro can be seamlessly integrated into existing WS-M3D frameworks, achieving up to 4.8% AP improvement on KITTI.
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI: This paper proposes Wanderland, a real-to-sim framework that uses a handheld multi-sensor scanner (LiDAR+IMU+RGB) to capture open-world indoor and outdoor scenes. It employs LIV-SLAM to obtain metric-accurate geometry and camera poses, combines 3DGS for photorealistic rendering with geometrically grounded collision simulation, and constructs a large-scale dataset of 530 scenes / 420K frames / 3.8M m². The work systematically demonstrates that purely vision-based reconstruction falls significantly short of LiDAR-enhanced approaches in metric accuracy, mesh quality, and reliability for navigation policy training and evaluation.
What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?: This paper systematically ablates the design space of synthetic stereo matching training data—covering floating objects, backgrounds, materials, baselines, and more—and finds that "realistic indoor scenes + dense floating objects + wide baseline" is the optimal combination. The resulting WMGStereo-150k dataset, trained on a single source, outperforms the mixture of four classical datasets.
WMGStereo: What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?: This paper systematically investigates the design space of synthetic stereo datasets by individually varying six key parameters (floating object density, background objects, object types, materials, camera baseline, lighting augmentation) within the Infinigen procedural generator, and quantifies their impact on zero-shot stereo matching. The study finds that the combination of realistic indoor scenes + floating objects is most effective, leading to the construction of the WMGStereo-150k dataset. Training on this single dataset surpasses the combination of SceneFlow + CREStereo + TartanAir + IRS (28% reduction on Middlebury, 25% on Booster), with performance competitive with FoundationStereo.
Where, What, Why: Toward Explainable 3D-GS Watermarking: A representation-native 3D-GS watermarking framework that answers three key questions: Trio-Experts for carrier selection (where), Channel-wise Group Mask for gradient control (what), and decoupled fine-tuning for auditable attribution (why). Surpasses SOTA on both rendering quality (PSNR +0.83 dB) and bit accuracy (+1.24%).
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion: This paper proposes Yo'City, a multi-agent framework that achieves user-personalized, text-driven unbounded 3D city generation through a "City–District–Grid" hierarchical planning strategy, a produce–refine–evaluate isometric image synthesis loop, and a scene graph-guided expansion mechanism. The approach comprehensively outperforms existing methods such as SynCity in semantic consistency and visual quality.
Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image: DynaAvatar presents the first zero-shot framework for reconstructing animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Through a static-to-dynamic knowledge transfer strategy and a optical flow-guided DynaFlow loss, the method achieves realistic garment dynamics under limited dynamic training data, surpassing all existing approaches across the board.

🎨 Image Generation¶

2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching: This paper proposes 2ndMatch, a fine-tuning framework for pruned diffusion models that aligns the second-order Jacobian matrix \(J^\top J\) between the pruned and original models—inspired by finite-time Lyapunov exponents (FTLE)—to match their sensitivity to input perturbations over time, thereby significantly closing the generation quality gap.
Accelerating Diffusion Model Training under Minimal Budgets: A Condensation-Based Perspective: This paper proposes D2C (Diffusion Dataset Condensation)—the first dataset condensation framework for diffusion models—which achieves 100–233× training speedup while maintaining high-quality image generation by using only 0.8–8% of ImageNet data through a two-stage "Select + Attach" pipeline.
ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation: This paper proposes the ADAPT framework, which employs three training-free modules — Attention-driven adaptive Prompt Scheduling (APS), Pooling Embedding Manipulation (PEM), and Latent Space Manipulation (LSM) — to deterministically and semantically control the generation transition from common to rare concepts, significantly outperforming the R2F baseline on RareBench.
HINGE: Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images: HINGE is a framework that, for the first time, repurposes a pre-trained expression-space single-cell foundation model (sc-FM, CellFM) as a histology-image-conditioned spatial gene expression generator. It achieves state-of-the-art performance on three ST datasets while maintaining superior gene co-expression consistency, through three core mechanisms: identity-initialized SoftAdaLN for lightweight visual context injection, an expression-space masked diffusion process that aligns with the pre-training objective, and a warm-start curriculum to stabilize training.
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation: This paper proposes Adaptive Auxiliary Prompt Blending (AAPB), which derives a closed-form adaptive blending coefficient via the Tweedie formula to dynamically balance the contributions of an auxiliary anchor prompt and a target prompt at each denoising step. Without any training, AAPB significantly improves semantic accuracy and structural fidelity for both rare concept generation and zero-shot image editing.
Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration: Ours proposes Spectrum, a global spectral domain feature forecasting method based on Chebyshev polynomials. By treating the intermediate features of the diffusion denoiser as functions of time and fitting coefficients via ridge regression, it achieves long-range feature forecasting where errors do not grow with step size. It reaches 4.79× acceleration on FLUX.1 and 4.67× on Wan2.1-14B with near-lossless quality.
Agentic Retoucher for Text-To-Image Generation: The problem of correcting local distortions (deformed fingers, facial abnormalities, text errors, etc.) in T2I diffusion model outputs is modeled as a Perception-Reasoning-Action multi-agent cyclic system named Agentic Retoucher. It utilizes a Perception Agent to locate defects via context-aware distortion saliency maps, a Reasoning Agent to diagnose distortion types through structured reasoning, and an Action Agent to execute repairs via tool selection. Combined with the GenBlemish-27K dataset, it achieves end-to-end iterative automatic correction.
Agentic Retoucher for Text-To-Image Generation: Agentic Retoucher reframes the local defect restoration of T2I generated images into a multi-agent closed-loop decision process of Perception \(\to\) Reasoning \(\to\) Action. Through context-aware saliency detection, human-preference-aligned diagnostic reasoning, and adaptive tool selection, it achieves autonomous restoration, improving plausibility by 2.89 points on GenBlemish-27K, with 83.2% of results rated better than the original by humans.
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations: AHS overcomes the limitations of self-supervised training by using a head reenactment model (GAGAvatar) to generate synthetic augmented data. Combined with a dual-encoder attention mechanism and an adaptive masking strategy, it achieves SOTA results in head swapping tasks for full-body images.
AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution: AlignVAR addresses two consistency failures of visual autoregressive (VAR) models in image super-resolution (ISR): spatially incoherent reconstructions caused by locally biased attention, and cross-scale error accumulation induced by residual supervision. The proposed framework introduces Spatial Consistency Autoregression (SCA) and Hierarchical Consistency Constraint (HCC) to jointly resolve both issues, achieving reconstruction quality superior to diffusion-based methods while delivering over 10× faster inference.
All-in-One Slider for Attribute Manipulation in Diffusion Models: The proposed All-in-One Slider framework trains a lightweight Attribute Sparse Autoencoder on the intermediate layer embeddings of a text encoder. It decomposes attributes into disentangled directions within a high-dimensional sparse activation space, enabling continuous, fine-grained, and composable control of multiple facial attributes with a single module. It also demonstrates zero-shot continuous manipulation capabilities for unseen attributes (e.g., ethnicity, celebrities).
All-in-One Slider for Attribute Manipulation in Diffusion Models: The All-in-One Slider framework is proposed, which trains an Attribute Sparse Autoencoder on the text embedding space to decouple various facial attributes into sparse semantic directions. This enables a single lightweight module to achieve fine-grained continuous control over 52+ attributes, supporting multi-attribute combinations and zero-shot manipulation of unseen attributes.
Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling: The Ani3DHuman framework is proposed, combining kinematics-driven mesh animation with video diffusion priors. Through Self-guided Stochastic Sampling, it restores low-quality rigid body renderings into high-fidelity videos, achieving realistic modeling of non-rigid clothing dynamics.
APPLE: Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping: APPLE proposes a teacher-student framework based on diffusion models. It trains a teacher model using conditional deblurring (instead of traditional conditional inpainting) to generate attribute-aligned pseudo-labels, which are then used to train a student model. This achieves SOTA performance in attribute preservation (FID 2.18, Pose Error 1.85) while maintaining identity transfer capabilities.
Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation: Ar2Can decomposes multi-human image generation into two stages — spatial planning (Architect) and identity-preserving rendering (Artist) — and trains the Artist model via GRPO reinforcement learning with a spatially-anchored face reward based on Hungarian matching. The method achieves an identity preservation score of 68.2 and a counting accuracy of 90.2 on MultiHuman-Testbench, substantially outperforming all baselines.
AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys: The authors propose AS-Bridge, a bidirectional generative framework based on the Brownian Bridge diffusion process. It models the probabilistic conditional distribution between ground-based LSST and space-based Euclid astronomical surveys, enabling cross-survey image translation and rare event detection (gravitational lensing), while improving likelihood estimation of the standard Brownian Bridge via an \(\epsilon\)-prediction training objective.
AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys: AS-Bridge is proposed to model the conditional probability distribution between ground-based LSST and space-based Euclid survey observations using a bidirectional Brownian Bridge diffusion process, enabling cross-survey probabilistic image translation and unsupervised strong gravitational lens detection by leveraging reconstruction inconsistency.
Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models: This paper employs linear probing to demonstrate that implicit decisions in diffusion models—such as defaulting to male when gender is unspecified—are primarily governed by self-attention layers rather than cross-attention layers. Building on this finding, the paper proposes ICM, a method that intervenes on a small number of critical self-attention layers to achieve state-of-the-art debiasing while minimizing image quality degradation.
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution: This paper reframes AI-generated image attribution from a classification paradigm to an instance retrieval paradigm, proposing the LIDA framework. It extracts generator-specific fingerprints from RGB low-bit planes as input, and achieves open-set attribution via unsupervised pre-training on real images followed by few-shot adaptation. LIDA achieves average Rank-1 accuracies of 40.4%/77.5% on GenImage and WildFake under the 1-shot setting, substantially outperforming existing methods.
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution: This paper proposes LIDA, which reformulates AI-generated image attribution from a classification problem into a retrieval problem. By leveraging low-bit-plane fingerprints to capture generator-specific artifacts, combined with unsupervised pre-training and few-shot adaptation, LIDA achieves state-of-the-art Deepfake detection and image attribution under zero-shot and few-shot settings.
AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models: Proposes AutoDebias—the first unified framework to simultaneously detect and mitigate malicious backdoor biases in T2I models. It leverages VLM open-set detection to discover trigger-bias associations and construct look-up tables, then eliminates backdoor associations through CLIP-guided distribution alignment training. It reduces the attack success rate from 90% to nearly 0 across 17 backdoor scenarios while maintaining image quality.
Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro: Banana100 systematically investigates quality degradation in multi-turn editing by having Nano Banana Pro iteratively replicate images 100 times, constructing a dataset of 28,000 degraded images. The study reveals a startling finding: 21 mainstream No-Reference Image Quality Assessment (NR-IQA) metrics fail to reliably detect iterative degradation—most metrics even assign higher scores to noisy images than to clean ones.
BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling: This paper proposes BeautyGRPO, a reinforcement learning-based face retouching framework that constructs a fine-grained preference dataset FRPref-10K to train a dedicated reward model, and introduces a Dynamic Path Guidance (DPG) mechanism to balance stochastic exploration and high fidelity, achieving natural retouching results aligned with human aesthetic preferences.
Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training: This paper identifies a "Motion-Vision Quality Dilemma" where motion quality (MQ) and visual quality (VQ) are negatively correlated in video data. Through gradient analysis, it reveals that imbalanced data can produce equivalent learning signals at appropriate timesteps. The proposed TQD framework enables training on imbalanced data to surpass training on "golden data."
BiGain: Unified Token Compression for Joint Generation and Classification: BiGain proposes a frequency-aware token compression framework. Through Laplacian-gated token merging (preserving high-frequency details) and interpolate-extrapolate KV downsampling (preserving query precision), it is the first to simultaneously optimize generation quality and classification accuracy in diffusion model inference acceleration.
BiGain: Unified Token Compression for Joint Generation and Classification: BiGain proposes a frequency-aware token compression framework comprising two training-free operators: Laplacian-Gated Token Merging and Interpolation-Extrapolation KV Downsampling. It is the first to maintain generation quality while significantly improving discriminative classification performance in diffusion model acceleration.
BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation: BiMotion is proposed to compress variable-length motion sequences into a fixed number of control points using continuously differentiable B-spline curves. Combined with a specialized VAE and a flow-matching diffusion model, it achieves fast, highly expressive, and semantically complete text-guided dynamic 3D character generation, outperforming existing methods in both quality and efficiency.
BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment: This paper proposes the BioVITA framework, comprising a million-scale tri-modal (image–text–audio) biological dataset, a two-stage alignment model, and a six-direction cross-modal species-level retrieval benchmark, achieving for the first time unified visual-textual-acoustic representation learning in the biological domain.
BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation: This paper proposes BlackMirror, a two-stage framework that achieves generalizable black-box backdoor detection against T2I models through fine-grained instruction-response semantic deviation detection (MirrorMatch) and cross-prompt stability verification (MirrorVerify). The framework achieves an average F1 of 89.46%, substantially outperforming the existing black-box method UFID.
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing: Proposes CARE-Edit, a condition-aware expert routing framework that implements dynamic computation allocation on a DiT backbone via heterogeneous experts (Text/Mask/Reference/Base) coupled with a lightweight latent-attention router. This effectively addresses issues like color bleeding and identity drift caused by conflicting multi-conditional signals (text, mask, reference image) in unified image editors.
CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion: Proposes CaReFlow, the first work to utilize rectified flow for multimodal distribution mapping to bridge the modality gap: it enables source modality data points to observe the global distribution of the target modality through one-to-many mapping, applies different alignment intensities to modality pairs with varying correlation via adaptive relaxed alignment, and ensures no information loss after mapping through cyclic rectified flow. It achieves SOTA on multiple multimodal affective computing benchmarks even with simple concatenation fusion.
Causal Motion Diffusion Models for Autoregressive Motion Generation: This paper proposes CMDM, a framework that unifies diffusion denoising and autoregressive generation within a motion-language-aligned causal latent space. By employing frame-wise independent noise and a causal uncertainty-based sampling schedule, CMDM achieves high-quality, low-latency text-to-motion generation and long-sequence streaming synthesis.
Guiding Diffusion Models with Semantically Degraded Conditions (CDG): Condition-Degradation Guidance (CDG) replaces the null prompt \(\emptyset\) in CFG with a semantically degraded condition \(\boldsymbol{c}_{\text{deg}}\), transforming the guidance from a "good vs. empty" comparison to a refined "good vs. almost good" contrast. This significantly improves the compositional generation precision of diffusion models without requiring any training.
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance: This paper reinterprets Classifier-Free Guidance (CFG) as a feedback control process within flow matching diffusion models, proposes a unified framework termed CFG-Ctrl, and introduces SMC-CFG — a nonlinear feedback guidance mechanism grounded in sliding mode control (SMC) — which substantially improves semantic consistency and generation robustness at large guidance scales.
ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing: Proposes ChangeBridge, which achieves conditional spatiotemporal image generation from pre-event to post-event in remote sensing scenes via a drift-asynchronous diffusion bridge. It supports multimodal controls including coordinate-text, semantic masks, and instance layouts, serving as a data generation engine for change detection tasks.
ChordEdit: One-Step Low-Energy Transport for Image Editing: Based on dynamic optimal transport theory, a low-energy Chord control field is derived to smooth unstable naive editing fields, achieving the first training-free, inversion-free, and high-fidelity real-time image editing for distilled one-step T2I models.
Cinematic Audio Source Separation Using Visual Cues: This paper proposes the first audio-visual cinematic audio source separation (AV-CASS) framework, which leverages visual cues from dual video streams (face and scene) to perform generative three-way audio separation (dialogue/effects/music) via conditional flow matching, training solely on synthetic data while generalizing to real films.
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers: Internal circuit mechanisms for spatial relation generation in Diffusion Transformers (DiT) are revealed through mechanistic interpretability: Randomized Token Embedding (RTE) models utilize a two-stage modular circuit (Relation Heads + Object Generation Heads), while T5-encoded models fused relation information into object tokens for single-token decoding, showing significant differences in robustness.
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers: Through mechanistic interpretability methods, this work reveals two distinct circuit mechanisms for spatial relation generation in Diffusion Transformers: Randomized Text Encoders (RTE) use a two-stage modular circuit with "relation heads + object heads," while T5 encoders integrate relation information into object tokens for single-token decoding, making the latter more fragile under out-of-distribution perturbations.
CoD: A Diffusion Foundation Model for Image Compression: This paper proposes CoD, the first diffusion foundation model designed for image compression. Trained from scratch for joint compression-generation optimization, CoD replaces Stable Diffusion in downstream diffusion codecs and achieves state-of-the-art performance at ultra-low bitrates (0.0039 bpp), with a training cost of only 0.3% of that required by SD.
coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation: Ours proposes coDrawAgents, an interactive multi-agent dialogue framework (Interpreter-Planner-Checker-Painter). It significantly enhances the faithfulness of compositional text-to-image generation in complex scenarios through divide-and-conquer incremental layout planning, visual context-driven spatial reasoning, and an explicit error correction mechanism.
coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation: This paper proposes coDrawAgents, an interactive multi-agent dialogue framework in which four specialized agents — Interpreter, Planner, Checker, and Painter — collaborate in a closed loop. A divide-and-conquer strategy incrementally plans layouts group by group according to semantic priority, grounding reasoning in canvas visual context with explicit error correction. The framework achieves an Overall Score of 0.94 on GenEval, substantially outperforming GPT Image 1 (0.84), and reaches 85.17 SOTA on DPG-Bench.
CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment: This paper proposes CognitionCapturerPro, which integrates EEG signals with four modalities (image, text, depth, and edge) via Uncertainty-Weighted Masking (UM), a multi-modal fusion encoder, and Shared-Trunk Multi-Head Alignment (STH-Align). On THINGS-EEG, the method achieves a Top-1 retrieval accuracy of 61.2% and Top-5 of 90.8%, improving over the predecessor CognitionCapturer by 25.9% and 10.6%, respectively.
CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment: CognitionCapturerPro addresses Fidelity Loss via uncertainty-weighted masking and resolves Representational Shift by integrating image, text, depth, and edge information through a multi-modal fusion encoder. Combined with a lightweight shared backbone alignment replacing diffusion priors, it improves Top-1/Top-5 retrieval accuracy on the THINGS-EEG dataset by 25.9% and 10.6%, respectively.
CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation: Introduces CoLoGen, a unified image generation framework based on "Concept-Localization Duality." Through progressive staged training and the Progressive Representation Weaving (PRW) dynamic expert routing architecture, it simultaneously matches or exceeds specialized models across three major tasks: instruction editing, controllable generation, and personalized generation.
ConsistCompose: Unified Multimodal Layout Control for Image Composition: The paper proposes ConsistCompose, which achieves layout-controllable multi-instance image generation within a unified multimodal framework by embedding layout coordinates directly into language prompts (the LELG paradigm). It constructs the ConsistCompose3M dataset with 3.4 million samples providing layout and identity supervision. Coupled with a Coordinate-aware CFG mechanism, it achieves a 7.2% improvement in layout IoU and a 13.7% improvement in AP on COCO-Position while maintaining general understanding capabilities.
ConsistCompose: Unified Multimodal Layout Control for Image Composition: Propounds the LELG (Language-Embedded Layout Guidance) paradigm, which encodes bounding box coordinates directly into text tokens within the language stream. This achieves layout-controllable multi-instance image generation in a unified multimodal Transformer without requiring any specialized layout encoders or branches.
COT-FM: Cluster-wise Optimal Transport Flow Matching: This paper proposes COT-FM, a plug-and-play Flow Matching enhancement framework that clusters target samples, inverts a pretrained model to recover cluster-wise source distributions, and approximates optimal transport within each cluster. This significantly straightens transport trajectories, simultaneously accelerating sampling and improving generation quality without modifying the model architecture.
CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think: CRAFT proposes an ultra-lightweight alignment method for diffusion models: it automatically constructs high-quality training sets through a Compositional Reward Filtering (CRF) strategy and then performs an enhanced version of SFT. Theoretically, CRAFT optimizes the lower bound of Group Relative Policy Optimization (GRPO). It outperforms SOTA methods requiring thousands of preference pairs using only 100 samples, with training speeds 11-220 times faster.
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video: This paper proposes C-MET (Cross-Modal Emotion Transfer), which models the mapping of emotion semantic vectors between speech and facial expression spaces, achieving for the first time speech-driven talking face video generation for extended emotions (e.g., sarcasm, charisma), surpassing the state of the art in emotion accuracy by 14%.
CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration: This paper proposes CTCal (Cross-Timestep Self-Calibration), which leverages reliable text-image alignments (cross-attention maps) formed at small timesteps (low noise) to calibrate representation learning at large timesteps (high noise), providing explicit cross-timestep self-supervision for text-to-image generation. CTCal comprehensively outperforms existing methods on T2I-CompBench++ and GenEval.
Cycle-Consistent Tuning for Layered Image Decomposition: A cycle-consistent fine-tuning framework based on diffusion models is proposed to achieve image layer separation (e.g., logo-object decomposition) by jointly training decomposition and synthesis models. A progressive self-improving data augmentation strategy is introduced to achieve robust decomposition in scenarios with non-linear layer interactions.
D2C: Accelerating Diffusion Model Training under Minimal Budgets via Condensation: This work introduces dataset condensation to diffusion model training for the first time, proposing the D2C two-stage framework (Select+Attach). Using only 0.8% of ImageNet data, it achieves an FID of 4.3 in 40K steps, performing 100x faster than REPA and 233x faster than vanilla SiT.
DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment: This paper proposes Detail-Aligned VAE (DA-VAE), which introduces structured latent representations (base + detail channels) with an alignment loss to achieve a 4× compression ratio increase over pretrained VAEs without retraining diffusion models from scratch, requiring only 5 H100-days to adapt SD3.5 for 1024×1024 image generation.
Elucidating the SNR-t Bias of Diffusion Probabilistic Models: This paper reveals the pervasive SNR-t bias in diffusion models (the mismatch between the Signal-to-Noise Ratio of samples in the reverse process and their timestamps) and proposes Differential Correction in Wavelet domain (DCW). DCW is a training-free, plug-and-play method that enhances the generation quality across various diffusion models.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation: DeCo proposes a frequency-decoupled pixel diffusion framework that delegates high-frequency detail synthesis to a lightweight pixel decoder while allowing the DiT to focus on low-frequency semantic modeling. Combined with a frequency-aware flow matching loss, it achieves FID 1.62 (256×256) and 2.22 (512×512) on ImageNet, substantially narrowing the gap between pixel diffusion and latent diffusion models.
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache: Diffusion model sampling acceleration is formulated as a global path planning problem. By constructing a Path-Aware Cost Tensor (PACT) to quantify the path dependency of skipping errors and using dynamic programming to select the optimal sequence of key steps, DPCache achieves a 4.87× speedup on FLUX, surpassing the full-step baseline by +0.028 ImageReward.
Depth Adaptive Efficient Visual Autoregressive Modeling: Reveals the fundamental limitations of the frequency-driven hard pruning paradigm in VAR models and proposes DepthVAR, a training-free inference acceleration framework. By adaptively allocating the Transformer computation depth for each token (rather than binary keep/prune), it achieves \(2.3\times\)-\(3.1\times\) speedup with minimal quality loss.
Diffusion Mental Averages: Proposed Diffusion Mental Averages (DMA), which extracts "mental average" prototype images of concepts from pretrained diffusion models by aligning multiple denoising trajectories in semantic space—achieving consistent and realistic concept averaging visualization for the first time.
Diffusion Probe: Generated Image Result Prediction Using CNN Probes: This work discovers that the cross-attention distribution in early denoising steps of diffusion models is highly correlated with final image quality. It proposes Diffusion Probe — a lightweight CNN that predicts generation quality from early attention maps — enabling pre-filtering of low-quality generation trajectories after only 10% of denoising steps, thereby accelerating prompt optimization, seed selection, and GRPO training.
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization: Ours proposes DiFlowDubber, an automated video dubbing framework based on Discrete Flow Matching (DFM). Through a two-stage training pipeline (Zero-shot TTS pre-training → Video dubbing adaptation), large-scale TTS knowledge is transferred to video-driven dubbing. The framework features a FaPro module to capture facial expression-prosody mapping and a Synchronizer module for precise lip-sync.
DiP: Taming Diffusion Models in Pixel Space: The paper proposes DiP, an efficient pixel-space diffusion framework. By utilizing a DiT backbone to model global structures on large patches combined with a lightweight Patch Detailer Head to recover local details, it achieves computational efficiency comparable to LDMs without requiring a VAE, reaching a 1.79 FID on ImageNet 256×256.
Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation: This paper proposes the DisCo framework, which resolves the similarity-controllability paradox in subject-driven image generation by first decoupling textual and visual information (replacing entity words with pronouns to eliminate textual interference on the subject) and then re-coupling them via GRPO with a dedicated reward model.
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression: Ours proposes DiT-IC, which adapts a pre-trained T2I Diffusion Transformer into a one-step image compression reconstruction model via three alignment mechanisms (Variance-Guided Reconstruction Flow, Self-Distillation Alignment, and Latent Conditional Guidance). By performing diffusion in a deep latent space with \(32\times\) downsampling, it achieves SOTA perceptual quality with decoding speeds \(30\times\) faster than existing diffusion-based codecs.
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression: This work adapts a pretrained text-to-image DiT (SANA) into an efficient single-step image compression decoder. Three alignment mechanisms are proposed: variance-guided reconstruction flow (pixel-level adaptive denoising intensity), self-distillation alignment (encoder latents as distillation targets), and latent-conditioned guidance (replacing the text encoder). Operating entirely in a deep latent space with 32× downsampling, the method achieves state-of-the-art perceptual quality (BD-rate DISTS −87.88%), decodes 30× faster than prior diffusion-based methods, and can reconstruct 2K images on a 16 GB laptop GPU.
Diversity over Uniformity: Rethinking Representation in Generated Image Detection: This paper proposes an Anti-Feature-Collapse Learning (AFCL) framework that filters task-irrelevant features via an information bottleneck and suppresses excessive overlap among heterogeneous forgery cues, thereby preserving diversity and complementarity in discriminative representations. The method achieves significant improvements in cross-model generated image detection.
DMin: Scalable Training Data Influence Estimation for Diffusion Models: Proposes DMin, a scalable training data influence estimation framework for diffusion models. By using an efficient gradient compression pipeline, it reduces storage requirements from hundreds of terabytes down to MB/KB levels, enabling influence estimation for billion-parameter diffusion models for the first time and supporting sub-second top-k retrieval.
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache: This paper formalizes diffusion model sampling acceleration as a global path planning problem. By constructing a Path-Aware Cost Tensor (PACT) and applying dynamic programming to select the optimal sequence of key timesteps, the method achieves training-free 4.87× acceleration while surpassing the full-step baseline in generation quality.
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning: DreamVideo-Omni is proposed as a two-stage progressive training paradigm—omni-motion identity supervised fine-tuning followed by latent identity reward feedback learning—that, within a single DiT architecture, for the first time unifies multi-subject customization with full-granularity motion control (global bounding boxes + local trajectories + camera motion).
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning: This paper proposes DreamVideo-Omni, a unified DiT framework for multi-subject identity customization and omni-motion control (global bbox + local trajectory + camera motion). It resolves multi-subject ambiguity via condition-aware 3D RoPE and Group/Role Embeddings, and introduces Latent Identity Reward Feedback Learning (LIReFL) to provide dense identity rewards at arbitrary denoising timesteps, enabling efficient identity reinforcement by bypassing the VAE decoder.
DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution: The paper proposes DUO-VSR, a three-stage distillation framework that compresses multi-step video super-resolution models into a one-step generator through progressive guided distillation initialization, dual-stream distillation (joint optimization of DMD and RFS-GAN), and preference-guided refinement. It achieves approximately 50× acceleration while surpassing the visual quality of previous one-step VSR methods.
DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data: DynaVid proposes utilizing synthetic optical flow rendered via computer graphics (rather than synthetic videos) to train video diffusion models. Through a two-stage framework consisting of a motion generator and a motion-guided video generator, it achieves realistic video synthesis of highly dynamic motions and fine-grained camera control.
EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation: EdgeDiT proposes a hardware-aware optimization framework for Diffusion Transformers that trains lightweight proxy blocks via hierarchical knowledge distillation and searches for Pareto-optimal architectures through multi-objective Bayesian optimization, achieving 20–30% parameter reduction, 36–46% FLOPs reduction, and 1.65× on-device speedup while maintaining or surpassing the generation quality of the original DiT-XL/2.
Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking: This paper provides a unified theoretical and experimental analysis of how non-adversarial diffusion editing inadvertently destroys robust invisible watermarks, deriving bounds for watermark SNR and mutual information decay, and validating systemic failures of watermark recovery across scenarios such as instruction-based editing, drag-based editing, and training-free synthesis.
Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking: This paper systematically analyzes, from both theoretical (SNR attenuation, mutual information lower bounds, denoising contraction) and empirical perspectives, how non-adversarial diffusion editing (instruction-based, drag-based, and composition-based) inadvertently destroys robust invisible watermarks, revealing that traditional post-processing robustness does not generalize to generative transformations.
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing: This paper proposes EffectErase, a framework that jointly learns video object insertion as an inverse auxiliary task to object removal, and constructs a large-scale VOR dataset containing 60K video pairs, enabling high-quality erasure of objects along with their associated visual side effects, including occlusion, shadow, reflection, illumination changes, and deformation.
EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation: EgoFlow proposes a generative framework based on Flow Matching that integrates multimodal scene conditions through a Mamba-Transformer-Perceiver hybrid architecture. During inference, it applies differentiable physical constraints (collision avoidance, motion smoothness) via gradient-guided sampling to generate physically plausible 6DoF object motion trajectories from first-person videos, reducing collision rates by up to 79%.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation: This work is the first to extend the MeanFlow framework from class-label conditioning to text-conditioned image generation. It identifies the semantic discriminability and disentanglement of text representations as the key bottlenecks under limited inference steps, and achieves high-quality few-step/one-step T2I generation based on the BLIP3o-NEXT text encoder.
EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories: The EMMA benchmark is proposed to systematically evaluate concept erasure methods for T2I models across five dimensions (erasing ability, retaining ability, efficiency, quality, and bias) with 12 metrics. Covering 206 concept categories across 5 domains, it reveals for the first time the shallow erasure nature and bias amplification issues of existing methods under implicit prompts.
Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception: Ours proposes the DIAE framework, which transforms vague aesthetic instructions into joint signals of HSV/contour maps and text via Multimodal Aesthetic Perception (MAP). It leverages an "imperfectly paired" dataset, IIAEData, to achieve weakly supervised image aesthetic enhancement.
Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception: DIAE proposes a Multimodal Aesthetic Perception (MAP) module to convert vague aesthetic instructions into explicit control signals (HSV + contour maps + text). It constructs a "imperfectly paired" dataset, IIAEData, and utilizes a dual-branch supervision framework for weakly supervised training, achieving content-consistent aesthetic enhancement with a 17.4% improvement in LAION aesthetic scores.
Enhancing Spatial Understanding in Image Generation via Reward Modeling: The authors construct the SpatialReward-Dataset, an 80K adversarial preference dataset, to train SpatialScore—a reward model specifically for evaluating spatial relationship accuracy (outperforming GPT-5). By integrating a top-k filtering strategy with GRPO online RL, they significantly enhance the spatial generation capabilities of FLUX.1-dev.
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models: This paper systematically evaluates the trade-off between safety (erasure success rate) and compositional generation capability across 16 text-to-image diffusion model unlearning methods. It reveals that aggressive erasure strategies, while removing undesirable content, severely damage attribute binding, spatial reasoning, and counting abilities, emphasizing that safety interventions should not come at the expense of the model's semantic logic.
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation: This paper proposes EVATok, a four-stage framework that first uses a proxy tokenizer to estimate the optimal token allocation for each video, then trains a lightweight router to predict these allocations in a single forward pass, and finally trains an adaptive tokenizer that flexibly assigns token counts according to content complexity. On UCF-101, EVATok achieves state-of-the-art generation quality with a 24.4% reduction in token count.
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation: This paper proposes EVATok, a four-stage framework that defines optimal token allocation via a proxy reward, trains a lightweight router to predict the optimal token budget for each video segment, and achieves content-adaptive variable-length video tokenization. EVATok attains state-of-the-art generation quality on UCF-101 while saving at least 24.4% of tokens.
Exploring Conditions for Diffusion Models in Robotic Control: This paper investigates how to leverage the conditioning mechanisms of pretrained text-to-image diffusion models to generate task-adaptive visual representations for robotic control. It identifies that text conditioning fails in control environments due to domain gap, and proposes ORCA, a framework that employs learnable task prompts and per-frame visual prompts as conditioning signals. ORCA achieves state-of-the-art performance across 12 tasks on three benchmarks: DMC, MetaWorld, and Adroit.
ExpPortrait: Expressive Portrait Generation via Personalized Representation: This paper proposes a high-fidelity personalized head representation (static identity offset + dynamic expression offset) to address the limited expressiveness of parametric models such as SMPL-X. Combined with an identity-adaptive expression transfer module and a DiT-based generator, the method achieves state-of-the-art performance on both self-driven portrait video animation and cross-identity reenactment tasks.
ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop: This paper presents ExpressEdit, a fully open-source Photoshop plugin that achieves noise-free editing of stylized facial expressions within 3 seconds on a single consumer-grade GPU, leveraging a SPICE-based diffusion model backend combined with a Danbooru expression tag database and a RAG system, significantly outperforming commercial models such as GPT, Grok, and Nano Banana 2.
Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration: Face2Scene proposes a two-stage framework: a reference-based face restoration model (Ref-FR) first produces HQ-LQ face pairs from which degradation codes are extracted as an "oracle"; these codes then condition a single-step diffusion model to restore the full scene, including body and background.
FDeID-Toolbox: Face De-Identification Toolbox: This paper presents FDeID-Toolbox, a modular face de-identification toolbox that uniformly integrates 16 de-identification methods (spanning four categories: naive, generative, adversarial, and K-Same), 6 benchmark datasets, and a systematic evaluation protocol covering three dimensions—privacy protection, attribute preservation, and visual quality—addressing the field's persistent problems of fragmented implementations, inconsistent evaluation protocols, and incomparable results.
FDeID-Toolbox: Face De-Identification Toolbox: This paper proposes FDeID-Toolbox, a modular face de-identification research toolbox comprising four standardized components—data loading, unified method implementation, flexible inference pipeline, and systematic evaluation protocol—enabling, for the first time, fair and reproducible comparisons across diverse de-identification methods along three dimensions: privacy protection, utility preservation, and visual quality.
Few-shot Acoustic Synthesis with Multimodal Flow Matching: This paper proposes FLAC, the first flow matching-based few-shot room impulse response (RIR) generation framework, capable of synthesizing spatially consistent acoustic responses in unseen scenes from a single recording. It further introduces AGREE, a joint embedding for geometry–acoustic consistency evaluation.
FG-Portrait: 3D Flow Guided Editable Portrait Animation: FG-Portrait introduces "3D optical flow" — directly computed from the FLAME parametric 3D head model without any learning — as a geometry-driven motion correspondence signal. Combined with depth-guided sampling for 3D flow encoding as the motion condition for a diffusion model ControlNet, the method achieves substantially improved motion transfer accuracy (APD reduced by 22%+) and supports inference-time expression and head pose editing.
Flash-Unified: Training-Free and Task-Aware Acceleration for Native Unified Models: FlashU conducts the first systematic redundancy analysis of native unified multimodal models, identifying parameter specialization and computational heterogeneity. Based on these findings, it proposes a training-free, task-aware acceleration framework that achieves 1.78×–2.01× speedup on Show-o2 while maintaining SOTA performance, through FFN pruning, dynamic layer skipping, adaptive guidance scaling, and diffusion head caching.
FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation: FontCrafter reframes artistic font generation as a visual in-context generation task. By horizontally concatenating reference element images with a blank canvas and feeding the result into a pretrained inpainting model (FLUX.1-Fill), it achieves high-fidelity element-driven font creation, significantly outperforming existing methods in both texture and structural fidelity.
Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems: This paper proves that the DDIM deterministic reverse chain is equivalent to a Partitioned Iterated Function System (PIFS), and derives three computable quantities from fractal geometry—the contraction threshold \(L_t^*\), the diagonal expansion function \(f_t(\lambda)\), and the global expansion threshold \(\lambda^{**}\)—providing a unified theoretical explanation for four empirically motivated design choices: cosine schedule offset, resolution logSNR shift, Min-SNR loss weighting, and Align Your Steps sampling schedule.
Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems: This paper proves that the DDIM deterministic reverse chain is essentially a Partitioned Iterated Function System (PIFS), and derives from this framework three computable geometric quantities requiring no model evaluation. It provides a unified, first-principles explanation for the two-phase denoising dynamics of diffusion models, the effectiveness of self-attention, and four empirical design choices (cosine schedule offset, resolution-dependent logSNR shift, Min-SNR loss weighting, and Align Your Steps sampling).
FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution: FRAMER proposes a frequency-aligned self-distillation training framework that uses final-layer feature maps as teacher supervision for intermediate layers. By applying IntraCL and InterCL contrastive losses to low-frequency (LF) and high-frequency (HF) components respectively, along with Frequency-based Adaptive Weight (FAW) and Frequency-based Adaptive Modulation (FAM), FRAMER significantly improves high-frequency detail recovery in diffusion-based real-world image super-resolution without modifying the network architecture or inference pipeline.
Frequency-Aware Flow Matching for High-Quality Image Generation: FreqFlow explicitly incorporates frequency-domain awareness into the flow matching framework via a dual-branch architecture that processes low-frequency global structures and high-frequency detail information separately, achieving state-of-the-art FID of 1.38 on ImageNet-256.
From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition: This paper identifies an intrinsic connection between image layer decomposition and inpainting/outpainting, and proposes the Outpaint-and-Remove framework, which efficiently adapts a pretrained inpainting DiT model (FLUX.1-Fill-dev) for layer decomposition via lightweight LoRA fine-tuning. A multi-modal context fusion module is introduced to preserve fine details. The method achieves state-of-the-art performance using only 100K synthetic training samples.
Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories: This paper introduces Garments2Look, the first large-scale multimodal outfit-level virtual try-on dataset (80K pairs, 40 categories, 300+ subcategories). Each sample contains 3–12 reference garment images, a model outfit image, and detailed textual annotations. The dataset exposes significant shortcomings of existing methods in multi-layer outfit composition and accessory consistency.
Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication: This paper models the watermark embedding and extraction process in diffusion models as communication over a noisy channel, and proposes the Gaussian Shannon framework. By cascading majority voting and LDPC error-correcting codes, the framework achieves bit-exact watermark recovery (rather than mere threshold-based detection), attaining state-of-the-art bit accuracy and detection rates across three Stable Diffusion versions and seven types of perturbation.
GIST: Towards Design Compositing: This paper proposes GIST, a training-free identity-preserving image compositing method that achieves style harmonization across multi-source visual elements via cross-attention-guided token injection and Flow Matched latent space initialization, serving as a plug-and-play compositing stage between layout prediction and typography generation.
gQIR: Generative Quanta Image Reconstruction: This work adapts a large-scale text-to-image latent diffusion model to the extreme photon-limited imaging regime of single-photon avalanche diodes (SPADs) via a three-stage framework—Quanta-aligned VAE → adversarially fine-tuned LoRA U-Net → FusionViT spatiotemporal fusion—enabling high-quality RGB image reconstruction from sparse binary photon detections and significantly outperforming all existing methods under extreme conditions of 10K–100K fps.
gQIR: Generative Quanta Image Reconstruction: This paper proposes gQIR, a modular three-stage framework that adapts large-scale text-to-image (T2I) diffusion models to the extreme photon-limited domain of SPAD sensors. It employs a quanta-aligned VAE (with a frozen encoder copy to prevent collapse), an adversarially fine-tuned LoRA U-Net for single-step generation, and a latent-space FusionViT for spatiotemporal fusion, enabling high-quality color image and video reconstruction from extremely sparse binary photon events.
GrOCE: Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models: GrOCE proposes a training-free concept erasure framework based on dynamic semantic graphs, achieving precise, context-aware online removal of target concepts in text-to-image diffusion models through three cooperative components: semantic graph construction, adaptive clustering identification, and selective severing.
Group Editing: Edit Multiple Images in One Go: This paper proposes GroupEditing, which reconstructs a group of related images as pseudo-video frames and combines explicit geometric correspondences from VGGT with the implicit temporal prior of a video diffusion model. Two specially designed positional encodings—Ge-RoPE and Identity-RoPE—are introduced to inject correspondence information, enabling cross-view consistent group image editing that significantly outperforms existing methods in visual quality, editing consistency, and semantic alignment.
Guiding a Diffusion Model by Swapping Its Tokens: This paper proposes Self-Swap Guidance (SSG), a training-free sampling guidance method for diffusion models that constructs perturbations by selectively swapping the most semantically dissimilar token pairs in the intermediate representation space. Compared to SAG/PAG/SEG, SSG stably generates high-fidelity images over a wider range of guidance scales, achieving state-of-the-art FID on both conditional and unconditional generation.
Guiding a Diffusion Transformer with the Internal Dynamics of Itself: This paper proposes Internal Guidance (IG), which adds auxiliary supervision losses to intermediate layers of a Diffusion Transformer to produce weaker generative outputs, then extrapolates the discrepancy between intermediate-layer and final-layer outputs at sampling time to achieve an Autoguidance-like effect — requiring no additional sampling steps or external model training. On ImageNet 256×256, IG pushes LightningDiT-XL/1 to FID 1.34 (without CFG) and 1.19 (+CFG), achieving state-of-the-art results among contemporaneous methods.
Guiding Diffusion Models with Semantically Degraded Conditions: This paper proposes Condition-Degradation Guidance (CDG), which replaces the null prompt \(\emptyset\) in CFG with a semantically degraded condition \(\boldsymbol{c}_{\text{deg}}\), transforming the guidance paradigm from a coarse-grained "good vs. empty" contrast to a fine-grained "good vs. slightly worse" contrast. Through a stratified degradation strategy—first degrading content tokens, then context-aggregating tokens—CDG constructs adaptive negative samples and achieves plug-and-play improvements in compositional generation accuracy on models including SD3, FLUX, and Qwen-Image, with negligible additional overhead.
HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation: This paper proposes HaltNav, a hierarchical navigation framework that combines lightweight text-based topological maps (osmAG) for global planning with a VLN model for local execution. A Reactive Visual Halting (RVH) mechanism is introduced to interrupt execution upon encountering unknown obstacles, update the topology, and trigger replanning for detour. The framework achieves significant improvements over baselines in both simulation and real-robot experiments.
HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation: This paper proposes HaltNav, a hierarchical navigation framework that combines lightweight textual topological priors (osmAG) for global planning with a VLN model for local execution. A Reactive Visual Halting (RVH) mechanism monitors egocentric observations to detect unexpected obstacles, dynamically updates the topology, and triggers replanning. The approach substantially improves long-range navigation robustness in both simulation and real-robot settings.
HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models: This paper proposes HAM, a training-free style transfer method that achieves high-quality stylization without sacrificing content identity. HAM applies heterogeneous modulation (GAR + LAT) to self-attention and cross-attention layers of diffusion models, complemented by style-injected noise initialization (SINI), attaining state-of-the-art performance across multiple metrics.
HazeMatching: Dehazing Light Microscopy Images with Guided Conditional Flow Matching: This paper proposes HazeMatching, a guided conditional flow matching (Guided CFM) framework for microscopy image dehazing. By incorporating degraded observations as conditioning signals in the velocity field, the method achieves high data fidelity and high perceptual quality simultaneously without requiring an explicit degradation operator, while also providing well-calibrated uncertainty estimates.
Heterogeneous Decentralized Diffusion Models: This paper proposes a heterogeneous decentralized diffusion framework that allows different experts to train completely independently using distinct diffusion objectives (DDPM ε-prediction and Flow Matching velocity-prediction). At inference time, a deterministic schedule-aware conversion unifies all expert outputs into velocity space for fusion. Compared to homogeneous baselines, the framework simultaneously improves FID and generation diversity while reducing computation by 16×.
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images: This paper proposes HiFi-Inpaint, a framework that leverages high-frequency information to enhance product detail features via Shared Enhancement Attention (SEA), combined with a Detail-Aware Loss (DAL) for pixel-level high-frequency supervision, achieving state-of-the-art detail fidelity in human-product image generation.
High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning: This paper proposes an identity-constrained attribute tuning framework for diffusion-based face swapping: the method first constrains the identity solution space, then injects attribute conditions, and finally performs end-to-end refinement with identity and adversarial losses. Combined with a decoupled condition injection design, it achieves state-of-the-art FID (3.61) and identity retrieval accuracy (97.9% Top-1) on FFHQ.
Image Diffusion Preview with Consistency Solver: This paper proposes the Diffusion Preview paradigm and ConsistencySolver—a lightweight high-order ODE solver trained via reinforcement learning—that generates high-quality preview images with few-step sampling while ensuring consistency with full-step outputs. It achieves FID comparable to Multistep DPM-Solver using 47% fewer steps, reducing user interaction time by nearly 50%.
Image Generation as a Visual Planner for Robotic Manipulation: This work adapts a pretrained image generation model (DiT) via LoRA fine-tuning into a visual planner for robotic manipulation, generating temporally coherent action sequences in the form of \(3\times3\) grid images, supporting both text-conditioned and trajectory-conditioned control modes.
Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval: This paper proposes DreamPRVR, which adopts a coarse-to-fine "imagine before concentrate" strategy: a truncated diffusion model generates global semantic register tokens under text supervision, which are then fused into fine-grained video representations to suppress spurious local noise responses, achieving state-of-the-art performance on three PRVR benchmarks.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards: This paper proposes SOLACE, a post-training framework that leverages the denoising self-confidence of text-to-image generation models as an intrinsic reward signal, requiring no external reward models while achieving consistent improvements in compositional generation, text rendering, and text-image alignment. SOLACE is also complementary to external rewards and mitigates reward hacking when combined with them.
InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation: This paper proposes InnoAds-Composer, a single-stage e-commerce poster generation framework built on MM-DiT. It maps three types of conditions — product subject, glyph text, and background style — into a unified token space via unified tokenization, and combines a Text Feature Enhancement Module (TFEM) with an importance-aware condition injection strategy to maintain high-quality generation while significantly reducing inference cost.
InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing: This paper proposes InterEdit, the first text-guided multi-human 3D motion editing framework. Through two alignment mechanisms—Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment—InterEdit achieves precise editing of two-person interactive motions within a conditional diffusion model, while preserving source motion consistency and interaction coherence.
InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing: This paper is the first to formally define the Text-guided Multi-human Motion Editing (TMME) task. It constructs the InterEdit3D dataset containing 5,161 source–target–instruction triplets and proposes the InterEdit conditional diffusion model. The model captures high-level editing intent via semantic-aware planning token alignment and models periodic interaction dynamics via interaction-aware frequency-domain token alignment, achieving state-of-the-art performance on instruction following (g2t R@1 30.82%) and source preservation (g2s R@1 17.08%), outperforming all four baselines across the board.
Interpretable and Steerable Concept Bottleneck Sparse Autoencoders: This paper identifies that the majority of SAE neurons (~81%) suffer from insufficient interpretability or steerability, and proposes the CB-SAE framework — which prunes low-utility SAE neurons and augments them with a concept bottleneck module — achieving +32.1% interpretability and +14.5% steerability improvements on LVLM and image generation tasks, respectively.
Intra-finger Variability of Diffusion-based Latent Fingerprint Generation: This paper systematically evaluates the intra-finger variability of diffusion-model-based fingerprint synthesis. By constructing a latent fingerprint style library spanning 40 surface types and 15 development techniques, it enhances generation diversity and quantifies both local and global identity inconsistencies introduced during the generation process.
Intrinsic Concept Extraction Based on Compositional Interpretability: HyperExpress introduces a novel task termed Compositional Interpretability-based Intrinsic Concept Extraction (CI-ICE). By leveraging the hierarchical modeling capacity of hyperbolic space and a horospherical projection module, it extracts composable object-level and attribute-level concepts from a single image, enabling invertible decomposition of complex visual concepts.
Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers: This paper proposes the Just-in-Time (JiT) framework, which dynamically selects sparse anchor tokens in the spatial domain to drive generative ODE evolution, and introduces a deterministic micro-flow (DMF) mechanism to ensure seamless activation of newly included tokens. JiT achieves up to 7× acceleration on FLUX.1-dev with negligible quality degradation.
Language-Free Generative Editing from One Visual Example: This paper reveals a critical text-visual alignment failure in text-guided diffusion models on simple visual transformations such as rain, haze, and blur, and proposes the VDC framework — which learns a purely visual conditioning signal from a single visual example pair (before and after transformation) to guide diffusion-based editing, requiring neither text nor training. VDC surpasses text-based and fine-tuning-based methods on tasks including deraining, dehazing, and denoising.
Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection: This paper identifies that real images exhibit stable layer-wise transitions in intermediate feature representations within a frozen CLIP ViT, whereas synthetic images exhibit abrupt attention shifts at intermediate layers. Based on this observation, the paper proposes Layer Transition Discrepancy (LTD) to model this difference, achieving mean Acc of 96.90% on UFD, 99.54% on DRCT-2M, and 91.62% on GenImage, surpassing all prior state-of-the-art methods.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories: This paper proposes LeapAlign, which constructs two-step leap trajectories to compress long generation paths into two steps, enabling reward gradients to be directly backpropagated to early generation steps. Combined with trajectory similarity weighting and gradient discounting strategies, LeapAlign achieves efficient post-training alignment of flow matching models.
Learnability-Guided Diffusion for Dataset Distillation: This paper proposes LGD, a learnability-driven incremental dataset distillation framework that constructs the distilled dataset in stages, conditioning each stage on the current model state to generate complementary rather than redundant training samples. By injecting learnability-score gradients into diffusion sampling, LGD reduces the 80–90% inter-sample information redundancy observed in existing methods by 39.1%, achieving 60.1% accuracy at 50 IPC on ImageNet-1K and 87.2% at 100 IPC on ImageNette.
Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition: Flora achieves robust skeleton–semantic cross-modal alignment via neighbor-aware semantic calibration, and constructs a distribution-aware open-form classifier using noise-free flow matching, attaining state-of-the-art performance on zero-shot skeleton action recognition—particularly under low-data training regimes.
Learning Latent Proxies for Controllable Single-Image Relighting: This paper proposes LightCtrl, a diffusion-based single-image relighting framework that achieves precise and continuous control over lighting direction, intensity, and color temperature. It introduces a few-shot latent proxy encoder to provide lightweight material–geometry priors, a lighting-aware mask to guide spatially selective denoising, and DPO post-training to enhance physical consistency. The method outperforms existing approaches on both synthetic and real-world benchmarks.
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal: This paper proposes the VeilGen + DeVeiler framework, which employs a physics-guided Stable Diffusion generative model to learn latent transmission and glare maps for synthesizing realistic compound-degradation training data. A restoration network trained under invertible constraints jointly removes aberrations and veiling glare in simplified optical systems.
Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models: This paper proposes GvU, a self-supervised RL framework (based on GRPO) that leverages the visual understanding branch of a unified multimodal model (UMM) as an intrinsic reward signal. Token-level text-image alignment probabilities are used to iteratively improve T2I generation quality without any external supervision, achieving a 43.3% improvement on GenEval++. Notably, the enhanced generation in turn promotes fine-grained visual understanding.
LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration: This paper proposes LESA, a framework that employs KAN (Kolmogorov-Arnold Network) as learnable temporal predictors, combined with a multi-stage multi-expert architecture and a two-phase training strategy. LESA achieves 5× acceleration on FLUX with only 1.0% quality degradation, 6.25× acceleration on Qwen-Image with 20.2% quality improvement over TaylorSeer, and 5× acceleration on HunyuanVideo with a 24.7% PSNR gain.
Leveraging Multispectral Sensors for Color Correction in Mobile Cameras: This paper proposes a unified end-to-end color correction framework that jointly fuses data from a high-resolution RGB sensor and an auxiliary low-resolution multispectral (MS) sensor, integrating illuminant estimation, illuminant compensation, and color space conversion into a single model. The proposed approach reduces color error (\(\Delta E_{00}\)) by up to 50% compared to RGB-only and MS-only baselines.
Low-Resolution Editing is All You Need for High-Resolution Editing: ScaleEdit is the first work to formally define the high-resolution image editing task. It learns a 1×1 convolutional transfer function in the intermediate feature space of a pretrained generative model to inject fine-grained textural details from the source image, and employs a Blended-Tweedie-based patch synchronization strategy to ensure global consistency. Operating entirely via test-time optimization, the method achieves high-quality editing at resolutions up to 2K and even 8K.
LumiCtrl: Learning Illuminant Prompts for Lighting Control in Personalized Text-to-Image Models: This paper identifies a semantic gap in T2I model text encoders that prevents understanding of standard lighting terminology (e.g., tungsten, 6500K), and proposes LumiCtrl, which learns illumination prompts via three components — physics-based lighting augmentation, edge-guided prompt disentanglement, and masked reconstruction loss — enabling precise text-guided lighting control while preserving subject identity.
MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness: This paper proposes the MAGIC framework, which fine-tunes an inpainting diffusion model and incorporates three complementary modules—Gaussian prompt perturbation, mask-guided spatial noise injection, and context-aware mask alignment—to generate high-fidelity, diverse, and spatially plausible industrial anomaly images under few-shot conditions, achieving state-of-the-art performance on downstream tasks using MVTec-AD.
Match-and-Fuse: Consistent Generation from Unstructured Image Sets: Match-and-Fuse is proposed as the first training-free consistent generation method for unstructured image sets. Images are treated as nodes and image pairs as edges to construct a pairwise consistency graph. Multi-view Feature Fusion (MFF) and feature guidance are employed to manipulate internal features during diffusion inference, achieving set-level cross-image consistency with a DINO-MatchSim of 0.80, substantially outperforming all baselines.
Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping: This paper proposes DiT-BlockSkip, a framework that reduces LoRA fine-tuning memory on FLUX by approximately 50% while maintaining comparable personalized generation quality. It achieves this through two components: timestep-aware dynamic patch sampling (low-resolution training with dynamically adjusted crop sizes) and a block skipping strategy that identifies critical blocks via cross-attention analysis and precomputes residual features for skipped blocks.
MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models: This paper proposes MICON-Bench, a multi-image context generation benchmark covering 6 tasks (1,043 cases), paired with an MLLM-driven Evaluation-by-Checkpoint automated assessment framework. It further introduces DAR (Dynamic Attention Rebalancing), a training-free mechanism that improves generation consistency and quality in unified multimodal models (UMMs) by dynamically adjusting attention weights at inference time.
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation: This paper proposes Mixture of States (MoS)—a multimodal fusion paradigm based on learnable token-level sparse routing—enabling visual tokens to adaptively select hidden states from arbitrary layers of a text encoder at each denoising step. With only 3–5B parameters, MoS matches or surpasses models at the 20B scale.
MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing: MorphAny3D is proposed as the first training-free 3D morphing framework based on Structured Latent (SLAT) representations. It achieves state-of-the-art quality in cross-category 3D morphing through Morphing Cross-Attention (MCA) for structurally coherent source/target fusion, Temporal-Fused Self-Attention (TFSA) for temporal consistency, and a direction correction strategy to eliminate abrupt orientation jumps.
MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification: The paper proposes the MOS framework to address optical-SAR cross-modal ship re-identification. It comprises two core modules: (1) MCRL, which reduces the modality gap during training via SAR image denoising and a category-level modality alignment loss; and (2) CDGF, which generates pseudo-SAR samples from optical images using a Brownian bridge diffusion model at inference time and fuses the resulting features. On the HOSS ReID dataset, MOS achieves a +16.4% R1 improvement in the SAR→Optical direction.
MPDiT: Multi-Patch Global-to-Local Transformer Architecture for Efficient Flow Matching: This paper proposes MPDiT, a multi-scale patch global-to-local diffusion Transformer architecture. Early layers process global context using large patches (4×4) with only 64 tokens, and later layers upsample to small patches (2×2) with 256 tokens for local detail refinement. This reduces GFLOPs by up to 50%, while the XL model achieves FID 2.05 (with CFG) at only 240 training epochs.
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation: This paper introduces MultiBanana — the first large-scale benchmark for systematically evaluating multi-reference image generation, comprising 3,769 evaluation samples with up to 8 reference images across 5 difficulty dimensions (cross-domain, scale mismatch, rare concepts, and multilingual). The benchmark reveals complementary failure modes: closed-source models tend to overfit reference details, while open-source models tend to ignore reference subjects.
Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models: This paper proposes NLCE, a training-free three-stage concept erasure framework for text-to-image diffusion models. It achieves precise localized erasure of target concepts through spectrally-weighted representation modulation, attention-guided spatial gating, and gated feature scrubbing, while explicitly preserving semantically neighboring concepts. NLCE outperforms existing methods on Oxford Flowers, Stanford Dogs, celebrity identity, and sensitive content erasure benchmarks.
Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models: This paper reinterprets SDE-based GRPO as distance optimization / contrastive learning, and proposes Neighbor GRPO — which completely bypasses SDE conversion by constructing neighborhood candidate trajectories through perturbation of ODE initial noise, combined with a softmax distance surrogate policy for policy gradient optimization, while preserving all advantages of deterministic ODE sampling.
OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution: This paper proposes OARS, a framework that systematically addresses human preference alignment in generative real-world image super-resolution for the first time. It introduces COMPASS, an MLLM-based process-aware reward model, and a progressive online reinforcement learning pipeline (cold start → reference-guided RL → non-reference RL), significantly improving perceptual quality while preserving fidelity.
OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution: This paper proposes OARS, a framework that aligns generative real-world image super-resolution models with human visual preferences via COMPASS—an MLLM-based process-aware reward model—and a progressive online reinforcement learning pipeline, achieving adaptive balance between perceptual quality and fidelity.
Object-WIPER: Training-Free Object and Associated Effect Removal in Videos: This paper presents Object-WIPER, the first training-free framework for removing objects and their associated visual effects (shadows, reflections, mirror images, etc.) in videos. It leverages text-visual cross-attention and visual self-attention within DiT to localize associated effect regions, achieves clean removal via foreground re-initialization and attention scaling, and introduces the TokSim metric along with WIPER-Bench, a real-world benchmark.
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers: This paper proposes ELIT (Elastic Latent Interface Transformer), which decouples computation from input resolution by inserting variable-length latent token interfaces and lightweight Read/Write cross-attention layers into DiTs. A single model supports multiple inference budgets, achieving 35.3% and 39.6% improvements in FID and FDD respectively on ImageNet-1K 512px.
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers: This paper proposes ELIT (Elastic Latent Interface Transformer), which inserts variable-length latent interfaces and lightweight Read/Write cross-attention layers into DiT, enabling a single model to dynamically adjust its computational budget at inference time while non-uniformly allocating computation to more difficult image regions, achieving up to 53% FID reduction on ImageNet 512px.
OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery: OpenDPR proposes a training-free, vision-centric framework that leverages diffusion models to offline generate diverse visual prototypes for target categories, and performs open-vocabulary change detection in remote sensing imagery via similarity-based retrieval in visual feature space at inference time, achieving state-of-the-art performance on four benchmark datasets.
OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation: This paper proposes OPRO, a parameter-efficient adaptation method based on orthogonal matrices. By applying learnable panel-specific orthogonal operators to position-aware queries and keys of a frozen backbone, OPRO explicitly modulates cross-panel attention interactions while preserving the pre-trained same-panel synthesis behavior. With only 0.93M additional parameters, it significantly improves the editing quality of multiple state-of-the-art methods on MagicBrush.
Organizing Unstructured Image Collections using Natural Language: This paper introduces a new task, Open Semantic Multi-Clustering (OpenSMC), and proposes the X-Cluster framework, which converts images into text via an MLLM and subsequently employs an LLM to automatically discover clustering criteria and semantic substructures. Without any human-specified priors, the framework organizes large-scale unlabeled image collections into multi-dimensional, multi-granularity, and interpretable semantic clusters.
PhysGen: Physically Grounded 3D Shape Generation for Industrial Design: This paper proposes PhysGen, a unified framework that integrates physical constraints (aerodynamic efficiency) into 3D shape generation. It jointly encodes geometric and physical information into a unified latent space via a Shape-and-Physics VAE (SP-VAE), and employs a Flow Matching model with alternating updates between velocity steps and physics refinement to generate 3D shapes that are both visually plausible and physically efficient (e.g., automobiles with low drag coefficients).
Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction: This paper proposes ReMD (Residual-Multigrid Diffusion), which embeds multigrid residual correction into each reverse sampling step of a diffusion model. By leveraging a multi-wavelet basis to construct a cross-scale hierarchy, ReMD achieves physics-consistent and efficient fluid super-resolution without requiring explicit PDEs.
Pixel Motion Diffusion Is What We Need for Robot Control: DAWN proposes a two-stage fully diffusion-based framework — a Motion Director that generates dense pixel motion fields as interpretable intermediate representations, and an Action Expert that converts these fields into executable robot action sequences — achieving SOTA on CALVIN (Avg Len 4.00), MetaWorld (Overall 65.4%), and real-world benchmarks, with substantially smaller model capacity and training data than competing methods.
PixelDiT: Pixel Diffusion Transformers for Image Generation: PixelDiT proposes a fully Transformer-based dual-level pixel-space diffusion model: a patch-level DiT captures global semantics while a pixel-level DiT refines textural details, achieving an FID of 1.61 on ImageNet without any VAE, and enabling direct text-to-image training at 1024-resolution in pixel space.
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion: This paper proposes PixelRush, a training-free high-resolution image generation framework that combines four components — partial inversion, few-step diffusion models, Gaussian filter blending, and noise injection — to compress 4K image generation time from several minutes to approximately 20 seconds (10×–35× speedup), while surpassing existing SOTA methods on FID/IS metrics.
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion: PixelRush is the first method to bring training-free high-resolution image generation into practical deployment. By truncating the reverse diffusion process via partial DDIM inversion to skip redundant low-frequency reconstruction steps, it enables few-step diffusion models to function within a patch-based refinement pipeline. Combined with Gaussian filter blending and noise injection to eliminate artifacts, the method generates 2K images in 4 seconds and 4K images in 20 seconds—10–35× faster than the state of the art while achieving superior FID.
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers: This paper proposes the PPCL framework, which employs linear probing and first-order CKA difference analysis to detect contiguous redundant layer intervals in MMDiT, combined with non-sequential distillation to enable depth pruning (plug-and-play) and width pruning (replacing text streams/FFNs with linear projections). The approach compresses Qwen-Image from 20B to 10B with only a 3.29% performance drop.
Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification: Pose-dIVE leverages the SMPL model to jointly control human body pose and camera viewpoint, using a diffusion model to generate person images with diversified poses and viewpoints. This approach systematically alleviates distributional bias in Re-ID training data, consistently improving the generalization capability of arbitrary Re-ID models across multiple benchmarks.
PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation: This paper introduces PosterIQ, a comprehensive benchmark for poster design evaluation, comprising 7,765 understanding annotations and 822 generation prompts across 24 task categories — including OCR, font perception, layout reasoning, design intent understanding, and compositionally-aware generation — to systematically diagnose the gap between current MLLMs and diffusion models in design cognition.
Precise Object and Effect Removal with Adaptive Target-Aware Attention: This paper proposes ObjectClear, a framework that decouples foreground removal from background reconstruction via Adaptive Target-Aware Attention (ATA), combined with Attention-Guided Fusion (AGF) and Spatially Varying Denoising Strength (SVDS) strategies, enabling precise removal of target objects along with their associated visual effects such as shadows and reflections. The work also introduces OBER, the first large-scale dataset for Object-Effect Removal.
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality: This paper proposes LivingSwap, the first video reference-guided face swapping model. Through a controllable pipeline of keyframe identity injection, source video reference completion, and temporal stitching, it achieves high-fidelity face swapping in long videos. The method stably injects the target identity while preserving expression, lighting, and motion details from the source video, reducing manual editing effort by 40×.
Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models: This work systematically probes affordance capabilities in vision foundation models (VFMs), revealing that DINO encodes part-level geometric structure while Flux encodes verb-conditioned interaction priors. By training-free fusion of both, the method achieves zero-shot affordance estimation competitive with weakly supervised approaches.
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On: A virtual try-on framework built on Flow Matching DiT that significantly reduces inference overhead while maintaining high fidelity, achieved through latent multimodal condition concatenation, a temporal self-reference caching mechanism, and 3D-RoPE grouped condition injection. The framework supports multi-garment try-on and text-prompt-controlled outfit styling.
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On: PROMO is built on a FLUX.1-dev Flow Matching DiT backbone and achieves high-fidelity, efficient multi-garment virtual try-on without a traditional reference network, by combining latent-space multimodal condition concatenation, temporal self-reference KV caching, 3D-RoPE grouped conditioning, and a fine-tuned VLM style-prompt system. Inference is 2.4× faster than the non-accelerated baseline, and the method surpasses existing VTON and general image-editing approaches on VITON-HD and DressCode.
Prototype-Guided Concept Erasure in Diffusion Models: To address the difficulty of thoroughly erasing broad concepts (e.g., violence, nudity) from diffusion models, this paper proposes a training-free erasure method based on concept prototypes. The method clusters concept-differential directions in the CLIP embedding space to obtain image-space prototypes, optimizes these into a text prototype space via cosine similarity, and at inference time selects the best-matching prototype as a negative guidance signal to suppress target concepts in a classifier-free guidance fashion.
PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow: This paper presents PSDesigner, an automated graphic design system that simulates the creative workflow of human designers. It operates through three collaborative modules — AssetCollector (resource collection), GraphicPlanner (tool-call planning), and ToolExecutor (PSD operation execution) — and is trained on CreativePSD, the first PSD-format design dataset, enabling the system to learn professional design workflows and directly generate editable PSD files.
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards: To address poor subject consistency and insufficient text adherence in multi-subject personalized image generation, this paper proposes a scalable multi-subject data construction pipeline and Pairwise Subject-Consistency Rewards (PSR). Through two-stage training (SFT + RL), the method comprehensively outperforms existing state-of-the-art methods on the self-constructed PSRBench.
PureCC: Pure Learning for Text-to-Image Concept Customization: PureCC introduces a decoupled learning objective that separates "target concept implicit guidance" from "original condition prediction," coupled with a dual-branch training pipeline comprising a frozen representation extractor and a trainable flow model, along with adaptive guidance scaling \(\lambda^{\star}\) derived from projection error. This enables high-fidelity concept customization while minimizing disruption to the original model's behavior and capabilities.
Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge: This paper proposes the QUAD framework, which treats LoRA weights as runtime inputs rather than compiling them into the model graph. Combined with a distillation fine-tuning strategy that shares quantization parameters across LoRAs, QUAD enables a single compiled model to dynamically switch among multiple GenAI tasks on mobile NPUs, achieving 6× memory compression and 4× latency improvement.
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment: This paper proposes RAISE, a framework that models T2I generation as a requirement-driven adaptive evolutionary process. A requirement analyzer decomposes prompts into structured checklists; multi-action mutations (prompt rewriting + noise resampling + instruction-based editing) evolve candidate populations in parallel; tool-augmented visual verification eliminates non-compliant candidates each round. The result is adaptive inference-time scaling that achieves 0.94 SOTA on GenEval while reducing generated samples by 30–40% and VLM calls by 80% compared to reflection fine-tuning baselines.
RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models: This paper proposes RAZOR, a ratio-aware multi-layer/multi-head selective editing framework that enables efficient and precise targeted unlearning in Transformer-based vision models such as CLIP, Stable Diffusion, and VLMs, while preserving overall model performance and quantization robustness.
RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models: RAZOR selects the most critical layers and attention heads via ratio-aware gradient scoring that jointly measures forgetting pressure and retention alignment, and achieves precise, efficient targeted unlearning on CLIP, Stable Diffusion, and VLMs through a three-component constrained loss and an iterative expansion mechanism, with no performance degradation after quantization.
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark: This paper introduces RealUnify, the first benchmark specifically designed to evaluate the bidirectional synergy between understanding and generation capabilities in unified models. Through 1,000 manually annotated instances and a dual evaluation protocol (direct and stepwise), it reveals that current unified models, despite possessing both understanding and generation capabilities, still fail to achieve genuine capability synergy in end-to-end scenarios.
Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning: This paper proposes MVC-ZigAL, a framework that improves single-view fidelity and cross-view consistency in few-step text-to-multiview diffusion models through multiview-aware MDP formulation, zigzag self-refining advantage learning, and Lagrangian dual constrained optimization.
Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing: This paper identifies that relational distances between image patch pairs remain invariant under AI editing, and exploits this invariance to build Rel-Zero, a zero-watermarking framework that achieves robust content authentication against diverse generative edits without modifying the original image.
RenderFlow: Single-Step Neural Rendering via Flow Matching: RenderFlow recasts neural rendering as a single-step conditional flow matching problem from albedo to full-illumination images. Using G-buffers as conditions and a pretrained video DiT as the backbone, it achieves deterministic rendering more than 10× faster than diffusion-based methods (~0.19 s/frame). An optional sparse keyframe guidance module further improves physical accuracy, and inverse rendering is supported via a frozen backbone with lightweight adapters.
Resolving the Identity Crisis in Text-to-Image Generation: This paper identifies the "identity crisis" in text-to-image models for multi-person scene generation — manifesting as duplicated faces and identity merging — and proposes the DisCo framework. By combining a compositional reward function with GRPO-based reinforcement learning fine-tuning of a flow-matching model, DisCo achieves 98.6% unique face accuracy, surpassing closed-source models including GPT-Image-1.
Reviving ConvNeXt for Efficient Convolutional Diffusion Models: This paper proposes FCDM (Fully Convolutional Diffusion Model), which adapts the ConvNeXt architecture as a backbone for conditional diffusion models. Using only 50% of DiT-XL's FLOPs, FCDM achieves a competitive FID of 2.03 on ImageNet and can train an XL-scale model on 4× RTX 4090 GPUs, demonstrating the severely underestimated efficiency of fully convolutional architectures in generative modeling.
RewardFlow: Generate Images by Optimizing What You Reward: RewardFlow proposes an inversion-free inference-time framework that fuses multiple differentiable reward signals—including semantic alignment, perceptual fidelity, local grounding, object consistency, and human preference—via multi-reward Langevin dynamics, achieving state-of-the-art editing fidelity and compositional alignment on image editing and compositional generation benchmarks.
Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring: Score2Instruct proposes SIG, a fully automated video quality instruction generation pipeline that requires neither human annotation nor closed-source APIs. By automatically evaluating 14 quality dimensions and aggregating them into comprehensive quality reasoning texts via hierarchical CoT, SIG constructs the S2I dataset (320K+ instruction samples). Combined with a two-stage progressive fine-tuning strategy, multiple video LMMs simultaneously acquire quality scoring and quality reasoning capabilities, achieving an average SRCC improvement of 26–31% across 5 VQA benchmarks.
SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models: This paper proposes SeaCache, a training-free dynamic caching strategy based on a Spectral-Evolution-Aware (SEA) filter. By separating signal and noise components in the frequency domain to measure inter-timestep redundancy, SeaCache significantly improves the latency–quality trade-off in diffusion model inference.
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models: This paper proposes SegQuant, a framework that achieves high-fidelity post-training quantization of diffusion models through two novel components: SegLinear, a semantics-aware segmented quantization scheme based on static computational graph analysis, and DualScale, a hardware-native dual-scale polarity-preserving quantization scheme. The approach is cross-architecture generalizable and compatible with deployment pipelines, requiring neither handcrafted rules nor runtime dynamic information.
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models: This paper proposes SegQuant, a deployment-oriented post-training quantization (PTQ) framework for diffusion models. It achieves cross-architecture, high-fidelity W8A8/W4A8 quantization on SD3.5, FLUX, and SDXL via semantics-aware segmented quantization (SegLinear) based on static computational graph analysis and hardware-native dual-scale polarity-preserving quantization (DualScale), while maintaining compatibility with industrial inference engines such as TensorRT.
Self-Corrected Image Generation with Explainable Latent Rewards: This paper proposes xLARD, a framework that performs semantic self-correction in the latent space during text-to-image generation via a lightweight residual corrector. Guided by explainable latent reward signals (counting, color, position), xLARD achieves +4.1% on GenEval and +2.97% on DPGBench, and adapts to multiple backbones in a plug-and-play manner.
SHOE: Semantic HOI Open-Vocabulary Evaluation Metric: This paper proposes SHOE, an evaluation framework that decomposes HOI predictions into verb and object components and computes LLM-driven semantic similarity scores for each independently, replacing the exact-match paradigm of conventional mAP. SHOE achieves 85.73% agreement with human judgments on open-vocabulary HOI detection evaluation, surpassing the average inter-annotator agreement of 78.61%.
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement: ShowTable introduces a novel task termed creative table visualization (generating infographics from structured data tables) and proposes a progressive self-correction pipeline in which an MLLM (for reasoning and reflection) and a diffusion model (for generation and refinement) collaborate iteratively. Through a dedicated fine-tuned rewriting module and an RL-optimized refinement module, the framework consistently and substantially improves visualization quality over all baseline models on the newly constructed TableVisBench benchmark.
SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images: This paper proposes SimLBR, which regularizes a detector by blending a small amount of fake image information into real image embeddings within the DINOv3 latent space, compelling the model to learn a compact decision boundary around the real image distribution. This design achieves strong generalization to unseen generators, attaining 94.54% average accuracy on GenImage and outperforming AIDE on the challenging Chameleon benchmark by 25% in accuracy and 70% in recall.
SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation: This paper analyzes the bottleneck of severely skewed acceptance-length distributions in Speculative Jacobi Decoding (SJD) for text-to-image generation, and proposes the SJD-PAC framework. By introducing two techniques—Proactive Drafting (PD) and Adaptive Continuation (AC)—SJD-PAC achieves a strictly lossless 3.8× inference speedup, substantially surpassing the ~2× acceleration of vanilla SJD.
SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking: SLICE decomposes image semantics into four factors (subject / environment / action / detail), anchors each factor to a distinct spatial partition of the diffusion model's initial noise, and thereby enables fine-grained, semantic-aware watermarking—capable of not only detecting tampering but also precisely localizing which semantic factor has been altered, entirely without training.
SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking: SLICE is a semantic watermarking framework that decomposes image semantics into four factors—subject, environment, action, and detail—and binds each factor to a distinct spatial partition of the initial Gaussian noise. This enables a three-state verification mechanism that not only detects watermark presence but also localizes semantic tampering. Against the strongest CSI attack, SLICE achieves an attack success rate (ASR) of only 19%, compared to 81% for SEAL.
Smoothing the Score Function for Generalization in Diffusion Models: An Optimization-based Explanation Framework: This paper theoretically demonstrates that memorization in diffusion models stems from the "sharpness" of empirical score function weights (concentration of softmax weights), and proposes two methods — noise unconditioning and temperature smoothing — that improve generalization and reduce memorization by smoothing score function weights while preserving generation quality.
SOLACE: Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards: SOLACE uses a T2I model's intrinsic denoising self-confidence (i.e., the accuracy with which it recovers injected noise) as an internal reward signal to replace external reward models in post-training, achieving consistent improvements in compositional generation, text rendering, and text-image alignment. The signal is also complementary to external rewards and can mitigate reward hacking.
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning: This paper proposes Spatial-SSRL, a self-supervised reinforcement learning paradigm that automatically constructs five pretext tasks (patch reordering, flip recognition, cropped patch inpainting, depth ordering, and relative 3D position prediction) from standard RGB/RGB-D images. By optimizing LVLMs with GRPO, the method achieves average improvements of 3.89%–4.63% across seven spatial benchmarks without any human annotation or external tools.
SPDMark: Selective Parameter Displacement for Robust Video Watermarking: SPDMark proposes a video diffusion model watermarking framework based on Selective Parameter Displacement (SPD). By learning a low-rank basis shift dictionary in the decoder and selecting combinations according to the watermark key, it achieves per-frame watermark embedding with imperceptibility, high robustness, and low computational overhead, while supporting temporal tampering detection and localization.
StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars: This paper proposes a two-stage autoregressive adaptation framework (autoregressive distillation + adversarial refinement) that converts a bidirectional human video diffusion model into a real-time streaming generator. Reference Sink, RAPR positional re-encoding, and a consistency-aware discriminator are introduced to ensure long-video stability, realizing the first full-body real-time digital human that supports both speaking and listening interactions.
TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts: To address the severe task interference problem in unified image generation and editing models, this paper proposes the TAG-MoE framework. By introducing a hierarchical task semantic annotation scheme and a predictive alignment regularization, TAG-MoE injects high-level task intent into local MoE routing decisions, transforming the gating network from a task-agnostic executor into a semantics-aware dispatcher. The method achieves the best overall open-source performance across five benchmarks including ICE-Bench, EmuEdit, GEdit, and DreamBench++.
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning: This paper proposes D2-Align, a framework that learns a directional correction vector in the reward model's embedding space to debias reward signals, addressing preference mode collapse (PMC) in RLHF-aligned diffusion models — a phenomenon where over-optimization of rewards leads to severe degradation in generation diversity. DivGenBench is also introduced as a benchmark for quantitative diversity evaluation.
Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models: This paper identifies that β-VAE tokenizers in latent diffusion models suffer from variance collapse, producing an overly compact latent space that is highly sensitive to diffusion sampling perturbations. The proposed Variance Expansion (VE) Loss achieves adaptive latent variance regulation through an adversarial balance between reconstruction and variance expansion objectives, consistently improving generation quality (FID 1.18) across multiple diffusion architectures.
Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework: This paper proposes AC-DC, a three-stage denoiser (Auto-Correction + Directional Correction + Score Denoising) that addresses the manifold mismatch between ADMM iterations and the score training manifold. It provides the first convergence guarantee for ADMM-PnP combined with score-based denoisers, achieving state-of-the-art performance across multiple inverse problems.
Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework: This paper proposes ADMM-PnP with an AC-DC denoiser, which integrates diffusion priors into the ADMM primal-dual framework via a three-stage correct-then-denoise procedure (Auto-Correction + Directional Correction + score-based denoising). The method addresses the geometric mismatch between ADMM iterates and the diffusion training manifold, establishes convergence guarantees under two sets of conditions, and consistently outperforms baselines such as DAPS, DPS, and DiffPIR across seven inverse problems.
Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control: WorldForge proposes a fully training-free inference-time guidance framework that adapts pretrained video diffusion models into precise camera-trajectory-controllable 3D/4D generation tools via three synergistic components—Intra-Step Recursive Refinement (IRR), Flow-Gated Latent Fusion (FLF), and Dual-Path Self-Corrective Guidance (DSG)—simultaneously surpassing both training-based and inference-based baselines in trajectory accuracy and perceptual quality.
TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration: This paper proposes TAP, a framework that uses a first-layer probe to adaptively select the optimal predictor (from a Taylor expansion family) for each token at each step, enabling training-free diffusion model acceleration with a 6.24× speedup on FLUX.1-dev without perceptible quality degradation.
TAUE: Training-free Noise Transplant and Cultivation Diffusion Model: TAUE proposes a training-free layered image generation framework that "transplants" intermediate denoising latents into the initial noise of a new generation process, combined with cross-layer attention sharing, to achieve consistent three-layer generation of foreground, background, and composite images — matching or surpassing fine-tuning-based methods.
TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration: This paper proposes TC-Padé, a feature residual prediction framework based on Padé rational function approximation. Through adaptive coefficient modulation and a stage-aware strategy, TC-Padé achieves trajectory-consistent acceleration in low-step (20–30 steps) diffusion sampling scenarios (2.88× on FLUX.1-dev, 1.72× on Wan2.1), significantly outperforming existing Taylor-expansion-based methods.
Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling: This paper proposes Composer, a plug-and-play meta-generator framework that dynamically generates low-rank parameter updates from each input condition at inference time and injects them into pretrained model weights, achieving instance-specific adaptive generation with negligible computational overhead (+0.2% time, +3.6% memory). The framework consistently improves performance across class-conditional generation, text-to-image synthesis, post-training quantization, and test-time scaling scenarios.
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering: This paper proposes TextPecker—a plug-and-play structural anomaly-aware RL strategy that constructs a character-level structural anomaly annotation dataset to train a structure-aware recognizer, replacing the noisy reward signals of conventional OCR. By jointly optimizing semantic alignment and structural fidelity, TextPecker achieves significant improvements in visual text rendering quality across multiple text-to-image models (FLUX, SD3.5, Qwen-Image).
The Universal Normal Embedding: This paper proposes the Universal Normal Embedding (UNE) hypothesis: the latent spaces of generative models (diffusion models) and visual encoders (CLIP, DINO) share an approximately Gaussian underlying geometric structure, and both can be viewed as noisy linear projections of this shared space. The hypothesis is validated through the NoiseZoo dataset and extensive experiments, and the paper demonstrates the feasibility of direct linear semantic editing in the DDIM inversion noise space.
TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models: This paper proposes TINA (Text-free INversion Attack), which bypasses all text-based concept erasure defenses by optimizing DDIM inversion under null-text conditioning to recover a precise initial noise vector. The work demonstrates that existing erasure methods merely sever the text-to-image mapping without truly deleting the visual knowledge encoded in model parameters.
Tiny Inference-Time Scaling with Latent Verifiers: This paper proposes VHS (Verifier on Hidden States), a verifier that operates directly on the intermediate hidden states of a DiT generator, bypassing the decode–re-encode overhead. In the inference-time scaling setting for single-step image generation, VHS reduces joint generation-verification latency by 63.3% and FLOPs by 51%, while achieving a 2.7% performance gain on GenEval under the same time budget.
TokenLight: Precise Lighting Control in Images using Attribute Tokens: TokenLight formulates image relighting as an end-to-end image generation task conditioned on attribute tokens (intensity, color, ambient light, diffuse level, and 3D light source position), enabling precise, continuous, and interpretable lighting control within a diffusion Transformer framework.
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity: To address the problem of T2I models generating images that appear "too vivid to be real," this work proposes the Color Fidelity Dataset (CFD, 1.3M images), the Color Fidelity Metric (CFM, based on Qwen2-VL + softrank loss), and Color Fidelity Refinement (CFR, a training-free spatiotemporal adaptive guidance modulation scheme), forming an integrated evaluation-and-improvement framework.
Towards Robust Content Watermarking Against Removal and Forgery Attacks: This paper proposes ISTS, an instance-specific two-sided detection watermarking method that dynamically selects watermark injection timestep and location based on image semantics to resist both removal and forgery attacks. A two-sided detection mechanism is further designed to counter reverse latent representation attacks. ISTS achieves state-of-the-art robustness under both average and worst-case scenarios across three removal attacks and three forgery attacks.
TRACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking: This paper proposes TRACE, a document watermarking framework based on character structure encoding. It leverages a diffusion model (DragDiffusion) to precisely displace skeleton keypoints of characters for information embedding. Through three core components—Adaptive Diffusion Initialization (ADI), Guided Diffusion Encoding (GDE), and Masked Region Replacement (MRR)—TRACE simultaneously achieves cross-media robustness, multi-language/multi-font generalizability, and high visual imperceptibility.
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods: This paper proposes STALL, a training-free zero-shot generated video detector that jointly models per-frame spatial likelihoods and inter-frame temporal likelihoods in a whitened embedding space. It requires only real video calibration and achieves robust detection across diverse generative models.
TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection: This paper proposes TriDF — the first benchmark that comprehensively evaluates interpretable DeepFake detection across three dimensions: Perception, Detection, and Hallucination. It comprises 55K high-quality samples covering 16 DeepFake types and 3 modalities, and reveals a triadic coupling relationship in which accurate perception is a prerequisite for reliable detection, yet hallucination can severely undermine decision-making.
Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation: This paper proposes Uni-DAD, the first method to unify diffusion model distillation and domain adaptation into a single-stage pipeline. Through a dual-domain DMD loss and a multi-head GAN loss, Uni-DAD achieves high-quality and diverse generation in few-shot target domains using only 1–4 sampling steps.
Unified Vector Floorplan Generation via Markup Representation: This paper proposes the Floorplan Markup Language (FML), which encodes floorplan elements such as rooms and doors into structured token sequences. A LLaMA-style Transformer model (FMLM) trained on this representation unifies unconditional, boundary-conditioned, graph-conditioned, and completion tasks within a single framework, achieving over 80% lower FID than HouseDiffusion.
V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration: This paper reformulates image restoration as a progressive video generation process. By leveraging the rich visual priors of a pretrained video model (Wan2.2-TI2V-5B), the proposed method achieves versatile all-in-one restoration across multiple degradation types using only 1,000 multi-task training samples (less than 2% of existing methods), surpassing specialized architectures trained on million-scale datasets.
VeCoR — Velocity Contrastive Regularization for Flow Matching: This paper proposes VeCoR (Velocity Contrastive Regularization), which introduces a "negative velocity" contrastive signal into standard Flow Matching training. By simultaneously guiding the model on "where to go" and "where not to go," VeCoR achieves more stable trajectory evolution and higher perceptual fidelity—yielding relative FID reductions of 22% and 35% for SiT-XL/2 and REPA-SiT-XL/2, respectively, on ImageNet-1K.
Verify Claimed Text-to-Image Models via Boundary-Aware Prompt Optimization: BPO proposes a reference-free white-box T2I model verification method that employs a three-stage pipeline (adversarial anchor identification → binary search boundary exploration → target optimization) to locate model-specific semantic boundary regions. The generated verification prompts achieve an average accuracy of 96% and F1 of 0.93 across 5 T2I models, while being 2× faster than the TVN baseline.
ViHOI: Human-Object Interaction Synthesis with Visual Priors: This paper proposes ViHOI, a plug-and-play framework that leverages a VLM to extract decoupled visual and textual priors from 2D reference images, compresses them into compact condition tokens via Q-Former, and injects them into a diffusion model to enhance HOI motion generation quality. At inference time, a text-to-image model synthesizes reference images to enable strong generalization to unseen objects.
Vinedresser3D: Agentic Text-guided 3D Editing: This paper presents Vinedresser3D, a 3D editing agent centered on a multimodal large language model (MLLM) that requires no user-provided 3D masks. The system automatically interprets editing intent, localizes editing regions, generates multimodal guidance, and performs inversion-based inpainting in the latent space of a native 3D generative model (Trellis), enabling high-quality text-guided 3D asset editing.
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization: ViStoryBench constructs a comprehensive benchmark comprising 80 multi-style stories, 344 characters, and 1,317 shots, and proposes 12 automated evaluation metrics covering character consistency, style similarity, prompt alignment, and copy-paste detection. The benchmark systematically evaluates over 25 open-source and commercial story visualization methods, addressing the lack of unified evaluation standards in this field.
VOSR: A Vision-Only Generative Model for Image Super-Resolution: This paper proposes VOSR, the first work to demonstrate that a purely vision-trained generative super-resolution model can match or even surpass T2I pretrained-based methods. By leveraging visual semantic conditioning and a restoration-oriented guidance strategy, VOSR achieves high-quality SR at approximately 1/10 the training cost of T2I-based approaches.
WaDi: Weight Direction-aware Distillation for One-step Image Synthesis: By decomposing weight changes during distillation into norm and direction components, this work finds that directional change is the primary driver of distillation (with a magnitude 22× larger than norm change). It proposes LoRaD (Low-Rank Weight Direction Rotation) adapters, integrated into the VSD framework to form WaDi, achieving state-of-the-art one-step FID on COCO with only ~10% trainable parameters.
When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization: This paper exposes the "identity collapse" bottleneck in multi-subject personalization: three SOTA models (MOSAIC, XVerse, PSR) already reach ~50% SCR at 2 subjects, surging to ~97% at 10 subjects. The paper proposes the DINOv2-based Subject Collapse Rate (SCR) metric to replace the inadequate CLIP-I, and constructs a systematic benchmark covering 2–10 subjects × 3 scene types.
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance: This paper proposes Conflict-aware Adaptive Safety Guidance (CASG), a training-free plug-and-play framework that resolves safety degradation caused by directional conflicts in multi-category aggregation. CASG dynamically identifies the harmful category most aligned with the current generation state and applies safety guidance exclusively along that direction.
When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm: This paper presents a systematic comparative analysis of safety risks between MLLMs (Multimodal Large Language Models) and diffusion models, finding that MLLMs are more prone to generating unsafe images due to their superior semantic understanding (capable of interpreting abstract and non-English prompts), and that images generated by MLLMs are harder to detect by existing fake image detectors—even when detectors are fine-tuned specifically for MLLMs, detection can be circumvented by enriching prompt details.
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval: This paper proposes WISER, a training-free zero-shot composed image retrieval (ZS-CIR) framework that unifies T2I and I2I dual-path retrieval through an iterative "retrieve–verify–refine" loop. A VLM verifier explicitly models intent-awareness and uncertainty-awareness to enable adaptive fusion and structured self-reflective refinement. WISER achieves a relative improvement of 45% on CIRCO mAP@5 and 57% on CIRR Recall@1, surpassing many supervised methods.
YOEO: You Only Erase Once - Erasing Anything without Bringing Unexpected Content: YOEO proposes a single-pass erasure framework that distills a multi-step diffusion model into a few-step model for efficient inference. It introduces a Sundries Suppression Loss (which detects newly generated spurious objects via entity segmentation) and an Entity Feature Coherence Loss (which ensures semantic consistency between the erased region and its surroundings), addressing the hallucination problem of diffusion models in object erasure.

🏥 Medical Imaging¶

A protocol for evaluating robustness to H&E staining variation in computational pathology models: A three-step evaluation protocol (select reference staining conditions → characterize test-set staining properties → simulate staining conditions for inference) is proposed to systematically quantify the robustness of 306 MSI classification models to H&E staining variation. The study finds a weak negative correlation between robustness and classification performance (\(r = -0.28\)), indicating that high performance does not imply high robustness.
A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement: This paper proposes a semi-supervised framework for breast ultrasound (BUS) image segmentation. It employs GPT-5-generated appearance descriptions combined with Grounding DINO and SAM for training-free pseudo-label generation (APPG), and refines labels via a dual-teacher framework (static + dynamic) using Uncertainty-Entropy Weighted Fusion (UEWF) and Adaptive Uncertainty-guided Reverse Contrastive Learning (AURCL). The method approaches fully supervised performance using only 2.5% labeled data.
A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement: Simple appearance descriptions (e.g., "dark oval") are used to drive Grounding DINO + SAM for training-free pseudo-label generation in breast ultrasound segmentation. A dual-teacher uncertainty-entropy weighted fusion mechanism and adaptive reverse contrastive learning further refine pseudo-label quality. With only 2.5% labeled data, the proposed method matches or surpasses the fully supervised upper bound.
Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning: Inspired by the foundation model paradigm, this work proposes a data-efficient training strategy for diffusion probabilistic models (DPMs) in accelerated MRI reconstruction. A DPM is first pre-trained on large-scale multi-contrast brain MRI data (~4,000 subjects), then fine-tuned with as few as 20 target-domain subjects. The resulting model achieves reconstruction quality comparable to large-dataset training in clinical stroke MRI, with a clinical blind reader study confirming non-inferiority to standard-of-care at 2× acceleration.
Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning: Drawing inspiration from the "pre-train then fine-tune" paradigm of foundation models, this work pre-trains a diffusion probabilistic model (DPM) at scale on ~4,000 fastMRI subjects spanning multiple contrasts, then fine-tunes on as few as 20 target-domain subjects using a low learning rate. The resulting model generalizes across contrasts and acquisition protocols for accelerated MRI reconstruction. In a clinical stroke validation, 2× accelerated images are rated non-inferior to fully-sampled images by blinded neuroradiologists.
Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning: This paper proposes HistoSelect, a framework that emulates the coarse-to-fine reasoning process of pathologists through a three-stage filtering mechanism — tissue segmentation → Group Sampler → Patch Selector — grounded in Information Bottleneck (IB) theory. By compressing task-irrelevant visual tokens, the method achieves state-of-the-art performance across three datasets while reducing computational cost by approximately 70%.
Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning: This paper proposes the UAAI framework, which for the first time introduces Active Inference into micro-gesture recognition. By combining EFE-guided temporal frame selection, spatial attention, and UMIX uncertainty-aware augmentation, UAAI achieves 63.47% on the SMG dataset (RGB modality), substantially outperforming conventional RGB-based methods.
Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions: This paper proposes SFDA-DeP, a source-free domain adaptation (SFDA) framework inspired by machine unlearning that models adaptation as an iterative process of identifying and correcting prediction bias. The method selectively reduces confidence on uncertain samples from the dominant class, retains reliable predictions, and jointly trains a pixel-level classifier to recover localization discriminability. It consistently outperforms SFDA baselines in both classification and localization across cross-organ and cross-center histopathology benchmarks.
Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions: This paper proposes SFDA-DeP, a method inspired by machine unlearning that reformulates SFDA as an iterative process of "identifying and correcting prediction bias." It applies a forgetting operation to high-entropy uncertain samples from the dominant class to force the model to abandon biased predictions, maintains self-training on reliable samples, and anchors localization capacity via a pixel-level classifier. The method consistently outperforms existing SFDA approaches on cross-organ and cross-center histopathology benchmarks.
Adaptive Confidence Regularization for Multimodal Failure Detection: This paper proposes the ACR framework, which addresses multimodal misclassification detection for the first time in a systematic manner through two complementary modules: an Adaptive Confidence Loss (ACL) that penalizes "confidence degradation" where multimodal fusion confidence falls below that of individual unimodal branches, and Multimodal Feature Swapping (MFS) that synthesizes failure-aware outlier samples in the feature space. ACR consistently outperforms existing methods across four datasets.
Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding: Under extreme annotation scarcity—only 206 labeled cases (144 for training)—this work combines patch-based MIM pretraining of a 3D U-Net, a VDETR detector with 3D vertex RPE, and Mean Teacher semi-supervised consistency regularization over 2,000 unlabeled volumes. The approach improves 3D abdominal trauma detection mAP@0.50 from 26.36% to 56.57% on the validation set (+115%), while a frozen encoder with a lightweight classification head achieves 94.07% accuracy on 7-class injury classification.
Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding: This paper proposes a two-stage label-efficient framework: a patch-based MIM self-supervised pretraining of a 3D U-Net encoder on 1,206 unlabeled CT volumes, followed by VDETR with 3D vertex relative position encoding for 3D lesion detection, augmented by Mean Teacher semi-supervised consistency regularization over 2,000 additional unlabeled volumes. Using only 144 annotated samples, the framework achieves 56.57% val mAP@0.50, a 115% improvement over fully supervised training.
From Adaptation to Generalization: Adaptive Visual Prompting for Medical Image Segmentation: This paper proposes APEX (Adaptive Prompt EXtraction), which adaptively retrieves input-specific visual prompts from a learnable prompt memory bank—rather than assigning a fixed prompt per domain—and incorporates low-frequency contrastive learning (LFC) to enhance inter-domain discriminability, achieving significant improvements in medical image segmentation on both seen and unseen domains.
Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study: By evaluating 11 models on three heterogeneous medical datasets under a unified training protocol, this study demonstrates that general-purpose vision models (GP-VMs) systematically outperform most specialized medical segmentation architectures (SMAs) under standardized conditions, challenging the prevailing assumption that medical image segmentation inherently requires domain-specific architectures.
Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study: Under a unified training and evaluation protocol, this study compares 11 models — 5 specialized medical segmentation architectures (SMAs) and 6 general-purpose vision models (GP-VMs) — across 3 heterogeneous medical datasets. GP-VMs systematically outperform most SMAs on all datasets (average mDSC: VW-MiT 91.0% vs. best SMA SU-Mamba 90.5%), and Grad-CAM analysis demonstrates that GP-VMs capture clinically relevant structures.
Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts: Across two independent large-scale lung cancer screening cohorts, deep learning-based automatic segmentation is employed to quantify longitudinal PPFE changes, providing the first validation of the independent prognostic value of PPFE progression in a screening population.
Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts: Across two large-scale lung cancer screening cohorts (NLST n=7,980; SUMMIT n=8,561), this study employs deep learning to automatically segment PPFE volumes and defines "progressive PPFE" based on annualized volume change. Cox proportional hazards models demonstrate that PPFE progression is an independent predictor of all-cause mortality (NLST HR=1.25; SUMMIT HR=3.14), and is significantly associated with respiratory hospitalization rates, antibiotic/corticosteroid usage, and other clinical endpoints.
Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI: This work systematically compares 15 CNN variants (LeNet/ResNet/VGG/Inception) on five-class classification of ovarian cancer histopathology images. InceptionV3-A (ReLU) is selected as the final model, achieving 94% across comprehensive metrics, with comparative explainability analysis conducted using three XAI methods: LIME, SHAP, and Integrated Gradients.
Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI: This paper systematically compares 15 variants across four major CNN families — LeNet, ResNet, VGG, and Inception — for ovarian cancer histopathology image classification. InceptionV3-ReLU is selected as the final model (average metrics ~94%), and three XAI methods — LIME, SHAP, and Integrated Gradients — are applied to provide interpretability for the classification results.
Benchmarking Endoscopic Surgical Image Restoration and Beyond: This work constructs SurgClean, the first multi-source real-world endoscopic surgical image restoration dataset (3,113 images covering three degradation types: smoke, fog, and liquid splash), and systematically benchmarks 22 representative image restoration methods (12 general-purpose + 10 task-specific) on it. The results reveal a significant gap between existing methods and clinical requirements, and further analyze the fundamental differences between surgical-scene and natural-scene degradations.
Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance: This paper presents the first systematic study of aggregation strategies for converting pixel-level uncertainty maps to image-level scores in segmentation tasks. It proposes the Spatial Mass Ratio (SMR)—incorporating spatial structural information via Moran's I, Edge Density, and Shannon Entropy—alongside a GMM meta-aggregator. Experiments across 10 datasets on OoD and failure detection tasks demonstrate that spatially-aware aggregation significantly outperforms global averaging.
Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control: UniPath proposes a semantics-driven pathology image generation framework that achieves diagnostic-level controllable generation through multi-stream control (raw text + diagnostic semantic tokens distilled from a frozen pathology MLLM + prototype bank morphology control), attaining a Patho-FID of 80.9 and outperforming the second-best method by 51%.
BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation: This paper proposes BiCLIP, a framework that employs Bidirectional Multimodal Fusion (BMF) to refine text representations using visual information, and Image Augmentation Consistency (IAC) to enforce perturbation-invariant intermediate features. BiCLIP surpasses state-of-the-art methods on COVID-19 CT segmentation while remaining robust with as little as 1% labeled data.
BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation: This paper proposes BiCLIP, a framework that introduces a Bidirectional Multimodal Fusion (BMF) module enabling text and visual features to mutually refine each other in a closed loop, and an Image Augmentation Consistency (IAC) module that enforces consistency of intermediate features under weak/strong perturbations. BiCLIP achieves robust medical image segmentation under extremely label-scarce (1% annotations only) and image-degraded (low-dose CT noise/motion blur) clinical conditions.
Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection: This paper proposes AnoPLe — a lightweight multimodal bidirectional prompt learning framework that requires neither manually crafted anomaly descriptions nor external auxiliary modules. Through text–visual prompt bidirectional interaction and scale-aware prefixes, AnoPLe achieves few-shot multi-class anomaly detection, delivering strong competitive results on MVTec-AD/VisA/Real-IAD while maintaining efficient inference (~28 FPS).
Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD: This work constructs a large-scale CBCT-report paired dataset of 7,408 cases covering 55 oral diseases, and develops CBCTRepD, a bilingual oral-maxillofacial CBCT report generation system. Through a collaborative paradigm of AI-generated drafts followed by radiologist editing, the system is shown via multi-level clinical evaluation to elevate junior radiologists to an intermediate level, intermediate radiologists to near-senior level, and reduce omissions for senior radiologists.
Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD: This paper proposes CBCTRepD, a bilingual report generation system for oral and maxillofacial CBCT, trained on a high-quality paired dataset of 7,408 cases. A multi-level evaluation framework is introduced to validate its tiered empowerment effect on novice, intermediate, and senior radiologists within a radiologist–AI collaborative workflow.
CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis: This paper proposes CARE, a slide-level pathology foundation model that employs an Adaptive Region Generator (ARG) to partition WSIs into morphologically coherent irregular regions (analogous to word-level tokens in NLP), combined with two-stage pretraining via cross-modal alignment with RNA/protein expression profiles. Using approximately 1/10 the data of mainstream models, CARE achieves state-of-the-art average performance across 33 downstream tasks.
Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images: This paper proposes CPNN, which constructs cell-type prototypes from publicly available single-cell RNA-seq data and models slide/patch-level gene expression as a weighted combination of these prototypes, achieving state-of-the-art performance on gene expression estimation while providing interpretability.
CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection: This paper revisits CLIP domain adaptation from a data-centric perspective and proposes CHIPS, which computes a utility score for each image-text pair by combining three factors: curvature-aware Newton alignment (fidelity), JL sketching-compressed curvature estimation (scalability), and learnability–domain-relevance weighting (retention). Using only 30% of data, CHIPS matches full-dataset CPT; using 10%, it surpasses 50%-data CPT. The method achieves state-of-the-art data selection performance across 17 medical and 31 general benchmarks.
CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection: This paper proposes CHIPS, a curvature-aware hybrid influence-based data selection method that computes Newton-style alignment scores in the CLIP endpoint subspace and combines them with learnability and domain-relevance weights. Using only 30% of the data, CHIPS matches full-dataset continual pre-training (CPT) performance and achieves state-of-the-art results across 17 medical benchmarks.
CLoE: Expert Consistency Learning for Missing Modality Segmentation: This work reformulates the robustness problem under missing modalities as decision-level expert consistency control. It proposes a dual-branch consistency learning scheme (global MEC + regional REC) coupled with a lightweight gating network that converts consistency scores into modality reliability weights, achieving an average WT Dice of 88.09% across 15 missing-modality combinations on BraTS 2020, surpassing all prior state-of-the-art methods.
CLoE: Expert Consistency Learning for Missing Modality Segmentation: This paper proposes CLoE (Consistency Learning of Experts), which reformulates missing-modality robustness as a decision-level expert consistency control problem. It reduces expert drift via two complementary consistency branches—Modality Expert Consistency (MEC) and Region Expert Consistency (REC)—and achieves reliability-weighted fusion through a consistency-score-driven gating network.
CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration: CRFT is a unified coarse-to-fine cross-modal image registration framework that learns modality-agnostic feature flow representations within a Transformer architecture. It employs 1/8-resolution global correspondence at the coarse stage and multi-scale local refinement at 1/2–1/4 resolution at the fine stage, coupled with iterative discrepancy-guided attention and Spatial Geometric Transform (SGT) to recursively refine flow fields and capture subtle spatial inconsistencies. CRFT outperforms SOTA methods including RAFT, GMFlow, and LoFTR across diverse cross-modal datasets covering optical, infrared, SAR, and multispectral imagery.
Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference: This paper proposes SpaHGC, a multimodal heterogeneous graph framework that constructs three types of subgraphs—intra-target-slice (TS), cross-slice (CS), and intra-reference-slice (RS)—and integrates masked graph contrastive learning with a cross-node dual attention mechanism to predict spatial gene expression from H&E histopathology images, achieving PCC improvements of 7.3%–27.1% across seven datasets.
cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold: This paper proposes cryoSENSE, the first computational framework for compressed cryo-EM imaging, demonstrating that protein cryo-EM images can be faithfully reconstructed from undersampled measurements under both sparse priors (DCT/Wavelet/TV) and generative priors (diffusion models), achieving up to 2.5× throughput gain while preserving 3D reconstruction resolution.
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation: This paper proposes CURE — an error-aware curriculum learning framework for multi-task training that dynamically adjusts sampling distributions to emphasize hard samples, improving visual grounding accuracy by +0.37 IoU and reducing hallucination rate by 18.6% without introducing additional data.
Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation: This paper proposes Deco-Mamba, a decoder-centric Transformer-CNN-Mamba hybrid architecture that enhances the decoding process via Co-Attention Gates, Vision State Space Modules (VSSMs), and deformable convolutions, while introducing a distribution-aware deep supervision strategy based on windowed KL divergence. The method achieves state-of-the-art performance across 7 medical image segmentation benchmarks.
Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation: This paper proposes Deco-Mamba, a decoder-centric segmentation network that employs a Co-Attention Gate (CAG) for bidirectional encoder–decoder feature fusion, a Visual State Space Module (VSSM) for long-range dependency modeling, and deformable convolutions for detail recovery. A windowed distribution-aware KL-divergence deep supervision scheme is further introduced. The method achieves state-of-the-art performance on 7 medical segmentation benchmarks at moderate computational cost.
Decoupling Vision and Language: Codebook Anchored Visual Adaptation: CRAFT is proposed to decouple the visual encoder from the language model via a discrete codebook, enabling domain adaptation by fine-tuning only the visual encoder. The adapted encoder can be seamlessly reused across different LLM architectures, achieving an average improvement of 13.51% across 10 domain benchmarks.
Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning: This paper compares three learning paradigms — local learning (LL), federated learning (FL), and centralized learning (CL) — for binary classification of third molar–mandibular canal overlap on panoramic radiographs. Centralized learning achieves the best performance (AUC 0.831), federated learning serves as a competitive privacy-preserving alternative (AUC 0.757), and both substantially outperform local learning (mean AUC 0.672).
Deep Learning–Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging: This paper proposes ScleraGluNet, a multi-view deep learning framework that combines five-directional scleral vessel imaging with multi-branch CNN feature extraction, MRFO-based feature refinement, and Transformer-based cross-view fusion, achieving 93.8% accuracy on three-class metabolic state classification and an MAE of 6.42 mg/dL for continuous fasting plasma glucose (FPG) estimation.
Deep Learning–Based Estimation of Blood Glucose Levels from Multidirectional Scleral Blood Vessel Imaging: This paper proposes ScleraGluNet, which captures scleral blood vessel photographs from five gaze directions, extracts direction-specific vascular features via parallel CNNs, refines them through MRFO feature selection, and fuses them across views using a Transformer. The model simultaneously performs three-class metabolic state classification (93.8% accuracy) and continuous fasting plasma glucose (FPG) estimation (MAE = 6.42 mg/dL, r = 0.983).
Deep Learning-based Assessment of the Relation Between the Third Molar and Mandibular Canal on Panoramic Radiographs using Local, Centralized, and Federated Learning: This work systematically compares three training paradigms—local learning (LL), federated learning (FL), and centralized learning (CL)—on cropped panoramic dental radiographs partitioned by 8 independent annotators, targeting a binary classification task of third molar–mandibular canal overlap. The study establishes a consistent performance ranking of CL > FL > LL (AUC: 0.831, 0.757, and 0.672, respectively), demonstrating that FL substantially outperforms site-independent training while preserving data privacy.
Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography: This work constructs PETWB-Seg11K, the largest whole-body PET segmentation dataset to date (11,041 3D PET scans and 59,831 segmentation masks), and proposes SegAnyPET, a foundation model enabling prompt-driven universal volumetric segmentation of organs and lesions in PET imaging. The model demonstrates strong performance in zero-shot cross-center and cross-tracer settings.
Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography: This work constructs PETWB-Seg11K, the largest whole-body PET segmentation dataset to date (11,041 3D PET scans + 59,831 masks), and proposes SegAnyPET — the first 3D promptable segmentation foundation model tailored for functional PET imaging — achieving strong zero-shot generalization across multi-center, multi-tracer, and multi-disease scenarios.
Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification: This paper proposes an NNMF+CNN+diffusion defense framework for brain tumor MRI classification. MRI images are first decomposed into compact, interpretable low-rank features via NNMF; the most discriminative components are selected using AUC, Cohen's d, and p-value statistical criteria; a lightweight CNN then performs classification. At inference time, a feature-space purification module combining forward diffusion noise injection and a learned denoiser is introduced. Under AutoAttack (\(L_\infty\), \(\epsilon=0.10\)), robust accuracy improves from 0.47% to 59.53%.
Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification: A four-stage pipeline is proposed consisting of NNMF feature extraction → statistical feature selection → lightweight CNN classification → feature-space diffusion purification. The method maintains 85.1% clean accuracy while substantially improving robust accuracy under AutoAttack (\(L_\infty\), \(\epsilon=0.10\)) from a baseline of 0.47% to 59.5%.
EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes", "Hands" and "Minds": This paper proposes EchoAgent, an agent system that simulates the "eyes–hands–minds" collaborative workflow of echocardiography clinicians. Through three stages—an Expertise-Driven Cognition engine (mind), a Hierarchical Collaboration Toolkit (eyes + hands), and an Orchestrated Reasoning Hub—the system achieves end-to-end reliable echocardiography interpretation, attaining state-of-the-art performance on multiple benchmarks.
Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models: This paper proposes EDA, a framework that extends the EDM design space from isotropic Gaussian noise to arbitrary noise patterns. By driving SDEs with multivariate Gaussian distributions and multiple independent Wiener processes, EDA enables flexible noise diffusion while provably introducing no additional sampling overhead. With only 5 sampling steps, EDA achieves performance on par with or superior to 100-step Refusion and task-specific methods across three tasks: MRI bias field correction, CT metal artifact removal, and natural image shadow removal.
EI: Early Intervention for Multimodal Imaging based Disease Recognition: EI proposes injecting cross-modal semantic guidance (the [INT] token) before unimodal embedding (UIE), emulating the clinical workflow in which a clinician first examines one modality to form a preliminary judgment and then uses that judgment to guide interpretation of another modality. Simultaneously, EI introduces MoR (multi-rank LoRA with a relaxed bypass router) for parameter-efficient VFM adaptation to the medical domain. With fewer than 9M trainable parameters, EI surpasses all full fine-tuning and prompt-learning baselines on three datasets covering retinal, dermatological, and knee-joint imaging.
Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models (EDA): This paper proposes the EDA framework, which extends the EDM design space from Gaussian noise to arbitrary noise patterns by parameterizing a covariance matrix via a multivariate Gaussian distribution. EDA enables flexible noise diffusion and achieves performance at or above 100-step EDM methods and task-specific approaches using only 5 sampling steps across three tasks: MRI bias field correction, CT metal artifact reduction, and natural image shadow removal.
EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease: This paper proposes EMAD, an end-to-end multimodal vision-language framework for AD diagnosis that generates structured reports. Through hierarchical Sentence–Evidence–Anatomy (SEA) Grounding, each diagnostic statement is explicitly linked to clinical evidence and 3D brain anatomy. Executable rule-driven GRPO reinforcement fine-tuning is applied to ensure clinical consistency.
EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis: This paper proposes EquivAnIA, which employs a family of oriented filters (cake wavelets and ridge filters) to estimate the angular distribution of an image via weighted averaging in the frequency domain, replacing conventional angular binning. The method achieves truly numerically rotation-robust anisotropic analysis, with a dominant orientation estimation error of only 0.03° on synthetic images and a CT registration error of only 0.02°.
EquivAnIA: A Spectral Method for Rotation-Equivariant Anisotropic Image Analysis: This paper proposes EquivAnIA, a spectral method that computes angular energy distributions via Cake wavelets and Ridge filters in the Fourier domain, achieving strictly numerically rotation-equivariant anisotropic image analysis. The method substantially outperforms conventional angular PSD binning approaches on both synthetic and real images.
Event-Level Detection of Surgical Instrument Handovers in Videos: This paper proposes a spatiotemporal visual framework for detecting instrument handovers in real surgical videos. It combines ViT-based spatial feature extraction with unidirectional LSTM temporal modeling, and employs multi-task learning to jointly predict handover events and their directions, achieving an event-level detection F1 of 0.84 on kidney transplant surgical videos.
Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning: This paper proposes PAMS (Priority-Aware Mistake Severity), a framework that significantly reduces the risk of severe misdiagnosis in multiclass MIL-based WSI diagnosis through an asymmetric severity-aware cross-entropy loss (MSCE), semantic feature remix (SFR), and an asymmetric Mikel's Wheel evaluation metric.
Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes: This work presents the first robustness extension evaluation of ZACH-ViT, a compact permutation-invariant ViT architecture, in low-data medical imaging settings. Across 7 MedMNIST datasets, ZACH-ViT ranks first under both clean and common corruption conditions (Mean Rank 1.57), ranks first under FGSM (2.00), and second under PGD (2.29).
Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning: An attention-based MIL model is built upon a ConvNeXt-Base backbone, employing a gradient reversal layer (GRL) to adversarially eliminate gender information from scan representations. Combined with focal loss (\(\gamma=2\)) + label smoothing (\(\varepsilon=0.1\)), subgroup oversampling, and 5-fold ensemble, the proposed method achieves a mean competition score of 0.685±0.030 on a four-class lung disease diagnosis task over 889 chest CT scans. The female macro-F1 (0.691) slightly exceeds the male macro-F1 (0.679), validating that GRL effectively closes the fairness gap.
Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning: A fairness-aware framework based on attention MIL and gradient reversal layers (GRL) is proposed for multi-class lung disease diagnosis from chest CT volumes, eliminating gender bias while preserving diagnostic accuracy.
Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation: This paper proposes FedMEPD, a framework that employs modality-specific encoders to address intermodal heterogeneity, a filter-level dynamic partial personalization decoder to balance knowledge sharing and personalization, and a multi-anchor cross-attention calibration module to compensate for missing modality information. FedMEPD comprehensively outperforms existing multimodal federated learning methods on BraTS 2018/2020.
Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation: This paper proposes FedMEPD, a framework that addresses two major challenges in federated multimodal brain tumor segmentation — inter-modality heterogeneity and client personalization — through modality-specific encoders (fully federated), a partially personalized multimodal fusion decoder, and a multi-anchor cross-attention calibration module. FedMEPD surpasses existing federated methods on BraTS 2018/2020.
FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning: FedVG proposes to score each client using layer-wise gradient norms computed on a global validation set, assigning higher aggregation weights to clients whose gradients are flatter (i.e., smaller in norm), thereby substantially improving generalization performance of federated learning under high data heterogeneity.
Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis: FPRL is proposed as a clinically-inspired hierarchical self-supervised framework that mitigates motion bias by first "focusing" on intra-frame lesion-centric static semantics and then "perceiving" inter-frame contextual evolution, achieving state-of-the-art performance across 11 endoscopic datasets.
Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning: This paper is the first to systematically define the task of video-based epileptic seizure forecasting (predicting whether a seizure will occur within the next 5 seconds using 3–10-second pre-ictal clips), and proposes a two-stage cross-species transfer learning framework — self-supervised pre-training of VideoMAE on a mixed dataset of rodent and human videos, followed by few-shot fine-tuning on a very limited set of human epilepsy videos. Under 2/3/4-shot settings, the framework achieves an average balanced accuracy (bacc) of 72.30% and ROC-AUC of 75.58%, outperforming all video understanding baselines.
Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning: This work introduces the first purely vision-based epileptic seizure forecasting task, leveraging large-scale rodent epilepsy videos for cross-species self-supervised pre-training via the VideoMAE framework, achieving >70% forecasting accuracy within a 3–10 second prediction window.
Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay: This paper proposes FORGE, the first continual learning framework specifically designed for cross-site fMRI-based brain disorder diagnosis. FORGE generates realistic functional connectivity matrices via a structure-aware VAE for privacy-preserving generative replay, and combines dual-level knowledge distillation with a hierarchical contextual bandit sampling strategy to effectively mitigate catastrophic forgetting.
GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction: This paper proposes GaussianPile, which extends 3D Gaussian Splatting from surface appearance modeling to slice-based volumetric reconstruction by introducing a focus-aware physical imaging model (Focus Gaussian). On ultrasound and light-sheet microscopy data, the method achieves high-quality volumetric compression and reconstruction that is 11× faster than NeRF-based methods and reduces storage by 16× compared to voxel grids.
GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis: This paper proposes the GIIM framework, which constructs a Multi-Heterogeneous Graph (MHG) to simultaneously model intra-view and inter-view dependencies among lesions in multi-view medical images, and achieves robust diagnosis on incomplete data through four missing-view representation strategies.
GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis: This paper proposes the GIIM framework, which constructs a Multi-Heterogeneous Graph (MHG) with four types of edge relations to simultaneously model the dynamic changes of individual lesions across imaging phases and the spatial associations among different lesions. Four missing-view imputation strategies are designed. GIIM achieves significant improvements over existing methods on three modalities: liver CT, breast mammography, and breast MRI.
GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification: This paper presents GLEAM, the first publicly available trimodal glaucoma dataset (SLO fundus photography + peripapillary OCT + visual field deviation maps, 1,200 cases, four-stage annotation), along with HAMM, a CNN-based hierarchical attention masked modeling framework. HAMM achieves cross-modal fusion via clinically inspired multi-head modality gating and relational graph attention, attaining a four-class classification accuracy of 81.08%.
GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification: This paper introduces GLEAM (Glaucoma Lesion Evaluation and Analysis with Multimodal imaging), the first publicly available three-modality glaucoma dataset comprising SLO fundus images, circumpapillary OCT, and visual field pattern deviation maps, along with HAMM (Hierarchical Attentive Masked Modeling), a framework that concentrates cross-modal representation learning at the encoder side via a hierarchical attentive encoder and a lightweight decoder, enabling accurate four-stage glaucoma classification.
Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization: This paper proposes GenEval, which quantifies causal coverage gaps via a Domain Conformal Bound (DCB), distills human expert knowledge, and integrates it with a medical VLM (MedGemma-4B) through LoRA fine-tuning for single source domain generalization (SDG), achieving substantial gains over baselines on DR grading and seizure onset zone (SOZ) detection.
Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization: This paper proposes the Domain Conformal Bound (DCB) theoretical framework to quantify causal factor discrepancies across domains and derives an optimizable consistency metric SDCD. Expert knowledge is refined accordingly and injected into MedGemma-4B via LoRA, achieving substantial improvements over single source domain generalization SOTA on 8 DR and 2 SOZ datasets.
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset: This paper introduces instruction-guided lesion segmentation (ILS) for chest X-rays, constructs the first large-scale automatically generated instruction-answer dataset MIMIC-ILS (1.1M samples, 192K images, 91K masks), and trains the ROSALIA model to achieve gIoU of 71.2% and null-target accuracy of 91.8%, substantially outperforming existing general-purpose and medical segmentation models.
Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment: This paper identifies and addresses the degradation of local feature alignment in CLIP under cross-domain few-shot learning (CDFSL), and proposes CC-CDFSL, a cycle-consistency-based framework. Through bidirectional T-I-T and I-T-I cyclic paths and a semantic anchor mechanism, CC-CDFSL improves patch-level vision-language alignment while enhancing model interpretability.
InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models: This paper proposes InvAD, which shifts diffusion-based anomaly detection from a "denoising-reconstruction in RGB space" paradigm to a "noising-inversion in latent space" paradigm. By applying DDIM inversion to directly infer the terminal latent variable and measuring deviation under the prior distribution, anomalies are detected without reconstruction. Only 3 inversion steps suffice to achieve state-of-the-art performance, with approximately 2× inference speedup.
InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models: This paper proposes a "detection via noising" paradigm to replace the conventional "detection via denoising" approach. By mapping images to the latent noise space via DDIM inversion, the method measures deviation from the prior distribution as an anomaly score using only 3 inference steps—without any reconstruction—achieving state-of-the-art accuracy at 88 FPS (more than 2× faster than OmiAD).
Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision: This paper proposes MASS (MAsk-guided Self-Supervised learning), which leverages category-agnostic masks automatically generated by SAM2 as pseudo-annotations and adopts in-context segmentation as a pretext task for self-supervised pretraining. Without any manual annotation, MASS learns semantically rich and highly generalizable 3D medical image representations, achieving strong performance on both few-shot segmentation and frozen-encoder classification.
LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings: This paper presents LEMON, a large-scale endoscopic dataset comprising 4,194 surgical videos (938 hours), and proposes LemonFM, a self-supervised foundation model based on augmented knowledge distillation. LemonFM achieves state-of-the-art performance across four downstream surgical tasks: phase recognition, tool detection, action recognition, and semantic segmentation.
LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings: This paper introduces LEMON, the largest open surgical video dataset to date (4,194 videos, 938 hours, 35 procedure types), and proposes LemonFM, a foundation model based on augmented knowledge distillation, which comprehensively outperforms existing methods across four downstream tasks: surgical phase recognition, tool detection, action recognition, and semantic segmentation.
LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol: This paper introduces LUMINA, a multi-vendor full-field digital mammography (FFDM) dataset comprising 468 patients and 1,824 images, accompanied by a foreground-pixel histogram matching protocol for energy harmonization. The benchmark systematically evaluates CNN and Transformer models across three clinical tasks: diagnosis, BI-RADS classification, and breast density prediction.
Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies: This paper proposes a low-cost marker-based photogrammetry approach for high-quality 3D reconstruction of aggregate particles. Through a systematic comparative analysis of 2D and 3D morphological indices, it reveals significant deviations introduced by 2D projection analysis relative to true 3D morphology.
Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies: This paper proposes a low-cost marker-based photogrammetric pipeline for high-quality 3D reconstruction of aggregate particles. Through a systematic comparative analysis of 2D and 3D morphological indices, it reveals the significant limitations of relying solely on 2D images for aggregate morphology assessment.
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation: Built upon frozen CLIP encoders, MedCLIPSeg introduces a probabilistic cross-modal attention adapter (PVL) that enables bidirectional vision-language interaction and explicit prediction uncertainty modeling, complemented by a soft patch-level contrastive loss. The method achieves strong data efficiency, domain generalization, and interpretability across 16 medical segmentation datasets.
MedGEN-Bench: Contextually Entangled Benchmark for Open-Ended Multimodal Medical Generation: This paper introduces MedGEN-Bench, the first comprehensive benchmark for open-ended multimodal medical generation, comprising 6,422 expert-verified image-text pairs spanning 6 imaging modalities and 16 clinical tasks, accompanied by a three-tier evaluation framework. The benchmark reveals that compositional pipelines outperform unified models in cross-modal consistency.
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding: MedGRPO introduces two key innovations to address training collapse in multi-dataset reinforcement learning for medical video understanding: cross-dataset reward normalization (mapping median performance across datasets of varying difficulty to a uniform reward value via a logistic function) and a medical LLM judge (comparative scoring across five clinical dimensions). Built on Qwen2.5-VL-7B and trained on MedVidBench (532K video instruction pairs), the method surpasses GPT-4.1 and Gemini-2.5-Flash.
MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration: This paper proposes MedKCO, a knowledge-driven cognitive orchestration strategy for medical vision-language pretraining. It introduces a hierarchical curriculum (label-level ordering by diagnostic sensitivity + description-level ordering by sample representativeness) and a self-paced asymmetric contrastive loss, enabling the model to progressively learn from simple to complex concepts. MedKCO substantially outperforms baselines on zero-shot and downstream tasks across three medical imaging modalities.
MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification: This paper proposes MIL-PF, a framework that leverages frozen foundation vision encoders (DINOv2/MedSigLIP) to precompute features, followed by a lightweight MIL head of approximately 40K parameters for mammography classification. The method achieves state-of-the-art performance on the large-scale EMBED dataset while substantially reducing training cost.
MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification: Combining frozen general-purpose foundation encoders (DINOv2 ViT-Giant / MedSigLIP) with a lightweight MIL aggregation head of only ~40k parameters, MIL-PF achieves state-of-the-art performance on large-scale mammography classification benchmarks such as EMBED (AUC 0.916, Spec@Sens=0.9 of 0.762) via a dual-stream aggregation strategy (global mean pooling + local Perceiver cross-attention), training in 5–7 minutes with 35–458× fewer trainable parameters than baselines.
Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning: This paper reveals that enhancing visual discriminability during VLM fine-tuning for cross-domain few-shot learning paradoxically degrades cross-modal alignment — a phenomenon termed the "discriminability trap." Two plug-and-play modules, SVL and RA, are proposed to suppress visual learning shortcuts and guide cross-modal alignment, achieving state-of-the-art performance on 4 CDFSL benchmarks and 11 FSL datasets.
Mitigating Object Hallucination in LVLMs via Attention Imbalance Rectification: This paper introduces the concept of Attention Imbalance to explain object hallucination in LVLMs, and proposes a lightweight decoding-time intervention method, AIR, which rectifies attention imbalance via cross-modal attention reallocation and variance-constrained projection regularization. AIR reduces hallucination rates by up to 35.1% and improves general capability by up to 15.9% across four LVLMs.
MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection: MoECLIP introduces Mixture-of-Experts into zero-shot anomaly detection (ZSAD), achieving patch-level dynamic expert routing and specialization via Frozen Orthogonal Feature Separation (FOFS) and an Equiangular Tight Frame (ETF) loss, attaining state-of-the-art performance across 14 industrial and medical benchmarks.
Momentum Memory for Knowledge Distillation in Computational Pathology: This paper proposes MoMKD, which replaces conventional batch-local feature alignment with a momentum-updated class-conditional memory bank to enable cross-modal knowledge distillation from genomics to pathology whole-slide images (WSIs), achieving genomics-level predictive capability at inference using only H&E slides.
MozzaVID: Mozzarella Volumetric Image Dataset: This paper introduces MozzaVID — a mozzarella cheese microstructure volumetric image classification dataset based on synchrotron X-ray CT — comprising 591–37,824 samples of size \(192^3\), with classification targets spanning 25 cheese types and 149 individual cheese specimens. The dataset bridges the large gap in scale and task design between 3D volumetric and 2D datasets, and experiments demonstrate that 3D models significantly outperform their 2D counterparts.
MRI Contrast Enhancement Kinetics World Model: This paper presents the first MRI Contrast Enhancement Kinetics World Model (MRI CEKWorld), which leverages spatiotemporal consistency learning (STCL) on sparsely sampled data to generate continuous, high-fidelity contrast-enhanced sequences from non-contrast MRI, addressing the dual challenges of content distortion and temporal discontinuity.
Multimodal Classification of Radiation-Induced Contrast Enhancements and Tumor Recurrence Using Deep Learning: This paper proposes RICE-NET, a multimodal 3D ResNet-18 model that integrates longitudinal MRI data with radiotherapy dose distribution maps to automatically distinguish radiation-induced contrast enhancements (RICE) from tumor recurrence following glioblastoma surgery, achieving F1=0.92 on an independent test set.
Multimodal Classification of Radiation-Induced Contrast Enhancements and Tumor Recurrence Using Deep Learning: This paper proposes RICE-NET, a multimodal 3D ResNet-18 that fuses longitudinal T1-weighted MRI with radiotherapy dose distribution maps. Evaluated on a cohort of 92 glioblastoma patients, the model achieves F1=0.916 for classifying radiation-induced contrast enhancements (RICE) versus tumor recurrence. Ablation studies reveal that the radiotherapy dose map is the single most informative modality (F1=0.78).
Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation: This paper proposes ERBA (Enzyme-Reaction Bridging Adapter), which reformulates enzyme kinetic parameter prediction as a staged conditioning problem aligned with catalytic mechanisms — first injecting substrate information via MRCA to capture molecular recognition, then fusing active-site 3D geometry via G-MoE to model conformational adaptation, and applying ESDA for distribution alignment to preserve PLM priors — achieving state-of-the-art performance across three kinetic metrics.
Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation: This paper proposes ERBA (Enzyme-Reaction Bridging Adapter), which reformulates enzyme kinetic parameter prediction as a staged multimodal conditional generation problem — first injecting substrate information via MRCA to capture substrate recognition specificity, then integrating active-site 3D geometry via G-MoE to capture conformational adaptation, with ESDA distribution alignment to preserve PLM semantic priors.
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning: This paper proposes MMPFN, the first method to extend the pretrained tabular foundation model TabPFN to multimodal settings (tabular + image/text). By introducing a Multi-head Gated MLP (MGM) and a Cross-Attention Pooler (CAP), MMPFN addresses two failure modes — over-compression of non-tabular embeddings and token-count imbalance — and achieves state-of-the-art performance on both medical and general-purpose datasets.
Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation: This paper proposes MSG-LDM, which introduces a multiscale structure-style disentanglement mechanism into a latent diffusion model. Through high-frequency injection, multimodal structural feature fusion, and structure-aware losses, MSG-LDM achieves multimodal MRI synthesis that preserves anatomical structures and fine-grained details under missing-modality scenarios.
Multiscale Structure-Guided Latent Diffusion for Multimodal MRI Translation: This paper proposes MSG-LDM, a latent diffusion model-based framework for multimodal MRI translation. By explicitly disentangling style and structural information in the latent space and incorporating High-Frequency Injection Blocks (HFIB), Multi-Modal Structural Feature Fusion (MMSF), and Multi-Scale Structure Enhancement (MSSE) modules, the framework extracts modality-agnostic structural priors to guide diffusion denoising. MSG-LDM outperforms existing methods on the BraTS2020 and WMH datasets.
MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification: This paper proposes the MUSE framework, which significantly improves generalization in few-shot whole slide image (WSI) classification through MoE-driven sample-wise fine-grained semantic enhancement (SFSE) and LLM knowledge base-based stochastic multi-view model optimization (SMMO).
MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality: This paper proposes MUST, a framework that explicitly decomposes multimodal representations into modality-specific and cross-modal shared components via algebraic constraints, and employs a conditional latent diffusion model to generate modality-specific information under missing-modality scenarios. MUST achieves state-of-the-art performance with a C-index of 0.742 across five TCGA cancer datasets, with degradation of only ~0.4%–3.5% under missing-modality conditions.
MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy: This paper proposes MuViT, a multi-resolution Vision Transformer that employs world-coordinate RoPE positional encoding to jointly process crops of the same scene at different physical resolutions within a single encoder, achieving substantial improvements over single-resolution baselines on microscopy image segmentation tasks.
NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization: NeurINO proposes to initialize a 3D neuron segmentation model by inflating DINOv3 pretrained 2D convolutional kernels into 3D operators, while introducing a Topology-Aware Skeleton Loss (TASL) to explicitly supervise skeleton-level structural fidelity. The method achieves average improvements of 2.9% in ESA, 2.8% in DSA, and 3.8% in PDS across four neuroimaging datasets.
Novel Architecture of RPA in Oral Cancer Lesion Detection: This paper compares low-code RPA platforms (UiPath, Automation Anywhere) against a Python-based design pattern approach (Singleton + Batch Processing) for oral cancer detection automation. The proposed OC-RPAv2 reduces per-image inference time from 2.5 seconds to 0.06 seconds, achieving a 60–100× speedup.
Novel Architecture of RPA In Oral Cancer Lesion Detection: This work integrates software design patterns (Singleton + Batch Processing) into an EfficientNetV2B1-based oral cancer lesion detection Python pipeline, achieving a 60–100× inference speedup over conventional RPA platforms (UiPath/Automation Anywhere) — 0.06 s per image vs. 2.58 s — while maintaining diagnostic accuracy.
OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging: This paper proposes OmniFM, a modality-robust and task-agnostic federated learning framework that integrates three complementary components—Global Spectral Knowledge Retrieval, Embedding-wise Cross-Attention Fusion, and Prefix–Suffix Spectral Prompting—to support five medical imaging tasks (classification, segmentation, super-resolution, VQA, and multimodal fusion) within a unified FL pipeline, achieving substantial improvements over existing baselines under cross-modal heterogeneous settings.
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation: OraPO (Oracle-educated GRPO) injects lightweight DPO supervision when GRPO exploration fails, converting zero-reward rollouts into preference pairs. Combined with a FactScore reward, the method achieves SOTA radiology report generation on CheXpert Plus and MIMIC-CXR (F1=0.341/0.357) using only 1K training samples and a 3B model—reducing training data by 2–3 orders of magnitude compared to prior best methods.
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation: OraPO is proposed as an adaptive hybrid RL framework combining GRPO and DPO for data-efficient radiology report generation. It dynamically switches between GRPO and DPO via Zero-Reward Rate detection, and employs a FactScore-based clinical fact-level reward. Using only 1K samples (vs. 227K for baselines), OraPO achieves state-of-the-art clinical F1 scores of 0.341/0.357 on CheXpert Plus and MIMIC-CXR.
Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification: HIPSS introduces two key innovations for few-shot WSI classification: (1) parameter-efficient prompt tuning via Scaling and Shifting Features (SSF) as a replacement for CoOp, substantially reducing the number of trainable parameters; and (2) a soft hierarchical textual guidance strategy that exploits the pretrained knowledge of VLMs and the inherent hierarchical structure of WSIs without hard patch filtering. The method achieves up to 13.8% improvement across three cancer datasets.
PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation: PGR-Net proposes an explicit ROI-aware brain tumor MRI segmentation network that concentrates computational resources on lesion regions via a data-driven spatial prior template set \(\{(r_i, c_i)\}\) constructed from the training set, a hierarchical Top-K ROI selection mechanism, and a Window Gaussian-Spatial decay guidance module (WinGS-ROI). With only 8.64M parameters, the method achieves state-of-the-art performance on BraTS-2019/2023 and MSD Task01.
Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting: This paper proposes ProtoSR, which leverages LLMs to mine a template-aligned visual prototype knowledge base from large-scale free-text radiology reports, and injects it into a structured report generation model via prototype-conditioned residuals (late fusion). ProtoSR achieves state-of-the-art performance on the Rad-ReStruct benchmark, with particularly significant gains on fine-grained attribute questions.
Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting: This paper proposes ProtoSR, which employs an LLM-driven pipeline to mine template-aligned visual prototype knowledge bases from 227,000 free-text MIMIC-CXR reports, and introduces a prototype-conditioned late-fusion module that injects retrieved prototype evidence as logit residuals into a hierarchical structured reporting model. ProtoSR achieves state-of-the-art performance on the Rad-ReStruct benchmark, improving L3 fine-grained attribute F1 from 4.3 to 7.4 (+72.1% relative gain).
RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation: This work introduces RDFace, a standardized benchmark comprising 456 pediatric facial images spanning 103 rare genetic diseases, and systematically evaluates phenotype-aware synthetic data augmentation (DreamBooth/FastGAN) for rare disease diagnosis under extremely low-sample regimes. DreamBooth-based augmentation achieves up to 13.7% improvement in diagnostic accuracy in the most data-scarce settings.
Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning: This paper identifies "Lost Layers" in CLIP's text encoder — intermediate layers whose removal paradoxically improves performance under Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL). The authors demonstrate that these layers are not redundant but rather underutilized due to visual domain shift, and propose the VtT model to reclaim this information at both the layer and encoder levels, achieving state-of-the-art performance.
Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration: This work selectively replaces the classical skull stripping (BET2) and tissue segmentation (FAST) modules in the SIENA longitudinal brain atrophy pipeline with deep learning alternatives (SynthStrip/SynthSeg). Evaluated on two large-scale longitudinal cohorts—ADNI (N=1006) and PPMI (N=310)—the proposed modifications substantially improve the correlation between PBVC and clinical disease progression (correlation coefficients increase by over 100%), while reducing scan-order error by up to 99.1%.
Reinforcing the Weakest Links: Modernizing SIENA with Targeted Deep Learning Integration: By replacing the classical skull stripping (BET2) and tissue segmentation (FAST) modules in the SIENA brain atrophy pipeline with deep learning alternatives (SynthStrip, SynthSeg), this work significantly improves the clinical sensitivity and robustness of PBVC estimation while preserving the interpretability of the overall pipeline.
RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference: This paper proposes RelativeFlow, a flow matching-based framework that decomposes the absolute noise-to-clean mapping into relative noisier-to-noisy mappings. By incorporating a consistent transport constraint and a simulation-based velocity field, RelativeFlow learns a unified denoising flow from heterogeneous noisy references, overcoming the reference bias limitation.
Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning: This paper proposes the Residual SODAP framework, which jointly addresses prompt-side representation adaptation and classifier-side knowledge preservation through: α-entmax sparse prompt selection with residual aggregation, data-free statistical distillation with pseudo-feature replay, prompt usage pattern drift detection (PUDD), and uncertainty-weighted multi-loss balancing. The framework achieves state-of-the-art performance on medical domain-incremental learning benchmarks.
Residual SODAP: Residual Self-Organizing Domain-Adaptive Prompting with Structural Knowledge Preservation for Continual Learning: This paper proposes the Residual SODAP framework, which jointly addresses representation adaptation (via α-entmax sparse prompt selection with residual aggregation) and classifier preservation (via statistical pseudo-feature replay and knowledge distillation) for domain-incremental learning without task IDs or data buffers, achieving state-of-the-art performance on three benchmarks: DR, Skin Cancer, and CORe50.
Robust Fair Disease Diagnosis in CT Images: This paper proposes a dual-objective training framework combining Logit-Adjusted Cross-Entropy (for class imbalance) and CVaR aggregation (for demographic fairness), achieving a gender-averaged macro F1 of 0.8403 with a fairness gap of only 0.0239 on CT disease diagnosis.
Robust Multi-Source Covid-19 Detection in CT Images: This paper proposes a multi-task learning framework that jointly trains a COVID-19 diagnosis head and a source hospital identification head (supervised by a logit-adjusted loss) on a shared EfficientNet-B7 backbone, encouraging the feature extractor to learn institution-invariant representations. The method achieves an F1 of 0.9098 on a multi-source CT dataset.
SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation: This paper proposes SD-FSMIS, a framework that adapts pretrained Stable Diffusion for few-shot medical image segmentation (FSMIS). Through a Support-Query Interaction module and a Visual-to-Text Conditioning Transformer, the framework achieves efficient adaptation, with particularly strong performance in cross-domain scenarios.
Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation: This paper proposes SCDL (Semantic Class Distribution Learning), a plug-and-play module that learns structured class-conditional feature distributions and aligns them bidirectionally with learnable class proxies via Class Distribution Bidirectional Alignment (CDBA). Combined with Semantic Anchor Constraints (SAC), which leverage annotated data to guide proxies toward correct semantics, SCDL mitigates both supervision bias and feature representation bias in semi-supervised medical image segmentation (SSMIS), achieving notable improvements on tail-class organs.
SCDL: Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation: This paper proposes SCDL, a plug-and-play semantic class distribution learning framework that addresses supervision bias and representation imbalance in semi-supervised medical image segmentation (SSMIS) via two components: Class Distribution Bidirectional Alignment (CDBA), which learns structured class-conditional feature distributions through proxy distributions, and Semantic Anchor Constraint (SAC), which guides proxy distributions toward true class semantics. SCDL achieves state-of-the-art performance on minority class segmentation.
SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation: This paper proposes SemiTooth, a framework that addresses annotation scarcity and cross-source domain discrepancy in multi-source CBCT tooth segmentation via a multi-teacher multi-student architecture and Strict Weighted Confidence (SWC) constraints. It also introduces MS3Toothset, the first multi-source semi-supervised tooth segmentation dataset.
SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation: This paper proposes SemiTooth, a multi-teacher multi-student semi-supervised framework coupled with a Stricter Weighted-Confidence (SWC) constraint, which effectively leverages multi-source unlabeled data for multi-source CBCT tooth segmentation and achieves cross-source generalization.
Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors: This paper proposes InvTag, a framework that, for the first time, integrates a physics-based MR forward model with a pretrained diffusion generative prior to jointly solve three sub-tasks in 3D Tagged MRI—anatomical recovery, Cine synthesis, and motion estimation—without requiring any additional training data.
Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis: This paper proposes STEPH, which efficiently transfers generalizable prognostic knowledge from multiple cancer-type models to a target cancer type via Task Vector Mixup (TVM) and hypernetwork-driven sparse aggregation, achieving an average C-Index improvement of 5.14% across 13 TCGA datasets without requiring large-scale joint training or multi-model inference.
STEPH: Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in WSI Prognosis: STEPH proposes a model merging framework based on Task Vector Mixup (TVM) and hypernetwork-driven sparse aggregation, which efficiently transfers knowledge from multiple cancer-type-specific prognosis models into a target cancer model. It achieves a mean C-Index of 0.6949 across 13 TCGA datasets (+5.14% vs. cancer-type-specific learning, +2.01% vs. ROUPKT), while requiring only a single-model forward pass at inference—far more efficient than multi-model representation transfer approaches.
SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation: This paper proposes the SPEGC framework, which combines semantic-prompt-enhanced feature representations with a differentiable graph clustering solver to refine raw similarity matrices into higher-order structural representations. These representations guide the adaptation of medical image segmentation models to continuously shifting target domains, effectively mitigating error accumulation and catastrophic forgetting.
SVC 2026: The Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge: This paper organizes the SVC 2026 challenge, comprising two tracks — cross-domain multimodal deception detection and domain-generalized remote physiological measurement — providing a unified evaluation framework and baseline models, with 22 teams submitting final results.
Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos: This work introduces SurgBlood, the first laparoscopic surgical video dataset with annotations for both bleeding regions and bleeding points, and proposes BlooDet, a SAM2-based dual-branch bidirectional guidance online detector that achieves joint bleeding region segmentation and bleeding point localization through synergistic optimization of Mask and Point branches.
T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation: This paper proposes a lightweight Temporal Gated Adapter (T-Gated Adapter) that injects adjacent-slice context into the 2D vision-language model CLIPSeg. Trained on only 30 annotated CT volumes, the method achieves an average Dice of 0.704 (+0.206), with consistent improvements on cross-domain zero-shot evaluation and CT-to-MRI cross-modal evaluation.
Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model: This paper proposes Tell2Adapt, a unified framework that leverages the generalized knowledge of a vision foundation model (BiomedParse) to generate high-quality pseudo labels via Context-Aware Prompt Regularization (CAPR), followed by Visual Plausibility Refinement (VPR) to eliminate anatomically implausible predictions, enabling unified source-free unsupervised domain adaptation for medical image segmentation across 10 domain transfer directions and 22 anatomical targets.
The Invisible Gorilla Effect in Out-of-distribution Detection: This paper reveals a previously unreported bias in OOD detection — the "Invisible Gorilla Effect": detection performance is substantially higher when OOD artifacts are visually similar to the model's region of interest (ROI), and degrades significantly when they are dissimilar, with feature-based OOD methods being most severely affected.
Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data: This paper proposes the Difficulty-Influence Quadrant (DIQ) data selection strategy, which jointly considers sample difficulty and gradient influence to enable VLM language backbones to match full-data SFT performance using only 1% of curated data, and to surpass full-data training with just 10%.
Transformer-Based Multi-Region Segmentation and Radiomic Analysis of HR-pQCT Imaging for Osteoporosis Classification: This paper proposes a fully automatic multi-region HR-pQCT segmentation framework based on SegFormer, combined with radiomic features and machine learning for binary osteoporosis classification. The key finding is that soft tissue (tendon/fat) features demonstrate greater diagnostic value than traditional bone features.
Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding: The core contribution of this paper is not merely an "ultrasound version of CLIP," but rather a redefinition of the image-text alignment objective around ultrasound-specific anatomical hierarchies and diagnostic attributes. The authors first construct the Ultrasonographic Diagnostic Taxonomy (UDT) and the large-scale US-365K dataset, then explicitly inject clinical relationships from text into contrastive learning via semantic soft labels and an attribute heterogeneous graph, yielding visual-language representations that are more genuinely "ultrasound-aware."
Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos: This paper proposes SMART, a Teacher-Student semi-supervised framework built upon SAM3's concept-prompt segmentation, integrating progressive confidence regularization and a dual-stream temporal consistency strategy to achieve state-of-the-art vessel segmentation in X-ray coronary angiography videos with minimal annotation.
UNIStainNet: Foundation-Model-Guided Virtual Staining of H&E to IHC: UNIStainNet is proposed as the first method to inject dense spatial tokens from the frozen pathology foundation model UNI directly into a generator as SPADE modulation signals. Combined with misalignment-aware losses and learnable stain embeddings, a single unified model simultaneously generates four IHC stains (HER2/Ki67/ER/PR), achieving state-of-the-art distributional metrics on the MIST and BCI benchmarks.
Unleashing Video Language Models for Fine-grained HRCT Report Generation: This paper proposes AbSteering, a two-stage framework that adapts general-purpose VideoLMs to HRCT report generation via abnormality-centric Chain-of-Thought reasoning and DPO-based hard-negative contrastive learning, substantially outperforming specialized CT foundation models on clinical efficacy metrics.
Unlocking Multi-Site Clinical Data: A Federated Approach to Privacy-First Child Autism Behavior Analysis: This paper proposes the first federated learning framework for children's autism behavior recognition. Through a two-tier privacy strategy—3D skeleton abstraction (identity removal) combined with federated optimization (data never leaves the site)—the proposed approach achieves 87.80% accuracy on the MMASD dataset using the APFL personalized federated method, surpassing local training by 5.2% while satisfying HIPAA/GDPR compliance requirements.
Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework: This paper restructures per-instrument prompt parameters from isolated, independent prompts into a tree-structured hierarchy that progressively decomposes shared knowledge across layers. This design enables new instruments to inherit prior knowledge for rapid learning, while allowing new knowledge to gently revise existing representations, thereby simultaneously improving performance on new, regular, and old classes in surgical instrument class-incremental segmentation.
Unsupervised Domain Adaptation with Target-Only Margin Disparity Discrepancy: This paper addresses unsupervised domain adaptation (UDA) for CT→CBCT liver segmentation. It identifies a contradictory term in the classical MDD objective—where the feature extractor is optimized to maximize the discrepancy between \(f\) and \(f'\) on the source domain—and proposes Target-Only MDD, which removes this contradiction and minimizes prediction discrepancy on both domains. The method achieves state-of-the-art UDA performance in both 2D and 3D experiments.
Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code: This paper proposes CodeBrain, which reformulates the any-to-any brain MRI modality imputation problem as a region-level full-stack quantised code prediction task. Through a two-stage pipeline (scalar quantisation reconstruction + grading-loss code prediction), it achieves unified missing modality synthesis and outperforms five state-of-the-art methods.
CodeBrain: Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code: CodeBrain reformulates any-to-any brain MRI modality imputation as a region-level full-stack quantised code prediction problem. Stage I encodes complete MRI sets into compact code maps and modality-agnostic common features via Finite Scalar Quantisation (FSQ); Stage II predicts code maps from incomplete modalities using a grading loss to preserve the smoothness of the quantisation space. CodeBrain surpasses five SOTA methods on IXI and BraTS 2023, and the synthesised modalities achieve brain tumour segmentation performance approaching that of real data.
VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer: This paper revisits the necessity of the text branch in zero-shot anomaly detection (ZSAD) and proposes VisualAD, a purely vision-based framework. Two learnable tokens (anomaly/normal) are inserted into a frozen ViT, enhanced by Spatial-Aware Cross-Attention (SCA) and a Self-Alignment Function (SAF). Without a text encoder, VisualAD achieves state-of-the-art performance across 13 industrial and medical benchmarks.
Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation: This paper proposes a weakly supervised teacher-student framework that leverages sparse pathological annotations and an EMA-stabilized teacher network to generate progressively refined pseudo-masks. Through confidence filtering, adaptive fusion, and curriculum-guided refinement, the framework achieves efficient segmentation of glandular structures in colorectal cancer pathology images.
X-WIN: Building Chest Radiograph World Model via Predictive Sensing: X-WIN is a chest radiograph world model that, for the first time, incorporates 3D CT spatial knowledge into CXR representation learning. By learning to predict 2D projections of CT volumes at varying rotation angles, the model internalizes 3D anatomical structure. Combined with affinity-guided contrastive alignment and structure-preserving domain adaptation, X-WIN achieves state-of-the-art linear probing performance across 6 CXR benchmarks.
XSeg: A Large-scale X-ray Contraband Segmentation Benchmark for Real-World Security Screening: This paper introduces XSeg, the largest X-ray contraband segmentation dataset to date (98,644 images, 295,932 instance masks, 30 fine-grained categories), and proposes APSAM, a domain-specialized model that leverages the physical dual-energy properties of X-ray imaging via an Energy-Aware Encoder (EAE) and an Adaptive Point Generator (APG) to intelligently expand user click prompts. APSAM achieves 72.83% mIoU, surpassing SAM fine-tuning by 4.96%.

🚗 Autonomous Driving¶

A Prediction-as-Perception Framework for 3D Object Detection: Inspired by the brain's predictive perception mechanism, this paper proposes the PAP framework, which injects trajectory prediction outputs from previous frames as queries into the current frame's perception module, achieving a 10% improvement in tracking accuracy and a 15% speedup in inference on UniAD.
A Prediction-as-Perception Framework for 3D Object Detection: Inspired by the human cognitive pattern of "anticipating target locations before focusing attention," this work converts trajectory predictions from the previous frame into detection queries for the current frame, forming an iterative prediction-perception closed loop. Applied to UniAD, the framework achieves simultaneous improvements of +10% in tracking accuracy and +15% in inference speed.
AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception: AdaRadar is proposed — an online adaptive radar data compression framework based on DCT spectral pruning and zeroth-order surrogate gradients — achieving over 100× compression with only ~1 percentage point degradation in detection/segmentation performance, effectively alleviating the bandwidth bottleneck between radar sensors and computing units.
An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving: This paper proposes ADMesh (a library of 15K+ high-quality 3D models) and CarlaOcc (a panoptic occupancy dataset with 100K frames at 0.05m resolution), providing for the first time instance-level annotations and physically consistent ground truth for 3D panoptic occupancy prediction in autonomous driving, along with occupancy quality evaluation metrics and a systematic benchmark.
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images: This paper proposes BEV-SLD, a self-supervised scene landmark detection (SLD)-based method for LiDAR global localization. By decoupling detection from correspondence prediction, the approach achieves high-accuracy \((x, y, \text{azimuth})\) pose estimation across diverse environments using only 20 MB of storage.
BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds: This paper proposes BuildAnyPoint, which employs a loosely-coupled cascaded diffusion Transformer (Loca-DiT) to achieve unified reconstruction from diverse point cloud distributions (airborne LiDAR, SfM, sparse noisy point clouds) to structured 3D building meshes — first recovering the underlying point cloud distribution via hierarchical latent diffusion, then generating compact polygonal meshes via an autoregressive Transformer.
C2T: LLM-Aligned Common-Sense Reward Learning for Traffic-Vehicle Coordination: This paper proposes the C2T framework, which converts traffic states into structured captions, leverages LLMs for offline preference judgments, and distills these judgments into an intrinsic reward function. This approach replaces hand-crafted rewards for traffic signal control (TSC) and achieves improvements in efficiency, safety, and energy consumption across multiple real-world urban networks on the CityFlow benchmark.
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention: CausalVAD is proposed to parameterize Pearl's backdoor adjustment theory as a plug-and-play module (SCIS), performing multi-level causal intervention across the perception–prediction–planning pipeline of the VAD architecture to eliminate spurious correlations and achieve safer, more robust end-to-end autonomous driving.
CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection: To address the modality imbalance problem in dual-branch multi-modal 3D detectors under domain shift, this paper proposes the CCF framework, which systematically improves camera query utilization and cross-domain robustness through three components: Query Decoupled Loss, LiDAR-Guided Depth Prior, and Complementary Cross-Modal Masking.
ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data: This paper proposes ClimaDrive, a data generation framework, and ClimaOoD, a benchmark dataset. By combining semantically guided multi-weather scene generation with perspective-aware anomaly object placement, the framework constructs a 10K+ training set covering 6 weather conditions × 93 anomaly categories. Training on this dataset yields an average AP improvement of 3.25% across four state-of-the-art methods.
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection: CoIn3D is proposed as a framework that explicitly models spatial prior discrepancies arising from camera intrinsics, extrinsics, and array layouts via two modules — Spatial-aware Feature Modulation (SFM) and Camera-aware Data Augmentation (CDA) — enabling strong generalization transfer of multi-camera 3D detection models from source configurations to unseen target configurations. The framework is plug-and-play compatible with three mainstream paradigms: BEVDepth, BEVFormer, and PETR.
ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving: ColaVLA proposes a unified vision-language-action (VLA) framework that transfers VLM reasoning from textual chain-of-thought to latent space. Through a Cognitive Latent Reasoner and a Hierarchical Parallel Planner, the framework completes scene understanding and trajectory decoding with only two VLM forward passes, achieving state-of-the-art performance on both nuScenes open-loop and closed-loop benchmarks.
CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion: CoLC proposes a communication-efficient early fusion framework for collaborative perception. It reduces transmission volume via Foreground-Aware Point Sampling (FAPS), reconstructs dense pillar representations on the ego side through VQ-based LiDAR completion (CEEF), and ensures semantic and geometric consistency via Dense-Guided Dual Alignment (DGDA). The framework achieves detection performance on par with or superior to full early fusion while significantly reducing communication bandwidth.
Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation: This paper proposes CompoSIA, a compositional driving video simulator that injects three control factors — scene structure, object identity, and ego-vehicle action — through independent pathways into a Flow Matching DiT. It supports both individual and compositional editing, enabling systematic adversarial scenario synthesis. CompoSIA achieves a 17% FVD improvement on identity editing, 30%/47% reduction in rotation/translation error for action control, and an average 173% increase in collision rate for downstream planners.
CompoSIA: Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation: This paper proposes CompoSIA, a framework that achieves composable adversarial driving scene generation via disentangled control over three factors — Structure, Identity, and Action — built upon a video diffusion model. The approach reduces FVD for identity editing by 17% and increases the collision rate of downstream planners by 173%, effectively exposing hidden failure modes in autonomous driving systems.
CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation: This paper proposes CycleBEV, a training-time regularization framework that introduces an Inverse View Transformation (IVT) network to map BEV segmentation maps back to perspective-view (PV) segmentation maps. The framework enhances existing BEV semantic segmentation models via a cycle consistency loss, a height-aware geometric regularization objective, and a cross-view latent space alignment objective, with zero additional inference overhead.
Den-TP: A Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction: Den-TP is a data-centric framework that addresses the long-tail density imbalance in trajectory prediction datasets through density-aware data curation and evaluation protocols. Using only 50% of the training data, it maintains overall performance while significantly improving robustness in high-density scenarios.
DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving: This paper proposes DLWM, a two-stage Gaussian-centric self-supervised pre-training paradigm. Stage 1 learns 3D Gaussian representations by reconstructing depth and semantic maps. Stage 2 trains dual latent world models — a Gaussian-flow-guided temporal prediction model (for occupancy perception/prediction) and an ego-planning-guided temporal prediction model (for motion planning) — achieving significant performance gains across all three core tasks.
Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving: This paper proposes DMW (Drive My Way), a personalized VLA driving framework that learns long-term driving habits via user embeddings and adapts to short-term preferences through natural language instructions. Personalized driving behavior is generated using GRPO-based reinforcement fine-tuning and style-aware rewards.
DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance: This paper introduces the first 360° panoramic driver attention dataset (~1M frames / 19 drivers) and proposes DriverGaze360-Net, which jointly learns attention maps and attended objects via an auxiliary semantic segmentation head, achieving state-of-the-art attention prediction performance on panoramic driving images.
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving: Dr.Occ proposes a unified 3D occupancy prediction framework with depth guidance and region guidance. It employs D2-VFormer to leverage high-quality depth priors from MoGe-2 for accurate 2D→3D geometric mapping, and R/R2-EFormer to adaptively assign region-specific experts inspired by MoE/MoR for handling spatial semantic anisotropy, achieving a +7.43% mIoU improvement over the BEVDet4D baseline.
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving: This paper proposes Dr.Occ, a unified camera-only 3D occupancy prediction framework. It introduces a Depth-guided Dual-projection View Former (D2-VFormer) that leverages high-quality depth priors from MoGe-2 for accurate geometric alignment, and a Region-guided Expert Transformer (R-EFormer / R2-EFormer) that adaptively assigns spatial region experts to address semantic imbalance. Dr.Occ improves the BEVDet4D baseline by 7.43% mIoU on Occ3D-nuScenes.
Efficient Equivariant Transformer for Self-Driving Agent Modeling: This paper proposes DriveGATr, an equivariant Transformer architecture based on 2D Projective Geometric Algebra (PGA) that achieves SE(2)-equivariance without explicit pairwise relative position encoding (RPE), attaining state-of-the-art performance on traffic simulation tasks while substantially reducing computational cost.
EMDUL: Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets: This paper proposes EMDUL, a pipeline that expands mmWave HPE datasets in scale and diversity by (1) annotating unlabeled mmWave data via pseudo-labels with a novel unsupervised temporal consistency loss (UTCL), and (2) converting LiDAR datasets to mmWave point clouds through a closed-form converter with flow-based point filtering (FPF). The approach reduces in-domain error by 15.1% and cross-domain error by 18.9%.
F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling: This paper proposes F3DGS, the first method to apply a federated learning framework to 3DGS, enabling decentralized multi-agent 3D reconstruction through frozen geometry and visibility-aware aggregation without sharing raw data.
Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them: This paper systematically defines and quantifies two failure modes of deep learning-based online mapping models — localization overfitting and map geometry overfitting — proposes a Fréchet distance-based performance metric and a minimum spanning tree (MST)-based training set sparsification strategy, and validates on nuScenes and Argoverse 2 that geometrically diverse and balanced training sets improve model generalization.
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts: FedBPrompt introduces learnable visual prompts partitioned into body part alignment prompts (with constrained local attention to handle viewpoint misalignment) and holistic full-body prompts (to suppress background interference), coupled with a prompt-only federated fine-tuning strategy that transmits only prompt parameters (~0.46M vs. ~86M for the full model), achieving consistent improvements on FedDG-ReID benchmarks.
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts: This paper proposes FedBPrompt, a framework that introduces a Body Distribution Aware Visual Prompts Mechanism (BAPM) dividing prompts into Body Part Alignment Prompts and Holistic Full Body Prompts, paired with a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and trains only lightweight prompt parameters (reducing communication to ~1%), achieving average mAP gains of 3.3% and Rank-1 gains of 4.9% on FedDG-ReID benchmarks.
FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision: FlashCap is proposed as the first motion capture system combining flashing LEDs with event cameras, where each LED is assigned a unique flashing frequency for identity recognition. The system enables the construction of FlashMotion, the first human motion dataset with 1000Hz annotation precision (7.15 million frames), and introduces the ResPose baseline, reducing motion timing error from ~50ms to ~5ms and lowering pose estimation MPJPE by approximately 40%.
FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration: FoSS proposes a frequency-domain–time-domain dual-branch framework that organizes Fourier spectra via progressive spiral reordering (HelixSort) before feeding them into a selective state space model (SSM), and combines a temporal dynamic SSM with cross-attention fusion to achieve state-of-the-art trajectory prediction on Argoverse 1/2 while reducing parameter count by over 40% and inference latency by 22%.
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction: GPOcc proposes leveraging generalizable visual geometry priors (e.g., VGGT, DepthAnything) for monocular 3D occupancy prediction. Surface points predicted by these priors are extended inward along camera rays to generate volumetric samples, which serve as centers of sparse Gaussian primitives for probabilistic occupancy inference. A training-free incremental update strategy handles streaming input. On Occ-ScanNet, GPOcc surpasses the previous SOTA by +9.99 mIoU (monocular) and +11.79 mIoU (streaming), while running 2.65× faster under the same depth prior.
Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal: Ghost-FWL introduces the first large-scale mobile full-waveform LiDAR dataset (24K frames, 7.5 billion peak-level annotations) and proposes FWL-MAE, a self-supervised pretraining framework for ghost detection and removal, reducing SLAM trajectory error by over 66% and cutting 3D detection false positive rates by 50×.
HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation: To address the severe scarcity of adverse-weather samples in lane detection datasets (CULane/TuSimple), this paper proposes HG-Lane — a two-stage diffusion-based generation framework requiring no re-annotation. Stage-I employs Control Information Fusion and Structure-aware Reverse Diffusion to preserve lane geometry, while Stage-II applies Appearance-aware Refinement to adjust illumination style. The framework generates 30K images across snow/rain/fog/night/dusk conditions. CLRNet achieves an overall mF1 improvement of +20.87%, with +38.8% in snow scenarios.
HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles: HorizonForge proposes a unified framework that reconstructs driving scenes as editable Gaussian Splats combined with Mesh representations, enabling fine-grained 3D manipulation via trajectory control and language-driven vehicle insertion. A video diffusion model then renders spatiotemporally consistent, high-quality driving videos. The method achieves a user preference rate of 91.02%, decisively outperforming all baselines.
IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration: This paper proposes the IGASA framework, which employs a three-stage pipeline consisting of a Hierarchical Pyramid Architecture (HPA), Hierarchical Cross-Layer Attention (HCLA), and Iterative Geometry-Aware Refinement (IGAR) to bridge the semantic gap across multi-scale features and dynamically suppress outliers, achieving state-of-the-art performance on four benchmarks: 3DMatch, 3DLoMatch, KITTI, and nuScenes.
IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration: IGASA is a point cloud registration framework that combines a Hierarchical Pyramid Architecture (HPA), Hierarchical Cross-Layer Attention (HCLA) with skip-attention fusion, and Iterative Geometry-Aware Refinement (IGAR) with dynamic consistency weighting. It achieves 94.6% Registration Recall on 3DMatch (SOTA), 100% RR on KITTI, with a total inference time of only 2.763s.
InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset: This paper presents InCaRPose, an in-cabin relative camera pose estimation model built upon a frozen ViT backbone and a Transformer decoder. Trained exclusively on synthetic data, it generalizes to real in-cabin environments, achieving metric-scale translation prediction and real-time inference (>45 FPS). The authors also release an accompanying real-world, high-distortion in-cabin test dataset, In-Cabin-Pose.
KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System: KnowVal is an end-to-end autonomous driving system that addresses two fundamental deficiencies—knowledge reasoning and value alignment—through three core components: (1) Retrieval-guided Open-world Perception, which integrates standard 3D detection, VL-SAMv2-based long-tail object recognition, and VLM-based scene understanding; (2) Perception-guided Knowledge Retrieval, which queries a driving knowledge graph covering traffic regulations, defensive driving, and ethical norms; and (3) a World Model for future state prediction combined with a human-preference-trained Value Model for trajectory evaluation. The system achieves the lowest collision rate on nuScenes and state-of-the-art performance on Bench2Drive and NVISIM.
Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens: This work extends the LeJEPA self-supervised framework to a multi-modal setting by introducing learnable fusion tokens as a Perceiver-style latent bottleneck within a shared Transformer, enabling efficient fusion of RGB with companion modalities (LiDAR depth / thermal infrared). A pruning strategy reduces attention overhead by approximately 9×. On Waymo, CenterNet 3D detection mAP XY reaches 23.6 (+4.3 over RGB-only LeJEPA) and Depth MAE improves from 4.704 to 2.860.
LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization: LEADER achieves 24.1% and 73.9% relative reductions in position error on LiDAR relocalization benchmarks via a robust projection-based geometric encoder (yaw-invariant) and a truncated relative reliability loss (suppressing unreliable points).
Learnability-Driven Submodular Optimization for Active Roadside 3D Detection: This paper proposes LH3D, an active learning framework that employs a three-stage hierarchical submodular optimization strategy—depth confidence → semantic balancing → geometric diversity—to suppress the selection of inherently ambiguous samples in roadside monocular 3D detection. With only 20% of the annotation budget, LH3D significantly outperforms conventional uncertainty- and diversity-based AL methods.
Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization: This paper uses panoramic depth and reflectance images derived from 3D LiDAR point clouds as CNN inputs, constructs a large-scale outdoor scene categorization dataset (MPO), and proposes two architectural improvements—Horizontal Circular Convolution (HCC) and Row-Wise Max Pooling (RWMP)—to achieve high-accuracy classification (up to 97.87%) across six outdoor scene categories, substantially outperforming traditional handcrafted feature methods.
Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization: This paper proposes a method for outdoor scene categorization using LiDAR panoramic depth maps and reflectance maps as CNN inputs. The authors construct the large-scale MPO outdoor 3D dataset (6 scene categories, 34,200 frames), and address the ring topology of panoramic images via Horizontal Circular Convolution (HCC) and Row-Wise Max Pooling (RWMP). The proposed multimodal fusion approach achieves 97.47% classification accuracy.
Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception: This paper proposes MVIG, an adversarial attack framework that unifies the vulnerability modeling of diverse defense-equipped collaborative perception systems into a Mutual View Information Graph (MVIG). By combining temporal graph learning with entropy-aware vulnerability search, MVIG enables adaptive fabrication attacks that reduce the defense success rate by up to 62%.
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos: This paper proposes LFG (Learning to drive is a Free Gift), a fully label-free, teacher-guided pretraining framework for autonomous driving. LFG learns a unified pseudo-4D representation of geometry, semantics, and motion from large-scale unposed YouTube driving videos. On the NAVSIM benchmark, using only a monocular front-facing camera, LFG surpasses multi-camera + LiDAR BEV methods (PDMS 85.2), and demonstrates strong data efficiency (81.4 PDMS with only 10% labels).
LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration: This paper proposes LiREC-Net, the first unified framework for simultaneously performing target-free extrinsic calibration between LiDAR–RGB and LiDAR–Event camera pairs. Through a shared LiDAR representation that fuses 3D point features with projected depth features, and pairwise cost volumes for cross-modal alignment, LiREC-Net achieves calibration accuracies of 1.80 cm/0.11° on KITTI, and 2.51 cm/0.14° (LiDAR–RGB) and 1.18 cm/0.07° (LiDAR–Event) on DSEC.
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection: This work identifies that feature misalignment in LiDAR-Camera fusion is concentrated at foreground-background depth discontinuity boundaries, and proposes three synergistic modules — PGDC (2D Prior-Guided Depth Calibration), DAGF (Discontinuity-Aware Geometric Fusion), and SGDM (Structural Guidance Depth Modulator) — to proactively correct misalignment prior to fusion, achieving state-of-the-art mAP of 71.5% and NDS of 73.6% on the nuScenes validation set.
LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction: LR-SGS proposes a structure-aware Salient Gaussian representation guided by LiDAR reflectance. By calibrating LiDAR intensity into an illumination-invariant reflectance channel appended to each Gaussian, initializing structured Salient Gaussians from geometric and reflectance feature points, and enforcing RGB–reflectance cross-modal gradient consistency, the method surpasses OmniRe by 1.18 dB PSNR on complex-lighting scenes of the Waymo dataset while using fewer Gaussians and shorter training time.
LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction: This paper proposes LR-SGS, which calibrates LiDAR intensity into an illumination-invariant reflectance channel attached to 3D Gaussians, and introduces a structure-aware Salient Gaussian representation (initialized from LiDAR geometry and reflectance feature points) with improved densification control and a salient transform strategy. LR-SGS achieves higher-fidelity reconstruction than OmniRe on complex Waymo autonomous driving scenes while using fewer Gaussians and requiring less training time.
M²-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs: M²-Occ addresses real-world scenarios where camera failures cause missing views by proposing MMR (reconstructing missing view representations in feature space using adjacent camera FoV overlaps) and FMM (refining ambiguous voxel features via a learnable semantic prototype memory bank). On the SurroundOcc baseline, it achieves +4.93% IoU when the rear camera is missing, maintains 18.36% IoU under five missing cameras (versus a baseline collapse to 13.35%), and does not compromise performance under complete-view inputs.
M²-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs: To address incomplete inputs caused by camera failures in autonomous driving, M²-Occ introduces a Multi-view Masked Reconstruction (MMR) module that exploits the overlapping fields of view between adjacent cameras to recover missing view features, and a Feature Memory Module (FMM) that refines voxel representations using class-level semantic prototypes. The framework achieves a 4.93% IoU gain when the rear camera is missing, without degrading full-view performance.
MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction: This paper proposes MapGCLR, a geospatial contrastive learning strategy that enforces consistent BEV feature representations for overlapping regions across different traversals. Operating within a semi-supervised framework, the method achieves 13%–42% relative performance gains on online vectorized HD map construction using only 5%–20% of labeled data.
MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction: MapGCLR proposes a semi-supervised training scheme based on geospatial contrastive learning: it exploits the geospatial overlap between BEV feature grids produced from multiple traversals of the same location, constructing an InfoNCE contrastive loss to enforce geographic consistency in the BEV feature space. On Argoverse 2, using only 5% labeled data, it achieves 18.9 mAP (vs. 13.3 for the fully supervised baseline), a relative improvement of 42%—roughly equivalent to doubling the amount of labeled data.
MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving: MeanFuser is an end-to-end autonomous driving framework that replaces discrete trajectory vocabulary with Gaussian mixture noise for continuous multi-modal trajectory modeling, leverages the MeanFlow Identity for error-free one-step sampling, and introduces an Adaptive Reconstruction Module (ARM) that implicitly decides between selecting an existing proposal and reconstructing a new trajectory. On NAVSIM, using only RGB input with a ResNet-34 backbone, it achieves 89.0 PDMS at 59 FPS.
MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating: This paper proposes MetaDAT, a framework that obtains an initialization amenable to online adaptation via meta pre-training, and further employs dynamic learning rate optimization (DLO) and hard-sample-driven updates (HSD) at test time to achieve trajectory prediction adaptation under cross-dataset distribution shifts. MetaDAT consistently outperforms existing test-time training (TTT) methods across diverse cross-domain configurations on nuScenes, Lyft, and Waymo.
MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating: This paper proposes the MetaDAT framework, which obtains a model initialization amenable to online adaptation via meta pre-training, and achieves data-adaptive model adjustment at test time through dynamic learning rate optimization (DLO) and hard-sample-driven updates (HSD). MetaDAT surpasses all existing TTT methods under cross-dataset distribution shift settings across nuScenes, Lyft, and Waymo.
Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks: This paper proposes the dCAP framework, which achieves real-time 6-DoF relative pose estimation between the tractor and trailer in articulated autonomous trucks via Transformer-based cross-view and temporal attention mechanisms. The framework is integrated into BEVFormer to improve 3D object detection performance under articulated motion, achieving a translation error of 0.452 m and a rotation error of 0.042 rad.
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving: This paper proposes MindDriver, a progressive multimodal reasoning framework that emulates the human "perception→imagination→action" cognitive process. The model first performs textual semantic understanding, then imagines future scene images (bridging semantic and physical spaces), and finally predicts trajectories. Combined with feedback-guided automatic data annotation and progressive reinforcement fine-tuning, MindDriver achieves state-of-the-art performance on both the nuScenes open-loop and Bench2Drive closed-loop benchmarks.
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes (LegoOcc): This paper proposes LegoOcc, which leverages Language-Embedded Gaussians (LE-Gaussians) as a unified geometric-semantic intermediate representation. Combined with a Poisson-process-based Gaussian-to-Occupancy (G2O) operator and a progressive temperature decay strategy, LegoOcc achieves monocular open-vocabulary occupancy prediction for indoor scenes using only binary occupancy labels (without semantic annotations), attaining 59.50 IoU / 21.05 mIoU on Occ-ScanNet.
Neural Distribution Prior for LiDAR Out-of-Distribution Detection: NDP introduces a learnable neural distribution prior module to model the distributional structure of network predictions. Combined with Perlin-noise-based pseudo-OOD sample generation and a soft anomaly exposure strategy, NDP achieves 61.31% AP on the STU benchmark, surpassing the previous best result by more than 10×.
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning: NoRD demonstrates that autonomous driving VLAs require neither large-scale reasoning annotations nor massive datasets. By identifying the root cause of GRPO failure on weak SFT policies as difficulty bias — wherein learning signals from high-variance rollout groups are suppressed — it replaces standard GRPO with Dr. GRPO for RL post-training. Using less than 60% of the data, no reasoning annotations, and 3× fewer tokens, NoRD achieves competitive performance against reasoning-based VLAs on NAVSIM (85.6 PDMS) and WaymoE2E (7.709 RFS).
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction: O3N is the first work to introduce the omnidirectional open-vocabulary occupancy prediction task and proposes a purely vision-based end-to-end framework. Polar-spiral Mamba (PsM) models panoramic geometric continuity via spiral scanning in polar coordinate space; Occupancy Cost Aggregation (OCA) constructs a voxel-text matching cost volume to avoid overfitting from direct feature alignment; Natural Modality Alignment (NMA) aligns pixel-voxel-text tri-modal embeddings through gradient-free random walk. The method achieves 16.54 mIoU / 21.16 Novel mIoU on QuadOcc (SOTA), substantially outperforming the OVO baseline.
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction: O3N is the first purely vision-based, end-to-end omnidirectional open-vocabulary occupancy prediction framework. Through three core modules—Polar Spiral Mamba (PsM), Occupancy Cost Aggregation (OCA), and Natural Modality Alignment (NMA)—it achieves open-vocabulary 3D occupancy prediction under 360° panoramic image input that surpasses closed-set supervised methods.
OccAny: Generalized Unconstrained Urban 3D Occupancy: OccAny proposes the first generalized unconstrained urban 3D occupancy prediction framework, capable of predicting metric-scale occupancy voxels from monocular, sequential, or surround-view images in calibration-free, out-of-domain scenes. Through two key designs—Segmentation Forcing and Novel View Rendering—it surpasses all visual geometry baselines on KITTI and nuScenes.
OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective: OccuFly introduces the first real-world camera-based Semantic Scene Completion (SSC) benchmark from the aerial perspective, comprising 20,000+ samples across 21 semantic categories, spanning multi-season and multi-altitude urban, industrial, and rural scenes. It further reveals fundamental limitations of current visual foundation models in aerial settings.
On the Feasibility and Opportunity of Autoregressive 3D Object Detection: This paper proposes AutoReg3D, the first framework that formulates LiDAR 3D object detection as autoregressive sequence generation. By adopting a near-to-far ordering and parameter-specific vocabularies to discretize bounding boxes into token sequences, AutoReg3D achieves competitive performance against mainstream methods without anchors or NMS, while unlocking new capabilities such as RL fine-tuning and cascading refinement.
OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera: This paper proposes OneOcc, a vision-only panoramic semantic occupancy prediction framework for legged/humanoid robots. Through dual-projection fusion, dual-grid voxelization, gait displacement compensation, and a hierarchical mixture-of-experts decoder, OneOcc achieves 360° semantic scene completion using only a single panoramic camera, surpassing LiDAR baselines on both real quadruped and simulated humanoid datasets.
Open-Vocabulary Domain Generalization in Urban-Scene Segmentation: This paper proposes OVDG-SS, a new problem setting that unifies unseen-domain and unseen-category challenges in semantic segmentation, and introduces S2-Corr, a state space model-based module that repairs text-image correlation degradation caused by domain shift, enabling efficient and robust cross-domain open-vocabulary segmentation in autonomous driving scenarios.
Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots: This paper introduces PanoMMOcc, the first panoramic multimodal (RGB + thermal + polarization + LiDAR) semantic occupancy dataset for quadruped robots, and proposes VoxelHound, a framework achieving robust 3D occupancy prediction via Vertical Jitter Compensation (VJC) and Multimodal Information Prompt Fusion (MIPF) modules, attaining 23.34% mIoU (+4.16%).
Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots: This paper introduces PanoMMOcc, the first panoramic multimodal semantic occupancy prediction dataset for quadruped robots, along with the VoxelHound framework. By incorporating a Vertical Jitter Compensation (VJC) module and a Multimodal Information Prompt Fusion (MIPF) module, VoxelHound achieves 23.34% mIoU under a four-modality setup (panoramic RGB + thermal + polarization + LiDAR), surpassing existing methods by +4.16%.
Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule: This paper proposes the Perception Characteristics Distance (PCD), a novel metric that quantifies the maximum reliable detection range of a perception system by statistically modeling how the mean and variance of detection confidence evolve with distance. Given a detection quality threshold \(y^{thres}\) and a probability threshold \(p^{thres}\), PCD identifies the furthest distance at which reliability requirements are satisfied, addressing the inability of conventional static metrics such as AP and IoU to capture distance-dependent behavior and stochastic variation.
Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species: This paper introduces TPC-268, the first large-scale plant counting dataset integrating plant taxonomy, comprising 10,000 images, 678,050 point annotations, and 268 countable categories (covering 242 species), with complete Linnaean taxonomic hierarchy annotations, and provides comprehensive benchmarking under the class-agnostic counting (CAC) paradigm.
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors: This paper proposes Points-to-3D, which encodes visible-region point clouds into TRELLIS's sparse structure latent (SS latent) and completes unobserved regions via a mask-aware inpainting network. Combined with a two-stage sampling strategy of structure completion followed by boundary refinement, the method achieves geometry-controllable, high-fidelity 3D asset/scene generation, attaining an F-Score of 0.964 on Toys4K (0.998 for visible regions).
ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction: This paper proposes ProOOD, a framework that for the first time unifies long-tail recognition and out-of-distribution (OOD) detection in 3D occupancy prediction from a voxel prototype-guided perspective. Through prototype-guided semantic inpainting (PGSI), tail-class enhancement (PGTM), and the training-free EchoOOD scoring mechanism, it achieves +3.57% mIoU (tail classes +24.80%) on SemanticKITTI and +19.34 AuPRCr on VAA-KITTI for OOD detection.
PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency: This paper proposes PTC-Depth, a monocular depth estimation framework that combines optical flow triangulation with wheel odometry. It tracks the metric scale of a depth foundation model via recursive Bayesian updates, achieving temporally consistent metric depth prediction with strong generalization across KITTI, TartanAir, and thermal infrared datasets.
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection: This paper proposes R4Det, which systematically addresses three core challenges in 4D radar-camera fusion—inaccurate depth estimation, pose-free temporal fusion, and small object detection—through three plug-and-play BEV modules: Panoramic Depth Fusion (PDF), Deformable Gated Temporal Fusion (DGTF), and Instance-Guided Dynamic Refinement (IGDR). R4Det achieves 47.29% 3D mAP (+5.47%) on TJ4DRadSet and 66.69% mAP on VoD.
Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals: This paper proposes Rascene, an Integrated Sensing and Communication (ISAC) framework for high-fidelity 3D scene imaging using mmWave OFDM communication signals (5G/Wi-Fi). It achieves geometrically consistent recovery from sparse, multipath-corrupted RF observations via confidence-weighted multi-frame fusion.
Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction: This paper proposes the Progressive Retrospective Framework (PRF), which employs cascaded retrospective units to progressively align features from incomplete observations to those of complete observations, substantially improving variable-length trajectory prediction performance in a plug-and-play manner compatible with existing methods.
ReMoT: Reinforcement Learning with Motion Contrast Triplets: This paper proposes ReMoT, a unified training paradigm that automatically constructs a 16.5K motion contrast triplet dataset (ReMoT-16K) via a rule-driven multi-expert collaborative pipeline, and combines GRPO reinforcement learning with a composite reward (logical consistency + length regularization) to systematically address the fundamental deficiencies of VLMs in spatiotemporal consistency reasoning, achieving a 25.1% performance improvement.
RESBev: Making BEV Perception More Robust: This paper proposes RESBev, a plug-and-play robustness enhancement framework for BEV perception. It employs a latent-space world model to predict clean BEV semantic priors from historical frames, and an anomaly reconstructor that fuses these priors with corrupted current observations via cross-attention. On nuScenes, RESBev achieves an average improvement of 15–20 IoU points across four LSS-based models under 10 types of perturbations (including natural corruptions and adversarial attacks), and generalizes to corruption types unseen during training.
ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes: This paper formally defines the temporally sparse 4D indoor semantic instance segmentation (4DSIS) task and proposes ReScene4D, which extends a 3D instance segmentation architecture to the 4D domain via three temporal information sharing strategies—spatio-temporal contrastive loss, spatio-temporal mask pooling, and spatio-temporal decoder serialization. The method achieves state-of-the-art performance on the 3RScan dataset and introduces a new t-mAP metric that jointly evaluates segmentation quality and temporal identity consistency.
SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors: This paper proposes SABER, the first non-invasive, spatially consistent universal adversarial object generation framework targeting BEV 3D detectors. By placing optimized 3D meshes in the scene, SABER disrupts multi-view multi-frame detection and reveals BEV models' over-reliance on learned environmental context priors.
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems: This paper proposes MOSAIC, a framework that clusters training data into domains, fits per-domain scaling laws over evaluation metrics, and greedily selects samples with the highest marginal gain, enabling efficient data selection for end-to-end autonomous driving models that matches or surpasses baseline performance with 80% less data.
SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving: SearchAD introduces the first large-scale rare image retrieval dataset for autonomous driving, comprising 420K+ frames, 510K+ annotated bounding boxes, and 90 rare categories. It supports both text-to-image and image-to-image retrieval, and through comprehensive evaluation reveals the deficiencies of current multimodal retrieval models in retrieving rare objects.
SG-NLF: Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis: SG-NLF proposes a pose-free LiDAR NeRF framework that addresses the geometric hole problem arising from LiDAR sparsity via a spectral-geometric hybrid representation, achieves global pose optimization through a confidence-aware pose graph, and enforces cross-frame consistency via adversarial learning. On nuScenes, it outperforms the previous state of the art by 35.8% in reconstruction quality and 68.8% in pose accuracy.
SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting: This paper proposes SHARP, a motion prediction framework based on short-window streaming inference. It explicitly maintains and updates agent latent representations across time steps via an instance-aware context streamer module, and employs a dual-objective training strategy to achieve state-of-the-art streaming performance on the Argoverse 2 multi-agent benchmark while maintaining minimal latency.
SimScale: Learning to Drive via Real-World Simulation at Scale: This paper proposes the SimScale framework, which generates large-scale, high-fidelity simulation data by applying trajectory perturbations to existing driving logs, simulating reactive environment responses, and synthesizing sensor observations via neural rendering. Combined with pseudo-expert trajectory supervision and a sim-real co-training strategy, SimScale achieves substantial gains on NAVSIM v2 (navhard +8.6 EPDMS), with performance scaling smoothly with the volume of simulation data.
Single Pixel Image Classification using an Ultrafast Digital Light Projector: An ultrafast microLED-on-CMOS digital light projector (330 kfps global shutter) is employed for single-pixel imaging. Twelve-by-twelve Hadamard patterns are projected onto MNIST digits, and a single-pixel photodetector acquires a time series of aggregated light intensities. Image reconstruction is entirely bypassed; an ELM or DNN directly classifies the time series. The system achieves greater than 90% multi-class accuracy and greater than 99% AUC binary classification (anomaly detection) at 1.2 kfps.
Single Pixel Image Classification using an Ultrafast Digital Light Projector: This paper employs a microLED-on-CMOS digital light projector to realize ultrafast single-pixel imaging (SPI), and combines low-complexity machine learning models (ELM and DNN) to achieve >90% classification accuracy on MNIST handwritten digits at a frame rate of 1.2 kHz, entirely bypassing image reconstruction.
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model: This paper proposes SparseWorld-TC, a pure attention-based sparse occupancy world model that bypasses VAE discretization and BEV intermediate representations, directly predicting trajectory-conditioned multi-frame future occupancy end-to-end from raw image features, achieving substantial improvements over existing methods on nuScenes.
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion: This paper proposes VoxSAMNet, a monocular semantic scene completion (SSC) framework that explicitly models voxel sparsity and semantic imbalance. It employs a Dummy Shortcut to bypass empty voxels, and Foreground Dropout combined with a Text-Guided Image Filter (TGIF) to mitigate long-tail overfitting. VoxSAMNet achieves a state-of-the-art 18.19% mIoU on SemanticKITTI, surpassing existing monocular and stereo methods.
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis: This paper proposes SG-NLF, a framework that achieves pose-free LiDAR novel view synthesis via a hybrid spectral-geometric representation, combined with a confidence-aware pose graph and adversarial learning strategy. It significantly outperforms state-of-the-art methods on KITTI-360 and nuScenes (Chamfer Distance reduced by 35.8%, ATE reduced by 68.8%).
TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR: This paper proposes TerraSeg, the first self-supervised, domain-agnostic LiDAR ground segmentation model. By constructing the large-scale unified OmniLiDAR dataset (12 public benchmarks, 15 sensor types, ~22 million scans) and a novel PseudoLabeler self-supervised pseudo-label generation module, TerraSeg achieves state-of-the-art performance on nuScenes, SemanticKITTI, and Waymo without any human annotation.
TT-Occ: Test-Time 3D Occupancy Prediction: This paper proposes TT-Occ, a training-free test-time 3D occupancy prediction framework that integrates vision foundation models (VFMs) at inference time to incrementally construct, refine, and voxelize temporally-aware 3D Gaussians. TT-Occ surpasses all self-supervised methods requiring extensive training on both Occ3D-nuScenes and nuCraft benchmarks.
TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding: This paper proposes TopoMaskV3, which upgrades the mask-based road topology understanding paradigm from a 2D auxiliary module to a standalone 3D centerline predictor by introducing dense offset fields and dense height maps as additional prediction heads. The work also introduces, for the first time in road topology evaluation, a geographically non-overlapping split and a long-range benchmark, exposing performance inflation caused by geographic overlap in existing benchmarks. TopoMaskV3 achieves state-of-the-art 28.5 OLS on the geographically non-overlapping benchmark.
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation: This paper proposes Adaptive Weight Constraint (AWC) regularization, combining Shapley-value-based modality contribution assessment and Fisher Information Matrix (FIM) weighted parameter penalties, to address modality imbalance in multi-modal (RGB/LiDAR/mmWave/WiFi) 3D human pose estimation. Balanced optimization is achieved without introducing any additional learnable parameters.
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation: To address the modality imbalance problem in multi-modal 3D human pose estimation (3D HPE), this paper proposes a Shapley-value-based modality contribution evaluation algorithm and an Adaptive Weight Constraint (AWC) regularization method based on the Fisher information matrix. The approach achieves balanced optimization across modalities without introducing additional parameters, and comprehensively outperforms existing balancing methods on the MM-Fi dataset.
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation: This paper proposes a modality contribution assessment algorithm based on Shapley values and Pearson correlation coefficients, along with a Fisher Information Matrix (FIM)-guided Adaptive Weight Constraint (AWC) regularization method. The approach addresses modality imbalance in end-to-end fusion of four modalities (RGB/LiDAR/mmWave/WiFi), achieving a 2.71 mm reduction in MPJPE on the MM-Fi dataset without introducing additional learnable parameters.
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model: This paper proposes TTSG, a training-free modular framework that generates realistic traffic scenes directly from free-form natural language descriptions. It employs LLM-driven prompt analysis, road retrieval, agent planning, and a plan-aware road ranking algorithm, requiring no predefined routes or spawn points, and achieves a minimum average collision rate of 3.5% on SafeBench.
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model: This paper proposes TTSG, a modular framework that leverages LLMs to convert free-form text descriptions into executable traffic scenarios. Through prompt analysis, road retrieval, agent planning, and a novel plan-aware road ranking algorithm, TTSG generates diverse scenes and achieves a minimum average collision rate of 3.5% on SafeBench.
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences: This paper proposes U4D, the first uncertainty-aware 4D LiDAR world modeling framework. It adopts a "hard-first, easy-second" two-stage diffusion generation strategy that first reconstructs high-uncertainty regions and then conditionally completes the entire scene. A MoST module is designed to adaptively fuse spatio-temporal features for temporal consistency.
VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation: This paper proposes VIRD, which constructs view-invariant representations via dual-axis transformation (polar transformation + context-enhanced positional attention) to achieve state-of-the-art cross-view pose estimation without orientation priors, reducing position and orientation errors on KITTI by 50.7% and 76.5%, respectively.
Learning Vision-Language-Action World Models for Autonomous Driving: VLA-World unifies the predictive imagination of world models with the reflective reasoning of VLA models in a single framework. By generating future frames and reasoning over them, the method improves trajectory planning, achieving state-of-the-art collision rates and FID scores.
WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation: WalkGPT is proposed as the first pixel-grounded large vision-language model for pedestrian accessibility navigation, unifying conversational reasoning, segmentation masks, and depth estimation within a single architecture, accompanied by the 41k-scale PAVE dataset.
x2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space: This paper proposes x2-Fusion, which constructs a unified Event Edge Space anchored on spatiotemporal edge signals from event cameras. Image, LiDAR, and event features are aligned into this homogeneous edge space, followed by reliability-aware adaptive fusion and cross-dimension contrastive learning to jointly estimate 2D optical flow and 3D scene flow, achieving state-of-the-art performance on both synthetic and real-world datasets.

✂️ Segmentation¶

3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion: This paper proposes 3M-TI, a calibration-free multi-camera cross-modal diffusion framework that performs implicit alignment and fusion of uncalibrated RGB–thermal infrared image pairs via a Cross-modal Self-attention Module (CSM) in the VAE latent space. Combined with a misalignment augmentation strategy, the method achieves state-of-the-art performance on mobile thermal imaging super-resolution and significantly improves downstream object detection and semantic segmentation.
MEDISEG: A Medication Image Instance Segmentation Dataset for Preventing Adverse Drug Events: This work introduces MEDISEG, a medication image instance segmentation dataset (8,262 images, 32 pill classes, with real-world occlusion/overlap scenarios). YOLOv8/v9 achieve 99.5% mAP@0.5 on the 3-class subset and 80.1% on the 32-class subset. FsDet few-shot experiments demonstrate that MEDISEG pretraining significantly outperforms CURE in occluded scenarios (1-shot: 0.406 vs. 0.131).
MEDISEG: A Dataset of Medication Images with Instance Segmentation Masks for Preventing Adverse Drug Events: This paper introduces MEDISEG — a dataset of 8,262 real-world multi-pill scene images covering 32 pill types (including overlapping, occluded, and varying-illumination scenarios within dosette boxes), with instance segmentation annotations. YOLOv8/v9 achieve mAP@50 of 99.5% on the 3-Pills subset and 80.1% on the 32-Pills subset. Few-shot experiments demonstrate that MEDISEG as a base training set significantly outperforms the CURE dataset.
A Mixed Diet Makes DINO An Omnivorous Vision Encoder: This paper proposes an Omnivorous Vision Encoder that performs cross-modal alignment distillation training (RGB/Depth/Segmentation) on top of a frozen DINOv2 via lightweight adapters, enabling a single encoder to produce consistent embeddings across different visual modalities while preserving the original discriminative semantics.
AFRO: Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning: This paper proposes AFRO, a self-supervised 3D visual pretraining framework that infers latent actions via an Inverse Dynamics Model (IDM), predicts future features via a Diffusion Transformer Forward Dynamics Model (FDM), and enforces temporal symmetry through an inverse consistency constraint. Pretrained on the large-scale RH20T dataset, AFRO achieves an average success rate of 76.0% across 14 MetaWorld tasks (vs. DynaMo-3D 64.9%, PointMAE 63.9%) and attains state-of-the-art performance on 4 real-world tasks.
Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation: This paper proposes a lightweight dual-loss training framework for temporal action segmentation (TAS) that requires only one additional boundary output channel and two auxiliary losses—a boundary regression loss and a CDF segment shape regularization loss. The framework consistently improves F1 and Edit scores across three architectures (MS-TCN, C2F-TCN, and FACT), demonstrating that precise segmentation can be achieved through simple loss design rather than heavier architectural modifications.
Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation: This paper proposes DEO (Distillation for Earth Observation), a dual-teacher contrastive distillation framework that employs a multispectral self-distillation teacher to learn spectral representations and a frozen optical VFM teacher (DINOv3) to inject high-level semantic priors. The resulting single student network excels at both optical and multispectral remote sensing tasks, achieving state-of-the-art performance across semantic segmentation, change detection, and classification.
CA-LoRA: Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation: This paper proposes Concept-Aware LoRA (CA-LoRA), which automatically identifies weight layers in a T2I model that are sensitive to specific concepts (e.g., viewpoint, style) and applies LoRA fine-tuning exclusively to those layers. This selective adaptation achieves domain alignment while preserving the diverse generation capability of the pretrained model, enabling the synthesis of high-quality urban-scene segmentation datasets.
CLIP Is Shortsighted: Paying Attention Beyond the First Sentence: This paper reveals a systematic bias in CLIP-family models toward the summary sentence and early tokens in long-form text, and proposes DeBias-CLIP, which eliminates this bias via three text augmentation strategies — summary removal, sentence sub-sampling, and token padding — achieving state-of-the-art performance on both long- and short-text retrieval benchmarks without introducing any additional parameters.
DeBias-CLIP: CLIP Is Shortsighted — Paying Attention Beyond the First Sentence: The paper shows that CLIP and Long-CLIP suffer from a serious early-token bias and a first-sentence summary shortcut. DeBias-CLIP uses three simple augmentations — removing the summary sentence, sentence sub-sampling, and prefix-token padding — that introduce no extra parameters and reach SOTA on multiple long-text retrieval benchmarks.
Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper: This paper provides a systematic review of two major technical paradigms for brain glioma MRI segmentation and classification — traditional methods (thresholding, region growing, clustering, etc.) and deep learning methods (CNN-based architectures). Through a methodological taxonomy and performance comparison, the paper concludes that CNN architectures comprehensively outperform traditional techniques, while also noting that semi-automatic methods are preferred by radiologists in clinical settings due to their controllability.
Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging: A systematic review paper that comprehensively compares traditional methods (thresholding, region growing, fuzzy clustering, etc.) and deep learning methods (CNN, U-Net, SegNet, etc.) for brain glioma MRI segmentation and classification, concluding that CNN-based architectures consistently outperform traditional techniques in both accuracy and degree of automation.
Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness: This paper proposes CFT (Concept-Guided Fine-Tuning), which leverages LLM-generated class-level semantic concepts and zero-shot segmentation via GroundedSAM to obtain concept masks. ViTs are then fine-tuned by aligning AttnLRP relevance maps with concept regions. Using only 1,500 training images, CFT achieves substantial robustness improvements across 5 OOD benchmarks.
ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization: This paper proposes ConceptPrism, which introduces image-level residual tokens and cross-image repulsion losses to automatically disentangle shared target concepts from image-specific residual information in personalized T2I diffusion models, achieving state-of-the-art performance on DreamBench across all three metrics: CLIP-T, DINO, and CLIP-I.
CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation: This paper presents CrossEarth-SAR, the first billion-scale SAR vision foundation model, which integrates a physics-guided sparse MoE architecture with SAR physical descriptors. It achieves state-of-the-art performance on 20 out of 22 cross-domain semantic segmentation benchmarks, surpassing prior methods by over 10% mIoU in certain multi-gap scenarios.
CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation: This paper introduces CrossEarth-SAR, the first billion-scale SAR visual foundation model, which replaces the FFN in each Transformer block of a DINOv2 ViT backbone with a physics-guided sparse Mixture-of-Experts (MoE) layer. Routing is conditioned on three SAR physical descriptors—directional entropy, equivalent number of looks, and local roughness. The work also contributes a 200K-scale cross-domain pretraining dataset and a benchmark of 22 evaluation settings covering 8 types of domain shift. CrossEarth-SAR achieves state-of-the-art performance on 20 out of 22 cross-domain semantic segmentation benchmarks.
CTFS: Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels: This paper proposes CTFS, the first semi-supervised semantic segmentation framework specifically designed for forward-looking sonar (FLS) images. It introduces a multi-teacher collaboration mechanism (one general teacher + two sonar-specific teachers simulating acoustic shadow and energy attenuation, respectively), combined with multi-view pseudo-label reliability assessment (intra-teacher stability × inter-teacher consistency). With only 2% labeled data, CTFS achieves 62.32% mIoU, surpassing the state of the art by 5.08 percentage points.
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training: This paper proposes Data Warmup, a curriculum learning strategy that requires no modifications to the model or loss function. It schedules training images from easy to hard using a semantics-aware image complexity metric (foreground dominance × foreground typicality). On ImageNet 256×256, it yields improvements of up to +6.11 IS and −3.41 FID for the SiT family. Notably, the reversed curriculum (hard-to-easy) performs worse than the uniform baseline, demonstrating that ordering itself is the key mechanism.
DeDelayed: Deleting Remote Inference Delay via On-Device Correction: DeDelayed is an edge-cloud collaborative inference framework that combines a lightweight on-device image model with a latency-aware cloud-side temporal prediction video model. By training the network with temporally predictive objectives to compensate for communication delay, DeDelayed achieves gains of 6.4 mIoU over local-only inference and 9.8 mIoU over remote-only inference under 100 ms latency.
Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification: This paper proposes IFA-Net, which detects AI-generated forgeries from the perspective of "modeling what is real" rather than "learning what is fake." A frozen MAE reconstructs the input to produce residuals that expose regions deviating from the natural image manifold. A two-stage closed-loop pipeline—coarse detection → task-adaptive prior injection → residual amplification → refinement—iteratively amplifies manifold deviation, achieving state-of-the-art performance on both diffusion inpainting and traditional image tampering detection.
Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation: This paper proposes an open-vocabulary semantic segmentation method that bypasses the logits optimization process entirely. Based on the assumption that homogeneous regions exhibit consistent distributional discrepancies from their logits to a degenerate distribution, the method directly constructs segmentation maps via either the optimal transport path or the analytical solution of maximum transport velocity. The approach achieves state-of-the-art performance on 8 benchmarks without requiring training or model-specific modulation.
DSS: Discover, Segment, and Select for Zero-shot Camouflaged Object Segmentation: This paper proposes DSS, a three-stage progressive pipeline (Discover→Segment→Select) that achieves zero-shot, training-free camouflaged object segmentation by: discovering foreground regions via self-supervised visual encoders and Leiden clustering (FOD); generating candidate masks using SAM; and selecting the optimal mask through heuristic scoring combined with iterative pairwise MLLM comparison. The method demonstrates particularly strong performance in multi-instance camouflage scenarios.
DPAD: Discriminative Perception via Anchored Description for Reasoning Segmentation: To address the limitation that geometric rewards in RL+GRPO training for reasoning segmentation (RS) cannot constrain whether the reasoning chain focuses on the target's unique attributes, this paper proposes DPAD: an MLLM generates a reasoning chain, geometric localization, and an anchored description; a CLIP-based Discriminative Perception Reward is introduced to compare the similarity between the description and the ROI/AOI, forcing the caption to be more discriminative and thereby indirectly constraining the reasoning chain to focus on the target. On ReasonSeg, cIoU improves by 3.09% while reasoning chain length decreases by 42%.
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime: This paper proposes DSFlash, a low-latency panoptic scene graph generation model that achieves real-time inference at 56 FPS on an RTX 3090 while maintaining state-of-the-art performance (mR@50=30.9), through a unified backbone, bidirectional relation prediction, and mask-guided dynamic pruning.
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime: DSFlash combines a unified segmentation/relation backbone, a gated bidirectional relation head, and mask-based dynamic patch pruning to deliver SOTA panoptic scene graph generation on PSG at mR@50=30.9 with only 18 ms latency (56 FPS).
DSS: Discover, Segment, and Select - A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation: DSS is a three-stage zero-shot camouflaged object segmentation framework: (1) Discover candidate regions via DINOv2 feature clustering and part combination (FOD); (2) Segment using SAM; (3) Select the optimal mask via pairwise MLLM comparison (SMS). Requiring no training, DSS achieves comprehensive improvements over prior zero-shot methods on four COD benchmarks, with particularly pronounced advantages in multi-instance scenarios.
Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance: This paper proposes an efficient RGB-D multi-task scene understanding network. An improved fusion encoder exploits channel redundancy to accelerate feature extraction. A Normalization-Focused Channel Layer (NFCL) and a Context Feature Interaction Layer (CFIL) provide cross-dimensional feature guidance. A batch-level multi-task adaptive loss function dynamically adjusts per-task learning weights. The unified framework simultaneously handles five tasks—semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification—on NYUv2, SUN RGB-D, and Cityscapes, achieving advantages in both accuracy and speed.
Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance: This paper proposes an efficient RGB-D multi-task scene understanding network. A partial-channel convolution fusion encoder reduces FLOPs to 1/16 of standard convolution. A Normalized Focus Channel Layer (NFCL) and a Context Feature Interaction Layer (CFIL) enable cross-dimensional feature guidance. A batch-level multi-task adaptive loss dynamically balances five tasks. The method achieves 49.82 mIoU on NYUv2 at 20.33 FPS, which is 24% faster than EMSAFormer.
ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark: ELVIS proposes the first low-light video instance segmentation (VIS) framework, comprising a physics-driven synthetic low-light video pipeline (with motion blur modeling), a calibration-free degradation parameter estimation network (VDP-Net), and an enhancement decoder integrated into the VIS architecture for degradation-content decoupling. It achieves gains of +3.7 AP and +2.8 AP on synthetic and real low-light videos, respectively.
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection: This paper proposes EReCu, a unified unsupervised camouflaged object detection framework consisting of three synergistic modules — Multi-cue Native Perception (MNP), Pseudo-label Evolution Fusion (PEF), and Local Pseudo-label Refinement (LPR) — achieving boundary-accurate and detail-rich camouflaged object segmentation without any manual annotations.
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection: EReCu is a unified framework built upon a DINO teacher-student architecture that employs Multi-cue Native Perception (MNP) to extract texture and semantic priors from raw images, guiding Pseudo-label Evolution Fusion (PEF) for global pseudo-label evolution, and Local Pseudo-label Refinement (LPR) for boundary detail recovery. It is the first framework to unify the two dominant UCOD paradigms—pseudo-label guidance and feature learning—achieving state-of-the-art performance across four COD benchmarks.
FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning: This paper proposes FCL-COD, a framework that injects camouflaged scene knowledge into SAM via Frequency-aware Low-Rank Adaptation (FoRA), enhances foreground-background feature separation through Gradient-aware Contrastive Learning (GCL), and refines boundary-sensitive features with Multi-Scale Frequency Attention (MSFA). Under a weakly supervised setting using only bounding box annotations, FCL-COD surpasses fully supervised state-of-the-art methods.
Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning: This paper proposes STaRC, a framework that leverages supervised frame-level saliency learning to jointly drive retrieval (saliency-guided segmentation and retrieval) and caption generation (saliency prompt injection into the decoder), achieving substantial improvements in temporal alignment and caption quality for dense video captioning (DVC).
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting: This paper presents FoV-Net, the first rotation-invariant framework for CAD B-rep learning that simultaneously captures local surface geometry and global structural context. By introducing a Local Reference Frame UV grid (LRF UV) and a Field-of-View (FoV) ray casting descriptor, FoV-Net achieves robust classification and segmentation under arbitrary \(\mathbf{SO}(3)\) rotations.
From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction: This work decouples two-hand reconstruction into 2D structural alignment (fusing keypoint, segmentation, and depth priors) and 3D spatial interaction alignment (a penetration-free diffusion model), achieving an MPJPE of 5.36 mm on InterHand2.6M and substantially outperforming the state of the art.
Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation: This paper proposes Generalizable Knowledge Distillation (GKD), which transfers the cross-domain generalization capability of vision foundation models (VFMs) to lightweight student models through a multi-stage distillation scheme that decouples representation learning from task learning, along with a query-based soft distillation mechanism. GKD achieves an average improvement of +10.6% mIoU under the F2L setting.
GenMask: Adapting DiT for Segmentation via Direct Mask Generation: This paper proposes GenMask, which directly trains a DiT to generate binary segmentation masks (sharing the same model as color image generation). By discovering that the VAE latent representations of binary masks are linearly separable, the authors design an extreme heavy-tailed timestep sampling strategy tailored for segmentation, enabling single-step inference to produce segmentation results, achieving state-of-the-art performance on referring and reasoning segmentation benchmarks.
GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation: This paper proposes GeoGuide, a hierarchical geometric guidance framework for open-vocabulary 3D semantic segmentation. It leverages geometric priors from pretrained 3D models to correct geometric bias in 2D-to-3D knowledge distillation via three complementary modules: uncertainty-based superpoint distillation, instance-level mask reconstruction, and inter-instance relation consistency. GeoGuide achieves state-of-the-art performance of 64.8 mIoU on ScanNet v2.
GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth: GeomPrompt learns lightweight geometric prompt modules for frozen RGB-D segmentation models, synthesizing task-driven depth proxy signals from RGB (without depth supervision). It achieves gains of +6.1 mIoU under missing depth and up to +3.6 mIoU under degraded depth.
GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings: GeoSURGE introduces hierarchical geographic embeddings and a semantic fusion module, framing global image geo-localization as a matching problem between visual representations and learned geographic representations. The method achieves state-of-the-art performance on 22 out of 25 metrics across 5 benchmarks.
GKD: Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation: This paper proposes the GKD framework, which distills compact student models with cross-domain generalization capability from VFMs via a multi-stage decoupled distillation strategy (generic feature learning → frozen encoder → task head training) combined with a Query-based Soft Distillation (QSD) mechanism. GKD achieves an average mIoU gain of +10.6% under the F2L setting and +1.9% under the F2F setting.
Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions: This paper reformulates class-level curriculum learning in unsupervised domain adaptation as a sequential decision-making problem under the reinforcement learning framework. The proposed HeuSCM framework achieves autonomous curriculum scheduling via high-dimensional semantic state perception and category-fair policy gradients, attaining state-of-the-art performance (72.9 mIoU) on ACDC, Dark Zurich, and Nighttime Driving.
HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding: HippoMM maps three core hippocampal cognitive mechanisms—pattern separation (episodic segmentation), memory consolidation (semantic compression), and pattern completion (hierarchical retrieval)—into a computational architecture for episodic memory formation and cross-modal associative recall in long audiovisual streams. On the authors' proposed benchmark HippoVlog, the system achieves 78.2% accuracy while being 5× faster than retrieval-augmented baselines.
INSID3: Training-Free In-Context Segmentation with DINOv3: INSID3 is a training-free in-context segmentation method that relies exclusively on frozen DINOv3 features. Through a three-stage pipeline consisting of positional debiasing, fine-grained clustering, and seed cluster aggregation, it surpasses methods that depend on SAM or fine-tuning across semantic, part-level, and personalized segmentation tasks using a single self-supervised backbone, achieving an average mIoU improvement of +7.5%.
Kαlos finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks: This paper proposes the KαLOS meta-algorithm, which transforms the complex problem of spatial-categorical annotation agreement into a standard nominal reliability matrix via a "localize-then-classify" principle and data-driven parameter calibration, enabling unified evaluation of inter-annotator agreement (IAA) across diverse vision tasks including object detection, instance segmentation, and pose estimation.
Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction: This paper proposes CCMP, a cross-view object correspondence framework based on conditional binary segmentation. It leverages cycle-consistency constraints as a self-supervised signal and supports test-time training (TTT), achieving state-of-the-art performance of 44.57% mIoU on Ego-Exo4D.
LEMMA: Laplacian Pyramids for Efficient Marine Semantic Segmentation: This paper proposes LEMMA, a lightweight marine semantic segmentation model based on Laplacian pyramids, which replaces deep feature computation with pyramid-decomposed edge information. LEMMA achieves SOTA-level segmentation accuracy (98.97% mIoU on MaSTr1325) with a 71× reduction in parameter count.
Live Interactive Training for Video Segmentation: LIT (Live Interactive Training) proposes a framework enabling interactive visual systems (e.g., SAM2) to learn online from user corrections during inference. Its lightweight implementation, LIT-LoRA, generalizes user feedback to subsequent frames by updating LoRA modules in real time, reducing user corrections by 18–34% on challenging VOS benchmarks with a training overhead of only ~0.5 seconds per correction.
LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment: This paper proposes LoD-Loc v3, which addresses two critical limitations of LoD-based UAV localization — poor cross-scene generalization and pose ambiguity in dense urban areas — by constructing a large-scale synthetic instance segmentation dataset (InsLoD-Loc, 100K images) and upgrading the localization paradigm from semantic to instance silhouette alignment. On the Tokyo-LoDv3 dense scene benchmark, the method achieves a ~2000% improvement in (2m, 2°) accuracy over the previous state of the art.
Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation: This paper proposes GLA-CLIP to address cross-window semantic inconsistency introduced by sliding-window inference in training-free open-vocabulary semantic segmentation. Three mechanisms—global key-value extension, proxy anchor attention, and dynamic normalization—are introduced to integrate global context across windows, achieving state-of-the-art average mIoU of 44.0% across 8 benchmarks.
Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning: This paper identifies a critical yet overlooked problem in visual in-context learning (VICL): existing prompt retrieval methods ignore label information, leading to label inconsistency. The proposed LaPR framework addresses this through joint image-label representation and a mixture-of-experts (MoE) mechanism, achieving label-aware prompt retrieval that consistently outperforms state-of-the-art methods on foreground segmentation, object detection, and image colorization tasks.
Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift: This paper systematically demonstrates that prompt engineering completely fails to bridge the domain gap of vision-language models in satellite remote sensing cloud segmentation, and that fine-tuning with as little as 0.1% of labeled data (~8 images) suffices to surpass all zero-shot prompting strategies.
Making Training-Free Diffusion Segmentors Scale with the Generative Power: This paper identifies the fundamental reasons why existing training-free diffusion segmentation methods fail to scale with the generative power of stronger models — namely, two gaps between cross-attention maps and semantic relevance (an aggregation gap and a score imbalance gap). It proposes two techniques, auto aggregation and per-pixel rescaling, forming the GoCA framework, which for the first time enables stronger diffusion models (SDXL, PixArt-Sigma, Flux) to significantly outperform weaker ones in training-free semantic segmentation.
Masked Representation Modeling for Domain-Adaptive Segmentation: This paper proposes Masked Representation Modeling (MRM), which performs masking and reconstruction in latent space rather than pixel space as a plug-and-play auxiliary task for UDA segmentation, yielding an average gain of +2.3 mIoU across 4 baselines on GTA→Cityscapes.
MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator: This paper proposes a learned Matting Quality Evaluator (MQE) that assesses alpha quality at the pixel level without ground-truth supervision. MQE serves dual roles as an online training guide and an offline data filter, enabling the construction of VMReal — a real-world video matting dataset comprising 28K clips / 2.4M frames. Combined with a reference-frame training strategy, the proposed method significantly outperforms all existing approaches.
A Mixed Diet Makes DINO An Omnivorous Vision Encoder: This paper identifies severe cross-modal feature misalignment in pretrained vision encoders such as DINOv2 (across RGB, depth, and segmentation modalities), and proposes the Omnivorous framework, which trains lightweight adapters on the final few layers of a frozen backbone using an alignment loss, an anchoring loss, and modality mixup augmentation. The resulting encoder constructs a unified, modality-agnostic feature space that substantially outperforms baselines on cross-modal retrieval while maintaining or improving downstream task performance.
MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention: MixerCSeg is proposed to decouple channels into global/local branches by analyzing the implicit attention mechanism of Mamba, enhanced respectively by Self-Attention and CNN, combined with Direction-guided Edge Gated Convolution, achieving state-of-the-art crack segmentation performance at only 2.05 GFLOPs and 2.54M parameters.
MPM: Mutual Pair Merging for Efficient Vision Transformers: This paper proposes Mutual Pair Merging (MPM), a parameter-free, training-free token merging module for ViTs that reduces sequence length via mutual nearest-neighbor pairing and mean fusion. On ADE20K, MPM achieves a 60% latency reduction on Raspberry Pi 5 for ViT-Tiny and a 20% throughput improvement on H100 with FlashAttention-2, while keeping mIoU degradation within 3%.
Masked Representation Modeling for Domain-Adaptive Segmentation: The paper proposes Masked Representation Modeling (MRM), which randomly masks and reconstructs features in the encoder's latent space and supervises the reconstruction with a pixel classification loss. As a plug-in auxiliary task it lifts four UDA baselines by an average of +2.3 / +2.8 mIoU on GTA→Cityscapes / Synthia→Cityscapes, with zero inference-time overhead.
Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models: This paper proposes OccSAM-Bench, a benchmark that systematically evaluates the occlusion robustness of SAM-family models in endoscopic scenes via synthetically generated surgical instrument occlusions. A three-region evaluation protocol is introduced to reveal two distinct behavioral patterns under occlusion: occlusion-aware and occlusion-agnostic.
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation: PCA-Seg proposes a Parallel Cost Aggregation (PCA) paradigm to replace the conventional serial spatial-categorical aggregation architecture. It efficiently integrates semantic and spatial context streams via an Expert-driven Perception Learning (EPL) module, and eliminates redundancy between the two knowledge streams through a Feature Orthogonal Decoupling (FOD) strategy. Each parallel block adds only 0.35M parameters while achieving state-of-the-art performance across 8 open-vocabulary semantic and part segmentation benchmarks.
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation: This paper revisits cost aggregation strategies and proposes PCA-Seg, a parallel architecture that replaces the conventional serial design. It integrates class-semantic and spatial-contextual information via an Expert-driven Perception Learning (EPL) module, and employs a Feature Orthogonalization Decoupling (FOD) strategy to reduce redundancy. PCA-Seg achieves state-of-the-art performance on 8 benchmarks with only 0.35M additional parameters per block.
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation: PCA-Seg revisits the cost aggregation mechanism in open-vocabulary semantic and part segmentation, proposing a parallel cost aggregation paradigm to replace existing serial architectures. It efficiently integrates semantic and contextual streams via an Expert-driven Perception Learning (EPL) module and reduces redundancy between the two knowledge streams through a Feature Orthogonal Decoupling (FOD) strategy. With only 0.35M additional parameters per parallel block, PCA-Seg achieves state-of-the-art performance across 8 benchmarks.
PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation: PEARL proposes a two-step inference framework based on Procrustes alignment and text-aware Laplacian propagation. Without introducing any additional training or auxiliary backbone networks, it corrects the geometric mismatch between keys and queries in the final self-attention layer of CLIP and leverages textual semantics to guide label propagation, achieving new state-of-the-art performance on training-free open-vocabulary semantic segmentation.
Phrase-Instance Alignment for Generalized Referring Segmentation: This paper proposes InstAlign, which reformulates Generalized Referring Expression Segmentation (GRES) as an instance-level reasoning problem. By introducing a Phrase-Object Alignment (POA) loss to establish fine-grained correspondences between linguistic phrases and visual instances, and employing a relevance-weighted aggregation mechanism to handle both multi-target and no-target scenarios in a unified manner, InstAlign achieves +3.22% cIoU and +12.25% N-acc improvements on gRefCOCO.
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation: This paper formally defines the UAV Reasoning Segmentation task, constructs the DRSeg benchmark comprising 10K high-resolution UAV images with chain-of-thought reasoning annotations, and proposes the dual-path pixel-level multimodal large language model PixDLM as a strong baseline.
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection: This paper proposes a pointer-based command sequence representation that explicitly incorporates B-Rep geometric entities (edges/faces) into autoregressive CAD generation, enabling chamfer/fillet operations in command sequence methods for the first time while substantially reducing topology errors caused by quantization errors.
Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains: This paper proposes SAM FTI-FDet, which transfers SAM's general segmentation capability to freight train fault detection via an automatic prompt generation module and an adaptive feature dispatcher. Using a TinyViT lightweight backbone, the method achieves 74.6 AP^box / 74.2 AP^mask, surpassing existing methods in both accuracy and efficiency.
Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains: This paper proposes SAM FTI-FDet, which introduces a Transformer decoder-based Prompt Generator that enables lightweight TinyViT-SAM to automatically generate task-relevant query prompts, achieving instance-level fault detection of freight train components without manual interaction. The method attains 74.6 AP_box / 74.2 AP_mask on a self-constructed dataset.
PRUE: A Practical Recipe for Field Boundary Segmentation at Scale: This paper systematically evaluates 18 segmentation and geospatial foundation models (GFMs), and proposes PRUE—a field boundary segmentation recipe combining a U-Net backbone, composite loss function, and targeted data augmentation. PRUE achieves 76% IoU and 47% object-F1 on the FTW benchmark, surpassing the baseline by 6% and 9% respectively, while introducing a novel set of metrics for evaluating deployment robustness.
RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images: To address the challenge of large-scale variation in remote sensing images, this paper proposes RDNet, a region proportion-aware dynamic adaptive salient object detection network. RDNet uses a Proportion Guidance mechanism to dynamically select convolution kernel combinations of varying sizes, combined with wavelet frequency-domain interaction and a cross-attention localization module. The method achieves state-of-the-art performance across three ORSI-SOD benchmarks.
RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images: This paper proposes RDNet, which employs a region proportion-aware Proportion Guidance block to estimate the area ratio of salient objects and dynamically selects combinations of 3/4/5 convolutional kernels of varying sizes for detail extraction. Combined with wavelet-domain frequency-matched context enhancement (reducing computation to 1/4) and a cross-attention localization module, RDNet comprehensively outperforms 21 state-of-the-art methods on three optical remote sensing SOD benchmarks: EORSSD, ORSSD, and ORSI-4199.
RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation: This paper proposes the RealVLG framework, comprising the RealVLG-11B large-scale real-world multi-granularity annotated dataset and the RealVLG-R1 unified model fine-tuned via reinforcement learning. It is the first work to unify visual-language grounding (VLG) and robotic grasping under a single paradigm, enabling end-to-end prediction of bounding boxes, segmentation masks, grasp poses, and contact points from natural language instructions, while demonstrating zero-shot generalization capability.
Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics: This paper proposes the QVLM architecture and SQuID dataset, achieving pixel-level quantitative spatial reasoning on satellite imagery through a decoupled design of code generation and segmentation models. The approach overcomes the fundamental limitation of conventional VLMs, which lose spatial indexing due to patch embedding compression.
RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation: This paper proposes RecycleLoRA, which employs Rank-Revealing QR (RRQR) decomposition to systematically "recycle" subspace structures from pretrained Vision Foundation Model weights. By initializing a primary adapter from minor directions and a secondary adapter from major directions, the method substantially improves LoRA representational diversity and parameter utilization efficiency, achieving state-of-the-art performance on both synthetic-to-real and real-to-real domain generalized semantic segmentation benchmarks (average mIoU of 68.95 / 72.10).
REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion: This paper proposes REL, a three-channel depth representation based on cylindrical coordinates (Rectified Depth + EGVIA + LOA), and a Spherical Multi-Modal Fusion module (SMMF) for panoramic semantic segmentation. The approach achieves 63.06% average mIoU on Stanford2D3D (a 2.35% gain over the HHA baseline) and reduces performance variance under 3D perturbations by approximately 70%.
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video: This paper presents RobotSeg, the first foundation model supporting both image and video robot segmentation. Built upon SAM 2, it introduces a Structure-Enhanced Memory Associator (SEMA), a Robot Prompt Generator (RPG), and a label-efficient training strategy requiring only first-frame annotations. In automatic mode, it achieves 85.1 J&F on Whole Robot segmentation, surpassing the fine-tuned SAM 2.1 by 4.9 points, with only 41.3M parameters — far fewer than existing 638M+ solutions.
RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation: RS-SSM is proposed to extract channel-wise specific information distribution features (CwAP) via frequency domain analysis and adaptively invert the forget gate matrix (FGIR) to complementarily refine spatiotemporal details lost during SSM state space compression, achieving state-of-the-art performance on four video semantic segmentation benchmarks while maintaining high efficiency.
RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection: This paper proposes RSONet, a two-stage RGB-T salient object detection network. In the region guidance stage, similarity scores between RGB/thermal guidance maps and a joint guidance map are computed to select the more reliable modality. In the saliency generation stage, a selective optimization (SO) module fuses dual-modality features based on the selection result, while Dense Detail Enhancement (DDE) and Mutual Interaction Semantic (MIS) modules extract detail and positional information, respectively, to produce high-quality saliency maps. RSONet achieves state-of-the-art performance on three RGB-T benchmarks.
RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection: RSONet is a two-stage RGB-T salient object detection framework that first generates region guidance maps via three parallel encoder-decoder branches and selects the dominant modality based on similarity, then fuses dual-modality features through a selective optimization module. It achieves MAE of 0.020/0.014/0.021 on VT5000/VT1000/VT821, outperforming 27 state-of-the-art methods.
SAP: Segment Any 4K Panorama: This paper proposes SAP (Segment Any 4K Panorama), which converts panoramic images into perspective pseudo-video sequences sampled along fixed spherical trajectories, addressing the structural mismatch of SAM2's streaming memory mechanism on 360° images. By synthesizing a 183K instance-annotated 4K panoramic dataset for fine-tuning, SAP achieves a zero-shot mIoU improvement of +17.2 on real-world panoramic benchmarks.
SARMAE: Masked Autoencoder for SAR Representation Learning: This paper proposes SARMAE, a framework for noise-robust SAR self-supervised pre-training built upon the million-scale SAR-1M dataset, speckle-aware representation enhancement (SARE), and semantic anchor representation constraint (SARC). SARMAE achieves state-of-the-art performance across multiple downstream tasks including classification, detection, and segmentation.
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation: SCOPE proposes a plug-and-play background-guided prototype enrichment framework that mines pseudo-instances from background regions of base-training scenes to build a prototype bank. At incremental stages, it enriches few-shot prototypes via retrieval + attention fusion — without retraining the backbone or adding parameters, it raises novel-class IoU on ScanNet / S3DIS by up to +6.98% while keeping forgetting low.
SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection: SDDF introduces a new task of Open-Vocabulary Camouflaged Object Detection (OVCOD) and constructs the OVCOD-D benchmark. It removes redundant textual noise via a sub-description principal component contrastive fusion strategy, and enhances foreground-background discrimination through a specificity-guided regional weak alignment mechanism and a dynamic focusing module, achieving 56.4 AP under the open-set setting.
Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation: This paper proposes EDA-PSeg, a framework that introduces two core modules — a Graph Matching Adapter (GMA) and an Euler-Margin Attention (EMA) — to achieve, for the first time, open-set unsupervised domain adaptive semantic segmentation from pinhole to 360° panoramic images, simultaneously addressing geometric FoV distortion and unknown category discovery.
SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation: This paper proposes SemiTooth, a framework that addresses distribution discrepancies across multi-source CBCT data in semi-supervised tooth segmentation via a multi-teacher–multi-student architecture and a Stricter Weighted Confidence (SWC) constraint, achieving state-of-the-art performance on the newly constructed MS3Toothset dataset.
SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons: This paper proposes SemLayer, a generative-model-based pipeline that recovers semantically structured, layered representations from flattened vector icons. The approach reframes segmentation as a colorization task via a diffusion model, follows with semantic amodal completion of occluded regions, and applies integer linear programming (ILP) to determine layer ordering, achieving segmentation gains of +5.0 mIoU and +16.7 PQ.
SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data: This paper proposes the SGMA framework, which constructs global semantic prototypes via a Semantic-Guided Fusion (SGF) module for adaptive cross-modal fusion, and dynamically increases the training frequency of fragile modalities through a Modality-Aware Sampling (MAS) module. The framework addresses three core challenges in incomplete multimodal semantic segmentation for remote sensing: modality imbalance, large intra-class variance, and cross-modal heterogeneity.
SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data: This paper proposes SGMA—a Semantic-Guided Modality-Aware segmentation framework—that employs Semantic-Guided Fusion (SGF) to reduce intra-class variance and reconcile cross-modal conflicts, and Modality-Aware Sampling (MAS) to balance training frequency for vulnerable modalities. On ISPRS, SGMA achieves Average mIoU +9.20% and Last-1 mIoU +18.26% for weak modalities compared to the SOTA method IMLT.
SouPLe: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts: This paper proposes SouPLe (Sound-aware Prompt Learning), which replaces fixed text prompts in CLIP with learnable context tokens generated conditioned on image features, enhancing semantic correspondence between audio embedding tokens and visual features. SouPLe achieves +3.75 cIoU on VGG-SS and +6.32 cIoU in the open-set setting, surpassing all prior methods.
SPAR: Single-Pass Any-Resolution ViT for Open-Vocabulary Segmentation: This paper proposes SPAR, which distills the spatial reasoning capability of a fine-stride sliding window teacher into a single-pass student of identical architecture, transforming a ViT into a resolution-agnostic dense feature extractor. SPAR achieves +10.5 mIoU over the single-pass baseline in open-vocabulary segmentation while running 52× faster than the teacher.
Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation: This paper proposes the SERA framework, which introduces a two-stage lightweight MoE expert refinement mechanism — SERA-Adapter at the backbone level and SERA-Fusion at the fusion level — into a frozen vision-language backbone. Through expression-guided adaptive routing, SERA improves spatial consistency and boundary precision in referring image segmentation while updating fewer than 1% of backbone parameters.
Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation: This paper proposes the TODSynth framework, which achieves joint text-image-mask controlled remote sensing image synthesis via unified tri-modal attention in MM-DiT, and introduces Control-Rectify Flow Matching (CRFM), a novel sampling-stage method that dynamically adjusts the generation trajectory using semantic loss from a downstream segmentation model. The synthesized data improves mIoU by 4.14% on FUSU-4k and 2.08% on LoveDA.
The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation: This paper proposes GOLD, a framework for Continual Test-Time Adaptation (CTTA). The central finding is that the minimal feature update subspace—termed the "golden subspace"—coincides with the row space of the classifier weight matrix and is inherently low-rank. GOLD estimates this subspace online via the Average Gradient Outer Product (AGOP) and performs feature adaptation using a lightweight scaling vector, achieving state-of-the-art performance on classification and segmentation benchmarks with minimal computational overhead.
Towards Context-Aware Image Anonymization with Multi-Agent Reasoning: This paper proposes CAIAMAR, a multi-agent framework that combines high-confidence direct PII processing (pedestrians, license plates) using dedicated models with context-aware reasoning via large vision-language models (LVLMs). Through a PDCA iterative refinement loop, it detects indirect privacy identifiers and applies appearance-decorrelated inpainting via diffusion models. On CUHK03-NP, it reduces re-identification risk by 73% while maintaining high image quality (FID 9.1) on CityScapes.
Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels: This paper proposes Same Class Neighbor Penalization (SCNP), which replaces each pixel's logit with the worst prediction among its same-class neighbors during training, thereby forcing the model to prioritize correcting weakly classified pixels within local neighborhoods. This approach achieves significant improvements in topological accuracy at negligible cost (only 3 lines of code and a few milliseconds per iteration).
Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera: USF proposes a modular, lens-agnostic spherical vision frontend that projects arbitrarily calibrated camera images onto the unit sphere and performs spatial-domain spherical resampling, convolution, and pooling operations. Using only distance-weighted kernels, the framework inherently guarantees rotation equivariance, and demonstrates zero-shot generalization robustness to random rotations and cross-lens transfer on classification, detection, and segmentation tasks.
Universal 3D Shape Matching via Coarse-to-Fine Language Guidance: This paper proposes UniMatch, a semantics-aware coarse-to-fine 3D shape matching framework. The coarse stage establishes part-level correspondences via category-agnostic 3D segmentation, MLLM-based part naming, and FG-CLIP language embeddings. The fine stage learns dense correspondences within an extended functional map framework using a Group-wise Ranking Contrastive (RnC) Loss, enabling universal matching across categories and non-isometric shapes.
UnrealPose: Leveraging Game Engine Kinematics for Large-Scale Synthetic Human Pose Data: This paper proposes UnrealPose-Gen, a synthetic human pose data generation pipeline built on Unreal Engine 5, which leverages native game engine skeletal kinematics—rather than SMPL—to produce UnrealPose-1M, a million-scale annotated dataset providing 3D joint positions, 2D keypoints, occlusion flags, instance segmentation masks, and camera parameters.
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model: This paper proposes VidEoMT, an encoder-only video segmentation model that unifies segmentation and temporal association within a single ViT encoder via query propagation and query fusion, eliminating all dedicated tracking modules. It achieves 160 FPS on YouTube-VIS 2019 (10×+ faster than CAVIS) with only a 0.3 AP drop.
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model: This paper proposes VidEoMT, an encoder-only video segmentation architecture that unifies segmentation and temporal association within a single ViT encoder via query propagation and query fusion, achieving 5×–10× speedup (160 FPS with ViT-L) while maintaining accuracy comparable to state-of-the-art methods.
VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation: VIRST proposes an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single vision-language model. Through Spatiotemporal Fusion (STF) and a Temporal Dynamic Anchor Updater (TDAU), the method achieves spatiotemporally consistent video segmentation, attaining J&F of 70.8 (+7.5 over SOTA) on ReVOS and 62.9 (+9.2) on MeViS, while achieving an inference speed of 5.1 FPS (1.3× faster than VRS-HQ).
Weakly-Supervised Referring Video Object Segmentation through Text Supervision: This paper proposes WSRVOS, the first weakly supervised referring video object segmentation framework that uses only text expressions as supervision signals. It achieves significant reduction in reliance on pixel-level annotations through MLLM-driven contrastive expression augmentation, bidirectional visual-language feature selection, instance-aware expression classification, and temporal segment ranking constraints.

📹 Video Understanding¶

A4VL: A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning: This paper proposes A4VL, a training-free multi-agent perception-action alliance framework in which multiple heterogeneous VLM agents perform iterative perception exploration (event-based segmentation + CLIP-guided clue alignment for keyframe localization) and action exploration (independent reasoning → cross-scoring → consensus/pruning). A4VL comprehensively outperforms 18 VLMs and 11 long-video-specialized methods across 5 VideoQA benchmarks, with significantly lower inference latency (74s vs. GPT-4o's 127s on MLVU).
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning: This paper proposes A4VL, a training-free multi-agent perception-action alliance framework that achieves state-of-the-art performance across five VideoQA benchmarks—surpassing 28 baseline methods—while significantly reducing inference latency, through event-driven video segmentation, clue-guided keyframe selection, and a multi-round agent negotiation-and-pruning mechanism.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding: AdaSpark is proposed to reduce FLOPs for long-video processing by up to 57% while maintaining performance, via 3D spatiotemporal cube partitioning and two synergistic adaptive sparsity mechanisms: cube-level attention selection and token-level FFN selection.
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing: This paper proposes AutoGaze—a lightweight autoregressive module with only 3M parameters—that operates before the ViT to select the minimal set of patches in a multi-scale manner, eliminating spatiotemporal redundancy and achieving 4×–100× token compression and up to 19× ViT speedup, enabling MLLMs to scale to 1K-frame 4K-resolution video.
AutoGaze: Attend Before Attention — Efficient and Scalable Video Understanding via Autoregressive Gazing: This paper proposes AutoGaze, a lightweight 3M-parameter module that autoregressively selects the minimal multi-scale patch set minimizing reconstruction loss prior to ViT processing, removing redundant information from video. It achieves 4×–100× token compression and up to 19× ViT speedup, enabling MLLMs to scale to 1K-frame 4K-resolution video and reach 67.0% on VideoMME.
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding: This paper identifies severe unreliability in single-sample teacher responses under black-box distillation for video LVLMs—manifested as cross-question variance (\(\sigma=0.22\)), intra-sampling variance (\(\sigma=0.07\)–\(0.15\)), and format violation rates (1%–10%)—and proposes R-MSD, a framework that addresses these issues through a multi-sample teacher pool, task-adaptive matching, and two-stage SFT→RL adversarial distillation. The resulting 4B student model comprehensively outperforms the same-scale Qwen3-VL-4B on VideoMME, Video-MMMU, and WorldSense.
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding: This paper proposes the R-MSD framework, which constructs a teacher pool by sampling \(K\) responses per input, applies task-adaptive quality matching (quality-weighted pairing for closed-ended tasks and uniform pairing for open-ended tasks), and employs an online critic-as-discriminator adversarial distillation strategy to address the unreliability of single-sample supervision in black-box distillation of video LVLMs.
Temporally Consistent Long-Term Memory for 3D Single Object Tracking: This paper proposes ChronoTrack, a robust long-term 3D single object tracking framework built upon compact learnable memory tokens and two complementary objectives — a temporal consistency loss and a memory cycle-consistency loss — achieving state-of-the-art performance on multiple benchmarks while running in real time at 42 FPS.
CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization: This paper presents CineSRD, a training-free multimodal speaker diarization framework that performs speaker registration via visual anchor clustering and detects speaker turns using an audio language model, addressing open-world challenges in visual media such as long videos, large cast sizes, and audio-visual asynchrony.
CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning: This paper proposes the CLCR framework, which organizes each modality's features into three semantic hierarchy levels (shallow/middle/deep). An intra-level Controlled Exchange Domain (IntraCED) restricts cross-modal interaction to the shared subspace only, while an inter-level Collaborative Aggregation Domain (InterCAD) enables adaptive cross-level fusion, addressing the cross-level semantic asynchrony problem in multimodal learning.
Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining: This paper proposes ClusterSTM, which leverages intra-frame semantic clustering and a cluster-wise spatio-temporal masking strategy to retain semantically complete visual tokens under high masking ratios. A video-text relevance reconstruction objective is further introduced to enable efficient video-language pretraining at minimal computational cost, achieving a new state of the art among efficient models on retrieval, VQA, and captioning tasks.
Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing: This paper proposes a novel "grayscale always-on, color on demand" paradigm. ColorTrigger detects color redundancy online via lightweight quadratic programming on the grayscale stream, achieving 91.6% of the full-color baseline performance using only 8.1% RGB frames, enabling always-on video sensing on resource-constrained devices.
CVA: Context-aware Video-text Alignment for Video Temporal Grounding: This paper proposes CVA (Context-aware Video-text Alignment), a framework comprising three synergistic components—Query-aware Context Diversification (QCD), Context-invariant Boundary Discrimination (CBD) loss, and Context-enhanced Transformer Encoder (CTE)—to address false negatives and background association issues in video temporal grounding, achieving approximately 5-point improvement in R1@0.7 on QVHighlights.
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection: This paper proposes the Phase-wise Decomposition and Alignment (PDA) framework, which leverages the CoT reasoning capability of LLMs to decompose action labels into start–middle–end phase descriptions. Through text-guided foreground filtering and adaptive phase-wise alignment, PDA achieves fine-grained action pattern transfer, attaining an Avg mAP of 46.9 on THUMOS14 OV-TAD, surpassing the previous SOTA Ti-FAD (41.2).
DIvide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding: This paper proposes DIG, a training-free frame selection framework that classifies queries into global and localization types. For global queries, uniform sampling is applied directly; for localization queries, a dedicated pipeline consisting of content-adaptive frame selection (CAFS), LMM-based reward scoring, and video refinement is employed. DIG consistently outperforms existing methods on three long-form video understanding benchmarks.
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering: This paper proposes the EgoPointVQA dataset and the HINT (Hand Intent Tokens) method, which encodes 3D hand keypoints into hand intent tokens interleaved with visual tokens as input to an MLLM, addressing deictic gesture-based question answering in egocentric video. HINT-14B achieves 68.1% accuracy, surpassing InternVL3-14B by 5.4 pp.
Drift-Resilient Temporal Priors for Visual Tracking: This paper proposes DTPTrack—a lightweight plug-and-play temporal modeling module that assigns reliability scores to historical frames via a Temporal Reliability Calibrator (TRC) to filter noisy observations, and synthesizes the calibrated historical information into dynamic prior tokens via a Temporal Guidance Synthesizer (TGS) to suppress tracking drift, achieving state-of-the-art performance across multiple benchmarks.
Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual-Inertial Odometry: This paper proposes a dual-agent reinforcement learning framework comprising a Select Agent (which decides whether to activate the visual front-end based on IMU signals) and a Fusion Agent (which adaptively fuses visual-inertial states). Without completely removing VIBA, the framework substantially reduces its invocation frequency and computational overhead, achieving a superior accuracy–efficiency–memory trade-off.
Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition: Inspired by Kahneman's dual-system theory of human decision-making, TCEI proposes a test-time calibration framework for multi-object tracking. The intuitive system leverages transient memory of recently observed objects (confident samples as temporal priors and uncertain samples as reflective cases) for rapid prediction, while the experiential system validates and calibrates intuitive predictions using knowledge accumulated from historical videos. The entire process requires only forward passes without backpropagation, achieving significant robustness improvements under distribution shift across multiple MOT benchmarks.
EgoPointVQA: Gesture-Based Egocentric Video Question Answering: This paper proposes the EgoPointVQA dataset (4,000 synthetic + 400 real egocentric videos) and the HINT method, which encodes 3D hand keypoints into hand intent tokens interleaved with visual tokens as input to an MLLM, enabling the model to interpret pointing gestures and answer deictic questions. HINT-14B achieves 68.1% accuracy, outperforming InternVL3-14B by 6.6 percentage points.
EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions: This paper introduces EgoXtreme, the first large-scale benchmark for 6D object pose estimation in egocentric views under extreme conditions, encompassing three real-world challenges — severe motion blur, dynamic illumination, and smoke occlusion — and reveals critical failures of current state-of-the-art pose estimators under these conditions.
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration: This paper proposes an efficient post-hoc calibration method based on isotonic regression that aligns the output distribution of uncertainty models with the observed distribution, addressing inaccurate uncertainty estimation caused by domain shift in gaze tracking. It also introduces Coverage Probability Error (CPE) as a more reliable uncertainty evaluation metric than EUC.
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration: A data-efficient post-hoc calibration method is proposed that aligns the predictive distribution of uncertainty-aware gaze tracking models with the true observational distribution via isotonic regression, and introduces Coverage Probability Error (CPE) as a replacement for the unreliable Error-Uncertainty Correlation (EUC) metric for evaluating uncertainty quality.
Envisioning the Future, One Step at a Time: This paper formulates open-set future scene dynamics prediction as stepwise reasoning over sparse point trajectories, enabling rapid generation of thousands of diverse future hypotheses from a single image via an autoregressive diffusion model — orders of magnitude faster than dense prediction models.
Event6D: Event-based Novel Object 6D Pose Tracking: EventTrack6D proposes an event-depth fusion framework for 6D pose tracking that bridges the temporal gap between event cameras and depth frame rates by reconstructing intensity and depth images at arbitrary timestamps, achieving robust tracking of unseen objects at 120+ FPS while trained exclusively on synthetic data.
FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking: FC-Track is a lightweight post-association correction framework that explicitly corrects identity switch errors caused by target overlap in online MOT. It employs IoA (Intersection over Area)-based overlap-aware appearance feature filtering and a local mismatch reassignment strategy, reducing the long-term identity switch ratio to 29.55%.
FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking: This paper proposes FC-Track, a lightweight post-association correction framework that suppresses appearance updates via IoA triggering and reassigns locally mismatched detection–tracklet pairs, reducing the proportion of long-term identity switches from 36.86% to 29.55% while maintaining state-of-the-art performance on MOT17/MOT20.
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding: This paper proposes FluxMem, a training-free streaming video understanding framework that employs a three-tier hierarchical memory design (short-term / medium-term / long-term) and two adaptive token compression modules — TAS for temporal redundancy removal and SDC for spatial redundancy reduction. FluxMem achieves new state-of-the-art results on StreamingBench and OVO-Bench while discarding 60–70% of visual tokens.
Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding: This paper proposes Frame2Freq—the first family of PEFT adapters that performs temporal modeling in the frequency domain. By transforming frozen VFM frame embeddings into the spectral space via FFT and learning frequency band-level filtering, Frame2Freq surpasses fully fine-tuned models on five fine-grained action recognition benchmarks with fewer than 10% trainable parameters.
GoalForce: Teaching Video Models to Accomplish Physics-Conditioned Goals: This paper proposes the GoalForce framework, which trains video generation models on simple synthetic data using multi-channel physical control signals (goal force, direct force, and mass), enabling the model to learn backward causal planning from desired effects. The approach achieves zero-shot generalization to complex real-world scenarios such as tool use and human–object interaction.
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation: SelVA introduces the text-conditioned selective video-to-audio (V2A) generation task. Through a learnable supplementary token [SUP] and a self-supervised video mixing strategy, the model generates only the user-specified target sound from multi-source videos guided by text prompts, surpassing existing methods in audio quality, semantic alignment, and temporal synchronization.
HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering: HERBench is a video question answering benchmark specifically designed for multi-evidence integration, comprising 26,806 five-choice questions, each structurally requiring the fusion of \(\ge 3\) temporally dispersed, non-overlapping visual cues. By introducing the Minimum Required Frame Set (MRFS) metric, the benchmark exposes two critical bottlenecks in current Video-LLMs: insufficient frame retrieval and evidence fusion failure.
HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling: HieraMamba proposes a Mamba-based hierarchical architecture for video temporal grounding. Its core contribution is the Anchor-MambaPooling (AMP) module, which employs Mamba's selective scanning to progressively compress video features into multi-scale anchor tokens. Complementary anchor-conditioned and segment-pooled contrastive losses enhance the compactness and discriminability of hierarchical representations, achieving state-of-the-art performance on Ego4D-NLQ, MAD, and TACoS.
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms: This paper compares three mainstream temporal output paradigms for video temporal grounding (VTG) — text-number generation, temporal token generation, and continuous time decoding — within a unified framework, finding that the continuous distribution paradigm consistently achieves the best efficiency–accuracy Pareto frontier.
LAOF: Robust Latent Action Learning with Optical Flow Constraints: This paper proposes the LAOF framework, which leverages agent optical flow as a pseudo-supervision signal to constrain latent action learning, yielding latent action representations that are more robust to distractors. LAOF substantially outperforms unsupervised baselines on LIBERO and PROCGEN, and matches or surpasses supervised methods that use 1% action labels, while requiring no labels at all.
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning: This paper proposes AssistMimic, which formulates physics-based imitation of human-human assistive interactions as a multi-agent reinforcement learning (MARL) problem. Through motion prior initialization, dynamic reference retargeting, and contact-promoting rewards, it achieves, for the first time, physics-simulation tracking of force-exchanging assistive motions.
LensWalk: Agentic Video Understanding by Planning How You See in Videos: This paper presents LensWalk, an agentic framework that enables an LLM reasoner to actively control the temporal scope and sampling density of video observations. Through a reason-plan-observe loop, LensWalk achieves adaptive video understanding without any fine-tuning, yielding plug-and-play performance gains exceeding 5% on long video benchmarks.
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding: This paper proposes LongVideo-R1, a reasoning-capable multimodal agent that organizes videos into a hierarchical tree structure and employs an intelligent navigation strategy to achieve efficient long-video question answering with an average of only 10.5 tool calls, significantly outperforming exhaustive methods on the accuracy–efficiency trade-off.
Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding: A two-stage video moment retrieval framework is proposed: the first stage employs LLM-guided caption matching and generates auxiliary short videos as temporal priors; the second stage uses a multimodal-controlled Mamba network to efficiently fuse generated priors with long sequences, achieving state-of-the-art performance on TVR (R@1/IoU=0.5: 45.20%) while reducing computational overhead.
MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters: This paper proposes MaskAdapt, a two-stage residual learning framework that first trains a mask-invariant robust base policy and then trains a residual policy on top of the frozen base controller to modify target body parts, enabling flexible and precise motion adaptation for physics-based humanoid characters.
MINERVA-Cultural: A Benchmark for Cultural and Multilingual Long Video Reasoning: This paper introduces MINERVA-Cultural, a benchmark comprising 2,400 manually annotated video reasoning questions spanning 18 language/region locales, and reveals severe deficiencies in cultural visual perception among state-of-the-art Video-LLMs through evidence graphs and an iterative error isolation strategy (best model Gemini-2.5-Pro: 45.07% vs. human: 95.22%).
Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos: This paper introduces the Mistake Attribution (MATT) task, which attributes action mistakes in egocentric videos along three dimensions: semantic (which component of the instruction was violated), temporal (at which frame the point of no return, PNR, occurs), and spatial (which region in the PNR frame contains the error). A data engine called MisEngine automatically constructs large-scale mistake samples from existing action datasets, and a unified Transformer model, MisFormer, simultaneously addresses all three attribution sub-tasks, surpassing task-specific SOTA methods across multiple benchmarks.
MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark: This paper introduces MovieRecapsQA, a multimodal open-ended video QA benchmark constructed from movie recap videos, comprising approximately 8.2K questions across 60 movies. It proposes a reference-free evaluation metric based on atomic facts and reveals that the critical bottleneck of current MLLMs lies in visual perception rather than reasoning.
Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking: Ninja Codes leverages deep steganography to transform arbitrary images into visually inconspicuous fiducial markers via an end-to-end trained encoder. The resulting markers can be printed with standard printers and detected using RGB cameras, enabling stealthy 6-DoF pose tracking.
Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking: This paper proposes OA-SORT, an occlusion-aware tracking framework that explicitly models target occlusion states to mitigate positional cost ambiguity and Kalman Filter estimation instability. The method achieves state-of-the-art improvements on DanceTrack, SportsMOT, and MOT17, with all components being plug-and-play compatible with multiple tracker architectures.
OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments: This paper presents OpenMarcie, the largest-scale multimodal action recognition dataset for industrial environments, integrating 8 sensing modalities, 200+ channels, and 37+ hours of recordings from wearable sensors and visual data. Three benchmarks—HAR classification, open-vocabulary description, and cross-modal alignment—demonstrate the superiority of inertial+vision fusion.
Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation: This paper presents the first systematic analysis of adversarial vulnerabilities in Tracking-by-Query-Propagation (TBP) trackers, and proposes the FADE attack framework. FADE employs two complementary strategies — Temporal Query Flooding (TQF) to exhaust fixed query budgets by generating persistent spurious tracks, and Temporal Memory Corruption (TMC) to disrupt hidden state propagation of legitimate tracks. On MOT17/MOT20, FADE causes up to ~30 points of HOTA degradation and more than 10× identity switches on MOTR/MOTRv2/MeMOTR/Samba/CO-MOT.
Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding: This paper proposes QViC-MF, a framework that achieves state-of-the-art performance on MLVU, LVBench, and VNBench through question-guided multi-frame visual compression (QMSA) and a contextual memory feedback mechanism, using as few as 16 visual tokens per frame.
RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation: This paper is the first to introduce textual descriptions into RGBT tracking, proposing RAGTrack, a retrieval-augmented generation (RAG)-based framework. Through a Multimodal Transformer Encoder (MTE), Adaptive Token Fusion (ATF), and a Context-aware Reasoning Module (CRM), it achieves state-of-the-art performance on four RGBT benchmarks.
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling: This paper proposes the Verifier — a meta-model that learns to assess the per-frame reliability of predictions from multiple pre-trained trackers, selecting the best candidate at each frame to construct high-quality pseudo-label trajectories. This enables annotation-free fine-tuning for real-world point tracking and achieves state-of-the-art performance on four real-world benchmarks.
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling: This paper proposes a learnable Verifier meta-model trained on synthetic data to assess the reliability of tracker predictions and transfer this capability to the real world. By evaluating per-frame predictions from six pretrained trackers and selecting the most reliable as pseudo-labels, the proposed Track-On-R model is fine-tuned on only ~5K real videos and achieves comprehensive state-of-the-art performance across four real-world benchmarks.
Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning: This paper proposes SlotCurri, a reconstruction-guided slot-count curriculum learning strategy that begins training with very few slots and progressively expands slot capacity only in regions with high reconstruction error. Combined with structure-aware loss and cyclic inference, SlotCurri effectively addresses the over-fragmentation problem — where a single object is erroneously split across multiple slots — in video object-centric learning, achieving a +6.8 FG-ARI improvement on YouTube-VIS.
FlexHook: Rethinking Two-Stage Referring-by-Tracking in RMOT: This paper proposes FlexHook, a novel two-stage Referring-by-Tracking framework that redefines feature construction via a sampling-based Conditioning Hook (C-Hook) and replaces CLIP cosine similarity matching with a Pairwise Correspondence Decoder (PCD), making a two-stage method comprehensively surpass current state-of-the-art one-stage methods for the first time.
FlexHook: Rethinking Two-Stage Referring-by-Tracking in RMOT: FlexHook revitalizes the two-stage Referring-by-Tracking (RBT) paradigm: it introduces C-Hook to directly sample target features from the backbone (replacing dual encoding) and inject language-conditioned cues, and replaces CLIP cosine similarity with PCD (Pairwise Correspondence Decoder) for active correspondence modeling. This marks the first time a two-stage method comprehensively surpasses one-stage RMOT state-of-the-art — achieving HOTA of 42.53 (vs. 10.32 for iKUN) on Refer-KITTI-V2, with training completed in only 1.91 hours (2×4090).
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning: This paper proposes SAIL, which achieves state-of-the-art performance on both dense video captioning and event localization on ActivityNet and YouCook2 under a weakly-supervised setting (caption annotations only, no temporal boundaries), via cross-modal similarity-guided semantic-aware mask generation and auxiliary supervision from LLM-synthesized captions.
SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion: This paper proposes SAVA-X, a framework comprising three complementary modules—adaptive sampling, scene-aware view embedding, and bidirectional cross-attention fusion—to address cross-view temporal error detection in the exocentric-demonstration-to-egocentric-imitation setting, achieving comprehensive improvements over existing baselines on the EgoMe benchmark.
SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion: This paper formalizes the Ego→Exo imitation error detection task and proposes the SAVA-X (Align–Fuse–Detect) framework, which jointly addresses three core challenges—temporal misalignment, video redundancy, and cross-view domain gap—through three modules: adaptive sampling, scene-adaptive view embeddings, and bidirectional cross-attention fusion.
Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting: This paper proposes Seen-to-Scene, a unified video outpainting framework that integrates propagation-based and generation-based paradigms. By combining reference-frame-guided latent-space propagation with a video diffusion model, it achieves spatiotemporal consistency and visual fidelity in zero-shot inference that surpasses prior methods requiring input-specific adaptation.
SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild: This paper introduces SHOW3D, the first hand-object interaction dataset with accurate 3D annotations captured in truly in-the-wild environments. Through a lightweight wearable multi-camera backpack system and an ego-exo fusion annotation pipeline, the dataset comprises 4.3 million frames of multi-view data, achieving sub-centimeter annotation accuracy for both hands and objects. Cross-dataset experiments validate the generalization advantage of models trained on SHOW3D.
SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition: This paper proposes SkeletonContext, a framework that recovers the missing environmental and object context semantics in skeleton data from pretrained language models via a cross-modal context prompt module, and enhances the discriminability of motion-critical joints through a key part decoupling module. The method achieves state-of-the-art performance on NTU-60/120 and PKU-MMD under both zero-shot (ZSL) and generalized zero-shot (GZSL) settings.
SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding: This paper proposes SlotVTG, a framework that inserts a lightweight Slot Adapter into the early layers of an MLLM decoder to decompose visual tokens into object-level slot representations. A Slot Alignment Loss guided by DINOv2 priors encourages semantically coherent slot formation, substantially improving out-of-domain (OOD) generalization for video temporal grounding (up to +4.3 OOD R1@0.5), while introducing only ~0.25% additional trainable parameters.
SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking: SpikeTrack is proposed as the first RGB visual tracking framework fully compliant with the spike-driven paradigm. Through asymmetric temporal step expansion, unidirectional information flow, and a brain-inspired Memory Retrieval Module (MRM), it achieves SOTA among SNN-based trackers and is on par with ANN-based trackers, while consuming only 1/26 the energy of TransT.
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning: ROS-DVC introduces three complementary components for DETR-based dense video captioning (DVC): role-specific query initialization (separate localization and captioning queries), a cross-task contrastive alignment loss, and an overlap suppression loss. Without pretraining or LLMs, it achieves a CIDEr of 39.18 on YouCook2, surpassing DDVC which relies on GPT-2.
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning: This paper proposes ROS-DVC, which decouples the shared queries in DETR-based DVC frameworks into independent localization queries and caption queries, introduces an Overlap Suppression Loss to penalize temporal overlap between queries, and employs Cross-Task Contrastive Alignment to maintain cross-task semantic consistency. The approach achieves state-of-the-art captioning and localization performance on YouCook2 and ActivityNet Captions.
STORM: End-to-End Referring Multi-Object Tracking in Videos: STORM is the first end-to-end multimodal large language model framework for Referring Multi-Object Tracking (RMOT). It substantially reduces reliance on RMOT-annotated data through a task composition learning strategy and introduces the high-quality STORM-Bench dataset.
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos: This work presents StreamGaze, the first gaze-guided streaming video understanding benchmark, comprising 8,521 QA pairs covering three task categories — past, present, and proactive prediction. A gaze trajectory–video alignment pipeline is proposed to generate spatiotemporally grounded QA pairs, revealing a substantial gap in current MLLMs' ability to leverage gaze signals for temporal reasoning.
StreamingTOM: Streaming Token Compression for Efficient Video Understanding: This paper proposes StreamingTOM, a training-free two-stage framework for streaming video understanding. Causal Temporal Reduction (CTR) compresses per-frame tokens from 196 to 50 via causal temporal selection before the LLM, while Online Quantized Memory (OQM) constrains kv-cache growth after the LLM through 4-bit quantization and on-demand retrieval. The framework achieves a 15.7× compression ratio, 1.2× lower peak memory, and 2× faster TTFT.
StreamingTOM: Streaming Token Compression for Efficient Video Understanding: The first training-free framework to simultaneously address both pre-LLM prefill and post-LLM KV-cache efficiency bottlenecks in streaming video VLMs, achieving 15.7× compression with bounded active memory.
StreamReady: Learning What to Answer and When in Long Streaming Videos: This paper introduces a readiness-aware paradigm for streaming video understanding. By incorporating a learnable <RDY> token and proposing the Answer Readiness Score (ARS) metric, the model is trained not only to produce correct answers but also to respond at the appropriate moment when sufficient evidence has appeared. The approach achieves state-of-the-art results on 9 streaming and offline video benchmarks.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration: This paper proposes SVAgent, a storyline-guided cross-modal multi-agent framework for long video question answering. By progressively constructing narrative representations, employing DPP-based evidence selection, cross-modal consistency verification, and iterative refinement, SVAgent achieves performance gains of 5.5%–11.5% over baselines.
TCEI: Dual-level Adaptation for Multi-Object Tracking via Test-Time Calibration: Inspired by the dual-system model of human decision-making, this paper proposes TCEI, a test-time calibration framework for multi-object tracking: an intuition system leverages instantaneous memory for rapid prediction, while an experience system calibrates those predictions using accumulated knowledge. Confident and uncertain samples serve as historical priors and reflective cases, respectively, enabling online adaptation.
Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition: Inspired by Kahneman's dual-process theory, the TCEI framework proposes a test-time adaptation method that combines an intuitive system (rapid inference via transient memory of recently observed objects) with an experiential system (calibration of intuitive predictions using knowledge accumulated from historical videos), achieving significant improvements in multi-object tracking under distribution shift without requiring backpropagation.
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models: This paper proposes the AOT framework, which establishes local-global token anchors and employs Optimal Transport (OT) to aggregate the semantic information of pruned/merged tokens at both intra-frame and inter-frame levels. The method achieves training-free video token compression, retaining 97.6% of original performance while discarding 90% of tokens.
TrajTok: Learning Trajectory Tokens Enhances Video Understanding: This paper proposes TrajTok — an end-to-end differentiable trajectory tokenizer that implicitly clusters video pixels into object trajectory tokens, replacing external segmentation-and-tracking pipelines. It achieves significant improvements across three settings: training from scratch (TrajViT2), feature adaptation (TrajAdapter), and vision-language model connectors (TrajVLM), with particularly large gains on long-video QA over patch pooling.
TrajTok: Learning Trajectory Tokens Enhances Video Understanding: This paper proposes TrajTok—the first end-to-end differentiable trajectory-based video tokenizer—which encodes video into object trajectory tokens via implicit spatiotemporal clustering, requiring no external segmentation or tracking pipeline. TrajTok achieves +4.8% on K400, +4.1% on SSv2, and +8.8% on long-video QA benchmarks, with inference efficiency on par with the most efficient baselines.
U2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation: U2Flow is the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. Through augmentation-consistency-based decoupled uncertainty learning and uncertainty-guided bidirectional flow fusion, it achieves unsupervised state-of-the-art performance on KITTI and Sintel.
UETrack: A Unified and Efficient Framework for Single Object Tracking: This paper proposes UETrack, a unified and efficient single object tracking framework capable of handling five modalities simultaneously: RGB, Depth, Thermal, Event, and Language. UETrack addresses a critical gap in efficient multi-modal tracking — existing efficient trackers are limited to RGB, while multi-modal trackers are too slow for practical deployment due to complex designs. The core contributions include: (1) Token-Pooling-based Mixture-of-Experts (TP-MoE), which replaces conventional gating mechanisms with similarity-based soft assignment to enable efficient expert collaboration and specialization; and (2) Target-aware Adaptive Distillation (TAD), which adaptively determines whether each sample is suitable for distillation, filtering out unreliable teacher signals. Evaluated across 12 benchmarks on 3 hardware platforms, UETrack achieves an optimal speed-accuracy trade-off — UETrack-B attains 69.2% AUC on LaSOT at 163/56/60 FPS on GPU/CPU/AGX respectively, with only 13M parameters.
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models: UFVideo is the first Video LLM to unify global, pixel-level, and temporal-level video understanding within a single model. Through a visual-language guided alignment strategy and the SAM2 mask decoder, it simultaneously supports video question answering, object referring, video segmentation, and temporal grounding, and introduces UFVideo-Bench, a multi-grained cooperative understanding benchmark.
Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability: This paper investigates, from an interpretability perspective, the root cause of temporal logic inconsistency in Video-LLMs—namely, that cross-modal attention heads fail to effectively discriminate video tokens at different timestamps—and proposes TCAS (Temporally Conditioned Attention Sharpening), which significantly improves temporal logic consistency and general temporal grounding performance by optimizing attention distributions.
Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention: This paper proposes a unified spatiotemporal token compression method that jointly evaluates token contribution and semantic redundancy via a global retention pool, and introduces a text-aware merging mechanism inside the LLM. At an extreme compression ratio retaining only ~2% of visual tokens, the method preserves 90.1% of baseline performance while reducing FLOPs to ~2.6%.
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking: This paper proposes UTPTrack, the first unified framework that jointly prunes tokens from all three components — search region (SR), dynamic template (DT), and static template (ST) — within one-stream Transformer trackers, achieving 65–67% visual token reduction across both RGB and multimodal/language-guided tracking tasks while maintaining 99.7%–100.5% of baseline performance.
VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference: This paper identifies a strong "vertical vector" sparsity pattern in the attention maps of video models and proposes VecAttention, a fine-grained vector-wise sparse attention framework. Through TilingSelect and minS filtering, the method efficiently selects important KV vectors, achieving accuracy on par with full attention at over 78% sparsity while delivering a 2.65× speedup in attention computation.
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding: VideoARM proposes an agentic reasoning paradigm built upon a Hierarchical Multimodal Memory (HM3) structure. Through an adaptive observe–think–act–memorize loop and a coarse-to-fine tool-calling strategy, it surpasses state-of-the-art methods on long-form video understanding benchmarks while reducing token consumption to 1/34 of DVD.
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice: This paper proposes VideoAuto-R1, an on-demand reasoning framework for video understanding. During training, it adopts a "think once, answer twice" (answer→think→answer) paradigm; during inference, it uses the confidence of the first answer to determine whether to invoke CoT reasoning. The approach maintains SOTA accuracy while reducing average response length from 149 to 44 tokens (approximately 3.3× compression).
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning: This paper proposes VideoChat-M1, which replaces conventional fixed tool-calling strategies with Collaborative Policy Planning (CPP) and Multi-Agent Reinforcement Learning (MARL). Multiple policy agents dynamically generate, execute, and communicate tool-invocation plans, achieving state-of-the-art results on 8 video understanding benchmarks—surpassing Gemini 2.5 Pro by 3.6% and GPT-4o by 15.6% on LongVideoBench.
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning: VideoChat-M1 proposes the Collaborative Policy Planning (CPP) paradigm and a Multi-Agent Reinforcement Learning (MARL) training framework, enabling four heterogeneous VLM agents to dynamically generate and update tool-calling policies for video understanding. It surpasses Gemini 2.5 Pro by 3.6% and GPT-4o by 15.6% on LongVideoBench.
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking: VideoSeek proposes a long-horizon video agent that actively seeks critical evidence via video logical flow rather than exhaustively parsing all frames. Through a think-act-observe loop and a multi-granularity toolkit (overview/skim/focus), it achieves a 10.2-point improvement over the base model GPT-5 on LVBench while reducing frame usage by 93%.
VidTAG: Temporally Aligned Video to GPS Geolocalization: This paper proposes VidTAG, a dual-encoder (CLIP+DINOv2) frame-to-GPS retrieval framework that achieves temporally consistent per-frame video geolocalization at global scale, via a TempGeo module for inter-frame temporal alignment and a GeoRefiner encoder-decoder module for GPS prediction refinement.
VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding: This paper proposes VirtueBench, the first long video understanding benchmark for evaluating VLM trustworthiness under uncertainty. By constructing multi-level frame sampling for each video and annotating answerable/unanswerable ground truth at each level, it reveals that existing models tend to guess rather than honestly refuse to answer.
VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues: This paper proposes VRR-QA, a benchmark comprising 1K carefully annotated video QA pairs designed to evaluate models' ability to reason about implicit visual relationships in videos—such as off-screen events, cross-frame causality, and spatial relationship inference. The benchmark reveals significant deficiencies in implicit reasoning among current state-of-the-art VideoQA models, including GPT-O3: the best-performing model achieves only 64% accuracy, far below the human baseline of 83%.
VSI: Visual-Subtitle Integration for Keyframe Selection to Enhance Long Video Understanding: VSI proposes a dual-branch collaborative retrieval framework (Video Search + Subtitle Match) that fuses visual and textual information for precise keyframe localization. On text-dominant subtasks, it improves search accuracy from 29.48 to 45.00, representing the first cross-modal keyframe retrieval method.
Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding: This paper proposes WFS-SB, a training-free frame selection framework that applies wavelet transforms to query-frame similarity signals for semantic boundary detection. The video is segmented into semantically coherent segments, over which frame budgets are adaptively allocated and diversity-aware sampling is performed. WFS-SB substantially surpasses state-of-the-art methods on VideoMME, MLVU, and LongVideoBench.

🧑 Human Understanding¶

A Two-Stage Dual-Modality Model for Facial Expression Recognition: A two-stage dual-modality framework for facial expression recognition is proposed: Stage I adapts a DINOv2 encoder on external datasets via padding-aware augmentation and a training-only MoE head; Stage II performs frame-level audio-visual expression classification using multi-scale facial crops, Wav2Vec 2.0 audio features, and a gated fusion module, achieving 0.5368 Macro-F1 in the ABAW 2026 competition.
All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark: This paper proposes LIDMark, the first proactive forensics framework that unifies deepfake detection, tampering localization, and source tracing within a single watermarking scheme. By embedding a 152-dimensional Landmark-Identity watermark (136D facial landmarks + 16D source ID) and leveraging intrinsic/extrinsic consistency, LIDMark achieves three-in-one forensics while surpassing existing methods in both PSNR/SSIM and detection accuracy.
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video: This paper proposes AVATAR, a framework that addresses three fundamental limitations of GRPO in multimodal video reasoning—data inefficiency, advantage collapse, and uniform credit assignment—via an off-policy training architecture (hierarchical replay buffer) and a Temporal Advantage Shaping (TAS) strategy. AVATAR significantly outperforms standard GRPO on audio-visual understanding benchmarks (OmniBench +3.7, 5× sample efficiency improvement).
Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation: This paper reveals that subject-independent cross-validation in facial AU detection introduces a random noise floor of ±0.065 F1 merely from varying subject-to-fold assignments, rendering many claimed SOTA improvements statistically indistinguishable. The authors propose the Leave-One-Dataset-Out (LODO) protocol as a more stable and reliable alternative evaluation scheme.
BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy: This paper proposes a heavily regularized multimodal fusion pipeline that achieves robust video-level recognition of Ambivalence/Hesitancy (A/H) behaviors in naturalistic settings. The framework employs a heterogeneous classifier committee across four modalities — visual (SigLip2), audio (HuBERT), text (F2LLM), and statistical features — combined with PSO-based hard-voting ensemble regularized by a train-validation gap penalty, achieving Macro F1 = 0.7465 on the ABAW10 test set.
CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation: This paper proposes CIGPose, a causal intervention graph-based pose estimation framework that employs a structural causal model (SCM) to identify visual-context confounders, leverages prediction uncertainty to localize confounded keypoints and replaces their embeddings with learned context-free canonical representations, and subsequently models skeletal anatomical constraints via a hierarchical graph neural network. CIGPose achieves a new state of the art of 67.0% AP on COCO-WholeBody.
COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation: This paper proposes COG, a framework that models cross-view correspondences as a confidence-aware optimal transport (OT) problem. By predicting per-point confidence scores as transport marginal constraints, COG suppresses contributions from non-overlapping regions and outliers, achieving unsupervised single-reference 6DoF novel object pose estimation on par with supervised methods.
E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation: This paper proposes E-3DPSM, an event-camera-based egocentric 3D human pose state machine that formulates pose estimation as a continuous-time state evolution process. It integrates bidirectional SSM temporal modeling with a learnable Kalman-style fusion module to combine direct and incremental pose predictions, achieving real-time inference at 80Hz with a 19% reduction in MPJPE and a 2.7× improvement in temporal stability.
Editing Physiological Signals in Videos Using Latent Representations: This paper proposes PhysioLatent, a framework that encodes input facial videos into the latent space of a 3D VAE, fuses the resulting representation with target heart rate CLIP text embeddings, captures rPPG temporal coherence via AdaLN-enhanced spatiotemporal fusion layers, and employs a FiLM-modulated decoder with a fine-tuned output layer to achieve precise heart rate modification. The method attains a heart rate modulation MAE of 10 bpm while preserving visual quality at PSNR 38.96 dB / SSIM 0.98.
Efficient Onboard Spacecraft Pose Estimation with Event Cameras and Neuromorphic Hardware: The first end-to-end 6-DoF spacecraft pose estimation system deployed on BrainChip Akida neuromorphic hardware, exploring accuracy–efficiency trade-offs among event camera representations and quantization-aware training for low-power onboard deployment.
EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR: This paper proposes EgoPoseFormer v2 (EPFv2), which achieves state-of-the-art accuracy in egocentric 3D human motion estimation on the EgoBody3M benchmark (MPJPE 4.02 cm, 15–22% improvement over its predecessor) at 0.8 ms GPU latency. The system combines an end-to-end Transformer architecture (single global query token + causal temporal attention + conditioned multi-view cross-attention) with an uncertainty-distillation-based auto-labeling system.
Face Time Traveller: Travel Through Ages Without Losing Identity: This paper proposes FaceTT, a framework that achieves high-fidelity, identity-consistent face age transformation via three core modules—face-attribute-aware prompt refinement, angular inversion, and adaptive attention control (AAC)—surpassing existing methods across multiple benchmarks.
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision: FlexAvatar introduces learnable bias sink tokens to unify training across monocular and multi-view data, resolving the entanglement between driving signals and target viewpoints, and enables the generation of complete, high-quality, animatable 3D head avatars from a single image.
A2P: From 2D Alignment to 3D Plausibility for Occlusion-Robust Two-Hand Reconstruction: The paper decouples two-hand reconstruction into 2D structural alignment and 3D spatial interaction alignment. Stage 1 employs a Fusion Alignment Encoder (FAE) to implicitly distill three 2D priors from Sapiens (keypoints, segmentation, depth), eliminating the need for the foundation model at inference (56 fps). Stage 2 maps penetrating poses to physically plausible configurations via a penetration-aware diffusion model with collision gradient guidance. On InterHand2.6M, MPJPE is reduced to 5.36 mm (surpassing SOTA 4DHands by 2.13 mm) and penetration volume is reduced by 7×.
FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation: FSMC-Pose proposes a lightweight top-down framework for cattle mounting pose estimation, comprising the frequency-spatial fusion backbone CattleMountNet (which employs wavelet transform and Gaussian filtering in the SFEBlock for foreground-background separation, and multi-scale dilated convolutions in the RABlock for context aggregation) and the multiscale self-calibration head SC2Head (spatial-channel co-calibration with a self-calibration branch to correct structural displacement). The paper also introduces MOUNT-Cattle, the first dataset for cattle mounting behavior, achieving 89% AP in complex group-housing environments at extremely low computational cost (4.41 GFLOPs, 2.698M parameters).
FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation: This paper proposes FSMC-Pose, a lightweight top-down framework that achieves cattle mounting pose estimation in dense and cluttered farm environments via the frequency-spatial fusion backbone CattleMountNet and the multiscale self-calibration head SC2Head, attaining 89% AP with only 2.698M parameters.
FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation: FSMC-Pose presents a lightweight cattle mounting pose estimation framework tailored for dense farm environments. By combining the frequency-spatial fusion backbone CattleMountNet with the multiscale self-calibrating prediction head SC2Head, the method achieves 89% AP with only 2.698M parameters and 4.4G FLOPs.
FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition: This paper proposes FusionAgent, an intelligent agent framework based on a multimodal large language model (MLLM) for dynamic sample-level model selection in whole-body biometric recognition. Each expert model (face recognition / gait recognition / person re-identification) is encapsulated as a callable tool. Through reinforcement fine-tuning (RFT), the agent learns to adaptively select the optimal model combination for each test sample based on its characteristics. Combined with the newly proposed ACT score fusion strategy, FusionAgent significantly outperforms existing state-of-the-art fusion methods.
HandDreamer: Zero-Shot Text to 3D Hand Model Generation: This paper presents HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. It addresses view inconsistency and geometric distortion in SDS-based optimization through MANO initialization, skeleton-guided diffusion, and a corrective hand shape loss.
HandX: Scaling Bimanual Motion and Interaction Generation: This work introduces HandX—a unified bimanual motion generation infrastructure comprising 54.2 hours of motion data and 485K fine-grained text annotations. It proposes a decoupled automatic annotation strategy (kinematic feature extraction + LLM-based description generation) and benchmarks two generation paradigms—diffusion and autoregressive—demonstrating clear data and model scaling trends.
How to Take a Memorable Picture? Empowering Users with Actionable Feedback: This paper defines a novel task of memorability feedback (MemFeed) and proposes MemCoach — a training-free, activation-steering approach for MLLMs. Via a teacher-student strategy, memorability-aware knowledge is injected into the model's activation space, enabling the MLLM to generate natural-language actionable suggestions that improve photo memorability.
HUM4D: A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture: This paper introduces the HUM4D dataset, covering complex single- and multi-person motion scenarios (rapid movements, occlusions, identity swaps), providing synchronized multi-view RGB/RGB-D sequences, accurate Vicon marker-based ground truth, and SMPL/SMPL-X parameters. Benchmark evaluations reveal significant performance degradation of state-of-the-art markerless methods under realistic conditions.
HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation: This paper reformulates single-image 3D human reconstruction as a 360° orbital video generation problem. A video diffusion model (Wan 2.1) is fine-tuned via LoRA using only 500 3D scans to generate 81-frame orbital videos, from which high-quality textured meshes are reconstructed via VGGT and Mesh Carving. The approach requires no pose annotations and surpasses existing methods in multi-view consistency and identity preservation.
IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations: This paper proposes IDperturb, a geometry-driven sampling strategy that applies angular perturbations to identity embeddings on the unit hypersphere. Without modifying the generative model, it significantly enhances intra-class diversity in synthetic face datasets and improves downstream face recognition performance.
LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference: This paper proposes LabanLite, a symbolic motion representation, and the LaMoGen framework, which for the first time enables LLMs to autonomously compose motion sequences through interpretable Laban symbol reasoning, surpassing conventional text-motion joint embedding methods in temporal precision and controllability.
LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics: This paper proposes the LaScA framework, which leverages large language models to generate a deterministic semantic lexicon as affective priors for handcrafted facial and acoustic features. A frozen sentence encoder produces semantic embeddings that are fused with the raw features. LaScA consistently outperforms feature-only baselines in affective dynamics prediction on the Aff-Wild2 and SEWA datasets, and matches or surpasses end-to-end deep models in terms of consistency, efficiency, and interpretability.
LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction: This paper proposes LASER, a training-free framework that converts offline feed-forward reconstruction models (e.g., VGGT, π³) into streaming systems via Layer-wise Scale Alignment (LSA), achieving real-time streaming 4D reconstruction of kilometer-scale videos at 14 FPS with 6 GB peak memory on an RTX A6000.
LCA: Large-scale Codec Avatars - The Unreasonable Effectiveness of Large-scale Avatar Pretraining: LCA is the first work to apply the large-scale pretraining/post-training paradigm to 3D avatar modeling: it pretrains on 1 million in-the-wild videos to acquire broad appearance and geometry priors, then post-trains on high-quality multi-view studio data to enhance fine-grained expression fidelity, effectively breaking the inherent trade-off between generalizability and fidelity.
MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision: MatchED introduces a lightweight (~21K parameter) plug-and-play module that generates crisp (single-pixel-wide) edge maps by performing one-to-one bipartite matching between predicted and GT edge pixels during training, based on spatial distance and confidence. The module can be appended to any edge detector for end-to-end training, and for the first time matches or surpasses standard post-processing methods without relying on NMS and thinning.
Miburi: Towards Expressive Interactive Gesture Synthesis: Miburi is proposed as the first online causal framework for real-time synchronized whole-body gesture and facial expression generation, achieved by directly leveraging the internal token stream of the speech-text large model Moshi and a 2D causal Transformer.
MMGait: Towards Multi-Modal Gait Recognition: MMGait constructs the most comprehensive multi-modal gait recognition benchmark to date (5 sensors, 12 modalities, 725 subjects, 334K sequences), introduces the novel omni-modal gait recognition task, and proposes a unified baseline model, OmniGait.
Mobile-VTON: High-Fidelity On-Device Virtual Try-On: This paper proposes Mobile-VTON, the first diffusion-based virtual try-on system capable of running fully offline on mobile devices. Through a TeacherNet-GarmentNet-TryonNet (TGT) architecture and a Feature-Guided Adversarial (FGA) distillation strategy, the system achieves high-quality try-on results comparable to server-side baselines with only 415M parameters and 2.84GB memory.
Mobile-VTON: High-Fidelity On-Device Virtual Try-On: The first fully offline, on-device diffusion-based virtual try-on framework. Built upon a TeacherNet-GarmentNet-TryonNet (TGT) architecture, it transfers the capabilities of SD3.5 Large to a 415M-parameter lightweight student network via Feature-Guided Adversarial (FGA) distillation. The method matches or surpasses server-side baselines at 1024×768 resolution on VITON-HD and DressCode, with an end-to-end inference time of approximately 80 seconds on a Xiaomi 17 Pro Max.
MoLingo: Motion-Language Alignment for Text-to-Human Motion Generation: MoLingo achieves comprehensive state-of-the-art performance on text-to-human motion generation—across FID, R-Precision, and user studies—by combining a Semantic Alignment Encoder (SAE) with multi-token cross-attention text conditioning, performing masked autoregressive rectified flow in a continuous latent space.
OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition: This paper introduces OMG-Bench, the first large-scale public benchmark for skeleton-based online micro hand gesture recognition (40 classes, 13,948 instances), and proposes the HMATr framework, which unifies detection and classification end-to-end via a hierarchical memory bank and position-aware queries, achieving a 7.6% improvement in detection rate over the previous state of the art.
OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery: This paper proposes OnlineHMR, the first online world-grounded human mesh recovery framework that simultaneously satisfies four criteria: system causality, faithfulness, temporal consistency, and efficiency. It achieves streaming camera-space HMR via sliding-window causal learning with KV-cache inference, and performs online global localization through human-centric incremental SLAM combined with EMA trajectory correction.
OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis: This paper proposes OpenFS, a framework that achieves multi-hand fingerspelling recognition with implicit signing-hand detection via dual-level positional encoding, a signing-hand focusing loss, and a monotonic alignment loss. A frame-wise letter-conditioned diffusion generator is further designed to synthesize OOV training data. OpenFS achieves state-of-the-art performance on three benchmarks (ChicagoFSWild / ChicagoFSWildPlus / FSNeo) with inference speed over 100× faster than PoseNet.
ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis: This paper proposes ParTY, a framework that employs a Part-Guided Network and Part-aware Text Grounding to significantly improve text–motion semantic alignment at the body-part level while preserving whole-body motion coherence, thereby resolving the fundamental trade-off between part expressiveness and global coherence that exists between holistic and part-decomposition methods.
PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement: Starting from the Navier-Stokes equations, this work derives through rigorous mathematical analysis that rPPG pulse signals obey a second-order damped harmonic oscillator model whose discrete solution is equivalent to a causal convolution operator, thereby providing a first-principles justification for the TCN architecture. The resulting PHASE-Net, with only 0.29M parameters, achieves state-of-the-art performance across multiple datasets.
RAM: Recover Any 3D Human Motion in-the-Wild: RAM proposes a unified multi-person 3D motion recovery framework integrating a motion-aware semantic tracker SegFollow (built on SAM2 with adaptive Kalman filtering), a memory-augmented temporal human mesh recovery module T-HMR, a lightweight motion predictor, and a gated combiner. It achieves state-of-the-art zero-shot tracking stability and 3D accuracy on benchmarks including PoseTrack and 3DPW, while running 2–3× faster than prior methods.
Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback: This paper presents VTON-IQA, a reference-free image quality assessment framework for virtual try-on. It introduces VTON-QBench, a large-scale benchmark comprising 62,688 try-on images annotated with 431,800 human judgments, and proposes an Interleaved Cross-Attention (ICA) module to model interactions among garment, person, and try-on images, achieving image-level quality predictions that are closely aligned with human perception.
Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback: This work constructs VTON-QBench (62,688 try-on images, 13,838 qualified annotators, 431,800 annotations) and proposes VTON-IQA, a reference-free image quality assessment framework that jointly models garment fidelity and person preservation via an asymmetric Interleaved Cross-Attention (ICA) module, achieving image-level quality prediction highly aligned with human perception.
RefTon: Reference Person Shot Assist Virtual Try-on: This paper proposes RefTon, a person-to-person virtual try-on framework built on Flux-Kontext. By incorporating an additional reference image — a photo of another person wearing the target garment — RefTon provides richer garment detail information. Combined with a two-stage training strategy and a rescaled position index mechanism, the framework achieves end-to-end try-on without auxiliary conditions (e.g., DensePose, segmentation masks), attaining state-of-the-art performance on VITON-HD and DressCode.
RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised HOI Detection: RegFormer proposes a lightweight relational grounding Transformer module that, under weak supervision with only image-level annotations, leverages spatially-grounded HO queries and interactiveness-aware learning to directly transfer image-level reasoning to instance-level HOI detection without additional training, achieving performance close to fully supervised methods.
ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data: This paper proposes ReMoGen, a modular framework for real-time human interaction-to-reaction motion generation. It learns a general motion prior from large-scale single-person motion data (frozen during downstream training), adapts to different interaction domains (human-human/human-scene) via independently trained Meta-Interaction modules, and achieves per-frame low-latency online updates (0.047 s/frame) through Frame-wise Segment Refinement. ReMoGen comprehensively surpasses state-of-the-art methods on the Inter-X and LINGO benchmarks.
rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training: rPPG-VQA proposes the first video quality assessment framework tailored for remote heart rate detection (rPPG), combining signal-level multi-method consensus SNR with scene-level MLLM disturbance recognition, along with a two-stage adaptive sampling strategy to curate in-the-wild training data.
Seeing without Pixels: Perception from Camera Trajectories: This paper is the first to systematically elevate camera pose trajectories (6DoF pose sequences) to an independent modality for video perception. Through a contrastive learning framework, a lightweight Transformer encoder, CamFormer, is trained to map camera trajectories into a joint embedding space aligned with text. Across 10 downstream tasks on 5 datasets, the paper demonstrates that camera trajectories serve as a lightweight and robust signal for video content understanding—even surpassing video models requiring thousands of times more computation on physical activity tasks.
Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation: This paper proposes Sketch2Colab, which distills a sketch-driven diffusion prior into a rectified flow student network, and combines energy guidance with continuous-time Markov chain (CTMC) discrete event planning to generate coordinated multi-human–object interaction 3D motions from storyboard sketches, achieving state-of-the-art constraint compliance and perceptual quality on CORE4D and InterHuman.
Stake the Points: Structure-Faithful Instance Unlearning: This paper proposes Structguard, which leverages semantic anchors to preserve the semantic relational structure among retained instances during the forgetting process, thereby preventing structural collapse. The method achieves average improvements of 32.9% / 19.3% / 22.5% across image classification, face recognition, and retrieval tasks.
Talking Together: Synthesizing Co-Located 3D Conversations from Audio: This work presents the first method for generating complete facial animations of two participants sharing the same 3D physical space from a single mixed audio stream. It introduces a dual-stream diffusion architecture (shared U-Net + cross-attention), a two-stage mixed-data training strategy, LLM-driven text-to-spatial-layout control, and an auxiliary eye gaze loss to synthesize natural mutual gaze, head turning, and spatially-aware dyadic 3D conversation animations.
Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach: A four-modality fusion pipeline (scene VideoMAE + face EfficientNetB0 + audio Wav2Vec2.0/Mamba + text EmotionDistilRoBERTa) is proposed. Each modality embedding is projected into a shared 128-dimensional space via a prototype-augmented Transformer fusion module and regularized with a prototype classification auxiliary loss. A 5-model ensemble achieves 71.43% Macro F1 on the final test set of the BAH corpus, substantially outperforming all unimodal baselines.
Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach: This paper presents a multimodal Ambivalence/Hesitancy (A/H) recognition approach for the 10th ABAW Competition, integrating four modalities—scene, facial, audio, and text—via a Transformer-based fusion module and a prototype-augmented classification strategy. The best single model achieves an MF1 of 83.25%, and a five-model ensemble reaches 71.43% on the final test set.
TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures: TeHOR leverages text descriptions as semantic guidance and jointly optimizes the geometry and texture of 3D humans and objects via Score Distillation Sampling from pretrained diffusion models. This approach eliminates the reliance on contact information required by conventional methods, enabling accurate and semantically consistent 3D reconstruction of both contact and non-contact interactions.
4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction: This paper proposes 4DSurf, a general-purpose dynamic scene surface reconstruction framework based on 2D Gaussian splatting. By introducing Gaussian motion-induced SDF flow regularization to constrain the temporally consistent evolution of surfaces, and adopting an overlapping segment partitioning strategy to handle large deformations, 4DSurf surpasses existing SOTA methods by 49% and 19% in Chamfer distance on the Hi4D and CMU Panoptic datasets, respectively.
TriLite: Efficient WSOL with Universal Visual Features and Tri-Region Disentanglement: TriLite employs a frozen DINOv2 ViT backbone with a lightweight TriHead module containing fewer than 800K trainable parameters. By disentangling patch features into foreground, background, and ambiguous regions, and introducing an adversarial background loss, the method achieves state-of-the-art WSOL performance with minimal parameter overhead.
UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos: This paper presents UniDex, a robot foundation suite comprising a large-scale dataset spanning 8 dexterous hands (50K+ trajectories / 9M frames), a Functionally-Aligned Actuator Space (FAAS), and a 3D VLA policy (UniDex-VLA). UniDex-VLA achieves 81% average task progress on real-world tool-use tasks (vs. 38% for π₀) and demonstrates spatial, object-level, and zero-shot cross-hand generalization.
UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking: This paper proposes UniLS, the first end-to-end framework for unified speaking and listening facial expression generation. Through a two-stage training paradigm—first learning intrinsic motion priors without audio, then fine-tuning with dual-track audio—UniLS generates natural speaking and listening facial motions simultaneously from dual-track audio input alone, achieving up to 44.1% improvement on listening metrics.
Unleashing Vision-Language Semantics for Deepfake Video Detection: This paper proposes VLAForge, which employs a ForgePerceiver to independently learn diverse forgery cues and forgery localization maps, and integrates an identity-aware Vision-Language Alignment (VLA) scoring mechanism to unleash the cross-modal semantic potential of VLMs for enhanced deepfake video detection, achieving comprehensive state-of-the-art performance across 9 datasets.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body: This paper proposes ViBES, a 3D conversational agent that unifies language, speech, and body motion via a Mixture of Modal Experts (MoME) architecture and cross-modal attention mechanisms. ViBES generates temporally aligned facial expressions and whole-body motions while preserving the conversational capabilities of a pretrained speech LLM, surpassing the paradigm that treats behavior as simple "modality translation."
Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification: VLADR leverages fine-grained attribute knowledge from vision-language models (VLMs) to enhance lifelong person re-identification. Through a two-stage training pipeline comprising Multi-grain Text Attribute Disentanglement (MTAD) and Inter-domain Cross-modal Attribute Reinforcement (ICAR), the framework explicitly models human body attributes shared across domains to enable effective knowledge transfer and forgetting mitigation, surpassing the state of the art by 1.9%–2.2% in anti-forgetting performance and 2.1%–2.5% in generalization.
WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering: This paper proposes WildCap, a hybrid inverse rendering framework that reconstructs high-quality 4K facial diffuse albedo maps from casual in-the-wild smartphone videos. The approach combines data-driven relighting (SwitchLight), model-based texel grid lighting optimization, and diffusion prior sampling, substantially closing the quality gap between in-the-wild capture and controlled-illumination methods.

🎬 Video Generation¶

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos: This work introduces the first activity-level video forgery localization task and the large-scale ActivityForensics benchmark (6K+ forged clips). A grounding-assisted automated data construction pipeline is proposed to produce highly realistic activity manipulations, and a baseline method, Temporal Artifact Diffuser (TADiff), is presented to amplify forgery cues via diffusion-based feature regularization.
Anti-I2V: Safeguarding your photos from malicious image-to-video generation: Anti-I2V proposes a defense method against malicious image-to-video generation. By optimizing perturbations in both L*a*b* color space and the frequency domain, and designing Internal Representation Collapse (IRC) and Anchor (IRA) losses to disrupt semantic feature propagation within the denoising network, the method achieves state-of-the-art protection across three architecturally distinct models: CogVideoX, DynamiCrafter, and Open-Sora.
AutoCut: End-to-end Advertisement Video Editing Based on Multimodal Discretization and Controllable Generation: AutoCut proposes an end-to-end advertisement video editing framework that unifies video, audio, and text into a shared discrete token space via Residual Vector Quantization (RQVAE), performs multimodal alignment and supervised fine-tuning on Qwen3-8B, and enables four tasks—clip selection, ordering, script generation, and background music selection—within a single unified model, surpassing GPT-4o baselines across multiple metrics.
Chain of Event-Centric Causal Thought for Physically Plausible Video Generation: This work models physically plausible video generation (PPVG) as a sequence of causally connected events. It decomposes complex physical phenomena into ordered events via physics-formula-grounded event chain reasoning, then synthesizes semantic–visual dual conditions through transition-aware cross-modal prompting to guide a video diffusion model in generating videos that follow causal physical evolution.
Compressed-Domain-Aware Online Video Super-Resolution: CDA-VSR leverages compressed-domain information (motion vectors, residual maps, and frame types) to guide three key stages of online video super-resolution: motion-vector-guided deformable alignment for efficient and accurate registration, residual-map-gated fusion to suppress misalignment artifacts, and frame-type-aware reconstruction to adaptively allocate computation. The method achieves state-of-the-art PSNR on REDS4 at 93 FPS—more than twice the speed of prior SOTA.
CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video: CubeComposer decomposes 360° video into a cubemap six-face representation and generates each face autoregressively in a spatio-temporal order, achieving for the first time native 4K (3840×1920) 360° panoramic video generation from perspective video without post-hoc super-resolution.
Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation: This paper proposes Diff4Splat, a feed-forward framework that unifies video diffusion models with deformable 3D Gaussian fields into an end-to-end trainable model, enabling direct generation of dynamic 4D scene representations from a single image in approximately 30 seconds — roughly 60× faster than optimization-based methods.
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching: DisCa is the first framework to unify learnable feature caching with step distillation in a compatible manner, replacing hand-crafted caching strategies with a lightweight neural predictor (<4% of model parameters). Combined with Restricted MeanFlow for stable large-scale video DiT distillation, DisCa achieves an 11.8× near-lossless speedup on HunyuanVideo.
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching: This paper proposes DisCa, the first framework to combine learnable feature caching with step distillation by replacing handcrafted caching strategies with lightweight neural predictors. It further introduces Restricted MeanFlow to stabilize large-scale video model distillation, achieving an 11.8× speedup on HunyuanVideo with negligible quality degradation.
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior: This paper proposes DreamShot, which leverages the spatiotemporal prior of video diffusion models to generate multi-shot storyboards with consistent characters and coherent scenes. A Role-Attention Consistency Loss (RACL) is introduced to address multi-character confusion, and three unified generation modes are supported: text-to-shot, reference-to-shot, and shot-to-shot.
DriveLaW: Unifying Planning and Video Generation in a Latent Driving World: This paper proposes DriveLaW, a driving world model that unifies video generation and motion planning through a shared latent space. The intermediate latent features of the video generator are directly injected into a diffusion-based planner, achieving state-of-the-art performance simultaneously on the nuScenes video prediction benchmark and the NAVSIM planning benchmark.
FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters: FastLightGen proposes a three-stage distillation algorithm that, for the first time, achieves joint distillation of sampling steps and model size. By identifying redundant layers, applying dynamic probabilistic pruning, and performing well-guided teacher guidance distribution matching, it compresses HunyuanVideo/WanX into a lightweight generator with 4 sampling steps and 30% parameter pruning, achieving approximately 35× speedup while surpassing the teacher model in performance.
First Frame Is the Place to Go for Video Content Customization: This paper identifies an intrinsic capability of video generation models to implicitly use the first frame as a "conceptual memory buffer" for storing and reusing multiple visual entities. Building on this observation, the authors propose FFGo—a lightweight LoRA adaptation method requiring only 20–50 training samples—that activates this capability without any architectural modification, enabling multi-reference video content customization. FFGo is rated best in 81.2% of cases in user studies.
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance: FlashMotion proposes a three-stage training framework that distills a multi-step trajectory-controllable video generation model into a few-step counterpart. By fine-tuning the adapter with a hybrid diffusion and adversarial objective, the method simultaneously preserves video quality and trajectory accuracy under few-step inference.
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance: FlashMotion is proposed as the first three-stage training framework for few-step (4-step) trajectory-controllable video generation. By sequentially training a trajectory adapter, distilling a fast generator, and fine-tuning the adapter via a hybrid adversarial and diffusion loss, the method surpasses existing multi-step approaches in both visual quality and trajectory accuracy under 4-step inference, achieving a 47× speedup.
Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction: FreeLOC proposes a training-free, layer-adaptive framework that identifies the differential sensitivity of each layer in video DiTs to two out-of-distribution (OOD) problems—frame-level relative position OOD and context length OOD—and selectively applies multi-granularity positional re-encoding (VRPR) and tiered sparse attention (TSA) to sensitive layers, achieving state-of-the-art long video generation quality without any additional training cost.
From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning: This paper proposes Co-Settle, a framework that trains a lightweight linear projection layer on top of a frozen image-pretrained encoder. Using temporal cycle consistency loss and semantic separability constraints, the method achieves consistent improvements across multi-granularity video downstream tasks on 8 image foundation models with only 5 epochs of self-supervised training.
Generative Neural Video Compression via Video Diffusion Prior: This paper proposes GNVC-VD, the first DiT-based generative neural video compression framework. By leveraging a video diffusion transformer as a video-native generative prior, GNVC-VD performs spatiotemporal latent compression and sequence-level generative refinement within a unified codec. At extremely low bitrates (<0.03 bpp), it substantially surpasses both traditional and learned codecs in perceptual quality while significantly reducing the flickering artifacts prevalent in prior generative approaches.
Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context: This paper proposes the Geometry-as-Context (GaC) framework, which replaces the non-differentiable operators (3D reconstruction + rendering) in reconstruction-based scene video generation with a unified autoregressive video generation model. By embedding geometric information (depth maps) as interleaved context into the generation sequence, GaC enables end-to-end training and mitigates accumulated errors.
Gloria: Consistent Character Video Generation via Content Anchors: Gloria introduces a compact set of "Content Anchors" to represent a character's multi-view appearance and expression identity. Through two key mechanisms—superset content anchoring (to prevent copy-paste artifacts) and RoPE weak conditioning (to distinguish multiple anchor frames)—the method enables consistent character video generation exceeding 10 minutes in duration.
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning: This paper proposes GenReward, a framework that fine-tunes a pre-trained video diffusion model to generate goal-conditioned videos, and derives two-level goal-driven reward signals—video-level and frame-level—to guide reinforcement learning agents without manually designed reward functions, achieving substantial improvements over baselines on Meta-World robotic manipulation tasks.
Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization: This paper proposes IPRO, which directly optimizes a video diffusion model via reinforcement learning and a differentiable facial identity scorer, significantly improving face identity consistency in image-to-video generation without modifying the model architecture, achieving 20%–45% FaceSim gains on Wan 2.2.
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout: This paper proposes ∞-RoPE, a training-free inference-time framework comprising three components — Block-Relativistic RoPE, KV Flush, and RoPE Cut — that extends an autoregressive video diffusion model trained solely on 5-second clips to support infinite-length generation, fine-grained action control, and cinematic scene transitions.
I'm a Map! Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers: This paper proposes IMAP (Interpretable Motion-Attentive Maps), a training-free framework that extracts spatio-temporal saliency maps for motion concepts from Video DiTs via two modules: GramCol for spatial localization and motion head selection for temporal localization. IMAP surpasses existing methods on motion localization and zero-shot video semantic segmentation benchmarks.
LAMP: Language-Assisted Motion Planning for Controllable Video Generation: LAMP frames motion control as a language-to-program synthesis problem: it designs a cinematography-inspired motion DSL, fine-tunes an LLM to translate natural language descriptions into structured motion programs, and deterministically maps these programs to 3D object and camera trajectories that condition a video diffusion model — achieving, for the first time, simultaneous natural-language control over both object and camera motion.
Let Your Image Move with Your Motion! – Implicit Multi-Object Multi-Motion Transfer: This paper proposes FlexiMMT, the first I2V framework supporting implicit multi-object multi-motion transfer. It introduces a Motion Decoupling Mask Attention (MDMA) mechanism to constrain motion/text tokens to interact only with their corresponding object regions, and a Differential Mask Extraction Mechanism (DMEM) to derive object masks from diffusion attention maps with progressive propagation, enabling precise compositional multi-object motion transfer.
Lighting-grounded Video Generation with Renderer-based Agent Reasoning: LiVER proposes a lighting-driven video generation framework that employs a renderer-based agent to translate textual descriptions into explicit 3D scene proxies (encompassing layout, lighting, and camera trajectories). Physical rendering is then used to produce diffuse/glossy/rough GGX scene proxies, which are injected into a video diffusion model to achieve physically accurate lighting effects and precise scene control.
LightMover: Generative Light Movement with Color and Intensity Controls: LightMover leverages video diffusion priors to model light source editing as a sequence-to-sequence prediction problem. Through a unified control token representation, it achieves precise manipulation of light source position, color, and intensity. An adaptive token pruning mechanism reduces control sequence length by 41%, and the method outperforms existing approaches on both light movement and object movement tasks.
LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation: This paper proposes LinVideo, a data-free post-training framework that selectively replaces quadratic attention with linear attention in video diffusion models, achieving 1.43–1.71× speedup. Combined with distillation, the speedup reaches 15.9–20.9× while maintaining generation quality.
LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation: LinVideo is the first data-free post-training framework that automatically identifies which layers are most amenable to linear attention substitution via selective transfer, and recovers model performance through an Arbitrary-timestep Distribution Matching (ADM) objective. It achieves 1.43–1.71× lossless speedup on Wan 1.3B/14B, and up to 15.9–20.9× speedup when combined with 4-step distillation.
MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer: The first method to simultaneously generate RGB + depth + semantic tri-modal multi-view driving scene videos within a unified DiT framework. Through a decomposed design of modal-shared layers (temporal + multi-view spatiotemporal attention) and modal-specific layers (cross-modal interaction + projection heads), a unified layout encoder, and diverse conditioning, the method achieves FVD 46.8 on nuScenes (22% improvement over CogVideoX+SyntheOcc), depth AbsRel 0.110, and semantic mIoU 37.5, outperforming pipelines based on separate model generation and estimation.
MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer: MoVieDrive proposes a unified multi-modal multi-view video diffusion Transformer that simultaneously generates RGB video, depth maps, and semantic maps within a single model via a two-level modal-shared + modal-specific architecture. Combined with diverse conditioning inputs (text, layout, contextual reference), it achieves FVD 46.8 (SOTA) on nuScenes while producing cross-modally consistent, high-quality driving scene synthesis.
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos: NeoVerse proposes a scalable 4D world model that enables the entire training pipeline to leverage large-scale in-the-wild monocular videos (millions of clips) via feed-forward pose-free 4DGS reconstruction and online monocular degradation simulation, achieving state-of-the-art performance on both 4D reconstruction and novel-trajectory video generation.
NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing: This paper proposes NOVA, which formalizes for the first time the "sparse control, dense synthesis" paradigm for video editing: a sparse branch provides semantic guidance from user-edited multi-keyframes, while a dense branch injects motion and texture information from the original video. Combined with a degradation simulation training strategy, NOVA achieves learning without paired data and comprehensively outperforms existing methods in editing fidelity, motion preservation, and temporal consistency.
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors: This paper proposes leveraging the latent features of a 3D foundation generative model (Hunyuan3D) as shape priors, injecting them into a base video diffusion model via a multi-scale 3D adapter, to generate geometrically realistic and view-consistent orbital videos from a single image.
PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation: This paper proposes PAM — the first engine capable of generating realistic hand-object interaction (HOI) videos given only initial/target hand poses and object geometry. Through a three-stage decoupled architecture of pose generation, appearance generation, and motion generation, PAM achieves FVD 29.13 (vs. InterDyn 38.83) and MPJPE 19.37 mm (vs. CosHand 30.05 mm) on DexYCB. The generated synthetic data also effectively augments downstream hand pose estimation tasks.
PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing: PerformRecast presents a GAN-based portrait video editing method built upon a corrected 3DMM keypoint transformation formulation. By applying expression deformation before head rotation — consistent with the FLAME model — the method achieves precise disentanglement of expression and head pose. A Boundary Alignment Module (BAM) is further introduced to address stitching misalignment between facial and non-facial regions. The approach substantially outperforms existing methods under both expression replacement and expression enhancement modes.
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics: This paper proposes Phantom, a framework that augments a pretrained video diffusion model (Wan2.2-TI2V) with a dedicated physical dynamics branch. Physics-aware embeddings extracted by V-JEPA2 serve as latent physical states, and bidirectional cross-attention is employed to jointly model visual content and physical dynamics evolution. Phantom achieves substantial improvements over baselines on physics consistency benchmarks (VideoPhy PC +50.4%) while preserving visual quality.
Physical Simulator In-the-Loop Video Generation: This paper proposes PSIVG — the first training-free inference-time framework that embeds a physical simulator into the video diffusion generation loop. It reconstructs a 4D scene and object meshes from a template video, generates physically consistent trajectories via an MPM simulator, guides video generation using optical flow, and enforces texture consistency of moving objects through Test-Time Consistency Optimization (TTCO), achieving a user preference rate of 82.3%.
PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation: PoseGen achieves dual condition injection (token-level appearance + channel-level pose) via in-context LoRA finetuning, and proposes a segmented interleaved generation strategy (KV sharing + pose-aware frame interpolation) to generate high-fidelity long human videos using only 33 hours of training data.
Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation: This paper proposes PoCo (Position Embedding as Context Controller), which introduces an additional SideInfo axis in RoPE to encode reference entity information, addressing the "reference confusion" problem in multi-reference multi-shot video generation—where the model fails to correctly associate shots with references when reference images are visually similar. PoCo achieves state-of-the-art cross-shot consistency on the VACE-Wan2.1-14B framework (CrossShot-FaceSim 89.35, CrossShot-DINO 92.66).
SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation: SeeU is a 2D→4D→2D learning framework that reconstructs a 4D world representation from sparse monocular 2D frames, learns continuous and physically consistent 4D dynamics on a low-rank representation (B-spline parameterization + physical constraints), and reprojects the 4D world back to 2D, completing unseen regions with a spatiotemporally context-aware video generator—enabling generation of unseen visual content across time and space.
Semantic Satellite Communications for Synchronized Audiovisual Reconstruction: This paper proposes an adaptive multimodal semantic satellite transmission system that flexibly switches transmission priorities via a dual-stream generative architecture (video-driven audio / audio-driven video), combined with a dynamic knowledge base update mechanism and an LLM agent for adaptive decision-making, achieving high-fidelity synchronized audiovisual reconstruction under stringent bandwidth constraints.
Semantic Satellite Communications for Synchronized Audiovisual Reconstruction: This paper proposes an adaptive multimodal semantic transmission system for satellite communications. A dual-stream generative architecture (video-driven audio / audio-driven video) enables dynamic modality priority switching, combined with a dynamic knowledge base update mechanism and an LLM agent decision module, achieving high-fidelity synchronized audiovisual reconstruction under severe bandwidth constraints.
SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation: This paper proposes SLVMEval, a meta-evaluation benchmark that synthesizes controlled degradations to construct "high-quality vs. low-quality" video pairs (up to ~3 hours) from densely captioned video datasets, and tests whether existing T2V evaluation systems can distinguish long-video quality differences. Human annotators achieve 84.7%–96.8% accuracy across 10 dimensions, whereas existing automatic evaluation systems fall behind humans on 9 out of 10 dimensions.
StreamDiT: Real-Time Streaming Text-to-Video Generation: StreamDiT presents a complete streaming video generation pipeline—covering training, modeling, and distillation—that introduces a sliding buffer with progressive denoising under Flow Matching, a mixed partitioning training strategy, a time-varying DiT architecture with windowed attention, and a customized multi-step distillation method. The resulting 4B-parameter model achieves real-time streaming video generation at 512p@16FPS on a single GPU.
SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution: SWIFT introduces the novel task of "few-shot training-free generated video attribution," exploiting the temporal mapping property of 3D VAEs — where \(K\) pixel frames correspond to a single latent frame — by performing two reconstructions (normal and corrupted) via fixed-length sliding windows. The ratio of reconstruction losses over overlapping frames serves as the attribution signal. Using only 20 samples, SWIFT achieves over 90% average attribution accuracy, with a 5-model average of 94%.
SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls: This paper proposes SwitchCraft, a training-free multi-event video generation framework that achieves clear temporal transitions and scene consistency without modifying model weights, via Event-Aligned Query Steering (EAQS) to align frame-level attention to corresponding event prompts, and Auto-Balance Strength Solver (ABSS) to adaptively balance guidance strength.
SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation: SymphoMotion is a unified motion control framework that simultaneously and precisely controls camera motion and object 3D trajectories in video generation via two mechanisms — Camera Trajectory Control (CTC) and Object Dynamics Control (ODC) — alongside a large-scale real-world jointly annotated dataset, RealCOD-25K, containing 25K samples.
TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models: This paper proposes TEAR, the first automated red-teaming framework targeting temporal-dimension vulnerabilities in text-to-video (T2V) models. Through a two-stage optimized temporal-aware test generator and an iterative refinement model, TEAR generates textually benign prompts that exploit temporal dynamics to elicit harmful videos, achieving attack success rates (ASR) exceeding 80% on both open-source and commercial T2V models.
The Devil is in the Details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection: This paper proposes KeyTailor, a framework that employs a keyframe-driven details injection strategy—comprising garment dynamic enhancement and collaborative background optimization—to substantially improve garment fidelity and background consistency in video virtual try-on without modifying the DiT architecture. A 15K high-resolution dataset, ViT-HD, is also released.
Training-free Motion Factorization for Compositional Video Generation: This paper proposes a motion factorization framework that decomposes multi-instance scene motion into three categories — stationary, rigid-body, and non-rigid motion — and addresses prompt semantic ambiguity via Structured Motion Reasoning (SMR) while steering the generation of each motion category during diffusion through Decoupled Motion Guidance (DMG). The framework requires no additional training and achieves substantial improvements in motion diversity and fidelity on VideoCrafter-v2.0 and CogVideoX-2B.
U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation: This paper proposes U-Mind, the first unified real-time full-stack multimodal interaction system supporting high-level reasoning dialogue and instruction following. Within a single interaction loop, the system jointly generates text, speech, and motion, and renders them into photorealistic video. Rehearsal-driven learning and a text-first decoding strategy are introduced to balance reasoning preservation with cross-modal alignment.
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions: UniAVGen proposes a joint audio-video generation framework built on a symmetric dual-branch DiT, achieving precise spatiotemporal synchronization through an asymmetric cross-modal interaction mechanism and a Face-Aware Modulation module. With only 1.3M training samples, it outperforms competitors trained on 30M data across lip-audio synchronization, timbre consistency, and emotional consistency.
Unified Camera Positional Encoding for Controlled Video Generation: This paper proposes Unified Camera Positional Encoding (UCPE), which injects complete camera geometric information (6-DoF pose, intrinsics, and lens distortion) into Transformer attention mechanisms via relative ray encoding and absolute orientation encoding. UCPE enables fine-grained video generation control across heterogeneous camera models while introducing less than 1% additional trainable parameters.
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation: UniTalking is proposed as an end-to-end talking portrait generation framework built upon MM-DiT. Through a joint attention mechanism within a dual-stream symmetric architecture, it explicitly models fine-grained temporal correspondences between audio and video tokens, achieving state-of-the-art lip-audio synchronization accuracy while supporting personalized voice cloning.
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision: Vanast proposes a unified framework that simultaneously performs garment transfer and human image animation within a single stage via a Dual Module architecture (HAM + GTM) and a three-stage synthetic data construction pipeline. On the Internet dataset, it achieves a PSNR of 17.95 dB (+5.5 dB over the best two-stage baseline) and an LPIPS of 0.237.
VideoCoF: Unified Video Editing with Temporal Reasoner: This paper proposes VideoCoF, a Chain-of-Thought-inspired "observe→reason→edit" video editing framework. By prompting a video diffusion model to first predict reasoning tokens (grayscale-highlighted region latents) before generating the target video tokens, VideoCoF achieves precise instruction-region alignment without requiring user-provided masks. Trained on only 50K video pairs, it achieves state-of-the-art performance and supports video extrapolation up to 16× the training length.
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models: NUMINA proposes an identify-then-guide paradigm that, without retraining the video diffusion model, extracts a countable instance layout from DiT attention maps during inference, detects inconsistencies between numeric tokens and the current layout, applies conservative layout modifications, and uses the revised layout to guide regeneration—substantially improving adherence to quantity constraints such as "two apples" or "eight ducks" in text-to-video models.
When to Lock Attention: Training-Free KV Control in Video Diffusion: This paper proposes KV-Lock, a training-free framework that dynamically schedules background KV cache fusion ratios and CFG guidance strength based on diffusion hallucination detection, simultaneously ensuring background consistency and foreground generation quality in video editing.

📦 Model Compression¶

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation: This paper proposes 4D-RGPT and the Perceptual 4D Distillation (P4D) framework, which enhances 4D perception in MLLMs by distilling knowledge of depth and optical flow from frozen 4D perceptual expert models. It also introduces R4D-Bench, the first region-level 4D video question-answering benchmark.
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos: This paper proposes the first fully end-to-end framework for Temporal Sentence Grounding in Videos (TSGV). A Sentence-Conditioned Adapter (SCADA) is introduced to inject sentence embeddings into intermediate layers of the video backbone, dynamically modulating visual features. Combined with a video-centric learning strategy to accelerate training, the method surpasses state-of-the-art performance on Charades-STA and ActivityNet.
Adversarial Concept Distillation for One-Step Diffusion Personalization: OPAD is the first work to address personalization for one-step diffusion models (1-SDP). It achieves single-step high-quality concept generation via joint teacher–student training, alignment loss, and adversarial supervision, and further introduces a collaborative learning phase in which samples generated by the student are fed back to benefit both parties.
An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS: This paper presents the first FPGA hardware acceleration architecture for the Intra Pattern Copy (IPC) tool in the JPEG XS standard. Through a four-stage pipelined DV comparison engine and IPC Group-aligned memory organization, the design achieves 38.3 Mpixels/s throughput and 277 mW power consumption on an Artix-7 FPGA.
An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS: To address the computational bottleneck of Displacement Vector (DV) search in the Intra Pattern Copy (IPC) module for JPEG XS screen content coding, this paper proposes the first four-stage pipeline FPGA architecture and designs an IPC Group-aligned memory organization scheme. Implemented on a Xilinx Artix-7, the design achieves a throughput of 38.3 Mpixels/s at 277 mW power consumption, providing a viable solution for practical hardware deployment of IPC.
ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation: A fully convolutional architecture that unifies hierarchical hyperprior, Masked PixelCNN spatial autoregression, channel-conditional modeling, and SE channel excitation — without relying on Transformers or recurrent components. With 95M parameters and a 222ms decoding time, it achieves a 48% BD-Rate reduction over the Ballé baseline and outperforms VVC Intra by 5.6%.
ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation: This paper proposes ARCHE, an end-to-end image compression framework that, within a purely convolutional architecture free of Transformers and recurrent modules, integrates five complementary components — hierarchical hyperprior, Masked PixelCNN spatial autoregressive context, channel conditioning, SE channel recalibration, and latent residual prediction — achieving a 48% BD-Rate reduction over the Ballé baseline and −5.6% over VVC Intra on Kodak, with only 95M parameters and 222ms decoding time.
Batch Loss Score for Dynamic Data Pruning: This paper proposes Batch Loss Score (BLS), a method that estimates sample importance using only the mean batch loss — which is universally available — rather than per-sample loss, which is difficult to obtain in practice. Grounded in a signal-processing perspective via EMA-based low-pass filtering, BLS offers theoretical guarantees and can be integrated into existing dynamic pruning frameworks with just 3 lines of code.
Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment: This paper proposes AlignPrune—a plug-and-play module based on loss trajectory alignment—that replaces conventional loss-value ranking with a Dynamic Alignment Score (DAS), achieving up to 6.3% accuracy improvement over standard dynamic data pruning methods under noisy label settings.
Bilevel Layer-Positioning LoRA for Real Image Dehazing: This paper proposes BiLaLoRA, which employs bilevel optimization to automatically identify the optimal network layers for LoRA insertion, coupled with H2C Loss — an unsupervised dehazing loss based on CLIP semantic directions — to efficiently adapt synthetic-data-pretrained dehazing models to real-world scenes. The approach reduces training time by 77.7% while matching full fine-tuning performance, and generalizes across models and domains.
Bilevel Layer-Positioning LoRA for Real Image Dehazing: This work leverages CLIP's cross-modal capability to reformulate dehazing as a semantic alignment problem via the H2C loss, and employs bilevel optimization to automatically identify optimal LoRA injection layers (BiLaLoRA), enabling plug-and-play, parameter-efficient synthetic-to-real domain adaptation for image dehazing.
BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers: This paper proposes BinaryAttention, which quantizes Query and Key in Transformer attention to 1-bit binary representations and replaces floating-point dot products with XNOR + popcount bitwise operations, achieving over 2× speedup over FlashAttention2 on A100 GPUs while matching or surpassing full-precision attention across vision classification, detection, segmentation, and diffusion generation tasks.
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge: This paper proposes CPS-Prompt, a framework that combines task-aware critical patch sampling (CPS) and decoupled prompt-classifier training (DPCT) to achieve approximately 1.6× reduction in training-time memory and computation for prompt-based continual learning on edge devices, with only ~2% accuracy degradation.
DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation: This paper proposes DAGE, a dual-stream Transformer architecture that decouples global consistency modeling (low-resolution stream) from fine-grained detail preservation (high-resolution stream), fusing them via a lightweight Cross-Attention Adapter. DAGE achieves high-quality depth/point map estimation and camera pose prediction at 2K resolution and over 1000-frame sequences, running 2–28× faster than Pi3 and establishing a new state of the art on video geometry estimation.
Distilling Balanced Knowledge from a Biased Teacher: To address the head-class bias of teacher models in knowledge distillation under long-tailed distributions, this paper decomposes the conventional KL divergence loss into a cross-group component and a within-group component. By rebalancing the cross-group loss to calibrate the teacher's group-level predictions and reweighting the within-group loss to ensure equal contribution across groups, the proposed method consistently outperforms existing approaches on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT — and even surpasses the teacher model itself.
DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration: DualReg proposes a dual-space registration paradigm that progressively filters feature-space correspondences via lightweight 1-point RANSAC followed by 3-point RANSAC, then constructs geometric proxy point sets from the filtered anchor correspondences for joint dual-space optimization. The method achieves state-of-the-art accuracy on 3DMatch while running 32× faster than MAC.
Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling: This paper proposes Cluster-aware Upcycling, which extracts semantic structure from a dense model via spherical k-means clustering to initialize expert and router parameters in MoE, thereby breaking expert symmetry and promoting early specialization. Combined with an Expert Ensemble Self-Distillation (EESD) loss, the method consistently outperforms existing upcycling approaches on CLIP ViT benchmarks.
FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection: This paper proposes FAAR, a frequency-aware parameter-efficient fine-tuning method for multi-task learning. It introduces Performance-Driven Rank Shrinking (PDRS) to dynamically select the optimal rank for each task and layer, and designs a Task-Spectral Pyramidal Decoder (TS-PD) that leverages FFT frequency information to enhance spatial awareness and cross-task consistency. FAAR achieves superior performance using only 1/9 the parameters of full fine-tuning.
FAIR-Pruner: Leveraging Tolerance of Difference for Flexible Automatic Layer-Wise Neural Network Pruning: FAIR-Pruner is a structured pruning framework that introduces the Tolerance of Differences (ToD) metric to reconcile two complementary perspectives: the Wasserstein Utilization Score (U-Score), which identifies redundant units based on class-conditional separability, and the Taylor-based Reconstruction Score (R-Score), which protects critical units. The framework automatically determines non-uniform per-layer pruning ratios and supports search-free flexible compression ratio adjustment, achieving state-of-the-art results on CIFAR-10, SVHN, and ImageNet.
Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation: RETA decouples two failure modes in residual matching for dataset distillation—the fit-complexity gap and the pull-to-anchor effect—by employing Dynamic Retrieval Connection (DRC) to adaptively select real patch anchors and Persistent Topology Alignment (PTA) to preserve intra-class diversity. The method achieves 64.3% (+3.1% vs. FADRM) on ImageNet-1K with ResNet-18 at IPC=50.
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention: By replacing the global self-attention in VGGT with descriptor-based cross-attention, FlashVGGT reduces inference time on 1000 images to 9.3% of VGGT while maintaining competitive reconstruction accuracy, and scales to sequences of 3000+ images.
FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation: This paper proposes FOZO, a forward-only zeroth-order prompt optimization paradigm that updates prompts via SPSA gradient estimation, a dynamic perturbation strategy, and shallow–deep feature statistics alignment—without modifying model weights. FOZO achieves 59.52% accuracy on ImageNet-C, surpassing all forward-only methods including FOA (58.13%), and supports INT8 quantized models.
Frequency Switching Mechanism for Parameter-Efficient Multi-Task Learning: Free Sinewich proposes a parameter-efficient multi-task learning framework based on frequency switching. By applying task-specific sinusoidal transformations \(M_t = \sin(\omega_t \cdot M_{AWB})\) to a shared low-rank base matrix, the method achieves genuine parameter reuse and task specialization at near-zero cost, attaining state-of-the-art performance on dense prediction benchmarks with the fewest trainable parameters.
From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness: This paper proposes QuADD, a framework that embeds a differentiable quantization module into the dataset distillation loop to jointly optimize synthetic data and quantization parameters, achieving Pareto-optimal compression of "fewer samples + lower precision" under a fixed bit budget.
Generative Video Compression with One-Dimensional Latent Representation: This paper proposes GVC1D, which for the first time replaces the 2D grid latent representation in video compression with a compact 1D token sequence. Combined with a 1D memory module for modeling long-term temporal context, GVC1D achieves over 60% bitrate savings on perceptual quality metrics.
GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design: This paper proposes GeoFusion-CAD, an end-to-end diffusion framework that encodes CAD programs as hierarchical tree structures and introduces a geometry-aware G-Mamba block with linear time complexity to replace quadratic-complexity Transformers, enabling scalable and structure-aware generation of long-sequence parametric CAD programs. The method substantially outperforms Transformer-based approaches on the newly constructed DeepCAD-240 benchmark (up to 240-step commands).
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers: HiAP formulates ViT pruning as an end-to-end budget-aware learning problem, applying stochastic differentiable gating simultaneously at two granularities—entire heads/blocks (macro) and intra-head value dimensions/FFN neurons (micro)—to automatically discover a compact dense subnetwork satisfying a compute budget within a single training run, eliminating the need for importance ranking, threshold search, and separate fine-tuning.
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers: This paper proposes HiAP, a hierarchical Gumbel-Sigmoid gating framework that unifies macro-level (entire attention heads / FFN blocks) and micro-level (intra-head dimensions / FFN neurons) pruning decisions. Through a single end-to-end training pass, HiAP automatically discovers efficient ViT subnetworks satisfying a given compute budget, eliminating the need for manual importance ranking or multi-stage pipelines.
HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation: This paper proposes HierAmp, which injects learnable class tokens at each scale of the coarse-to-fine generation process of a Visual AutoRegressive (VAR) model to identify semantically salient regions, and amplifies attention to these regions via positive logit biasing. This enables distilled data to acquire richer and more diverse layouts at coarse scales while focusing on class-relevant details at fine scales, achieving state-of-the-art performance across multiple dataset distillation benchmarks.
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning: This paper proposes IAPL (Image-Adaptive Prompt Learning), which introduces dynamic prompts at the input of a CLIP encoder. These prompts are generated via two complementary pathways: a Conditional Information Learner (extracting forgery-specific and generic cues from texture-rich regions) and test-time token tuning (minimizing entropy through multi-view consistency). The model adaptively adjusts to each test image at inference time, achieving significantly improved detection generalization on unseen generators.
Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery: This paper proposes the LTC framework, which leverages MKEE (Minimize Kernel Energy + Maximize Entropy) to online-generate pseudo-unknown class samples during training. Combined with a dual max-margin loss and an adaptive threshold mechanism, LTC achieves 1.5%–13.1% all-class accuracy improvements across 7 datasets, entirely eliminating the semantic degradation caused by hash encoding.
LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration: LLaVA-LE is the first vision-language model tailored for lunar exploration. By constructing LUCID, a large-scale real lunar image-text dataset (96K images + 81K QA pairs), and applying two-stage curriculum fine-tuning on LLaVA, the model achieves a 3.3× improvement over the baseline on lunar geological understanding and multimodal reasoning.
MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis: This paper proposes MaMe, a training-free and differentiable token merging method based on fully matrix-based operations, along with its inverse operation MaRe for token restoration, achieving efficient acceleration with minimal performance degradation across image classification, video recognition, and image generation tasks.
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation: This work reformulates the visual autoregressive model (VAR) from a full-context-dependent next-scale prediction paradigm into a Markovian scale prediction process. By introducing a sliding-window history compensation mechanism for non-full-context modeling, the method achieves a 10.5% FID reduction and 83.8% peak memory reduction on ImageNet.
MARVO: Marine-Adaptive Radiance-aware Visual Odometry: MARVO is an underwater visual odometry framework that embeds a Physics-Aware Radiance Adapter (PARA) into the LoFTR feature matcher to compensate for wavelength-dependent attenuation, integrates GTSAM multi-sensor factor graph fusion, and employs reinforcement learning-based pose graph optimization (RL-PGO), achieving robust localization in underwater scenes.
MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction: This paper proposes MEMO, a framework that generates crisp single-pixel edge maps using only cross-entropy loss, achieved through masked edge training and a confidence-ordered progressive inference strategy. MEMO substantially outperforms prior methods on crispness-aware evaluation (CEval ODS on BSDS improves from 0.749 to 0.836).
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation: MDPD proposes an efficient fine-tuning framework based on bidirectional knowledge distillation between a frozen backbone and a lightweight side network. Upon training completion, the side network is discarded, achieving both parameter- and memory-efficient training as well as inference-time acceleration.
On the Robustness of Diffusion-Based Image Compression to Bit-Flip Errors: This paper presents the first systematic study of bit-flip robustness in diffusion-based image compression. It demonstrates that reverse channel coding (RCC)-based diffusion compression methods are inherently more resilient to bit-flip errors than traditional and learned codecs, and proposes Robust Turbo-DDCM, which independently encodes each atom index to further enhance robustness. At BER \(10^{-3}\), the proposed method maintains high reconstruction quality with only a marginal increase in BPP.
OPAD: Adversarial Concept Distillation for One-Step Diffusion Personalization: OPAD is the first work to address one-step diffusion model personalization (1-SDP). It achieves reliable single-step personalized generation via joint teacher–student training, alignment losses, and adversarial supervision, and further proposes a collaborative learning stage in which the efficient student generation is fed back to improve the teacher.
Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression: This paper proposes the OmniParallax Attention Mechanism (OPAM) for Distributed Multi-view Image Compression (DMIC), which explicitly models inter-view correlations and aligned features between arbitrary view pairs via a two-stage parallax attention mechanism. The resulting ParaHydra framework is the first DMIC method to significantly outperform state-of-the-art MIC encoders while substantially reducing computational overhead.
PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching: PlanaReLoc is the first camera relocalization paradigm centered on planar primitives and 3D planar maps. A deep matcher associates planar regions extracted from query images with map planar primitives in a unified embedding space, achieving lightweight 6-DoF camera relocalization without requiring textured maps, pose priors, or per-scene training.
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model: This paper proposes CompACT, which compresses each image into only 8 discrete tokens (~128 bits) by leveraging a frozen pretrained visual encoder to retain planning-critical semantic information, while a generative decoder supplements perceptual details. This achieves approximately 40× speedup in world-model-based planning with no loss in accuracy.
PPCL: Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers: This paper proposes PPCL, a structured pruning framework tailored for large-scale Multi-Modal Diffusion Transformers (MMDiT, 8–20B parameters). It trains linear probes (Linear Probe) to assess the substitutability of each layer, automatically localizes contiguous redundant layer intervals via first-order differences of CKA, and applies non-sequential alternating distillation for dual-axis pruning along depth and width. On Qwen-Image 20B, PPCL achieves 50% parameter reduction and 1.8× inference speedup with an average performance drop of only 2.61%.
Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy: This paper revisits the LoRA merging problem through two complementary lenses—subspace coverage and directional anisotropy—and proposes the TARA-Merging framework. By retaining LoRA directions to preserve subspace coverage and applying preference-weighted cross-entropy pseudo-loss for direction-level reweighting, TARA consistently outperforms existing merging methods across 8 vision and 6 NLI benchmarks.
PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild: PriVi constructs a large-scale primate video pretraining dataset of 424 hours and performs domain-level pretraining (rather than dataset-level pretraining) on V-JEPA. This work is the first to demonstrate that domain-level pretraining of video models generalizes across datasets, surpassing fully fine-tuned specialized models on four primate behavior recognition benchmarks using a frozen classifier with only 220K parameters.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models: This paper proposes QuantVLA, the first training-free post-training quantization (PTQ) framework for Vision-Language-Action (VLA) models. Through a selective quantization layout and two lightweight calibration mechanisms—Attention Temperature Matching (ATM) and Output Head Balancing (OHB)—QuantVLA achieves approximately 70% memory reduction at W4A8 precision while surpassing the task success rate of the full-precision baseline.
RDVQ: Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression: RDVQ introduces a differentiable relaxation over the codebook distribution, enabling for the first time end-to-end rate-distortion joint optimization for VQ-based image compression. At extremely low bitrates, the method achieves superior or competitive perceptual quality with less than 20% of the parameters of prior approaches.
RL-ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment: This paper proposes RL-ScanIQA, the first end-to-end reinforcement learning-based blind 360° image quality assessment (BIQA) framework. The core idea is to formulate scanpath generation as a sequential decision-making process, using a PPO policy to learn task-driven viewing strategies directly from quality assessment feedback, rather than relying on imitation learning from human gaze data. The framework consists of two jointly optimized modules—a scanpath generator and a quality assessor—augmented with multi-level rewards (step-level exploration, set-level diversity, and task-aligned perception) and distortion-space data augmentation. The method achieves state-of-the-art performance and strong cross-dataset generalization on three benchmarks: CVIQD, OIQA, and JUFE.
SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer: SODA is proposed to achieve controllable-speedup high-fidelity generation for Diffusion Transformers without training, via offline fine-grained sensitivity modeling, dynamic-programming-based cache schedule optimization, and a unified adaptive pruning strategy.
TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery: This paper proposes TALON, the first test-time adaptive framework for On-the-Fly Category Discovery (OCD). By combining semantics-aware prototype updating, stable encoder adaptation, and margin-aware logit calibration, TALON operates directly in continuous feature space without hash encoding, substantially alleviating category explosion and significantly improving novel category discovery accuracy.
F²HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling: This paper proposes F²HDR, a two-stage HDR video reconstruction framework that adapts general-purpose pre-trained optical flow to alternating-exposure scenes via a Flow Adapter for robust alignment, and employs physical motion modeling to extract continuous motion masks from optical flow to guide artifact removal in the second stage, achieving state-of-the-art performance on real-world HDR video benchmarks.
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning: This paper proposes Image-Adaptive Prompt Learning (IAPL), which dynamically adjusts the prompts of the CLIP encoder for each test image at inference time. Through test-time token tuning and a conditional information learner, IAPL achieves strong generalization to unseen generators, attaining state-of-the-art average accuracies of 95.61% and 96.7% on UniversalFakeDetect and GenImage, respectively.
Towards Source-Aware Object Swapping with Initial Noise Perturbation: This paper proposes SourceSwap, which generates high-quality pseudo-paired data from single images via frequency-separated initial noise perturbation, and employs a source-aware dual U-Net architecture to learn cross-object alignment, enabling zero-shot, per-object-fine-tuning-free high-fidelity object swapping.
Understanding and Enforcing Weight Disentanglement in Task Arithmetic: This paper proposes Task Feature Specialization (TFS) as a sufficient condition for weight disentanglement, reveals that its geometric consequence is weight vector orthogonality, and introduces OrthoReg — a regularization method that enforces column-wise orthogonality of weight update matrices during fine-tuning to promote task vector disentanglement, substantially improving the performance of various task arithmetic methods.
UniComp: Rethinking Video Compression Through Informational Uniqueness: This paper proposes UniComp, a video token compression framework grounded in informational uniqueness rather than attention scores. Through three modules—Frame Group Fusion, Token Allocation, and Spatial Dynamic Compression—UniComp maximally preserves unique information across temporal, spatial, and global dimensions, surpassing the uncompressed baseline even when retaining only 10% of tokens.
Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation: A fully automated pipeline is proposed that leverages self-supervised ViT features for unsupervised object discovery, generating spatially grounded multi-label annotations for all 1.28 million ImageNet-1K training images without human annotation. Models trained with these labels achieve consistent gains on both in-domain and downstream multi-label tasks (ReaL +2.0 top-1, COCO +4.2 mAP).
WPT: World-to-Policy Transfer via Online World Model Distillation: WPT proposes a world-to-policy transfer training paradigm that injects future-predictive knowledge from a world model into a teacher policy via a trainable reward model, then transfers this knowledge to a lightweight student policy through policy distillation and world reward distillation, achieving a closed-loop driving score of 79.23 with a 4.9× inference speedup.

🤖 Robotics & Embodied AI¶

Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation: This work leverages the pretrained 3D geometric foundation model π3 as a perception backbone, fuses 3D geometric, 2D semantic, and proprioceptive features, and jointly predicts future action chunks and future 3D Pointmaps via a diffusion model. Using only RGB inputs, the proposed method comprehensively surpasses point-cloud-based approaches on the RoboTwin bimanual benchmark.
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models: This paper proposes Adaptive Action Chunking (AAC), a strategy that leverages action entropy as a signal to dynamically determine the optimal chunk size at inference time, requiring no additional training or architectural modification. AAC consistently improves task success rates of GR00T N1.5 and π0.5 on benchmarks including RoboCasa and LIBERO.
AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots: AtomicVLA proposes a unified planning-execution framework that adaptively switches between Think and Act modes to generate task chains and atomic skill abstractions. Using a Skill-Guided MoE (SG-MoE), it builds a scalable atomic skill expert library, surpassing π₀ by 10% on LIBERO-LONG, exceeding baselines by 21% in real-world continual learning, with forgetting as low as 1.3%.
AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots: This paper proposes AtomicVLA, a unified planning-execution framework built upon π₀ that adaptively switches between Think and Act modes to generate atomic skill abstractions, and employs a Skill-Guided MoE (SG-MoE) to route actions to specialized experts. The approach improves LIBERO-LONG success rate from 85.2% to 95.2% (+10%), achieves +18.3% on real-world Franka long-horizon tasks, and +21% on continual learning benchmarks.
BiPreManip: Learning Affordance-Based Bimanual Preparatory Manipulation through Anticipatory Collaboration: This paper proposes BiPreManip, a framework for bimanual preparatory manipulation based on visual affordance representations. The system first anticipates the primary hand's target interaction region, then guides the assistive hand to perform preparatory actions (e.g., flipping a bottle so its cap faces the primary hand), achieving substantial improvements over baselines in both simulated and real-world environments.
Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior: This paper proposes a Feasible Action Neighborhood (FAN) regularizer that shapes the output distribution of VLA models into a Gaussian form matching physical action tolerances. The approach consistently improves success rate, generalization, and sample efficiency under both SFT and RFT finetuning paradigms (RFT requires only 1/3 of training steps to reach 90% success rate).
Chain of World: World Model Thinking in Latent Motion (CoWVLA): CoWVLA unifies the strengths of world-model VLAs and latent-action VLAs: a Latent Motion Extractor decomposes video into structural and motion latent variables, enabling the VLA to perform world-model prediction in the latent motion space rather than reconstructing redundant pixels. Combined with Co-Fine-tuning that alternately generates keyframe and action tokens, CoWVLA achieves 95.2% on LIBERO-Long (surpassing π₀ at 85.2%) and an average score of 0.560 on SimplerEnv-WidowX (surpassing π₀ at 0.425).
CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning: CoMo is proposed to jointly address the shortcut learning problem in continuous latent motion learning via two mechanisms — Early Temporal Difference (Td) and Temporal Contrastive Learning (Tcl) — enabling the extraction of fine-grained continuous pseudo-action labels from internet videos and joint training of video data and robot actions under a unified continuous distribution, substantially improving policy performance.
Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning: This paper proposes NeSyCR, a neurosymbolic counterfactual reasoning framework that abstracts video demonstrations into a symbolic world model, detects cross-domain incompatibilities via counterfactual state simulation, and automatically corrects program steps. NeSyCR achieves a 31.14% improvement in success rate over the strongest baseline, Statler, on cross-domain demo-to-code tasks.
CycleManip: Enabling Cyclic Task Manipulation via Effective Historical Perception and Understanding: CycleManip is the first work to systematically address cyclic robotic manipulation tasks (e.g., shaking a bottle N times). It enhances historical perception via a cost-aware history sampling strategy and improves historical understanding through multi-task learning auxiliary objectives, enabling controllable cycle-count manipulation in an end-to-end imitation learning framework.
DAWN: Pixel Motion Diffusion is What We Need for Robot Control: This paper proposes DAWN, a two-stage fully diffusion-based vision-language-action framework. A Motion Director (latent diffusion model) generates dense pixel motion fields as interpretable intermediate representations, while an Action Expert (diffusion Transformer policy) translates pixel motion into executable robot actions. DAWN achieves state-of-the-art performance on the CALVIN benchmark (average length 4.00) and demonstrates strong generalization on real-world single-arm and dual-arm manipulation tasks.
DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation: This paper proposes DecoVLN, a framework that decouples observation, reasoning, and correction in VLN tasks. By introducing an adaptive memory refinement (AMR) mechanism and a state-action-pair-based correction fine-tuning strategy, DecoVLN achieves state-of-the-art performance on R2R-CE and RxR-CE using only egocentric RGB input.
DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning: This paper proposes the DeepSketcher suite — comprising a 31k high-quality interleaved image-text CoT dataset built via code rendering and a self-contained Embedding Editor model — enabling VLMs to generate "visual thoughts" directly in the visual embedding space for multimodal reasoning without relying on any external tools.
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols: This paper proposes the ViFailback framework, which leverages explicit visual symbols (arrows, crosshairs, etc.) to efficiently annotate real-world robotic manipulation failure data. It constructs a large-scale dataset of 58,128 VQA pairs and fine-tunes ViFailback-8B, which, when combined with a VLA model in real-robot experiments, achieves failure recovery with an average success rate improvement of 22.2%.
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols: This paper proposes ViFailback, a framework that leverages visual symbols (arrows, crosshairs, labels, etc.) to efficiently annotate real-world robotic manipulation failures. The framework constructs a dataset of 58,128 VQA pairs and trains ViFailback-8B, a VLM capable of failure diagnosis and both visual and textual corrective guidance. When integrated with a VLA, it achieves a 22.2% improvement in task success rate.
Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation: To address the limitation of MoE-LoRA methods where all experts share identical structures (uniform rank) and thus cannot adapt to tasks of varying complexity, this paper proposes EPT: a parameter pyramid constructed via a shared meta-knowledge subspace and deconvolution experts with varying kernel sizes, coupled with an Adaptive LoRA Pruner and contrastive learning-based Task Embedding. EPT achieves an average score of 87.0% on GLUE with only 0.41M parameters per task, outperforming all MoE-LoRA variants.
Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation: This paper proposes Expert Pyramid Tuning (EPT), which transplants the multi-scale feature pyramid (FPN) concept from computer vision into the MoE-LoRA paradigm. By combining a shared low-dimensional meta-knowledge subspace, deconvolution expert projections with kernels of varying scales, and contrastive task embeddings, EPT achieves an average score of 87.0% on GLUE with only 0.41M parameters per task—reducing parameter count by approximately 50% compared to existing MoE-LoRA variants.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning: This paper proposes Fast-ThinkAct, which compresses verbose textual CoT reasoning (~250 tokens) into 6 verbalizable continuous latent tokens, combined with reward-guided preference distillation and visual trajectory alignment, achieving an 89.3% reduction in inference latency (9.3× faster than ThinkAct-7B) while maintaining or surpassing the performance of state-of-the-art reasoning VLAs.
FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation: This paper proposes FineCog-Nav, a zero-shot UAV vision-language navigation framework inspired by human cognition. It decomposes navigation into seven fine-grained cognitive modules—language processing, perception, attention, memory, imagination, reasoning, and decision-making—each driven by moderate-scale foundation models, enabling long-range navigation in complex 3D environments without any training.
FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction: By analyzing the over-reliance on layer-wise features and spectral-domain components in visual jailbreaking attacks, FORCE corrects non-generalizable feature dependencies and guides the attack toward flatter loss landscapes, thereby substantially improving cross-model transferability.
FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction: This work identifies the root cause of poor transferability in visual jailbreak attacks as their residence in high-sharpness loss regions — arising from shallow-layer over-reliance on model-specific representations and excessive influence of high-frequency information. FORCE is proposed to address this via layer-aware regularization that broadens the shallow-layer feasible region, and spectral rescaling that suppresses high-frequency non-semantic components, guiding attacks into flatter loss landscapes and substantially improving cross-model transferability.
ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation: This paper proposes ForceVLA2, the first end-to-end model that unifies force awareness and hybrid force-position control within a VLA framework. Force-based Prompts injected into a VLM expert construct cross-phase force-aware task concepts, while a Cross-Scale MoE adaptively fuses task semantics with real-time interaction forces to achieve closed-loop force-position regulation. The model achieves an average success rate of 66% across 5 contact-rich tasks, surpassing π₀ and π₀.5 by 48.0% and 35.0%, respectively.
GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer: This paper proposes GeCo-SRT, a geometry-aware continual adaptation method that extracts cross-domain and cross-task invariant knowledge from local geometric features, enabling knowledge accumulation across successive sim-to-real transfers to efficiently adapt to new tasks.
GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer: GeCo-SRT proposes the first continual cross-task Sim-to-Real transfer paradigm, exploiting the domain-invariance and task-invariance of local geometric features. Through a Geo-MoE module for reusable geometric knowledge extraction and Geo-PER for expert-level forgetting prevention, the method achieves an average success rate of 63.3% across four real-robot tasks (a 52% improvement over baselines) while requiring only 1/6 of the data to match baseline performance.
IGen: Scalable Data Generation for Robot Learning from Open-World Images: IGen starts from a single open-world image and automatically generates large-scale vision-action training data through a pipeline of 3D scene reconstruction → VLM task planning → SE(3) action generation → point cloud synthesis → frame rendering. Policies trained exclusively on the generated data can successfully perform real-world manipulation tasks.
Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics: This paper establishes via the NTK framework that linearized attention fails to converge to the infinite-width kernel limit (requiring width \(m = \Omega(\kappa^6)\)), and proposes the "influence malleability" metric to quantify its dual implications: attention exhibits 6–9× higher data-dependent flexibility than ReLU networks, which simultaneously reduces approximation error and increases adversarial vulnerability.
Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics: This paper demonstrates that linearized attention does not converge to the infinite-width limit under the NTK framework, and proposes the metric of influence malleability to show that the expressive power of attention and its adversarial vulnerability share a common origin—data-dependent kernel structure that deviates from the kernel regime.
Language-Grounded Decoupled Action Representation for Robotic Manipulation (LaDA): LaDA is a framework that uses natural language as a semantic bridge to decouple continuous 7-DoF actions into three interpretable primitives — translation, rotation, and gripper — and employs soft-label contrastive learning to align cross-task action representations in a shared embedding space. With only 0.6B parameters, LaDA achieves a 93.6% success rate on LIBERO, outperforming all baselines with 1.3B–8.5B parameters.
Language-Grounded Decoupled Action Representation for Robotic Manipulation: This paper proposes LaDA, a framework that decouples continuous 7-DoF robotic actions into interpretable, language-described motion primitives (translation, rotation, gripper state), and unifies the visual-language-action representation space via semantically guided soft-label contrastive learning to achieve cross-task generalization.
Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation: This paper proposes the TVVE framework, which employs a reinforcement learning-driven Multi-View Exploration Policy (MVEP) to select optimal virtual camera viewpoints and re-render observations online. A task-aware MoE visual encoder (TaskMoE) is designed to mitigate cross-task feature interference. The framework achieves an average success rate of 86.6% across 18 tasks on RLBench.
ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation: ManipArena proposes a standardized real-world robot manipulation evaluation framework comprising 20 reasoning-oriented tasks and 10,812 expert trajectories. Through a green-screen controlled environment, systematic diversity design, and hierarchical OOD evaluation, it provides a fair and reproducible benchmark for VLA models and world models.
MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent: This paper presents the first systematic diagnosis of two root causes underlying the non-mergeability of VLA models—LoRA selfish parameter conflicts and task coupling induced by self-attention in action experts—and proposes MergeVLA. By combining task-mask sparse LoRA activation, self-attention-free action experts, and training-free test-time task routing, MergeVLA merges multiple single-skill VLA specialists into a unified generalist agent, achieving a 90.2% success rate on LIBERO and 90% on the real-robot SO101 platform.
MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent: MergeVLA diagnoses two root causes of VLA model unmergeability—LoRA parameter conflicts and architectural incompatibility induced by self-attention in the action expert—and addresses them via sparsely activated task masks and a self-attention-free action expert architecture. This enables training-free merging of multiple single-task VLA experts, achieving 90.2% success on LIBERO and 90.0% on a real-robot SO101 platform.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation: This paper proposes PALM, a unified VLA framework that employs structured fine-grained affordance prediction across four categories (global, local, spatial, and dynamic) as implicit reasoning anchors, and incorporates continuous sub-task progress estimation to enable seamless task transitions. PALM achieves an average completion length of 4.48 on CALVIN ABCD (surpassing the previous SOTA by 12.5%), a success rate of 91.8% on LIBERO-LONG, and more than twice the baseline performance in real-world long-horizon generalization evaluations.
PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments: PanoAffordanceNet introduces a novel task of holistic affordance grounding in 360° indoor environments. It employs a Distortion-Aware Spectrum Modulator (DASM) to correct ERP geometric distortions, an Omnidirectional Sphere Densification Head (OSDH) to recover continuous affordance regions from sparse activations, and multi-level training objectives. The method achieves substantial gains over existing approaches on 360-AGD, the first panoramic affordance dataset constructed by the authors.
PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments: This paper presents PanoAffordanceNet, the first holistic affordance grounding framework for 360° panoramic indoor environments. It systematically addresses ERP geometric distortion, sparse functional regions, and semantic drift via a Distortion-Aware Spectral Modulator (DASM) and an Omnidirectional Spherical Densification Head (OSDH), and introduces 360-AGD, the first panoramic affordance dataset.
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition: This paper proposes CroBo, a self-supervised framework that learns visual state representations via a global-local reconstruction objective: a global reference image is compressed into a single bottleneck token, which is then used to reconstruct a heavily masked (90%) local crop, compelling the bottleneck token to encode pixel-level "what-is-where" scene composition. CroBo achieves state-of-the-art performance on the Franka Kitchen and DMC robot policy learning benchmarks.
Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection: This paper reformulates Multimodal Misinformation Detection (MMD) as a structured probabilistic reasoning problem over concept graphs. The proposed PCGR framework employs MLLMs to automatically discover and validate human-interpretable concept nodes, constructs a hierarchical probabilistic concept graph, and achieves interpretable misinformation detection, outperforming 13 baselines across three benchmarks.
ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation: ProFocus is a training-free progressive framework that achieves state-of-the-art zero-shot VLN performance on R2R and REVERIE benchmarks through two mechanisms: proactive perception (converting panoramic observations into semantic maps and having an LLM generate targeted visual queries) and focused reasoning (BD-MCTS filtering top-k high-value waypoints from large navigation histories).
PULSE: Privileged Knowledge Transfer from Rich to Deployable Sensors for Embodied Multi-Sensory Learning: This paper proposes PULSE, a framework that performs knowledge distillation from a frozen privileged-sensor (e.g., EDA) teacher to a student model relying solely on cheap, deployable sensors (e.g., ECG, BVP, accelerometer). PULSE introduces shared-private embedding decomposition and a reconstruction-based collapse-prevention mechanism, achieving 0.994 AUROC for stress detection without EDA at inference time—surpassing even models that use all sensors.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models: This paper proposes QuantVLA, the first training-free post-training quantization (PTQ) framework for Vision-Language-Action (VLA) models. Through a selective quantization layout and two lightweight calibration mechanisms—Attention Temperature Matching (ATM) and Output Head Balancing (OHB)—QuantVLA achieves approximately 70% memory reduction under W4A8 precision while surpassing the task success rate of the full-precision baseline.
RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation: This paper proposes Robot-Conditioned Normalizing Flow (RC-NF), which models the joint distribution of robot states and object motion trajectories via a conditional normalizing flow, enabling real-time anomaly detection at <100ms latency. RC-NF serves as a plug-and-play monitoring module for VLA models (e.g., π₀), supporting task-level replanning and state-level trajectory rollback (homing).
RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation: This paper proposes RC-NF, a real-time anomaly detection model based on conditional normalizing flows that decouples the processing of robot state and object trajectory features. Trained in an unsupervised manner using only normal demonstrations, RC-NF detects OOD anomalies during VLA model execution within 100ms, outperforming state-of-the-art methods (including VLM baselines such as GPT-5 and Gemini 2.5 Pro) by approximately 8% AUC and 10% AP on LIBERO-Anomaly-10.
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics: SaPaVe proposes an end-to-end active manipulation framework that decouples camera actions from manipulation actions via a bottom-up training strategy: it first learns active perception priors from 200K semantic camera-control pairs, then jointly optimizes for active manipulation, surpassing π₀ and GR00T N1 by up to 31.25% in real-world success rate.
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics: SaPaVe is an end-to-end framework that decouples camera motion from manipulation actions via a two-stage bottom-up learning strategy, enabling semantics-driven active perception and viewpoint-invariant manipulation execution. It surpasses GR00T N1 and π₀ by 31.25% and 40%, respectively, on real-world tasks.
STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation: STRNet proposes a unified spatio-temporal representation framework for visual navigation. It employs a graph reasoning module to model intra-frame spatial topology, and combines hybrid temporal shifting with multi-resolution differential convolution to capture temporal dynamics, achieving substantial improvements in goal-conditioned navigation success rates (70% gain over NoMaD).
Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency: This paper introduces the Test-time Ego-Exo Adaptation for Action Anticipation (TE2A3) task and proposes the DCPGN network, which leverages multi-label prototype growing and dual-clue (visual + textual) consistency to online-adapt a source-view trained model to the target view at test time for action anticipation, substantially outperforming existing TTA methods.
Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning: For the task of general scene-adaptive vision-language navigation (GSA-VLN) in open environments, inspired by Kahneman's dual-process cognitive theory, this paper proposes the slow4fast-VLN framework. A fast reasoning module performs real-time navigation via an end-to-end policy network while accumulating historical memory; a slow reasoning module leverages LLM-based reflection to generate structured, generalizable experience entries. These experiences are fed back into the fast reasoning network via attention-based fusion, enabling continuous adaptation to unseen environments and diverse instruction styles. The proposed framework achieves comprehensive improvements over the previous SOTA (GR-DUET) on the GSA-R2R dataset.
Towards Training-Free Scene Text Editing: This paper proposes TextFlow, a training-free scene text editing framework that employs Flow Manifold Steering (FMS) during the early denoising stage to preserve style consistency and Attention Boost (AttnBoost) during the late stage to enhance text rendering accuracy, achieving editing quality comparable to or better than training-based methods without any task-specific training.

🖼️ Image Restoration¶

Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration: This paper proposes IQPIR, a framework that introduces image quality priors (IQP) derived from pretrained NR-IQA models as conditioning signals. Through three mechanisms—quality-conditioned Transformer, dual Codebook architecture, and quality optimization in discrete representation space—IQPIR guides the restoration process toward maximal perceptual quality, achieving state-of-the-art performance on blind face restoration and related tasks.
Beyond the Ground Truth: Enhanced Supervision for Image Restoration: This paper proposes to enhance the perceptual quality of suboptimal ground-truth images in existing datasets via super-resolution combined with frequency-domain adaptive mixing, and trains a lightweight Output Refinement Network (ORNet) that improves the perceptual quality of restoration outputs without modifying any pretrained restoration model.
BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting: BHCast takes a single blurry EHT black hole image as input, employs a U-Net dynamics surrogate model for super-resolution combined with long-term autoregressive forecasting (stable over 100 steps), extracts physical features (pattern speed, pitch angle, etc.) from the predicted plasma dynamics, and infers black hole spin and inclination via XGBoost. Effectiveness is also demonstrated on real M87* observational images.
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding: This paper proposes Blink, a framework that dynamically expands and discards visual tokens across different Transformer layers of an MLLM — simulating the human "rapid blinking" scanning process — to adaptively enhance visual perception within a single forward pass, improving LLaVA-1.5 performance across multiple multimodal benchmarks.
BluRef: Unsupervised Image Deblurring with Dense-Matching References: BluRef is proposed as the first unsupervised framework that leverages unpaired reference sharp images to generate pseudo ground truth via dense matching for training a deblurring network, achieving performance comparable to or even surpassing supervised methods.
Bridging the Perception Gap in Image Super-Resolution Evaluation: Through a large-scale user study, this paper reveals a severe misalignment between existing SR evaluation metrics (PSNR, SSIM, LPIPS, etc.) and human perception. After analyzing their inherent deficiencies, the paper proposes a minimalist yet effective Relative Quality Index (RQI) framework that learns relative quality differences between image pairs to enable more reliable SR evaluation, and can also serve as a loss function to guide SR model training.
PNG: Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning: PNG introduces learnable Global/Local Prompt components to automatically extract noise characteristics from real noise (replacing metadata such as ISO and camera model). A Prompt AutoEncoder encodes noise into a latent space, and a Prompt DiT (based on a consistency model) generates latent codes in a single step, enabling realistic sRGB noise synthesis without any metadata. The downstream DnCNN denoiser trained on PNG-synthesized data trails real-data training by only 0.08 dB on SIDD.
Disentangled Textual Priors for Diffusion-based Image Super-Resolution: This paper proposes DTPSR, which disentangles textual priors along two orthogonal dimensions — spatial hierarchy (global/local) and frequency semantics (low-frequency/high-frequency) — and constructs a disentangled cross-attention injection pipeline along with a multi-branch CFG strategy, achieving superior perceptual quality in diffusion-based image super-resolution.
DRFusion: Degradation-Robust Fusion via Degradation-Aware Diffusion Framework: This paper proposes DRFusion, a degradation-aware diffusion framework that achieves multimodal image fusion under arbitrary degradation scenarios within a small number of diffusion steps, via direct regression of the fused image (rather than explicit noise prediction) and a joint observation model correction mechanism.
EVLF: Early Vision-Language Fusion for Generative Dataset Distillation: This paper proposes EVLF, a plug-and-play early vision-language fusion method operating at the encoder-backbone interface, addressing the problem of text dominance and degraded visual fidelity caused by late-stage semantic injection in diffusion-based dataset distillation.
FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution: This paper proposes FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework that simultaneously addresses structural fidelity degradation and insufficient high-frequency detail recovery in one-step diffusion SR through three complementary components: Detail-Aware Weighting (DAW), Latent Residual Refinement Block (LRRB), and Latent Frequency Injection Module (LFIM).
FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution: This paper proposes FinPercep-RM, a fine-grained perceptual reward model, and a Co-evolutionary Curriculum Learning (CCL) strategy to address reward hacking and training instability when applying RLHF to real-world image super-resolution. The model simultaneously outputs a global quality score and a spatial degradation heatmap, enabling localized artifact awareness.
FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution: This paper proposes FinPercep-RM, a fine-grained perceptual reward model that predicts both a global quality score and a perceptual degradation map to spatially localize artifacts. Combined with a co-evolutionary curriculum learning (CCL) strategy that balances training stability and reward robustness, the method effectively mitigates reward hacking in RL-based real-world super-resolution.
GSNR: Graph Smooth Null-Space Representation for Inverse Problems: This paper proposes Graph Smooth Null-Space Representation (GSNR), which employs spectral graph theory to construct a null-space-constrained Laplacian matrix and selects the \(p\) smoothest spectral modes as the null-space projection basis. GSNR provides structured null-space constraints for inverse problem solvers including PnP, DIP, and diffusion models, achieving up to 4.3 dB PSNR gains on deblurring, compressed sensing, demosaicing, and super-resolution.
IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE: IA-CLAHE demonstrates that the histogram redistribution process in CLAHE is differentiable almost everywhere, enabling the first end-to-end learning framework for tile-adaptive clip limit estimation. Without requiring pre-searched ground-truth clip limits, it achieves zero-shot improvements in recognition performance and visual quality under adverse weather conditions.
Flickerformer: A Duet of Periodicity and Directionality for Burst Flicker Removal: This paper identifies two intrinsic physical properties of flicker artifacts—periodicity and directionality—and proposes Flickerformer, comprising three dedicated modules (PFM/AFFN/WDAM) for inter-frame/intra-frame periodicity and directionality modeling respectively. With only 3.92M parameters, the method achieves 31.226 dB PSNR on the BurstDeflicker benchmark, surpassing the second-best method AST by +0.580 dB while using only 19.70% of its parameters.
Learning to Translate Noise for Robust Image Denoising: This paper proposes a noise translation framework that converts unknown real-world noise into Gaussian noise via a lightweight noise translation network (NTN), which is then processed by a pre-trained Gaussian denoising network. The approach achieves an average PSNR gain of over 1.5 dB on OOD real-noise benchmarks, while the translation network contains only 0.29M parameters and is transferable across different denoisers.
MAD-Avatar: Motion-Aware Animatable Gaussian Avatars Deblurring: The first method to directly reconstruct sharp, drivable 3D Gaussian human avatars from blurry video: proposes a 3D-aware physical blur formation model (decomposing blur into sub-frame SMPL motion and canonical 3DGS), models sub-frame motion via B-spline interpolation and a pose deformation network, resolves motion direction ambiguity with inter-frame regularization, and substantially outperforms two-stage "2D deblurring + 3DGS" pipelines on both synthetic and real datasets (~2.5 dB PSNR gain).
NEC-Diff: Noise-Robust Event–RAW Complementary Diffusion for Seeing Motion in Extreme Darkness: This paper proposes NEC-Diff, a diffusion-based event–RAW hybrid imaging framework that uses the illumination prior from RAW images to guide event denoising, and leverages the high-dynamic-range edges from denoised events to assist image denoising. Combined with dual-modality SNR-guided reliable information extraction and cross-modal attention diffusion, the method achieves high-quality dynamic scene reconstruction in extreme darkness (0.001–0.8 lux), reaching 24.51 dB PSNR on the REAL dataset.
NTIRE 2026 The 3rd RAIM Challenge: AI Flash Portrait (Track 3): NTIRE 2026 3rd RAIM Challenge AI Flash Portrait Track: mapping weak-flash low-light portraits to strong-flash professional-grade portraits, providing 800 real paired samples (with professional retoucher GT), adopting a dual evaluation system combining region-aware objective metrics and expert blind assessment. 118 teams registered with 3,187 valid submissions.
NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: This is the summary report of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Based on the Raindrop Clarity real-world dataset (14,139 training / 407 validation / 593 test images), 168 teams participated and 17 submitted valid solutions. The winning team AIIA-Lab achieved the best score of 35.24 using an MSDT backbone combined with a pseudo-GT refinement pipeline.
PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors: PhaSR introduces a dual-level physically aligned prior framework: at the global level, PAN performs parameter-free Retinex decomposition to suppress color bias; at the local level, GSRA employs differential attention to align DepthAnything depth priors with DINO-v2 semantic embeddings. This enables generalized shadow removal spanning from single-source direct illumination to multi-source ambient lighting scenes, achieving state-of-the-art performance on WSRD+ and Ambient6K with the lowest FLOPs.
POLISH'ing the Sky: Wide-Field and High-Dynamic Range Interferometric Image Reconstruction: Building upon the POLISH framework, this work proposes POLISH+ and POLISH++, which employ a patch-based training-and-stitching strategy and an arcsinh-based nonlinear transform to achieve radio interferometric image reconstruction and super-resolution under wide-field (12,960×12,960 pixels) and high-dynamic-range (\(\sim 10^6\)) conditions. The paper also presents the first demonstration that deep learning methods can super-resolve strong gravitational lens systems.
RAR: Restore, Assess, Repeat - A Unified Framework for Iterative Image Restoration: RAR deeply integrates image quality assessment (IQA) with image restoration (IR) into a unified end-to-end model, iteratively executing an "assess–restore–verify" loop in the latent space. It achieves a +2.71 dB PSNR gain under composite degradation scenarios while running 11.27× faster than AgenticIR.
RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution: This paper proposes a calibration-based RAW-domain degradation modeling framework that accurately calibrates SR blur kernels and sensor noise models for multiple smartphone cameras, enabling the "unprocessing" of public sRGB images into realistic LR RAW data for training. The approach significantly outperforms baselines based on generic degradation pools in both camera-specific and cross-camera blind super-resolution settings.
RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution: This paper demonstrates that principled, device-specific degradation modeling — obtained via physical calibration of real blur and noise parameters — significantly improves real-world smartphone super-resolution performance. By unprocessing publicly available rendered images into the RAW domain of target devices to generate HR-LR training pairs, the resulting SR models substantially outperform baselines trained with large pools of arbitrary degradation combinations on held-out real device data.
Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset: This paper proposes Real-IISR, a unified autoregressive framework that addresses the unique challenges of real-world infrared image super-resolution via a Thermal-Structure Guidance (TSG) module, a Conditional Adaptive Codebook (CAC), and a Thermal Order Consistency loss. It also introduces the FLIR-IISR dataset comprising 1,457 real LR-HR infrared image pairs.
SAT: Selective Aggregation Transformer for Image Super-Resolution: This paper proposes the Selective Aggregation Transformer (SAT), which reduces Key-Value matrix token count by 97% through density-driven token aggregation while preserving full-resolution Queries, enabling efficient global attention modeling. SAT surpasses the state-of-the-art PFT by 0.22 dB while reducing FLOPs by 27%.
SelfHVD: Self-Supervised Handheld Video Deblurring: SelfHVD exploits naturally occurring sharp frames in handheld videos as supervisory signals. Through Self-Enhanced Video Deblurring (SEVD), it constructs high-quality training pairs that surpass the quality ceiling of sharp frames, while Self-Constrained Spatial Consistency Maintenance (SCSCM) prevents spatial displacement drift, enabling handheld video deblurring without paired training data.
Winner of CVPR2026 NTIRE Challenge on Image Shadow Removal: Semantic and Geometric Guidance for Shadow Removal via Cascaded Refinement: A three-stage cascaded refinement pipeline built upon OmniSR, combining frozen DINOv2 semantic features with monocular depth/normal geometric guidance and a contraction constraint loss to stabilize multi-stage training, achieving first place in the NTIRE 2026 Image Shadow Removal Challenge.
ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration: ShiftLUT is proposed to achieve the largest receptive field among LUT-based methods (65×65) via a Learnable Spatial Shift module (LSS), combined with an asymmetric dual-branch architecture and Error-bounded Adaptive Sampling (EAS). Under a storage budget of 104 KB and inference latency of 84 ms, ShiftLUT surpasses all existing LUT-based methods.
Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: This paper proposes UALNet, which integrates a data-driven spectral prior (PriorNet) and an adversarial learning term into a deep unfolding framework to perform spectral super-resolution from Sentinel-2 multispectral data (12 bands) to NASA AVIRIS hyperspectral imagery (186 bands), surpassing Transformer-based methods while requiring only 15% of their computation and 1/20 of their parameters.
Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging: This paper proposes SCGN (Statistical Characteristic-Guided denoising Network), which adaptively enhances signal and suppresses noise in both spatial and frequency domains via window standard deviation weighting and frequency band-guided channel attention, respectively. Combined with an HRTEM-specific noise calibration method that generates realistic noisy datasets containing disordered structures, SCGN achieves high-quality denoising of high-resolution transmission electron microscopy images at millisecond-level acquisition speeds.
The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations: Through systematic experimental analysis, this paper demonstrates that pretraining INRs on unstructured noise (uniform/Gaussian distributions) achieves a surprising ~80 dB PSNR in image fitting, far surpassing all data-driven initialization methods. Noise with the natural image \(1/|f^\alpha|\) spectral structure achieves the best balance between signal fitting and denoising, matching state-of-the-art data-driven initialization performance without requiring any real data.
TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising: This paper proposes TM-BSN, a triangular-masked blind-spot network that designs the blind-spot region to precisely align with the diamond-shaped spatial correlation pattern of real-world sRGB noise, enabling self-supervised image denoising at full resolution without downsampling. Combined with knowledge distillation, TM-BSN achieves state-of-the-art self-supervised denoising performance on the SIDD and DND benchmarks.
Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset: This paper proposes Real-IISR, a visual autoregressive framework guided by thermal-structural cues, which achieves real-world infrared image super-resolution via a conditionally adaptive codebook and a thermal order consistency loss. The first real-world infrared SR dataset, FLIR-IISR, is also introduced.
Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis: This paper constructs UniCAC, the first universal computational aberration correction benchmark for consumer-grade cameras, proposes an Optical Degradation Evaluator (ODE) to quantify aberration difficulty, systematically evaluates 24 image restoration/CAC methods, and reveals three key factors influencing CAC performance.
UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution: UCAN is a lightweight super-resolution network that unifies convolutional and attention mechanisms to efficiently expand the effective receptive field. It addresses the rank collapse issue of linear attention via Hedgehog attention, introduces a large-kernel distillation module and a semi-shared parameter strategy, and achieves 31.63 dB PSNR on Manga109 (×4) with only 48.4G MACs.
UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution: This paper proposes UCAN, a lightweight super-resolution network that unifies convolution and attention. By introducing Hedgehog Attention to overcome the low-rank bottleneck of linear attention, and combining Flash Attention for large-window modeling, a large-kernel distillation module, and cross-layer parameter sharing, UCAN achieves super-resolution performance comparable to much larger models under extremely low computational budgets.
UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation: UDAPose achieves a 56.4% AP improvement on the low-light hard set by combining stable diffusion-based low-light image synthesis (with preserved high-frequency low-light characteristics) and a dynamic attention control module (adaptively balancing visual cues and pose priors).
UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization: This paper proposes UniBlendNet, which builds upon IFBlend to unify three complementary modules—global context modeling, multi-scale feature aggregation, and region-adaptive residual refinement—for ambient lighting normalization under complex spatially varying illumination conditions.
UniCAC: Towards Universal Computational Aberration Correction in Photographic Cameras: This work constructs UniCAC, the first large-scale universal benchmark for computational aberration correction (CAC) in photographic lenses covering both spherical and aspherical designs. It proposes an Optical Degradation Evaluator (ODE) to replace the traditional RMS radius metric, and derives three key factors governing CAC performance—prior utilization, network architecture, and training strategy—through a comprehensive evaluation of 24 models.
Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis: This paper presents UniCAC, the first large-scale universal benchmark for Computational Aberration Correction (CAC). It introduces an Optical Degradation Evaluator (ODE) to quantify aberration difficulty and comprehensively evaluates 24 image restoration/CAC algorithms, revealing the impact of three key factors—prior utilization, network architecture, and training strategy—on CAC performance.
UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization: This paper proposes UniRain, a unified image deraining framework that employs RAG-driven dataset distillation to select high-quality samples from million-scale public datasets, combined with an asymmetric MoE architecture and a multi-objective reweighted optimization strategy, achieving consistently superior performance across four degradation types: rain streaks and raindrops under both daytime and nighttime conditions.
UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization: This paper proposes UniRain, a unified deraining framework that distills high-quality training samples from over 2 million public image pairs via RAG-driven dataset distillation, combines an asymmetric Mixture-of-Experts (MoE) architecture with a multi-objective adaptive reweighting optimization strategy, and for the first time handles all four degradation types — daytime rain streaks, daytime raindrops, nighttime rain streaks, and nighttime raindrops — within a single model.
UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization: UniRain is a unified deraining framework that employs RAG-driven dataset distillation to select high-quality samples from public datasets, and introduces a multi-objective reweighted optimization strategy within an asymmetric MoE architecture to balance learning across different rain degradation types, achieving state-of-the-art performance across four scenarios: daytime/nighttime rain streaks and raindrops.
Variational Garrote for Sparse Inverse Problems: Under a unified sparse inverse problem framework, this paper systematically compares \(\ell_1\) regularization (LASSO) with Variational Garrote (VG, a method that approximates \(\ell_0\) sparsity via variational binary gating) across three tasks—signal resampling, denoising, and sparse-view CT reconstruction—demonstrating that VG significantly reduces the minimum generalization error in severely underdetermined settings, with the greatest advantage observed at sampling rates below 20% or with very few projection angles.

🎯 Object Detection¶

A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps: This paper proposes a Hybrid Ensemble Decoder (HED) and a progressive fine-tuning strategy for cross-domain few-shot object detection (CD-FSOD). By parallelizing a subset of decoder layers and randomly initializing denoising queries to introduce prediction diversity, the method achieves state-of-the-art performance on three benchmarks — CD-FSOD, ODinW-13, and RF100-VL — without introducing any additional parameters.
ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection: ABRA decouples domain knowledge from category knowledge by constructing class-agnostic domain experts via Objectification, extracting lightweight per-category residuals via SVFT, and aligning weight spaces through Orthogonal Procrustes rotation—enabling detection capability transfer to a target domain even when no data for certain categories exists therein.
ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection: This paper formulates cross-domain category transfer as SVD rotation alignment in weight space: domain-agnostic experts are trained via Objectification, lightweight class residuals are extracted with SVFT, and a closed-form orthogonal Procrustes solution is used to "teleport" source-domain class knowledge to a target domain with no data for that class.
AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos: By exploiting the temporal invariance of background structure in fixed-view videos, the paper constructs an offline Anchor Bank and an online Anchor Map as persistent language–scene memory. Combined with an anchor-guided re-entry prior and a ReID-Gating identity verification mechanism, the system achieves robust re-capture of targets after occlusion or departure, yielding a 10.3% improvement in RCR and a 24.2% reduction in RCL.
Beyond Caption-Based Queries for Video Moment Retrieval: This paper identifies a substantial gap between caption-based queries and real-world search queries in VMR, introduces three search-query benchmarks, and mitigates active decoder-query collapse in DETR via two architectural modifications—self-attention removal and query dropout—achieving gains of up to 21.83% mAPm on multi-moment search queries.
Beyond Prompt Degradation: Prototype-Guided Dual-Pool Prompting for Incremental Object Detection: This paper proposes the PDP framework, which addresses prompt degradation in incremental object detection caused by prompt coupling and prompt drift via decoupled dual-pool prompting (shared pool + private pool) and Prototypical Pseudo-Label Generation (PPG), achieving state-of-the-art performance on COCO and VOC.
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval: This paper proposes Object-Anchored Composed Image Retrieval (OACIR), a new task formulation, along with a large-scale benchmark OACIRR (160K+ quadruplets) and the AdaFocal framework. AdaFocal employs a context-aware attention modulator to adaptively enhance focus on anchored instance regions, substantially outperforming existing methods in instance-level retrieval fidelity.
CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection: This paper proposes the CD-Buffer framework, which drives complementary collaboration between a subtractive buffer (channel suppression) and an additive buffer (lightweight adapter compensation) via a unified domain discrepancy measure, enabling robust test-time object detection adaptation across adverse weather conditions of varying severity.
CompAgent: An Agentic Framework for Visual Compliance Verification: This paper proposes CompAgent, the first agentic framework for visual compliance verification. A Planning Agent dynamically selects visual tools (object detection, face analysis, NSFW detection, etc.) based on compliance policies, while a Compliance Verification Agent integrates image content, tool outputs, and policy context for multimodal reasoning. Without any training, CompAgent surpasses the previous SOTA by 10% on UnsafeBench, achieving 76% F1.
DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection: This paper proposes DA-Mamba, a CNN-SSM hybrid architecture that achieves image-level and instance-level global-local domain-invariant feature alignment with linear complexity via two modules—Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM)—attaining state-of-the-art performance on four domain adaptive detection benchmarks.
Detecting Unknown Objects via Energy-Based Separation for Open World Object Detection: This paper proposes the DEUS framework, which introduces ETF-Subspace Unknown Separation (EUS) to effectively separate known, unknown, and background proposals via energy scores within geometrically orthogonal known/unknown subspaces, and designs an Energy-based Known Distinction (EKD) loss to reduce cross-task interference between old and new classes during incremental learning, achieving substantial improvements in unknown object recall on OWOD benchmarks.
Detecting Unknown Objects via Energy-based Separation for Open World Object Detection: This paper proposes DEUS, a framework that constructs orthogonal known/unknown subspaces via Simplex ETF and employs energy scores to guide feature separation (EUS), while mitigating cross-task interference between old and new categories through an Energy-based Known Distinction loss (EKD), achieving substantially improved unknown recall on OWOD benchmarks.
Does YOLO Really Need to See Every Training Image in Every Epoch?: This paper proposes the Anti-Forgetting Sampling Strategy (AFSS), which dynamically determines which training images participate in each epoch based on per-image learning sufficiency measured by \(\min(\text{Precision}, \text{Recall})\). AFSS achieves over 1.43× training speedup for YOLO-series detectors while maintaining or even improving detection accuracy.
Evaluating Few-Shot Pill Recognition Under Visual Domain Shift: This paper systematically evaluates the generalization of pill recognition under cross-domain few-shot conditions from a deployment perspective. It reveals a decoupling phenomenon in which semantic classification saturates at 1-shot while localization and recall degrade sharply under occlusion and overlap, and demonstrates that the visual realism of training data is far more critical than data volume or shot count.
Evaluating Few-Shot Pill Recognition Under Visual Domain Shift: This paper systematically evaluates few-shot pill recognition under cross-dataset domain shift from a deployment-oriented perspective. It reveals a decoupling phenomenon in which semantic classification saturates at 1-shot while localization and recall degrade sharply under occlusion and overlap, and demonstrates that the visual realism of training data is the dominant factor governing few-shot generalization.
EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer: This paper proposes the Evolving World Object Detection (EWOD) paradigm and the EW-DETR framework, which jointly address class-incremental learning, domain shift adaptation, and unknown object detection under a strict no-replay constraint through three synergistic modules: incremental LoRA adapters, a query-norm objectness adapter, and entropy-aware unknown mixing. The proposed approach achieves a 57.24% improvement on the FOGS metric.
EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer: This paper proposes the Evolving World Object Detection (EWOD) paradigm and the EW-DETR framework, which jointly address class-incremental learning, domain-shift adaptation, and unknown object detection without storing any historical data, via three modules: incremental LoRA adapters, a query-norm objectness adapter, and entropy-aware unknown mixing. The proposed method achieves a 57.24% improvement in FOGS over prior methods.
Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments: This paper proposes FI3Det, the first few-shot incremental 3D object detection framework. During the base training stage, a VLM-guided unknown object learning module enables early awareness of potential novel categories. During the incremental stage, a gated multimodal prototype imprinting module fuses 2D semantic and 3D geometric features for novel class detection. FI3Det achieves an average improvement of 17.37% in novel class mAP on ScanNet V2 and SUN RGB-D.
Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection: This paper proposes FALCON-SFOD, a framework that leverages class-agnostic binary masks generated by a foundation model (OV-SAM) to regularize the detector's feature space via Spatial Prior-Aware Regularization (SPAR), and introduces an Imbalance-aware Robust Pseudo Label loss (IRPL) to achieve object-focused representations in source-free object detection, attaining state-of-the-art results across multiple benchmarks.
Fourier Angle Alignment for Oriented Object Detection in Remote Sensing: By exploiting Fourier rotational equivariance to estimate the principal orientation of objects in the frequency domain and align features accordingly, this paper proposes two plug-and-play modules—FAAFusion and FAA Head—to address cross-scale directional incoherence in FPN and the classification–regression task conflict in detection heads, respectively, achieving new state-of-the-art results on DOTA-v1.0/v1.5 and HRSC2016.
HeROD: Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection: HeROD proposes a lightweight, model-agnostic framework that injects heuristic-inspired spatial and semantic reasoning priors into three stages of a DETR-style detection pipeline (candidate ranking, prediction fusion, and Hungarian matching), significantly improving data efficiency and convergence for referring object detection (ROD) under annotation-scarce conditions.
Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection: This paper proposes LMP, a dual-branch framework built upon GroundingDINO that introduces a visual prototype branch (comprising positive class prototypes and hard negative prototypes) jointly trained and integrated with the text branch at inference, achieving state-of-the-art performance on cross-domain few-shot object detection.
Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection: This paper proposes InCoM-Net, which extracts intra-instance, inter-instance, and global context features separately for each instance from VLM features, and achieves state-of-the-art HOI detection on HICO-DET and V-COCO (HICO-DET Full mAP 43.96, V-COCO AP_role^S1 73.6) via progressive context aggregation and fusion with detector features.
Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection: This paper proposes RAPTA (training-time region-aware prompt augmentation) to mitigate memorization in diffusion models, and ADMCD (attention-driven multimodal copy detection) to detect whether generated images copy training data. The two modules are complementary, forming an end-to-end framework for memorization mitigation and detection.
Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection: This paper proposes two complementary modules — Region-Aware Prompt Augmentation at Training time (RAPTA) and Attention-Driven Multimodal Copy Detection (ADMCD) — to address training data memorization in diffusion models. RAPTA generates semantically grounded prompt variants via object detector proposals to mitigate memorization during training, while ADMCD fuses patch-level, CLIP, and texture features through a zero-training detection pipeline to classify copy behavior at inference time. On LAION-10k, the copy rate is reduced from 7.4 to 2.6.
MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label: This work is the first to formally define and address the problem of sparsely annotated monocular 3D object detection. It proposes two modules—Road-Aware Patch Augmentation (RAPA) and Prototype-Based Filtering (PBF)—achieving substantial improvements over existing 2D SAOD methods under the KITTI 30% annotation setting (AP3D Easy: 21.28 vs. 17.14).
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding: This paper proposes MRD, a training-free multi-resolution retrieval-detection fusion framework that mitigates object fragmentation via cross-resolution semantic fusion and suppresses background interference through an open-vocabulary detector, substantially improving MLLM understanding of high-resolution images.
NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection: NoOVD proposes a framework that, during frozen-VLM-based OVD training, employs a parameter-free K-FPN to preserve CLIP knowledge for discovering potential novel-category objects, applies self-distillation to embed novel-category knowledge into the detector, and introduces R-RPN at inference to improve novel-category recall, achieving SOTA on OV-LVIS, OV-COCO, and Objects365.
PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection: PaQ-DETR proposes pattern-based dynamic query generation (content-aware weighted combination of shared basis patterns) combined with quality-aware one-to-many assignment (adaptive positive sample selection based on localization–classification consistency), jointly addressing query representation imbalance and supervision sparsity in DETR. It achieves consistent gains of 1.5%–4.2% mAP across multiple backbones.
Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection: HSA-DINO proposes a multi-scale prompt bank that learns hierarchical semantic prompts from the image feature pyramid to enrich text representations, and employs a semantics-aware router to dynamically determine at inference time whether domain-specific augmentation should be applied. This design achieves a superior balance between domain adaptation and open-vocabulary generalization, attaining the best harmonic mean (H) scores across three vertical-domain datasets.
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training: PET-DINO builds a unified object detector supporting both text and visual prompts on top of Grounding DINO. It introduces an alignment-friendly visual prompt generation module (AFVPG) and two prompt-enriched training strategies (IBP and DMD), achieving competitive zero-shot detection performance with significantly less training data.
PHAC: Promptable Human Amodal Completion: This paper introduces Promptable Human Amodal Completion (PHAC), a novel task that accepts point-based user prompts (pose/bounding box) via dedicated ControlNet modules to inject conditional signals, and designs an inpainting-based refinement module to preserve the appearance of visible regions, achieving high-quality and controllable completion of occluded human images.
Prompt-Free Universal Region Proposal Network: PF-RPN replaces text/image prompts with learnable visual embeddings and introduces three modules—Sparse Image-Aware Adapter (SIA), Cascaded Self-Prompting (CSP), and Centrality-Guided Query Selection (CG-QS)—to achieve state-of-the-art zero-shot region proposals across 19 cross-domain datasets using only 5% of COCO training data.
Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection: This paper is the first to identify an "astigmatism" phenomenon in cross-domain few-shot object detection (CD-FSOD), wherein model attention remains persistently diffuse in the target domain. Inspired by the human foveal visual system, the authors design three complementary modules — Positive Pattern Refinement (PPR), Negative Context Modulation (NCM), and Text Semantic Alignment (TSA) — to reshape attention, achieving state-of-the-art performance with significant margins across six cross-domain benchmarks.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward: This paper proposes Saliency-R1, which uses a logit-decomposition-based efficient saliency map technique and chain-of-thought bottleneck attention rollout to compute alignment between saliency maps and human-annotated bounding boxes as a GRPO reward, training VLMs to focus on task-relevant image regions during reasoning and thereby improving the interpretability and faithfulness of the reasoning process.
SDF-Net: Structure-Aware Disentangled Feature Learning for Optical–SAR Ship Re-Identification: SDF-Net is proposed to exploit the rigid-body geometric structure of ships as a cross-modal invariant anchor. It enforces structural consistency via gradient energy extracted from intermediate layers, and disentangles modality-shared/specific features at the terminal layer with additive residual fusion, achieving SOTA on HOSS-ReID (All mAP 60.9%, surpassing TransOSS by 3.5%).
Show, Don't Tell: Detecting Novel Objects by Watching Human Videos: This paper proposes the "Show, Don't Tell" paradigm — automatically creating training datasets and training bespoke object detectors by watching human demonstration videos, entirely bypassing language descriptions and prompt engineering. The approach significantly outperforms state-of-the-art open-set/closed-set detectors on novel object recognition in real-world robotic scenarios.
Show, Don't Tell: Detecting Novel Objects by Watching Human Videos: This paper proposes the "Show, Don't Tell" paradigm: a SODC pipeline (HOIST-Former for hand-object detection → SAMURAI for tracking → DBSCAN for spatiotemporal clustering) automatically creates annotated datasets from human demonstration videos to train a lightweight F-RCNN customized detector (MOD). Without any language prompts, MOD achieves instance-level detection of novel objects, surpassing VLM baselines such as GroundingDINO, RexOmni, and YoloWorld in mAP and precision on the Meccano and in-house datasets, and is integrated end-to-end into a real robotic sorting system.
Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images: This paper proposes ESM-YOLO+, a lightweight visible-infrared fusion network for small target detection. It achieves pixel-level cross-modal adaptive fusion via a Mask-Enhanced Attention Fusion (MEAF) module, and introduces a training-time structural representation enhancement to improve spatial discriminability. The method achieves 84.71% mAP on VEDAI while reducing parameter count by 93.6%.
SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection: This paper proposes Spatial-Projection Alignment (SPAN), which improves the localization accuracy of arbitrary monocular 3D detectors through two geometrically synergistic constraints — 3D corner spatial alignment and 3D-to-2D projection alignment — coupled with a hierarchical task learning strategy, serving as a plug-and-play module.
SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras: This paper proposes SpiralDiff, a diffusion framework for RGB-to-RAW conversion that employs a signal-dependent noise weighting strategy to accommodate varying reconstruction difficulty across pixel intensity regions, and introduces a CamLoRA module for lightweight cross-camera adaptation within a single unified model.
The COTe Score: A Decomposable Framework for Evaluating Document Layout Analysis Models: This paper proposes COTe (Coverage, Overlap, Trespass, Excess), a decomposable evaluation framework for Document Layout Analysis (DLA), along with the concept of Structural Semantic Units (SSUs). Compared to conventional IoU/mAP/F1 metrics, COTe more accurately reflects page parsing quality and reveals model-specific failure modes.
Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data: This paper introduces CANVAS — the first large-scale, subcellular-resolution Light-Sheet Fluorescence Microscopy (LSFM) whole-brain benchmark dataset, encompassing 6 cell markers, approximately 93,000 annotated cells, and a public leaderboard. It reveals critical generalization failures of existing detection models across markers and brain regions, and explores the potential of 3D Masked Autoencoders (MAE) for self-supervised representation learning.
Towards Intrinsic-Aware Monocular 3D Object Detection: MonoIA proposes converting numerical camera intrinsics into language-guided semantic representations (via LLM-generated intrinsic descriptions encoded by CLIP), and injects them into the detection network through a hierarchical adaptation module. This enables zero-shot generalization to unseen focal lengths and unified cross-dataset training, achieving new state-of-the-art results on KITTI, Waymo, and nuScenes.
UAVGen: Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection: This paper proposes UAVGen, a layout-to-image data augmentation framework for UAV-based object detection. It addresses low-quality small object generation, inefficient model capacity allocation, and label inconsistency through a visual prototype conditioned diffusion model and a focal region enhancement pipeline.

🔄 Self-Supervised Learning¶

A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking: This paper proposes PL-Stitch, a self-supervised framework that leverages the Plackett-Luce probabilistic ranking model to use temporal ordering of video frames as a pretraining signal. The method learns "procedure-aware" video representations and consistently outperforms existing self-supervised approaches on surgical phase recognition and cooking action segmentation.
AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation: This paper proposes AcTTA, a framework that for the first time treats activation functions as learnable components for test-time adaptation (TTA). By introducing a parameterized activation center shift \(c\) and asymmetric gradient scaling \(\lambda_{pos}, \lambda_{neg}\) to replace or augment conventional normalization-layer adaptation, AcTTA consistently outperforms all normalization-based TTA methods on CIFAR-10/100-C and ImageNet-C, while supporting learning rates up to 10× larger.
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning: This paper proposes MMOT, an online mixture model learning framework driven by optimal transport theory. By maintaining multiple adaptive centroids per class, MMOT more accurately captures the multimodal structure of online data streams. Combined with a dynamic preservation strategy that enhances class discriminability, MMOT effectively alleviates catastrophic forgetting in online class-incremental learning (OCIL).
BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning: This paper proposes the BD-Merging framework, which trains a debiased router via Dirichlet evidential modeling, Adjacency Discrepancy Score (ADS), and discrepancy-aware contrastive learning to adaptively assign model merging weights, significantly improving the robustness and generalization of merged models under test-time distribution shifts and on unseen tasks.
BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning: This paper proposes BoSS, a scalable oracle strategy selection framework. In each active learning round, multiple query strategies are run in parallel on random sub-pools to generate candidate batches; each candidate batch is evaluated rapidly by freezing the backbone and retraining only the final linear head; the batch yielding the greatest performance gain is selected. This framework enables quantification of the gap between existing AL strategies and the theoretical optimum.
BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning: This paper proposes BoSS (Best-of-Strategies Selector), which generates 100 candidate batches by ensembling 10 complementary AL selection strategies, and efficiently evaluates the performance gain of each candidate batch by freezing the pretrained backbone and retraining only the final linear layer. The best-performing batch is selected as an Oracle upper-bound reference. BoSS is the first deep active learning Oracle scalable to ImageNet, and reveals that current state-of-the-art strategies still leave approximately a 2× accuracy improvement gap on large-scale, many-class datasets.
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors: This paper proposes a zero-hyperparameter yield multi-corner analysis framework based on Learned Priors (TabPFN foundation model). By replacing traditional GP/normalizing flow hyperparameter tuning with in-context Bayesian inference, and combining automatic feature selection, Cross-Corner knowledge transfer, and uncertainty-driven active learning, the framework achieves an MRE as low as 0.11% with no manual tuning, reducing verification cost by over 10×.
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors: This paper proposes replacing handcrafted priors (GP kernels, IS Gaussian assumptions) with the learned prior of the foundation model TabPFN, enabling zero-hyperparameter multi-PVT-corner yield analysis. On industrial-grade SRAM benchmarks, the method achieves state-of-the-art accuracy (MRE as low as 0.11%) while reducing verification cost by more than 10×.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models: This paper proposes Chain-of-Models Pre-Training (CoM-PT), which arranges vision foundation models in a size-ordered "model chain" and progressively accelerates training via inverse knowledge transfer (weight initialization + feature distillation) from smaller to larger models, achieving lossless training acceleration whose efficiency improves as the model family grows.
CraterBench-R: Instance-Level Crater Retrieval for Planetary Scale: This work is the first to formalize crater analysis as an instance-level image retrieval problem. It introduces the CraterBench-R benchmark (~25K Mars crater IDs, 50K gallery, 5K queries), and through systematic diagnosis reveals that single-vector pooling imposes an accuracy ceiling while supervised metric learning consistently degrades performance. A training-free instance token aggregation method is proposed—selecting K seed tokens via top-K attention or FPS and performing cosine nearest-neighbor residual assignment—to compress 196 ViT patch tokens into K representative tokens for late interaction matching. At K=64, the method matches full-token accuracy with substantially reduced storage. A practical two-stage pipeline (single-vector coarse retrieval + instance token re-ranking) recovers 89–94% of full-pipeline accuracy.
D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping: This paper proposes D2Dewarp—the first document dewarping method that learns geometric representations from both horizontal and vertical dimensions. A UNet with dual decoders predicts horizontal lines (top/bottom boundaries of documents, tables, and text lines) and vertical lines (left/right boundaries) respectively. An HV Fusion Module cross-fuses features from both directions via mixed attention. The authors also introduce the DocDewarpHV dataset containing 114K images with dual-dimension annotations.
DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers: Through systematic analysis, this work identifies inter-block representation diversity as a key factor for effective learning in DiTs, and proposes DiverseDiT: long residual connections to diversify inputs combined with a representation diversity loss to explicitly promote feature differentiation across blocks—accelerating convergence and improving generation quality without any external guidance model.
GeoBridge: A Semantic-Anchored Multi-View Foundation Model for Geo-Localization: GeoBridge proposes a semantic-anchored multi-view foundation model for geo-localization that bridges UAV, street-view, and satellite imagery through textual descriptions as cross-modal semantic anchors, enabling bidirectional cross-view matching and language-to-image localization. The authors also introduce the GeoLoc dataset (50K+ location tuples across 36 countries).
GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration: This paper introduces GeoChemAD, the first open-source multi-region multi-element geochemical anomaly detection benchmark (8 subsets covering three sampling media—sediment/rockchip/soil—and four target elements—Au/Cu/Ni/W), and proposes GeoChemFormer, a two-stage Transformer framework that first learns spatial context and then models inter-element dependencies, achieving a mean AUC of 0.7712 that surpasses all baselines.
GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration: This paper introduces GeoChemAD, an open-source benchmark dataset, and GeoChemFormer, a two-stage framework that performs unsupervised geochemical anomaly detection via spatial context learning and elemental dependency modeling, achieving an average AUC of 0.7712 across 8 subsets.
Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning: This paper proposes leveraging DINOv3 with two self-supervised pretraining tasks — individual optical flow estimation and group-relevant object localization — to learn group activity features (GAF), achieving substantial improvements over existing methods without any group activity annotations.
LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency: This paper proposes LaS-Comp, a zero-shot, category-agnostic 3D shape completion framework. It injects known geometry in the spatial domain via an Explicit Replacement Stage (ERS) and optimizes boundary consistency in the latent space via gradient-based updates in an Implicit Alignment Stage (IAS). The framework bridges the gap between the latent space and spatial domain of pretrained 3D foundation models, achieving state-of-the-art performance across diverse partial observation patterns.
MINE-JEPA: In-Domain Self-Supervised Learning for Mineral Exploration: This paper proposes Mine-JEPA, the first in-domain self-supervised learning (SSL) pipeline for side-scan sonar (SSS) mine classification. Built upon SIGReg regularization loss, sonar-adapted augmentation strategies, and ImageNet initialization, Mine-JEPA pretrained on only 1,170 unlabeled sonar images surpasses DINOv3—a foundation model pretrained on 1.7 billion images.
MOMO: Mars Orbital Model — Foundation Model for Mars Orbital Applications: MOMO is the first foundation model for Mars remote sensing. It independently pre-trains MAE on three Mars sensors (HiRISE/CTX/THEMIS) and proposes an Equal Validation Loss (EVL) checkpoint selection strategy for model merging, outperforming ImageNet pre-training and Earth observation foundation models across 9 downstream tasks in Mars-Bench.
OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism: This paper proposes OmniGCD, the first modality-agnostic generalized category discovery method. A GCDformer trained on synthetic data transforms the GCD latent space of arbitrary modalities at test time into representations more amenable to clustering, achieving zero-shot GCD across 16 datasets spanning four modalities.
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning: This paper proposes an online mixture model framework driven by optimal transport theory (MMOT), which maintains multiple adaptive centroids per class to capture the multimodal distribution of streaming data. Combined with a dynamic preservation strategy to mitigate catastrophic forgetting, the method substantially outperforms existing approaches in the OCIL setting.
Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting: This paper proposes Re-Depth Anything, which refines depth predictions from Depth Anything V2/3 at inference time through self-supervised optimization: the predicted depth map is augmented via re-lighting, and a 2D diffusion model's SDS loss is used to guide the optimization without any labeled data.
Representation Learning for Spatiotemporal Physical Systems: This paper systematically compares four self-supervised/physics-modeling methods on three PDE-based physical systems (active matter, shear flow, and Rayleigh-Bénard convection), finding that latent-space prediction (JEPA) consistently outperforms pixel-level prediction (VideoMAE) on physical parameter estimation tasks — achieving 28%–51% relative MSE reduction — and that JEPA trained with only 10% of fine-tuning data surpasses VideoMAE trained on 100% of the data. Notably, methods specifically designed for physical modeling are not always the optimal choice.
Representation Learning for Spatiotemporal Physical Systems: This paper systematically benchmarks four learning paradigms — JEPA, VideoMAE, an autoregressive foundation model (MPP), and an operator learning method (DISCO) — across three PDE-based physical systems. It finds that latent-space predictive objectives (JEPA) consistently outperform pixel-level prediction methods on the downstream task of physical parameter estimation, achieving 28–51% relative MSE reduction with greater data efficiency.
Robustness of Vision Foundation Models to Common Perturbations: This paper presents the first systematic study on the robustness of vision foundation models to common perturbations (e.g., JPEG compression, brightness adjustment). It proposes three robustness metrics, formalizes five mathematical properties, finds that foundation models are generally non-robust, and introduces a fine-tuning method that improves robustness without sacrificing utility.
Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild: This paper reformulates semantic correspondence as a Fused Gromov-Wasserstein (FGW) optimal transport problem, leveraging geometric structural constraints from 3D foundation models to generate globally consistent pseudo labels, thereby addressing the geometric inconsistency caused by the locality and 2D appearance ambiguity inherent in conventional nearest-neighbor matching.
SpHOR: A Representation Learning Perspective on Open-set Recognition: SpHOR proposes a two-stage decoupled training framework: Stage 1 performs OSR-tailored representation learning via orthogonal label embeddings, spherical constraints (vMF distribution), and Mixup/Label Smoothing; Stage 2 freezes the encoder and trains a linear classifier. The method achieves up to 5.1%/5.2% gains in OSCR/AUROC on the Semantic Shift Benchmark, and introduces two new metrics: Angular Separability and Norm Separability.
SpHOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Neural Networks: This paper proposes SpHOR, a two-stage decoupled training framework for open-set recognition (OSR) that explicitly shapes the feature space via spherical representation learning (vMF distributions), orthogonal label embeddings, and integrated Mixup/Label Smoothing, achieving up to 5.1% OSCR improvement on the Semantic Shift Benchmark.
Suppressing Non-Semantic Noise in Masked Image Modeling Representations: This paper identifies that representations learned by Masked Image Modeling (MIM) retain substantial non-semantic information (e.g., low-level features such as texture and color), and proposes a training-free post-hoc method, SOAP (Semantically Orthogonal Artifact Projection), which leverages PCA to identify and project out non-semantic components, consistently improving zero-shot performance across multiple MIM models.
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction: This paper proposes TALO, a high-degrees-of-freedom alignment framework based on Thin Plate Spline (TPS), which corrects spatially varying geometric inconsistencies of 3D vision foundation models (3DVFMs) in online reconstruction via globally propagated control points and a point-agnostic submap registration design. TALO is compatible with multiple foundation models and camera configurations, and significantly reduces trajectory error on the Waymo and nuScenes datasets.
TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation: This paper proposes TeFlow — the first method to introduce multi-frame supervision into self-supervised feed-forward scene flow estimation. By constructing a motion candidate pool via temporal aggregation and aggregating temporally consistent supervision signals through consensus voting, TeFlow achieves a Three-way EPE of 3.57 cm on Argoverse 2 (on par with the optimization-based method Floxels) while maintaining real-time inference (8 s vs. 24 min), representing a 22.3% improvement over SeFlow++.
Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval: This paper proposes TPSNet, which leverages CLIP-learned domain prompts as text priors to provide fine-grained semantic supervision, while introducing phase spectrum features as phase priors to bridge domain distribution gaps and preserve semantic integrity. Significant improvements in unsupervised cross-domain image retrieval (UCDIR) are achieved through the synergistic combination of text-phase dual priors.
TrackMAE: Video Representation Learning via Track, Mask, and Predict: This paper introduces explicit motion signals into the masked video modeling (MVM) framework. Point trajectories extracted via CoTracker3 serve as auxiliary reconstruction targets, complemented by a motion-aware masking strategy. The model jointly learns spatial reconstruction and motion prediction, achieving substantial gains over existing video self-supervised methods on motion-sensitive benchmarks (SSv2, FineGym).
UniGeoCLIP: Unified Geospatial Contrastive Learning: UniGeoCLIP is the first to align five complementary geospatial modalities (aerial imagery, street-view imagery, digital surface models, text, and GPS coordinates) into a unified embedding space via pure contrastive learning, and proposes a multi-scale coordinate encoder to enhance spatial representation capacity.
Vision Transformers Need More Than Registers: This paper argues that dense feature artifacts in ViTs trained under label supervision, text supervision, and self-supervision share a common root cause: rather than a simple high-norm token problem, models learn to exploit background patches as global semantic shortcuts, driven by coarse-grained supervision combined with global attention. The authors accordingly propose LaSt-ViT, which replaces standard CLS aggregation with frequency-domain stability-guided selective aggregation, yielding consistent improvements in localization, segmentation, and open-vocabulary tasks across 12 benchmarks.
Vision Transformers Need More Than Registers: This paper systematically analyzes the artifact phenomenon widely observed in ViTs across fully supervised, text-supervised, and self-supervised paradigms, revealing that the root cause is "lazy aggregation"—ViTs exploit semantically irrelevant background patches as shortcuts to represent global semantics. The authors propose LaSt-ViT (LazyStrike ViT), which anchors the CLS token to foreground regions via frequency-aware selective channel aggregation, consistently eliminating artifacts and improving performance across 12 benchmarks.
VT-Intrinsic: Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair: VT-Intrinsic exploits the physical complementarity between visible and thermal infrared images—unreflected light is absorbed as heat—to derive ordinal relationships between visible-thermal intensities that directly correspond to ordinal relationships in reflectance and shading. These ordinal relations serve as dense self-supervised signals to drive neural network optimization, achieving high-quality intrinsic image decomposition without any pre-training data.
Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers: Three substitution control experiments (mean substitution, noise substitution, and cross-image shuffling) demonstrate that zero-ablation overstates the dependence on the precise content of register tokens in DINO-series ViTs — the model requires only "reasonable register-like activations" rather than image-specific values.

🔬 Interpretability¶

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP: This paper proposes "information scope" as a novel dimension for SAE feature interpretability. By introducing the Contextual Dependency Score (CDS), it partitions CLIP's SAE features into local features (low CDS) and global features (high CDS), revealing their differentiated functional roles in classification, segmentation, and depth estimation.
CI-ICE: Intrinsic Concept Extraction Based on Compositional Interpretability: This paper introduces the CI-ICE task and the HyperExpress method, which leverages the hierarchical modeling capacity of hyperbolic space (Poincaré ball) to extract composable object-level and attribute-level intrinsic concepts. By applying Horosphere projection to enforce compositionality in the concept embedding space, HyperExpress achieves an ACC₁ of 0.504 on UCEBench, a 55% improvement over ICE (0.325).
Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events: This paper proposes CoE, a training-free multimodal summarization framework that constructs a Hierarchical Event Graph (HEG) to guide chain-of-events reasoning. CoE surpasses state-of-the-art video CoT baselines across 8 datasets, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore.
DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification: This paper proposes DINO-QPM, a lightweight interpretability adapter that transforms the complex, high-dimensional features of a frozen DINOv2 backbone into contrastive, class-agnostic interpretable representations. Through quadratic programming for sparse feature selection and class-level feature assignment, the method simultaneously surpasses DINOv2 linear probing in accuracy and all comparable methods in interpretability on CUB-2011 and Stanford Cars.
Draft and Refine with Visual Experts: This paper proposes DnR (Draft and Refine), an agent framework built upon a question-conditioned Visual Utilization metric that quantifies the degree to which LVLMs actually rely on visual evidence. Through iterative rendering feedback from external visual experts (detection, segmentation, OCR, etc.), DnR improves visual grounding and reduces hallucinations.
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing: This paper reframes open-vocabulary 3D indoor scene editing as a goal-regressive planning problem. It introduces EditLang, a PDDL-style symbolic language, and employs an LLM-driven Planner-Validator loop to derive minimal edit sequences by reasoning backward from goal states. Evaluated on 63 editing tasks, the method achieves the best overall balance across instruction fidelity (69.1%), semantic consistency (86.6%), and physical plausibility (91.7%).
ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization: ERMoE proposes reparameterizing MoE expert weights within an orthogonal eigenbasis and replacing conventional routing logits with eigenbasis scores (cosine similarity), achieving stable routing and interpretable expert specialization without auxiliary load-balancing losses.
Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?: This paper proposes the FASS benchmark, which systematically evaluates the stability of post-hoc feature attribution methods through prediction-invariant filtering, a three-axis stability decomposition (spatial / ranking / salient region), and multiple perturbation types (geometric / photometric / compression), exposing fundamental flaws in existing evaluation frameworks.
From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition: This paper proposes SITH (Semantic Inspection of Transformer Heads), a fully data-free and training-free interpretability framework for CLIP. SITH applies SVD directly to the Value-Output weight matrices of attention heads, then leverages a novel COMP algorithm to interpret each singular vector as a sparse combination of semantically coherent concepts. This achieves finer-grained intra-head interpretability than existing methods and enables precise weight editing to improve downstream performance.
Geometry-Guided Camera Motion Understanding in VideoLLMs: This paper reveals that VideoLLMs perform near random-chance on fine-grained camera motion primitives (pan/tilt/dolly, etc.), constructs CameraMotionDataset (12K clips × 15 atomic motions) and the CameraMotionVQA benchmark, and proposes a model-agnostic approach that injects geometric camera cues extracted by a frozen 3D foundation model (VGGT) via a lightweight temporal classifier and structured prompting — bridging this capability gap without any fine-tuning of the VideoLLM.
Geometry-Guided Camera Motion Understanding in VideoLLMs: This work systematically reveals camera motion blind spots in VideoLLMs through a benchmarking-diagnosis-injection framework, and significantly improves fine-grained camera motion understanding without fine-tuning by leveraging a frozen 3D foundation model (VGGT) for geometric feature extraction, a lightweight temporal classifier, and structured prompt injection.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings: This paper proposes generalization performance prediction metrics based on model-internal circuits, including Dependency Depth Bias (DDB) for pre-deployment model selection and Circuit Shift Score (CSS) for post-deployment performance monitoring, improving average correlation over existing proxy metrics by 13.4% and 34.1%, respectively.
Language Models Can Explain Visual Features via Steering: This paper proposes a method for scalable automatic explanation of visual features by causally intervening (steering) on SAE features in VLM visual encoders. By injecting feature vectors into a blank image's forward pass and prompting the language model to describe what it "sees," the approach eliminates the need for an evaluation image set. A hybrid method, Steering-informed Top-k, is further proposed and achieves state-of-the-art performance.
Measuring the (Un)Faithfulness of Concept-Based Explanations: This paper demonstrates that the faithfulness of existing unsupervised concept-based explanation methods (U-CBEMs) is systematically overestimated — due to the use of overly complex surrogate models and flawed deletion-based evaluation. The authors propose SURF (Surrogate Faithfulness), a simple linear surrogate with a dual-space metric framework, validated through a sanity check that "random concepts should be less faithful," and provide the first systematic benchmark revealing that multiple SOTA U-CBEMs are in fact not faithful.
Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared: This paper proposes the first framework that performs cross-modal fusion under missing infrared conditions in the coefficient domain rather than the pixel domain. By learning a shared convolutional dictionary that establishes a unified IR-VIS atomic space, the method performs VIS→IR inference and adaptive fusion entirely in the coefficient domain. A frozen LLM provides weak semantic priors for thermal information completion. The approach achieves performance comparable to dual-modality fusion methods using only visible light images as input.
Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion: This paper proposes ND-CNPFuse, which performs neurodynamical analysis of coupled neural P (CNP) systems to establish constraint relationships between network parameters and input signals, preventing abnormal sustained neuronal firing. The method generates high-quality, interpretable decision maps for multi-focus image fusion (MFIF) without any training.
On the Possible Detectability of Image-in-Image Steganography: This paper exposes a fundamental security flaw in mainstream image-in-image deep steganography schemes: the embedding process is essentially a mixing process that can be readily separated by Independent Component Analysis (ICA). The authors propose an interpretable steganalysis method based on statistical moments of wavelet-domain independent components (achieving 84.6% accuracy with only 8-dimensional features), and demonstrate that the classical SRM+SVM approach achieves detection rates exceeding 99%.
On the Possible Detectability of Image-in-Image Steganography: This paper exposes a fundamental security vulnerability in invertible neural network (INN)-based image-in-image steganography: the embedding process is intrinsically a mixing process identifiable via independent component analysis (ICA). Using only 8-dimensional statistical features with an SVM achieves a detection rate of 84.6%, while the classical SRM+SVM baseline exceeds 99%.
Pixel2Phys: Distilling Governing Laws from Visual Dynamics: Pixel2Phys is proposed as a multi-agent collaborative framework built upon MLLMs, employing four agents — Plan, Variable, Equation, and Experiment — in an iterative hypothesize-verify-refine loop to automatically discover interpretable governing equations from raw videos, achieving a 45.35% improvement in extrapolation accuracy over baselines.
Reallocating Attention Across Layers to Reduce Multimodal Hallucination: A lightweight, training-free plugin method is proposed to mitigate hallucination in Multimodal Large Reasoning Models (MLRMs) by identifying perceptual and reasoning attention heads and applying Class-Conditioned Rescaling to rebalance cross-layer attention distribution. The method achieves an average improvement of 4.2% across 5 benchmarks with negligible additional inference overhead.
Reallocating Attention Across Layers to Reduce Multimodal Hallucination: This paper decomposes multimodal reasoning model hallucinations into two failure modes — shallow-layer perceptual bias and deep-layer reasoning drift — and selectively amplifies the contributions of identified perception/reasoning functional heads in a plug-and-play, training-free manner, achieving an average accuracy improvement of 4.2% with only ~1% additional computational overhead.
Rethinking Concept Bottleneck Models: From Pitfalls to Solutions: This paper proposes CBM-Suite, a methodological framework that systematically addresses four fundamental pitfalls of Concept Bottleneck Models—the absence of a pre-training concept relevance metric, the linearity problem that allows the concept bottleneck to be bypassed, the accuracy gap relative to black-box models, and the unexplored interaction effects of different visual backbones and VLMs—through entropy-based metrics, nonlinear layers, and distillation losses, significantly improving both accuracy and interpretability of CBMs.
RiskProp: Collision-Anchored Self-Supervised Risk Propagation for Early Accident Anticipation: This paper proposes RiskProp, a collision-anchored self-supervised risk propagation paradigm that learns temporally coherent risk evolution curves using only collision-frame annotations, via a future-frame regularization loss and an adaptive monotonicity constraint loss, achieving state-of-the-art performance on the CAP and Nexar datasets.
SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World: This paper proposes SafeDrive, an end-to-end planning framework that employs a trajectory-conditioned sparse world network (SWNet) to simulate future behaviors of critical entities, followed by a fine-grained reasoning network (FRNet) for per-instance collision assessment and per-timestep drivable-area compliance evaluation. SafeDrive achieves a PDMS of 91.6 with only 0.5% collision rate on NAVSIM, and a driving score of 66.8% on Bench2Drive.
SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection: This paper introduces SteelDefectX, the first vision-language dataset for steel surface defect detection (7,778 images, 25 defect categories), featuring coarse-to-fine textual annotations ranging from class-level to sample-level descriptions. A four-task benchmark is established covering pure-vision classification, vision-language classification, zero/few-shot recognition, and zero-shot transfer. Experiments demonstrate that high-quality textual annotations significantly improve model interpretability, generalization, and cross-domain transfer capability.
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling: SubspaceAD demonstrates that fitting a single PCA model on features from a strong visual foundation model (DINOv2-G) is sufficient to outperform all few-shot anomaly detection methods requiring training, memory banks, or prompt tuning, achieving 98.0% image-level AUROC and 97.6% pixel-level AUROC on MVTec-AD under the 1-shot setting.
TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment: This paper proposes the TDATR framework, which achieves end-to-end table recognition under limited annotation data through a "perceive-then-fuse" strategy and a structure-guided cell localization module, attaining state-of-the-art performance across 7 benchmarks without dataset-specific fine-tuning.
Text-guided Fine-Grained Video Anomaly Understanding: This paper proposes the T-VAU framework, which achieves pixel-level spatiotemporal anomaly localization via an Anomaly Heatmap Decoder (AHD), and introduces a Region-Aware Anomaly Encoder (RAE) that injects heatmap evidence into an LVLM for unified reasoning over anomaly detection, localization, and semantic explanation.
Towards Faithful Multimodal Concept Bottleneck Models: This paper proposes f-CBM — the first faithful multimodal Concept Bottleneck Model framework — which mitigates unintended information leakage in concept representations via a differentiable leakage loss, and improves concept detection accuracy using a Kolmogorov-Arnold Network (KAN) prediction head, achieving an optimal Pareto frontier across task accuracy, concept detection, and leakage reduction.
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension: VIRO embeds lightweight operator-level verification mechanisms (CLIP uncertainty verification + spatial logic verification) into a neuro-symbolic REC pipeline, enabling each reasoning step to self-verify and terminate early when no target exists. Under a zero-shot setting, it achieves 61.1% balanced accuracy, substantially outperforming compositional reasoning baselines, while maintaining a program failure rate below 0.3% and efficient inference speed.
Why Does It Look There? Structured Explanations for Image Classification: This paper proposes the I2X framework, which transforms unstructured explainability (saliency maps) into structured explanations by tracking the co-evolution of prototype intensity extracted via GradCAM and model confidence across training checkpoints. The framework reveals the reasoning structure underlying "why the model attends to a specific region" and leverages this understanding to guide fine-tuning for performance improvement.

📊 LLM Evaluation¶

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation: This paper theoretically proves that fine-tuning weight deltas encode input covariance information, and proposes ACE-Merging, which achieves data-free closed-form model merging through three steps: adaptive covariance estimation, collective structural prior, and spectral refinement. ACE-Merging achieves an average improvement of 4% over prior methods on GPT-2 and 5% on RoBERTa-Base.
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks: This paper proposes AdaBet, a gradient-free layer selection method grounded in algebraic topology, which uses the first Betti number \(b_1\) to quantify the topological complexity of each layer's activation space via a single forward pass—requiring no labels, gradients, or backpropagation. By fine-tuning only 10% of layers on ResNet50/VGG16/MobileNetV2/ViT-B16, AdaBet surpasses full fine-tuning in accuracy while reducing peak memory by approximately 40%.
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening: This paper proposes KAB (Keyframe-Anchored Attention Bias) and ReTRo (Rescaled Temporal RoPE), two training-free inference-time methods built upon the Wan2.1 video diffusion model. These methods address semantic infidelity, frame inconsistency, and temporal rhythm instability in generative inbetweening (GI) with sparse keyframes under large-motion conditions. The paper also introduces TGI-Bench, the first text-conditioned GI evaluation benchmark.
Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark: This paper proposes PanScale, the first cross-scale pansharpening dataset, along with the PanScale-Bench evaluation benchmark, and the ScaleFormer framework — which reinterprets resolution variation as sequence length variation, achieving cross-scale generalization via Scale-Aware Patchify bucketed sampling, decoupled spatial-sequence modeling, and RoPE.
CryoHype: Reconstructing a Thousand Cryo-EM Structures with Transformer-Based Hypernetworks: This paper proposes CryoHype, a Transformer-based hypernetwork approach for cryo-EM reconstruction that dynamically modulates the weights of implicit neural representations (INRs) to reduce parameter sharing, achieving for the first time the simultaneous reconstruction of 1,000 distinct protein structures from unlabeled cryo-EM images.
Enhancing Out-of-Distribution Detection with Extended Logit Normalization: This paper identifies two forms of feature collapse induced by LogitNorm during training—dimensional collapse and origin collapse—and proposes a hyperparameter-free Extended Logit Normalization (ELogitNorm) that replaces the distance-to-origin scaling factor with the distance from features to the decision boundary. ELogitNorm significantly improves both post-hoc OOD detection performance and confidence calibration without sacrificing classification accuracy.
Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning: This paper proposes a Factored Flow Prediction module that predicts optical flow from the geometric latent of a source view and the pose latent of a target view, enabling unlabeled videos to serve as supervisory signals for 3D geometry learning. The method achieves state-of-the-art performance across 8 benchmarks covering both static and dynamic scenes.
Free-Grained Hierarchical Visual Recognition: This paper proposes free-grained hierarchical recognition, a setting in which training labels may appear at any level of a taxonomy. Two complementary methods are introduced to compensate for missing supervision — text-guided pseudo-attributes (Text-Attr) and taxonomy-guided semi-supervised learning (Taxon-SSL) — while at inference time the model adaptively selects its prediction depth.
HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT: HeSS proposes a Head Sensitivity Score to quantify the sensitivity of each attention head in VGGT's global attention layers to sparsification, and redistributes the attention budget from insensitive heads to sensitive ones accordingly. This approach significantly outperforms the uniform sparsification method SparseVGGT at high sparsity ratios with virtually no additional runtime overhead.
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces: This paper proposes Hier-COS, a framework that assigns orthogonal basis vectors to each node in a label hierarchy tree to construct a theoretically guaranteed Hierarchy-Aware Vector Space (HAVS). It is the first to unify "hierarchy-aware fine-grained classification" and "hierarchical multi-level classification" within a single framework, while introducing a new evaluation metric HOPS, achieving comprehensive state-of-the-art performance across four datasets.
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces: This paper proposes Hier-COS, a framework that assigns orthogonal basis vectors to each node in a label hierarchy tree and constructs a Hierarchy-Aware Vector Space (HAVS) via subspace composition (ancestor bases + self basis + descendant bases). The approach provides theoretical guarantees that the distance structure of the feature space is consistent with the hierarchy tree, while also introducing the HOPS evaluation metric to address the permutation-invariance deficiency of existing hierarchical evaluation metrics.
HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning: This paper identifies a "Domain Gravity" bias in heterogeneous-domain continual learning—whereby data-rich or low-entropy domains exert disproportionate influence in a shared embedding space—and proposes HyCal, a training-free method that calibrates prototypes by fusing cosine similarity and Mahalanobis distance, achieving robust classification in cross-discipline imbalanced few-shot class-incremental learning.
Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery: This paper proposes AL-GCD, a framework that simulates human analogical reasoning by designing an Analogical Text Concept Generator (ATCG)—which analogically generates textual concepts for unlabeled samples by drawing on a visual-textual knowledge base built from labeled categories—thereby casting category discovery as a joint visual-textual reasoning task. AL-GCD achieves an average improvement of 5.0% across six benchmarks, with 7.1% gains on fine-grained datasets.
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models: This paper introduces StEvo-Bench, a benchmark comprising 225 tasks that evaluates whether video world models can correctly continue evolving scene states during unobserved intervals—induced by inserting occlusions or redirecting the camera during video generation. Experiments reveal that state-of-the-art models (including Veo 3 and Sora 2 Pro) achieve success rates below 10%, exposing a fundamental tendency of current video models to couple state evolution tightly with pixel-level observation.
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models: This paper proposes StEvo-Bench, a benchmark comprising 225 tasks across 6 evolution categories, which systematically evaluates whether 9 video world models can decouple state evolution from observation via occlusion or camera-away controls. All models achieve a success rate below 10% under observation interruption, and 5 specialized verifiers are employed to precisely localize failure modes.
Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline: This paper formally separates Video Fluency Assessment (VFA) from conventional Video Quality Assessment (VQA) for the first time, introduces FluVid — the first fluency-oriented benchmark dataset (4,606 videos) — and proposes a baseline model FluNet that leverages Temporal Permuted Self-Attention (T-PSA) for efficient inter-frame interaction, achieving SRCC/PLCC of 0.816/0.821.
PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion: This paper proposes PRISM, a holistic video dataset condensation method that begins from only two temporal anchors (first and last frames), adaptively inserts keyframes by detecting gradient direction conflicts, and achieves state-of-the-art storage efficiency while preserving content–motion coupling integrity — reaching 17.9% accuracy with 20 MB on miniUCF 1VPC, a 5× storage reduction over prior methods (94 MB).
R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII: This paper introduces R2G, the first standardized multi-view circuit graph benchmark suite, providing five stage-aware graph representations with information equivalence across 30 IP cores. A systematic study reveals that graph representation choice has a greater impact on performance than GNN model choice.
ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation: ReflexSplit proposes an explicit layer fusion-separation framework that addresses the transmission-reflection confusion problem in single image reflection separation (SIRS). It employs Cross-scale Gated Fusion (CrGF) for adaptive multi-scale feature aggregation, a differential dual-dimensional attention mechanism \(\mathbf{A}^t - \lambda_\ell \mathbf{A}^r\) within the Layer Fusion-Separation Block (LFSB) for cross-stream interference suppression, and a curriculum training strategy with depth-dependent initialization and epoch-wise warmup to progressively strengthen separation intensity, achieving state-of-the-art performance on both synthetic and real-world benchmarks.
Reframing Long-Tailed Learning via Loss Landscape Geometry: This paper reframes the head-tail seesaw dilemma in long-tailed learning through the lens of loss landscape geometry. It identifies that tail class degradation stems from optimization converging to sharp minima that are far from tail-class optima. A dual-module framework comprising GKP (Grouped Knowledge Preservation) and GSA (Grouped Sharpness Aware) is proposed based on continual learning principles, achieving state-of-the-art results on four benchmarks (CIFAR-LT / ImageNet-LT / iNat2018) without requiring additional data.
SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval: This paper proposes SATTC, a label-free test-time calibration head that operates directly on the similarity matrix over frozen EEG and image encoders. It combines a geometric expert (subject-adaptive whitening + adaptive CSLS) and a structural expert (mutual nearest neighbors + bidirectional top-k ranking + category popularity) via a product-of-experts fusion, significantly improving Top-1 accuracy and reducing the hubness effect in cross-subject EEG-to-image retrieval.
Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score: This paper proposes SemiCP, a framework that incorporates unlabeled data into the conformal prediction calibration pipeline via a Nearest Neighbor Matching (NNM) score. Under extremely limited labeled data, SemiCP reduces the average coverage gap by up to 77% while simultaneously shrinking prediction set sizes.
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras: This paper proposes SparseCam4D, the first method to achieve sparse-camera (2–3 views) 4D reconstruction on standard multi-camera dynamic scene benchmarks. The core innovation is the Spatio-Temporal Distortion Field (STDF), which explicitly models spatio-temporal inconsistencies in generative observations and decouples them from the underlying 4D Gaussian representation, enabling high-fidelity, spatio-temporally consistent rendering of dynamic scenes.
TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation: This paper presents TacSIm, the first large-scale dataset and benchmark that reconstructs full-team trajectories from real Premier League broadcast footage and performs tactical style imitation in a virtual football environment, quantifying imitation fidelity via two metrics: spatial occupancy similarity and motion vector similarity.
Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning: This paper identifies temporal imbalance as a previously overlooked source of bias in class-incremental learning (CIL) and proposes the Temporal-Adjusted Loss (TAL), which dynamically downweights negative supervision for old classes via a temporally decaying memory kernel. TAL integrates in a plug-and-play manner and significantly alleviates catastrophic forgetting.
Unified Primitive Proxies for Structured Shape Completion: This paper proposes UniCo, which learns unified primitive representations over shared shape features via primitive proxies, jointly predicting complete point clouds and assembly-ready quadric primitives (with geometry, semantics, and membership) in a single forward pass. UniCo reduces Chamfer distance by up to 50% and improves normal consistency by up to 7% on synthetic and real-world point cloud benchmarks.
VGA-Bench: A Unified Benchmark for Video Aesthetics and Generation Quality Evaluation: VGA-Bench proposes a unified AIGC video evaluation benchmark comprising a three-tier taxonomy (aesthetic quality, aesthetic labels, and generation quality), 1,016 prompts, 60,000 videos, and three dedicated evaluation models, enabling automated assessment aligned with human judgment.
Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning: This paper proposes the LAS-VAD framework, which introduces an Anomaly-Connected Components (ACC) mechanism to partition video frames into semantically consistent groups for pseudo-label generation to compensate for the absence of frame-level annotations, and an Intention-Aware Mechanism (IAM) that leverages position-velocity-acceleration features to distinguish normal from anomalous behaviors with similar appearances but different intentions. The method achieves 89.96% AP (I3D) on XD-Violence.

🛡️ AI Safety¶

A Unified Perspective on Adversarial Membership Manipulation in Vision Models: This work is the first to reveal the adversarial membership manipulation vulnerability in membership inference attacks (MIA) against vision models — imperceptible perturbations can forge non-members as members to deceive auditing. It identifies a gradient norm collapse signature in forged members, and proposes a gradient-geometry-based detection strategy (MFD) and an adversarially robust inference framework (AR-MIA).
All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference: This paper proposes the Pseudo-Random Bayesian Inference (PRBI) framework for collaborative perception scenarios where all vehicles are untrusted. By leveraging inter-frame temporal consistency as a self-referential signal, PRBI employs pseudo-random grouping combined with Bayesian inference to efficiently identify and exclude malicious vehicles at an average cost of only 2.5 validations per frame, recovering detection accuracy to 79.4%–86.9% of the pre-attack baseline.
ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering: This paper proposes ClusterMark, a watermarking method based on visual token clustering for autoregressive image generation models. By assigning semantically similar tokens to the same color set (red/green), ClusterMark substantially improves watermark robustness under image perturbations while preserving image quality and enabling fast verification.
ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering: This paper proposes ClusterMark, a watermarking scheme based on visual token clustering that adapts KGW-style LLM watermarking to autoregressive image generators. By assigning visually similar tokens to the same green/red partition, it significantly improves watermark robustness under image perturbations while preserving image quality.
Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression: This paper proposes FOUL, a two-stage framework that decouples causal and non-causal features during training and performs on-server gradient conflict matching during unlearning, achieving efficient federated unlearning with low communication overhead without accessing client data.
AdvMark: Decoupling Defense Strategies for Robust Image Watermarking: AdvMark proposes a two-stage decoupled defense framework: Stage 1 Encoder Adversarial Training (EAT) pushes watermarked images into non-attackable regions to resist adversarial attacks; Stage 2 performs direct image optimization to defend against distortion and regeneration attacks while preserving adversarial robustness. Evaluated across 9 watermarking methods × 10 attack types, AdvMark improves distortion/regeneration/adversarial accuracy by 29%/33%/46% respectively, while achieving the best image quality.
Domain-Skewed Federated Learning with Feature Decoupling and Calibration: This paper proposes F²DC, a framework that employs a Domain Feature Decoupler (DFD) and a Domain Feature Corrector (DFC) to decompose local client features in federated learning into domain-robust features and domain-related features. Rather than discarding the latter, F²DC calibrates them to recover entangled class-discriminative information, and combines this with a domain-aware aggregation strategy. The method consistently outperforms state-of-the-art approaches across three multi-domain datasets.
FecalFed: Privacy-Preserving Poultry Disease Detection via Federated Learning: This paper proposes FecalFed, a privacy-preserving federated learning framework that first removes 46.89% duplicate contamination from public poultry fecal datasets via a dual-hash deduplication pipeline and releases a clean benchmark of 8,770 images (poultry-fecal-fl). Under highly non-IID conditions (Dirichlet α=0.5), FedAdam + Swin-Small recovers accuracy from a collapsed 64.86% (single-farm) to 90.31%, only 4.79% below the centralized upper bound of 95.10%. The edge-optimized Swin-Tiny (28M parameters) still achieves 89.74%, providing an efficient and practical solution for on-farm deployment.
FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation: This paper proposes FedAFD, a framework that simultaneously improves model performance for both heterogeneous clients and the server in multimodal federated learning through a three-stage design comprising bi-level adversarial alignment, granularity-aware feature fusion, and similarity-guided ensemble distillation.
FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift: This paper proposes FedDAP, a domain-aware prototype federated learning framework that addresses global model performance degradation caused by client-side domain shift in federated learning. FedDAP constructs domain-specific global prototypes and employs a dual prototype alignment strategy comprising intra-domain alignment and cross-domain contrastive learning.
Federated Active Learning Under Extreme Non-IID and Global Class Imbalance: This paper systematically analyzes the impact of global class imbalance and client heterogeneity on query model selection in federated active learning (FAL), derives three core Observations, and proposes FairFAL—a class-fair FAL framework featuring adaptive query model selection, prototype-guided pseudo-labeling, and two-stage uncertainty-diversity balanced sampling—consistently outperforming all baselines across five benchmark datasets.
Federated Active Learning Under Extreme Non-IID and Global Class Imbalance: This paper systematically investigates the query model selection problem in federated active learning (FAL), identifies class-balanced sampling as the key performance factor, and proposes FairFAL — a framework achieving fair and efficient FAL via adaptive model selection, prototype-guided pseudo-labeling, and uncertainty-diversity balanced sampling.
FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning: This paper proposes FedRE, a framework that achieves a three-way balance among performance, privacy protection, and communication overhead in model-heterogeneous federated learning via "entangled representations"—aggregating all local representations of each client into a single cross-class representation using normalized random weights.
Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting: This paper proposes CrowdGen, the first cross-paradigm adversarial attack framework targeting both density-map and point-regression crowd counting models. A lightweight UNet generator combined with a multi-task loss (logit suppression, density suppression, GradCAM guidance, and frequency-domain constraint) achieves high transferability (TR up to 1.69) across seven SOTA crowd counting models while maintaining visual imperceptibility (~19 dB PSNR), increasing attack MAE by an average factor of 7×.
LogitDynamics: Reliable ViT Error Detection from Layerwise Logit Trajectories: LogitDynamics attaches lightweight classification heads to each layer of a ViT to extract layerwise logit trajectories and top-K competition dynamics, then trains a linear probe to predict model errors, outperforming existing methods in cross-dataset generalization.
Monte Carlo Stochastic Depth for Uncertainty Estimation in Deep Learning: This paper formally connects Stochastic Depth (SD) to the Bayesian variational inference framework, proposes Monte Carlo Stochastic Depth (MCSD) as an uncertainty estimation method, and conducts the first systematic benchmark on modern detectors including YOLO and RT-DETR, demonstrating competitive performance against MC Dropout in terms of calibration and uncertainty ranking.
One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control: O2MAG proposes a training-free few-shot anomaly generation method that synthesizes diverse and realistic anomalies from a single reference anomaly image via a tri-branch diffusion process with self-attention grafting (TriAG). It incorporates Anomaly Guidance Optimization (AGO) to align textual semantics and Dual Attention Enhancement (DAE) to ensure complete mask-region filling. The method significantly outperforms existing approaches on downstream anomaly detection benchmarks using MVTec-AD.
ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning: ProxyFL is proposed as a framework that leverages classifier weights as unified proxies to simultaneously mitigate external heterogeneity (cross-client distribution discrepancy) and internal heterogeneity (distribution mismatch between labeled and unlabeled data) in federated semi-supervised learning, achieving substantial improvements over existing FSSL methods across multiple benchmarks.
RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces: This paper proposes RecoverMark, a robust watermarking framework that embeds facial content itself as a watermark into the background region, simultaneously achieving tampering localization, original content recovery, and copyright verification while remaining effective under watermark removal attacks.
SubFLOT: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport: This paper proposes SubFLOT, a framework that leverages Optimal Transport (OT) on the server side to align the parameter distributions of a global model with clients' historical models, enabling personalized pruning without access to raw data. Combined with an adaptive regularization mechanism to suppress pruning-induced parameter drift, SubFLOT substantially outperforms existing federated pruning methods across multiple datasets.
TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking: This paper proposes TIACam, a framework that simulates camera distortions via a learnable auto-augmentor, learns invariant features through text-anchored cross-modal adversarial training, and binds binary messages to features via a zero-watermarking head—achieving camera-robust zero-watermarking without modifying any image pixels. TIACam attains state-of-the-art bit accuracy across three real-world scenarios: screen recapture, print-and-scan, and screenshot.
Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction: This paper proposes SADCA (Semantic-Augmented Dynamic Contrastive Attack), which iteratively disrupts cross-modal semantic consistency between adversarial images and texts via a dynamic contrastive interaction mechanism and a semantic augmentation module. SADCA significantly improves adversarial transferability against vision-language pre-training (VLP) models, surpassing existing SOTA methods in both cross-model and cross-task attack settings.
Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection: This paper proposes the Tutor-Student Reinforcement Learning (TSRL) framework, which formulates the training process of a deepfake detector as a Markov Decision Process. A "tutor" (PPO agent) dynamically assigns loss weights to individual samples based on their visual features and historical learning dynamics (EMA loss, forgetting count). A "state-change" reward signal guides the "student" (detector) to prioritize high-value samples, substantially improving generalization in cross-dataset and cross-method evaluations.
When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models: This paper proposes the UPA-RFAS framework, which learns a single physical adversarial patch to achieve universal, transferable black-box attacks against VLA robot policies through a combination of feature-space displacement, attention hijacking, and semantic misalignment.

🎮 Reinforcement Learning¶

AceTone: Bridging Words and Colors for Conditional Image Grading: AceTone is proposed as the first unified framework for multimodal-conditioned color grading supporting both text and reference image inputs. By compressing 3D-LUTs into 64 discrete tokens via VQ-VAE, a VLM is trained to predict LUT token sequences, followed by GRPO reinforcement learning to align color similarity and aesthetic preference, achieving a 50% improvement in LPIPS on both style transfer and instruction-based grading tasks.
Anticipatory Planning for Multimodal AI Agents: This paper proposes TraceR1, a two-stage RL framework in which the first stage employs trajectory-level reward optimization to train agents to perform multi-step look-ahead planning, while the second stage applies grounded fine-tuning via tool execution feedback to improve single-step precision. The approach achieves open-source state-of-the-art results across 7 GUI and tool-use benchmarks.
AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization: AnyDoc proposes a general-purpose document generation framework based on a unified HTML/CSS representation. It constructs a 265K-document dataset, DocHTML, via an automated data synthesis pipeline, and fine-tunes a multimodal large language model through SFT and Height-Aware Reinforcement Learning (HARL). The framework surpasses baselines including GPT-4o on three tasks: intent-to-document, document de-rendering, and element-to-document generation.
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment: This paper proposes BRIDGE, a system that distills noisy multimodal queries into retrieval-optimized pure-text queries via FORGE (an RL-trained query alignment model), paired with LENS, a reasoning-enhanced retriever. BRIDGE achieves 29.7 nDCG@10 on MM-BRIGHT, and as a plug-in further improves Nomic-Vision to 33.3, surpassing the best text-only retriever.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning: This paper proposes CCCaption, a dual-reward reinforcement learning framework that jointly optimizes completeness (via a visual query set generated by multiple MLLMs) and correctness (via hallucination detection on sub-queries decomposed from the caption) for image captioning. A 2B model trained under this framework surpasses a 32B baseline.
Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning: This paper proposes Cross-modal Identity Mapping (CIM), which quantifies information loss in image captioning by analyzing the representational consistency (GRC) of images retrieved via captions and their relevance to the source image (QIR). These metrics serve as RL reward signals to train LVLMs to generate fine-grained and accurate captions without requiring additional annotations.
GeoWorld: Geometric World Models: GeoWorld maps the latent representations of predictive world models from Euclidean space onto a hyperbolic manifold, preserving geometric structure and hierarchical relationships via Hyperbolic JEPA, and proposes Geometric Reinforcement Learning to optimize multi-step planning. The method achieves approximately 3% SR (T=3) and 2% SR (T=4) gains on CrossTask and COIN.
GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion: This paper proposes GraspLDP, which injects grasp pose priors from a pretrained grasp detector and graspness map visual cues into a latent diffusion policy framework. By leveraging VAE-encoded action latent spaces for guidance and a self-supervised reconstruction objective, GraspLDP substantially improves grasping accuracy and generalization.
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment: This paper proposes a lifelong imitation learning framework that stores and replays compact representations in the feature space of frozen encoders via Multimodal Latent Replay (MLR), and introduces an Incremental Feature Adjustment (IFA) mechanism that employs angular distance constraints to maintain inter-task separability. The method achieves AUC improvements of 10–17 points and reduces forgetting by up to 65% on the LIBERO benchmark.
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment: This paper proposes a lifelong imitation learning framework that combines Multimodal Latent Replay (storing and replaying compact multimodal features in the latent space of a frozen encoder) with Incremental Feature Adjustment (an adaptive margin constraint based on angular distance to prevent inter-task representation drift), achieving 10–17 point AUC gains and 65% reduction in forgetting on the LIBERO benchmark.
MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning: This paper proposes Multi-Stage Reinforcement Learning (MSRL), which first learns reward reasoning capabilities on large-scale text preference data and then progressively transfers them to multimodal tasks, addressing the bottleneck of scarce annotated data in multimodal reward model training. MSRL improves accuracy on VL-RewardBench from 66.6% to 75.9%.
MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning: This paper proposes MSRL (Multi-Stage Reinforcement Learning), which scales generative multimodal reward modeling through a multi-stage RL curriculum: first learning general reward reasoning on large-scale text preference data (400K) via RL, then transferring to the multimodal domain via caption-based RL and cross-modal knowledge distillation, and finally fine-tuning with a small amount of multimodal preference data. Without additional multimodal annotations, MSRL improves performance from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench.
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset: This paper proposes RADAR, a fully autonomous closed-loop robotic data collection framework. Through the synergistic operation of four modules—VLM semantic planning, GNN policy execution, VQA success evaluation, and LIFO causal environment reset—the system requires only 2–5 human demonstrations to continuously generate high-quality manipulation data without human intervention, achieving a 90% success rate on long-horizon simulation tasks.
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset: This paper presents RADAR — a fully autonomous closed-loop robotic manipulation data generation engine comprising four modules: VLM-based semantic planning, GNN policy execution, VQA-based success evaluation, and FSM-orchestrated LIFO causal reverse environment reset. Requiring only 2–5 human demonstrations, the system continuously generates high-fidelity manipulation data, achieving 90% success rate on complex long-horizon tasks in simulation.
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering: This paper proposes ReAG, a reasoning-augmented multimodal RAG framework that combines coarse- and fine-grained retrieval with a Critic filtering model to reduce noise, and trains a generator via GRPO reinforcement learning to perform explicit reasoning, achieving new state-of-the-art performance on knowledge-intensive VQA.
Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision: This paper proposes two modules, ReAL and CGRO, which extract anomaly-relevant tokens from the autoregressive reasoning process of an MLLM and aggregate their visual attention maps to generate pixel-level anomaly maps. A consistency-guided reinforcement learning scheme then aligns reasoning tokens with visual evidence, enabling end-to-end anomaly detection, localization, and interpretable reasoning under image-level supervision only.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning: This paper proposes RLER, a dual-paradigm framework in which the training stage employs GRPO with three novel rewards (Frame-sensitive, Think-transparency, Anti-repetition) to teach the model to generate structured evidence, while the inference stage uses a training-free orchestrator to perform evidence-consistency-based weighted election and self-checking across multiple candidates. RLER comprehensively outperforms open-source and RL-based LMMs on 8 video benchmarks with an average gain of 6.3%, requiring only approximately 3.1 candidates on average.
Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation: This paper presents the first systematic empirical study on the properties of wrist-mounted fisheye cameras in imitation learning for robotic manipulation. Centered on three core research questions—spatial localization, scene generalization, and hardware generalization—it reveals both the advantages and limitations of wide field-of-view (FoV) imaging, and proposes Random Scale Augmentation (RSA) to address scale overfitting in cross-camera transfer.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning: This paper proposes RoboAgent, a capability-driven embodied task planning framework that employs a single VLM to simultaneously serve as a scheduler and five basic capabilities (exploration guidance, object grounding, scene description, action decoding, experience summarization). Through three-stage training (SFT + DAgger + expert-guided RL), RoboAgent achieves state-of-the-art performance on EB-ALFRED and ALFWorld.
See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs: This paper proposes Evidence-Constrained Reweighting Decoding (ECRD), a framework that maintains a dynamic textual evidence pool during LVLM decoding, reweights candidate tokens via distribution negotiation, and automatically invokes a lightweight visual decider to extract micro-evidence under uncertainty—achieving significant reductions in visual hallucination and improvements in reasoning accuracy across multiple LVLMs without any training.
Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement: VFLM proposes a layout generation framework that leverages visual feedback for iterative refinement. By combining a visually grounded reward model based on OCR accuracy with reinforcement learning training, the framework enables multimodal large language models to "see" rendered outputs and repeatedly self-correct, achieving substantial improvements in text layout quality over code-only generation approaches.
Specificity-aware Reinforcement Learning for Fine-grained Open-world Classification: This paper proposes SpeciaRL—a specificity-aware reinforcement learning framework that guides reasoning-capable large multimodal models to simultaneously improve prediction specificity and correctness in open-world fine-grained image classification, via a dynamic reward signal derived from the best prediction among online rollouts.

🦾 LLM Agent¶

ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search: This paper proposes ARGOS, the first benchmark and framework that redefines multi-camera person search as an interactive reasoning problem. An agent conducts multi-turn dialogue with witnesses, invokes spatiotemporal tools, and eliminates candidates under information asymmetry. The benchmark comprises 2,691 tasks across 3 progressive tracks.
CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare: This paper proposes the CareFlow benchmark (1,050 long-horizon medical software workflow tasks, 8–24 steps, covering four systems: DICOM/3D Slicer/EMR/LIS) and the CarePilot framework (based on the Actor-Critic paradigm, integrating tool grounding and a dual memory mechanism), achieving approximately 15% higher task accuracy than GPT-5 on CareFlow.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration: EchoTrail-GUI is proposed as a framework that builds a high-quality action memory repository through critic-model-guided autonomous exploration, and dynamically retrieves relevant experiences to inject into prompts at inference time, improving GPT-4o's task success rate on AndroidWorld from 34.5% to 51.7%.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration: EchoTrail-GUI proposes a three-stage closed-loop framework: an exploration agent autonomously interacts with GUI environments to generate trajectories → a critic reward model filters and retains only high-quality trajectories to construct a memory store (EchoTrail-4K) → upon receiving a new task, the most relevant memories are injected via hybrid dense-sparse retrieval to guide inference. This transforms a stateless GUI agent into a memory-augmented system, achieving 51.7% SR (+17.2pp) with GPT-4o on AndroidWorld, and improving Qwen2.5-VL-72B SR from 23.9% to 37.5% on AndroidLab.
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos: This paper proposes Ego2Web, the first benchmark that bridges egocentric video perception with web agent execution, accompanied by a semi-automatic data construction pipeline and the Ego2WebJudge automatic evaluation framework. Experiments reveal that current state-of-the-art agents still exhibit a substantial gap in cross-modal transfer from real-world visual perception to online action, with the best model achieving only 48.2% success rate.
Gen-n-Val: Agentic Image Data Generation and Validation: This paper proposes Gen-n-Val, an agentic synthetic data generation and validation framework that leverages an LLM to optimize Layer Diffusion prompts for generating high-quality single-object transparent images, and employs a VLLM to filter low-quality samples. The framework reduces invalid synthetic data from 50% to 7%, achieving a 7.6% mAP improvement on LVIS rare-category instance segmentation.
GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents: This paper proposes GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, covering 201 mainstream Chinese apps and 4 device types. It adopts a two-tier structure (foundation + application) to perform fine-grained diagnosis across five dimensions—perception, planning, reflection, execution, and evaluation. Experiments on 20 representative models reveal that current models exhibit significant deficiencies in reflection and self-evaluation.
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents: This paper proposes HATS (Hardness-Aware Trajectory Synthesis), a difficulty-aware trajectory synthesis framework that employs a closed-loop mechanism of hardness-driven exploration and alignment-guided refinement. By focusing on the collection and correction of training trajectories for semantically ambiguous actions, HATS substantially improves the generalization capability of GUI Agents in complex real-world scenarios.
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents: This paper proposes HATS — a hardness-aware trajectory synthesis framework that identifies and handles semantically ambiguous GUI actions via two closed-loop modules: hardness-driven exploration and alignment-guided refinement, significantly improving the cross-environment generalization of GUI agents.
Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search: This paper proposes the HAVEN framework, which achieves 84.1% accuracy on LVBench through audiovisual entity cohesion and a four-level hierarchical video index (global–scene–clip–entity), coupled with an agentic search mechanism, attaining 80.1% on the reasoning category.
HAVEN: Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search: HAVEN proposes a unified framework combining audiovisual entity cohesion, hierarchical indexing, and agentic search. By leveraging speaker identity as a cross-modal coherence signal, it constructs a four-level hierarchical database (global → scene → clip → entity), achieving state-of-the-art 84.1% overall accuracy on LVBench.
Nerfify: A Multi-Agent Framework for Turning NeRF Papers into Code: Nerfify is proposed as a four-stage pipeline—CFG formalization with in-context learning, compositional citation recovery, GoT-based code synthesis, and visual feedback—that automatically converts NeRF papers into trainable Nerfstudio plugins, achieving 100% executability on a 30-paper benchmark (vs. 5% for general baselines) with visual quality within ±0.5 dB PSNR of expert implementations.
Nerfify: A Multi-Agent Framework for Turning NeRF Papers into Code: Nerfify is proposed, a domain-aware multi-agent framework that automatically converts NeRF papers into trainable Nerfstudio plugin code via context-free grammar (CFG) constraints, Graph-of-Thought (GoT) code synthesis, and compositional reference dependency recovery, achieving 100% executability with visual quality within ±0.5 dB PSNR of expert implementations.
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting: This paper proposes REALM, a framework that leverages an MLLM agent to perform reasoning segmentation on views rendered by 3D Gaussian Splatting (3DGS), and introduces a Global-Local Spatial Grounding strategy (GLSpaG) to aggregate multi-view MLLM reasoning results. REALM substantially outperforms existing methods on implicit-instruction 3D segmentation (mIoU 92.88% vs. baseline 44.82% on LERF) and supports downstream 3D editing.
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting: REALM is proposed as a framework that leverages MLLM reasoning capabilities to perform open-world 3D reasoning segmentation on 3DGS via a global-to-local spatial grounding strategy, handling implicit instructions without 3D post-training. It achieves 92.88% mIoU on LERF, surpassing baseline methods by 40+ percentage points, and supports editing tasks including object removal, replacement, and style transfer.
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation: This paper proposes SceneAssistant—a VLM agentic framework driven purely by visual feedback—that designs 14 functionally complete Action APIs enabling Gemini-3.0-Flash to iteratively generate and refine open-vocabulary 3D scenes within a ReAct closed loop, requiring neither predefined spatial relation templates nor external layout solvers. On a human evaluation of 30 scenes, it achieves a Layout score of 7.600 (vs. SceneWeaver 5.800) and a Human Preference rate of 65%.
Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding: VideoHV-Agent reframes long video QA as a hypothesis-verification process: a Thinker rewrites answer options into testable hypotheses, a Judge extracts discriminative clues, a Verifier localizes evidence in the video, and an Answer agent synthesizes evidence into a final answer. The framework achieves state-of-the-art results on EgoSchema, NextQA, and IntentQA while outperforming existing agent methods in inference efficiency.
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding: This work presents the first systematic study of discrete vision-language diffusion models (DVLMs) for GUI grounding, adapting LLaDA-V for single-step action prediction and proposing a hybrid masking schedule (linear + deterministic) to capture geometric hierarchical dependencies among bounding box coordinates. The approach demonstrates the feasibility of diffusion models as a foundation for GUI agents across Web, Desktop, and Mobile interfaces.
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning: This paper proposes WorldMM, a video reasoning agent based on multimodal memory, which constructs three complementary memory types: episodic memory (multi-temporal-scale textual knowledge graphs), semantic memory (continuously updated relational knowledge graphs), and visual memory (frame-level retrieval stores). An adaptive multi-round retrieval agent dynamically selects the most relevant memory source and temporal granularity, achieving an average improvement of 8.4% over the previous state of the art across five long video QA benchmarks.

🛰️ Remote Sensing¶

ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery: ACPV-Net is the first framework that generates topologically consistent all-class polygonal vector maps from aerial imagery in a single pass, employing a semantically supervised conditional diffusion model for vertex heatmap generation and proposition-driven PSLG reconstruction to ensure zero gaps and zero overlaps.
AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network: AVION proposes a knowledge distillation framework that generates semantically rich text prototypes via LLMs and employs visual-textual dual-side prompt tuning with tri-aspect alignment distillation, addressing semantic poverty and visual rigidity in remote sensing VLM adaptation and comprehensively surpassing SOTA on few-shot classification, base-to-novel generalization, and cross-modal retrieval.
AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network: AVION proposes a knowledge distillation framework that uses LLM-generated semantically rich remote sensing text prototypes as teacher supervision while injecting learnable prompts into both the visual and text encoders of the student, achieving tri-aspect alignment distillation that significantly outperforms existing PEFT methods on few-shot classification and cross-modal retrieval.
Conflated Inverse Modeling for Urban Vegetation Patterns: A framework conflating a forward prediction model with a diffusion-based inverse generative model to produce diverse yet physically plausible urban vegetation spatial configurations (NDVI patterns) under specified temperature change targets, achieving 3.4× diversity improvement while reducing temperature control error by 37%.
Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark: A cross-modal fuzzy alignment network (CFAN) that leverages fuzzy logic to quantify token-level reliability for fine-grained alignment and introduces ground-view bridging to alleviate the semantic gap between aerial images and text descriptions, along with a large-scale text-aerial person retrieval benchmark AERI-PEDES.
Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: The first work to advance spectral compressed imaging (SCI) from image-level to video-level reconstruction, introducing the first high-quality dynamic hyperspectral dataset DynaSpec (30 sequences / 300 frames), and proposing PG-SVRT with spatial-then-temporal attention plus bridge tokens that achieves 41.52 dB PSNR with optimal temporal consistency at lower FLOPs (28.18G) than several image-level SOTAs.
GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction: GeoFlow is a lightweight flow-matching-inspired framework for fine-grained cross-view geolocalization (FG-CVG). It learns probabilistic displacement fields combined with an iterative refinement sampling (IRS) algorithm to achieve precise 2-DoF localization from ground to satellite images in continuous space, reaching SOTA-competitive accuracy at 29 FPS real-time speed.
GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction: GeoFlow reformulates fine-grained cross-view geolocalization (FG-CVG) as probabilistic displacement regression—the model learns displacement fields (distance + direction probability distributions) from arbitrary hypothesis positions to true locations, combined with an iterative refinement sampling (IRS) algorithm that flows multiple random hypotheses from different starting points toward a consensus position, achieving 29 FPS real-time inference with 7.8× fewer parameters and 4× less computation while maintaining competitive localization accuracy.
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing: This work proposes GeoMMBench (1053 expert-level geoscience multiple-choice questions) and GeoMMAgent (a retrieval-perception-reasoning multi-agent framework), systematically evaluating 36 MLLMs in the remote sensing domain and revealing systematic deficiencies in domain knowledge, perceptual grounding, and reasoning capabilities.
Joint and Streamwise Distributed MIMO Satellite Communications with Multi-Antenna Ground Users: This paper studies downlink transmission from multiple LEO satellites jointly serving multi-antenna ground users. Two non-coherent transmission modes are proposed—joint transmission and streamwise transmission—with precoders designed under the WMMSE framework and stream-to-satellite association solved via the Hungarian algorithm, achieving near-optimal spectral efficiency while substantially reducing fronthaul overhead.
Joint and Streamwise Distributed MIMO Satellite Communications with Multi-Antenna Ground Users: Two downlink transmission schemes (joint transmission & streamwise transmission) are proposed for distributed LEO satellite systems serving multi-antenna ground users. Through WMMSE precoding design based on statistical CSI and a stream-satellite association strategy based on the Hungarian algorithm, the proposed framework achieves a flexible trade-off between high spectral efficiency and low fronthaul overhead without requiring inter-satellite phase synchronization.
Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels: This paper presents Lumosaic, an active hyperspectral video system that synchronizes an array of 12 narrowband LEDs with a coded-exposure pixel (CEP) camera at microsecond precision. Within 158 sub-frames per video frame, the system jointly encodes spatial, temporal, and spectral information, achieving motion-robust hyperspectral video reconstruction at 30 fps, VGA resolution, and 31 spectral channels (400–700 nm), with PSNR exceeding passive snapshot systems by more than 10 dB.
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging: This paper presents MetaSpectra+, the first multifunctional metasurface imaging system operating across the full visible spectrum (250 nm bandwidth). Through a dual-layer metasurface design enabling beam splitting and precise dispersion control, the system acquires a hyperspectral data cube together with HDR/polarization images in a single snapshot, achieving 33.31 dB PSNR on benchmark datasets with a total track length (TTL) of only 17 mm.
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging: MetaSpectra+ proposes a metasurface–refractive hybrid optical paradigm that employs a dual-layer metasurface to independently control the dispersion, exposure, and polarization of four channels, enabling snapshot hyperspectral+HDR/polarization multi-functional imaging over a 250 nm bandwidth with a minimum total track length (TTL) of 17 mm. On the KAUST benchmark, it achieves a PSNR of 33.31 dB, comprehensively surpassing existing snapshot hyperspectral systems.
No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors: This paper proposes LightStab, an unsupervised online video stabilization framework built upon the classical three-stage pipeline (motion estimation → motion propagation → motion compensation) augmented with multi-threaded asynchronous buffering. LightStab is the first online method to comprehensively match offline SOTA across 5 benchmark datasets, and introduces UAV-Test, the first multimodal UAV aerial stabilization benchmark covering both visible-light and infrared imagery.
Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments: Olbedo introduces the first large-scale real-world aerial albedo–shading decomposition dataset (5,664 UAV images, 4 terrain types, multi-year multi-illumination conditions). A physics-based inverse rendering pipeline generates multi-view-consistent pseudo-ground-truth annotations. Results demonstrate that synthetic pre-training combined with Olbedo LoRA fine-tuning substantially improves outdoor albedo prediction and supports downstream applications including relighting, material editing, and scene change analysis.
Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?: This paper evaluates 24 families of pretrained image matchers on SAR-optical satellite registration under a zero-shot setting, finding that deployment protocol choices (geometric model, tile size, etc.) can affect accuracy by up to 33×, sometimes surpassing the effect of switching the matcher itself.
RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization: This paper introduces CV-RHO, the first OSM-based metric cross-view geo-localization benchmark targeting adverse weather and sensor noise (2.72M+ images), and proposes RHO, a dual-branch Pin-Pan architecture integrating panoramic undistortion (SUM) and position-orientation fusion (POF) mechanisms, achieving up to 20% localization improvement under diverse degradation conditions.
SDF-Net: Structure-Aware Disentangled Feature Learning for Optical-SAR Ship Re-identification: This paper proposes SDF-Net, a physics-guided structure-aware disentangled feature learning network that enforces cross-modal geometric consistency via intermediate-layer gradient energy (SCL) and decouples shared/modality-specific features at the terminal layer (DFL) with parameter-free additive fusion, achieving 60.9% mAP (+3.5% vs. SOTA TransOSS) on HOSS-ReID.

🎵 Audio & Speech¶

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models: This paper proposes BabyVLM-V2, a framework that constructs three formats of pretraining data (768K image pairs + 181K video pairs + 63K interleaved sequences) from the SAYCam longitudinal egocentric corpus, designs the DevCV Toolbox (10 developmental cognitive tasks) grounded in the NIH Baby Toolbox®, and demonstrates that a compact model trained from scratch surpasses GPT-4o on selected mathematical tasks — representing the first systematic exploration of Artificial Developmental Intelligence (ADI).
Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning: This paper proposes Refine, an ensemble active learning method that employs a two-stage strategy—progressive filtering (iteratively refining the unlabeled pool via multiple strategies) and coverage-based selection (selecting high-value, diverse samples from the refined pool)—to consistently outperform individual AL strategies and existing ensemble methods without requiring prior knowledge of the optimal strategy.
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models: This paper proposes MMHNet, a Multimodal Hierarchical Network based on a hierarchical architecture and non-causal Mamba-2, achieving length generalization by training on short clips (8 seconds) while generating high-quality, well-aligned audio for long videos (5+ minutes). MMHNet substantially outperforms existing methods on the UnAV100 and LongVale benchmarks.
GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization: GEM-TFL is proposed to bridge the gap between weak and full supervision for temporal forgery localization via a two-stage classification-regression framework. Three core modules are introduced: EM-based decomposition of binary labels into multi-dimensional latent attributes, training-free temporal consistency refinement (TCR), and graph diffusion proposal refinement (GPR). The method achieves an average mAP improvement of 4–8% on weakly supervised temporal forgery localization benchmarks.
Omni-MMSI: Toward Identity-Attributed Social Interaction Understanding: This paper introduces the Omni-MMSI task—understanding multi-person social interactions from raw audio-visual inputs (rather than pre-processed oracle social cues)—and proposes Omni-MMSI-R, a reference-guided pipeline that achieves accurate social interaction understanding via tool-generated identity-attributed social cues combined with chain-of-thought reasoning.
OmniRet: Efficient and High-Fidelity Omni Modality Retrieval: This paper proposes OmniRet, the first unified retrieval model supporting composed queries across text, vision, and audio modalities. It introduces a Shared Media Resampler to improve computational efficiency and Attention Sliced Wasserstein Pooling (ASWP) to preserve fine-grained information, achieving state-of-the-art performance on 12 out of 13 retrieval tasks.
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text: This paper proposes the Universal Holistic Audio Generation (UniHAGen) task and the OmniSonic framework, which employs a TriAttn-DiT architecture with triple cross-attention and MoE gating to simultaneously generate on-screen environmental sound, off-screen environmental sound, and human speech within a unified audio synthesis pipeline, achieving comprehensive state-of-the-art performance on the newly constructed UniHAGen-Bench.
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval: This paper proposes SAVE, a speech-aware video representation learning method that introduces a dedicated speech branch (Whisper ASR + CLIP text encoder) and a soft-ALBEF visual-audio early alignment strategy, achieving comprehensive state-of-the-art performance across five video-text retrieval benchmarks.
Semantic Audio-Visual Navigation in Continuous Environments: This paper introduces the SAVN-CE task, extending semantic audio-visual navigation to continuous 3D environments, and proposes MAGNet (Memory-Augmented Goal description Network). By fusing historical context and ego-motion cues, MAGNet achieves robust goal inference after target sounds cease, yielding absolute success rate improvements of up to 12.1%.
Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion: For the Ambivalence/Hesitancy (A/H) recognition task of the 10th ABAW competition, this paper proposes a divergence-based multimodal fusion strategy that explicitly models cross-modal conflict by computing pairwise absolute differences among embeddings from three modalities — visual (AU), audio (Wav2Vec 2.0), and text (BERT) — achieving a Macro F1 of 0.6808 on the BAH dataset, substantially surpassing the baseline of 0.2827.
Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach: This work is the first to incorporate behavioral description embeddings extracted by a VLM (Qwen3-VL-4B-Instruct) as an independent third modality, combining them with GRADA facial encodings and WavLM audio features via two fusion strategies—DCMMOE and RAAV—achieving a continuous VA estimation CCC of 0.658 (dev) / 0.62 (test) on Aff-Wild2, demonstrating the value of VLM behavioral semantics for continuous emotion recognition.
Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach: This paper proposes a multimodal approach combining facial visual features, VLM-based behavioral description embeddings, and audio features for continuous valence-arousal (VA) estimation. Two fusion strategies—DCMMOE and RAAV—are explored, achieving competitive results on the Aff-Wild2 dataset.
Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis: This paper proposes the TSD framework, which explicitly decomposes multimodal features into three complementary subspaces—globally shared, pairwise shared, and modality-private—and adaptively integrates these three levels of information via a subspace-aware cross-attention (SACA) fusion module, achieving state-of-the-art performance on CMU-MOSI and CMU-MOSEI.
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark: This paper proposes UniM, the first unified any-to-any interleaved multimodal benchmark (31K samples, 7 modalities, 30 domains), accompanied by a three-dimensional evaluation suite and an agentic baseline UniMA based on traceable evidence reasoning, revealing critical deficiencies of existing MLLMs under the interleaved multimodal paradigm.
Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods: Through systematic data-centric experiments, this paper demonstrates that audio pre-training performance is primarily driven by label/supervision quality rather than model design. It proposes the Unified Tag System (UTS), which unifies speech, music, and environmental sound under a high-granularity vocabulary of 800–3k tags. Models trained with UTS surpass AudioSet baselines on out-of-domain tasks such as speaker verification (VoxCeleb2) and music (MusicCaps) using 5× less data.
ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos: This paper presents ViDscribe, a web platform integrating AI-generated audio descriptions (with 6 user-customizable options) and a conversational visual question answering interface. A longitudinal field study with 8 blind and low-vision (BLV) users demonstrates that customized audio descriptions significantly improve effectiveness, enjoyment, and immersion.
ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos: ViDscribe is a web-based platform that leverages a multimodal large language model (Gemini 3 Pro) to provide customizable AI-generated audio descriptions (AD) and interactive visual question answering (VQA) for blind and low-vision (BLV) users. Supporting arbitrary YouTube videos, the system is validated through a one-week longitudinal user study, which demonstrates that customized AD outperforms default AD in terms of effectiveness, enjoyment, and immersion.

� LLM Safety¶

Association and Consolidation: Evolutionary Memory-Enhanced Incremental Multi-View Clustering: This paper proposes EMIMC, a framework inspired by the hippocampus–prefrontal cortex collaborative memory mechanism in the brain. Three coordinated modules — a Rapid Associative Module (orthogonal mapping to ensure plasticity), a Cognitive Forgetting Module (power-law decay to simulate the forgetting curve), and a Knowledge Consolidation Module (temporal tensor low-rank decomposition to distill long-term memory) — jointly address the stability-plasticity dilemma in incremental multi-view clustering.
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations: A patch-level LVLM hallucination detection framework is proposed. Hallucinated tokens are found to exhibit two characteristic signatures—dispersed attention patterns and low semantic alignment—based on which two lightweight metrics are designed: Attention Dispersion Score (ADS) and Cross-modal Grounding Consistency (CGC), achieving 90% detection accuracy.
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models: This work systematically investigates catastrophic forgetting when fine-tuning VLMs for autonomous driving scenarios, constructs the large-scale 180K-scene benchmark FidelityDrivingBench, and proposes the Drive Expert Adapter (DEA), which enhances driving task performance via prompt-space routing without corrupting base model parameters.
DAMP: Class Unlearning via Depth-Aware Removal of Forget-Specific Directions: Proposes DAMP (Depth-Aware Modulation via Projection), a one-shot closed-form weight surgery method for class unlearning that achieves selective forgetting by removing forget-class-specific directions in the editing space of each network stage, with a depth-aware scaling rule enforcing conservative edits in shallow layers and aggressive edits in deep layers.
Designing to Forget: Deep Semi-parametric Models for Unlearning: This paper proposes the "Designing to Forget" paradigm and introduces a family of deep semi-parametric models (SPMs) that achieve unlearning at inference time by simply removing training samples—without modifying model parameters. On ImageNet classification, SPMs reduce the prediction gap relative to the retrain baseline by 11% and achieve over 10× faster unlearning.
Elastic Weight Consolidation Done Right for Continual Learning: This paper systematically analyzes the fundamental flaws in EWC and its variants regarding weight importance estimation from a gradient perspective—specifically, gradient vanishing in EWC and redundant protection in MAS—and proposes an extremely simple Logits Reversal operation to correct the Fisher Information Matrix computation, achieving substantial improvements over vanilla EWC and all its variants on exemplar-free class-incremental learning and multimodal continual instruction tuning tasks.
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in LVLMs: This paper proposes HulluEdit, a single-pass, reference-model-free hallucination mitigation framework that orthogonally decomposes hidden states into a visual evidence subspace, a conflicting prior subspace, and a residual uncertainty subspace, selectively suppressing hallucination patterns without interfering with visual grounding, achieving state-of-the-art performance on POPE and CHAIR.
Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting: This paper proposes KNOW prediction: a framework that induces a structured forgetting process via sequential fine-tuning on progressively shrinking data subsets, collects the resulting weight transition trajectories, and then employs a meta-learned hyper-model (KNOWN) to reverse the forgetting direction, predicting virtually knowledge-enriched weights as if the model had been trained on a larger dataset. The approach consistently outperforms naive fine-tuning and multiple weight prediction baselines across diverse datasets (CIFAR/ImageNet/PACS, etc.) and architectures (ResNet/PVTv2/DeepLabV3+), yielding significant improvements on downstream tasks including image classification, semantic segmentation, image captioning, and domain generalization.
Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models: MPCAttack is proposed as a framework that jointly leverages feature representations from three learning paradigms—cross-modal alignment, multimodal understanding, and visual self-supervision—and generates highly transferable adversarial examples via a multi-paradigm collaborative optimization strategy, achieving state-of-the-art attack performance on both open-source and closed-source MLLMs.
⊘ Source Models Leak What They Shouldn't ↛: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization: This work identifies that Source-Free Domain Adaptation (SFDA) methods inadvertently leak knowledge of source-exclusive classes to the target domain (zero-shot transfer phenomenon), and proposes the SCADA-UL framework, which performs category unlearning simultaneously with domain adaptation through adversarial generation of forget samples and a rescaled labeling strategy, achieving unlearning quality approaching that of retraining from scratch.
Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP: This paper proposes PAR (Perturb and Recover), a simple yet effective backdoor cleansing method for CLIP: by explicitly pushing model embeddings away from the poisoned state (Perturb) while recovering clean performance via the standard CLIP loss (Recover), PAR achieves robust backdoor removal against arbitrary trigger types without relying on strong data augmentation, and remains effective even when using only synthetic data.
PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing: This paper proposes the PinPoint benchmark, comprising 7,635 queries and 329K human-verified relevance judgments. Through four dimensions—explicit negatives, multi-image queries, paraphrase variants, and demographic metadata—it exposes severe deficiencies in existing CIR methods regarding false positive suppression, linguistic robustness, and multi-image reasoning. A training-free MLLM-based reranking method is also proposed as an improved baseline.
Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation: This paper proposes SIEVE (Select–Hypothesize–Verify), a closed-loop framework that interprets neuron functionality by selecting highly activated samples, generating concept hypotheses, and verifying them via text-to-image generation. The probability that generated concepts activate the corresponding neuron is approximately 1.5× that of existing SOTA methods.
SineProject: Machine Unlearning for Stable Vision–Language Alignment: To address the severe ill-conditioning of the projector Jacobian during machine unlearning in MLLMs—which causes systematic vision–language alignment drift—this paper proposes SineProject, which applies sinusoidal modulation (\(\sin(\Delta W)\)) to projector weights to constrain parameter magnitudes to \([-1, 1]\). This reduces the Jacobian condition number by 3–4 orders of magnitude, achieving complete forgetting of target knowledge while reducing the safe answer rejection rate (SARR) on benign queries by 15%.
Unsafe2Safe: Controllable Image Anonymization for Downstream Utility: This paper proposes Unsafe2Safe, a fully automatic privacy-preserving pipeline that realizes controllable image anonymization through a four-stage approach—VLM privacy inspection → dual captioning (private/public) → LLM editing instructions → text-guided diffusion editing. The method achieves substantial improvements on the VLMScore privacy metric while surpassing the original images in downstream accuracy on Caltech-101 classification and OK-VQA.
V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs: This work discovers that Value features in ViT exhibit more disentangled local semantic representations compared to Patch features, and proposes V-Attack, which achieves precise and controllable local semantic attacks on LVLMs via self-enhanced Value features and text-guided semantic manipulation, improving average ASR by 36%.

💡 LLM Reasoning¶

Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D: A new paradigm called "Artistic Disparity Synthesis" (Art3D) is proposed, shifting the goal of 2D-to-3D conversion from geometric accuracy to artistic expression. A dual-path architecture decouples global depth style from local artistic effects, learning directorial intent from professional 3D film data.
E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought: This work constructs E-comIQ-ZH, the first multi-dimensional quality assessment framework for Chinese e-commerce posters, comprising an 18K expert-annotated dataset with CoT reasoning chains, a dedicated evaluation model E-comIQ-M (trained via SFT+GRPO), and a standardized benchmark E-comIQ-Bench.
EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence: EagleVision is a dual-stage framework in which the macro-perception stage employs Semantic-Pose Fusion DPP (SPF-DPP) to jointly optimize semantic relevance and viewpoint diversity in SE(3) space for key-frame selection, while the micro-verification stage enables the model to actively query new viewpoint frames on the BEV plane for iterative spatial CoT reasoning (hypothesis → observe → verify loop). The query strategy is trained purely via RL without human annotation, achieving open-source SOTA on VSI-Bench and SQA3D.
GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization: GRAZE is proposed as a training-free pipeline that leverages Grounding DINO to discover candidate interactions and employs SAM2 mask overlap as a pixel-level contact verifier, achieving 97.4% coverage and 77.5% contact onset frame localization accuracy within ±10 frames on 738 American football training videos.
Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing: The paper introduces FaceCoT, the first CoT-VQA dataset for face anti-spoofing (FAS) with 1.08 million samples covering 14 attack types, and proposes a two-stage progressive learning strategy CEPL, achieving an average AUC improvement of 4.06% and HTER reduction of 5.00% across 11 FAS benchmarks.
Latent Chain-of-Thought World Modeling for End-to-End Autonomous Driving: LCDrive proposes a Latent Chain-of-Thought (Latent CoT) framework that replaces natural language CoT with action proposal tokens and world model prediction tokens for reasoning, achieving lower latency and superior trajectory quality in end-to-end autonomous driving via cold-start + RL post-training.
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought: This work identifies that existing LVLMs effectively ignore intermediate rationale content during CoT reasoning, and proposes RED (Rationale-Enhanced Decoding)—multiplying the image-conditioned and rationale-conditioned next-token distributions at the logit level. This approach is theoretically equivalent to the optimal solution of KL-constrained reward maximization, and significantly improves multimodal reasoning accuracy without any training.
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought: This work identifies that existing LVLMs neglect the generated rationale content during multimodal CoT reasoning (image tokens dominate attention), and proposes Rationale-Enhanced Decoding (RED)—reformulating CoT as a KL-constrained rationale-conditioned log-likelihood reward maximization problem. The closed-form optimal solution multiplies the image-conditioned distribution \(p(y|x,q)\) by the rationale-conditioned distribution \(p(y|r,q)^\lambda\), significantly improving reasoning performance across multiple benchmarks without any training.
Reinforcing Structured Chain-of-Thought for Video Understanding: This paper proposes SDRL (Summary-Driven Reinforcement Learning), a single-stage RL framework that requires no SFT. By introducing a structured CoT (Summarize→Think→Answer) and two self-supervised mechanisms (CVK and DVR), SDRL enhances temporal reasoning in video understanding and achieves state-of-the-art results on 7 VideoQA benchmarks.
Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering: This work constructs Step-CoT, the first structured multi-step CoT medical reasoning dataset aligned with clinical diagnostic workflows (10K+ cases / 70K QA pairs), and proposes a teacher-student framework based on graph attention networks for stepwise reasoning supervision, improving both accuracy and interpretability in Med-VQA.
Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models: This paper systematically analyzes the causes of hallucinations in multimodal CoT models, identifies "divergent thinking" (associative reasoning) as the core trigger, and proposes a training-free detection and decoding intervention strategy based on visual entropy. The method reduces CHAIRS by over 30% on Object HalBench while maintaining or improving general reasoning capability.
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models: This paper proposes the Hallucination-as-Cue analytical framework, systematically investigating the true mechanisms underlying RL post-training of multimodal reasoning models via three modality-specific corruption strategies (blank image, random image, text removal). The study finds that GRPO training with 100% corrupted visual inputs still yields significant improvements in reasoning performance, challenging the prevailing assumption that RL training effectively leverages visual information.
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models: This paper proposes VisRef, a training-free visual refocusing framework that, during inference in multimodal large reasoning models (MLRMs), adaptively selects a semantically relevant and visually diverse subset of tokens at each reasoning step via Determinantal Point Processes (DPP) and reinjects them into the context. An entropy-based stopping criterion prevents overthinking. Under a fixed compute budget, VisRef improves visual reasoning accuracy by up to 6.4%.
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models: This paper proposes VisRef, a training-free visual refocusing framework that dynamically selects and re-injects semantically relevant and diverse visual tokens—chosen via a Determinantal Point Process (DPP)—into the reasoning context of Multimodal Large Reasoning Models (MLRMs) at each inference step, addressing the progressive decay of visual attention during long-chain reasoning. VisRef achieves improvements of up to 6.4% on benchmarks such as MathVista.

⚖️ Alignment & RLHF¶

Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group: This paper proposes a method that bypasses Clebsch-Gordan (CG) coefficient computation and directly constructs explicit steerable kernel bases from group representation matrix elements. Through a three-step strategy of "stabilizer constraint + Schur's lemma + steering," it uniformly covers SO(2), O(2), SO(3), O(3), and the non-compact Lorentz group, substantially simplifying the kernel design pipeline for equivariant CNNs.
Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group: This paper proposes a method that bypasses Clebsch-Gordan coefficients to solve the steerable kernel constraint in equivariant CNNs by solving simple invariance conditions on stabilizer subgroups and then "steering" to arbitrary points, providing explicit kernel bases for symmetry groups ranging from SO(2) to the Lorentz group.
Bias at the End of the Score: Demographic Biases in Reward Models for T2I: This paper conducts a large-scale demographic bias audit of widely used reward models (PickScore, ImageReward, HPS, etc.) in text-to-image generation, revealing that reward-guided optimization disproportionately sexualizes female images, converges demographics toward white, and that reward scores correlate with real-world population frequency priors.
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering: GlyphPrinter constructs a region-level glyph preference dataset (GlyphCorrector) and proposes Region-Grouped DPO (R-GDPO) to significantly improve glyph accuracy in visual text rendering without relying on explicit reward models, while introducing inference-time Regional Reward Guidance (RRG) for controllable generation.
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models: This paper proposes MapReduce LoRA and RaTE as two complementary methods for advancing the Pareto front in multi-preference optimization: the former uses a "Map (parallel preference expert training) + Reduce (iterative merging)" strategy to progressively advance the Pareto front; the latter learns reward-aware token embeddings for inference-time composable preference control.
Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation: This paper proposes Mesh-Pro, the first asynchronous online reinforcement learning framework for 3D quadrilateral mesh generation. Its core algorithm, ARPO (Advantage-guided Ranking Preference Optimization), combines the Plackett-Luce ranking model with advantage-function weighting to achieve simultaneous improvements in efficiency (3.75× faster than offline DPO) and generalization, attaining state-of-the-art generation quality for both artist-style and dense meshes.
LocalDPO: Direct Localized Detail Preference Optimization for Video Diffusion Models: LocalDPO is proposed to perform localized preference alignment at the detail level by applying random spatiotemporal Bézier-masked local corruption to real high-quality videos to construct negative samples (single inference pass, no external ranking), paired with a region-aware DPO loss. The method consistently outperforms vanilla DPO and SFT on Wan2.1 and CogVideoX in terms of video quality.
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization: This paper proposes MoD-DPO (Modality-Decoupled DPO), which decouples the contribution of each modality in multimodal LLMs via three mechanisms—invariance regularization, sensitivity regularization, and language-prior debiasing—to effectively mitigate cross-modal hallucinations (e.g., answering visual questions using auditory information). A closed-form optimal policy is also derived.
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization: This work introduces DPO preference optimization into the post-training stage of diffusion-based motion generation models. A physics simulation controller automatically constructs preference data pairs, enabling generated human motions to satisfy both text/spatial control instructions and physical constraints. The approach successfully transfers zero-shot to a real Unitree G1 robot.
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization: PhysMoDPO integrates a pretrained whole-body controller (WBC/DeepMimic) into the post-training pipeline of a diffusion-based motion generator. By automatically constructing preference pairs via physical simulation and fine-tuning with DPO, generated motions—after WBC execution—simultaneously satisfy physical plausibility and text/spatial condition faithfulness, enabling zero-shot transfer to the Unitree G1 real robot.
Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models: This paper proposes NullSteer, an activation steering defense framework based on null-space projection, which effectively resists visual jailbreak attacks without degrading general model capability by constraining steering operations to the null space of benign activations.
\(\varphi\)-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models: This paper proposes \(\varphi\)-DPO, which adopts DPO as a continual learning paradigm (using the previous-step model as the reference policy) and introduces a fairness modulation factor \((1-p)^\gamma\) inspired by focal loss to balance gradient contributions across data groups. The authors theoretically prove that the gradient bias approaches zero as \(\gamma \to \infty\), and achieve state-of-the-art performance on the CoIN and MLLM-CL benchmarks.

📚 Pretraining¶

Defending Unauthorized Model Merging via Dual-Stage Weight Protection: This paper proposes MergeGuard, a proactive dual-stage weight protection framework: Stage 1 disperses task-critical weights via L2 regularization, and Stage 2 injects structured perturbations to disrupt merging compatibility. The protected model retains <1.5% performance loss while causing merged model accuracy to drop by up to 90%.
Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation: This paper proposes the Evidential Transformation Network (ETN), a lightweight post-hoc module that learns sample-dependent affine transformations in logit space to convert pretrained classifiers or LLMs into evidential models, achieving reliable uncertainty estimation with minimal computational overhead.
FlowMotion: Training-Free Flow Guidance for Video Motion Transfer: FlowMotion is a training-free video motion transfer framework that directly leverages the latent prediction output of flow-based T2V models to construct motion guidance signals, avoiding gradient backpropagation through internal model layers while maintaining motion fidelity and significantly reducing inference time and memory overhead.
Linking Modality Isolation in Heterogeneous Collaborative Perception: CodeAlign constructs a discrete code space via codebooks and cross-modal Feature-Code-Feature (FCF) translation, becoming the first framework to solve the "modality isolation" problem in heterogeneous collaborative perception—where different modalities never co-occur in training data—using only 8% of HEAL's training parameters with 1024x communication reduction while achieving SOTA perception performance.
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation: LottieGPT is the first autoregressive vector animation generation framework, designing a Lottie tokenizer to encode hierarchical geometry, transforms, and keyframe motion into compact token sequences. It builds a 660K animation dataset and fine-tunes Qwen-VL to generate editable vector animations directly from text/image inputs.
Model Merging in the Essential Subspace: ESM constructs an "essential subspace" via PCA on activation shifts induced by parameter updates (rather than SVD on parameter matrices), and applies three-level polarized scaling to amplify critical parameters while suppressing noise. On 20-task ViT-B/32 merging, it improves over Iso-CTS by 3.2% absolute accuracy.
MXNorm: Reusing MXFP Block Scales for Efficient Tensor Normalisation: MXNorm fuses RMSNorm with MXFP quantization by reusing the block absmax values already computed during MXFP8 quantization to approximate the RMS value, eliminating the separate normalization reduction operation. It maintains training accuracy on Llama 3 up to 8B parameters while achieving up to 2.4x kernel speedup on GB200.
MXNorm: Reusing MXFP Block Scales for Efficient Tensor Normalisation: GPU matrix multiplication throughput has improved 80x (V100 to GB200) while reduction/elementwise operations improved only 5-9x, making RMSNorm a new bottleneck in low-precision training. MXNorm directly reuses the block scales already computed during MXFP8 quantization to estimate RMS, achieving a 32x reduction size decrease. Theorem 1 proves that the generalized \(p\)-mean of block absmax converges to a constant multiple of RMS. Llama 3 pretraining (125M/1B/8B) validates that MXNorm(\(p=2\)) matches RMSNorm with minimal accuracy difference, with torch.compile benchmarks showing up to 2.4x isolated kernel speedup and +1.3%/+2.6% Llama 3 8B layer acceleration for MXFP8/NVFP4. Drop-in replacement with zero additional hyperparameters.
Watch and Learn: Learning to Use Computers from Online Videos: Watch & Learn proposes using an inverse dynamics model (IDM) to automatically convert YouTube tutorial videos into executable UI trajectory data (53K+ trajectories without manual annotation), enhancing CUA capabilities with +11.1% improvement for Qwen 2.5VL-7B and +3.8% for UI-TARS-1.5-7B on OSWorld.
Watch and Learn: Learning to Use Computers from Online Videos: This paper proposes Watch & Learn (W&L), a framework that leverages an Inverse Dynamics Model (IDM) to automatically convert human computer-use tutorial videos from the internet into executable UI trajectory data. The system generates 53K+ high-quality trajectories that serve as either ICL demonstrations or SFT training data, significantly improving CUA performance across multiple models and platforms.

📐 Optimization & Theory¶

BlazeFL: Fast and Deterministic Federated Learning Simulation: BlazeFL is a lightweight single-machine federated learning simulation framework built on Python free-threading. By combining shared-memory execution with per-client isolated RNG streams, it achieves up to 3.1× speedup and bit-level reproducibility.
Dynamic Momentum Recalibration in Online Gradient Learning: From a signal processing perspective, this work identifies the inherent bias-variance tradeoff deficiencies of fixed momentum coefficients and proposes the SGDF optimizer, which dynamically balances noise suppression and signal preservation in gradient estimation by computing optimal time-varying gains online under the minimum mean squared error principle, outperforming SGD momentum and Adam variants across multiple vision tasks.
Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning: To address the problem that existing federated prototype learning methods destroy inter-class semantic relations, this paper proposes FedTSP, which leverages pre-trained language models to construct textual prototypes that preserve semantic structure, achieving significant performance gains and faster convergence in heterogeneous federated learning.
Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift: This paper proposes the Fed-ADE framework, which adaptively adjusts the learning rate for each client at each time step using two lightweight distribution shift signals — uncertainty dynamics estimation and representation dynamics estimation — enabling unsupervised post-deployment adaptation in federated settings.
Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning: This paper proposes FedTSP, which leverages pre-trained language models (PLMs) to construct semantically rich prototypes from the text modality, preserving inter-class semantic relationships in heterogeneous federated learning. Learnable prompts are introduced to bridge the modality gap, substantially improving model performance and accelerating convergence.
OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport: This work formulates visual token pruning as a distribution alignment problem under optimal transport (OT), minimizing the 2-Wasserstein distance between the full and pruned token sets. It achieves training-free, \(O(mk^2)\)-complexity pruning via Gaussian surrogates, a log-det submodular objective, and greedy Cholesky selection, attaining state-of-the-art accuracy–efficiency trade-offs across 11 multimodal benchmarks.
SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated Learning: This paper proposes SCOPE, a training-free federated coreset selection framework that leverages a frozen VLM (MobileCLIP-S2) with orthogonal projection embeddings to compute three scalar semantic metrics—representativeness, diversity, and boundary proximity—enabling globally-aware two-stage pruning that reduces communication bandwidth by 128–512× while surpassing full-data training.
SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated learning: SCOPE employs a training-free vision-language geometric scorer to compress each sample into three scalars—representativeness, diversity, and negative-class boundary proximity—and has the server aggregate only these lightweight statistics to form a global consensus. This consensus guides each client to first remove semantically anomalous samples and then eliminate majority-class redundancies, thereby achieving a favorable balance among accuracy, robustness, and minimal communication overhead under strongly non-IID and long-tail federated scenarios.
The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers: This paper reformulates sign-based adversarial attack optimizers as coordinate-wise gradient descent, reveals that non-decaying step sizes are the root cause of non-convergence and instability, and proposes a Monotonically Decreasing Coordinate Step-size (MDCS) strategy. Theoretical analysis proves that MDCS-MI achieves the optimal \(O(1/\sqrt{T})\) convergence rate, with significant improvements in attack transferability and stability demonstrated on image classification and cross-modal retrieval tasks.
UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation: This paper proposes UniFusion, a unified image fusion framework that leverages the self-supervised semantic priors of DINOv3 to construct a cross-modal shared feature space, preserves source image information via a reconstruction alignment mechanism, and decouples reconstruction and fusion objectives through a bilevel optimization strategy. The framework achieves state-of-the-art performance across multiple tasks, including infrared-visible, multi-exposure, multi-focus, and medical image fusion.

🕸️ Graph Learning¶

Adaptive Learned Image Compression with Graph Neural Networks: GLIC reformulates the nonlinear transforms in learned image compression (LIC) from fixed convolutions or window-based attention into content-adaptive graph neural network operations. A dual-scale graph determines where to connect, while a complexity-aware mechanism determines how much to connect, enabling more effective modeling of both local and long-range redundancies. GLIC consistently outperforms traditional codecs and recent LIC baselines across three standard benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning: This paper proposes the G2F-RAG paradigm, which renders retrieved structured knowledge into a single "reasoning frame" appended to the end of the video, enabling large models to reason uniformly within the visual space. This approach avoids the attention dilution and cognitive overload caused by text appending, achieving consistent training-free improvements across 8 video benchmarks.
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs: This paper proposes Graph2Eval, a knowledge graph-driven framework for the automatic generation of agent evaluation tasks. By constructing structured knowledge graphs from documents and webpages, performing subgraph sampling, applying LLM-conditioned generation, and employing multi-stage filtering, the framework automatically produces multimodal agent tasks with improved semantic consistency (+20%) and solvability (+17%). The resulting benchmark, Graph2Eval-Bench, comprises 1,319 tasks.
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs: This paper proposes Graph2Eval, a framework that leverages knowledge graphs constructed from heterogeneous data sources as a structured task space. By employing subgraph sampling, task templates, and meta-path strategies, it automatically generates semantically consistent and solvable multimodal agent evaluation tasks, achieving improvements of 20% and 17% in semantic consistency and solvability, respectively.
Hyperbolic Busemann Neural Networks: This paper intrinsically lifts multinomial logistic regression (MLR) and fully connected (FC) layers to hyperbolic space via Busemann functions, proposing two unified components—BMLR and BFC—applicable to both the Poincaré ball and the Lorentz model. The proposed components outperform existing hyperbolic layers across four task categories: image classification, genomic sequence classification, node classification, and link prediction.
M3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation: This paper proposes M3KG-RAG, which constructs a multi-hop multimodal knowledge graph (M3KG) via a lightweight multi-agent pipeline and introduces the GRASP mechanism for entity grounding and selective pruning. By retaining only query-relevant and answer-useful knowledge, the approach substantially improves audio-visual reasoning capabilities of MLLMs.
Mario: Multimodal Graph Reasoning with Large Language Models: Mario is proposed for LLM reasoning on multimodal graphs (MMGs). It achieves topology-aware cross-modal alignment via a Graph-conditioned Vision-Language Model (GVLM), and employs a Modality-Adaptive Prompt Router (MAPR) to select the optimal modality configuration for each node, attaining state-of-the-art performance on node classification and link prediction.
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning: This work embeds a Procedural Knowledge Graph (PKG) into a planning model end-to-end via a differentiable Viterbi layer, enabling the neural network to learn only emission probabilities rather than memorizing complete procedural structures. With only 5–7M parameters—one to three orders of magnitude fewer than diffusion- or LLM-based methods—the approach achieves state-of-the-art success rates on CrossTask/COIN/NIV and establishes a unified evaluation benchmark.
WSGG: Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos: This paper proposes the World Scene Graph Generation (WSGG) task, extending conventional frame-level scene graphs to track all objects—including occluded and invisible ones—within a unified world coordinate system. Accompanied by the ActionGenome4D dataset and three complementary methods (PWG, MWAE, and 4DST), the work enables persistent scene reasoning.

💬 LLM / NLP¶

Bi-CMPStereo: Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo: Bi-CMPStereo is a bidirectional cross-modal prompting framework that alternately designates event and frame as the target domain for stereo canonicalization and cross-domain embedding adaptation, while leveraging cost volumes from both directions to achieve robust event-frame asymmetric stereo matching.
Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting: The QICA framework addresses the lack of quantity awareness and spatial insensitivity in zero-shot object counting by using a quantity-conditioned Synergistic Prompting Strategy (SPS) to jointly adapt vision-language encoders, combined with a Cost Aggregation Decoder (CAD) operating on similarity maps to preserve zero-shot transferability, achieving zero-shot SOTA on FSC-147 (MAE 12.41) with strong cross-domain generalization.
Composing Concepts from Images and Videos via Concept-prompt Binding: Bind & Compose (BiCo) is a one-shot method that binds visual concepts to prompt tokens via hierarchical binders and achieves flexible image-video concept composition through token-level composition, comprehensively outperforming prior work in concept consistency, prompt fidelity, and motion quality.
CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection: CoPS is a framework that dynamically generates prompts through two visual conditioning mechanisms — Explicit State Token Synthesis (ESTS) and Implicit Category Token Sampling (ICTS) — combined with Spatially-Aware Global-local Alignment (SAGA), achieving zero-shot anomaly detection SOTA across 13 industrial and medical datasets.
GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations: The paper proposes the GUIDE framework, which leverages in-context learning capabilities of LLMs to provide guided decision evolution for autonomous spacecraft operations, enabling progressive improvement of mission planning and fault diagnosis decisions through structured contextual information and feedback mechanisms without fine-tuning.
Perception Programs: Unlocking Visual Tool Reasoning in Language Models: Perception Programs (P2) is a training-free, model-agnostic method that converts raw visual tool outputs (depth, optical flow, correspondences, etc.) into compact language-native structured summaries, enabling MLLMs to directly "read" visual modalities rather than infer from dense pixels, achieving an average 19.66% improvement across 6 BLINK tasks.
PhysVid: Physics Aware Local Conditioning for Generative Video Models: PhysVid is a physics-aware local conditioning scheme that segments videos into temporal chunks, annotates each chunk with physics phenomenon descriptions via a VLM, and injects them through chunk-level cross-attention. At inference, "negative physics prompts" (counterfactual guidance) steer generation away from physics violations, improving physics commonsense scores by approximately 33% on VideoPhy.
Sign Language Recognition in the Age of LLMs: The first systematic evaluation of modern VLMs on zero-shot isolated sign language recognition (ISLR), revealing that open-source VLMs fall far behind specialized classifiers while large commercial models (GPT-5) demonstrate surprising potential.
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation: SketchDeco is a training-free line-art colorization method that uses a global-local two-stage strategy with region masks and color palettes as precise control signals, leveraging diffusion model inversion and self-attention injection in latent space for region-accurate coloring with harmonious global transitions, completing in 15–20 steps on consumer GPUs.

🔍 Information Retrieval & RAG¶

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval: This paper introduces MCMR (Multi-Conditional Multimodal Retrieval), a large-scale benchmark that employs a dual-evidence design — where certain attributes are inferable only from images and others only from text — to ensure retrieval tasks cannot be solved unimodally. The benchmark systematically evaluates 5 retrievers and 7 MLLM rerankers, revealing modality asymmetry and fine-grained reasoning gaps.
CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering: CC-VQA is proposed as a training-free method for mitigating knowledge conflicts in KB-VQA. Through a two-stage strategy combining visual-centric contextual conflict reasoning and correlation-guided encoding/decoding, it achieves absolute accuracy improvements of 3.3%–6.4% on three benchmarks: E-VQA, InfoSeek, and OK-VQA.
Explaining CLIP Zero-shot Predictions Through Concepts: This paper proposes EZPC, which learns a linear projection matrix \(A\) to jointly map CLIP image and text embeddings into an interpretable concept space. The method provides faithful, human-understandable explanations for CLIP predictions with negligible accuracy loss (H-mean gap of ~1% on CIFAR-100/CUB/ImageNet-100) and an inference overhead of only ~0.1ms.
M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG: This paper proposes M4-RAG, the first large-scale multilingual, multicultural, multimodal RAG evaluation framework, covering 42 languages and 189 countries with 80K+ cultural VQA instances. It systematically reveals two key findings: RAG is effective for smaller models but does not scale positively with model size, and cross-lingual retrieval suffers from severe performance degradation.
MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model: MuCo proposes a multi-turn dialogue-based contrastive learning framework that leverages the conversational capabilities of MLLMs to process multiple associated query-target pairs within a single forward pass, substantially improving training efficiency and achieving state-of-the-art performance on the MMEB and M-BEIR retrieval benchmarks.
NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval: NanoVDR exploits the modality asymmetry between queries and documents, distilling the query encoding capability of a 2B VLM teacher into a 69M text-only encoder via pointwise cosine alignment. On the ViDoRe benchmark, the student retains 95.1% of teacher performance while reducing query latency by 50× with a total training cost of only 13 GPU hours.
NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval: NanoVDR exploits the inherent asymmetry between queries and documents to distill a 2B-parameter VLM document retriever into a 69M text-only query encoder via pointwise cosine alignment. The student model retains 95.1% of teacher performance on the ViDoRe benchmark, reduces query latency by 50×, and requires only 13 GPU hours to train.
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations: This paper proposes RobustVisRAG, a causality-guided dual-path framework that decouples semantic–degradation entanglement in VisRAG by capturing degradation signals via a non-causal path and learning clean semantics via a causal path. Under real-world degradation conditions, the framework achieves improvements of 7.35%, 6.35%, and 12.40% in retrieval, generation, and end-to-end performance, respectively, while preserving performance on clean data.

📈 Time Series¶

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens: This paper proposes DeltaTok, which compresses inter-frame VFM feature differences into a single delta token. Combined with Best-of-Many training, DeltaWorld efficiently generates diverse future predictions in a single forward pass. The model uses only 1/35 the parameters of Cosmos and 1/2000 the FLOPs, yet achieves superior performance on dense prediction tasks.
Competition-Aware CPC Forecasting with Near-Market Coverage: This paper reformulates CPC forecasting in search advertising as a time series prediction problem under partially observable competition states. Three observable proxies — semantic similarity, CPC trajectory alignment, and geographic intent — are constructed to approximate latent competition, and are subsequently injected into forecasters as covariates and graph priors respectively. The proposed framework achieves substantial improvements over purely autoregressive baselines on medium- and long-term forecasting horizons.
Competition-Aware CPC Forecasting with Near-Market Coverage: This paper reframes cost-per-click (CPC) forecasting in paid search advertising as a partial competition observability problem. By constructing three families of competition proxy signals — semantic neighborhood, DTW behavioral neighborhood, and geographic intent — and integrating them with temporal foundation models (Chronos-2/TimeGPT/Moirai) and spatiotemporal GNNs, the proposed framework achieves significant improvements in medium-to-long-term forecasting accuracy over 1,811 keyword time series.
L2GTX: From Local to Global Time Series Explanations: L2GTX is proposed as a fully model-agnostic local-to-global explanation method for time series, employing parameterized event primitives (increasing/decreasing trends, local extrema) as explanation units. Through hierarchical clustering merging, greedy budget selection, and attribute statistics aggregation, it produces compact and faithful class-level global explanations across 6 UCR datasets (GF = 0.792 on ECG200 with FCN).
L2GTX: From Local to Global Time Series Explanations: L2GTX proposes a fully model-agnostic local-to-global explanation framework for time series classification. It extracts parameterized temporal event primitives (PEPs)—trends and extrema—from LOMATCE local explanations, merges redundant clusters across instances via hierarchical clustering, selects representative instances through submodular optimization, and aggregates these into concise class-level global explanations. The method maintains stable global faithfulness across six time series classification datasets.
PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning: This paper proposes PFGNet, a fully convolutional spatiotemporal prediction framework that dynamically modulates multi-scale large-kernel peripheral responses via pixel-wise frequency-guided gating (PFG) while applying learnable center inhibition, thereby simulating the biological center-surround bandpass filtering mechanism of the visual system. PFGNet achieves state-of-the-art or near state-of-the-art performance on four benchmarks—Moving MNIST, TaxiBJ, KTH, and Human3.6M—with remarkably few parameters and low computational cost.
Stable Spike: Dual Consistency Optimization via Bitwise AND Operations for Spiking Neural Networks: This paper proposes Stable Spike, a dual consistency optimization framework that employs the hardware-friendly bitwise AND operation to decouple a stable spike skeleton \(\tilde{S}\) from multi-timestep spike maps, and injects amplitude-aware spike noise to enhance generalization. The method achieves up to 8.33% accuracy improvement on neuromorphic object recognition tasks under ultra-low latency (\(T=2\)).
STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting: This paper proposes STCast, a framework that replaces static boundary cropping with learnable global-regional distributions via Spatial-Aligned Attention (SAA) to adaptively integrate global atmospheric information into regional forecasting, and employs Temporal Mixture-of-Experts (TMoE) with month-conditioned dynamic routing to enhance temporal modeling. STCast achieves state-of-the-art performance across four tasks: global forecasting, high-resolution regional forecasting, typhoon track prediction, and ensemble forecasting.

📡 Signal & Communications¶

AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation: This paper proposes AcTTA, a test-time adaptation framework based on dynamic activation function modulation. By reparameterizing conventional fixed activation functions into a learnable form—incorporating an activation center shift and asymmetric gradient slopes—AcTTA adaptively adjusts activation behavior during inference to address distribution shift, consistently outperforming normalization-layer-based TTA methods on CIFAR10-C, CIFAR100-C, and ImageNet-C.
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding: This paper presents ChartNet, a million-scale chart understanding dataset comprising 1.5 million high-quality multimodal aligned samples. Generated through a code-guided synthesis pipeline, the dataset covers 24 chart types and 6 plotting libraries, with each sample organized as a quintuple (code, image, data table, text description, QA with reasoning). A 2B model fine-tuned on ChartNet surpasses GPT-4o and 72B open-source models.
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space: CLAY proposes a training-free conditional visual similarity computation method that modulates similarity by constructing text-conditioned subspaces within the VLM embedding space. It adapts to varying retrieval conditions without recomputing database features and supports multi-condition retrieval.
Dual-Imbalance Continual Learning for Real-World Food Recognition: This paper proposes DIME, a framework that employs class-count-aware spectral adapter merging and rank-wise threshold modulation to address dual imbalance (intra-step class long-tail distribution and inter-step class-count skew) in continual learning, consistently outperforming baselines by over 3% on four long-tail food recognition benchmarks.
FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection: This paper proposes FAAR, a frequency-aware parameter-efficient fine-tuning method for multi-task learning. It introduces Performance-Driven Rank Shrinking (PDRS) to dynamically select the optimal rank per task and per layer, and designs a Task-Spectral Pyramidal Decoder (TS-PD) that leverages FFT frequency information to enhance spatial awareness and cross-task consistency. FAAR achieves superior performance using only 1/9 the parameters of full fine-tuning.

👥 Social Computing¶

As Language Models Scale, Low-order Linear Depth Dynamics Emerge: This paper treats the layer depth of a Transformer as a discrete-time system, demonstrating that the inter-layer propagation and intervention response of GPT-2 can be approximated near a given context by a 32-dimensional low-order linear state-space surrogate. Notably, as model scale increases, this surrogate becomes more accurate. The framework further enables the derivation of energy-efficient multi-layer intervention strategies that outperform heuristic injection baselines.
As Language Models Scale, Low-order Linear Depth Dynamics Emerge: This work treats the layer-wise forward pass of a Transformer as a discrete-time dynamical system and constructs a 32-dimensional low-order linear layer variant (LLV) surrogate to approximate the depth propagation dynamics of the last-token hidden state. The surrogate achieves a Spearman correlation of 0.995 in predicting per-layer intervention gains on GPT-2-large, and this linear identifiability monotonically increases with model scale (GPT-2 → medium → large). The closed-form optimal solution of the surrogate is further exploited to derive multi-layer activation steering schemes that require 2–5× less energy than heuristic intervention strategies.
Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification: This paper proposes the MaLSF framework, which employs mask-label pairs as semantic anchors and introduces a Bidirectional Cross-modal Verification (BCV) module and a Hierarchical Semantic Aggregation (HSA) module to enable active local semantic conflict detection, achieving state-of-the-art performance on the DGM4 benchmark and fake news detection tasks.
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance: This paper proposes leveraging provenance information—automatically obtained during the synthetic data generation process—as auxiliary supervision signals. By applying input gradient guidance (suppressing input gradients in non-target regions), the method directly encourages models to learn discriminative representations focused on target regions. Effectiveness is validated across multiple tasks and modalities, including weakly supervised localization, spatio-temporal action detection, and image classification.
Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning: This paper proposes E2OAL, a detector-free open-set active learning framework that discovers latent structures among unknown classes via label-guided clustering, jointly models known and unknown categories through a Dirichlet calibration auxiliary head, and introduces a two-stage adaptive querying strategy. E2OAL simultaneously achieves high accuracy, high query purity, and high training efficiency across multiple benchmarks.

🔗 Causal Inference¶

Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression: This paper proposes CIPHER, a training-free test-time hallucination suppression method. It generates semantically altered yet structurally preserved counterfactual images via a diffusion model, applies SVD decomposition to the representation differences between original and counterfactual images in LVLM hidden layers to extract a hallucination subspace, and then projects hidden states onto the orthogonal complement of this subspace during inference. CIPHER is the first method to localize and mitigate LVLM hallucinations by intervening on the visual modality.
Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression: This paper proposes CIPHER, a training-free test-time hallucination suppression method. In the offline phase, a diffusion model is used to generate counterfactual images, constructing the OHC-25K dataset, from which visual hallucination subspaces are extracted via SVD. During inference, hidden states are projected onto the orthogonal complement of this subspace, significantly reducing visual hallucinations in LVLMs without modifying model parameters or incurring additional inference overhead.
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations: This paper proposes MaskDiME, a training-free diffusion framework that transforms global classifier guidance into decision-driven local editing via an adaptive dual-mask mechanism, enabling precise and efficient visual counterfactual explanations. MaskDiME achieves inference speeds more than 30× faster than DiME while requiring only one-tenth the GPU memory of ACE/RCSB.
Retrieving Counterfactuals Improves Visual In-Context Learning: This paper proposes CIRCLES, a framework that retrieves counterfactual demonstrations via attribute-guided composed image retrieval, constructing dual-channel in-context demonstrations combining causality and correlation to substantially improve fine-grained visual reasoning in VLMs.

⚡ LLM Efficiency¶

GeoCodeBench: Benchmarking PhD-Level Coding in 3D Geometric Computer Vision: The first PhD-level code generation benchmark for 3D geometric computer vision, GeoCodeBench, comprising 100 function completion tasks curated from top-venue 2025 papers and codebases, with automated diverse unit tests. The strongest model GPT-5 achieves only 36.6% pass rate, revealing a significant gap in LLM scientific-level 3D code implementation.
CHEEM: Continual Learning by Reuse, New, Adapt and Skip -- A Hierarchical Exploration-Exploitation Approach: Proposes the CHEEM framework that leverages hierarchical exploration-exploitation (HEE) NAS to automatically learn task-aware dynamic ViT backbones—selecting Reuse/New/Adapt/Skip operations at each layer—significantly outperforming prompt-based methods on MTIL and VDD continual learning benchmarks, approaching the full fine-tuning upper bound.
SparVAR: Exploring Sparsity in Visual Autoregressive Modeling for Training-Free Acceleration: Systematically analyzes attention activation patterns in VAR models, revealing three sparsity properties (attention sinks, cross-scale similarity, spatial locality), and proposes SparVAR, a training-free acceleration framework with two plug-and-play modules—Cross-Scale Self-Similar Sparse Attention (CS⁴A) and Cross-Scale Local Sparse Attention (CSLA)—achieving sub-second generation for 8B models at 1024×1024 (1.57× speedup) with virtually no loss in high-frequency details.
StoryTailor: A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives: Proposes StoryTailor, a zero-shot visual narrative generation pipeline that uses Gaussian-Centered Attention (GCA) to mitigate subject overlap and background leakage, Action-Boost SVR (AB-SVR) to amplify action semantics, and Selective Forgetting Cache (SFC) to maintain cross-frame background continuity, achieving multi-subject, action-rich visual narrative generation on a single RTX 4090 with 10–15% CLIP-T improvement over baselines.

🧮 Scientific Computing¶

Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis: This paper proposes an exposure-time-dependent modulation transfer function (ET-MTF) that treats exposure time as a continuous variable, and constructs a large-scale synthetic turbulence dataset ET-Turb (5,083 videos, 2 million frames), significantly improving the generalization of turbulence restoration models on real-world data.
EHETM: High-Quality and Efficient Turbulence Mitigation with Events: This paper proposes EHETM, the first method to leverage the microsecond temporal resolution of event cameras to overcome the accuracy–efficiency bottleneck of conventional multi-frame turbulence mitigation (TM). Two key physical phenomena are identified—polarity alternation of turbulence-induced events correlated with image gradients, and spatiotemporally coherent "event tubes" formed by dynamic objects—motivating two complementary modules: a polarity-weighted gradient module and an event tube constraint module. EHETM reduces data overhead by 77.3% and system latency by 89.5%, with particularly substantial gains over the state of the art in dynamic-object scenes.
NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training: NESTOR, a nested MoE-based neural operator, is proposed to capture global features across different PDE types via image-level MoE and local spatial correlations within physical fields via token-level Sub-MoE. The model is pre-trained on 12 PDE datasets and effectively transferred to downstream tasks.
PhysSkin: Real-Time and Generalizable Physics-Based Skin Simulation: PhysSkin is a generalizable physics-informed framework that learns continuous skinning weight fields directly from static 3D geometry via a neural skinning field autoencoder, coupled with a physics-informed self-supervised learning strategy (energy minimization + smoothness + orthogonality constraints), enabling real-time physics-based animation that generalizes across shapes and discretizations without any annotated data or simulation trajectories.

✏️ Knowledge Editing¶

Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors: This paper proposes an attribution-guided dynamic model rectification framework that repurposes rank-one model editing from domain adaptation to behavior rectification. By quantifying per-layer editability via Integrated Gradients, the framework automatically localizes suspect layers and repairs three categories of unreliable behaviors—backdoor attacks, spurious correlations, and feature leakage—using as few as a single clean sample.
MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization: This paper identifies and exploits the cross-modal knowledge transfer phenomenon—modifications to knowledge within an LLM text encoder naturally transfer to visual generation—and proposes MoKus, a two-stage framework (visual concept learning + textual knowledge updating) for knowledge-aware concept customization.
MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization: This paper introduces a new task termed "knowledge-aware concept customization," and discovers that knowledge editing applied to LLM text encoders naturally transfers to the visual generation modality (cross-modal knowledge transfer). Building on this finding, the paper proposes MoKus: a two-stage framework that first binds a rare token to a visual concept as an anchor representation via LoRA fine-tuning, then efficiently maps multiple natural-language knowledge statements onto the anchor representation via knowledge editing—requiring only ~7 seconds per knowledge update.

💻 Code Intelligence¶

GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning: GeoTikzBridge constructs the largest 2.5M image–TikZ code dataset and the first auxiliary-line instruction dataset, trains a code generation model capable of accurately reconstructing geometric figures, and serves as a plug-and-play module to enhance the geometric reasoning capabilities of arbitrary MLLMs/LLMs.
MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction: This paper proposes MM-ReCoder, the first multimodal LLM with genuine self-correction capability for chart-to-code generation. Through a two-stage multi-turn GRPO reinforcement learning framework—first optimizing correction ability via shared-first-turn training, then optimizing coding ability via full-trajectory training—MM-ReCoder achieves 86.5% low-level score on ChartMimic with only 7B parameters, rivaling Qwen3-VL-235B.

🌐 Multilingual & Translation¶

MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation: This paper constructs MMTIT-Bench, a multilingual multi-scenario text-image translation benchmark covering 14 non-English non-Chinese languages, and proposes the CPR-Trans data paradigm (Cognition → Perception → Translation Reasoning). The approach significantly improves end-to-end translation quality on 3B and 7B models, with the 7B model achieving performance competitive with a 235B model.
SEA-Vision: A Multilingual Benchmark for Document and Scene Text Understanding in Southeast Asia: This paper introduces SEA-Vision, a benchmark that unifies evaluation of document parsing (15,234 pages) and text-centric VQA (7,496 QA pairs) across 11 Southeast Asian languages. A re-rendering strategy eliminates visual–textual misalignment in multilingual VQA, revealing severe performance degradation of 3–7× for MLLMs on low-resource SEA languages.

🔎 AIGC Detection¶

Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks: This paper introduces a new task of "fine-grained image aesthetic assessment," constructs the FGAesthetics benchmark containing 32,217 images across 10,028 series, and proposes FGAesQ: a model that learns discriminative aesthetic scores from relative rankings via Difference-Preserving Tokenization (DiffToken), Contrastive Text-Guided Alignment (CTAlign), and Ranking-Aware Regression (RankReg). The model achieves 0.779 pairwise accuracy on fine-grained scenes while maintaining a coarse-grained SRCC of 0.770.

🗣️ Dialogue Systems¶

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition: This paper proposes HIER, which combines hierarchical semantic representation (a three-level hierarchy of tokens → concepts → relations) with a self-evolutionary reasoning mechanism driven by MLLM feedback, consistently outperforming SOTA methods and leading MLLMs by 1–3% on three multimodal intent recognition benchmarks.

⚛️ Physics¶

QKD: Quantum-Gated Task-interaction Knowledge Distillation for Class-Incremental Learning: QKD introduces quantum gating into class-incremental learning (CIL), modeling sample-task correlations in high-dimensional Hilbert space via parameterized quantum circuits to guide cross-task knowledge distillation during training and adaptive adapter fusion during inference, achieving state-of-the-art performance on 5 benchmarks.

📂 Others¶

AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments: This paper proposes AdaSFormer, a serialized Transformer framework for indoor Monocular Semantic Scene Completion (MSSC), achieving state-of-the-art performance on NYUv2 and Occ-ScanNet through three core designs: Adaptive Serialization Attention (with learnable offsets), Center-Relative Position Encoding, and Convolutional Modulation Layer Normalization.
AssistMimic: Physics-Grounded Humanoid Assistance via Multi-Agent RL: The first multi-agent RL framework that performs contact-rich human-human assistive motion imitation in physics simulation, enabling MARL in high-contact settings via motion prior initialization, dynamic reference redirection, and contact facilitation rewards.
BenDFM: A Taxonomy and Synthetic CAD Dataset for Manufacturability Assessment in Sheet Metal Bending: This paper proposes a two-dimensional taxonomy of manufacturability metrics (configuration dependence × feasibility/complexity) and introduces BenDFM, the first synthetic CAD dataset for sheet metal bending (20,000 parts, covering both manufacturable and non-manufacturable designs). Benchmark results show that topology-aware graph representations (UV-Net, AUC 0.896) consistently outperform point cloud methods (PointNext, AUC 0.844) across all four task categories.
BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending: This paper proposes a two-dimensional taxonomy of manufacturability metrics (configuration dependency × feasibility/complexity) and constructs BenDFM, the first synthetic dataset for sheet metal bending (20k parts). Benchmark results show that graph-based representations (UV-Net) outperform point cloud representations (PointNext), and configuration-dependent metrics are harder to predict.
Bounds on Agreement between Subjective and Objective Measurements: Starting from the mathematical properties of MOS, this paper derives theoretical formulas for the upper bound on PCC and the lower bound on MSE between subjective test results and any objective estimator. It further proposes the BinoVotes/BinoMOS voting model and validates both the bounds and the model on 18 subjective test datasets.
Bounds on Agreement between Subjective and Objective Measurements: This paper derives closed-form expressions for the upper bound on PCC and the lower bound on MSE between subjective MOS values and any objective quality estimator, and proposes BinoVotes — a binomial distribution-based voting model — to estimate these bounds when per-vote variance information is unavailable.
U-F²-CBM: CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models: This paper proposes TextUnlock, a method that trains a lightweight MLP to project features from an arbitrary frozen visual classifier into the text embedding space—while preserving the original classifier's output distribution—requiring no CLIP, no annotations, and no linear probe training. Any legacy classifier can thereby be converted into an interpretable concept bottleneck model. Evaluated on 40+ architectures, the approach surpasses even supervised CLIP-based CBMs.
Coded-E2LF: Coded Aperture Light Field Imaging from Events: This paper provides the first demonstration that an event camera alone (without conventional intensity images) can reconstruct a 4D light field at pixel-level accuracy. The proposed Coded-E2LF system triggers events via a coded aperture pattern sequence and accumulates them into event images. By introducing an all-black pattern, a mathematical equivalence between event-based and intensity-based coded aperture imaging is established. Combined with end-to-end deep optics training, the system achieves 8×8 sub-aperture light field reconstruction.
Crowdsourcing of Real-world Image Annotation via Visual Properties: This paper proposes an image annotation methodology constrained by visual properties. It constructs an object category hierarchy through knowledge representation and combines an interactive crowdsourcing framework that leverages visual genus and visual differentia to guide the annotation process, thereby reducing annotator subjectivity and the semantic gap problem.
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis: Through three complementary levels of analysis — macroscopic convergence state, microscopic gradient dynamics, and information-theoretic limits — this paper rigorously proves that even given a perfect noise transition matrix, Forward Correction (FC) inevitably collapses to the same suboptimal level as no correction. The root cause lies in memorization under finite samples and the information loss induced by the noisy channel.
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis: Through controlled experiments, this paper demonstrates that even given a perfect noise transition matrix \(T\), forward correction (FC) still suffers from performance collapse in the late stages of training. The paper systematically diagnoses the root causes of this failure from three complementary perspectives: macroscopic convergence states, microscopic optimization dynamics, and information theory.
DiffBMP: Differentiable Rendering with Bitmap Primitives: This paper proposes DiffBMP — the first general-purpose differentiable rendering engine for bitmap primitives — which enables efficient gradient-based optimization of position, rotation, scale, color, and opacity across thousands of bitmap primitives via a custom CUDA parallel pipeline, filling the gap left by 2D differentiable rendering methods that are restricted to vector graphics.
DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification: This paper proposes Dirichlet Prior Augmentation (DirPA), which mitigates prior shift between artificially balanced training episodes and severely imbalanced real-world label distributions by sampling from a Dirichlet distribution to simulate unknown long-tailed label distribution shifts during few-shot learning training. The method is validated on crop-type classification tasks across multiple EU countries, demonstrating cross-regional effectiveness.
DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification: This paper proposes Dirichlet Prior Augmentation (DirPA), which constructs imbalanced episodes during FSL training by sampling class proportion vectors from a Dirichlet distribution, actively simulating real-world long-tail distributions to eliminate prior shift. The method demonstrates consistent robustness improvements and rare-class accuracy gains on crop-type classification tasks across multiple European countries.
Do Vision Models Perceive Illusory Motion in Static Images Like Humans?: This paper systematically evaluates a range of optical flow models on static-image motion illusions such as the Rotating Snakes, finding that only the biologically-inspired Dual-Channel model reproduces the human-perceived rotational motion under simulated saccade conditions.
Dual-Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions: This paper proposes a dual-band long-wave infrared (LWIR) video analysis framework that jointly leverages spectral cues (constant emissivity ratio across dual bands) and temporal cues (smooth object radiance variation vs. abrupt background radiance changes) to achieve, for the first time, pixel-wise separation of reflected and emitted components in dynamic scenes near ambient temperature, along with recovery of per-pixel emissivity and temperature fields.
ELogitNorm: Enhancing OOD Detection with Extended Logit Normalization: This paper diagnoses two feature collapse problems in LogitNorm (dimensional collapse and origin collapse), and proposes ELogitNorm — replacing the feature norm with the average distance to decision boundaries as an adaptive temperature scaling factor. The method requires no hyperparameters, is compatible with all post-hoc OOD detection methods, achieves a 10.48% far-OOD AUROC improvement on CIFAR-10 (with SCALE), reduces FPR95 from 51.45% to 27.74% on ImageNet-1K, and simultaneously improves classification accuracy and ECE calibration.
FEAT: Federated Geometry-Aware Correction for Exemplar Replay under Continual Dynamic Heterogeneity: FEAT is proposed to address the underutilization of replay exemplars in federated continual learning (FCL), mitigating cross-client heterogeneity and task-level data imbalance via geometric structure alignment (angular distillation based on ETF prototypes) and energy-based geometric correction (inference-time debiasing).
GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents: This paper proposes GardenDesigner, a framework that encodes the aesthetic principles of Jiangnan gardens into computable constraints through a chain of agents (terrain distribution → road generation → asset selection → layout optimization). Combined with the expert-annotated GardenVerse dataset, the framework enables non-expert users to automatically construct aesthetically compliant Jiangnan gardens from text input within one minute.
GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion: This paper proposes GazeOnce360, an end-to-end dual-resolution CNN model for 360° multi-person gaze direction estimation using a single upward-facing tabletop fisheye camera. The authors also construct MPSGaze360, the first large-scale synthetic dataset for this setting, achieving substantial improvements over the existing multi-stage method GAM360 in both accuracy and speed.
HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition: This paper proposes HypeVPR, a visual place recognition framework based on hierarchical embedding in hyperbolic space, specifically designed to address cross-field-of-view matching between perspective (query) and equirectangular panoramic (database) images. By constructing multi-level descriptors from local to global within the Poincaré ball, HypeVPR achieves a flexible balance among accuracy, efficiency, and storage, achieving retrieval speeds several times faster than sliding-window baselines at comparable accuracy.
Integration of deep generative Anomaly Detection algorithm in high-speed industrial line: A GAN-based dense bottleneck residual autoencoder (DRAE) improved upon GRD-Net achieves semi-supervised anomaly detection on a pharmaceutical BFS production line, completing inference over 2.81 million training patches within a 500 ms time constraint (0.17 ms/patch) at a balanced accuracy of 97.62%.
Integration of Deep Generative Anomaly Detection Algorithm in High-Speed Industrial Line: This paper proposes a semi-supervised anomaly detection framework based on GAN and a Dense Residual Autoencoder (DRAE), specifically designed for high-speed online quality inspection in pharmaceutical Blow-Fill-Seal (BFS) production lines. Trained exclusively on non-defective samples, the system achieves 96.4% accuracy with a per-patch inference latency of only 0.17ms, satisfying the strict industrial constraint of a 500ms inspection cycle.
IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness: This paper proposes IrisFP, a model fingerprinting framework that simultaneously enhances fingerprint uniqueness and robustness through three innovations: placing fingerprints at the intersection of multi-class decision boundaries, constructing composite sample fingerprints, and performing statistically-guided fingerprint selection. IrisFP consistently achieves higher AUC than state-of-the-art methods across 5 datasets.
LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment: The LoViF 2026 inaugural challenge on human-oriented semantic image quality assessment introduces the SeIQA benchmark dataset (510/80/160 train/validation/test pairs) to measure whether image degradation alters the semantic information that humans care about, rather than traditional perceptual fidelity. The winning solution, RedpanQA Alliance, achieves a final score of 0.8724 using Qwen3-VL multimodal large language model with LoRA fine-tuning and PLCC loss.
Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning: To address the "instance entanglement" problem in instance-dependent partial label learning (ID-PLL)—where instances from visually similar classes share overlapping features and candidate label sets—this paper proposes the CAD framework, which mitigates class confusion through two complementary mechanisms: intra-class alignment via class-specific augmentation and inter-class separation via a weighted penalty loss.
MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection: This paper proposes MyoVision, a smartphone-based transillumination imaging framework, and the NEATBoost-Attention neuroevolution-optimized ensemble model for low-cost, real-time three-class classification of chicken breast myopathies (Wooden Breast and Spaghetti Meat).
NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries: This paper proposes NaiLIA, a multimodal retrieval method for nail design images that achieves fine-grained matching via dense intent descriptions and palette queries. A confidence-based relaxed contrastive (CRC) loss is introduced to handle unlabeled positives. NaiLIA substantially outperforms existing methods on the authors' newly constructed NAIL-STAR benchmark and on Marqo Fashion200K.
Neural Collapse in Test-Time Adaptation: This work extends Neural Collapse (NC) theory from the class level to the sample level, discovering the NC3+ phenomenon (sample feature embeddings align with their corresponding classifier weights). Building on this, it identifies feature-classifier misalignment at the sample level as the root cause of performance degradation under distribution shift, and proposes NCTTA, which employs a hybrid objective combining geometric proximity and prediction confidence to guide feature re-alignment, achieving a 14.52% improvement over Tent on ImageNet-C.
Next-Scale Autoregressive Models for Text-to-Motion Generation: MoScale proposes a next-scale autoregressive motion generation framework that replaces conventional next-token prediction. By performing hierarchical causal generation from coarse to fine, the model captures global semantic structure and introduces cross-scale hierarchical refinement and in-scale temporal refinement, achieving state-of-the-art performance on HumanML3D and KIT-ML (Top-1 0.540, FID 0.046).
Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples: To address the practical challenge that the definition of "normal" shifts with specification changes in industrial anomaly detection, this paper proposes two novel evaluation scenarios (A2N/N2A), a new metric (S-AUROC), and a training augmentation method called RePaste. RePaste increases the training frequency of high-anomaly-score regions by repasting them onto subsequent training images, enabling models to flexibly adapt to changes in the definition of normal samples.
OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion: This work introduces OmniFood8K, a multimodal Chinese food nutrition dataset comprising 8,036 samples, along with a synthetic dataset NutritionSynth-115K containing 115K samples. An end-to-end framework is proposed that predicts nutritional information from a single RGB image via a Scale-Shift depth adapter, frequency-aligned fusion, and a mask-based prediction head.
Order Matters: 3D Shape Generation from Sequential VR Sketches: This paper proposes VRSketch2Shape, a framework that, for the first time, models the temporal stroke order of VR sketches. Through a sequence-aware BERT encoder combined with a diffusion-based 3D generator (SDFusion), the framework generates high-fidelity 3D shapes from ordered VR sketches. The work also contributes a multi-category dataset comprising 20k synthetic and 900 real sketches.
POLISH'ing the Sky: Wide-Field and High-Dynamic Range Interferometric Image Reconstruction: POLISH++ extends the POLISH framework by introducing a patch-wise training-and-stitching strategy and an arcsinh nonlinear transformation, addressing two major practical deployment challenges in radio interferometric imaging: wide-field imaging (images exceeding ten thousand pixels) and high dynamic range (\(10^4\)–\(10^6\)). On T-RECS simulated data, POLISH++ substantially outperforms CLEAN in source detection accuracy, recovers strong gravitational lens systems near the PSF scale through super-resolution, and is projected to increase the number of gravitational lens discoveries in DSA surveys by approximately one order of magnitude.
Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model: This paper proposes HD-LIF (Hybrid-Driven LIF), a family of spiking neuron models that adopts distinct spike computation mechanisms above and below the firing threshold. It theoretically establishes gradient separability and alignment, resolving the forward–backward propagation inconsistency in SNN online training, while simultaneously achieving full-pipeline optimization of learning accuracy, memory complexity, and power consumption—attaining 78.61% accuracy on CIFAR-100 with 10× parameter compression, 11× power reduction, and 30% NOPs savings.
Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model: This paper proposes the Hybrid-Driven LIF (HD-LIF) model family, which achieves gradient separability and alignment by adopting distinct spike computation mechanisms in the sub- and supra-threshold regions. This approach resolves the fundamental forward–backward propagation inconsistency in SNN online training, while simultaneously optimizing training accuracy, memory complexity, and inference power consumption across all stages.
Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods: This work establishes a learning-observation framework based on PIV wind tunnel experimental data, systematically comparing Kriging interpolation with three deep learning models (UNet/ViTAE/CWGAN) for rooftop wind field reconstruction under 5–30 sparse sensors. It demonstrates that under multi-direction training (MDT), deep learning consistently outperforms Kriging (SSIM improvement of 18–34%), and sensor placement robustness is enhanced by up to 27.8% via QR decomposition-based sensor layout optimization.
Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods: Based on wind tunnel PIV experimental data, this paper systematically compares Kriging interpolation and three deep learning methods (UNet, ViTAE, CWGAN) for rooftop wind field reconstruction under sparse sensor conditions, and proposes QR decomposition-based sensor placement optimization to enhance robustness.
Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation: This paper proposes FECO, a framework that achieves robust dense foot contact estimation from a single RGB image via shoe style–content randomization (adversarial training) and ground-aware learning (pixel height maps + ground normals), significantly outperforming existing methods on multiple benchmarks.
SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules: This paper proposes SHREC, an algorithm that recovers projection angles of helical molecule segments directly from cryo-EM 2D projection images via spectral embedding, without requiring prior knowledge of helical symmetry parameters (rise/twist), enabling truly ab-initio helical reconstruction.
SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules: SHREC employs spectral embedding to directly recover projection angles of helical molecules from 2D cryo-EM projection images without prior knowledge of helical symmetry parameters. By proving that projections of helical segment form a one-dimensional closed manifold homeomorphic to the circle \(S^1\), the method achieves near-publication-quality high-resolution reconstructions (3.66 Å–8.23 Å) on three public datasets: TMV, VipA/VipB, and MakA.
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos: This paper proposes SimRecon, a framework that automatically constructs simulation-ready compositional 3D scenes from real videos via a three-stage "perception → generation → simulation" pipeline. The core innovations are Active Viewpoint Optimization (AVO), which identifies the optimal projection viewpoint for single-object generation, and the Scene Graph Synthesizer (SGS), which guides physically plausible hierarchical assembly.
SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design: This paper presents SldprtNet, a large-scale multimodal CAD dataset comprising 242,000+ industrial parts, where each sample contains four fully aligned modalities: .sldprt/.step 3D models, seven-view composite images, parametric modeling scripts, and natural language descriptions. The authors develop a lossless encoder/decoder toolchain supporting 13 CAD commands, and baseline experiments demonstrate the significant advantage of multimodal input over text-only input for CAD generation tasks.
SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design: This paper introduces SldprtNet — a large-scale multimodal dataset comprising 242K+ industrial CAD parts, where each sample includes a .sldprt/.step model, a 7-view composite image, a parametric modeling script (supporting lossless encoding/decoding of 13 command types), and a natural language description generated by Qwen2.5-VL. Baseline experiments demonstrate that multimodal input (image + text) outperforms text-only input for CAD generation.
Stronger Normalization-Free Transformers: By systematically analyzing four key properties required for pointwise functions to replace normalization layers (zero-centeredness, boundedness, center-sensitivity, and monotonicity), this work identifies \(\text{Derf}(x) = \text{erf}(\alpha x + s)\) as the optimal normalization-layer substitute through large-scale search. Derf consistently outperforms LayerNorm and DyT across vision recognition, image generation, speech representation, and DNA sequence modeling, with performance gains primarily attributable to stronger generalization rather than fitting capacity.
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size: TeamHOI proposes a framework using a Transformer-based decentralized policy network and Masked Adversarial Motion Prior (Masked AMP), enabling a single policy to generalize to cooperative carrying tasks with any number of agents, achieving 97%+ success rate for 2–8 humanoid agents cooperatively carrying a table.
UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting: This paper proposes UniSpector, an open-set industrial defect detection framework that addresses visual prompt embedding collapse through spectral-spatial dual-domain feature fusion (SSPE) and angular-margin contrastive prompt encoding (CPE). On the newly constructed Inspect Anything benchmark encompassing 360 defect categories, UniSpector surpasses the best baseline by 19.7% in AP50 detection and 15.8% in segmentation.
V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos: This paper proposes V-Nutri, the first framework to leverage process information from egocentric cooking videos for dish-level nutrition estimation. A VideoMamba-based keyframe selection module identifies ingredient addition moments, which are fused with the final dish image to predict calories and macronutrients.
ViT3: Unlocking Test-Time Training in Vision: This paper systematically explores the design space of Test-Time Training (TTT) for vision tasks, distills six practical design insights, and proposes ViT3—a purely TTT-based vision architecture with linear complexity—that matches or surpasses Mamba and linear attention methods on classification, generation, detection, and segmentation tasks.
What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F₁: This paper systematically studies the \(F_\beta\) score family as a ranking tradeoff between Precision and Recall from a ranking-theoretic perspective. It proves that the rankings induced by \(F_\beta\) form a geodesic (shortest path) between the Precision and Recall rankings, derives a closed-form formula for finding the optimal \(\beta\), and demonstrates that the commonly used \(F_1\) and skew-insensitive \(F_1\) are rarely optimal ranking tradeoffs in practice.
What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution: This paper systematically analyzes the deficiencies of existing rendered synthetic data in corpus, font, and layout diversity, and proposes the UnionST synthetic engine together with a Self-Evolution Learning (SEL) framework. Using only synthetic data, the approach substantially outperforms conventional synthetic sets; combined with SEL, only 9% of real labeled data is required to approach fully supervised performance.
Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation: This paper analyzes the energy landscape to reveal the complementarity between adversarial training (AT) and JEM—AT aligns the clean-adversarial energy distribution (→ robustness); JEM aligns the clean-generated energy distribution (→ accuracy + generation). The proposed EB-JDAT models the joint distribution \(p(\mathbf{x}, \tilde{\mathbf{x}}, y)\) and employs min-max energy optimization to align the energy distributions of all three data types. On CIFAR-10, AutoAttack robustness reaches 68.76% (surpassing SOTA AT by +10.78%), while maintaining 90.39% clean accuracy and competitive generation quality with FID=27.42.
Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation: This paper proposes EB-JDAT, a framework that models the joint energy distribution \(p_\theta(\mathbf{x}, \tilde{\mathbf{x}}, y)\) over clean, adversarial, and generated samples, achieving — for the first time in a single model — high classification accuracy, strong adversarial robustness, and competitive generative capability. On CIFAR-10, it attains 66.12% AutoAttack robustness, surpassing state-of-the-art adversarial training methods by over 10 percentage points.
ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training: This paper proposes ZO-SAM, which replaces the backpropagation in SAM's perturbation step with zeroth-order gradient estimation, reducing SAM's computational overhead from two backward passes to one. This makes SAM practical for sparse training for the first time, achieving consistent improvements of 0.38%–2.54% over all mainstream sparse training methods on CIFAR-10/100 and ImageNet-1K.