🤖 AAAI2026 Paper Notes¶

1387 AAAI2026 paper notes covering Medical Imaging (105), Multimodal VLM (92), Image Generation (78), 3D Vision (74), Reinforcement Learning (71), Autonomous Driving (58), Model Compression (54), AI Safety (44) and other 44 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

🏥 Medical Imaging¶

A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation: This paper proposes a two-stage disease-aware framework that learns 14 Disease-Aware Semantic Tokens (DASTs) corresponding to pathology categories for explicit disease representation. It further employs a Disease-Visual Attention Fusion (DVAF) module and a Dual-Modal Similarity Retrieval (DMSR) mechanism to assist an LLM in generating clinically accurate chest X-ray reports, achieving state-of-the-art performance on three datasets: CheXpert Plus, IU X-Ray, and MIMIC-CXR.
A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment: This paper proposes the GCSD system for group Cognitive Stimulation Therapy (CST) targeting elderly individuals with cognitive impairment. The system integrates four modules — multi-speaker context control, dynamic participant state modeling (soft prompt), a cognitive stimulation attention loss, and a multi-dimensional reward policy optimization — built on a fine-tuned Qwen-2.5-3B backbone. Training is conducted on 500+ hours of real Cantonese CST dialogues and 10,000+ simulated conversations. The system achieves a BLEU-4 of 27.93, surpassing GPT-4o and other large models, with an A/B test win rate of 50% versus GPT-4o's 39%.
Advancing Safe Mechanical Ventilation Using Offline RL With Hybrid Actions and Clinically Aligned Rewards: This paper addresses the problem of optimizing mechanical ventilation (MV) settings in the ICU via offline RL. A hybrid action space approach (HybridIQL/HybridEDAC) is proposed to avoid distributional shift caused by conventional discretization. Clinically aligned reward functions are introduced based on ventilator-free days (VFD) and physiological safety ranges, with multi-objective optimization used to select the optimal reward. The number of optimizable ventilation parameters is scaled from 2–3 to 6, and HybridIQL achieves the best balance between performance and policy coverage.
Ambiguity-aware Truncated Flow Matching for Ambiguous Medical Image Segmentation: This paper proposes the ATFM framework, which decouples prediction accuracy and diversity into distribution-level and sample-level optimization through a data-hierarchical inference paradigm. By integrating two modules — Gaussian Truncation Representation (GTR) and Segmentation Flow Matching (SFM) — ATFM simultaneously improves prediction accuracy, fidelity, and diversity in ambiguous medical image segmentation.
An LLM-Based Simulation Framework for Embodied Conversational Agents in Psychological Counseling: This paper proposes the ECAs framework, which grounds psychological counseling simulation in established theories such as Cognitive Behavioral Therapy (CBT). By leveraging LLMs to expand real counseling cases into embodied cognitive memory spaces, the framework simulates the complete cognitive processes of clients in counseling sessions and generates high-fidelity dialogue data. ECAs significantly outperforms baselines in both expert and automated evaluations.
Apo2Mol: 3D Molecule Generation via Dynamic Pocket-Aware Diffusion Models: This paper proposes Apo2Mol, a diffusion-based all-atom framework that simultaneously generates 3D ligand molecules and corresponding holo (bound-state) pocket conformations from protein apo (unbound) conformations. Trained on 24K experimentally resolved apo-holo structure pairs, it achieves state-of-the-art performance in binding affinity (Vina min −7.86) and drug-likeness.
Bayesian Meta-Analyses Could Be More: A Case Study in Trial of Labor After a Cesarean-section Outcomes and Complications: This paper proposes a hierarchical Bayesian meta-analysis framework that models the unrecorded clinical decision variable (Bishop score) as a truncated latent variable, correcting the biased conclusions arising from omitted confounders in conventional fixed-effect meta-analyses. Applied to the TOLAC (Trial of Labor After Cesarean) setting, the method demonstrates no significant difference between mechanical dilation and Pitocin.
BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives: This paper proposes a hard negative mining method that constructs a multi-hop semantic graph from PubMed citation chains and performs random walks thereon. Using only 20k training samples and minimal fine-tuning steps, 33M/110M small models surpass retrieval baselines with billions of parameters on BEIR and LoTTE.
Bidirectional Channel-selective Semantic Interaction for Semi-Supervised Medical Segmentation: This paper proposes the BCSI framework, which employs a channel-selection router to dynamically identify critical feature channels and performs bidirectional channel-level interaction between labeled and unlabeled data streams. Combined with semantic-spatial perturbation-based weak-to-strong consistency learning, BCSI achieves substantial improvements in semi-supervised medical image segmentation.
Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark: This paper presents VL-SurgPT, the first large-scale multimodal surgical point tracking dataset combining visual coordinates with textual state descriptions, and proposes TG-SurgPT, a text-guided tracking method that leverages semantic information to significantly improve tracking accuracy and robustness in complex surgical scenes.
CD-DPE: Dual-Prompt Expert Network Based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution: This paper proposes CD-DPE, a network that employs an iterative Convolutional Dictionary Feature Decoupling Module (CD-FDM) to disentangle multi-contrast MRI features into cross-contrast shared and modality-specific components, followed by a Dual-Prompt Feature Fusion Expert Module (DP-FFEM) for adaptive fusion and reconstruction. CD-DPE surpasses existing state-of-the-art methods on multiple public benchmarks.
CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records: This paper proposes CliCARE, a framework that transforms unstructured longitudinal cancer EHRs into temporal knowledge graphs (TKGs), aligns them with clinical practice guideline (CPG) knowledge graphs, and provides evidence-grounded clinical decision support for LLMs. An LLM-as-a-Judge evaluation protocol highly correlated with expert assessments is also introduced.
Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models: This paper proposes the Coarse-to-Fine Classification (CFC) framework, which leverages the zero-shot reasoning capability of LLMs to supply semantically grounded OOD samples and a potential OOD label space for open-set graph node classification, enabling the model not only to detect OOD nodes but also to classify them into specific unknown categories.
CoCoLIT: ControlNet-Conditioned Latent Image Translation for MRI to Amyloid PET Synthesis: This paper proposes CoCoLIT, a ControlNet-conditioned latent diffusion framework for synthesizing amyloid PET images from structural MRI. Through a Weighted Image Space Loss (WISL) and Latent Averaging Stabilization (LAS), CoCoLIT substantially outperforms existing methods.
Constrained Best Arm Identification with Tests for Feasibility: This paper proposes a new framework for best arm identification (BAI) with feasibility constraints, allowing the decision-maker to test arm performance and feasibility constraints separately. An asymptotically optimal algorithm is designed that adaptively eliminates suboptimal arms via whichever criterion—performance or feasibility—is easier to satisfy.
ConSurv: Multimodal Continual Learning for Survival Analysis: This paper proposes ConSurv, the first multimodal continual learning framework for survival analysis. Through two core components — Multi-Stage Mixture-of-Experts (MS-MoE) and Feature-Constrained Replay (FCR) — ConSurv effectively mitigates catastrophic forgetting in settings that integrate whole slide pathology images and genomic data, comprehensively outperforming existing methods on the newly constructed MSAIL benchmark.
Cross-Sample Augmented Test-Time Adaptation for Personalized Intraoperative Hypotension Prediction: This paper proposes the CSA-TTA framework, which enhances personalized intraoperative hypotension prediction at test time by constructing a cross-sample bank, performing coarse-to-fine retrieval, and applying multi-task optimization to retrieve hypotension event signals from other patients' data.
Decoding with Structured Awareness: Integrating Directional, Frequency-Spatial, and Structural Attention for Medical Image Segmentation: This paper proposes a novel decoder framework for medical image segmentation comprising three modules: Adaptive Cross-Fusion Attention (ACFA) for directional awareness, Triple Feature Fusion Attention (TFFA) for spatial-frequency-wavelet fusion, and Structural-aware Multi-scale Masking Module (SMMM), achieving state-of-the-art performance across multiple benchmark datasets.
DeepGB-TB: A Risk-Balanced Cross-Attention Gradient-Boosted Convolutional Network for Rapid, Interpretable Tuberculosis Screening: This paper proposes DeepGB-TB, a multimodal TB screening system combining a lightweight 1D-CNN (for cough audio) and gradient-boosted decision trees (for demographic features). A bidirectional cross-attention module (CM-BCA) fuses heterogeneous data by mimicking clinical reasoning, while a tuberculosis risk-balanced loss (TRBL) minimizes missed diagnoses. The system achieves AUROC 0.903 on a 7-country dataset and supports offline real-time inference on mobile devices.
DeNAS-ViT: Data Efficient NAS-Optimized Vision Transformer for Ultrasound Image Segmentation: DeNAS-ViT is proposed as the first method to apply NAS at the token level within ViT for optimizing multi-scale feature extraction in ultrasound image segmentation. A NAS-constrained semi-supervised learning framework is designed incorporating network independence loss, hierarchical contrastive loss, and staged optimization, achieving state-of-the-art performance under limited annotation.
DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities: This paper proposes DiA-gnostic VLVAE, a vision-language mixture-of-experts VAE that learns a three-factor latent space ($Z_v$ visual-specific / $Z_l$ language-specific / $Z_s$ shared), with dual constraints of orthogonality and contrastive alignment for disentanglement. The model generates reliable radiology reports even when clinical context is absent, achieving competitive BLEU@4 on IU X-Ray and MIMIC-CXR.
Distributional Priors Guided Diffusion for Generating 3D Molecules in Low Data Regimes: This paper proposes GODD (Geometric OOD Diffusion Model), which captures distributional structural priors via an equivariant asymmetric autoencoder to guide the generation process of a diffusion model, enabling models trained on data-rich molecular distributions to generalize to data-scarce distributions, achieving a 12.6% improvement in success rate on OOD structural shift benchmarks.
Divide, Conquer and Unite: Hierarchical Style-Recalibrated Prototype Alignment for Federated Medical Segmentation: To address the two key challenges in federated medical image segmentation — layerwise style bias accumulation and incomplete contextual representation — this paper proposes FedBCS: a framework that constructs domain-invariant prototypes via Frequency-domain adaptive Style Recalibration (FSR) and designs Context-aware Dual-level Prototype Alignment (CDPA) to fuse multi-level semantics from both encoder and decoder. FedBCS achieves state-of-the-art performance on nuclei segmentation and prostate MRI segmentation tasks.
Dual-Path Knowledge-Augmented Contrastive Alignment Network for Spatially Resolved Transcriptomics: This paper proposes DKAN, a Dual-path Knowledge-Augmented contrastive Alignment Network that integrates semantic information from external gene databases as a cross-modal coordinator. Combined with a unified one-stage contrastive learning paradigm and an adaptive weighting mechanism, DKAN predicts spatially resolved gene expression from H&E-stained whole slide images (WSI), achieving state-of-the-art performance across three public ST datasets.
DualFete: Revisiting Teacher-Student Interactions from a Feedback Perspective for Semi-supervised Medical Image Segmentation: A feedback mechanism is introduced into the teacher-student semi-supervised learning framework, enabling the student to feed back to the teacher information on whether pseudo-label-guided updates are consistent with the direction of supervision from labeled data. This feedback dynamic is further enhanced within a dual-teacher architecture, effectively suppressing error accumulation and confirmation bias in medical image segmentation.
DW-DGAT: Dynamically Weighted Dual Graph Attention Network for Neurodegenerative Disease Diagnosis: To address three core challenges in early diagnosis of neurodegenerative diseases (PD/AD)—multi-indicator data fusion, heterogeneous information extraction, and class imbalance—this paper proposes DW-DGAT, a dynamically weighted dual graph attention network. By introducing a universal data fusion strategy, micro-macro dual-level graph feature learning, and a dynamic class weight generation mechanism, DW-DGAT substantially outperforms 14 baseline methods on the PPMI and ADNI3 datasets.
Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows: Three complementary chromosome-level genomic parallelization scheduling schemes are proposed — static scheduling (optimizing processing order), dynamic scheduling (knapsack-based batching with online RAM prediction), and a symbolic regression RAM predictor — achieving significant reductions in out-of-memory errors and execution time in both simulated and real precision medicine pipelines.
EgoEMS: A High-Fidelity Multimodal Egocentric Dataset for Cognitive Assistance in Emergency Medical Services: This paper presents the first high-fidelity multi-person multimodal egocentric EMS dataset, comprising 233 trials with 20 hours of video, annotations covering 9 interventions and 67 critical steps, and three benchmark tasks (step classification / online segmentation / CPR quality estimation) to advance the development of cognitive assistance systems for EMS.
Error Correction in Radiology Reports: A Knowledge Distillation-Based Multi-Stage Framework: This paper proposes a staged inference + dual-knowledge infusion framework that decomposes radiology report error correction into three phases—detection → localization → correction—and integrates Medical Knowledge Graph Distillation (MKGD) with External Knowledge Retrieval (EXKR) to achieve up to 31.56% improvement in error detection accuracy and 37.4% reduction in processing time across 6 LLM architectures.
Experience with Single Domain Generalization in Real World Medical Imaging Deployments: This paper proposes the DL+EKE framework, which integrates domain-invariant expert knowledge with deep learning to address rare class single domain generalization (SDG) in medical imaging. The approach significantly outperforms state-of-the-art SDG methods across three real-world deployment scenarios: diabetic retinopathy (DR) grading, resting-state fMRI seizure onset zone (SOZ) localization, and stress ECG-based coronary artery disease (CAD) detection.
Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering: This paper constructs EMSQA, the first multiple-choice QA dataset for the emergency medical services domain (24.3K questions, 10 clinical topics, 4 certification levels), and proposes the Expert-CoT and ExpertRAG frameworks to inject domain expertise into LLM reasoning and retrieval, achieving up to 4.59% accuracy improvement over standard RAG.
FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention: FaNe proposes a semantics-enhanced medical vision-language pre-training framework that addresses the false-negative problem and insufficient coarse-grained alignment in medical VLP through semantics-aware positive mining, text-conditioned sparse attention pooling, and hard-negative-aware contrastive loss.
FDP: A Frequency-Decomposition Preprocessing Pipeline for Unsupervised Anomaly Detection in Brain MRI: This work presents the first systematic frequency-domain analysis of brain MRI anomalies, demonstrating that lesions are predominantly concentrated in low-frequency components. Based on this finding, the authors propose the Frequency Decomposition Preprocessing (FDP) framework, which reconstructs low-frequency signals via a learnable prior context bank to suppress lesions while preserving anatomical structures. As a plug-and-play module, FDP consistently improves detection performance across multiple UAD baselines (achieving a 17.63% DICE gain on LDM).
Federated CLIP for Resource-Efficient Heterogeneous Medical Image Classification: This paper proposes FedMedCLIP, a federated CLIP framework for medical image classification. By freezing the CLIP encoder and combining a masked Feature Adaptation Module (FAM), a local masked MLP, and class-level KL distillation regularization, the framework achieves robust classification under data heterogeneity with minimal communication and computational overhead (surpassing the second-best method by 8% on ISIC2019 and running 120× faster than FedAVG).
FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing: This paper proposes FIA-Edit, an inversion-free text-guided image editing framework based on frequency-interactive attention. It introduces a Frequency Representation Interaction (FRI) module that performs frequency-domain fusion of source/target features within self-attention, and a Feature Injection (FIJ) module that explicitly incorporates source image features into cross-attention. The framework achieves precise semantic editing while maintaining high background fidelity, and for the first time applies a general image editing method to clinical surgical bleeding image augmentation.
Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty: This paper reveals that the root cause of SFT-induced dishonesty in LLMs is impaired self-expression (rather than degraded self-knowledge), and proposes the HCNR framework accordingly. By identifying honesty-critical neurons via Fisher information and restoring them to their pre-trained states with Hessian-guided compensation, HCNR recovers 33.25% of honesty using only 256 data samples and 20% of parameters, achieving over 2.23× speedup.
From Policy to Logic for Efficient and Interpretable Coverage Assessment: This paper proposes a neuro-symbolic approach that combines a coverage-aware retriever with symbolic rule inference based on PyKnow, assisting human reviewers in efficiently and interpretably assessing whether medical CPT codes are covered by insurance policies. The approach reduces inference cost by 44% while improving F1 by 4.5%.
FunKAN: Functional Kolmogorov-Arnold Network for Medical Image Enhancement and Segmentation: This paper generalizes the Kolmogorov-Arnold representation theorem from finite-dimensional scalar spaces to function spaces (Hilbert spaces), proposing the FunKAN framework. By learning inner functions via Fourier expansion over Hermite basis functions, the framework preserves the spatial structure of image data and outperforms existing KAN variants on MRI enhancement and three medical image segmentation tasks.
G2L: From Giga-Scale to Cancer-Specific Large-Scale Pathology Foundation Models via Efficient Fine-Tuning: This paper proposes G2L (Giga-to-Large), a distillation framework that transfers knowledge from a 1.9B-parameter giga-scale pathology foundation model (H-optimus-0) to a 300M-parameter large-scale model (Hibou-L) using only 1K whole slide images, achieving performance on par with or superior to the teacher model and larger models across multiple cancer-specific downstream tasks.
GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLMs: GEM proposes a generative entropy-guided preference modeling approach that achieves efficient LLM alignment in low-resource settings (only 3,000 preference pairs) through cognitive filtering (entropy-based CoT scoring) and the SEGA algorithm (Self-Evaluated Group Advantage policy optimization).
GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis: This paper proposes GIIM, a Multi-Heterogeneous Graph (MHG)-based framework that simultaneously models intra-view dependencies among lesions and inter-view dynamic variations via graph structures. Four missing-view representation strategies are introduced. GIIM achieves consistent and significant improvements over existing multi-view methods across three imaging modalities: liver CT, breast X-ray, and breast MRI.
GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance: This paper proposes GP-MoLFormer-Sim, a training-free test-time molecular generation guidance method that leverages the contextual embeddings of a chemical language model (GP-MoLFormer) to estimate similarity to target molecules, dynamically adjusting logits during autoregressive decoding. Combined with a genetic algorithm (GP-MoLFormer-Sim+GA), the method achieves an average rank of 2nd across 23 tasks on the PMO benchmark and outperforms MOLLEO—which relies on GPT-4—under a strict black-box oracle setting.
Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation: This paper proposes TGC (Topology Graph Consistency), a framework that introduces graph-theoretic topological constraints by aligning the Laplacian spectra, connected component counts, and adjacency statistics between prediction graphs and reference graphs. TGC achieves near-fully-supervised histopathology segmentation performance using only 5–10% of labeled data.
GROVER: Graph-guided Representation of Omics and Vision with Expert Regulation for Cancer Survival Prediction: This paper proposes GROVER, a spatial multi-omics framework that captures nonlinear spatial-feature dependencies via a KAN-GCN encoder, aligns heterogeneous modalities through spot-feature-pair contrastive learning, and dynamically routes and filters low-quality signals via a self-adaptive Mixture of Experts (MoE). GROVER achieves superior clustering performance over existing methods on four real-world spatial omics datasets.
GuideGen: A Text-Guided Framework for Paired Full-Torso Anatomy and CT Volume Generation: GuideGen proposes a controllable framework that requires only text input. It synthesizes full-torso anatomical masks via a categorical diffusion model, and combines an anatomy-aware high-dynamic-range autoencoder with a latent feature generator to produce paired full-torso CT volumes, providing high-quality synthetic training data for downstream segmentation tasks.
Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling: HSO proposes a hierarchical schedule optimizer via a bilevel optimization framework — an upper-level global search for optimal initialization strategies combined with a lower-level local refinement of schedules — achieving training-free SOTA sampling quality under extremely low NFE at a one-time optimization cost of only ~8 seconds.
HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology: This paper proposes HiFusion, a framework comprising two complementary modules — Hierarchical Intra-Spot Modeling (HISM) and Context-Aware Cross-Scale Fusion (CCF) — to accurately predict spatial gene expression from H&E-stained whole-slide images, achieving state-of-the-art performance on two benchmark datasets under both 2D cross-validation and 3D sample-specific evaluation settings.
Human-in-the-Loop Interactive Report Generation for Chronic Disease Adherence: This paper presents a "physician-in-the-loop" interactive interface that restricts AI to the roles of data organization and draft generation. Through a single-page editor, chart–text pairing, and automated urgency stratification, it enables efficient and accountable chronic disease adherence report generation. A pilot study reveals an "accountability paradox": even when AI-generated quality matches the physician manual-authoring baseline, review time cannot be significantly reduced, because clinical responsibility demands complete verification.
Intervention Efficiency and Perturbation Validation Framework: Capacity-Aware and Robust Clinical Model Selection under the Rashomon Effect: To address the challenge of model selection under the Rashomon Effect—where multiple models achieve similar performance on small, class-imbalanced clinical datasets—this paper proposes Intervention Efficiency (IE), a capacity-aware evaluation metric, and the Perturbation Validation Framework (PVF), a robustness validation framework, jointly enabling reliable model selection under resource constraints.
Investigating Data Pruning for Pretraining Biological Foundation Models at Scale: This paper proposes a post-hoc data pruning framework based on influence functions, leveraging Subset-Based Self-Influence estimation and two selection strategies (Top-k Influence and Coverage-Centric Influence). Under an extreme pruning rate exceeding 99%, an RNA-FM pretrained on only 0.2M sequences matches or surpasses the full model trained on 23M sequences across multiple downstream tasks, revealing substantial redundancy in biological sequence datasets.
Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling: This paper proposes the CHMR framework, which addresses missing biological modalities via structure-aware propagation, and introduces Tree-VQ to model hierarchical dependencies among molecules, cells, and genes. Evaluated on 728 tasks across 9 benchmarks, CHMR achieves a 3.6% improvement in classification and 17.2% in regression, enabling robust cell-aware molecular representation learning.
Learning with Preserving for Continual Multitask Learning: This paper proposes the Learning with Preserving (LwP) framework, which maintains the geometric structure of the shared representation space via a Dynamically Weighted Distance Preserving (DWDP) loss. Without requiring a replay buffer, LwP addresses catastrophic forgetting in Continual Multitask Learning (CMTL) and significantly outperforms existing continual learning methods on benchmarks including BDD100k, CelebA, and PhysiQ. It is the only method to surpass the single-task learning baseline.
LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules: This paper proposes LungNoduleAgent, the first collaborative multi-agent system for lung nodule analysis. It simulates the clinical workflow through a three-stage pipeline—Nodule Spotter, Simulated Radiologist, and Doctor Agent System—and substantially outperforms mainstream VLMs (GPT-4o, Claude 3.7 Sonnet) and medical agents (MedAgent-Pro) on CT report generation and malignancy grading tasks.
MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss: This paper presents MAISI-v2, the first framework to introduce Rectified Flow into 3D medical image synthesis. By replacing DDPM with Rectified Flow, it achieves a 33× speedup, and a novel region-specific contrastive loss is designed to improve conditioning fidelity for small regions such as tumors. The utility of synthesized data is validated on downstream tumor segmentation tasks.
MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes: This paper proposes MAMAMemeia, a multi-agent multi-aspect collaborative discussion framework grounded in the Cognitive Analytic Therapy (CAT) competency framework, designed to identify depressive symptoms from social media memes. It additionally introduces the RESTOREx resource (containing both LLM-generated and human-annotated rationales), achieving a 7.55% improvement in macro-F1 over 30+ competing methods.
MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis: This paper proposes MAPI-GNN, which dynamically constructs multiple activation planes in semantic subspaces via a multi-dimensional feature discriminator, then aggregates intra- and inter-sample relationships through a hierarchical fusion network. The method achieves significant improvements over existing SOTA on two multimodal diagnostic tasks—prostate cancer and coronary heart disease (ACC 0.9432, AUC 0.9838 on PI-CAI).
MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration: This paper proposes the MCTSr-Zero framework, which combines MCTS with domain-principle-based self-evaluation and a meta-prompt adaptive exploration mechanism to generate high-quality multi-turn psychological counseling dialogue data. The resulting PsyLLM, fine-tuned on this data, achieves state-of-the-art performance on the authors' PsyEval benchmark.
Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology: This paper systematically evaluates six small open-source medical LLMs (<10B parameters) in pediatric endocrinology, demonstrating that accuracy alone is insufficient to characterize model reliability: semantically neutral prompt variations lead to significant output shifts (Stuart-Maxwell $p < 10^{-4}$), high consistency does not imply correctness, and even differences in CUDA versions can induce statistically significant output distribution changes.
MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis: MedEyes is a hybrid-policy reinforcement learning framework that introduces a Gaze-guided Reasoning Navigator (GRN) to simulate the "scan-and-drill" visual search pattern of clinical physicians. Combined with a Confidence Value Sampler (CVS) and dual-stream GRPO optimization, the framework enables dynamic visual focus for progressive medical diagnostic reasoning, achieving an average improvement of 8.5 pp across five medical VQA benchmarks.
MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging: MergeDNA is proposed to achieve context-aware dynamic DNA tokenization via differentiable Token Merging, combined with a hierarchical autoencoder and adaptive masked token modeling for pretraining. With 380M parameters, it surpasses GENERator at 1.3B.
MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains: This paper proposes MIRAGE, a framework that extends conventional linear reasoning chains into a parallel multi-chain reasoning paradigm. It combines adaptive retrieval from structured medical knowledge graphs (via neighborhood expansion and multi-hop traversal) with cross-chain verification to resolve contradictions, consistently outperforming GPT-4o, ToT, and Search-o1 on three medical QA benchmarks.
MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging: MIRNet is a framework that integrates self-supervised masked autoencoder (MAE) pre-training with constraint-aware graph attention network (GAT) reasoning for multi-label tongue diagnosis. The paper also introduces TongueAtlas-4K, a benchmark dataset of 4,000 images with 22 labels, achieving a 77.8% improvement in Macro Recall and 33.2% in Macro-F1.
MPA: Multimodal Prototype Augmentation for Few-Shot Learning: This paper proposes MPA, a framework that enhances prototype quality through three components: LLM-based Multi-Variant Semantic Enhancement (LMSE) for enriching semantic information, Hierarchical Multi-View Augmentation (HMA) for diversifying visual features, and an Adaptive Uncertain Class Absorber (AUCA) for modeling inter-class uncertainty. MPA achieves significant improvements over existing methods on 4 single-domain and 6 cross-domain few-shot learning benchmarks, surpassing the second-best method by 12.29% and 24.56% under the 5-way 1-shot setting for single-domain and cross-domain scenarios, respectively.
Multivariate Gaussian Representation Learning for Medical Action Evaluation: This paper proposes GaussMedAct, a framework that models joint motion trajectories as multivariate Gaussian mixture distributions combined with a Cartesian-vector dual-stream encoding scheme. It achieves 92.1% Top-1 accuracy on the newly constructed CPREval-6k dataset while requiring only 10% of the computational cost of ST-GCN.
Neural Bandit Based Optimal LLM Selection for a Pipeline of Tasks: This paper proposes Sequential Bandits, an online learning method based on neural contextual multi-armed bandits, for selecting the optimal LLM for each subtask in a task pipeline (e.g., "summarization → diagnosis"). The method jointly optimizes accuracy and cost, and outperforms existing bandit baselines on two pipeline task benchmarks: medical diagnosis prediction and telecommunications QA.
Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes: This paper proposes Note2Chat, a framework that trains LLMs for structured history taking and diagnosis using widely available medical notes rather than scarce dialogue data. Through note-driven dialogue generation, a three-stage fine-tuning strategy, and a single-turn reasoning paradigm, it substantially outperforms GPT-4o in information gathering (F1 +16.9) and diagnostic accuracy (Top-1 +21.0).
NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening: This paper proposes NutriScreener, a framework combining a CLIP visual encoder, a multi-pose graph attention network (GAT), and a FAISS-based retrieval-augmented classification/regression module. Through cross-pose attention and category-enhanced retrieval, the system achieves robust childhood malnutrition detection and anthropometric prediction, attaining 0.79 recall and 0.82 AUC on cross-continental datasets including AnthroVision, with clinician ratings of 4.3/5 for accuracy and 4.6/5 for efficiency.
CountVid: Open-World Object Counting in Videos: This paper proposes CountVid, a model, and the VideoCount benchmark, presenting the first systematic study of open-world video object counting—given a text or image description specifying target objects, the system enumerates all unique instances in a video. By combining an image counting model with a promptable video segmentation and tracking model, CountVid addresses challenges such as occlusion and re-appearance, achieving substantial improvements over strong baselines across diverse scenarios including TAO, MOT20, penguin colonies, and X-ray metal crystallization.
Pairing-free Group-level Knowledge Distillation for Robust Gastrointestinal Lesion Classification in White-Light Endoscopy: This paper proposes PaGKD, a pairing-free group-level knowledge distillation framework that eliminates the dependency on paired data in conventional NBI→WLI cross-modal distillation. It introduces group-level prototype distillation (GKD-Pro, which extracts modality-invariant semantic prototypes via a shared lesion query Transformer) and group-level dense distillation (GKD-Den, which achieves dense spatial alignment through activation map-guided semantic relation cross-attention). PaGKD improves AUC by 3.3%/1.1%/2.8%/3.2% across four clinical datasets.
PanFoMa: A Lightweight Foundation Model and Benchmark for Pan-Cancer Pathology Image Analysis: This paper proposes PanFoMa, a lightweight hybrid neural network that integrates Transformer-based local modeling with Mamba-based global integration for pan-cancer single-cell transcriptomic representation learning. It also introduces PanFoMaBench, a large-scale benchmark dataset covering 33 cancer subtypes and over 3.5 million cells.
Personality-guided Public-Private Domain Disentangled Hypergraph-Former Network for Multimodal Depression Detection: This paper proposes P3HF, a framework that achieves approximately 10% gains in accuracy and F1 on multi-event multimodal depression detection through three innovations: personality-guided feature gating, a temporally-aware Hypergraph-Former architecture, and event-level public-private domain disentanglement.
Personalization of Large Foundation Models for Health Interventions: This paper systematically analyzes four structural tensions in applying large foundation models (LFMs) to personalized health interventions, argues that LFMs cannot replace N-of-1 trials, and proposes a hybrid framework that combines LFM-based hypothesis generation with causal validation via N-of-1 trials.
PINGS-X: Physics-Informed Normalized Gaussian Splatting with Axes Alignment for Efficient Super-Resolution of 4D Flow MRI: This paper proposes PINGS-X, a framework that transfers the explicit representation paradigm of 3D Gaussian Splatting (3DGS) into the domain of physics-informed super-resolution. Through three key innovations—Normalized Gaussian Splatting (NGS), axes-aligned Gaussians, and a Gaussian merging strategy—PINGS-X achieves training speeds an order of magnitude faster than PINNs while maintaining superior super-resolution accuracy on both synthetic CFD and real 4D Flow MRI datasets.
PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation: PriorRG proposes a two-stage chest X-ray report generation framework that aligns clinical context with spatiotemporal visual features via prior-guided contrastive pre-training, then progressively integrates clinical context, disease progression, and multi-level visual cues through prior-aware coarse-to-fine decoding, achieving a 3.6% improvement in BLEU-4 and a 3.8% improvement in F1 on MIMIC-CXR.
ProPL: Universal Semi-Supervised Ultrasound Image Segmentation via Prompt-Guided Pseudo-Labeling: ProPL is proposed as a framework that, for the first time, achieves universal semi-supervised ultrasound image segmentation via a shared visual encoder, prompt-guided dual decoders, and uncertainty-driven pseudo-label calibration. With only 1/16 labeled data across 5 organs and 8 tasks, it surpasses fully supervised methods by 5.18% mDice.
ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders: This paper proposes ProtSAE, which incorporates semantic annotations and domain ontology knowledge as guidance signals during sparse autoencoder training to address the semantic entanglement problem of conventional SAEs. The method aligns latent features of protein language models with biological concepts (molecular function, biological process, ion binding sites, etc.) with high precision, while maintaining high reconstruction fidelity and supporting concept-level generation steering.
Provably Minimum-Length Conformal Prediction Sets for Ordinal Classification: This paper proposes min-CPS and its regularized variant min-RCPS, a model-agnostic conformal prediction method for ordinal classification. By solving the minimum-length prediction interval for each sample via a linear-time sliding window algorithm, the method reduces average prediction set size by 15% while maintaining coverage guarantees, with theoretical optimality guarantees at the instance level.
PulseMind: A Multi-Modal Medical Model for Real-World Clinical Diagnosis: This paper proposes PulseMind, a multimodal medical diagnostic model comprising three core contributions: MediScope, a large-scale multi-turn diagnostic dialogue dataset; PulseMind Benchmark, a multi-dimensional clinical dialogue evaluation benchmark; and CRPO, a comparison-based reinforcement policy optimization method. The system achieves superior performance in real-world clinical diagnostic dialogue scenarios.
Q-FSRU: Quantum-Augmented Frequency-Spectral Fusion for Medical Visual Question Answering: This paper proposes Q-FSRU, a model that transforms medical image and text features into the frequency domain via FFT for multimodal fusion, and incorporates external medical knowledge through a quantum-inspired retrieval-augmented generation (Quantum RAG) mechanism, achieving 90% accuracy and a ROC-AUC of 0.9541 on the VQA-RAD dataset.
qa-FLoRA: Data-free Query-Adaptive Fusion of LoRAs for LLMs: This paper proposes qa-FLoRA, a query-adaptive LoRA fusion method that requires neither training data nor a training process. It dynamically determines fusion weights by computing per-layer KL divergence between each adapter and the base model, achieving significant improvements over static fusion and training-free baselines across nine multilingual composite tasks.
QGShap: Quantum Acceleration for Faithful GNN Explanations: This paper proposes QGShap, a GNN explainability framework that leverages quantum amplitude amplification to accelerate exact Shapley value computation, achieving a quadratic speedup over classical Monte Carlo methods while maintaining exact (non-approximate) computation.
Radiation-Preserving Selective Imaging for Pediatric Hip Dysplasia: A Cross-Modal Approach: This paper proposes an "ultrasound-first, radiation-preserving" cross-modal selective imaging strategy. By combining a self-supervised pretrained frozen encoder, a measurement-faithful lightweight head network, and a conformal-prediction-calibrated one-sided lower bound, the framework provides principled decisions on when ultrasound alone suffices and when additional X-ray imaging is warranted for diagnosing developmental dysplasia of the hip (DDH).
ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Cognition: This paper implements ReCoN-Ipsundrum — an inspectable agent architecture that extends the ReCoN sensorimotor state machine with Humphrey's ipsundrum recurrent persistence loop and an optional affective proxy layer. Through behavioral tests and causal ablation experiments, the paper demonstrates that recurrence supports post-stimulus persistence, affect coupling supports preference stability, structured scanning, and sustained caution, while emphasizing that behavioral markers alone are insufficient to attribute consciousness.
Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA: This paper proposes AlignVQA, a multi-agent debate framework for VQA confidence calibration: specialist agents generate candidate answers, followed by structured debate (supporting vs. opposing arguments) by generalist agents to refine confidence scores. A differentiable calibration-aware loss, AlignCal, is also introduced to minimize the upper bound of calibration error (UBCE) during training. The approach reduces ECE from 0.375 to 0.098 on VQARad and ScienceQA.
Rethinking Bias in Generative Data Augmentation for Medical AI: a Frequency Recalibration Approach: This paper identifies high-frequency distribution discrepancies between AI-generated and real medical images as the root cause of unreliable generative data augmentation (GDA), and proposes FreRec (Frequency Recalibration), a coarse-to-fine post-processing module comprising Statistical High-frequency Replacement (SHR) and Reconstructive High-frequency Mapping (RHM) to align frequency distributions, consistently improving downstream medical image classification performance as a plug-and-play component.
Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset: This paper is the first to categorize surgical smoke into two distinct types — Diffusion Smoke and Ambient Smoke — and proposes STANet, the first smoke-type-aware laparoscopic video desmoking network comprising three sub-networks: semantic soft segmentation, coarse-to-fine disentanglement, and dual-branch reconstruction. It also introduces STSVD, the first large-scale synthetic video desmoking dataset with smoke-type annotations.
S2Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening: This paper proposes S2Drug, a two-stage contrastive learning framework. Stage 1 performs large-scale protein sequence–ligand contrastive pre-training on ChemBL with a bilateral data sampling strategy to reduce noise and redundancy. Stage 2 fine-tunes on PDBBind by fusing sequence and 3D structural information via a residue-level gating module and incorporating a binding site prediction auxiliary task. S2Drug substantially outperforms existing methods on the DUD-E and LIT-PCBA virtual screening benchmarks.
Self-supervised Multiplex Consensus Mamba for General Image Fusion: This paper proposes the SMC-Mamba framework, which achieves general image fusion across infrared-visible, medical, multi-focus, and multi-exposure tasks through Modality-Agnostic Feature Enhancement (MAFE), Multiplex Consensus Cross-modal Mamba (MCCM), and Bi-level Self-supervised Contrastive Learning loss (BSCL), comprehensively surpassing state-of-the-art methods.
SEMC: Structure-Enhanced Mixture-of-Experts Contrastive Learning for Ultrasound Standard Plane Recognition: This paper proposes the SEMC framework, which aligns shallow structural cues with deep semantic representations via a Semantic-Structure Fusion Module (SSFM), and performs hierarchical contrastive learning over multi-level features through a Mixture-of-Experts Contrastive Recognition Module (MCRM), thereby enhancing fine-grained discriminability for ultrasound standard plane recognition. A new liver ultrasound dataset, LP2025, is also introduced.
Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks: This paper introduces the Medical Diagnosis Segmentation (MDS) task along with the M3DS dataset, and proposes the Sim4Seg framework, which leverages Region-aware Vision-Language Similarity Masks (RVLS2M) derived from LVLM hidden states to prompt SAM for segmentation while simultaneously generating diagnostic chain-of-thought reasoning. Combined with a test-time scaling strategy, Sim4Seg comprehensively outperforms baselines on both segmentation and diagnosis.
Small but Mighty: Dynamic Wavelet Expert-Guided Fine-Tuning of Large-Scale Models for Optical Remote Sensing Object Segmentation: WEFT proposes a lightweight fine-tuning paradigm guided by dynamic wavelet experts, adapting frozen large-scale visual foundation models to optical remote sensing image segmentation with only 4.52% trainable parameters, surpassing 21 state-of-the-art methods on three ORSIs datasets.
SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization: This paper proposes Self-Priority Alignment (SPA), a fully unsupervised framework that enforces a strict "trustworthiness before helpfulness" priority ordering via lexicographic optimization. The model self-generates diverse responses, self-evaluates, and self-improves; dual-criterion denoising constructs preference pairs; and an uncertainty-weighted SimPO loss fine-tunes the model, simultaneously improving safety and helpfulness across multiple benchmarks.
SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection: SpaCRD is proposed as a transfer learning-based multimodal deep fusion framework that integrates histology images and spatial transcriptomics (ST) data through a Variational Reconstruction-guided Bidirectional Cross-Attention (VRBCA) fusion network. It achieves state-of-the-art performance in cancer tissue region (CTR) detection across samples, platforms, and batches on 23 paired datasets.
TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning: This paper proposes TAlignDiff, a unified framework that integrates a geometry-constrained point cloud regression network (PRN) with a diffusion-based transformation matrix denoising module (DTMD) under a joint training paradigm. Through a bidirectional feedback mechanism, the framework achieves superior automatic tooth alignment on small-scale clinical datasets compared to existing methods.
Towards Effective and Efficient Context-aware Nucleus Detection in Histopathology Whole Slide Images: This paper proposes an efficient context-aware nucleus detection method that aggregates off-the-shelf features from historically visited sliding windows—rather than additionally cropping large low-field-of-view patches—to provide tissue context, while employing a cross-annotation strategy to mine surrounding unannotated nucleus samples for enhanced contextual adaptability.
Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs: This work reformulates LLM policy violation detection as an out-of-distribution (OOD) detection problem in activation space. A training-free whitening approach is proposed: a whitening transform is fitted on compliant activations, and the Euclidean norm serves as the compliance score. Deployment requires only policy text and a small number of examples. The method achieves 86.0% F1 on DynaBench, outperforming fine-tuned baselines by 9.1 points and LLM-as-Judge by 16 points.
TrinityDNA: A Bio-Inspired Foundational Model for Efficient Long-Sequence DNA Modeling: TrinityDNA is a bio-inspired DNA foundation model integrating three innovations: a Groove Fusion module for capturing major/minor groove structural features, a Gated Reverse Complement mechanism for handling double-strand complementary symmetry, and Sliding Multi-Window Attention for multi-scale long-range dependency modeling. Combined with an Evolutionary Training Strategy (ETS) progressing from prokaryotes to eukaryotes, TrinityDNA achieves an average MCC of 0.708 across 15 GUE benchmark tasks (surpassing NT with 2.5B parameters), leads on both prokaryotic and eukaryotic zero-shot tasks across 19 benchmarks, and introduces a new CDS annotation benchmark for long-sequence inference evaluation.
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment: This paper proposes ARRA (Autoregressive Representation Alignment), a training framework that distills global visual representations from an external vision foundation model into the hidden states of an autoregressive LLM via a hybrid token \<HYBNEXT>, significantly improving text-to-image generation quality without any architectural modification.
Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation: This paper proposes MoCo-INR, which for the first time integrates implicit neural representation (INR) into a motion compensation (MoCo) framework. Through an unsupervised approach, it achieves high-quality dynamic reconstruction of cardiac MRI, significantly outperforming existing unsupervised methods at ultra-high acceleration factors (20× Cartesian / 69× Non-Cartesian).
Unsupervised Multi-Parameter Inverse Solving for Reducing Ring Artifacts in 3D X-Ray CBCT: This paper proposes Riner, which formulates CT ring artifact removal (RAR) as a physics-based multi-parameter inverse problem. By jointly learning artifact-free images and detector physical parameters via implicit neural representation (INR), Riner achieves unsupervised 3D CBCT reconstruction that surpasses supervised state-of-the-art methods.
Vascular Anatomy-aware Self-supervised Pre-training for X-ray Angiogram Analysis: This paper proposes VasoMIM, a domain-specific self-supervised pre-training framework for X-ray angiograms. It introduces an anatomy-guided masking strategy that prioritizes vessel regions, an anatomical consistency loss to preserve vascular topology in reconstructed images, and a newly constructed XA-170K pre-training dataset — the largest of its kind. VasoMIM comprehensively outperforms both general-purpose and medical SSL methods (including DINOv3 pre-trained on 1.69 billion images) across 4 downstream tasks and 6 datasets.
Virtual Multiplex Staining for Histological Images Using a Marker-wise Conditioned Diffusion Model: This paper proposes a virtual multiplex staining framework based on a marker-wise conditioned diffusion model. Through a two-stage training procedure (marker-wise conditional diffusion learning followed by pixel-level fine-tuning), it is the first method to generate multiplex immunofluorescence images of up to 18 distinct markers from a single H&E image, achieving state-of-the-art performance on two public datasets, HEMIT and Orion-CRC.
VitalDiagnosis: AI-Driven Ecosystem for 24/7 Vital Monitoring and Chronic Disease Management: This paper proposes VitalDiagnosis, an LLM-driven chronic disease management ecosystem that integrates continuous wearable data with multi-scale LLM reasoning, establishing a dual-track framework comprising interactive anomaly triage and routine adherence monitoring, thereby enabling a paradigm shift from passive surveillance to active engagement within a collaborative patient–clinician workflow.
Voices, Faces, and Feelings: Multi-modal Emotion-Cognition Captioning for Mental Health Understanding: This paper proposes the Emotion-Cognition cooperative Multi-modal Captioning (ECMC) task and framework. A dual-stream BridgeNet extracts emotion and cognition features from video, audio, and text, and a LLaMA decoder generates natural language descriptions. The system provides interpretable emotion-cognition profiles for mental health assessment, substantially improving both diagnostic accuracy and explainability.
WDT-MD: Wavelet Diffusion Transformers for Microaneurysm Detection in Fundus Images: This paper proposes WDT-MD, a framework that addresses three fundamental challenges in fundus image microaneurysm (MA) detection—identity mapping, high false positives, and poor normal-feature reconstruction quality—through noise-encoded image conditioning, pseudo-normal pattern synthesis, and a wavelet diffusion Transformer architecture.

🧩 Multimodal VLM¶

AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs: Inspired by the dual-mode human cognitive process of verbal abduction and pictorial imagination, this paper proposes AbductiveMLLM, which enhances visual abductive reasoning in MLLMs via two collaborative components — a Reasoner (causal contrastive learning for hypothesis selection) and an Imaginer (diffusion-model-based pictorial reasoning) — achieving state-of-the-art performance on the VAR and YouCookII benchmarks.
Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment: This paper proposes the CDDS algorithm, which decouples embeddings into semantic and modality components via a dual-path UNet, and employs a distribution sampling method to achieve cross-modal semantic alignment indirectly, avoiding distribution distortion caused by directly adjusting embeddings. CDDS surpasses the state of the art by 6.6%–14.2% on Flickr30K and MS-COCO.
anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding: This work constructs the anyECG dataset (covering three tasks: report generation, waveform localization, and multi-ECG comparison) and proposes the anyECG-chat model. Through a dynamic ECG input mechanism supporting variable-length, few-lead, and multi-ECG inputs, and a three-stage curriculum learning strategy, anyECG-chat comprehensively outperforms existing ECG-MLLMs in OOD generalization for report generation, second-level anomalous waveform localization, and multi-ECG comparative analysis.
"Are We Done Yet?": A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents: This paper proposes a VLM-based autonomous task completion evaluation framework that judges whether a Computer Use Agent (CUA) has completed a task using only screenshots and task descriptions. Evaluation feedback is passed back to the agent for self-correction, achieving 73% evaluation accuracy and a 27% relative improvement in task success rate on macOS.
AStar: Boosting Multimodal Reasoning with Automated Structured Thinking: This paper proposes AStar, a training-free multimodal reasoning paradigm that constructs a library of high-level "thought card" reasoning templates from 500 seed samples. At inference time, the most suitable templates are adaptively retrieved to guide structured reasoning in MLLMs. A 7B model achieves 53.9% accuracy on MathVerse (surpassing GPT-4o's 50.2%), requiring only 50 minutes of preprocessing and no model training.
BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models: This paper proposes BiPrompt, a bilateral prompt optimization framework that simultaneously mitigates spurious biases on both the visual side (structured attention erasure) and the textual side (balanced prompt normalization) in VLMs such as CLIP at test time, improving OOD robustness without retraining.
BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning: This paper proposes BOFA, a framework that exclusively fine-tunes the existing cross-modal projection layer (bridge-layer) in CLIP. By constraining parameter updates within a low-rank "safe subspace" orthogonal to old-task features via Orthogonal Low-Rank Fusion, and combining this with cross-modal hybrid prototypes, BOFA achieves state-of-the-art exemplar-free class-incremental learning without introducing any additional parameters or inference overhead.
Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models: This paper systematically investigates the application of zeroth-order (ZO) optimization in PEFT-based vision-language continual learning (VLCL). It finds that naively replacing first-order (FO) optimization with ZO causes training instability, and proposes a progressive ZO-FO hybrid strategy ranging from branch-wise to layer-wise granularity. Building on the theoretical finding that visual modality exhibits larger gradient variance, the paper further proposes MoZO (gradient sign normalization + visual perturbation constraint), achieving state-of-the-art performance across four benchmarks.
Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation (BriMPR): This paper proposes BriMPR, a framework that decomposes multimodal test-time adaptation (MMTTA) into multiple unimodal feature alignment subproblems via a divide-and-conquer strategy. It first calibrates the global feature distribution of each modality through prompt tuning to achieve initial cross-modal semantic alignment, then refines the alignment via cross-modal masked embedding recombination and instance-level contrastive learning.
Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?: This paper presents the first systematic evaluation of LVLMs' ability to recognize and respect copyrighted content in multimodal contexts. It constructs a large-scale benchmark of 50,000 multimodal query–content pairs, finds that 11 out of 12 SOTA LVLMs fail to refuse infringing requests even when explicit copyright notices are present, and proposes CopyGuard—a tool-augmented framework that raises the infringement rejection rate from ~3% to ~62%.
Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models: This paper proposes Concept-RuleNet, a three-agent collaborative neurosymbolic reasoning framework that conditions symbol generation and rule construction on visual concepts extracted from training images. It addresses the symbol hallucination and non-representativeness issues of existing methods (e.g., Symbol-LLM) that rely solely on class labels, achieving an average accuracy improvement of ~5% across 5 OOD benchmarks and reducing hallucinated symbols by up to 50%.
Conditional Information Bottleneck for Multimodal Fusion: Overcoming Shortcut Learning in Sarcasm Detection: This paper identifies three types of shortcut learning in multimodal sarcasm detection (character label bias, canned laughter label leakage, and sentiment inconsistency shortcuts), reconstructs a shortcut-free benchmark MUStARD++R, and proposes MCIB, a multimodal fusion framework based on the Conditional Information Bottleneck. MCIB achieves effective fusion by compressing redundancy in the primary modality while preserving complementary information from auxiliary modalities.
CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product: This paper proposes CreBench, a multimodal creativity evaluation benchmark covering three dimensions—creative idea → creative process → creative product—with 12 fine-grained metrics. It additionally constructs CreMIT (2.2K samples, 79.2K human annotations, 4.7M instructions) and fine-tunes CreExpert, which significantly outperforms GPT-4V and Gemini-Pro-Vision on creativity evaluation.
Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models: This paper proposes CoEvo, a training-free and annotation-free test-time framework that dynamically updates positive and negative proxy caches through a bidirectional sample-conditioned text/visual proxy co-evolution mechanism. On ImageNet-1K, CoEvo improves AUROC by 1.33% and reduces FPR95 by 45.98% (from 18.92% to 10.22%) over the strongest negative-label baseline, achieving state-of-the-art zero-shot OOD detection.
Cross-Modal Unlearning via Influential Neuron Path Editing in Multimodal Large Language Models: This paper proposes MIP-Editor, which localizes influential neuron paths encoding forget-target knowledge in MLLMs via cross-layer gradient integration (text branch) and Fisher integration (visual branch), then edits these neurons using path-based Representation Misdirection Unlearning (RMisU), achieving up to 87.75% forget rate and 54.26% improvement in general knowledge retention on MLLMU-Bench.
CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models: This paper introduces CrossVid, the first comprehensive benchmark for systematically evaluating the Cross-Video Reasoning (CVR) capabilities of multimodal large language models (MLLMs). CrossVid encompasses 10 tasks across 4 dimensions, 5,331 videos, and 9,015 QA pairs. Experiments reveal that the current best-performing model, Gemini-2.5-Pro, achieves only 50.4% accuracy, far below the human performance of 89.2%.
DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control: This paper proposes DEIG, a framework for fine-grained multi-instance image generation. It distills high-dimensional embeddings from a frozen LLM encoder into compact instance-aware representations via an Instance Detail Extractor (IDE), and employs instance masked attention in a Detail Fusion Module (DFM) to prevent attribute leakage. DEIG substantially outperforms existing methods on generation tasks with complex multi-attribute descriptions (color + material + texture).
Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models: This paper proposes DiVE, a method that constrains the "difference vectors" between pre-trained and fine-tuned model embeddings to be equal across samples, thereby preserving the geometric structure of the embedding space during CLIP fine-tuning. DiVE achieves comprehensive improvements over existing methods across in-distribution (ID), out-of-distribution (OOD), and zero-shot metrics (averaging 8+ points gain on zero-shot tasks).
DisCode: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning: This paper proposes DISCODE, a fine-tuning-free test-time adaptive decoder that introduces a Gaussian prior to minimize the ATT loss, enabling LVLM-generated image captioning scores to more robustly align with human judgments. The paper also constructs the MCEval benchmark covering six visual domains.
EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens: This paper proposes EM-KD, a distillation framework that leverages the Hungarian algorithm to address the vision token count imbalance between teacher and student models. By combining Vision Semantic Distillation (VSD) and Vision-Language Affinity Distillation (VLAD), EM-KD transfers knowledge from a vanilla teacher to an efficient student MLLM, achieving an average score of 50.4 across 11 benchmarks at 144 tokens/patch — surpassing LLaVA-NeXT with 576 tokens (49.4) while delivering nearly 2× inference speedup.
Empowering Semantic-Sensitive Underwater Image Enhancement with VLM: This work leverages a VLM to generate spatially-aware semantic guidance maps, and introduces a dual-guidance mechanism comprising cross-attention injection and a semantic alignment loss to endow underwater image enhancement networks with semantic awareness, yielding enhanced results that benefit both human perception and downstream detection/segmentation tasks.
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding: This paper proposes Exo2Ego, a framework that learns a mapping between the exocentric (third-person) and egocentric (first-person) domains to transfer rich exocentric knowledge encoded in MLLMs to egocentric video understanding. Combined with a newly constructed dataset of 1.1M synchronized ego-exo clip-text pairs (Ego-ExoClip) and 600K instruction-tuning samples (EgoIT), Exo2Ego achieves state-of-the-art open-source performance across 8 egocentric video benchmarks.
Explore How to Inject Beneficial Noise in MLLMs: This paper proposes the Multimodal Noise Generator (MuNG), which dynamically generates "beneficial noise" from image-text pairs via a variational inference framework and injects it into the frozen visual features of an MLLM. The approach suppresses task-irrelevant semantics and enhances cross-modal representation alignment, requiring only ~1% additional parameters while outperforming full fine-tuning and PEFT methods such as LoRA.
Exploring LLMs for Scientific Information Extraction using the SciEx Framework: This paper proposes SciEx, a modular and composable scientific information extraction framework that decouples PDF parsing, multimodal retrieval, schema-guided extraction, and cross-document aggregation into independent components. The framework evaluates the extraction capabilities of GPT-4o and Gemini-2.5-Flash on a dataset of 143 papers spanning medicine and environmental science, revealing systematic deficiencies in current LLMs with respect to cross-modal reasoning, numerical precision, and domain generalization.
Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation: This paper proposes UMEG-Net for few-shot Precise Event Spotting (PES). The method constructs a unified multi-entity graph integrating human skeletal keypoints, sports object keypoints, and environmental landmarks, combined with efficient spatiotemporal graph convolution and a parameter-free multi-scale temporal shift module. A multimodal knowledge distillation scheme transfers graph features to an RGB student network. The approach significantly outperforms existing methods across five sports datasets under extremely limited annotation budgets.
Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration: This paper proposes FiCoCo, a three-stage framework (Filter–Correlate–Compress) that identifies redundant tokens via integrated vision-aware and semantic-aware redundancy metrics, adaptively recycles information from discarded tokens via inter-token correlation, and achieves training-free MLLM acceleration. On LLaVA-NeXT, FiCoCo achieves a 14.7× FLOPs reduction while retaining 93.6% of performance, and consistently outperforms FastV, SparseVLM, and other state-of-the-art methods across five MLLM architectures.
FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation: This paper introduces FinMMDocR, a bilingual multimodal reasoning benchmark targeting real-world financial scenarios. It comprises 1,200 expert-annotated numerical reasoning questions spanning 12 implicit financial scenario types, 9 categories of long documents (averaging 50.8 pages), and reasoning chains averaging 11 steps. The strongest MLLM (o4-mini-high) achieves only 58% accuracy, exposing critical deficiencies of existing models in complex financial reasoning.
Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts: This paper systematically investigates the robustness of multimodal LLMs in verifying scientific claims using tables and charts as evidence. By extending SciTabAlign and ChartMimic into a table–chart aligned evaluation benchmark, the authors find that all 12 evaluated multimodal LLMs consistently perform better on table-based evidence than chart-based evidence, while human annotators perform consistently across both formats — revealing a critical weakness in current models' chart comprehension capabilities.
FT-NCFM: An Influence-Aware Data Distillation Framework for Efficient VLA Models: This paper proposes FT-NCFM, a framework that evaluates sample utility via causal attribution (Fact-Tracing) and guides an adversarial NCFM process to synthesize high-information-density coresets. Using only 5% synthetic data, it achieves 85–90% of full-data training performance while reducing training time by over 80%.
Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models: This paper proposes GlobalCom², a plug-and-play, training-free token compression framework tailored for high-resolution VLMs with dynamic cropping architectures. It leverages the global thumbnail as a "commander" to guide differentiated compression across local crop regions, achieving >90% of original performance while compressing 90% of visual tokens.
Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting: This paper proposes Graph-of-Mark (GoM), a training-free pixel-level visual prompting method that explicitly encodes inter-object spatial relationships by overlaying a depth-aware scene graph (comprising nodes and directed edges) directly onto input images, achieving up to an 11 percentage point improvement in zero-shot spatial reasoning accuracy for multimodal language models on VQA and grounding tasks.
Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization: This paper identifies three root causes of hallucination in RL-based MLLM training—visual misinterpretation, limited exploration diversity, and sample conflict—and addresses each with Caption Reward, reward-variance-guided sample selection, and NTK-similarity-based InfoNCE regularization, achieving significant hallucination reduction across multiple benchmarks.
Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning: This paper proposes the SECA framework, which leverages the stable semantic priors of the CLIP text branch to guide semantically-aware historical knowledge transfer in the backbone (SG-AKT module), and refines visual prototypes using inter-class semantic relationships derived from text embeddings to build a hybrid classifier (SE-VPR module), achieving state-of-the-art performance on ImageNet-R/A and CIFAR-100.
Harnessing Vision-Language Models for Time Series Anomaly Detection: A two-stage zero-shot time series anomaly detection framework is proposed: ViT4TS employs a lightweight ViT to perform multi-scale cross-patch matching on line-chart renderings of time series for candidate anomaly interval localization, while VLM4TS leverages GPT-4o with global temporal context to validate and refine detection results. The framework achieves F1-max surpassing the best baseline by 24.6% across 11 benchmarks, with token consumption only 1/36 of existing LLM-based methods.
HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection: This paper proposes HeadHunt-VAD, which systematically identifies a sparse set of anomaly-sensitive and stable attention heads within a frozen MLLM, bypassing the information loss inherent in text-based outputs. Using a lightweight classifier, it achieves efficient tuning-free video anomaly detection, establishing state-of-the-art performance among tuning-free methods on UCF-Crime and XD-Violence.
Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning: This paper proposes the HUG paradigm, which leverages fine-grained Gaussian probabilistic embeddings and heterogeneous uncertainty estimation—distinguishing query-side multimodal coordination uncertainty from target-side content quality uncertainty—combined with dynamic weighted fusion and uncertainty-guided contrastive learning, achieving state-of-the-art performance on the Fashion-IQ and CIRR benchmarks.
ImageBindDC: Compressing Multi-modal Data with ImageBind-based Condensation: This paper proposes ImageBindDC, the first framework for multimodal data compression in the unified feature space of ImageBind. It replaces the conventional MMD with Characteristic Function Distance (CFD) and introduces a three-level distribution alignment loss covering uni-modal, cross-modal, and joint-modal objectives. On NYU-v2, the method achieves performance comparable to full-data training (97.30%) using only 5 synthetic samples per class, surpassing the previous SOTA by an absolute margin of 8.2% while reducing compression time by 4.6×.
InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration: This paper proposes InEx, a framework that iteratively verifies and corrects MLLM outputs via internal introspective reasoning (TVER-driven uncertainty-aware visual augmentation) and external cross-modal multi-agent collaboration (textual self-reflection + image editing verification + visual self-reflection), achieving an 8.9% improvement on POPE and consistently outperforming OPERA/VCD/ICD across multiple hallucination and general benchmarks.
Information Theoretic Optimal Surveillance for Epidemic Prevalence in Networks: This paper introduces TestPrev, the first epidemic surveillance framework that employs mutual information as an optimization criterion. It selects an optimal subset of nodes in a network to maximize mutual information with the disease prevalence distribution, thereby providing distribution-level insights into outbreak size that traditional methods cannot offer. The paper proves the NP-hardness of this problem, designs a greedy algorithm GreedyMI, and demonstrates its superiority over baselines on both synthetic and real-world networks.
Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment: This paper proposes DSANet, which enhances the discriminability between normal and anomalous features in weakly supervised video anomaly detection (WS-VAD) at two levels: coarse-grained self-guided normal pattern modeling (SG-NM) and fine-grained disentangled contrastive semantic alignment (DCSA). DSANet achieves state-of-the-art performance with 86.95% AP (+1.14%) on XD-Violence and 13.01% fine-grained mAP (+3.39%) on UCF-Crime.
Leveraging Textual Compositional Reasoning for Robust Change Captioning: This paper proposes CORTEX, a framework that introduces VLM-generated compositional reasoning text as explicit cues, combined with an Image-Text Dual Alignment (ITDA) module, to enhance purely visual change captioning methods in understanding structured semantics such as object relationships and spatial configurations.
LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction: LLM-CAS is the first work to formulate real-time LLM hallucination correction as a hierarchical reinforcement learning (HRL) problem. It trains an RL agent to dynamically select optimal neuron perturbation strategies at inference time — the high-level policy selects a functional network category, while the low-level policy selects perturbation type and magnitude. Combined with adaptive masking and causal tracing for precise neuron localization, LLM-CAS achieves a 10.98% improvement on StoryCloze, outperforming static and dynamic baselines such as ITI, CAA, and SADI.
LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit: This paper presents LLMC+, a comprehensive benchmark and plug-and-play toolkit for vision-language model (VLM) compression, supporting 20+ compression algorithms across 5 representative VLM families. It systematically investigates the independent and joint effects of token-level and model-level compression, revealing three key findings.
MacVQA: Adaptive Memory Allocation and Global Noise Filtering for Continual Visual Question Answering: This paper proposes MacVQA, a framework that enhances the robustness of visual features via Global Noise Filtering (GonF) and optimizes cross-task knowledge retention and update via Adaptive Memory Allocation (AMA) based on prototype retrieval and memory decay. MacVQA achieves 43.38% average accuracy (+3.57%) and 2.32% forgetting rate across 10 continual learning tasks on VQA v2.
MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment: This paper presents the first exploration of incomplete multimodal action quality assessment (AQA), proposing the MCMoE framework. An Adaptive Gated Modality Generator (AGMG) completes missing modalities, while a Mixture of Experts (MoE) module with soft routing dynamically fuses unimodal and cross-modal joint representations within a unified single-stage training paradigm. MCMoE achieves state-of-the-art performance under both complete and incomplete modality settings across three public AQA benchmarks, with only 4.90M parameters.
Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection: This paper proposes a multi-agent vision-language model (MA-VLMs) guided self-training framework combined with a novel PNU loss function, achieving high-quality offensive content detection under low-resource settings (as few as 50 labeled samples), with performance approaching that of large-scale models.
Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models: This paper proposes MFA, a Multi-Faceted Attack framework that systematically exposes security vulnerabilities in VLMs equipped with multi-layered defenses (including commercial models such as GPT-4o and Gemini) through three complementary dimensions: Attention Transfer Attack (ATA) to bypass alignment, adversarial signatures to evade content moderation, and visual encoder attack to overwrite system prompts. The overall attack success rate reaches 58.5%.
O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model: This paper constructs a large-scale sketch-image-instruction triplet dataset, SketchVCL (600K pretraining + 215K fine-tuning samples), and trains O3SLM — the first open-source large vision-language model capable of fluently understanding hand-drawn sketches across four tasks: detection, counting, retrieval, and VQA — substantially outperforming existing LVLMs on all tasks.
OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive: This paper constructs OIDA-QA, a multimodal document question-answering benchmark based on the UCSF-JHU Opioid Industry Documents Archive (OIDA), comprising 400K training documents and 370K multi-hop QA pairs. A domain-specialized LLM system integrating content recitation and a page finder module is developed to effectively handle multi-turn QA and answer page localization over extremely long documents.
OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding: This paper proposes OmniPT, a unified pedestrian tracking framework built upon large vision-language models (LVLMs). Through a four-stage RL→Mid Training→SFT→RL training strategy, OmniPT simultaneously supports conventional MOT, language-referred tracking (RMOT/CRMOT), and semantic understanding (SMOT), achieving state-of-the-art results on multiple benchmarks—most notably a HOTA of 75.04 on BenSMOT, surpassing the previous SOTA by 3.06.
Panda: Test-Time Adaptation with Negative Data Augmentation: This paper proposes Panda, which generates semantics-destroying but corruption-preserving images via patch shuffling as negative data augmentation (NDA), and uses their features to offset original embeddings to suppress corruption-induced prediction bias. Panda is plug-and-play with less than 10% computational overhead and consistently improves various TTA methods.
PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis: This paper proposes the Pre-Consultation Dialogue Framework (PCDF), which simulates multi-turn doctor–patient dialogues using two VLMs (DocVLM and PatientVLM) to generate image–dialogue–diagnosis triplets for fine-tuning DocVLM, achieving an average F1 improvement of 11.48 across four medical imaging benchmarks.
PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis: This paper proposes PCDF (Pre-Consultation Dialogue Framework), which simulates realistic doctor-patient dialogue through two VLMs in role-play — DocVLM asks questions and PatientVLM answers — to generate image-dialogue-diagnosis triplets for fine-tuning DocVLM. The framework achieves an average F1 improvement of 11.48 percentage points across four medical imaging benchmarks without relying on real clinical dialogue data.
PET2Rep: Towards Vision-Language Model-Driven Automated Radiology Report Generation for Positron Emission Tomography: This paper presents PET2Rep, the first large-scale benchmark dataset dedicated to positron emission tomography (PET) radiology report generation, comprising 565 whole-body PET/CT image-report pairs. It further introduces PET Clinical Efficacy (CE) evaluation metrics and conducts a systematic assessment of 30 state-of-the-art general-purpose and medical-specialized VLMs, revealing that current SOTA VLMs perform poorly on PET report generation and fail to outperform even simple template baselines.
Phantom Menace: Exploring and Enhancing the Robustness of VLA Models Against Physical Sensor Attacks: This paper presents the first systematic study of the security of Vision-Language-Action (VLA) models under physical sensor attacks. It proposes a "Real-Sim-Real" framework to evaluate six camera attacks and two microphone attacks against four VLA models, reveals critical vulnerabilities across all evaluated models, and introduces an adversarial training defense that improves performance under moderate-strength attacks by up to 60%.
Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Reports: This paper proposes Pharos-ESG, a unified framework for structured parsing of ESG reports via four core modules: layout-flow-based reading order modeling, table-of-contents (ToC) anchor-guided hierarchical reconstruction, context-aware multimodal image-to-text conversion, and multi-level financial label prediction. The framework achieves an F1 of 93.59, ROKT of 0.92, and TBTA of 92.46% in comprehensive evaluation, substantially outperforming baselines such as MinerU, GPT-4o, and Gemini 2.5 Pro. The authors also release Aurora-ESG, the first large-scale public ESG report dataset comprising over 24K reports.
PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data: This paper proposes PlantTraitNet, a multimodal, multi-task, uncertainty-aware deep learning framework that leverages weakly supervised plant photographs from citizen science platforms (iNaturalist, Pl@ntNet) in combination with image features (DINOv2), depth priors (Depth-Anything-V2), and geospatial priors (Climplicit) to simultaneously predict four key plant functional traits (plant height, leaf area, specific leaf area, and leaf nitrogen content). The resulting global trait maps consistently outperform existing global trait products on benchmarks against sPlotOpen vegetation survey data.
Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation: This paper proposes the Plug-and-Play Clarifier, a zero-shot, modular multimodal framework that decomposes egocentric intent disambiguation into three sub-tasks: textual clarification, visual quality assessment, and cross-modal gesture grounding. The framework improves performance of small (4–8B) models by approximately 30% on intent disambiguation benchmarks, approaching or surpassing the performance of much larger models.
ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models: This paper proposes the ReCAD framework, which rewrites CAD scripts as parametric code for SFT, then applies GRPO-based reinforcement learning with a hierarchical primitive curriculum learning strategy, enabling VLMs to generate high-precision, editable parametric CAD models from text or image inputs. The approach substantially outperforms existing methods in both in-distribution and out-of-distribution settings.
Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies: This paper proposes T-DRS (Three-step Decay Resilience Strategies), a training-free inference-time framework that mitigates RoPE-induced long-range attention decay through three cooperative stages: semantics-driven enhancement, distance-aware control, and remote-distance re-reinforcement, achieving consistent performance gains across multiple LVLMs on VQA benchmarks.
Rethinking Visual Token Reduction in LVLMs under Cross-Modal Misalignment: This paper identifies three forms of cross-modal misalignment (causal, semantic, and spatial) in text-guided visual token importance estimation within LVLMs, and proposes VisionDrop—a training-free progressive token pruning framework that relies exclusively on visual self-attention. The framework performs multi-stage compression across both the visual encoder and LLM decoder, retaining over 91% of original performance while keeping only 5.6% of tokens.
Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View: This paper proposes two multimodal data difficulty assessment strategies—PISM (Progressive Image Semantic Masking) and CMAB (Cross-Modality Attention Balance)—and demonstrates that training exclusively with GRPO on difficulty-stratified data consistently outperforms the conventional SFT+GRPO pipeline, establishing that strategic data selection is more consequential than complex training paradigms.
RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models (Oral): This paper proposes RMAdapter, a dual-branch adapter architecture that augments the standard adaptation branch with a reconstruction branch (analogous to an AutoEncoder). By sharing the down-projection layer and applying per-layer local reconstruction losses, RMAdapter achieves an optimal balance between task-specific adaptation and general knowledge retention in few-shot CLIP fine-tuning, outperforming state-of-the-art methods (including prompt-based approaches) across three benchmarks: Base-to-Novel generalization, cross-dataset transfer, and domain generalization.
SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge: This paper proposes SafeR-CLIP, a framework that improves upon Safe-CLIP by introducing proximity-based alignment (redirecting unsafe embeddings to their semantically nearest safe targets rather than fixed pairs) and a relative cross-modal redirection loss (using only unsafe representations as negatives rather than random in-batch negatives), recovering zero-shot classification accuracy by 8.0% over Safe-CLIP while maintaining stronger safety guarantees.
SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias: This paper proposes SAGE, a training-free prompt selection method that requires no fine-tuning or external annotations. By computing inter-class separation scores for prompt templates, SAGE mitigates multimodal spurious bias in CLIP models, consistently improving Worst Group Accuracy (WGA) and Harmonic Mean (HM) across four benchmarks and five backbone models.
SatireDecoder: Visual Cascaded Decoupling for Enhancing Satirical Image Comprehension: This paper proposes SatireDecoder, a training-free framework that enhances deep semantic understanding of satirical images in MLLMs via multi-agent visual cascaded decoupling and uncertainty-guided CoT reasoning. On the YesBut dataset, it achieves improvements of 10%–40% across correctness, completeness, and faithfulness.
SDEval: Safety Dynamic Evaluation for Multimodal Large Language Models: This paper proposes SDEval, the first safety dynamic evaluation framework for MLLMs. By applying text dynamics (6 strategies), image dynamics (2 categories), and cross-modal dynamics (4 strategies), SDEval generates variant samples of controllable complexity from existing safety benchmarks. On MLLMGuard and VLSBench, it reduces the safety rate of InternVL-3-78B by nearly 10%, effectively mitigating data leakage and exposing model safety vulnerabilities.
See, Symbolize, Act: Grounding VLMs with Spatial Representations for Better Gameplay: This paper systematically evaluates the effect of symbolic spatial representations (object coordinates) on VLM gameplay, finding that symbolic information is beneficial only when detection is accurate; when VLMs self-extract symbols, effectiveness depends on model capability and scene complexity, while visual frames remain indispensable throughout.
Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models: This paper systematically compares traditional OCR+machine translation (OCR-MT) pipelines against vision large language models (vLLMs) on the task of translating handwritten Marathi legal documents into English. The study finds that neither approach meets legal-grade deployment requirements: OCR-MT suffers severely from cascading errors, while vLLMs exhibit critical hallucination issues. Nevertheless, vLLMs demonstrate potential for unified end-to-end processing.
SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models: SpeakerLM is the first multimodal large language model designed specifically for end-to-end Speaker Diarization and Recognition (SDR). Through an audio encoder–projector–LLM architecture and a flexible speaker enrollment mechanism, it significantly outperforms cascaded baseline systems on multiple public benchmarks (absolute cpCER reduction up to 13.82%) and demonstrates strong robustness on out-of-domain test sets.
SToLa: Self-Adaptive Touch-Language Framework with Tactile Commonsense Reasoning in Open-Ended Scenarios: SToLa proposes the first Mixture-of-Experts (MoE)-based touch-language framework, which employs a dynamic routing mechanism to manage the modality gap between tactile and linguistic inputs. The work also introduces TactileBench, an open-ended tactile commonsense reasoning dataset covering 8 physical properties and 4 interaction characteristics. With only 7B parameters, SToLa achieves state-of-the-art performance on the PhysiCLeAR benchmark, surpassing the 13B Octopi model.
TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing: TabFlash introduces two core techniques — Progressive Question Conditioning and Token Focusing — to inject question information into the ViT for generating question-aware visual features, prune background tokens via L2 norm, and concentrate critical information into retained tokens through contrastive training. On 7 table understanding benchmarks, TabFlash surpasses GPT-4o and Gemini 2.5 Pro while reducing FLOPs by 27% and GPU memory by 30%.
The Triangle of Similarity: A Multi-Faceted Framework for Comparing Neural Network Representations: This paper proposes the Triangle of Similarity framework, which integrates three complementary perspectives — static representational similarity (CKA/Procrustes), functional similarity (linear mode connectivity/predictive distribution similarity), and sparsity similarity (pruning robustness) — to comprehensively compare neural networks. Key findings include that architectural family is the primary determinant of representational similarity, and that a model's representational structure is more robust to pruning than its task accuracy.
TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks: TinyChemVL is a chemistry-domain VLM with only 4B parameters. It compresses visual tokens to 1/16 of the original count via an adaptive token merging and pruning strategy, introduces reaction-level tasks and the ChemRxn-V benchmark, and achieves state-of-the-art performance on both molecular- and reaction-level visual chemistry tasks while significantly improving inference and training speed.
TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models: TOFA is a federated learning framework that adapts CLIP via hierarchical Bayesian inference of personalized visual prototype distributions, globally aligned LLM-based text augmentation, and adaptive modality fusion — achieving training-free, single-round communication adaptation that outperforms one-shot baselines and even some multi-round training methods across 9 datasets.
Towards Human-AI Accessibility Mapping in India: VLM-Guided Annotations and POI-Centric Analysis in Chandigarh: This paper adapts the Project Sidewalk accessibility annotation platform to Chandigarh, India, through customized interface labels, VLM-driven task guidance (Gemini 2.5 Flash), and a POI-centric analysis framework. Approximately 40 km of sidewalks are audited across three regions of distinct land use, identifying 1,644 locations where accessibility improvements can be made.
Towards Long-window Anchoring in Vision-Language Model Distillation: LAid (Long-window Anchoring distillation) proposes a position-aware knowledge distillation framework that extends the effective context window of small VLMs (3B/7B) to 3.2× their original size—approaching the level of a large teacher model (32B)—through head-level Fourier-enhanced positional knowledge transfer, while preserving performance on standard VL benchmarks.
Towards Scalable Web Accessibility Audit with MLLMs as Copilots: This paper proposes the AAA framework, which operationalizes the WCAG-EM standard through two key innovations—GRASP (Graph-based multimodal page sampling) and MaC (MLLM as Copilot)—enabling scalable end-to-end web accessibility auditing.
CAMU: Context Augmentation for Meme Understanding: This paper proposes the CAMU framework, which achieves 0.807 accuracy and 0.806 F1 on the Hateful Memes dataset through visually grounded context caption generation, a novel caption scoring network, and parameter-efficient n-layer fine-tuning of the CLIP text encoder—matching the 55B-parameter SOTA while being substantially more efficient.
Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference: Tri-Bench is a compact benchmark comprising 400 real-world photographs of triangles. By systematically controlling two factors — camera pose (planar vs. tilted) and object interference — it evaluates the spatial geometric reasoning capabilities of four leading VLMs. The results reveal that models default to 2D image-plane cues rather than genuine 3D geometry, even when explicit reference-frame guardrails are provided in the prompt, with accuracy on minority-class shapes dropping to near 0%.
TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models: TTF-VLA proposes a training-free temporal token fusion method that selectively reuses visual tokens from historical frames via a dual-dimension mechanism combining grayscale pixel difference and attention-based semantic detection, improving inference quality of VLA models on robotic manipulation tasks with an average gain of 4.0 percentage points on LIBERO.
UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment: This paper proposes UniFit, a universal virtual try-on framework driven by a multimodal large language model (MLLM). An MLLM-Guided Semantic Alignment (MGSA) module bridges the semantic gap between textual instructions and reference images. A two-stage progressive training strategy combined with a self-synthesis pipeline overcomes data scarcity in complex scenarios. UniFit is the first single framework to support all 6 VTON tasks.
URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding: URaG identifies a human-like "coarse-to-fine" reasoning pattern in MLLMs processing long documents—shallow layers exhibit uniformly distributed attention while deep layers concentrate on evidence pages. Motivated by this insight, a lightweight cross-modal retrieval module is inserted at layer 6 (comprising only 0.05% of total parameters) to select the Top-5 relevant pages and discard the remainder, achieving SOTA performance while reducing computation by 44–56%.
Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models: This paper presents the first systematic study of verb concept hallucinations in multimodal large language models (MLLMs), constructs a multi-dimensional benchmark, demonstrates that existing hallucination mitigation methods are ineffective against verb hallucinations, and proposes a fine-tuning baseline enriched with verb knowledge that significantly alleviates verb hallucinations.
VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness: VILTA embeds a VLM (Gemini-2.5-Flash) directly into the RL training loop for autonomous driving. Via a Vision-Language-Editing (VLE) paradigm, the VLM edits the future trajectories of surrounding vehicles to generate challenging hazardous scenarios. The resulting driving policy achieves a 13.3% improvement in route completion rate and a 28.5% reduction in collision rate on CARLA challenging scenarios.
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use: VipAct proposes a multi-agent collaboration framework that significantly improves VLM performance on fine-grained visual perception tasks through three-tier collaboration: an Orchestrator Agent (task analysis, planning, and coordination), specialized agents (captioning, comparison, and visual prompt interpretation), and vision expert models (depth estimation, object detection, segmentation, etc.). The framework improves accuracy on Blink from 63.74% (zero-shot GPT-4o) to 73.79%.
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction: This paper proposes VIR-Bench — a benchmark based on 200 Japanese travel vlog videos that evaluates MLLMs' geospatial and temporal understanding via an itinerary reconstruction task (visiting order graph construction). Findings reveal that SOTA models (including GPT-4.1 and Gemini-2.5) still struggle significantly with POI recognition and temporal transition reasoning.
vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs: This paper proposes the vMFCoOp framework, which aligns the semantic discrepancy between LLMs and CLIP on a unified hyperspherical manifold via inverse estimation of von Mises-Fisher distributions, enabling robust few-shot prompt learning for biomedical VLMs.
VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models: VP-Bench introduces the first systematic two-stage benchmark for evaluating MLLMs' understanding of visual prompts (VPs): Stage 1 covers 30K+ images across 8 VP shape types × 355 attribute combinations to assess VP perception ability, while Stage 2 evaluates the practical effectiveness of VPs on 6 downstream tasks. Experiments on 28 MLLMs reveal the critical impact of VP shape selection on model performance.
When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?: This paper identifies a critical phenomenon termed "audio-visual confusion" in MLLMs, wherein models are heavily dominated by visual information and fail to recognize missing audio when audio-visual inputs are asymmetric. The authors propose the AV-ConfuseBench benchmark and the RL-CoMM method — combining a stepwise reasoning reward that incorporates an external audio model as reference with answer-centered confidence optimization — achieving 10–30% accuracy improvements over baselines using only approximately 20% of the training data.
Yes FLoReNce, I Will Do Better Next Time! Agentic Feedback Reasoning for Humorous Meme Detection: This paper proposes FLoReNce, a framework that models humorous meme understanding as a closed-loop control system. Through a feedback loop consisting of a Judge agent, a PID controller, and a non-parametric knowledge base, the system retrieves similar past experiences at inference time to modulate prompts, enabling a frozen VLM to perform adaptive reasoning without fine-tuning, substantially improving both prediction accuracy and explanation quality.
Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation: This paper proposes VAR-LIDE, a fully unsupervised visual autoregressive framework that jointly addresses low-light enhancement and deblurring through three modules guided by VLM perceptual priors: adaptive illumination modulation, spatial-frequency RoPE, and recursive phase-domain modulation. Without paired training data, the method achieves perceptual quality comparable to or exceeding supervised approaches.

🎨 Image Generation¶

AEDR: Training-Free AI-Generated Image Attribution via Autoencoder Double-Reconstruction: This paper proposes a training-free image attribution method based on the ratio of autoencoder double-reconstruction losses. By incorporating image uniformity calibration to eliminate texture complexity bias, the method achieves an average accuracy of 95.1% across 8 mainstream diffusion models, surpassing the strongest baseline by 24.7%, while being approximately 100× faster.
Aggregating Diverse Cue Experts for AI-Generated Image Detection: This paper proposes the Multi-Cue Aggregation Network (MCAN), which unifies three complementary cues — raw image, high-frequency representation, and a newly introduced Chromaticity Inconsistency (CI) — through a Mixture-of-Encoder Adapter (MoEA), enabling robust AI-generated image detection that generalizes across diverse generative models.
Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation: This paper proposes Cool-SD, a theoretically grounded annealed relaxation framework for speculative decoding. By deriving a tight upper bound on the TV distance, it obtains the optimal resampling distribution and proves that a decreasing acceptance probability schedule yields smaller distributional shift than a uniform schedule. Cool-SD achieves a superior speed–quality trade-off over LANTERN++ on LlamaGen and Lumina-mGPT.
AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer: This work formulates zero-shot anomaly generation as a text-guided localized style transfer problem. A lightweight U-Net trained with CLIP-based losses stylizes masked regions of normal images into semantically aligned anomalous images. With only 263M total parameters (0.61M trainable), AnoStyler surpasses diffusion-based baselines on MVTec-AD and VisA while significantly improving downstream anomaly detection performance.
Backdoors in Conditional Diffusion: Threats to Responsible Synthetic Data Pipelines: This paper exposes a backdoor vulnerability in the ControlNet conditional branch: injecting as little as 1–5% poisoned data suffices to implant a backdoor without modifying the diffusion backbone. Upon trigger activation, the model ignores text prompts and generates attacker-specified content. Clean fine-tuning (CFT) is proposed as a practical defense.
Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images: This paper identifies a novel threat of NSFW text embedded in diffusion-model-generated images, proposes NSFW-Intervention — a targeted LoRA fine-tuning method applied to text-rendering layers — and releases the ToxicBench benchmark.
Beyond Semantic Features: Pixel-Level Mapping for Generalized AI-Generated Image Detection: This paper proposes a pixel-level mapping preprocessing method that suppresses low-frequency semantic bias and enhances high-frequency generation artifacts by breaking the monotonic ordering of pixel values, achieving a cross-model generalization accuracy of 98.4% in AI-generated image detection.
Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra: This paper proposes GLMR, a two-stage framework (contrastive pre-retrieval + generative language model re-ranking) that transforms cross-modal retrieval into unimodal retrieval by generating molecular structures aligned with input mass spectra, achieving over 40% improvement in Recall@1 on MassSpecGym.
CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement: CAD-VAE introduces a correlation-aware latent code to capture shared information between target and sensitive attributes, achieves disentanglement by directly minimizing conditional mutual information, and employs a relevance-driven optimization strategy to precisely regulate the shared code, attaining state-of-the-art performance on fair representation learning, counterfactual generation, and fair image editing.
CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images: CausalCLIP is proposed to disentangle CLIP features into causal and non-causal subspaces via Gumbel-Softmax masking and HSIC constraints, combined with adversarial masking and counterfactual intervention to preserve stable forensic cues, achieving a 6.83% accuracy improvement in cross-generator generalization.
Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition: This paper proposes CD3T, a two-level hierarchical MARL framework that employs a conditional diffusion model to learn action semantic representations $z_a^i$ (conditioned on observations and other agents' actions to predict next observations and rewards), obtains subtask partitions via k-means clustering, and uses a high-level subtask selector combined with a low-level policy operating over a restricted action space. CD3T significantly outperforms all baselines on Super Hard scenarios in SMAC.
Constrained Particle Seeking: Solving Diffusion Inverse Problems with Just Forward Passes: This paper proposes Constrained Particle Seeking (CPS), a gradient-free method for solving diffusion model inverse problems. CPS constructs a locally linear surrogate of the forward process using all candidate particles, then seeks the optimal particle under a hyperspherical constraint within the high-density region of the transition kernel, achieving performance competitive with gradient-based methods.
Continuous Degradation Modeling via Latent Flow Matching for Real-World Super-Resolution: DegFlow is proposed to learn continuous degradation trajectories from discrete-scale real HR-LR pairs via a residual autoencoder and latent space Flow Matching. Given only a single HR image at inference, the model synthesizes realistic LR images at arbitrary continuous scales for training super-resolution models, achieving state-of-the-art performance.
Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy: This paper formalizes copyright infringement from the perspective of Differential Privacy (DP), and proposes the D-Plus-Minus (DPM) framework. By fine-tuning diffusion models in two opposing directions—"learning" and "unlearning"—DPM measures conditional sensitivity differences to perform post-hoc detection of copyright infringement in text-to-image models.
CountSteer: Steering Attention for Object Counting in Diffusion Models: This paper proposes CountSteer, a training-free inference-time method that injects adaptive steering vectors into the cross-attention hidden states of diffusion models, improving object counting accuracy by approximately 4% without degrading image quality.
Creating Blank Canvas Against AI-Enabled Image Forgery: This paper proposes a "blank canvas" mechanism that applies adversarial perturbations to make SAM "see nothing" in protected images. When a protected image is tampered with, the tampered regions disrupt the perturbations and become automatically detectable by SAM, enabling proactive tampering localization without requiring any tampered training data.
DICE: Distilling Classifier-Free Guidance into Text Embeddings: This paper proposes DICE, which trains a lightweight sharpener with only 2M parameters to distill the guidance effect of CFG into text embeddings, enabling guidance-free sampling to achieve generation quality on par with CFG while halving inference computation. The method is comprehensively validated across multiple variants of SD1.5, SDXL, and PixArt-α, and is accepted as an AAAI 2026 Oral presentation.
Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation: This paper proposes Diff-V2M, a hierarchical conditional diffusion Transformer framework for video-to-music generation that integrates affective, semantic, and rhythmic features via explicit rhythmic modeling (low-resolution ODF) and a hierarchical cross-attention mechanism, achieving state-of-the-art performance on both in-domain and out-of-domain datasets.
DiffA: Large Language Diffusion Models Can Listen and Understand: This paper proposes DIFFA — the first large audio-language model built upon a diffusion language model — which combines a frozen LLaDA-8B backbone with a lightweight dual-adapter architecture and a two-stage training pipeline. Using only 960 hours of ASR data and 127 hours of synthetic instruction data, DIFFA achieves competitive performance against autoregressive baselines on MMSU, MMAU, and VoiceBench.
Difficulty Controlled Diffusion Model for Synthesizing Effective Training Data: A difficulty encoder (MLP taking class label and difficulty score as input) is incorporated into Stable Diffusion, with LoRA fine-tuning used to decouple the objectives of "domain alignment" and "difficulty control," enabling controllable learning difficulty in synthesized data. Using only 10% additional synthetic data, the proposed method surpasses the best results of Real-Fake while saving 63.4 GPU hours.
Diffusion Reconstruction-Based Data Likelihood Estimation for Core-Set Selection: This paper proposes using the partial reverse denoising reconstruction bias of diffusion models as a theoretically grounded approximation of data likelihood, combined with information bottleneck theory for optimal reconstruction timestep selection, enabling distribution-aware core-set selection that achieves near-full-dataset training performance on ImageNet with only 50% of the data.
DogFit: Domain-guided Fine-tuning for Efficient Transfer Learning of Diffusion Models: This paper proposes DogFit, which internalizes Domain Guidance (DoG) into the fine-tuning loss of diffusion models, enabling the model to learn the guidance direction during training. At inference time, a controllable fidelity–diversity trade-off is achieved without double forward passes, surpassing the state-of-the-art guidance methods on 6 target domains with half the sampling TFLOPS.
DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation: This paper identifies four failure scenarios in multi-object generation (similar shapes/textures, dissimilar background biases, many objects), constructs directional separation vectors to modify three types of CLIP text embeddings (semantic token / EOT / pooled), achieves a 16–25% improvement in success rate and a 3–12% reduction in mixing rate on SDXL, with inference speed close to baseline (~4× faster than Attend-and-Excite).
EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding: This paper proposes EchoGen, a unified framework for layout-to-image generation (L2I) and image-to-layout grounding (I2L), trained through a progressive three-stage pipeline — parallel pre-training → dual-task joint optimization → cycle reinforcement learning (CycleRL) — which leverages the layout→image→layout cycle consistency as a self-supervised reward, achieving state-of-the-art results on MS-COCO and LayoutSAM.
EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI: This paper proposes EfficientFlow, which incorporates equivariance into the Flow Matching policy learning framework. It theoretically proves that an isotropic prior combined with an equivariant velocity network guarantees an equivariant action distribution, and introduces Flow Acceleration Upper Bound (FABO) regularization to accelerate sampling. On 12 tasks from MimicGen, EfficientFlow achieves 20–56× faster inference than EquiDiff with superior performance.
Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective: This paper proposes RetSimd, which "replays the whole story" by segmenting text and generating a series of supplementary images via a text-to-image model, combined with a graph neural network to fuse multi-image relationships. The approach significantly enhances the contribution of the image modality to misinformation detection, consistently improving the performance of five SOTA methods across three benchmark datasets.
Exposing DeepFakes via Hyperspectral Domain Mapping: This paper proposes HSI-Detect, a two-stage deepfake detection framework that first reconstructs RGB images into 31-channel hyperspectral images to amplify spectral artifacts introduced by generative models, then performs detection in the hyperspectral domain, achieving a mean AUC of 68.92% on cross-manipulation generalization benchmarks on FaceForensics++, surpassing RGB-only baselines.
FGM-HD: Boosting Generation Diversity of Fractal Generative Models through Hausdorff Dimension Induction: This paper is the first to introduce Hausdorff Dimension (HD) into Fractal Generative Models (FGM), proposing a learnable HD estimation module, a Monotonic Momentum-Driven Scheduling strategy (MMDS), and HD-guided rejection sampling. The method achieves a 39% improvement in generation diversity (Recall) on ImageNet while maintaining image quality.
Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment: This paper proposes R-REPA (Reverse Representation Alignment), which creatively exploits the invertibility of Normalizing Flows to align intermediate features with visual foundation models along the generative (reverse) path. It further introduces a training-free classification algorithm, achieving new state-of-the-art results for normalizing flows on ImageNet 64×64 and 256×256 with a 3.3× training speedup.
FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting: This paper proposes FreeInpaint, a plug-and-play, training-free method that optimizes the initial noise to steer attention toward the inpainting region (PriNo), and during denoising decomposes the conditional distribution into three guidance terms — text alignment, visual rationality, and human preference (DeGu) — simultaneously improving prompt alignment and visual rationality in image inpainting.
GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution: This paper proposes GEWDiff, a geometric enhanced wavelet-based diffusion model that efficiently compresses hyperspectral data into a latent space via a wavelet encoder-decoder, introduces edge-aware noise scheduling and mask-conditional control to preserve geometric integrity, and designs a multi-level loss function to facilitate stable convergence, achieving state-of-the-art performance on 4× hyperspectral image super-resolution.
HACK: Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling: This paper identifies that attention heads in VAR models naturally fall into two categories — Contextual Heads (semantic consistency, vertical attention patterns) and Structural Heads (spatial coherence, multi-diagonal patterns) — and proposes the HACK framework, which employs asymmetric budget allocation and pattern-specific compression strategies to achieve lossless generation quality at 70% compression, yielding 1.75× memory reduction and 1.57× speedup on Infinity-8B.
HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models: This paper proposes HierarchicalPrune, which exploits the hierarchical functional differences among blocks in MMDiT-based diffusion models—early blocks establish semantic structure while late blocks refine texture details—and combines three techniques: Hierarchical Position Pruning (HPP), Positional Weight Preservation (PWP), and Sensitivity-Guided Distillation (SGDistill), together with INT4 quantisation. Applied to SD3.5 Large Turbo (8B), the method compresses the model from 15.8 GB to 3.24 GB (79.5% memory reduction) with only a 4.8% degradation in image quality.
How Bias Binds: Measuring Hidden Associations for Bias Control in Text-to-Image Compositions: This work is the first to investigate compositional semantic binding bias in text-to-image generation. It proposes the Bias Adherence Score (BA-Score) to quantify how object–attribute binding activates bias, and introduces a training-free Context-Bias Control (CBC) framework that achieves over 10% debiasing improvement in compositional generation via token embedding decoupling and residual injection.
Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval: This paper proposes H2ARN, which embeds text and 3D point cloud data in the Lorentz hyperbolic space. It addresses hierarchical representation collapse via a hierarchical ordering loss (entailment cones), and mitigates redundancy-induced saliency dilution via contribution-aware hyperbolic aggregation. The method achieves state-of-the-art performance on Text-3D retrieval and introduces the T3DR-HIT v2 dataset, which is 2.6× larger than its predecessor.
Improved Masked Image Generation with Knowledge-Augmented Token Representations: This paper proposes KA-MIG, a framework that mines three types of token-level semantic prior knowledge graphs from training data (co-occurrence graph, semantic similarity graph, and position-token incompatibility graph), learns augmented token representations via a graph-aware encoder, and injects them into existing MIG models through a lightweight addition-subtraction fusion mechanism, consistently improving generation quality across multiple backbone networks.
Infinite-Story: A Training-Free Consistent Text-to-Image Generation: Built upon a scale-wise autoregressive model (Infinity), this work introduces three training-free techniques—Identity Prompt Replacement (eliminating contextual bias in the text encoder), Adaptive Style Injection (reference image feature injection), and Synchronized Guidance Adaptation (synchronizing both branches of CFG)—to achieve identity- and style-consistent multi-image generation at 6× the speed of diffusion-based methods (1.72 s/image).
Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers: Laytrol achieves high-quality layout-to-image generation on FLUX by initializing the layout control network via parameter copying from MM-DiT, adopting a dedicated initialization scheme (layout encoder initialized as a pure text encoder with zero-initialized outputs), and constructing the LaySyn dataset using FLUX-generated images to mitigate distribution shift.
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs: This paper presents the first systematic study of long-context capabilities in diffusion large language models (diffusion LLMs), revealing stable perplexity under direct extrapolation and a "local awareness" phenomenon. It further proposes LongLLaDA, a training-free method that successfully extends the context window by 6× (to 24k tokens) via NTK-based RoPE extrapolation.
LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations: This paper proposes LongT2IBench, the first evaluation benchmark targeting long-text-to-image (T2I) alignment, comprising 14K long-text–image pairs with graph-structured human annotations. It further introduces LongT2IExpert, an evaluator built by fine-tuning an MLLM via Hierarchical Alignment Chain-of-Thought (HA-CoT) instruction tuning, which jointly produces alignment scores and structured explanations.
MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models: This paper proposes MacPrompt, a black-box cross-lingual attack method that translates harmful words into multi-language candidates and performs character-level recombination to construct "macaronic words" as adversarial prompts. The method simultaneously bypasses text safety filters and concept removal defenses, achieving attack success rates of up to 92% on sexual content and 90% on violent content.
MACS: Multi-source Audio-to-Image Generation with Contextual Significance and Semantic Alignment: This paper proposes MACS, the first two-stage framework that explicitly separates multi-source audio prior to image generation. The framework combines weakly supervised sound source separation, CLAP-space semantic alignment (via ranking loss and contrastive loss), and decoupled cross-attention diffusion generation, achieving comprehensive state-of-the-art performance on multi-source, mixed-source, and single-source audio-to-image generation tasks.
Mass Concept Erasure in Diffusion Models with Concept Hierarchy: This paper proposes a grouped erasure strategy based on supertype-subtype concept hierarchy and Supertype-Preserving LoRA (SuPLoRA). By freezing the down-projection matrix (orthogonal to the supertype subspace) and training only the up-projection matrix, the method achieves an optimal balance between erasure effectiveness and generation quality in large-scale, multi-domain concept erasure.
MDiff4STR: Mask Diffusion Model for Scene Text Recognition: This work is the first to introduce Mask Diffusion Models (MDM) into Scene Text Recognition (STR), proposing MDiff4STR. It addresses the training-inference noising gap via six training mask strategies and resolves overconfident predictions through a Token Replacement Noise mechanism. With only 3 denoising steps, MDiff4STR surpasses state-of-the-art autoregressive models in accuracy while achieving 3× inference speedup.
Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models: Through systematic probing analysis of attention maps in diffusion models, this work reveals that self-attention maps are critical for preserving the temporal structure of music. Based on this finding, Melodia is proposed — a training-free music editing method that achieves an optimal balance between attribute modification and structural preservation by selectively manipulating self-attention maps.
Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution: This paper introduces the sparse Mixture-of-Experts (MoE) paradigm into real-world image super-resolution, proposing a Mixture-of-Ranks (MoR) architecture that treats each LoRA rank as an independent expert. Combined with a CLIP-driven degradation estimation module and a degradation-aware load balancing loss, the method achieves one-step high-fidelity super-resolution reconstruction.
MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation: This work introduces the MeanFlow paradigm to the robot learning domain for the first time. By incorporating 3D point cloud inputs and a Dispersive Loss, MP1 generates action trajectories in a single network forward pass (1-NFE), achieving state-of-the-art success rates with an inference latency of only 6.8 ms on robotic manipulation tasks.
Multi-Aspect Cross-modal Quantization for Generative Recommendation: This paper proposes MACRec, which introduces multi-aspect cross-modal interaction at both the semantic ID learning stage and the generative model training stage. Through cross-modal quantization (contrastive learning-enhanced residual quantization) and multi-aspect alignment (implicit + explicit), MACRec significantly improves recommendation performance while reducing ID collision rates.
Multi-Metric Preference Alignment for Generative Speech Restoration: This paper proposes a Multi-Metric Preference Alignment strategy that constructs a preference dataset, GenSR-Pref (80K pairs), requiring unanimous agreement across multiple complementary metrics. DPO is applied to post-training alignment of three generative speech restoration paradigms (AR, MGM, FM), achieving substantial quality improvements while effectively mitigating reward hacking.
ORVIT: Near-Optimal Online Distributionally Robust Reinforcement Learning: This paper studies online distributionally robust reinforcement learning and proposes the RVI-$f$ algorithm based on $f$-divergence uncertainty sets. It achieves near minimax-optimal regret bounds under both $\chi^2$ and KL divergences without relying on any structural assumptions.
PADiff: Predictive and Adaptive Diffusion Policies for Ad Hoc Teamwork: This work is the first to apply diffusion models to the Ad Hoc Teamwork (AHT) problem. The proposed PADiff framework achieves real-time adaptation to dynamic teammates via an Adaptive Feature Modulation Net (AFM-Net), and injects teammate intent prediction into the denoising process through a Predictive Guidance Block (PGB), achieving an average improvement of 35.25% over existing methods in multimodal cooperative scenarios.
PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement: This paper proposes PASE, a framework that leverages robust phonological priors embedded in pretrained WavLM via Denoising Representation Distillation (DRD) to suppress linguistic hallucinations, while employing a dual-stream representation (high-level phonetic + low-level acoustic) to eliminate acoustic hallucinations, simultaneously achieving state-of-the-art performance in both perceptual quality and content fidelity.
Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback: This paper proposes a DiT-based audio-driven human video generation framework built on Wan2.1, featuring a LoRA training strategy for long video generation, partial parameter updates combined with DPO reward feedback to enhance lip synchronization and motion naturalness, and a novel training-free Mask-CFG method that enables multi-character (≥3 persons) audio-driven animation for the first time.
ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration: This paper proposes ProCache, a training-free dynamic feature caching framework that achieves 2.90× speedup on DiT-XL/2 and 1.96× speedup on PixArt-α with negligible image quality degradation, through constraint-aware non-uniform caching pattern search and selective computation, significantly outperforming existing caching methods.
QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution: This paper proposes QuantVSR, the first low-bit (4/6-bit) post-training quantization framework for diffusion-based video super-resolution (VSR). It introduces a Spatiotemporal Complexity-Aware (STCA) mechanism for layer-adaptive rank allocation and a Learnable Bias Alignment (LBA) module to mitigate quantization bias. Under the 4-bit setting, QuantVSR achieves 84.39% parameter compression and 82.56% computation reduction while maintaining performance comparable to the full-precision model.
ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment: This paper proposes ReAlign (Reward-guided sampling Alignment), which employs a step-aware reward model and a reward-guided sampling strategy to dynamically steer sampling trajectories toward distributions with high text-motion alignment during diffusion inference, significantly improving the generation quality of various motion generation methods without fine-tuning any diffusion model. Using MLD as a baseline, R@1 improves by 17.9% and FID improves by 58.8%.
Realism Control One-step Diffusion for Real-World Image Super-Resolution: This paper proposes the RCOD framework, which endows one-step diffusion (OSD) super-resolution methods with the ability to flexibly control the fidelity–realism trade-off at inference time via a latent domain grouping strategy and degradation-aware sampling. A visual prompt injection module is also introduced to replace text prompts, improving restoration accuracy.
Realistic Face Reconstruction from Facial Embeddings via Diffusion Models: This paper proposes the FEM (Face Embedding Mapping) framework, which employs a KAN-based network to map embeddings from arbitrary face recognition (FR) or privacy-preserving face recognition (PPFR) systems into the embedding space of a pretrained identity-preserving (ID-Preserving) diffusion model, enabling high-resolution realistic face reconstruction for evaluating privacy leakage risks in FR systems.
Rectified Noise: A Generative Model Using Positive-incentive Noise: This paper proposes Rectified Noise (ΔRN), which leverages the positive-incentive noise (π-noise) framework to learn a set of beneficial noise signals and inject them into the velocity field of a pretrained Rectified Flow model, achieving a reduction in FID from 10.16 to 9.05 on ImageNet-1k with only 0.39% additional parameters.
RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers: This paper proposes the RelaCtrl framework, which quantifies the sensitivity of each DiT layer to control information via a ControlNet Relevance Score, and uses this analysis to guide the placement and modeling capacity of control blocks. A Two-Dimensional Shuffle Mixer (TDSM) is introduced to replace self-attention and FFN, achieving controllable generation quality superior to PixArt-δ with only 15% of its parameters and computational cost.
RetrySQL: Text-to-SQL Training with Retry Data for Self-Correcting Query Generation: This paper proposes the RetrySQL training paradigm, which injects retry data (erroneous steps + [BACK] token + correct steps) into reasoning chains during continual pre-training of small encoder models. This approach enables a 1.5B open-source model to acquire self-correction capabilities, achieving improvements of up to 4 and 3.93 percentage points in overall execution accuracy on the BIRD and SPIDER benchmarks, respectively, with gains of up to 9 percentage points on challenging samples.
Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation: This paper systematically investigates fundamental deficiencies in compositional fidelity of text-to-image (T2I) models, focusing on three basic primitives—negation, counting, and spatial relations. It reveals a "submultiplicative" interference phenomenon in which models perform adequately on individual primitives but suffer dramatic performance degradation under joint composition, attributing this to training data scarcity, the unsuitability of continuous attention architectures for discrete logic, and evaluation metrics biased toward visual plausibility rather than constraint satisfaction.
Self-NPO: Data-Free Diffusion Model Enhancement via Truncated Diffusion Fine-Tuning: This paper proposes Self-NPO, a negative preference optimization method that requires neither external data annotation nor reward models. By leveraging Truncated Diffusion Fine-Tuning (TDFT), the model learns "what is bad" from its own low-quality generated data, and uses CFG to steer generation away from undesirable outputs. Self-NPO achieves comparable performance to Diffusion-NPO at less than 1% of the training cost.
SimDiff: Simpler Yet Better Diffusion Model for Time Series Point Forecasting: This paper proposes SimDiff — the first purely end-to-end diffusion model to achieve state-of-the-art performance on time series point forecasting. A unified Transformer network serves simultaneously as denoiser and predictor. Combined with Normalization Independence for distribution shift handling and a Median-of-Means ensemble strategy that converts probabilistic samples into precise point predictions, SimDiff achieves 1st place on 6 and 2nd place on 3 out of 9 benchmarks.
SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation: This paper proposes SpecDiff, a training-free multi-level feature caching strategy based on self-speculation. By leveraging a small number of speculative steps to introduce future information for token importance selection, SpecDiff overcomes the accuracy–speed bottleneck of methods that rely solely on historical information, achieving 2.80×/2.74×/3.17× speedup on Stable Diffusion 3/3.5 and FLUX with negligible quality loss.
Stabilizing Self-Consuming Diffusion Models with Latent Space Filtering: This paper proposes Latent Space Filtering (LSF), a method that analyzes the degradation of low-dimensional structure in the latent representations of self-consuming diffusion models and uses confidence scores from a probing classifier to filter low-quality synthetic data. Under a fixed training budget, LSF effectively mitigates model collapse without requiring additional real data or an enlarged training set.
Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression: This paper proposes SODEC, a one-step diffusion-based image compression model that injects the prior of a high-fidelity VAE decoder into the diffusion generation process via a Fidelity Guidance Module (FGM). Combined with a rate annealing training strategy, SODEC achieves high-quality compression at extremely low bitrates, with decoding speed more than 20× faster than multi-step diffusion methods, while reaching state-of-the-art rate-distortion-perception trade-offs.
Structure-based RNA Design by Step-wise Optimization of Latent Diffusion Model: This paper proposes SOLD, a framework that integrates a latent diffusion model (LDM) with reinforcement learning (RL) via a step-wise single-step sampling optimization strategy to directly optimize non-differentiable structural metrics in RNA inverse folding — including secondary structure similarity (SS), minimum free energy (MFE), and LDDT — achieving comprehensive improvements over existing methods across multiple metrics.
Studying Classifier(-Free) Guidance From A Classifier-Centric Perspective: Through systematic empirical study, this paper reveals the essential mechanism underlying both classifier guidance and classifier-free guidance — both steer denoising trajectories away from the classifier's decision boundary to achieve conditional generation — and proposes a flow matching-based post-processing method that validates this "classifier-centric" perspective on high-dimensional data.
T-LoRA: Single Image Diffusion Model Customization Without Overfitting: This paper proposes T-LoRA, a timestep-dependent low-rank adaptation framework that addresses overfitting in single-image diffusion model customization. The framework dynamically adjusts the effective LoRA rank across diffusion timesteps (smaller rank at high-noise timesteps, larger rank at low-noise timesteps) and employs orthogonal initialization (Ortho-LoRA) via random matrix SVD to ensure information independence among adaptation components, achieving an optimal balance between concept fidelity and text alignment.
T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model: This paper constructs T2I-RiskyPrompt — a comprehensive benchmark comprising 6,432 valid risky prompts spanning 6 major categories and 14 subcategories, each annotated with hierarchical labels and detailed risk rationales. A reason-driven MLLM-based risk detection method is proposed (achieving 91.8% accuracy with a 3B model), and a systematic evaluation is conducted across 8 T2I models, 9 defense methods, 5 safety filters, and 5 attack strategies.
Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances: This paper proposes VALOR, a validation-aware multimodal expert framework combining a multi-expert routing architecture with Chain-of-Thought reasoning and a semantic alignment validation mechanism, which achieves joint fine-grained classification of complaint Aspect and Severity in multi-turn multimodal customer service dialogues, yielding absolute improvements of 12.94%/6.51% over the strongest baseline Gemma-3.
Targeted Data Protection for Diffusion Model by Matching Training Trajectory: TAFAP achieves, for the first time, effective Targeted Data Protection (TDP) for diffusion models by generating adversarial perturbations via training trajectory matching, redirecting unauthorized fine-tuning outputs toward a user-specified target concept while maintaining high image quality.
TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs: This paper proposes TruthfulRAG, a framework that, for the first time, leverages knowledge graphs (KGs) to resolve conflicts between retrieved knowledge and LLM parametric knowledge at the factual level in RAG systems. The framework improves generation accuracy and trustworthiness through triple extraction, query-aware graph retrieval, and an entropy-based conflict filtering mechanism.
TSGDiff: Rethinking Synthetic Time Series Generation from a Pure Graph Perspective: This paper proposes TSGDiff, the first framework to rethink time series generation from a purely graph-based perspective. Time series are represented as dynamic graphs constructed from Fourier spectral features, diffusion modeling is performed in the graph latent space, and a novel Topo-FID metric is introduced to evaluate the structural fidelity of generated time series.
UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective: This paper proposes UNSEEN, a dataset pruning method that improves coreset selection from a generalization perspective—considering not only how retained samples contribute to training loss, but also how they contribute to test-time generalization. UNSEEN selects coresets that better align the training distribution with unseen test distributions.
VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning: VoiceCloak is a proactive defense framework against diffusion-based voice cloning that simultaneously achieves speaker identity obfuscation and perceptual quality degradation via four-dimensional adversarial perturbations, attaining a DSR of 71.4% on LibriTTS and substantially outperforming all existing defense methods.
X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning: A 3.7M high-quality editing dataset covering 14 task categories is constructed, and a lightweight (0.9B parameter) plug-and-play editing module based on Task-Aware MoE-LoRA and Contrastive Learning is proposed, achieving performance comparable to 12B fully fine-tuned models.

🧊 3D Vision¶

3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition: This paper introduces the Neural Collapse (NC) mechanism into adversarial robustness for 3D point cloud recognition. By replacing the classifier head with a fixed ETF structure and adopting an adaptive training framework (RBL + FDL) to construct a disentangled feature space, 3D-ANC improves the adversarial accuracy of DGCNN on ModelNet40 from 27.2% to 80.9%, surpassing the best baseline by 34 percentage points.
3D-Free Meets 3D Priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance: This paper proposes a framework that combines 3D-free methods (HawkI-style test-time optimization) with 3D-based priors (weak guidance images from Zero123++) to synthesize camera-controlled views at specified elevation/azimuth angles from a single image, requiring neither additional 3D data nor training. The approach comprehensively outperforms Zero123++, HawkI, and Stable Zero123 on LPIPS, CLIP-Score, and other metrics in complex scenes.
3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation: This work adapts the SAM2 foundation model for 3D teeth segmentation by converting 3D meshes into 2D images via multi-view rendering and designing three lightweight adapters—a Prompt Embedding Generator, a Mask Refiner, and a Mask Classifier—along with a Deformable Global Attention Plugin (DGAP) to address automatic prompting, boundary refinement, and semantic classification. The proposed method achieves a new state-of-the-art T-mIoU of 91.90% on Teeth3DS.
4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation: This paper proposes the 4DSTR framework, which significantly improves the spatial-temporal consistency of 4D Gaussian generation and its adaptability to rapid temporal changes through a Mamba-based temporal correlation rectification module (correcting Gaussian scale and rotation residuals) and a per-frame adaptive densification and pruning strategy.
Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models: This paper proposes Uni-Adapter, a training-free online test-time adaptation (TTA) framework for 3D vision-language foundation models (VLFMs). It addresses distribution shifts via clustering-based dynamic prototype caching and graph-regularized label smoothing, achieving state-of-the-art performance on multiple 3D corruption benchmarks.
AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation: This paper identifies a critical yet overlooked issue in SDS: the source distribution is dynamically evolving rather than static. AnchorDS is proposed to anchor the source distribution by feeding the current rendered image as an image condition into a dual-conditioned diffusion model, thereby resolving semantic over-smoothing and multi-view inconsistency in SDS. The method comprehensively outperforms SDS, VSD, and SDS-Bridge on T3Bench.
AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation: AnchorHOI is proposed to achieve zero-shot text-driven 4D human-object interaction (HOI) generation by introducing two intermediate bridges — anchor NeRF and anchor keypoints — to distill interaction priors and motion priors from image and video diffusion models, respectively. The method outperforms existing approaches on both static 3D and dynamic 4D HOI generation.
Arbitrary-Scale 3D Gaussian Super-Resolution: This paper proposes Arbi-3DGSR, an integrated framework that, for the first time, enables a single 3DGS model to support arbitrary-scale (including non-integer) high-resolution rendering through three core components: scale-aware rendering, generative prior-guided optimization, and progressive super-resolving. At ×5.7 scale, PSNR improves by 6.59 dB over vanilla 3DGS while maintaining real-time rendering at 85 FPS.
ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation: This paper proposes ASSIST-3D, a synthetic data pipeline that generates high-quality annotated data for class-agnostic 3D instance segmentation through three stages: heterogeneous object selection, LLM-guided scene layout generation, and realistic point cloud construction, significantly improving model generalization.
Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting?: This paper presents the first systematic study exposing the vulnerability of 3DGS watermarking frameworks, and proposes GSPure — a purification framework that leverages view-aware Gaussian weight accumulation and geometric feature clustering to precisely isolate and remove watermark-related Gaussian primitives, reducing watermark PSNR by up to 16.34 dB while incurring less than 1 dB loss in scene fidelity.
Cheating Stereo Matching in Full-Scale: Physical Adversarial Attack against Binocular Depth Estimation: This paper proposes the first full-surface 3D texture physical adversarial attack against stereo matching models. Through a stereo-aligned rendering module and a region-aware merging attack strategy, adversarial vehicles seamlessly blend into the background in the predicted depth map, causing severe failures in autonomous driving perception systems.
Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation: This paper proposes a Class-Partitioned VQ-VAE (CPVQ-VAE) and a Latent Flow Matching Model (LFMM), achieving the first purely generative point cloud scene generation method that requires no external database retrieval, reducing Chamfer Distance by 70.4% on complex living room scenes.
DANCE: Density-Agnostic and Class-Aware Network for Point Cloud Completion: This paper proposes the DANCE framework, which achieves density-agnostic point cloud completion via ray-based candidate point sampling and an opacity prediction mechanism, while introducing a classification head to provide semantic priors. The method achieves state-of-the-art performance on the PCN and MVP benchmarks.
DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion: This work presents the first integration of Mamba (SSM) into unsupervised domain adaptive point cloud completion (UDA PCC). The proposed DAPointMamba framework achieves high-quality cross-domain point cloud completion through three modules—Cross-Domain Patch-Level Scanning, Spatial SSM Alignment, and Channel SSM Alignment—while maintaining linear complexity and a global receptive field.
Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting: This paper proposes the TD-Attn framework, which addresses multi-view inconsistency (the Janus problem) caused by prior-view bias in T2I diffusion models for 3D generation and editing. The framework comprises two modules—3D-Aware Attention Guidance (3D-AAG) and Hierarchical Attention Modulation (HAM)—and can be integrated as a general-purpose plugin into existing 3DGS pipelines.
DeepRAHT: Learning Predictive RAHT for Point Cloud Attribute Compression: This paper proposes DeepRAHT, the first end-to-end differentiable Region Adaptive Hierarchical Transform (RAHT) framework for lossy point cloud attribute compression. By integrating learnable prediction models with a Laplace distribution-based rate proxy, DeepRAHT achieves compression performance surpassing both the G-PCC standard and existing deep learning methods.
Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection: This paper proposes FTKD (Future Temporal Knowledge Distillation), a framework comprising two strategies—Future-aware Feature Reconstruction (FFR) and Future-guided Logit Distillation (FLD)—to effectively transfer future frame knowledge from an offline teacher model to an online student model, achieving gains of 1.3 mAP / 1.3 NDS on nuScenes without additional inference overhead.
Domain Generalized Stereo Matching with Uncertainty-guided Data Augmentation: This paper proposes UgDA-Stereo, a plug-and-play training-time module that simulates diverse unseen domain styles by applying Gaussian uncertainty perturbations—derived from batch statistics—to the per-channel mean and standard deviation of RGB images. Combined with a feature consistency constraint, the method substantially improves the cross-domain generalization of stereo matching models.
Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos: This paper proposes a coarse-to-fine temporal alignment module that can be plugged into existing 4D Gaussian Splatting frameworks to address reconstruction quality degradation caused by temporal misalignment across multi-view videos. The method achieves consistent improvements in PSNR/SSIM/LPIPS over multiple baselines on the DyNeRF dataset.
Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization: This paper proposes WeSTAR, a framework that synergistically combines semantics-aware hierarchical depth normalization self-training, sparse pairwise ordinal weak supervision, and LoRA weight regularization to enhance the generalization of depth estimation foundation models (Depth Anything V2) on unseen domains and corrupted data in a parameter-efficient manner, achieving state-of-the-art results on multiple OOD benchmarks.
Enhancing Rotation-Invariant 3D Learning with Global Pose Awareness and Attention Mechanisms: This paper proposes the Shadow-informed Pose Feature (SiPF) and the RIAttnConv operator. By introducing a global "shadow" reference point generated via Bingham distribution learning, the method enhances the global pose awareness of local rotation-invariant features, resolving the "Wing-tip Feature Collapse" problem where symmetric structures (e.g., left and right wings of an airplane) cannot be distinguished. The approach achieves state-of-the-art performance on ModelNet40 classification and ShapeNetPart segmentation.
EPSegFZ: Efficient Point Cloud Semantic Segmentation for Few- and Zero-Shot Scenarios: This paper proposes EPSegFZ, a pretraining-free framework for few- and zero-shot 3D point cloud semantic segmentation. It extracts high-frequency features via ProERA, updates prototypes with textual information via LGPE, and establishes accurate query-prototype correspondences via DRPE. EPSegFZ surpasses the state of the art by 5.68% on S3DIS and 3.82% on ScanNet.
FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting: This paper presents FantasyStyle, the first 3DGS style transfer framework built entirely on diffusion model distillation. It introduces a Multi-View Frequency Consistency (MVFC) mechanism that suppresses low-frequency components to reduce cross-view conflicts, and designs Controllable Stylized Distillation (CSD) with negative guidance to eliminate content leakage from style images. The method surpasses existing VGG-based and diffusion-based approaches in both stylization quality and content preservation.
Fast 3D Surrogate Modeling for Data Center Thermal Management: This paper develops a vision-based 3D surrogate modeling framework for data centers. Server workloads, fan speeds, and air-conditioning temperature setpoints are encoded as 3D voxel representations, and architectures including 3D CNN U-Net, 3D Fourier Neural Operator, and 3D Vision Transformer are employed for real-time temperature field prediction. The proposed framework achieves inference speeds up to 20,000× faster than traditional CFD solvers while enabling a 7% reduction in energy consumption.
FoundationSLAM: Unleashing the Potential of Deep Foundation Models in End-to-End Dense Visual SLAM: This work injects geometric priors from depth foundation models into a flow-based SLAM system. Three modules — a hybrid flow network, a bi-consistent BA layer, and reliability-aware refinement — form a closed loop. The resulting system achieves state-of-the-art trajectory accuracy and dense reconstruction quality across TUM/EuRoC/7Scenes/ETH3D benchmarks at 18 FPS in real time.
Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine: This paper proposes FFSE — an autoregressive 3D-aware image editing framework built on a video diffusion model — paired with a hybrid dataset 3DObjectEditor (real + synthetic). FFSE enables multi-round object translation, scaling, and rotation on real images in the manner of a 3D engine, while generating physically plausible background effects such as shadows, reflections, and occlusions, and maintaining cross-round consistency. It substantially outperforms existing methods in both single-round and multi-round editing.
Gaussian Blending: Rethinking Alpha Blending in 3D Gaussian Splatting: This paper revisits scalar alpha blending in 3DGS and identifies its neglect of intra-pixel spatial variation as the root cause of multi-scale rendering artifacts (enlargement erosion / downscaling dilation). The proposed Gaussian Blending models alpha and transmittance as spatial distributions within a pixel (2D uniform window), achieving real-time anti-aliasing without retraining. PSNR on multi-scale Blender improves from 31.59 to 35.80.
GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting: This paper proposes GaussianImage++, which achieves high-quality image representation and compression with a limited number of 2D Gaussian primitives via a distortion-driven densification mechanism and content-aware Gaussian filters, combined with an attribute-separated learnable scalar quantizer for efficient compression.
Generalized Geometry Encoding Volume for Real-time Stereo Matching: This paper proposes GGEV, which integrates depth priors from a monocular depth foundation model (Depth Anything V2) into the cost aggregation process in a lightweight manner. Through Depth-aware Dynamic Cost Aggregation (DDCA), GGEV adaptively enhances matching relationships across different disparity hypotheses, achieving strong generalization at real-time inference speed.
Geometry Meets Light: Leveraging Geometric Priors for Universal Photometric Stereo under Limited Multi-Illumination Cues: This paper proposes GeoUniPS, which injects geometric priors learned by a large-scale 3D reconstruction model (VGGT) into a universal photometric stereo pipeline. Through a light–geometry dual-branch encoder, the method recovers plausible surface normals even when multi-illumination cues are unreliable (e.g., shadows, self-occlusions, biased lighting). A new perspective-projection training dataset, PS-Perp, is also introduced to bridge the gap between the orthographic projection assumption and real-world cameras.
Graph Smoothing for Enhanced Local Geometry Learning in Point Cloud Analysis: This paper analyzes the limitations of conventional graph construction methods (ball query), specifically sparse connectivity at boundary points and noisy connectivity at junction regions, and proposes a graph smoothing module (symmetric adjacency optimization + von Neumann kernel) and a local geometry learning module (adaptive shape features + cylindrical coordinate transformation), achieving competitive performance on classification and segmentation tasks.
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark: This paper presents Griffin, the first aerial-ground cooperative (AGC) 3D perception dataset and benchmark framework, comprising 250+ dynamic scenes (37K+ frames) generated via CARLA-AirSim joint simulation. Griffin features realistic UAV dynamics, variable cruise altitudes (20–60 m), occlusion-aware annotations, and a systematic robustness evaluation protocol.
GT2-GS: Geometry-aware Texture Transfer for Gaussian Splatting: This paper proposes GT2-GS, a framework that achieves high-quality, view-consistent texture transfer for 3DGS via a geometry-aware texture transfer loss (GT2 Loss), an adaptive fine-grained control module (AFCM), and a geometry-preserving branch (GPB), outperforming existing 3D style transfer methods in both texture fidelity and scene content preservation.
Hierarchical Direction Perception via Atomic Dot-Product Operators for Rotation-Invariant Point Clouds Learning: This paper proposes DiPVNet, which leverages the dual properties of the atomic dot-product operator (directional selectivity + rotation invariance) to construct a local L2DP operator and a global DASFT module, achieving hierarchical direction-aware rotation-invariant point cloud learning.
IE-SRGS: An Internal-External Knowledge Fusion Framework for High-Fidelity 3D Gaussian Splatting Super-Resolution: This paper proposes IE-SRGS, a framework that fuses external knowledge (high-frequency texture priors from a pretrained 2D super-resolution model) with internal knowledge (cross-view consistent depth and texture features from a multi-scale 3DGS model), coordinated via a mask-guided fusion strategy, to achieve high-fidelity 3DGS super-resolution reconstruction from low-resolution inputs, attaining state-of-the-art performance on both synthetic and real-world scenes.
Learning Conjugate Direction Fields for Planar Quadrilateral Mesh Generation: This paper proposes a data-driven approach based on DGCNN to efficiently generate conjugate direction fields (CDFs), bypassing the high computational cost of traditional nonlinear optimization. The method supports user stroke-guided controllable CDF generation, achieves a 1–2 order-of-magnitude speedup, and is accompanied by a large-scale dataset of 50,000+ free-form surfaces.
MeshA*: Efficient Path Planning With Motion Primitives: This paper proposes MeshA, an algorithm that reformulates lattice-based path planning from "searching at the motion primitive level" to "searching at the grid cell level while simultaneously fitting primitive sequences." By defining a novel search space based on extended cells, MeshA achieves 1.5×–2× runtime speedup over standard LBA* while preserving completeness and optimality guarantees.
MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting: This paper proposes MeshSplat, the first generalizable sparse-view surface reconstruction framework based on 2DGS. It regularizes depth prediction via a Weighted Chamfer Distance loss and aligns 2DGS orientations through an uncertainty-guided normal prediction network, learning geometric priors in a self-supervised manner from novel view synthesis. MeshSplat achieves state-of-the-art performance on both sparse-view mesh reconstruction and cross-dataset generalization.
MoBGS: Motion Deblurring Dynamic 3D Gaussian Splatting for Blurry Monocular Video: MoBGS proposes an end-to-end dynamic deblurring 3D Gaussian Splatting framework that reconstructs sharp spatiotemporal novel views from blurry monocular video via two core modules — Blur-adaptive Latent Camera Estimation (BLCE) and Latent Camera-induced Exposure Estimation (LCEE) — achieving substantial improvements over existing state-of-the-art methods on the Stereo Blur dataset.
MR-CoSMo: Visual-Text Memory Recall and Direct Cross-Modal Alignment Method for Query-Driven 3D Segmentation: This paper proposes MR-CoSMo, a coarse-to-fine query-driven 3D segmentation model that establishes explicit alignment between 3D point clouds and text/2D images via a Direct Cross-Modal Alignment module (DCMA), and incorporates a Visual-Text Memory Module to store high-confidence feature pairs for enhanced cross-scene segmentation consistency. The method achieves state-of-the-art performance across three tasks: 3D instruction segmentation, referring segmentation, and semantic segmentation.
Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection: This paper proposes MMAssist, which leverages image and text features as "bridges" to align 3D features between the source and target domains, while incorporating 2D detection results to enhance pseudo-label quality, achieving significant improvements in LiDAR-based 3D unsupervised domain adaptation object detection.
NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling: NURBGen is the first text-to-CAD generation framework based on NURBS surface representation. By fine-tuning an LLM, it maps natural language descriptions to structured NURBS parameter JSONs. A hybrid representation (untrimmed NURBS + analytic primitives) and a large-scale partABC dataset are introduced, achieving significant improvements over existing methods in geometric fidelity and dimensional accuracy.
OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction: This paper proposes OceanSplat, which achieves high-fidelity underwater 3D Gaussian Splatting scene reconstruction under scattering media through trinocular view consistency constraints, synthetic epipolar depth priors, and depth-aware alpha adjustment, significantly reducing floating artifacts and surpassing existing methods.
Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning: This paper proposes OSU-3DSG, a unified framework that integrates vision-language models for open-world 3D scene graph generation and supports four scene interaction tasks — scene question answering, visual grounding, instance retrieval, and task planning — via retrieval-augmented reasoning, achieving performance comparable to supervised methods under a zero-shot setting.
OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding: This paper proposes the Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) task and the corresponding OpenScan benchmark, extending 3D scene understanding beyond object categories to eight linguistic attribute dimensions, revealing critical deficiencies of existing OV-3D methods in understanding abstract object attributes.
Opt3DGS: Optimizing 3D Gaussian Splatting with Adaptive Exploration and Curvature-Aware Exploitation: This paper proposes Opt3DGS, a framework that divides 3DGS training into two phases — exploration and exploitation. The exploration phase employs adaptively weighted SGLD to escape local optima, while the exploitation phase uses a local quasi-Newton Adam optimizer for precise convergence. The method achieves state-of-the-art rendering quality without modifying the Gaussian representation.
Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models: This work identifies pervasive redundant channels in vision foundation models (SAM/SAM2/DINOv2) and proposes a parameter-free adaptation method that requires no parameter updates: an output-difference-based channel selection algorithm identifies optimal replacement pairs, substituting redundant channels with effective ones to enhance feature representations for downstream tasks, yielding average mIoU gains of 5–11 points.
Pb4U-GNet: Resolution-Adaptive Garment Simulation via Propagation-before-Update Graph Network: This paper proposes Pb4U-GNet, which decouples message propagation from feature update (Propagation-before-Update) and incorporates resolution-aware propagation depth control and update scaling mechanisms, enabling garment simulation models trained solely on low-resolution meshes to generalize to high-resolution meshes.
PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos: PFAvatar is proposed as a two-stage pipeline—comprising pose-aware diffusion model fine-tuning (ControlBooth) and NeRF distillation (BoothAvatar)—that reconstructs high-quality 3D personalized avatars from real-world Outfit-of-the-Day (OOTD) photos, completing personalization within 5 minutes and achieving a 48× speedup over prior methods.
Physics-Informed Deformable Gaussian Splatting: Towards Unified Constitutive Laws for Time-Evolving Material Field: Each 3D Gaussian is treated as a Lagrangian material point. A time-evolving material field predicts per-particle velocities and constitutive stress tensors; the Cauchy momentum residual serves as a physics constraint while Lagrangian particle flow matching provides a data-fitting term. The approach achieves physical consistency and cross-scene generalization in monocular dynamic view synthesis, reaching state-of-the-art performance on both a self-constructed physics-driven dataset and the HyperNeRF real-world benchmark.
Point-SRA: Self-Representation Alignment for 3D Representation Learning: Point-SRA is proposed to enhance 3D point cloud representation learning via Dual Self-Representation Alignment (MAE-SRA + MFT-SRA) and MeanFlow probabilistic modeling, exploiting the complementarity of representations under different mask ratios. The method surpasses Point-MAE by 5.59% on ScanObjectNN.
Point Cloud Quantization through Multimodal Prompting for 3D Understanding: This paper proposes PCQ (Point Cloud Quantization), which leverages text embeddings from pretrained vision-language models as semantic prototypes. Through Gumbel-Softmax differentiable quantization, continuous point cloud features are discretized into a text prototype space, and cross-modal feature fusion is applied to achieve significant improvements in 3D understanding.
RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-Wave Point Cloud Sequence: This paper proposes RadarLLM, the first end-to-end framework leveraging large language models for semantic-level human motion understanding from millimeter-wave radar point cloud sequences. The framework comprises a motion-guided radar tokenizer based on Aggregate VQ-VAE and a radar-aware language model, along with a physics-aware simulation pipeline for generating large-scale paired radar-text data.
Redundant Queries in DETR-Based 3D Detection: Unnecessary and Prunable: This paper proposes GPQ (Gradually Pruning Queries), a method that progressively prunes redundant object queries in DETR-based 3D detectors using classification scores. Without introducing any additional learnable parameters, GPQ can be applied as a fine-tuning step directly on pretrained checkpoints, achieving up to 67.86% FLOPs reduction and 65.16% inference time reduction on edge devices.
Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective: This paper proposes a novel Completion-by-Correction paradigm that leverages a pretrained image-to-3D model to generate a topologically complete shape prior, then corrects it in feature space to align with local observations. This replaces the conventional Completion-by-Inpainting approach, achieving a 23.5% reduction in average CD and a 7.1% improvement in F-score on ShapeNet-ViPC.
Rethinking Rainy 3D Scene Reconstruction via Perspective Transforming and Brightness Tuning: This paper proposes OmniRain3D, the first dataset that jointly models perspective heterogeneity and brightness dynamicity in rainy 3D scenes, along with REVR-GSNet, an end-to-end framework integrating recursive brightness enhancement, Gaussian primitive optimization, and GS-guided rain elimination to reconstruct high-fidelity clean 3D scenes from rain-degraded images.
Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation: This paper proposes a box-guided approach that leverages 2D bounding boxes from the open-vocabulary detector YOLO-World to guide the assembly of 3D instance masks from superpoints, without relying on SAM or CLIP. The method achieves high efficiency (<1 min/scene) while substantially improving retrieval performance on rare-category objects.
RTGaze: Real-Time 3D-Aware Gaze Redirection from a Single Image: This paper proposes RTGaze, a real-time 3D-aware gaze redirection method that achieves high-quality gaze redirection from a single image at 61 ms/frame via a hybrid-frequency feature encoder, a gaze injection module, and 3D facial geometry prior distillation — approximately 800× faster than the previous state-of-the-art 3D method, GazeNeRF.
Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion: This paper proposes Simba, a framework that, for the first time, reformulates point cloud completion as diffusion over a geometric transformation field rather than over point coordinates. A Sym-Diffuser learns the conditional distribution of per-point affine transformations to generate coarse completions, which are then progressively refined to high-fidelity outputs via a cascaded Mamba architecture (MBA-Refiner). Simba achieves state-of-the-art performance on PCN, ShapeNet, and KITTI benchmarks.
SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images: This paper proposes SmartSplat, a feature-aware 2D Gaussian Splatting framework for image compression. By introducing three key strategies—gradient-color-guided variational sampling, repulsion-based uniform sampling, and scale-adaptive color initialization—SmartSplat achieves, for the first time, high-quality reconstruction of 8K/16K ultra-high-resolution (UHR) images at extreme compression ratios (up to 5000×).
Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction: Sparse4DGS is proposed as the first method for sparse-frame dynamic scene reconstruction, achieving high-fidelity 4D scene reconstruction from sparse video frames via two core modules: Texture-Aware Deformation Regularization (TADR) and Texture-Aware Canonical Optimization (TACO).
SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction: SparseSurf is proposed to achieve simultaneous high-accuracy surface reconstruction and high-quality novel view synthesis under sparse-view settings, via Stereo Geometry-Texture Alignment (SGTA) and Pseudo-Feature Enhanced Geometry Consistency (PFEGC).
Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction: This paper proposes Splat-SAP, a feed-forward method that reconstructs scale-aware point maps from wide-baseline stereo camera inputs and renders free-viewpoint video of human-centered scenes via a Gaussian Plane, requiring neither per-scene optimization nor 3D geometric supervision.
Splats in Splats: Robust and Effective 3D Steganography towards Gaussian Splatting: This paper proposes "Splats in Splats," the first steganography framework that embeds 3D hidden content into 3DGS assets without modifying any vanilla 3DGS attributes. It achieves secure, robust, and efficient copyright protection through importance-graded spherical harmonic (SH) coefficient encryption and autoencoder-assisted opacity mapping.
SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion: SplatSSC is proposed as a monocular 3D semantic scene completion framework based on depth-guided initialization and a Decoupled Gaussian Aggregator (DGA). Through compact Gaussian primitive initialization and robust geometry-semantics decoupled aggregation, it achieves state-of-the-art performance on Occ-ScanNet with significantly fewer primitives.
Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space: This paper proposes Split-Layer, which decomposes fully connected layers in MLPs into multiple parallel branches and integrates their outputs via the Hadamard product. Without increasing parameter count or computational cost, this approach exponentially expands the feature space dimensionality from $C$ to $\binom{C/\sqrt{N}+N-1}{N}$, significantly enhancing the representational capacity of implicit neural representations (INRs).
STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification: STMI proposes a three-component multi-modal object re-identification framework that suppresses background noise via SAM segmentation-guided feature modulation (SFM), extracts compact representations through semantic token reallocation (STR), and captures high-order semantic relationships via cross-modal hypergraph interaction (CHI), achieving significant improvements on benchmarks such as RGBNT201.
StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video: This paper proposes StreamSTGS, a streamable spatial-temporal Gaussian grid representation that encodes canonical 3D Gaussian attributes as 2D images and temporal features as video, enabling real-time free-viewpoint video streaming at only 170 KB per frame. Reconstruction quality is maintained (PSNR 32.30 dB) through Transformer-assisted training and a sliding window mechanism.
Surface-Based Visibility-Guided Uncertainty for Continuous Active 3D Neural Reconstruction: This paper proposes a Surface-Based Visibility field (SBV) that derives surface confidence from SDF sign changes and updates it via a voxel grid, enabling accurate visibility-aware uncertainty estimation during continuous active learning for Next-Best View selection. SBV achieves up to 11.6% improvement in image rendering quality across four benchmarks: DTU, Blender, TanksAndTemples, and BlendedMVS.
TG-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction: This paper proposes TG-Field, a geometry-aware Gaussian deformation framework for extremely sparse-view CT reconstruction. It employs a multi-resolution hash encoder to model spatial geometric priors, a spatiotemporal attention module and a motion flow network to handle dynamic CT, achieving state-of-the-art performance on both static and dynamic CT reconstruction.
TOSC: Task-Oriented Shape Completion for Open-World Dexterous Grasp Generation from Partial Point Clouds: This paper introduces Task-Oriented Shape Completion (TOSC), a novel task that completes only the contact regions relevant to a manipulation task—rather than the entire object—by leveraging pretrained foundation models to generate candidate shapes, a 3D Discriminative Autoencoder (DAE) to select the optimal shape, and a FlowGrasp flow-matching model to synthesize dexterous grasps. The approach achieves improvements of 16.17% in grasp displacement and 55.26% in Chamfer Distance over prior methods.
UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning: This paper proposes UniC-Lift, a unified single-stage 3D instance segmentation framework that learns optimizable vector embeddings on 3DGS primitives via contrastive loss and triplet loss, and directly decodes consistent 3D segmentation labels through a simple Embedding-to-Label procedure — eliminating post-processing clustering steps such as HDBSCAN and reducing training time from 15+ hours to under 40 minutes.
VGGT-DP: Generalizable Robot Control via Vision Foundation Models: This paper proposes VGGT-DP, a biologically inspired visuomotor policy framework that integrates the pretrained 3D-aware foundation model VGGT as a visual encoder with Diffusion Policy. Through three key designs — frame-wise token reuse (FTR), random token pruning, and proprioception-guided visual learning — VGGT-DP substantially outperforms DP and DP3 baselines on high-precision manipulation tasks in MetaWorld.
VPN: Visual Prompt Navigation: This paper proposes Visual Prompt Navigation (VPN), a novel navigation paradigm in which users annotate visual trajectories (keypoints connected by arrows) on 2D top-down maps to guide agent navigation, replacing natural language or image-goal instructions. Two datasets, R2R-VP and R2R-CE-VP, are constructed alongside a VPNet baseline model. Combined with view-level and trajectory-level data augmentation, the approach achieves strong performance in both discrete and continuous environments.

🎮 Reinforcement Learning¶

A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs: This paper proposes a multi-dimensional objective-space framework for evaluating LLM steerability, decomposing steering error into miscalibration and side effects (orthogonality). Experiments on text rewriting reveal that even the strongest LLMs produce severe side effects; prompt engineering proves ineffective, best-of-N sampling is prohibitively costly, and RL fine-tuning yields improvements but does not fully resolve the problem.
A Learning Framework For Cooperative Collision Avoidance of UAV Swarms Leveraging Domain Knowledge: This paper proposes reMARL, a framework that leverages domain knowledge from image processing (active contour model) to design reward functions for multi-agent reinforcement learning, enabling cooperative collision avoidance in UAV swarms. Compared to traditional metaheuristic methods, reMARL reduces reaction time by 98.75% and energy consumption by 85.37%.
A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses: This paper proposes MACO, a multi-agent conversational bandit framework that achieves online evaluation and user preference alignment for LLM responses through a local-agent phase elimination mechanism and an adaptive preference query strategy on a cloud server, attaining a near-optimal regret bound of $\tilde{O}(\sqrt{dMT})$.
Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward: AC3 proposes an actor-critic framework that directly learns continuous action sequences (action chunks), stabilizing long-horizon robotic manipulation under sparse rewards via an asymmetric actor update rule—updating the actor only from successful trajectories—and self-supervised anchor-based intrinsic rewards. The method achieves superior success rates over existing approaches across 25 tasks on BiGym and RLBench.
Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping: This paper proposes a test-time policy shaping method that interpolates and modifies the action probability distribution of pretrained RL agents at inference time using lightweight ethical attribute classifiers, enabling fine-grained behavioral steering across multiple ethical attributes without retraining.
BAMAS: Structuring Budget-Aware Multi-Agent Systems: This paper proposes the BAMAS framework, which employs Integer Linear Programming (ILP) to select the optimal LLM combination under budget constraints, and uses a reinforcement learning policy to choose the best collaboration topology (Linear/Star/Feedback/Planner-Driven). BAMAS achieves accuracy comparable to state-of-the-art multi-agent systems on GSM8K, MBPP, and MATH, while reducing costs by up to 86%.
Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning: This paper proposes Behaviour Policy Optimization (BPO), which optimizes a dedicated behaviour policy for off-policy data collection such that the variance of return estimates is provably lower than on-policy collection, thereby improving the sample efficiency and stability of REINFORCE and PPO.
Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning: Through dynamical systems analysis, this paper proves that under approximate greedy exploration policies, all zero-loss solutions violating IGM consistency in non-monotonic value factorization Q-learning are unstable saddle points, while IGM-consistent solutions are stable attractors — enabling reliable convergence to optimal solutions without monotonicity constraints.
Beyond the Lower Bound: Bridging Regret Minimization and Best Arm Identification in Lexicographic Bandits: Two elimination-based algorithms, LexElim-Out and LexElim-In, are proposed to simultaneously address regret minimization (RM) and best arm identification (BAI) in lexicographic multi-objective bandits for the first time. LexElim-In breaks the known lower bound of single-objective problems through cross-objective information sharing.
Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback: This paper proposes MetaCUB — a bi-level contextual bandit framework for individualized resource allocation under delayed feedback, dynamic cohorts, cooldown constraints, and fairness requirements. The meta-level optimizes subgroup budget allocation to ensure fairness, while the base-level applies a UCB strategy to select the most promising individuals within each subgroup.
ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing: This paper introduces the ChartEditVista benchmark (7,964 samples, 31 chart types) and the ChartEditor model. By combining a GRPO reinforcement learning framework with a novel rendering reward, ChartEditor surpasses GPT-4o and several 72B-scale models on chart editing tasks using only 3B parameters.
CHDP: Cooperative Hybrid Diffusion Policies for RL in Parametric Environments: This paper models the hybrid action space problem as a fully cooperative two-agent game, employing discrete and continuous diffusion policies respectively to generate actions. Sequential updates and a Q-guided codebook are introduced to resolve policy conflicts and high-dimensional scalability issues, achieving up to a 19.3% improvement in success rate.
Constrained and Robust Policy Synthesis with Satisfiability-Modulo-Probabilistic-Model-Checking: This paper proposes the first framework capable of efficiently computing robust policies under arbitrary structural constraints. By tightly integrating a SAT solver with probabilistic model checking algorithms, the framework enables constrained and robust policy synthesis for finite Markov Decision Processes (MDPs), with feasibility and competitiveness validated across hundreds of benchmarks.
Deep (Predictive) Discounted Counterfactual Regret Minimization: This paper proposes two model-free neural CFR algorithms, VR-DeepDCFR+ and VR-DeepPDCFR+, which integrate advanced tabular CFR variants (DCFR+/PDCFR+) into neural network approximation frameworks for the first time. Through bootstrapped cumulative advantage estimation, discounted clipping mechanisms, and baseline variance reduction, the proposed methods achieve faster convergence in standard imperfect information games.
DeepProofLog: Efficient Proving in Deep Stochastic Logic Programs: This paper proposes DeepProofLog (DPrL), a neurosymbolic system grounded in stochastic logic programs that introduces neural network parameterization at each proof step and establishes a formal mapping between SLD resolution and MDPs. This enables dynamic programming and reinforcement learning techniques to be applied for efficient inference and learning, substantially improving the scalability of neurosymbolic systems.
DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients: This paper proposes DiffOP, a framework that treats optimization-based control policies (e.g., MPC) as differentiable modules, derives analytic policy gradients via implicit differentiation to enable end-to-end reinforcement learning training, and provides the first non-asymptotic convergence guarantee for this setting.
Discounted Cuts: A Stackelberg Approach to Network Disruption: This paper introduces the Discounted Cuts mathematical framework, modeling the classical Most Vital Links problem as a Stackelberg game. It systematically establishes a computational complexity classification for eight variants of discounted cuts and proves that all variants are solvable in polynomial time on bounded-genus graphs.
Distilling Deep Reinforcement Learning into Interpretable Fuzzy Rules: An Explainable AI Framework: This paper proposes a hierarchical Takagi-Sugeno-Kang (TSK) fuzzy classifier system that distills deep RL neural network policies into human-readable IF-THEN fuzzy rules. Three quantitative interpretability metrics are introduced (FRAD, FSC, ASG). On the Lunar Lander continuous control task, the proposed system achieves 81.48% fidelity, surpassing decision trees by 21 percentage points.
Distributionally Robust Online Markov Game with Linear Function Approximation: This paper studies online distributionally robust Markov games with linear function approximation. It is the first to identify the hardness of learning in this setting, and proposes the DR-CCE-LSI algorithm, which achieves minimax-optimal sample complexity with respect to the feature dimension $d$ under a specific feature mapping condition.
Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning: This paper proposes a novel reward specification framework based on Linear Temporal Logic over finite traces modulo theories (LTLfMT), replacing manually coded labeling functions with first-order logic formulas. Combined with CRM and HER to address the inherent sparse reward problem in logic-based specifications, the framework achieves significant improvements on continuous control tasks.
Does Self-Evaluation Enable Wireheading in Language Models?: This paper theoretically proves and empirically validates that when a language model's self-evaluation is coupled with its reward signal, the model systematically inflates its self-assigned grades (wireheading), while decoupling self-grades from rewards mitigates this behavior. Experiments on Llama-3.1-8B and Mistral-7B across three tasks show that grade inflation in ambiguous tasks such as summarization reaches as high as 0.92.
DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift: This paper is the first to reformulate Android malware detection as a one-step Markov Decision Process (MD-MDP) and trains a PPO-based deep reinforcement learning agent, DRMD, that unifies sample classification, rejection, and active learning within a single policy. The approach achieves average AUT improvements of 8.66 (classification only) and 10.90 (with rejection) in multi-year temporal evaluations, significantly outperforming conventional supervised learning classifiers under concept drift.
Efficient Multiagent Planning via Shared Action Suggestions: This paper proposes the MCAS algorithm, which infers other agents' belief states by sharing only "suggested actions" within a decentralized POMDP framework, achieving coordination performance close to centralized methods while substantially reducing communication overhead and computational complexity.
Enhancing Robustness of Offline RL Under Data Corruption via SAM: This paper is the first to apply Sharpness-Aware Minimization (SAM) as a plug-and-play optimizer for offline RL. It hypothesizes that data corruption induces sharp minima in the loss landscape, leading to poor generalization, and demonstrates that SAM improves robustness by seeking flat minima. On the D4RL benchmark, IQL+SAM improves average score from 34.47 to 44.40.
Explaining Decentralized Multi-Agent Reinforcement Learning Policies: This paper proposes the first explainability method for decentralized multi-agent reinforcement learning (MARL) policies, comprising Hasse diagram-based policy summarization and query-based natural language explanations (When / Why Not / What). The approach is demonstrated across four MARL domains, showing both generality and computational efficiency. A user study confirms that it significantly improves human understanding of policies and question-answering performance.
First-Order Representation Languages for Goal-Conditioned RL: This paper investigates the application of first-order relational languages to goal-conditioned RL and generalized planning. It proposes representing goals as subsets or lifted versions of sets of atoms, and combines this with HER to automatically construct easy-to-hard goal curricula, enabling the learning of generalizable policies on large-scale sparse-reward planning problems.
Formal Verification of Diffusion Auctions: This paper presents the first formal logical framework for diffusion auctions, introducing the $n$-seller diffusion incentive logic $\mathcal{L}^n$ and its strategic extension $\mathcal{SL}^n$. The framework supports model-checking verification of auction properties such as Nash equilibria and the existence of seller strategies, with complexity results of P and PSPACE-complete respectively.
G-UBS: Towards Robust Understanding of Implicit Feedback via Group-Aware User Behavior Simulation: This paper proposes G-UBS (Group-aware User Behavior Simulation), a paradigm that employs a User Group Manager (UGM) based on a "Summarize–Cluster–Reflect" LLM workflow to generate group profiles, combined with group-aware reinforcement learning in a User Feedback Modeler (UFM), achieving robust user behavior understanding under implicit feedback noise. The paper also introduces IF-VR, the first multimodal implicit feedback benchmark for video recommendation.
Good-for-MDP State Reduction for Stochastic LTL Planning: This paper proposes a novel Good-for-MDP (GFM) automaton state reduction technique that significantly reduces automaton state counts via a GFM→DBA→DCA→GFG minimization→0/1-PA transformation pipeline. Additionally, for formulas of the form $\textsf{GF}\varphi$ where $\varphi$ is a co-safety formula, a direct singly-exponential construction is provided, achieving an exponential reduction in state count compared to the general doubly-exponential construction.
HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning: This paper proposes HCPO, an algorithm that enhances the expressiveness and exploration efficiency of multi-agent joint policies by introducing a conductor mechanism, constructing a Gaussian mixture model-like joint policy framework, and providing monotonic improvement guarantees for two-level policy updates.
In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback: This paper proposes the InTRO framework, which aligns the model's generation policy with its answer-conditioned posterior via KL divergence minimization. By enabling token-level exploration and self-generated feedback within a single forward pass, InTRO improves both accuracy and conciseness of LLM reasoning without relying on any external supervision.
InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization: To address the exploration bottleneck in semantic alignment for GUI grounding, this paper proposes the Adaptive Exploration Policy Optimization (AEPO) framework. AEPO enforces broad exploration via a multi-answer generation strategy, dynamically guides learning through an adaptive exploration reward function, and ensures exploration quality via a collinearity penalty mechanism, significantly improving multimodal large language model performance on complex GUI grounding tasks.
Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation: This paper proposes INSIGHT, a two-stage unified framework for egocentric long-term action anticipation (LTA). Stage one enhances action representations via hand-object interaction (HOI) region feature extraction and verb-noun co-occurrence matrices; stage two introduces a GRPO-based reinforcement learning cognitive reasoning module that simulates a structured "perceive → reason → answer" cognitive process for intention inference and action prediction.
Know your Trajectory -- Trustworthy Reinforcement Learning Deployment through Importance-Based Trajectory Analysis: This paper proposes a trajectory-level explanation framework based on state importance metrics. By combining Q-value differences with a goal-affinity measure (radical term), trajectories are ranked by importance. Counterfactual rollouts are then used to verify the robust superiority of the selected optimal trajectory, providing trustworthy explanations for RL policies in the form of "why this path rather than that one?"
Language Model Distillation: A Temporal Difference Imitation Learning Perspective: This paper revisits language model distillation from an imitation learning / inverse reinforcement learning perspective. It exploits the sparsity of teacher output distributions (top-p tokens concentrate over 96% of probability mass) to construct a top-p MDP for temporal difference (TD) learning, proves that the optimal policy in the reduced action space admits a bounded suboptimality guarantee, and demonstrates that the resulting Bellman Distill method — built on the IQL algorithm — outperforms existing distillation methods across multiple model families.
Learning to Generate and Extract: A Multi-Agent Collaboration Framework for Zero-shot Document-level Event Arguments Extraction: This paper proposes a "Propose-Evaluate-Revise" multi-agent collaboration framework (comprising a generator agent and an evaluator agent) to address zero-shot document-level event argument extraction (ZS-DEAE). The generator agent synthesizes training data for unseen event types, while the evaluator agent provides log-likelihood-based quality scores to guide reinforcement learning for iterative optimization, simultaneously improving synthetic data quality and extraction performance.
ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation: This paper proposes ManiLong-Shot, a framework comprising three modules—interaction-aware task decomposition, invariant region prediction, and region matching—that generalizes to 20 unseen long-horizon manipulation tasks after training on only 10 short-horizon tasks, achieving a one-shot imitation success rate of 30.2%, a relative improvement of 22.8% over the prior state of the art.
MARS: A Meta-Adaptive Reinforcement Learning Framework for Risk-Aware Multi-Agent Portfolio Management: This paper proposes the MARS framework, which achieves risk-aware portfolio management under dynamic market conditions through a two-level architecture comprising a Heterogeneous Agent Ensemble (HAE)—where each agent has a distinct risk preference and Safety-Critic—and a Meta-Adaptive Controller (MAC). The framework significantly reduces maximum drawdown and volatility.
MARS: Multi-Agent Adaptive Reasoning with Socratic Guidance for Automated Prompt Optimization: This paper proposes MARS, a five-agent framework for automated prompt optimization (APO): a Planner generates task-specific optimization trajectories; a Teacher-Critic-Student triad conducts Socratic dialogue for iterative prompt refinement (simulating pseudo-gradient descent in text space); and a Target agent executes the prompt and provides feedback. The entire process is modeled as a POMDP. MARS outperforms the previous SOTA (PE2) by 6.04% on general tasks and 6.42% on domain-specific tasks across 17 datasets, requiring only 1-shot training data.
MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy: This paper proposes MathSmith, a framework that generates hard mathematical problems by randomly sampling concept pairs from PlanetMath, applying 9 predefined difficulty strategies, and jointly optimizing structural validity, reasoning complexity, and answer consistency via GRPO-based reinforcement learning. The resulting high-difficulty synthetic problems significantly improve LLM mathematical reasoning on AIME and OlympiadBench.
MMhops-R1: Multimodal Multi-hop Reasoning: This paper proposes the MMhops benchmark (31K samples, 3–4 reasoning hops) and the MMhops-R1 framework, which trains MLLMs via reinforcement learning to autonomously plan reasoning paths and dynamically invoke image/text retrievers for multimodal multi-hop reasoning. A 7B model surpasses 72B baselines and existing mRAG methods.
Object-Centric Latent Action Learning: This paper proposes an object-centric latent action learning framework that leverages self-supervised object decomposition (VideoSAUR) to disentangle task-relevant entities from visual distractions (e.g., dynamic backgrounds), reducing the performance degradation of LAPO on distracted videos by approximately 50%. A linear action probe is used to automatically select control-relevant slots.
Object-Centric World Models for Causality-Aware Reinforcement Learning: This paper proposes STICA, a framework that implements the world model, policy network, and value network through a unified object-centric Transformer architecture. The world model decomposes observations into independent per-object latent states for token-level dynamics prediction, while the policy and value networks estimate token-level causal relationships via a causal attention mechanism to enable causality-aware decision-making. STICA significantly outperforms DreamerV3 and other state-of-the-art methods on the Safety Gym and OCVRL benchmarks.
One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow: This paper reformulates MeanFlow from visual generation into a generative policy for offline RL. It proposes a residual-form direct noise-to-action mapping that achieves expressive one-step sampling and enables stable joint optimization with a Q-function in a single training stage, achieving strong performance across 73 tasks on OGBench and D4RL.
PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning: This paper proposes PA-FAS, a framework that addresses two critical bottlenecks of the SFT+RL paradigm in multimodal FAS — insufficient reasoning path diversity and reasoning shortcut — via a Reasoning Path Augmentation strategy and an answer shuffling mechanism, achieving the first unified solution for multimodal fusion, domain generalization, and interpretability simultaneously.
Partial Action Replacement: Tackling Distribution Shift in Offline MARL: This paper proposes the Partial Action Replacement (PAR) principle, theoretically proving that under a factorized behavior policy, distribution shift grows linearly with the number of deviating agents (rather than exponentially in the joint action space). Building on this, the SPaCQL algorithm is developed to dynamically weight different PAR operators via Q-ensemble uncertainty, achieving substantial improvements over all baselines on Random and Medium-Replay datasets.
Perturbing Best Responses in Zero-Sum Games: This paper investigates the introduction of stochastic perturbations into best-response oracles (BROs) for zero-sum games. It proves that Stochastic Fictitious Play (SFP) achieves an expected iteration count of $O(\frac{\log n}{\varepsilon^2})$ with respect to the number of pure strategies $n$, and proposes the Stochastic Double Oracle (SDO) algorithm, which achieves logarithmic convergence under specific game structures.
Provably Efficient Multi-Objective Bandit Algorithms under Preference-Centric Customization: This paper presents the first theoretical study of preference-aware customization in multi-objective multi-armed bandits (MO-MAB) with explicit user preferences. It proposes the PAMO-MAB framework and designs PRUCB-UP and PRUCB-HP algorithms for the "unknown preference" and "hidden preference" settings, respectively. Through a two-component architecture combining preference estimation and preference-aware optimization, both algorithms achieve near-optimal regret bounds. The paper also proves that preference-free algorithms inevitably incur $\Omega(T)$ linear regret when the Pareto front contains conflicting arms.
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation: This paper proposes MTMC (Macro Thinking Micro Coding), a hierarchical framework that decouples GPU kernel generation into two stages: a lightweight RL-trained LLM generates high-level optimization actions (Macro Thinking), while a general-purpose LLM incrementally implements each action (Micro Coding). This design separates correctness from performance concerns, achieving near-100% accuracy and a 2.2× speedup over expert-optimized PyTorch Eager kernels on KernelBench.
Realistic Curriculum Reinforcement Learning for Autonomous and Sustainable Marine Vessel Navigation: This paper proposes a Curriculum Reinforcement Learning (CRL) framework for autonomous and sustainable marine vessel navigation. The framework integrates a high-fidelity simulation environment built on real AIS data, a diffusion model-enhanced dynamic maritime traffic simulator, and a machine learning-based fuel consumption prediction module. A multi-objective reward function simultaneously optimizes navigation safety, emission reduction, timeliness, and goal completion.
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination: This paper conducts a systematic data leakage audit revealing severe data contamination of the Qwen2.5 series on standard math benchmarks such as MATH-500. It demonstrates that recent findings claiming "spurious rewards can improve mathematical reasoning" are artifacts of contamination, and constructs a fully uncontaminated benchmark, RandomCalculation, to verify that only correct reward signals yield genuine reasoning improvements.
Reasoning with Exploration: An Entropy Perspective: This paper analyzes the positive correlation between exploratory reasoning behaviors in LLMs (pivotal tokens, self-reflection, rare behaviors) and high-entropy regions from an entropy perspective. It proposes a minimalist entropy-based advantage shaping method—requiring only a single line of code modification—that significantly enhances the Pass@K reasoning capability ceiling of LLMs.
ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India: This paper presents the first application of PPO-based reinforcement learning (RLAIF) to Indian legal judgment prediction and summarization tasks. Although performance does not surpass SFT or commercial models, this position paper reveals fundamental challenges and future directions for RL in legal NLP.
Revealing POMDPs: Qualitative and Quantitative Analysis for Parity Objectives: This paper proves that limit-sure analysis for revealing POMDPs under parity objectives is equivalent to almost-sure analysis (EXPTIME-complete), and that quantitative analysis can also be completed within EXPTIME, thereby resolving two important open problems for this subclass.
Risk-Sensitive Exponential Actor Critic: To address the high variance and numerical instability of policy gradients under the entropic risk measure, this paper derives a complete set of on/off-policy risk-sensitive policy gradient theorems and proposes the rsEAC algorithm, which achieves stable risk-sensitive continuous control via log-domain critic parameterization and gradient normalization-clipping mechanisms.
RLSLM: A Hybrid Reinforcement Learning Framework Aligning Rule-Based Social Locomotion Model with Human Social Norms: This paper proposes RLSLM, a hybrid framework that embeds a psychology-experiment-driven rule-based Social Locomotion Model (SLM) into the reward function of reinforcement learning, enabling agents to efficiently learn navigation policies aligned with human social norms in crowd environments. VR experiments demonstrate that RLSLM achieves significantly higher comfort ratings than existing rule-based baselines.
SafeMIL: Learning Offline Safe Imitation Policy from Non-Preferred Trajectories: This paper proposes SafeMIL, which formulates cost function learning as a Multiple Instance Learning (MIL) problem to learn a safe imitation policy from a limited set of non-preferred trajectories and a large collection of unlabeled trajectories—without requiring step-level reward or cost annotations—achieving constraint satisfaction performance 3.7× better than the strongest baseline.
Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation: This paper proposes PolicyGradEx, which efficiently estimates policy adaptation performance on arbitrary task subsets via first-order gradient approximation and surrogate models, constructs a task affinity matrix, and performs task grouping through convex optimization. PolicyGradEx outperforms state-of-the-art baselines by an average of 16% on multi-objective RL and meta-RL benchmarks, with a speedup of up to 26×.
Speculative Sampling with Reinforcement Learning: This paper proposes Re-SpS, the first framework to formulate the draft tree hyperparameter optimization of Speculative Sampling (SpS) as an MDP and solve it via reinforcement learning. Through two key designs—feature reuse and action caching—Re-SpS achieves up to 1.12× additional speedup over EAGLE-3 without any loss in output fidelity.
Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding: This work identifies that CoT reasoning can be counterproductive in visual grounding, and proposes CuRPO (Curriculum-based Relative Policy Optimization), which leverages CoT length and gIoU reward as data complexity proxies for curriculum-based RL training, achieving up to +12.52 mAP improvement over Visual-RFT on RefCOCO.
STELAR-Vision: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision: This paper proposes STELAR-Vision, a topology-aware training framework for visual language reasoning. Via the TopoAug data generation pipeline, it introduces diverse reasoning topologies—Chain, Tree, and Graph—and combines SFT with RL (SimPO) post-training. The framework achieves +9.7% accuracy on in-distribution data and up to +28.4% on out-of-distribution benchmarks, while reducing output length by 18.1% through Frugal Learning.
TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction: This paper proposes TAdaRAG, a task-adaptive RAG framework that performs on-the-fly knowledge graph construction via intent-driven template routing, supervised fine-tuning, and REINFORCE-based reinforcement learning. It addresses three core limitations of conventional RAG—chunking-induced hallucination, broken reasoning chains, and irrelevant information interference—achieving state-of-the-art performance on 6 public benchmarks and 1 commercial scenario benchmark.
Test-driven Reinforcement Learning in Continuous Control: This paper proposes the Test-driven Reinforcement Learning (TdRL) framework, which replaces a single reward function with multiple test functions — pass-fail tests defining optimality criteria and indicative tests guiding learning — to represent task objectives. A return function is learned via lexicographic-heuristic trajectory comparison, matching or surpassing hand-crafted reward methods on the DeepMind Control Suite while naturally supporting multi-objective optimization.
TextShield-R1: Reinforced Reasoning for Tampered Text Detection: This paper proposes TextShield-R1, the first reinforcement learning-based multimodal large language model (MLLM) method for tampered text detection. The approach integrates forensic continual pre-training (a curriculum from natural images to text images), GRPO reinforcement learning (five carefully designed reward functions to reduce annotation dependency), and OCR rectification (leveraging the MLLM's text recognition capability to improve localization accuracy). Together with the newly introduced TFR benchmark (45K+ images, 16 languages, 10 tampering techniques), this work substantially advances the state of the art in interpretable tampered text detection.
Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making: This paper proposes the LAMP framework, which integrates LLM-driven language reasoning with MARL policy optimization through a Think–Speak–Decide three-stage pipeline. The framework enables economic decision-making agents to understand and leverage natural language information (e.g., news, dialogues), achieving cumulative returns exceeding pure MARL baselines by 63.5% and LLM-only baselines by 34.0% in economic simulation environments.
Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction: This paper proposes the Thinker framework, which achieves structured deep search reasoning through hierarchical thinking (breadth decomposition + depth solving) and dual representation (natural language + logical functions). Combined with knowledge boundary determination to reduce unnecessary retrieval, the model is trained via SFT and significantly outperforms RL-based deep search methods across multiple QA benchmarks.
TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents: This paper introduces TowerMind, a lightweight multimodal environment based on tower defense games for evaluating LLMs' long-term planning and decision-making capabilities. It reveals a significant performance gap between current LLMs and human experts (the best model achieves only 42% of human expert scores) and identifies behavioral deficiencies including insufficient plan verification, lack of multi-goal thinking, and underutilization of the action space.
Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach: This paper proposes Geo-R, a retrieval-free, reasoning-driven image geolocalization framework. By introducing the Chain-of-Region (CoR) hierarchical reasoning paradigm and a reinforcement learning strategy based on Haversine distance coordinate-alignment rewards, Geo-R achieves 18.10% street-level (1 km) accuracy on IM2GPS3K, surpassing all retrieval-free methods and approaching retrieval-based ones.
Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning: This paper identifies the Beginning Lock-in Effect (BLE) in LLM reasoning — the initial reasoning steps significantly determine subsequent trajectories and final outcomes. Based on this finding, the paper proposes PPPO, a method that optimizes only prefix tokens (approximately 26% of all tokens), achieving accuracy improvements of up to 18.02% while reducing output token counts by up to 18.35%.
Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning: This paper proposes the STV framework, which identifies attention head positions sensitive to in-context information via activation deltas, and leverages reinforcement learning to select optimal task vectors from a pre-clustered activation bank for insertion—enabling efficient many-shot multimodal in-context learning without increasing input length.
Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position: This paper presents the first systematic safety analysis of diffusion large language models (dLLMs), revealing that—unlike autoregressive LLMs—middle tokens are more critical to safety in dLLMs, and that attackers are fundamentally constrained by the model's inherent sequential generation tendency from manipulating these positions. Based on this asymmetry, the paper proposes MOSA (Middle-tOken Safety Alignment) as a defense method.

🚗 Autonomous Driving¶

A Data-Driven Model Predictive Control Framework for Multi-Aircraft TMA Routing Under Travel Time Uncertainty: A closed-loop MPC framework is proposed for conflict-free multi-aircraft routing and scheduling within the 50 NM Terminal Maneuvering Area (TMA) of Changi Airport. The framework integrates XGBoost-based TMA boundary arrival time prediction, MILP optimization (incorporating route selection, speed adjustment, holding control, and separation constraints), and a receding-horizon simulator. Under peak congestion scenarios of 36 aircraft/hour, it achieves a 7× computational speedup while significantly outperforming the Dijkstra baseline in feasibility under Monte Carlo robustness validation.
AI-based Traffic Modeling for Network Security and Privacy: Challenges Ahead: A survey and position paper on AI-based traffic modeling for Network Security & Privacy (NetS&P) tasks. It systematically reviews AI approaches for anomaly detection, attack classification, IoT device identification, and website fingerprinting attacks, and provides an in-depth discussion of four frontier challenges: data quality, practical deployment, explainability, and foundation models.
Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning: This paper presents the first study on backdoor attacks against open-vocabulary object detectors (OVODs), proposing TrAP (Trigger-Aware Prompt tuning), which jointly optimizes learnable prompts in both visual and textual branches alongside a learnable trigger to inject high-success-rate backdoors without modifying any model weights.
Beta Distribution Learning for Reliable Roadway Crash Risk Assessment: A geospatial deep learning framework based on Beta distribution learning is proposed, which leverages multi-scale satellite imagery to predict the full probability distribution of fatal crash risk (rather than point estimates), achieving 17–23% improvement in Recall while naturally expressing uncertainty through distribution shape.
Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation: This paper is the first to systematically address the "target-class hallucination" problem in unpaired day-to-night image translation. By combining a dual-head discriminator (style head + SAM2 pseudo-label segmentation head) for hallucination detection and class-prototype contrastive learning for suppression, the method improves mAP from 15.08 to 17.40 (+15.5%) on BDD100K day-to-night domain adaptation detection, with traffic light AP improving by 31.7%.
CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction: CaTFormer is proposed to explicitly model causal interactions between driver behavior and environmental context via a causal temporal Transformer, achieving state-of-the-art performance of 98.6% F1 on the Brain4Cars dataset.
CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking: CompTrack is proposed as the first framework to simultaneously address dual redundancy in LiDAR point clouds: SFP filters background noise via information entropy analysis to resolve spatial redundancy; IB-DTC estimates effective rank via online SVD and adaptively determines compression ratio to compress foreground into low-rank proxy tokens, resolving information redundancy. Achieves state-of-the-art on nuScenes (61.04% Success) at 90 FPS.
Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification: This work systematically identifies two unique challenges in adversarial defense for person ReID — model bias and composite generalization requirements — and proposes a Debiased Dual-Invariant Defense framework. The data balancing stage employs a diffusion model for resampling to mitigate bias, while the dual adversarial self-meta defense stage achieves dual generalization to unseen IDs and unseen attacks via Farthest Negative Example Softening (FNES)-based metric adversarial training and adversarially-enhanced self-meta learning.
AdaptiveAD: Decoupling Scene Perception and Ego Status for End-to-End Autonomous Driving: This paper identifies the architectural root cause of ego-status over-reliance in end-to-end autonomous driving—namely, the premature fusion of ego status within the BEV encoder—and proposes AdaptiveAD, a dual-branch architecture consisting of a scene-driven branch (with ego status removed) and a self-driven branch that independently generate planning decisions. A scene-aware fusion module then adaptively integrates the two branches. Complemented by path attention, BEV unidirectional distillation, and an autoregressive online mapping auxiliary task, AdaptiveAD achieves state-of-the-art planning performance on nuScenes.
SAML: A Differentiable Semantic Meta-Learning Framework for Long-Tail Motion Prediction: SAML is proposed as the first framework to provide a differentiable semantic definition of "long-tailedness" in motion prediction — quantifying rarity via five intrinsic/interactive attributes, fusing them into a continuous Tail Index through a Bayesian Tail Perceiver, and driving MAML-based meta-learning adaptation. On the nuScenes worst-case top 1% subset, SAML achieves a minADE 17.2% lower than the second-best method.
Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection: This paper proposes MonoDLGD, which provides explicit geometric supervision for monocular 3D detection by adaptively perturbing and reconstructing ground-truth labels according to instance-level detection difficulty, achieving state-of-the-art performance on KITTI.
DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End to End Autonomous Driving: This paper proposes DiffRefiner, a coarse-to-fine two-stage framework that first employs a discriminative Proposal Decoder to generate coarse trajectory proposals, then iteratively refines them via a diffusion model, combined with a fine-grained semantic interaction module. The method achieves state-of-the-art performance on both NAVSIM v2 and Bench2Drive benchmarks.
Drive As You Like: Strategy-Level Motion Planning Based on A Multi-Head Diffusion Model: This paper proposes M-Diffusion Planner, a strategy-level motion planning framework built upon a multi-head diffusion model and GRPO post-training, enabling users to switch among driving styles such as aggressive, conservative, and comfortable via natural language, while maintaining state-of-the-art planning performance.
DriveFlow: Rectified Flow Adaptation for Robust 3D Object Detection in Autonomous Driving: DriveFlow is a rectified flow adaptation method built upon pretrained T2I Flow models. Through frequency decomposition, it preserves high-frequency foreground content while applying dual-frequency optimization to the background, enabling training-free driving scene image editing for data augmentation and significantly improving the OOD robustness of vision-based 3D detectors.
DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning: This paper proposes DriveSuprim, which addresses three key bottlenecks in selection-based end-to-end planning — difficulty distinguishing similar trajectories, directional bias, and hard-label instability — through a coarse-to-fine trajectory selection paradigm, rotation-based data augmentation, and a self-distillation soft-label framework, achieving state-of-the-art performance on NAVSIM v1/v2 and Bench2Drive.
Dual-branch Spatial-Temporal Self-supervised Representation for Enhanced Road Network Learning: This paper proposes DST (Dual-branch Spatial-Temporal), a road network representation learning framework that jointly models spatial heterogeneity and temporal dynamics via a spatial branch (mix-hop transition matrix + graph–hypergraph contrastive learning) and a temporal branch (Transformer encoder + next-token prediction + weekday/weekend classification). DST achieves state-of-the-art performance on three downstream tasks across three cities.
ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts: ExpertAD introduces a Mixture-of-Experts (MoE) architecture into the perception and prediction modules of end-to-end autonomous driving systems. A Perception Adapter dynamically re-weights BEV features to amplify task-critical semantics, while a Mixture of Sparse Experts employs a router to selectively activate relevant driving task experts and uses sparse attention to reduce computation. The framework reduces inference latency by approximately 25% while maintaining or improving planning performance.
Exploring Surround-View Fisheye Camera 3D Object Detection: This paper systematically investigates 3D object detection with surround-view fisheye cameras. It constructs the Fisheye3DOD benchmark dataset containing both pinhole and fisheye camera data, and proposes two frameworks—FisheyeBEVDet and FisheyePETR—that embed fisheye geometric modeling into mainstream detection paradigms via spherical feature representations, achieving up to 6.2 FDS improvement over rectification-based baselines.
FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning: FastDriveVLA is proposed to train a lightweight plug-and-play ReconPruner module (only 0.07B parameters) via MAE-style foreground pixel reconstruction. By employing an adversarial foreground-background reconstruction strategy, the method prioritizes the retention of foreground tokens critical for driving decisions. It achieves state-of-the-art performance across all pruning ratios on the nuScenes open-loop planning benchmark, and can be transferred to different VLA models sharing the same visual encoder without retraining.
Fine-Grained Representation for Lane Topology Reasoning: This paper proposes TopoFG, a framework that replaces conventional single-query lane modeling with fine-grained queries (each lane represented by multiple spatially-aware queries), combined with hierarchical prior extraction, region-focused decoding, and boundary-point-based robust topology reasoning, achieving new state-of-the-art results of 48.0% OLS (subset_A) and 45.4% OLS (subset_B) on OpenLane-V2.
FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection: This work presents the first fully INT8-quantized deployment of PETR-series 3D detectors. It introduces three key components: a quantization-friendly LiDAR-ray position encoding (QFPE) to resolve multi-modal feature magnitude mismatch, a dual-lookup table (DULUT) for efficient approximation of nonlinear operators, and quantization after numerical stabilization (QANS) to prevent softmax attention distortion. Across PETR/StreamPETR/PETRv2/MV2D, W8A8 quantization incurs less than 1% mAP loss while reducing latency by 75% (3.9× speedup).
Generalising Traffic Forecasting to Regions without Traffic Observations: This paper proposes GenCast, which achieves generalization of traffic forecasting from sensor-covered regions to unobserved continuous regions via three key innovations: a physics-informed neural network (incorporating the LWR traffic equation as a soft constraint), dynamic external weather signal fusion, and a spatial grouping module. GenCast consistently outperforms existing state-of-the-art methods across five real-world datasets.
Global-Lens Transformers: Adaptive Token Mixing for Dynamic Link Prediction: This paper proposes GLFormer, a lightweight attention-free Transformer framework for dynamic graph link prediction. It replaces self-attention with an adaptive token mixer conditioned on interaction order and temporal intervals, and employs a hierarchical aggregation mechanism to enlarge the temporal receptive field. GLFormer achieves performance on par with or superior to Transformer baselines across 6 benchmarks while substantially reducing computational complexity.
HD2-SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving: This paper proposes the HD2-SSC framework, which addresses the 2D→3D input–output dimension gap via a High-dimensional Semantic Decoupling (HSD) module (expanding pixel features along a pseudo-dimension and orthogonally decoupling them), and addresses the annotation–reality density gap via a High-density Occupancy Refinement (HOR) module (a "detection–refinement" paradigm that aligns geometrically and semantically critical voxels). The method achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360.
Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification: This paper proposes HPL, a unified framework that decouples I2I and T2I tasks via a Task-Routed Transformer (dual classification tokens), and employs hierarchical prompt learning (identity-level + instance-level pseudo-text tokens) combined with cross-modal prompt regularization, achieving simultaneous state-of-the-art performance on both image-to-image and text-to-image person re-identification within a single model for the first time.
I-INR: Iterative Implicit Neural Representations: This paper proposes I-INR (Iterative Implicit Neural Representations), a plug-and-play iterative refinement framework that introduces lightweight FeedbackNet and FuseNet modules (adding only 0.5–2% parameters) to perform progressive multi-step signal reconstruction, effectively alleviating the spectral bias of INRs. I-INR consistently outperforms baselines across image fitting, super-resolution, denoising, and 3D occupancy prediction tasks.
Invisible Triggers, Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving: This paper proposes AdvRoad, a two-stage framework (Road-Style Adversary Generation + Scenario-Associated Adaptation) that generates diverse adversarial posters with road-surface texture styles. These posters induce "ghost objects" (false positives) in visual 3D detectors for autonomous driving while remaining inconspicuous to human drivers due to their natural appearance, significantly improving the stealthiness and defensive resistance of FP attacks.
LiDAR-GS++: Improving LiDAR Gaussian Reconstruction via Diffusion Priors: This paper proposes LiDAR-GS++, which introduces a controllable LiDAR diffusion generative model as a prior to perform extended reconstruction of a neural 2DGS field. The method addresses the severe degradation in reconstruction quality under extrapolated viewpoints (e.g., lane-change scenarios) encountered in single-pass LiDAR scanning, achieving state-of-the-art performance on both interpolated and extrapolated views across multiple public benchmarks.
LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences: This paper proposes LiDARCrafter, the first 4D generative world model targeting LiDAR, which achieves controllable 4D LiDAR sequence generation and editing through a pipeline of text → scene graph → three-branch diffusion layout → range-image diffusion generation → autoregressive temporal extension, comprehensively surpassing existing methods on nuScenes.
LiNeXt: Revisiting LiDAR Completion with Efficient Non-Diffusion Architectures: This paper proposes LiNeXt, a lightweight non-diffusion network for LiDAR 3D scene completion. Through a Distance-aware Selective Repetition (DSR) strategy, a Noise-to-Coarse (N2C) module, and a Refine module, LiNeXt directly reconstructs complete point clouds. On SemanticKITTI, it achieves 199.8× faster inference than LiDiff, reduces Chamfer Distance by 50.7%, and uses only 6.1% of LiDiff's parameters.
LUCID: Learning-Enabled Uncertainty-Aware Certification of Stochastic Dynamical Systems: This paper proposes LUCID, the first verification engine capable of providing quantified safety guarantees for black-box stochastic dynamical systems. By combining data-driven control barrier certificates, conditional mean embeddings, and finite Fourier kernel expansions, LUCID reformulates a semi-infinite non-convex optimization problem into a tractable linear program.
MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation: MambaSeg is proposed, employing a dual-branch parallel Mamba encoder to process RGB images and event streams respectively, with a Dual-Dimension Interaction Module (DDIM) for fine-grained cross-modal fusion along both spatial and temporal dimensions. It achieves state-of-the-art performance of 77.56%/75.10% mIoU on DDD17 and DSEC with only 25.44M parameters, offering substantially better efficiency than Transformer-based approaches.
Meta Dynamic Graph for Traffic Flow Prediction: This paper proposes MetaDG, a framework that generates dynamic node representations at each time step and enhances them via spatio-temporal correlation, extending dynamism modeling beyond merely updating the adjacency matrix to simultaneously generating meta-parameters, adjacency matrices, and edge-weight adjustment matrices. This enables unified spatio-temporal heterogeneity modeling (ST-unification) and achieves state-of-the-art performance on four benchmark datasets: PEMS03/04/07/08.
Minimum-Cost Network Flow with Dual Predictions: This paper presents the first learning-augmented algorithm for minimum-cost network flow (MCF) based on dual predictions. By warm-starting the classical ε-relaxation algorithm with machine-learned dual solutions, the proposed method ties its complexity bound to the $\ell_\infty$-norm of the prediction error (achieving both consistency and robustness), and demonstrates average speedups of 12.74× on traffic networks and 1.64× on chip escape routing.
MOBA: A Material-Oriented Backdoor Attack against LiDAR-based 3D Object Detection: This paper proposes MOBA (Material-Oriented Backdoor Attack), the first physically realizable backdoor attack framework grounded in material reflectance modeling. It systematically selects titanium dioxide (TiO₂) as the trigger material and employs an angle-independent approximation of the Oren-Nayar BRDF model for LiDAR intensity simulation, achieving an attack success rate (ASR) of 93.50% on real physical data—more than 41% above existing methods.
Multimodal Data Fusion to Capture Dynamic Interactions between Built Environment and Vulnerable Older Adults: This paper proposes a multimodal data fusion framework that integrates eye-tracking, motion sensors (IMU), physiological monitoring (EDA/HRV), GPS, and video recording to dynamically characterize interactions between vulnerable older adults (with knee osteoarthritis or fall history) and the urban built environment. Through AI-driven data fusion, the framework identifies urban street segments that significantly influence walking behavior and perception at a microscopic scale, providing evidence-based support for age-friendly urban planning.
SPARC: OOD Generalization for Controlling 100 Unseen Vehicles with a Single Policy: This paper proposes SPARC (Single-Phase Adaptation for Robust Control), which unifies the two-phase context encoding and history-based adaptation of RMA into a single-phase training procedure. Using a single policy in the high-fidelity Gran Turismo 7 racing simulator, SPARC achieves state-of-the-art OOD generalization performance across 100+ unseen vehicles.
PriorDrive: Enhancing Online HD Map Construction with Unified Vector Priors: This paper proposes PriorDrive, a framework that encodes multiple types of vectorized prior maps (SD maps, outdated HD maps, historical prediction maps) into a unified representation via a Unified Vector Encoder (UVE) and Hybrid Prior Representation (HPQuery), and integrates them into various online mapping models. It achieves a +14.3 mAP improvement on nuScenes and is compatible with both query-based and non-query-based mapping architectures.
RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis: This paper presents RacketVision—the first large-scale benchmark covering three racket sports (table tennis, tennis, and badminton)—which introduces racket pose annotations for the first time and defines three interconnected tasks: ball tracking, racket pose estimation, and ball trajectory prediction. The work reveals the critical role of cross-attention fusion in multimodal trajectory prediction.
RadarMP: Motion Perception for 4D mmWave Radar in Autonomous Driving: This paper proposes RadarMP — the first unified architecture that jointly addresses mmWave radar object detection and scene flow estimation. It leverages energy flow consistency across adjacent-frame radar echo signals (tesseracts) for self-supervised training, achieving a detection probability of 69.5% (far exceeding the prior best of 44.1%) while enabling accurate 3D scene motion perception.
RAST: A Retrieval Augmented Spatio-Temporal Framework for Traffic Prediction: This work introduces the RAG paradigm into spatio-temporal forecasting by maintaining a dual-dimensional memory bank to store historical spatio-temporal patterns and retrieving them at inference time for fusion. The resulting general-purpose retrieval-augmented spatio-temporal prediction framework, RAST, achieves state-of-the-art performance on six traffic datasets while requiring only 1/12 the GPU memory of competing methods.
ReflexDiffusion: Reflexion-Enhanced Trajectory Planning for High Lateral Acceleration in Autonomous Driving: This paper proposes ReflexDiffusion, which introduces a physics-aware reflection mechanism during the inference stage of diffusion models. By injecting gradients to enforce curvature-velocity-acceleration coupling constraints ($a_y = \kappa v^2$), the method achieves a 14.1% improvement in driving score on nuPlan high-lateral-acceleration long-tail scenarios. The architecture-agnostic design allows direct deployment on existing diffusion planners.
Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception: This paper proposes HAT (multiple Hypotheses spAtio-Temporal alignment), a plug-and-play spatio-temporal alignment module that generates alignment hypotheses via multiple explicit motion models and adaptively decodes the optimal alignment using motion cues latent in queries. HAT consistently improves multiple 3D temporal detectors and trackers on nuScenes, and reduces collision rates by 32–48% in end-to-end autonomous driving.
RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System: This paper introduces RoadSceneVQA—the first large-scale visual question answering dataset for roadside perception scenarios (34,736 QA pairs)—and proposes the RoadMind model, which significantly improves lightweight MLLM performance on traffic scene reasoning through CogniAnchor Fusion (CAF) and Assisted Decoupled Chain-of-Thought (AD-CoT), enabling a 0.9B-parameter model to surpass 8B-parameter counterparts.
SparseCoop: Cooperative Perception with Kinematic-Grounded Queries: This paper proposes SparseCoop—the first fully sparse cooperative perception framework—which abandons dense BEV representations entirely through kinematic-grounded queries (KGQ), a coarse-to-fine aggregation module, and a cooperative instance denoising strategy. SparseCoop achieves state-of-the-art performance on V2X-Seq and Griffin datasets with minimal communication overhead and maximum computational efficiency (AP 0.530, transmission cost only 3.17×10⁴ BPS).
STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes: This paper constructs STRIDE-QA, the largest spatiotemporal reasoning VQA dataset in autonomous driving (270K frames, 16M QA pairs), defines three categories of spatiotemporal reasoning tasks (object-centric spatial / ego-centric spatial / ego-centric spatiotemporal), and demonstrates that fine-tuning a VLM raises localization success rate from near zero to 55% and temporal localization consistency from 0 to 28.4%.
Task Prototype-Based Knowledge Retrieval for Multi-Task Learning from Partially Annotated Data: This paper proposes a task prototype-based knowledge retrieval framework that employs learnable Task Prototypes to encode task characteristics and quantify inter-task affinities, and a Knowledge Retrieval Transformer to adaptively refine feature representations based on task-affinity scores. The framework addresses multi-task learning from partially annotated data (MTPSL) without relying on predictions from unannotated tasks, achieving state-of-the-art performance on PASCAL-Context and NYUD-v2.
TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training: This paper proposes TawPipe—a topology-aware weight pipeline parallelism framework comprising three components: group-based weight scheduling, device-bound storage, and communication-computation overlap. By exploiting the hierarchical bandwidth characteristics of distributed clusters, TawPipe achieves throughput improvements of 11.8%/23.6%/44.1% over WeiPipe/1F1B/FSDP respectively when training LLaMA models on 24 GPUs, while reducing communication time by 82.1%.
TimeBill: Time-Budgeted Inference for Large Language Models: This paper proposes TimeBill, a framework that adaptively adjusts the KV cache eviction ratio under a given time budget via a fine-grained Response Length Predictor (RLP) and a workload-guided Execution Time Estimator (ETE), simultaneously maximizing LLM response quality while guaranteeing inference completion rate.
Towards 3D Object-Centric Feature Learning for Semantic Scene Completion: This paper proposes Ocean, a framework that leverages instance masks extracted by MobileSAM to guide 3D object-centric feature learning. Through Semantic Group Attention (SGA3D) and Global Similarity-Guided Attention (GSGA), Ocean achieves instance-level feature aggregation in 3D space, and refines scene representations via an Instance-aware Local Diffusion (ILD) module, attaining state-of-the-art performance on SemanticKITTI and SSCBench-KITTI360.
TSBOW: Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions: This paper presents TSBOW — a large-scale CCTV-based traffic surveillance dataset comprising 198 videos, over 32 hours of real-world traffic data, and 3.2 million frames, covering all four seasons (clear/haze/rain/snow including extreme disaster scenarios), spanning 8 categories of traffic participants, with a focus on addressing the challenge of occluded vehicle detection under adverse weather conditions.
Understanding Dynamic Scenes in Egocentric 4D Point Clouds: This work introduces EgoDynamic4D — the first egocentric QA benchmark targeting highly dynamic 4D scenes (927K QA pairs, 12 task types) — and proposes an end-to-end spatiotemporal reasoning framework that compresses large-scale 4D scenes into LLM-processable token sequences via instance-aware feature encoding, temporal encoding, camera encoding, and adaptive downsampling.
Unleashing Semantic and Geometric Priors for 3D Scene Completion: This paper proposes FoundationSSC, a framework that unleashes the semantic and geometric priors of Vision Foundation Models through a dual-level decoupling design at both the source level and pathway level. Combined with an Axis-Aware Fusion module for integrating complementary 3D features, the method achieves state-of-the-art performance of 19.32 mIoU / 48.12 IoU on SemanticKITTI.
Unlocking Efficient Vehicle Dynamics Modeling via Analytic World Models: This paper proposes Analytic World Models (AWMs), which exploit the differentiability of differentiable simulators to design three world modeling tasks—relative odometry, optimal planners, and inverse optimal state estimation—enabling end-to-end efficient training of state predictors without trial-and-error search. The approach is validated on the Waymax autonomous driving simulator.
Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction: This paper proposes the first vision-only semantic occupancy prediction framework that uses sparse 3D semantic Gaussian primitives as the communication medium for collaborative perception. Through ROI cropping, rigid transformation of Gaussians, and a neighborhood fusion module to suppress noise and redundancy, the method achieves +8.42 mIoU over the single-agent baseline and +3.28 mIoU over the baseline collaborative method.
Walking Further: Semantic-aware Multimodal Gait Recognition Under Long-Range Conditions: This paper introduces LRGait — the first LiDAR-Camera multimodal gait dataset targeting long-range (10–50m) cross-distance scenarios — and proposes EMGaitNet, an end-to-end framework that achieves 2D-3D cross-modal feature fusion via Semantic Mining (SeMi), Semantic-Guided Alignment (SGA), and Symmetric Cross-Attention Fusion (SCAF) modules, reaching state-of-the-art performance on multiple benchmarks.
When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework: This paper presents EvReID, the first large-scale RGB-Event person re-identification dataset (1,200 identities / 118,988 image pairs), and proposes TriPro-ReID, a three-stage contrastive learning framework guided by pedestrian attributes. The framework leverages positive-negative attribute prompts and cross-modal prompt fusion to integrate RGB and Event modality features, achieving 69.3% mAP.
WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving: WorldRFT is a planning-oriented latent world model framework that integrates VGGT-based spatial encoding, hierarchical planning decomposition with local-aware iterative refinement, and GRPO-based collision-aware reinforcement fine-tuning. It reduces collision rate by 83% on nuScenes (0.30% → 0.05%) and achieves near-LiDAR SOTA performance using camera only on NavSim (87.8 vs. 88.1 PDMS).

📦 Model Compression¶

AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization: To address the severe inference latency overhead (250%–950%) of dynamic MoE-LoRA adapters, this paper proposes a token-level pre-gating architecture that performs a single global routing decision at the first layer. Combined with a custom SGMM fused CUDA kernel that merges all activated LoRA adapters into the backbone in one shot, the approach reduces decoding latency by 2.4× while preserving model accuracy.
AgentODRL: A Large Language Model-based Multi-agent System for ODRL Generation: This paper proposes AgentODRL, an LLM-based multi-agent system built on an Orchestrator-Workers architecture that converts natural language data usage rules into high-quality ODRL policies through task decomposition, a syntax validation loop, and a LoRA-driven semantic reflection mechanism.
ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs: This paper proposes ALTER, a framework that combines an asymmetric LoRA architecture with token-level Tsallis entropy guidance to achieve precise unlearning of target knowledge in LLMs. A parameter isolation mechanism is employed to preserve the model's general capabilities, achieving state-of-the-art performance on three benchmarks: TOFU, WMDP, and MUSE.
BD-Net: Has Depth-Wise Convolution Ever Been Applied in Binary Neural Networks?: This paper proposes BD-Net, which for the first time successfully integrates depth-wise convolution (DWConv) into binary neural networks (BNNs) by introducing 1.58-bit convolution and pre-BN residual connections. BD-Net achieves a new state of the art in the BNN domain on ImageNet with an extremely low computational cost of 33M OPs, with accuracy improvements of up to 9.3 percentage points across multiple datasets.
Beyond Sharpness: A Flatness Decomposition Framework for Efficient Continual Learning: This paper proposes FLAD, a framework that decomposes the sharpness-aware perturbation direction into a gradient-aligned component and a stochastic-noise component, retaining only the noise component for regularization. By combining zeroth-order and first-order sharpness, FLAD improves generalization in continual learning with minimal additional computational overhead.
CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis: This paper introduces the concept of "micro-expert" to decompose MoE layer outputs as cross-matrix (up/gate/down_proj) linear combinations, enabling structured pruning (Camera-P) and mixed-precision quantization (Camera-Q) based on energy ranking. On Deepseek-MoE-16B, Qwen2-57B, and Qwen3-30B at 20%–60% sparsity, the method comprehensively outperforms NAEE and D²-MoE; analysis of Qwen2-57B requires less than 5 minutes on a single A100 GPU.
Can You Tell the Difference? Contrastive Explanations for ABox Entailments: This paper proposes a formal framework for Contrastive ABox Explanations (CE) to answer questions of the form "Why is $a$ an instance of $C$ but $b$ is not?", simultaneously accounting for positive entailments and missing entailments within Description Logic knowledge bases, and analyzes the computational complexity under different description logics and optimization criteria.
CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening: This paper proposes CLIPPan, which fine-tunes CLIP in a parameter-efficient manner to understand multispectral/panchromatic/high-resolution multispectral image types and the pansharpening process, then leverages text prompts encoding Wald's protocol as semantic supervision signals to enable full-resolution unsupervised pansharpening without ground truth. CLIPPan operates as a plug-and-play module compatible with arbitrary pansharpening backbone networks.
Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers: This paper proposes Sequential Learning with Drift Compensation (SLDC), which learns latent space transformation operators (linear / weakly nonlinear) to compensate for distribution drifts induced by sequential fine-tuning of pre-trained ViTs in class-incremental learning. Combined with knowledge distillation, the approach achieves performance close to the joint-training upper bound.
Condensed Data Expansion Using Model Inversion for Knowledge Distillation: This paper proposes using condensed datasets as prototypes to guide the model inversion (MI) process. A feature-alignment discriminator enforces distributional consistency between synthesized data and condensed samples, thereby expanding the condensed dataset for knowledge distillation. The method achieves up to 11.4% improvement over standard MI-based distillation on CIFAR/ImageNet.
Correcting False Alarms from Unseen: Adapting Graph Anomaly Detectors at Test Time: This paper proposes TUNE, a plug-and-play test-time adaptation framework that addresses the "normality shift" problem in graph anomaly detection—caused by the emergence of new normal node categories—by transforming node features via a graph aligner. It leverages the degree of aggregation contamination as an unsupervised adaptation signal and significantly enhances the generalization of various pretrained GAD models across 10 real-world datasets.
Credal Ensemble Distillation for Uncertainty Quantification: This paper proposes the Credal Ensemble Distillation (CED) framework, which distills a deep ensemble (DE) teacher into a single-model student called CREDIT. Rather than predicting a single softmax distribution, CREDIT outputs class probability intervals that define a credal set, achieving superior or comparable uncertainty estimation on OOD detection tasks while substantially reducing inference overhead (from 5× to 1×).
CTPD: Cross Tokenizer Preference Distillation: This paper proposes Cross-Tokenizer Preference Distillation (CTPD), the first unified framework supporting preference distillation across heterogeneous tokenizers. Through three key innovations—Aligned Span Projection, cross-tokenizer importance weighting, and Teacher-Anchored Reference—CTPD achieves substantial improvements over existing methods on multiple benchmarks.
Distilling Cross-Modal Knowledge via Feature Disentanglement: This paper proposes Frequency-Decoupled Cross-Modal Knowledge Distillation (FD-CMKD), which decomposes teacher and student features into low-frequency (modality-shared semantics) and high-frequency (modality-specific details) components via Fourier transform, applies strong-consistency MSE and weak-consistency logMSE losses respectively, and introduces scale normalization along with shared classifier alignment to bridge the feature space. FD-CMKD consistently outperforms existing distillation methods across multiple cross-modal scenarios including audio–visual, image–text, and semantic segmentation.
Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs: This paper proposes PUMA, a framework that leverages lightweight adapters and a grouped user selection strategy to efficiently migrate personalized soft prompts from a source LLM to a target LLM with a different architecture. PUMA matches or surpasses from-scratch training on three large-scale datasets while reducing computational cost by up to 98%.
DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation: DOS is a framework that distills semantic softmaps exclusively over observable (unmasked) points, combined with Zipf-Sinkhorn regularization based on a Zipfian prior to handle the long-tail distribution of 3D semantics. It achieves state-of-the-art self-supervised learning performance on six 3D benchmarks, reaching 95% of supervised performance under linear probing.
DP-GenG: Differentially Private Dataset Distillation Guided by DP-Generated Data: This paper proposes DP-GenG, a framework that leverages differentially private generated data (DP-generated data) to guide three stages of dataset distillation — initialization, feature matching, and expert calibration — significantly improving the utility and privacy protection of the distilled dataset under a limited privacy budget.
DynaQuant: Dynamic Mixed-Precision Quantization for Learned Image Compression: To address the deployment inefficiency of learned image compression (LIC) models, this paper proposes DynaQuant, a framework that achieves content-adaptive quantization at the parameter level via learnable scale/zero-point combined with a Distance-Aware Gradient Modulator, and dynamically assigns optimal bit-widths per layer at the architecture level via a lightweight Bit-Width Selector. Across three baselines (Cheng2020, ELIC, Ballé), the framework achieves near-FP32 R-D performance while delivering up to 5.17× speedup and reducing model size to approximately 1/4 of the original.
Earth-Adapter: Bridge Geospatial Domain Gaps with Mixture of Frequency Adaptation: This paper proposes Earth-Adapter, the first parameter-efficient fine-tuning (PEFT) method specifically designed to address artifact problems in remote sensing imagery. Through a frequency-guided Mixture of Adapters (MoA), features are decomposed into high- and low-frequency subspaces, independently optimized, and then dynamically aggregated. The method outperforms the baseline Rein across three settings: remote sensing semantic segmentation (SS), domain adaptation (DA), and domain generalization (DG).
EEG-DLite: Dataset Distillation for Efficient Large EEG Model Training: This paper proposes EEG-DLite, a dataset distillation framework that combines self-supervised encoding, outlier filtering, and diversity sampling to compress a 2,500-hour EEG dataset to just 5% of its original size, achieving performance comparable to or exceeding full-data pretraining while reducing GPU pretraining time from 30 hours to 2 hours.
Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression: This paper proposes CGRS (Certainty-Guided Reflection Suppression), a training-free efficient reasoning method that dynamically suppresses reflection trigger tokens (e.g., "Wait", "But") when the model exhibits high confidence, reducing token consumption of large reasoning language models by 18.5%–41.9% while maintaining reasoning accuracy.
EfficientFSL: Enhancing Few-Shot Classification via Query-Only Tuning in Vision Transformers: This paper proposes EfficientFSL, a query-only parameter-efficient fine-tuning framework for ViT-based few-shot classification. Through three components — the Forward Block (decoupled active/frozen sub-blocks), the Combine Block (adaptive multi-layer feature fusion), and the SQ Attention Block (support-query distribution alignment) — EfficientFSL achieves state-of-the-art performance on 4 in-domain and 6 cross-domain benchmarks using only 1.25M–2.48M trainable parameters.
Explore and Establish Synergistic Effects between Weight Pruning and Coreset Selection: This paper presents the first systematic investigation of the interaction between weight pruning and coreset selection, proposing the SWaST mechanism to alternately perform both operations and establish synergistic effects, while introducing a state preservation mechanism to address the "dual loss" problem, achieving up to 17.83% accuracy improvement under 10%–90% FLOPs reduction.
Failures to Surface Harmful Contents in Video Large Language Models: This paper presents the first systematic security analysis of VideoLLMs, identifying three structural design flaws — sparse temporal sampling, spatial token downsampling, and modality fusion imbalance — that cause clearly visible harmful content in videos to be omitted from model-generated textual summaries (omission rate exceeding 90%). Three zero-query black-box attacks are designed to empirically validate the severity of these vulnerabilities.
First-Order Error Matters: Accurate Compensation for Quantized Large Language Models: This paper identifies a critical yet overlooked issue in LLM post-training quantization: the column-wise compensation process renders first-order gradient terms non-negligible. The proposed FOEM method incorporates first-order terms into the error compensation formula, reducing the perplexity of Llama3-8B under 3-bit quantization by 17.3% with virtually no additional computational overhead.
HCF: Hierarchical Cascade Framework for Distributed Multi-Stage Image Compression: This paper proposes the HCF framework, which performs cross-node transformation directly in the latent space (avoiding pixel-domain recompression) and introduces policy-driven quantization control to achieve up to 12.64% BD-Rate PSNR improvement in distributed multi-stage image compression, while reducing FLOPs by up to 97.8% and GPU memory by up to 96.5%.
Hierarchical Pedagogical Oversight: A Multi-Agent Adversarial Framework for Reliable AI Tutoring: This paper proposes the HPO framework, which achieves reliable AI tutoring evaluation through a three-phase pipeline (Intelligence Distillation → Adversarial Debate → Synthesis and Judgment). Using only an 8B-parameter model, HPO achieves a Macro F1 of 0.845 on the MRBench middle-school mathematics dialogue dataset, surpassing GPT-4o (0.812) by 3.3%, demonstrating that interaction structure—rather than model scale—is the key to reliable AI tutoring.
InfoCom: Kilobyte-Scale Communication-Efficient Collaborative Perception with Information-Aware Feature Compression: This paper proposes InfoCom, a framework that applies an extended information bottleneck (IB) principle to compress the communication payload of collaborative perception from the MB scale to the KB scale—a 440× reduction compared to Where2comm—while maintaining near-lossless perception performance. The framework consists of three core modules: information-aware encoding, sparse mask generation, and multi-scale decoding.
KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache: This paper proposes KVmix, which evaluates the importance of each layer's KV Cache by computing the $L_2$ norm of gradients with respect to Key/Value projection weights, enabling layer-wise mixed-precision quantization (Key avg. 2.19-bit, Value avg. 2.38-bit). Combined with a dynamic Recent Pivotal Context (RPC) selection strategy, KVmix achieves near-lossless inference, 4.9× memory compression, and 5.3× throughput acceleration on models such as Llama and Mistral.
LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence: This paper proposes LexChronos, a dual-agent iterative framework for extracting structured event timelines from Indian Supreme Court judgments. A LoRA fine-tuned extraction agent identifies candidate events, while a pretrained feedback agent scores and refines them through a confidence-driven loop. The system achieves a BERT F1 of 0.8751 on a synthetic dataset, and the structured timelines are preferred by GPT-4 over unstructured baselines in 75% of downstream legal summarization cases.
Lightweight Optimal-Transport Harmonization on Edge Devices: This paper proposes MKL-Harmonizer, which leverages the Monge-Kantorovich Linear (MKL) mapping from classical optimal transport theory to train a compact encoder that predicts 12-dimensional color transformation parameters, enabling real-time image color harmonization on edge devices. The method achieves state-of-the-art performance on the combined perceptual quality–speed metric in AR scenarios.
Parametric Pareto Set Learning for Expensive Multi-Objective Optimization: This paper proposes the PPSL-MOBO framework, which employs a hypernetwork + LoRA architecture to learn a unified mapping from preference vectors and extrinsic parameters to Pareto-optimal solutions. Combined with Gaussian process surrogate models and hypervolume improvement acquisition strategies, the framework efficiently addresses expensive parametric multi-objective optimization problems.
PocketLLM: Ultimate Compression of Large Language Models via Meta Networks: PocketLLM proposes compressing LLM weight vectors in a latent space via meta networks (encoder–codebook–decoder), replacing the original weight matrices with a small decoder, a compact codebook, and index arrays. The method achieves 10× compression on Llama 2-7B with negligible accuracy degradation, breaking the accuracy bottleneck of traditional quantization and pruning approaches under extreme compression ratios.
Post Training Quantization for Efficient Dataset Condensation: This work is the first to apply post-training quantization (PTQ) to dataset distillation, proposing a patch-based quantization framework (PAQ + grouping + refinement) that nearly doubles test accuracy of distilled datasets at the extreme 2-bit regime (e.g., DM IPC=1 improves from 26.0% to 54.1%). The framework is plug-and-play and can be applied to various distillation methods.
Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval: This paper proposes PSCA, a two-stage framework that establishes class-level semantic connections via orthogonal prototypes, dynamically corrects pseudo-label reliability through geometric-semantic consistency alignment, and learns hash codes on reconstructed features, achieving substantial improvements over existing methods on multiple cross-domain retrieval benchmarks.
Put the Space of LoRA Initialization to the Extreme to Preserve Pre-trained Knowledge: This paper proposes LoRA-Null, which initializes LoRA within the null space of pre-trained input activations (rather than the null space of weights). From an information-theoretic perspective, the effective rank of activations is much lower than that of weights, meaning their null space encodes less pre-trained knowledge, thereby substantially mitigating catastrophic forgetting during fine-tuning.
QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching: QuEPT is an elastic precision quantization framework that enables real-time switching among arbitrary predefined bit-widths on ViT/LLM/MLLM after a single calibration pass, via two core modules—Multi-Bit Token Merging and Multi-Bit Cascaded LoRA—achieving performance on par with or exceeding single-bit-width SOTA PTQ methods.
Reinforced Rate Control for Neural Video Compression via Inter-Frame Rate-Distortion Awareness: This paper proposes the first reinforcement learning rate control framework based on Constrained Markov Decision Processes (CMDP), which jointly captures intra-frame content features and inter-frame rate-distortion coupling dependencies via spatiotemporal state modeling, and directly maps these to per-frame coding parameters. The approach reduces the average bitrate error to 1.20% and achieves BD-Rate savings of up to 13.98% across multiple neural video codecs.
Renormalization Group Guided Tensor Network Structure Search: This paper proposes RGTN, a framework that introduces Renormalization Group (RG) theory from statistical physics into tensor network structure search. Through a multi-scale coarse-graining–expansion–compression pipeline and learnable edge gating, RGTN enables continuous topological evolution, achieving state-of-the-art compression ratios on light field compression, high-order tensor decomposition, and video completion tasks, while running 4–600× faster than existing methods.
Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling: This paper proposes the first uni-level dataset distillation framework for long-tailed distributions. Through three core strategies — expert model debiasing, fair BN statistics calibration, and confidence-guided initialization — the method achieves +15.6% on CIFAR-100-LT and +11.8% on Tiny-ImageNet-LT, comprehensively outperforming DAMED.
SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication: SafeSieve is proposed as a progressive adaptive multi-agent communication pruning framework. Through a two-stage edge scoring mechanism combining semantic-heuristic initialization and history-feedback-driven refinement, together with 0-extension clustering, SafeSieve achieves 94.01% average accuracy across 6 benchmarks while reducing token consumption by 12.4%–27.8%, and demonstrates inherent robustness against prompt injection attacks.
Satisficing and Optimal Generalised Planning via Goal Regression (Extended Version): This paper presents the Moose planner, which synthesises generalised planning programs from training problems via goal regression. It decomposes multi-goal problems into single-goal subproblems, solves each optimally, and applies regression followed by lifting to produce a set of first-order condition-action rules. These rules support either satisficing planning (direct rule execution) or optimal planning (encoded as axioms to prune the search space).
Share Your Attention: Transformer Weight Sharing via Matrix-Based Dictionary Learning: Inspired by dictionary learning, this paper proposes the MASA framework, which decomposes the attention projection matrices (Q/K/V/O) across Transformer layers into linear combinations of shared matrix atoms, achieving performance on par with or superior to the original Transformer at a 66.7% attention parameter compression ratio.
Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning: SharpV proposes a two-stage training-free visual token pruning framework. In the Pre-LLM stage, it adaptively adjusts the pruning ratio per frame based on spatiotemporal information; in the Intra-LLM stage, it prunes the KV Cache based on a visual information degradation hypothesis. SharpV is the first method to achieve full compatibility with Flash Attention, retaining approximately 12% of tokens while matching or surpassing dense model performance across multiple video understanding benchmarks.
SIGN: Schema-Induced Games for Naming: SIGN introduces lightweight message Schemas (e.g., @say {name: Ck}) into LLM multi-agent naming games, demonstrating that structured priors can improve group convention agreement by up to 5.8×, reduce convergence token cost by an order of magnitude, and provide a simple, controllable "tuning knob" for efficient multi-agent coordination.
SkipCat: Rank-Maximized Low-Rank Compression of Large Language Models via Shared Projection and Block Skipping: SkipCat proposes a rank-maximized low-rank compression framework that introduces two techniques—intra-layer shared projection (Cat) and block skipping (Skip)—to retain more effective rank under the same compression ratio. Without any fine-tuning, it achieves up to 7% accuracy improvement on zero-shot tasks over existing low-rank methods.
SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder: SparseRM leverages sparse autoencoders (SAE) to extract preference-relevant directions from LLM intermediate representations, constructing a lightweight reward model via projection vectors. With fewer than 1% trainable parameters, it surpasses most mainstream reward models and demonstrates stronger generalization in online iterative alignment frameworks.
SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization: SpecQuant proposes a two-stage quantization framework based on adaptive Fourier-domain decomposition: it first smoothly migrates activation outliers into weights, then suppresses high-frequency noise in the weights via channel-wise low-frequency Fourier truncation. On LLaMA-3 8B, W4A4 quantization achieves only 1.5% accuracy degradation, while delivering 2× speedup and 3× memory savings.
Steering Pretrained Drafters during Speculative Decoding: This paper proposes SD², which extracts steering vectors from verifier hidden states and injects them into the MLP layers of a pretrained drafter, achieving dynamic drafter–verifier alignment in speculative decoding. Under standard sampling, the number of accepted tokens increases by up to 35% with negligible computational overhead.
StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs Through Knowledge-Reasoning Fusion: This paper proposes the ThinkingF pipeline, which enhances LLMs' formal language domain knowledge via large-scale knowledge distillation and their informal-to-formal reasoning ability via template-guided reasoning trajectory synthesis. These two capabilities are then integrated through a two-stage SFT followed by RLVR. The resulting 7B/32B models achieve state-of-the-art performance on FormalMATH-Lite and ProverBench.
Stratified Knowledge-Density Super-Network for Scalable Vision Transformers: This paper proposes transforming a pretrained ViT into a "Stratified Knowledge-Density Super-Network" (SKD Super-Network) via two steps—WPAC (Weighted PCA Attention Contraction) and PIAD (Progressive Importance-Aware Dropout)—to hierarchically organize knowledge within the pretrained weights, enabling subnetwork extraction of arbitrary size at O(1) cost without additional fine-tuning, achieving performance on par with or surpassing state-of-the-art compression methods.
TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution: This paper proposes TGDD, which reframes static distribution matching as a dynamic alignment process along training trajectories. It captures evolving semantics via Stage-wise Distribution Matching and reduces inter-class overlap via Stage-wise Distribution Constraint, achieving SOTA on 10 datasets with a 5.0% accuracy gain on high-resolution benchmarks.
Towards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing: This paper proposes AsymVPR, an efficient asymmetric framework for Visual Place Recognition (VPR), which replaces expensive k-NN precomputation with a Geographical Memory Bank and bridges the capacity gap between a lightweight query network and a high-capacity gallery network via Implicit Embedding Augmentation, achieving retrieval performance close to the full-size model using only ~8% of its FLOPs.
Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated: A lightweight post-hoc calibration framework grounded in Bayesian decision theory is proposed. By adding a learnable scalar offset α to the output logits of an existing detector, the method significantly improves detection accuracy under distribution shift without any retraining.

🛡️ AI Safety¶

Alternative Fairness and Accuracy Optimization in Criminal Justice: This paper provides a systematic review of three dimensions of algorithmic fairness (group fairness, individual fairness, and procedural fairness), proposes an improved group fairness optimization formulation based on tolerance constraints, and constructs a "Three Pillars of Fairness" deployment framework for public decision-making systems.
An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses: Under the sole assumption of $L$-smoothness (without convexity), this paper derives tighter closed-form RDP privacy bounds for DPSGD and, for the first time, provides a complete convergence/utility analysis in the bounded-domain setting, revealing that a smaller parameter domain diameter simultaneously improves both privacy and utility.
An Information Theoretic Evaluation Metric for Strong Unlearning: This paper exposes a fundamental flaw in existing black-box unlearning evaluation metrics (MIA, JSD, etc.)—modifying only the final classification head is sufficient to satisfy all black-box metrics while intermediate layers fully retain information about the forget set. The paper proposes IDI, a white-box metric that quantifies unlearning effectiveness by estimating, via InfoNCE, the mutual information between each layer's representations and the forget labels. It further proposes COLA, an unlearning method that achieves IDI scores approaching Retrain on CIFAR-10/100 and ImageNet-1K.
Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks: This paper proposes the Angular Gradient Sign Method (AGSM), which decomposes gradients in hyperbolic space into radial (hierarchical depth) and angular (semantic) components, applying perturbations exclusively along the angular direction to generate adversarial examples. AGSM achieves 5–13% greater accuracy degradation than standard FGSM/PGD on image classification and cross-modal retrieval tasks.
Authority Backdoor: A Certifiable Backdoor Mechanism for Authoring DNNs: This paper proposes Authority Backdoor, which embeds hardware fingerprints as backdoor triggers into DNNs so that models function correctly only on authorized devices, and achieves certifiable robustness against adaptive trigger reverse-engineering attacks via randomized smoothing.
Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification: This paper proposes the Manifold-Correcting Causal Flow (MC²F) framework, which employs a Stratified Riemannian Continuous Normalizing Flow (SR-CNF) to learn the manifold density of clean data embeddings for adversarial example detection, and subsequently applies a Geodesic Purification Solver to project detected adversarial embeddings back onto the clean manifold along geodesic paths. MC²F comprehensively surpasses state-of-the-art methods in adversarial robustness across SST-2, AGNews, and YELP benchmarks, while incurring no loss—and even achieving marginal gains—in clean accuracy.
Breaking the Dyadic Barrier: Rethinking Fairness in Link Prediction Beyond Demographic Parity: This paper identifies three fundamental flaws in dyadic fairness and Demographic Parity (ΔDP) for link prediction—insufficient GNN expressiveness, subgroup bias masking, and ranking insensitivity—and proposes a ranking-aware fairness metric based on NDKL and a post-processing algorithm MORAL, achieving state-of-the-art fairness–utility trade-offs across six datasets.
CoRe-Fed: Bridging Collaborative and Representation Fairness via Federated Embedding Distillation: This paper proposes CoRe-Fed, a framework that simultaneously addresses representation fairness and collaborative fairness in federated learning through two synergistic modules—embedding-level contrastive alignment and contribution-aware aggregation—achieving significant improvements in both fairness and generalization of the global model under heterogeneous data distributions.
DeepTracer: Tracing Stolen Model via Deep Coupled Watermarks: This paper proposes DeepTracer, a robust watermarking framework that achieves deep coupling between the watermark task and the main task through adaptive source-class selection (K-Means clustering for feature space coverage) + same-class coupling loss (aligning watermark samples with target-class samples in output space) + two-stage key sample filtering. Under 6 model stealing attacks (including hard-label and data-free settings), the watermark success rate averages 77–100%, substantially outperforming existing methods.
Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception: This paper establishes the first all-type (speech/sound/singing/music) audio deepfake detection benchmark and proposes Wavelet Prompt Tuning (WPT), which enhances full-band frequency perception of SSL features via discrete wavelet transform. Without increasing trainable parameters, WPT surpasses full fine-tuning and achieves an average EER of only 3.58% under co-training.
Diversifying Counterattacks: Orthogonal Exploration for Robust CLIP Inference: This paper proposes Directional Orthogonal Counterattack (DOC), a method that expands the search space during counterattack optimization by introducing orthogonal gradient components and momentum updates, and adaptively modulates counterattack intensity via a cosine-similarity-based Directional Sensitivity Score (DSS). DOC significantly improves the test-time adversarial robustness of CLIP across 16 datasets.
Easy to Learn, Yet Hard to Forget: Towards Robust Unlearning Under Bias: This paper proposes the CUPID framework, which partitions the forget set into causal and bias subsets via loss landscape sharpness analysis, identifies and disentangles causal and bias pathways within the model, and achieves precise class-level unlearning on biased models — effectively addressing the shortcut unlearning problem.
EFX and PO Allocation Exists for Two Types of Goods: This paper proves that an allocation satisfying both EFX (envy-freeness up to any good) and Pareto optimality always exists when goods are of only two types and all valuations are positive, and provides a near-linear-time algorithm.
Enhancing DPSGD via Per-Sample Momentum and Low-Pass Filtering: This paper proposes DP-PMLF, which reduces clipping bias via per-sample momentum and suppresses high-frequency DP noise via a low-pass filter, simultaneously addressing both sources of accuracy degradation in DPSGD for the first time.
Fair Model-Based Clustering: This paper proposes FMC, a fair clustering algorithm based on finite mixture models. By imposing fairness constraints on model parameters rather than sample-level assignments, FMC achieves scalable fair clustering whose parameter count is independent of the sample size $N$. It supports mini-batch learning and categorical data, and substantially outperforms existing methods on large-scale datasets.
FairGSE: Fairness-Aware Graph Neural Network without High False Positive Rates: This paper is the first to identify the "FPR shortcut" problem in fairness-aware GNNs — existing methods achieve favorable fairness metrics by massively misclassifying negative samples as positive — and proposes the FairGSE framework, which reweights graph edges by maximizing 2D structural entropy to simultaneously improve fairness and reduce the false positive rate, achieving a 39% reduction in FPR.
Fine-Grained DINO Tuning with Dual Supervision for Face Forgery Detection: This paper proposes DFF-Adapter (DeepFake Fine-Grained Adapter), a lightweight fine-tuning scheme for deepfake detection built upon DINOv2. A three-branch adapter (authenticity detection head, forgery type classification head, and shared head) is injected into each Transformer block. A Forgery-Aware Multi-Head Router enables subspace-level dynamic routing among LoRA experts. The auxiliary forgery type classification task enhances artifact sensitivity for the primary task. With only 3.5M trainable parameters, the method achieves state-of-the-art performance across multiple cross-dataset evaluations.
Generalizing Fair Clustering to Multiple Groups: Algorithms and Applications: This paper generalizes the Closest Fair Clustering problem from two groups to arbitrarily many groups, proves NP-hardness for the equal-proportion case with three or more groups, proposes near-linear-time approximation algorithms (equal-proportion $O(|\chi|^{1.6}\log^{2.81}|\chi|)$, arbitrary-proportion $O(|\chi|^{3.81})$), and extends the results to fair correlation clustering and fair consensus clustering.
Hashed Watermark as a Filter: A Unified Defense Against Forging and Overwriting Attacks in Neural Network Watermarking: This paper proposes NeuralMark—a weight-based watermarking method built on a hashed watermark filter. It leverages the SHAKE-256 hash function to derive irreversible binary watermarks from secret key matrices, which serve as private filters for selecting embedding parameters. The avalanche effect blocks gradient-based reverse engineering against forging attacks, while multi-round filtering minimizes parameter overlap to resist overwriting attacks. Effectiveness and robustness are validated across 13 CNN/Transformer architectures on 5 image classification tasks and 1 text generation task.
HealSplit: Towards Self-Healing through Adversarial Distillation in Split Federated Learning: This paper proposes HealSplit, the first unified defense framework for Split Federated Learning (SFL). It identifies poisoned samples via topology-aware scoring (TAS) on a graph built over smashed data, generates semantically consistent substitute representations using a GAN, and trains a consistency-validated student model through adversarial multi-teacher distillation. This end-to-end detect-and-recover pipeline substantially outperforms ten SOTA defense methods across five categories of poisoning attacks.
Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks: To address the query efficiency bottleneck in hard-label black-box adversarial attacks, this paper proposes ARS-OPT, a momentum-based algorithm grounded in Nesterov Accelerated Gradient (NAG), and its enhanced variant PARS-OPT that incorporates surrogate model priors. Theoretical convergence guarantees are established, and both methods outperform 13 state-of-the-art approaches on ImageNet and CIFAR-10.
InfoDecom: Decomposing Information for Defending Against Privacy Leakage in Split Inference: InfoDecom is proposed to reduce redundant information in smashed data via two-level information decomposition (frequency-domain visual information removal + mutual information suppression), followed by closed-form Gaussian noise injection for theoretical privacy guarantees, achieving a significantly superior utility-privacy trade-off over existing methods under shallow client models.
Learning to Collaborate: An Orchestrated-Decentralized Framework for Peer-to-Peer Collaborative Learning: This paper proposes KNEXA-FL, a framework that models P2P collaboration as a contextual bandit problem via a Central Pairing Manager (CPM) that never accesses model parameters. Using LinUCB to learn optimal pairing strategies, KNEXA-FL achieves approximately 50% higher Pass@1 than random P2P in heterogeneous LLM federated learning, while avoiding the catastrophic collapse observed in centralized distillation.
Matrix-Free Two-to-Infinity and One-to-Two Norms Estimation: This paper proposes TwINEst and TwINEst++, two randomized algorithms based on the Hutchinson diagonal estimator, for efficiently estimating $\|A\|_{2\to\infty}$ and $\|A\|_{1\to 2}$ norms in a matrix-free setting. The algorithms come with provable oracle complexity guarantees and demonstrate significant advantages in Jacobian regularization for adversarial robustness in image classification and defense against adversarial attacks in recommender systems.
Minimizing Inequity in Facility Location Games: This paper studies the problem of minimizing the Maximum Group Effect in facility location games on the real line, proposing two strategyproof mechanisms—BALANCED and MAJOR-PHANTOM—that achieve tight approximation ratios in the single-facility setting. The framework unifies classical objectives (utilitarian social cost, egalitarian maximum cost) with group fairness objectives, and extends the endpoint mechanism to the two-facility setting.
MPD-SGR: Robust Spiking Neural Networks with Membrane Potential Distribution-Driven Surrogate Gradient Regularization: This work theoretically establishes a connection between SNN robustness error and surrogate gradient (SG) magnitude, demonstrating that reducing the overlap between the membrane potential distribution (MPD) and the effective region of the SG function can effectively decrease sensitivity to adversarial perturbations. Based on this insight, the paper proposes the MPD-SGR regularization method, which substantially outperforms existing SNN defense methods under both vanilla training and adversarial training settings.
Plug-and-Play Parameter-Efficient Tuning of Embeddings for Federated Recommendation: This paper proposes a plug-and-play federated recommendation framework that introduces PEFT (Parameter-Efficient Fine-Tuning) concepts into item embeddings. By freezing pre-trained full embeddings and transmitting only lightweight compressed embeddings (LoRA / Hash / RQ-VAE), the framework significantly reduces communication overhead while improving recommendation accuracy.
Privacy Auditing of Multi-Domain Graph Pre-Trained Model under Membership Inference Attack: This paper proposes MGP-MIA, the first framework targeting membership inference attacks (MIA) against multi-domain graph pre-trained models. It amplifies membership signals via machine unlearning, constructs shadow models through incremental learning, and employs a similarity-based inference mechanism to effectively expose privacy leakage risks in multi-domain graph pre-training.
Privacy on the Fly: A Predictive Adversarial Transformation Network for Mobile Sensor Data: This paper proposes PATN (Predictive Adversarial Transformation Network), the first framework to introduce adversarial perturbations into sensor data privacy protection. PATN leverages historical sensor data to generate forward-looking adversarial perturbations, achieving zero-latency real-time privacy protection while preserving the semantic fidelity of sensor data.
ProbLog4Fairness: A Neurosymbolic Approach to Modeling and Mitigating Bias: This paper proposes the ProbLog4Fairness framework, which formalizes bias mechanisms in data as interpretable logic programs using the probabilistic logic programming language ProbLog, and integrates bias assumptions into neural network training via distant supervision in DeepProbLog, enabling flexible and principled bias mitigation.
Reference Recommendation based Membership Inference Attack against Hybrid-based Recommender Systems: This paper proposes a Reference Recommendation-based Membership Inference Attack (MIA), designing a relative membership metric $\rho(u) = d(v_t, v_h) / d(v_t, v_r)$ that exploits the personalization capability of hybrid-based recommender systems to obtain reference recommendations. It is the first method to effectively attack hybrid-based recommender systems, achieving an attack success rate of up to 93.4% with a computational cost of only 10 seconds.
RegionMarker: A Region-Triggered Semantic Watermarking Framework for Embedding-as-a-Service: This paper proposes RegionMarker, a semantic watermarking framework based on region-triggered mechanisms. It defines trigger regions in a low-dimensional space and injects semantic watermarks, constituting the first EaaS copyright protection method capable of simultaneously resisting CSE attacks, paraphrasing attacks, and dimension perturbation attacks.
Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach: This paper proposes the TGAF framework, which leverages diffusion models to encode target labels as 2D semantic tensors for guiding adversarial noise generation, and introduces a random masking strategy to preserve complete semantic information, significantly improving the transferability of targeted adversarial attacks.
Revisiting (Un)Fairness in Recourse by Minimizing Worst-Case Social Burden: This paper systematically analyzes three fundamental limitations of existing fairness metrics in algorithmic recourse—neglecting classifier decision behavior, ignoring ground-truth labels, and the tendency of gap-based metrics to obscure unfairness—and proposes MISOB, a fairness framework grounded in social burden. Through a minimax-weighted training strategy, MISOB reduces social burden across all demographic groups without requiring access to sensitive attributes, simultaneously improving fairness at both the prediction and recourse stages.
Robust Watermarking on Gradient Boosting Decision Trees: This paper proposes the first robust watermarking framework for GBDT models. It embeds watermarks via in-place fine-tuning and introduces four embedding strategies—Wrong Prediction Flip, Outlier Flip, Cluster Center Flip, and Confidence Flip—achieving high embedding success rates, minimal accuracy degradation, and strong robustness against fine-tuning attacks.
SecMoE: Communication-Efficient Secure MoE Inference via Select-Then-Compute: This paper proposes the SecMoE framework, which efficiently enables sparse MoE inference under two-party secure computation via a Select-Then-Compute paradigm, eliminating redundant expert computation and achieving up to 29.8× communication reduction and up to 16.1× end-to-end speedup.
Sim-to-Real: An Unsupervised Noise Layer for Screen-Camera Watermarking Robustness: This paper proposes the Simulation-to-Real (S2R) framework, which introduces a novel two-stage noise approximation strategy of "mathematical modeling → unsupervised domain transfer": a mathematical transform $T$ first maps clean images to a known noise domain $\mathcal{C}$, and an unsupervised image-to-image network $G$ then maps $\mathcal{C}$ to the real screen-camera (SC) noise domain $\mathcal{U}$. Without requiring paired data, S2R accurately approximates real SC noise and achieves state-of-the-art watermarking robustness (BER reduced by 30–60%) and image quality (PSNR 42.27 dB / SSIM 0.962) across multiple devices, angles, and distances.
TopoReformer: Mitigating Adversarial Attacks Using Topological Purification in OCR Models: This paper proposes TopoReformer, a model-agnostic adversarial purification pipeline based on a topological autoencoder. By leveraging persistent homology to enforce topological consistency in the latent space, the method filters adversarial perturbations without adversarial training, effectively protecting OCR systems against classical attacks, adaptive attacks, and OCR-specific watermark attacks.
Towards Effective, Stealthy, and Persistent Backdoor Attacks Targeting Graph Foundation Models: This paper proposes GFM-BA, the first systematic backdoor attack method targeting the pre-training phase of Graph Foundation Models (GFMs). It addresses three core challenges — effectiveness, stealthiness, and persistence — through three modules: label-free trigger association, node-adaptive trigger generation, and persistent backdoor anchoring.
Towards Multiple Missing Values-Resistant Unsupervised Graph Anomaly Detection: This paper proposes M2V-UGAD, the first framework to address unsupervised graph anomaly detection under simultaneous node attribute and graph topology missingness. Through three core mechanisms—dual-pathway independent imputation, hyperspherical latent space fusion, and pseudo-anomaly generation—the framework overcomes cross-view interference and imputation bias, consistently outperforming existing methods across 7 benchmark datasets.
Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation: This paper proposes STAB (Sharpness-aware Transferable Adversarial Backdoor), which trains a surrogate model via SAM to converge to flat regions of the loss landscape and employs Gumbel-Softmax optimization to generate context-aware adversarial triggers. STAB is the first approach to simultaneously achieve cross-dataset transferability and stealthiness in backdoor attacks against code models.
Transferable Hypergraph Attack via Injecting Nodes into Pivotal Hyperedges: This paper proposes TH-Attack, a transferable node injection attack framework targeting Hypergraph Neural Networks (HGNNs). By identifying pivotal hyperedges in information aggregation pathways and injecting semantically inverted malicious nodes, TH-Attack effectively attacks diverse HGNN architectures in a black-box setting, reducing accuracy from 80%+ to below 30%.
Truth, Justice, and Secrecy: Cake Cutting Under Privacy Constraints: This paper proposes PP_CC_puv, the first privacy-preserving cake cutting protocol, which transforms Chen et al.'s strategyproof fair division algorithm into a privacy-preserving variant based on secret sharing and secure multi-party computation (MPC). The protocol maintains envy-freeness, Pareto optimality, and strategyproofness while ensuring that participants' preference information is not disclosed.
Yours or Mine? Overwriting Attacks Against Neural Audio Watermarking: This paper presents the first systematic study of overwriting attacks against neural audio watermarking, proposing white-box, gray-box, and black-box attack schemes that achieve near-100% attack success rates against three SOTA methods—AudioSeal, Timbre, and WavMark—exposing critical security vulnerabilities in existing audio watermarking systems.

🦾 LLM Agent¶

A2Flow: Automating Agentic Workflow Generation via Self-Adaptive Abstraction Operators: This paper proposes A2Flow, a framework that automatically extracts reusable abstract execution operators from expert data via a three-stage pipeline (case generation → functional clustering → deep extraction), replacing manually predefined operators. Combined with an operator memory mechanism that accumulates intermediate outputs to assist node decision-making, A2Flow outperforms AFLOW and other state-of-the-art methods across 8 benchmarks while reducing resource consumption by 37%.
A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses: This paper proposes MACO (Multi-Agent Conversational Online Learning), which formulates LLM response selection as a multi-agent conversational bandit problem. It employs local agents to eliminate low-quality responses and a cloud-side adaptive keyword-based dialogue to collect user preferences, achieving near-optimal online response evaluation and user preference alignment.
KDR-Agent: A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval: This paper proposes KDR-Agent, a multi-agent framework in which a central planner coordinates three specialized agents—knowledge retrieval, contextual disambiguation, and reflective error correction—combined with natural language type definitions and entity-level positive/negative contrastive demonstrations. Without any fine-tuning, KDR-Agent comprehensively outperforms zero-shot and few-shot baselines across 10 low-resource NER datasets spanning 5 domains (BC5CDR F1=82.47, WNUT-17 F1=80.78 on GPT-4o).
AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments: LLM-driven embodied agents are instantiated to "live" in simulated smart home environments, generating virtual ambient sensor data for pre-training HAR models, which yields significant gains in activity recognition under low-resource settings.
AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search: This paper proposes AgentSwift, a framework that automatically discovers high-performance LLM agent designs through a hierarchical search space (jointly optimizing agentic workflows and functional components), a lightweight value model for predicting agent performance, and an uncertainty-guided MCTS search strategy, achieving an average improvement of 8.34% across 7 benchmarks.
AquaSentinel: Next-Generation AI System Integrating Sensor Networks for Urban Underground Water Pipeline Anomaly Detection via Collaborative MoE-LLM Agent Architecture: This paper proposes AquaSentinel, a physics-informed AI system that achieves network-wide pipeline leak detection using only 20–30% node coverage through sparse sensor deployment, physics-augmented virtual sensors, a MoE spatiotemporal GNN ensemble, a dual-threshold RTCA detection algorithm, causal flow localization, and LLM-based report generation. The system achieves 100% detection rate across 110 leak scenarios.
ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment: This paper proposes ARCANE, a framework that formulates alignment as a multi-agent collaboration problem. A manager agent learns to generate natural-language rubrics (weighted verifiable criterion sets) through dialogue with stakeholders, which serve as interpretable proxy reward functions for a worker agent. Via two-stage SFT+GSPO training, the framework enables test-time configurable alignment, improving mean return from 0.58 to 0.74 (N=8) on the GDPVal benchmark with the GSPO variant.
AutoGLM: Autonomous Foundation Agents for GUIs: AutoGLM builds a GUI foundation agent for web browsers and Android devices on top of ChatGLM. By introducing an intermediate interface design that decouples planning from grounding, and proposing a self-evolving online curriculum reinforcement learning framework, the system achieves a 55.2% success rate on VAB-WebArena-Lite, substantially surpassing GPT-4o's 18.2%.
Automating Complex Document Workflows via Stepwise and Rollback-Enabled Operations: This paper proposes AutoDW, a framework that automates complex document workflows through stepwise planning (generating one API call at a time) combined with adaptive rollback (parameter-level and API-level). On DWBench—a benchmark of 250 sessions and 1,708 instructions—AutoDW achieves 90% instruction-level and 62% session-level completion rates, surpassing the strongest baseline by 40% and 76%, respectively.
AutoTool: Efficient Tool Selection for Large Language Model Agents: This paper proposes AutoTool, a graph-based tool selection framework that exploits tool usage inertia to construct a Tool Inertia Graph (TIG). By leveraging statistical structure, AutoTool bypasses redundant LLM inference for tool selection and parameter filling, reducing inference overhead by up to 30% while maintaining task completion rates.
BayesAgent: Bayesian Agentic Reasoning Under Uncertainty via Verbalized Probabilistic Graphical Modeling: This paper proposes the vPGM framework, which guides LLM agents via natural language to simulate Bayesian reasoning over probabilistic graphical models (PGMs)—discovering latent variables and inferring posterior distributions—and further applies numerical Bayesian calibration with a Dirichlet prior (BayesVPGM), achieving simultaneous improvements in accuracy and confidence calibration across multiple reasoning tasks.
Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning: This paper proposes a Planner-centric Plan-Execute framework that transforms complex queries into DAG-based execution plans. Through two-stage SFT+GRPO training of a dedicated Planner model, the approach surpasses reactive methods such as ReAct on ComplexTool-Plan and StableToolBench, achieving higher success rates with fewer inference steps.
CausalTrace: A Neurosymbolic Causal Analysis Agent for Smart Manufacturing: This paper proposes CausalTrace — a neurosymbolic causal analysis agent integrated into an industrial CoPilot (SmartPilot) that combines data-driven causal discovery with industrial ontologies and knowledge graphs, enabling real-time root cause analysis, counterfactual reasoning, and interpretable decision support.
Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents: This paper proposes Co-EPG, a framework that decouples a GUI Agent into separate Planning and Grounding models, establishes a positive feedback loop via GRPO co-training and a Confidence-based Dynamic Reward Ensemble Mechanism (C-DREM), enabling both models to co-evolve through self-iteration. Using only benchmark datasets (no external data), Co-EPG achieves state-of-the-art results on Multimodal-Mind2Web (58.4%) and AndroidControl (83.1%).
COACH: Collaborative Agents for Contextual Highlighting -- A Multi-Agent Framework for Sports Video Analysis: This paper proposes COACH — a reconfigurable multi-agent framework built on a shared backbone model — that achieves role specialization via intent-driven strategy orchestration and structured CoT fine-tuning, significantly outperforming generalist models such as Gemini 2.5 Pro on both QA and summarization tasks in badminton video analysis.
Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution: This paper introduces ORS3D, a novel task that incorporates operations research (OR) knowledge into embodied AI task scheduling. Agents are required to exploit the waiting time of parallelizable sub-tasks to execute other tasks, thereby minimizing total completion time, while simultaneously localizing target objects in 3D scenes. The authors construct a 60K-scale dataset ORS3D-60K and propose the GRANT model, which connects to an external dynamic programming solver via a scheduling token mechanism, achieving a 30.53% improvement in time efficiency over baselines.
COVR: Collaborative Optimization of VLMs and RL Agent for Visual-Based Control: This paper proposes COVR, a bidirectional collaborative optimization framework for VLMs and RL agents: high-quality interaction data generated by RL is used to fine-tune the VLM, while the enhanced VLM in turn guides RL policy learning via action priors, achieving SOTA performance on CARLA and DMControl.
D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies: This paper proposes D-GARA, a dynamic robustness evaluation framework for Android GUI Agents. By injecting real-world anomalies—such as permission dialogs, low-battery warnings, and app crashes—during live interactions, D-GARA reveals that existing SOTA agents (including UI-TARS-72B and GPT-4o) suffer an average success rate drop of over 17.5%, with a maximum degradation of 33%, under interruption scenarios.
DEPO: Dual-Efficiency Preference Optimization for LLM Agents: This paper proposes the concept of dual-efficiency, decomposing LLM agent efficiency into step-level (reducing tokens per step) and trajectory-level (reducing total number of steps) dimensions. Building on KTO, the authors introduce DEPO, which jointly optimizes efficiency and task performance by incorporating an efficiency bonus into the reward for desirable samples.
EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation: This paper proposes EcoAgent, a closed-loop device-cloud collaborative multi-agent framework for mobile automation. By combining Dual-ReACT two-level reasoning and planning, lightweight on-device verification feedback, and a Pre-Understanding text compression module, EcoAgent achieves success rates comparable to fully cloud-based agents on AndroidWorld while substantially reducing latency (3.9s vs. 15.3s), cloud invocations (−89%), and upstream data volume (−48.6×).
FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation: This paper is the first to systematically define the task of automated Equity Research Report (ERR) generation. It constructs the FinRpt dataset (6,825 high-quality bilingual reports integrating 7 categories of financial data), proposes an 11-metric evaluation framework, and designs the FinRpt-Gen generation framework with 9 collaborative agents featuring a three-stage enhancement pipeline (rating correction / expert review / language polishing). Human evaluation shows that generated reports approach expert-written quality.
From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness: The first systematic case study demonstrating that demographically grounded persona assignment causes up to 26.2% performance degradation in LLM agent task execution across 5 operational domains, establishing that persona-induced bias extends beyond text generation into action decision-making.
History-Aware Reasoning for GUI Agents: This paper proposes the HAR framework, which transforms the reasoning paradigm of GUI agents from "history-unaware" to "history-aware" by constructing reflective learning scenarios, synthesizing error-correction guidelines, and designing a hybrid RL reward function incorporating a Memory-Augmented Reward (MAR). A 3B model trained under this framework surpasses larger models on multiple benchmarks including AITW, Mind2Web, and GUI-Odyssey.
LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models: This paper proposes LieCraft, a multi-player hidden-role game framework (with constraint-satisfaction-guaranteed balance) to evaluate the strategic deception capabilities of 12 LLMs. It finds that all tested frontier LLMs—including GPT-4o—exhibit deception rates exceeding 90% under incentive conditions, demonstrating that safety training has not eliminated the capacity for strategic lying.
LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval: This paper proposes LLandMark, a modular multi-agent framework that achieves landmark-aware multimodal interactive video retrieval through landmark knowledge augmentation, LLM-assisted image retrieval, and OCR refinement modules, achieving a total score of 77.40/88 in the Vietnamese large-scale video retrieval challenge (HCMAIC 2025).
LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs: This paper proposes LLMTM — the first comprehensive benchmark for evaluating LLMs on temporal motif analysis in dynamic graphs. It covers 6 task categories across 9 temporal motif types and evaluates 9 models, finding that LLM performance on temporal motif recognition degrades rapidly with increasing motif complexity. A Structure-Aware Dispatcher is further proposed to intelligently route queries to either standard LLM prompting or tool-augmented agents based on graph structural properties and cognitive load, achieving near-peak accuracy while reducing computational cost.
Loss-Guided Auxiliary Agents for Overcoming Mode Collapse in GFlowNets: This paper proposes LGGFN (Loss-Guided GFlowNets), in which the exploration of an auxiliary GFlowNet is directly driven by the training loss of the primary GFlowNet. The auxiliary agent's reward is defined as $R_{aux}(x) = R(x) + \lambda \cdot L_{main}(x)$, prioritizing regions where the primary model is least well-understood. On grid, sequence generation, and Bayesian structure learning tasks, LGGFN discovers 40× more unique modes and reduces exploration error by 99%.
MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models: This paper proposes MedLA, the first multi-agent medical reasoning framework based on syllogistic logic trees. Each agent organizes its reasoning as an explicit logic tree composed of syllogistic nodes (major premise–minor premise–conclusion). Multiple agents align and revise their logic trees at the premise level through graph-guided multi-round discussions. MedLA outperforms all baselines by 7.4% on MedDDx (8B model) and achieves an average accuracy of 69.9% on medical QA benchmarks with an 8B model, surpassing 70B RAG-based models.
MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning: This work employs Group Relative Policy Optimization (GRPO) to train LLMs at the reasoning level for ethical framework alignment, achieving out-of-distribution generalization on the Moral-Reason-QA dataset (680 high-ambiguity scenarios) with utilitarian alignment scores improving from 0.207 to 0.964.
Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems: This paper proposes an adaptively coordinated multi-agent LLM framework that achieves a 27% improvement in compliance accuracy and a 74% reduction in revision rate on high-complexity financial document analysis tasks, through parallel competitive evaluation, dynamic task routing, and bidirectional feedback mechanisms.
PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching: This paper proposes PerTouch, a framework that integrates a semantic region-level retouching model based on Stable Diffusion + ControlNet with a VLM-driven Agent (incorporating feedback-driven rethinking and scene-aware memory) to achieve fine-grained, personalized image retouching.
Physics-Informed Autonomous LLM Agents for Explainable Power Electronics Modulation Design: This paper proposes PHIA, a system in which an LLM planner collects design requirements via a chat interface and autonomously coordinates a physics-informed neural network surrogate model (hierarchical PINN) with optimization algorithms to iteratively generate power converter modulation designs, achieving a 63.2% reduction in MAE, a 33× speedup in design time, and usability validated by 20 domain experts.
ProBench: Benchmarking GUI Agents with Accurate Process Information: ProBench is proposed as the first mobile GUI Agent benchmark that evaluates both final state and operational process: 200+ challenging tasks cover 34 mainstream Chinese and English apps. A Process Provider (Structure Description Converter + MLLM Summarizer) automatically captures accurate intermediate process information. Evaluation reveals that even the strongest model, Gemini 2.5 Pro, completes only 40.1% of tasks, exposing three prevalent issues: insufficient grounding, poor awareness of action history, and oversimplified task planning.
Promoting Sustainable Web Agents: Benchmarking and Estimating Energy Consumption Through Empirical and Theoretical Analysis: This paper presents the first systematic quantification of energy consumption and carbon emissions of Web Agents from both empirical benchmarking and theoretical estimation perspectives, finding that higher energy consumption does not equate to better performance, and advocating for the inclusion of energy efficiency metrics in evaluation protocols.
Prune4Web: DOM Tree Pruning Programming for Web Agent: This paper proposes Prune4Web, a programmatic DOM pruning approach that achieves 25–50× candidate element reduction via "LLM-generated scoring function parameters + fixed heuristic template execution." The three-stage pipeline (Planner decomposes subtasks → Programmatic Filter generates scoring functions to prune DOM → Grounder executes actions) enables a 3B model to achieve 52.4% Step SR on Multimodal-Mind2Web, surpassing all baselines of the same parameter scale and even some 9.6B/32B models, while improving low-level grounding accuracy from 46.8% to 88.28%.
Real-Time Trust Verification for Safe Agentic Actions Using TrustBench: This paper proposes TrustBench, a dual-mode framework: (1) Benchmark Mode — combines traditional metrics with LLM-as-a-Judge to evaluate 8 trust dimensions and learns a calibration mapping from agent confidence to actual accuracy; (2) Verification Mode — computes trust scores in real time after an agent formulates an action but before execution, blocking 87% of harmful actions with latency below 200ms, with specialization achieved through domain plugins (medical/financial/QA).
Reflection-Driven Control for Trustworthy Code Agents: This paper proposes a Reflection-Driven Control module that elevates "self-reflection" from a post-hoc patch to a first-class control loop within the agent reasoning process. Through three components—a lightweight self-checker, evidence-driven repair, and a reflective memory repository—the approach significantly improves code security rates on secure code generation tasks.
SoMe: A Realistic Benchmark for LLM-based Social Media Agents: This paper introduces SoMe, the first comprehensive benchmark for social media agents, comprising 8 tasks, over 9 million real-world posts, and 17,869 annotated queries. It evaluates 13 mainstream LLMs on social media agent capabilities and reveals substantial performance gaps on complex social tasks.
Structured Personalization: Modeling Constraints as Matroids for Data-Minimal LLM Agents: This paper formalizes structured constraints in LLM agent personalization—comprising logical dependencies and hierarchical quotas—as laminar matroids, proves that greedy algorithms retain constant-factor approximation guarantees under such constraints, and addresses the data-minimization selection problem with dependency relations and hierarchical limits.
Time, Identity and Consciousness in Language Model Agents: This paper applies the temporal gap concept from Stack Theory to LLM agent evaluation, proposing a conservative evaluation toolkit that distinguishes between "talking like a stable self" and "being organized like a stable self." It reveals identity trade-offs across different scaffold structures via persistence scores and an identity morphospace.
TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents: TongUI proposes a framework that automatically converts multimodal web tutorials (videos and illustrated articles) into GUI operation trajectories, constructing the million-scale GUI-Net-1M dataset for fine-tuning Qwen2.5-VL. The resulting models surpass or approach state-of-the-art methods such as UI-TARS across multiple grounding and navigation benchmarks.
Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance: This paper proposes a task completion framework in which a Task Profiler, a Reasoning Module, and a Generation Module co-evolve to enable verifiable and reliable behavioral guidance for LLM agents in multi-turn interactive environments.
When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents: This paper systematically investigates how long-context padding affects the safety behavior of LLM agents. Models claiming support for 1M–2M token windows exhibit performance collapse exceeding 50% at 100K tokens. Refusal rates fluctuate in unpredictable directions (GPT-4.1-nano rises from 5% to 40%; Grok 4 Fast drops from 80% to 10%), revealing critical safety vulnerabilities in long-context agent systems.
With Great Capabilities Come Great Responsibilities: Introducing the Agentic Risk & Capability Framework for Governing Agentic AI Systems: This paper proposes the Agentic Risk & Capability (ARC) framework, which systematically identifies, assesses, and mitigates safety and security risks in agentic AI systems from a capability perspective, providing organizations with an actionable and structured methodology for governance.

📊 LLM Evaluation¶

Axis-Aligned Document Dewarping: This paper proposes to exploit the inherent axis-aligned geometric property of planar documents, systematically incorporating axis-alignment constraints across training, inference, and evaluation stages, achieving state-of-the-art document rectification performance and introducing a new evaluation metric, AAD.
BCWildfire: A Long-term Multi-factor Dataset and Deep Learning Benchmark for Boreal Wildfire Risk Prediction: This paper introduces BCWildfire, a multimodal wildfire risk prediction dataset covering 240 million hectares of British Columbia, Canada over a 25-year span, encompassing 38 driving factors. It conducts a systematic benchmark evaluation of time series forecasting models across four paradigms—CNN, Linear, Transformer, and Mamba—revealing the performance ceiling of current models and the key influential factors in wildfire prediction.
Benchmarking LLMs for Political Science: A United Nations Perspective: This paper presents UNBench, the first comprehensive LLM evaluation benchmark for political science grounded in UN Security Council records from 1994 to 2024. It encompasses four interrelated tasks—resolution drafting, voting simulation, adoption prediction, and representative statement generation—to systematically assess LLMs' ability to understand and simulate complex political dynamics.
Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents: Drawing on Cognitive Load Theory (CLT) from psychology, this work decomposes the complexity of tool-use tasks into intrinsic load (structural complexity of the solution path) and extraneous load (ambiguity of problem formulation). It constructs ToolLoad-Bench, a benchmark with parametrically adjustable cognitive load, and employs an exponential decay model $\text{Acc} \approx e^{-(k \cdot CL + b)}$ to precisely characterize the capability boundaries of different agents.
Beyond Cosine Similarity: Magnitude-Aware CLIP for No-Reference Image Quality Assessment: This paper proposes MA-CLIP, which discovers and exploits the magnitude information of CLIP image features as a complementary perceptual quality cue. Combined with cosine similarity, it achieves training-free adaptive dual-cue fusion for image quality assessment.
Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering: This paper proposes DISC, a method that learns cluster-customized category relationships (rather than a globally uniform distance) for each cluster. Through joint optimization of relationship trees and cluster assignments, DISC achieves an average rank of 1.25 across 12 datasets, substantially outperforming the previous best method (average rank 5.21).
ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions: This paper proposes ConInstruct, a benchmark for evaluating LLMs' ability to detect and resolve conflicting constraints in instructions. Results show that most proprietary models can detect conflicts reasonably well but rarely notify users explicitly, with DeepSeek-R1 and Claude-4.5-Sonnet achieving the best conflict detection performance (F1 of 91.5% and 87.3%, respectively).
DcMatch: Unsupervised Multi-Shape Matching with Dual-Level Consistency: This paper proposes DcMatch, an unsupervised multi-shape matching framework that employs a shape graph attention network to capture the underlying manifold structure of a shape collection for constructing a more expressive shared universe space, while enforcing dual-level cycle consistency constraints in both the spatial and spectral domains, achieving comprehensive state-of-the-art performance across multiple benchmark datasets.
Deep Incomplete Multi-View Clustering via Hierarchical Imputation and Alignment: This paper proposes DIMVC-HIA, a deep incomplete multi-view clustering framework that integrates hierarchical imputation with dual alignment. The method first imputes missing cluster assignments and then imputes missing features in a coarse-to-fine manner, maintaining robust performance under high missing rates (up to 70%).
DiCaP: Distribution-Calibrated Pseudo-labeling for Semi-Supervised Multi-Label Learning: This paper proposes DiCaP (Distribution-Calibrated Pseudo-labeling), which estimates the posterior correctness rate of pseudo-labels to calibrate their weights, introduces a dual-threshold mechanism to separate confident and ambiguous regions with differentiated strategies, and surpasses the state of the art by up to 4.27% in semi-supervised multi-label learning.
GazeInterpreter: Parsing Eye Gaze to Generate Eye-Body-Coordinated Narrations: This paper proposes GazeInterpreter, an LLM-based hierarchical framework that converts raw gaze signals into textual narrations via a symbolic gaze parser, integrates them with body motion narrations to produce eye-body-coordinated descriptions, and iteratively refines outputs through a self-correction loop, yielding significant improvements on downstream tasks including text-driven motion generation, action prediction, and behavior summarization.
GDBA Revisited: Unleashing the Power of Guided Local Search for Distributed Constraint Optimization: To address the poor performance of GDBA on general-domain DCOPs, this paper systematically diagnoses three root causes—an overly aggressive violation condition, unbounded penalty accumulation, and uncoordinated penalty updates—and proposes the DGLS framework. Through an adaptive violation condition, an evaporation mechanism, and a synchronization scheme, DGLS fully unleashes the potential of guided local search, substantially outperforming state-of-the-art methods across multiple standard benchmarks.
GOAL: Geometrically Optimal Alignment for Continual Generalized Category Discovery: Grounded in Neural Collapse theory, this paper replaces dynamic classifiers with a fixed Equiangular Tight Frame (ETF) classifier and achieves continual generalized category discovery via supervised alignment and confidence-guided unsupervised alignment, reducing forgetting by 16.1% and improving novel category discovery by 3.2% across four benchmarks.
GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval: This paper proposes GranAlign, a training-free granularity-aware alignment framework that addresses the core challenge of semantic granularity mismatch in zero-shot video moment retrieval (ZVMR). By rewriting queries into simplified and detailed variants and matching them against query-agnostic and query-aware video descriptions respectively, GranAlign achieves a 3.23% improvement in mAP@avg on QVHighlights.
Graph Out-of-Distribution Detection via Test-Time Calibration with Dual Dynamic Dictionaries: This paper proposes the BaCa framework, which generates boundary-aware synthetic graph topologies at test time via graphon estimation and mixup strategies, and adaptively calibrates OOD scores using dual priority-queue-based dynamic dictionaries with an attention mechanism. Without fine-tuning the pretrained model or requiring auxiliary OOD data, BaCa outperforms GOODAT on all 10 datasets with an average AUC improvement of 8.37%.
HybriDLA: Hybrid Generation for Document Layout Analysis: HybriDLA is the first approach to unify diffusion-based bounding box refinement and autoregressive query expansion within a single decoding layer, simulating a human coarse-to-fine reading strategy for document layout analysis. It achieves 83.5% mAP on DocLayNet with a vision-only model, approaching multimodal systems.
Improved Runtime Guarantees for the SPEA2 Multi-Objective Optimizer: By rigorously analyzing the more complex selection mechanism of SPEA2, this paper demonstrates that its population dynamics are fundamentally different from those of NSGA-II — the σ-criterion induces a uniform distribution of objective values across the population — yielding runtime upper bounds with a substantially weaker dependence on population size, indicating that SPEA2 is more robust to parameter choices.
LLM-as-a-Judge for Scalable Test Coverage Evaluation: This paper applies the LLM-as-Judge paradigm to Gherkin acceptance test coverage evaluation, systematically quantifying accuracy–reliability–cost trade-offs across 20 model configurations × 500 evaluations. It finds that GPT-4o Mini achieves the optimal production balance with a MAAE of 6.07, an ECR@1 of 96.6%, and a cost of $1.01 per 1K evaluations—approximately 1/78th the cost of GPT-5 at high reasoning effort.
Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory: This paper proposes PSN-IRT (Pseudo-Siamese Network for IRT), an enhanced Item Response Theory framework that jointly estimates LLM ability parameters and four-parameter item characteristics (difficulty / discrimination / guessing / feasibility). Applied to 41,871 items across 11 benchmarks, the framework reveals systemic issues including widespread saturation, insufficient difficulty ceilings, and data contamination. Item subsets selected by PSN-IRT achieve a ranking consistency of Kendall $\tau = 1.00$.
Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning: This paper proposes LOREN, a curvature-aware zeroth-order optimization method that captures the anisotropic curvature of the loss landscape via a low-rank block-diagonal preconditioner, combined with REINFORCE Leave-One-Out (RLOO) variance reduction. LOREN achieves higher accuracy and faster convergence in LLM fine-tuning while reducing peak memory by up to 27.3% compared to MeZO-Adam.
MAPS: Multi-Agent Personality Shaping for Collaborative Reasoning: This paper proposes MAPS, a five-agent collaborative reasoning framework that assigns distinct "personalities" to four functional agents based on the Big Five personality theory — Interpreter (Openness), Aligner (Agreeableness), Scholar (Conscientiousness), and Solver (Extraversion) — to achieve heterogeneous collaboration, complemented by a Critic Agent (Neuroticism → Socratic reflection) for iterative refinement. MAPS surpasses the GPT-4o baseline by 15.84% on MathVista/OlympiadBench/EMMA and, for the first time, exceeds human expert performance by 3.58%.
MCTS-SQL: Light-Weight LLMs can Master the Text-to-SQL through Monte Carlo Tree Search: This paper proposes MCTS-SQL, enabling lightweight LLMs (e.g., Qwen-1.5B) to achieve strong Text-to-SQL performance via Monte Carlo Tree Search — a three-component architecture (Selector for schema pruning + Direct Generator for initial SQL generation + MCTS-Refiner for iterative refinement), combined with a prefix caching mechanism that reduces inference time by 53%. Qwen-1.5B achieves 40.69% execution accuracy on BIRD, surpassing ChatGPT-3.5.
MicroEvoEval: A Systematic Evaluation Framework for Image-Based Microstructure Evolution Prediction: This paper introduces MicroEvoEval, the first standardized benchmark for image-level microstructure evolution prediction, encompassing 4 representative physical tasks (planar wave propagation, grain growth, spinodal decomposition, dendritic solidification), 14 models (5 domain-specific + 9 general spatiotemporal architectures), and a multi-dimensional evaluation framework (numerical accuracy + physical fidelity + computational efficiency). The study finds that modern general-purpose architectures (e.g., VMamba) outperform domain-specific models in long-term stability and physical fidelity while achieving an order-of-magnitude improvement in computational efficiency.
MindVote: When AI Meets the Wild West of Social Media Opinion: This paper introduces MindVote — the first LLM opinion prediction benchmark grounded in real social media poll data, comprising 3,918 naturally occurring polls (across 23 topics) collected from Reddit and Weibo, enriched with platform- and topic-level context. Evaluation of 15 LLMs reveals: the best model (o3-medium) achieves a 1-Wasserstein score of only 0.892 versus an upper bound of 0.972; survey-specialized fine-tuned models underperform general-purpose models (the "survey specialization trap"); and models exhibit strong cultural alignment — Western models excel on Reddit while Chinese models excel on Weibo.
NeSTR: A Neuro-Symbolic Abductive Framework for Temporal Reasoning in Large Language Models: This paper proposes NeSTR, a neuro-symbolic prompting strategy that converts natural language temporal facts into structured symbolic predicates, combined with consistency verification and abductive reflection for error correction. Under a zero-shot setting, NeSTR enables LLMs to achieve high-quality temporal reasoning, attaining an average F1 of 89.7 on GPT-4o-mini, compared to 64.9 for vanilla prompting and 85.8 for TISER.
OptScale: Probabilistic Optimality for Inference-time Scaling: This paper proposes OptScale, a probabilistic optimality framework that models the probability distribution of verifier scores to derive a theoretical lower bound on the optimal number of samples, dynamically determining the minimum number of samples required per problem and substantially reducing computational overhead while preserving inference accuracy.
Perspective from a Broader Context: Can Room Style Knowledge Help Visual Floorplan Localization?: This paper proposes leveraging room style knowledge — obtained via unsupervised clustering pretraining in the form of a room discriminator — to resolve ambiguities caused by repetitive structures in visual floorplan localization (FLoc), achieving state-of-the-art performance on two standard benchmarks: Gibson and Structured3D.
RefineVAD: Semantic-Guided Feature Recalibration for Weakly Supervised Video Anomaly Detection: This paper proposes RefineVAD, a framework comprising two modules — Motion-aware Temporal Attention Recalibration (MoTAR) and Category-Oriented REfinement (CORE) — that jointly models temporal motion dynamics and anomaly category semantics, achieving precise localization and interpretable detection of anomalous events in weakly supervised video anomaly detection.
Regular Games – an Automata-Based General Game Playing Language: This paper introduces Regular Games (RG), a general game playing system centered on nondeterministic finite automata (NFA) for encoding game rules. RG employs a multi-level language architecture (low-level RG, high-level HRG, and domain-specific frameworks) that covers all finite turn-based games — including those with imperfect information and stochasticity — while generating forward models that consistently outperform the previously fastest GGP system, RBG, and typically run 10–20× faster than Ludii.
Sampling Control for Imbalanced Calibration in Semi-Supervised Learning: This paper proposes SC-SSL, a framework that introduces an expansion classifier for decoupled sampling control to mitigate feature-level imbalance, and leverages the bias term of a linear layer as an optimized bias vector to directly calibrate logits at inference time, achieving state-of-the-art performance across multiple data distribution settings.
SpikCommander: A High-Performance Spiking Transformer with Multi-View Learning for Efficient Speech Command Recognition: This paper proposes SpikCommander, a fully spike-driven Transformer architecture that jointly enhances temporal and channel feature modeling via Multi-view Spike Temporal-Aware Self-Attention (MSTASA) and Spike Context Refinement MLP (SCR-MLP), surpassing state-of-the-art SNN methods on SHD/SSC/GSC benchmarks with fewer parameters.
Streaming Generated Gaussian Process Experts for Online Learning and Control: Extended Version: This paper proposes SkyGP (Streaming Kernel-induced Progressively Generated Expert GP), which handles streaming data via kernel-distance-driven progressive expert generation and time-aware configurable aggregation, inheriting the learning guarantees of exact GP while maintaining bounded computational complexity. SkyGP comprehensively outperforms state-of-the-art methods on both benchmark regression tasks and real-time control experiments.
Structured Language Generation Model: Loss Calibration and Formatted Decoding for Efficient Text: This paper proposes the SLGM framework, which reformulates structured prediction tasks for generative language models as classification problems via three components: structured input format, format loss, and format-aware decoding. Without introducing additional model parameters, SLGM significantly improves structural prediction performance of sub-1B models across 13 datasets spanning 5 task categories, including NER, RE, and SRL.
Test-time Diverse Reasoning by Riemannian Activation Steering: This paper proposes SPREAD, an unsupervised test-time activation steering framework that maximizes the total volume spanned by hidden activations across multiple reasoning paths by solving a Riemannian optimization problem on a product of spherical manifolds. SPREAD improves reasoning diversity and accuracy in Best-of-N sampling, outperforming temperature sampling baselines on mathematical reasoning benchmarks.
Think How Your Teammates Think: Active Inference Can Benefit Decentralized Execution: This paper proposes AIM (Active Inference Modeling), a framework for decentralized multi-agent reinforcement learning that models teammates' active inference processes — as perception–belief–action triple portraits — based solely on local observations without any communication. A dual filtering mechanism based on accuracy and relevance selectively integrates teammate belief portraits to assist decision-making. AIM achieves state-of-the-art or near-state-of-the-art performance across four benchmarks: SMAC, SMACv2, MPE, and GRF.
Towards a Common Framework for Autoformalization: This paper systematically surveys existing work on autoformalization across mathematics, logical reasoning, planning, and knowledge representation, and proposes a unified cross-disciplinary definitional framework. Autoformalization is defined as the semantically equivalent transformation from informal language to formal reasoning languages, with the goal of facilitating methodology sharing across research communities and accelerating the development of next-generation AI reasoning systems.
Towards a Rigorous Understanding of the Population Dynamics of the NSGA-III: Tight Runtime Bounds: This paper establishes the first tight runtime bound $\Theta(n^2 \ln n / \mu)$ for NSGA-III on the classical bi-objective OneMinMax benchmark, reveals the population dynamics of NSGA-III, and proves that it outperforms NSGA-II under appropriate population sizes.
TRACE: A Generalizable Drift Detector for Streaming Data-Driven Optimization: This paper proposes TRACE, a transferable concept drift detector based on attention-based sequence learning. By tokenizing statistical features and employing a dual-attention encoder, TRACE learns drift patterns that generalize across tasks, enabling deployment on unseen datasets and integration as a plug-and-play module into streaming data-driven optimization algorithms.
Where Norms and References Collide: Evaluating LLMs on Normative Reasoning: This paper proposes SNIC, a diagnostic testbed comprising 9,000 instances across 51 scenarios, designed to evaluate whether LLMs can leverage implicit social norms to resolve ambiguous reference expressions (e.g., "hand me the cup" when multiple cups are present). Results show that LLMs achieve an average accuracy of only 44% given scene descriptions alone; adding Prolog-based formal logic yields negligible improvement (44.2%), whereas explicitly providing a list of norms dramatically raises accuracy to 70.5% (GPT-4.1 reaches 99.6%). This demonstrates that LLMs lack implicit physical norm knowledge yet can effectively exploit explicit norms.

🕸️ Graph Learning¶

Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees: This paper proposes Adaptive Initial Residual Connections (Adaptive IRC), which allows each node to have a personalized residual strength learned from its initial features. It provides the first theoretical proof of a positive lower bound on the Dirichlet energy of initial residual connections with activation functions (guaranteeing the absence of over-smoothing), and introduces a PageRank-based heuristic variant that achieves comparable or superior performance without learning additional parameters.
Adaptive Riemannian Graph Neural Networks: This paper proposes ARGNN, a framework that learns a continuous, anisotropic diagonal Riemannian metric tensor for each node in a graph, enabling adaptive capture of local geometric properties across different graph regions (hierarchical structures vs. dense communities). ARGNN unifies and outperforms geometric GNN methods based on fixed curvature or discrete mixed-curvature spaces.
Are Graph Transformers Necessary? Efficient Long-Range Message Passing with Fractal Nodes in MPNNs: This paper proposes Fractal Nodes (FN) to enhance long-range message passing in MPNNs. Subgraph-level aggregation nodes are generated via METIS graph partitioning, combined with low-pass and high-pass filters (LPF+HPF) and a learnable frequency parameter $\omega$. MLP-Mixer is adopted for cross-subgraph communication. The approach achieves $O(L(|V|+|E|))$ linear complexity while matching or surpassing Graph Transformer performance, earning an AAAI Oral.
Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation: This paper proposes ARG-Designer, which reformulates multi-agent system topology design as a conditional autoregressive graph generation task. Rather than pruning from template graphs, the model incrementally generates agent nodes and communication edges from scratch. ARG-Designer achieves state-of-the-art performance across 6 benchmarks (average 92.78%), reduces token consumption by approximately 50% compared to G-Designer, and supports role expansion without retraining.
Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing: This paper proposes SerenQA, the first framework to formally define the serendipity discovery task in knowledge graph question answering. It introduces an information-theoretic RNS metric, an expert-annotated drug repurposing benchmark dataset, and a three-stage LLM evaluation pipeline. The work reveals that current LLMs perform reasonably on retrieval tasks but have substantial room for improvement in serendipitous exploration.
Beyond Fixed Depth: Adaptive Graph Neural Networks for Node Classification Under Varying Homophily: This paper proposes AD-GNN, which theoretically analyzes node-level homophily/heterophily characteristics and adaptively assigns different aggregation depths to individual nodes, enabling unified handling of node classification on both homophilic and heterophilic graphs within a single framework.
BugSweeper: Function-Level Detection of Smart Contract Vulnerabilities Using Graph Neural Networks: This paper proposes BugSweeper, which constructs function-level abstract syntax graphs (FLAG) and designs a two-stage GNN architecture to enable end-to-end smart contract vulnerability detection without expert-defined rules, achieving an F1 of 98.57% on reentrancy attack detection.
Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory: This paper proposes CIF, which leverages hypergraphs to extract intra-class structural commonalities from a small number of training samples, guiding memory bank construction and retrieval for few-shot multimodal industrial anomaly detection, achieving state-of-the-art performance.
EchoLess: Label-Based Pre-Computation for Memory-Efficient Heterogeneous Graph Learning: Echoless-LP eliminates training label leakage (the echo effect) caused by multi-hop message passing in label pre-computation via Partition-Focused Echoless Propagation (PFEP). Combined with an Asymmetric Partition Scheme (APS) and a PostAdjust mechanism to address information loss and distribution shift introduced by partitioning, the method remains memory-efficient, is compatible with arbitrary message-passing operators, and achieves state-of-the-art performance on multiple heterogeneous graph benchmarks.
Enhancing Logical Expressiveness in GNNs via Path-Neighbor Aggregation: PN-GNN proposes aggregating neighbor node embeddings along reasoning paths on top of conditional message passing, enhancing the logical rule expressiveness of GNNs (strictly beyond C-GNN) in a plug-and-play manner, while avoiding the generalization degradation caused by the labeling trick. The method achieves improvements on both synthetic datasets and real-world knowledge graph reasoning tasks.
Feature-Centric Unsupervised Node Representation Learning Without Homophily Assumption: This paper proposes FUEL, a method that adaptively learns the degree of graph convolution usage through a node-feature-centric clustering scheme, achieving high-quality unsupervised node representations on both homophilic and non-homophilic graphs without any homophily assumption.
Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data: This paper presents the first systematic investigation of format bias in LLMs when processing heterogeneous-format data (text / table / infobox / knowledge graph). Through a three-stage experimental framework, it reveals the existence of such bias, its data-level driving factors, and its internal causes at the attention mechanism level, and validates the effectiveness of attention rebalancing as an intervention.
GCL-OT: Graph Contrastive Learning with Optimal Transport for Heterophilic Text-Attributed Graphs: This paper proposes GCL-OT, the first framework to introduce Optimal Transport (OT) into graph contrastive learning for heterophilic text-attributed graphs. Three dedicated modules — RealSoftMax similarity estimation, a filter-prompt mechanism, and OT-guided latent homophily mining — address three multi-granularity heterophily challenges: partial heterophily, complete heterophily, and latent homophily, respectively.
GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning: This paper introduces GSAP-ERE — a fine-grained scholarly entity and relation extraction dataset for the machine learning domain, comprising 10 entity types and 18 relation types, with 63K entities and 35K relations annotated across 100 full-text papers. Experiments show that fine-tuned models (NER: 80.6%, RE: 54.0%) substantially outperform LLM prompting approaches (NER: 44.4%, RE: 10.1%).
GT-SNT: A Linear-Time Transformer for Large-Scale Graphs via Spiking Node Tokenization: GT-SNT is proposed to leverage spiking neural networks (SNNs) as a graph node tokenizer. By aggregating multi-step propagated features into compact spike-count embeddings as node tokens, and employing Codebook-Guided Self-Attention (CGSA) to capture global context in linear time, GT-SNT achieves competitive performance on 9 node classification benchmarks while delivering up to 130× inference speedup.
Human Cognition Inspired RAG with Knowledge Graph for Complex Problem Solving: This paper proposes CogGRAG, a human cognition-inspired knowledge graph-based RAG framework that substantially improves LLM accuracy and reliability on complex Knowledge Graph Question Answering (KGQA) tasks through three stages: top-down mind map decomposition, hierarchical structured retrieval, and dual-LLM self-verification reasoning.
Hyperbolic Continuous Structural Entropy for Hierarchical Clustering: This paper proposes HypCSE, which relaxes discrete Structural Entropy (SE) into a Continuous Structural Entropy (CSE) defined in hyperbolic space. Combined with graph structure learning and contrastive learning, HypCSE enables end-to-end differentiable hierarchical clustering and consistently outperforms both discrete and continuous hierarchical clustering methods across 7 datasets.
Kernelized Edge Attention: Addressing Semantic Attention Blurring in Temporal Graph Neural Networks: This paper proposes KEAT (Kernelized Edge Attention for Temporal Graphs), which addresses the semantic attention blurring problem caused by the entanglement of node and edge representations in temporal graph neural networks. By modulating edge features with continuous-time kernels (Laplacian, RBF, and learnable MLP), KEAT achieves up to 18% MRR improvement over DyGFormer and 7% over TGN on link prediction tasks.
Logical Characterizations of GNNs with Mean Aggregation: This paper provides a systematic logical characterization of GNNs using mean aggregation: under the non-uniform setting, mean-GNNs are equivalent to Ratio Modal Logic (RML); under the uniform setting (relative to MSO), they are equivalent to Modal Logic (ML); when the combination function is additionally required to be continuous and the classification function is a threshold, the expressiveness drops significantly to Alternation-Free Modal Logic (AFML).
Magnitude-Modulated Equivariant Adapter for Parameter-Efficient Fine-Tuning of Equivariant Graph Neural Networks: This paper proposes MMEA (Magnitude-Modulated Equivariant Adapter), a lightweight parameter-efficient fine-tuning method for spherical-harmonic-based equivariant GNNs. By employing scalar gating to independently modulate feature magnitudes along "order–multiplicity" channels, MMEA achieves state-of-the-art molecular potential energy prediction accuracy—surpassing both ELoRA and full fine-tuning—while strictly preserving equivariance and using fewer trainable parameters.
MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation: MoToRec reformulates multimodal recommendation as a discrete semantic tokenization task. By leveraging a sparsely-regularized Residual Quantization VAE (RQ-VAE), raw multimodal features are transformed into composable discrete semantic codes. Combined with adaptive rarity amplification and a hierarchical multi-source graph encoder, the framework effectively addresses the item cold-start problem.
MUG: Meta-path-aware Universal Heterogeneous Graph Pre-Training: MUG is the first universal heterogeneous graph pre-training method that requires no LLM. It unifies heterogeneous node/relation types via contextual structural encoding, aligns representation spaces across graphs with a dimension-aware encoder, and achieves transferable encoding and aggregation through a shared GNN encoder over meta-path views combined with global scatter regularization. MUG substantially outperforms existing methods on cross-domain and few-shot node classification.
MyGram: Modality-aware Graph Transformer with Global Distribution for Multi-modal Entity Alignment: This paper proposes MyGram, which captures deep structural contextual information within each modality via a Modality-aware Graph Diffusion (MGD) module, and introduces a global distribution alignment loss (Gram Loss) based on the determinant of the Gram matrix to enforce cross-modal semantic consistency in high-dimensional space, achieving more robust multi-modal entity alignment.
NOTAM-Evolve: A Knowledge-Guided Self-Evolving Optimization Framework with LLMs for NOTAM Interpretation: This paper proposes NOTAM-Evolve, a self-evolving framework that achieves dynamic knowledge grounding via knowledge graph-enhanced tabular retrieval (KG-TableRAG), combined with iterative SFT+DPO preference optimization and a multi-view voting inference mechanism. The framework enables a 7B-parameter LLM to autonomously master deep parsing of complex aviation NOTAMs, achieving a 30.4% accuracy improvement over the base LLM.
NTSFormer: A Self-Teaching Graph Transformer for Multimodal Isolated Cold-Start Node Classification: This paper proposes NTSFormer (Neighbor-to-Self Graph Transformer), a unified Graph Transformer framework that implements a self-teaching paradigm via a cold-start attention mask. Within the same model, a "student" prediction is derived solely from the node's own features while a "teacher" prediction leverages neighbor information, enabling end-to-end self-teaching without degrading to an MLP. The framework handles missing modalities in multimodal graphs through MoE input projection and multimodal graph pre-computation.
On Stealing Graph Neural Network Models: This paper demonstrates that under strict query budgets (e.g., only 100 queries), an attacker can efficiently steal a GNN model via a two-stage approach: (1) locally obtaining an encoder (randomly initialized or SSL-trained) without interacting with the victim, and (2) strategically selecting queries via K-means clustering. On the Physics dataset, the proposed method achieves 91% accuracy with only 100 queries, whereas the current state-of-the-art requires approximately 5,000 queries plus additional access to victim embeddings to reach comparable performance.
PathMind: A Retrieve-Prioritize-Reason Framework for Knowledge Graph Reasoning with Large Language Models: This paper proposes PathMind, a framework following the Retrieve-Prioritize-Reason paradigm. It identifies important reasoning paths via a semantics-aware path prioritization function that jointly considers cumulative cost and estimated future cost (inspired by A*), and then enhances faithful and interpretable LLM reasoning through a two-stage training strategy comprising task-specific instruction tuning and path-level preference alignment. PathMind achieves state-of-the-art performance on complex reasoning tasks while consuming significantly fewer tokens.
PCoKG: Personality-aware Commonsense Reasoning with Debate: This work constructs PCoKG, the first large-scale personality-aware commonsense knowledge graph comprising 521,316 quadruples $(e, p, r, t)$ (event–personality–reasoning dimension–tail), generated via LLM role-playing combined with a multi-agent debate mechanism to produce high-quality personality-differentiated inferences. Experiments validate that MBTI personality information enhances commonsense reasoning and personalized dialogue generation.
Posterior Label Smoothing for Node Classification: This paper proposes PosteL (Posterior Label Smoothing), which derives soft labels from neighborhood label distributions via Bayesian posterior inference for node classification. The method naturally adapts to both homophilic and heterophilic graphs, achieving accuracy improvements in 76 out of 80 combinations across 8 backbone architectures and 10 datasets.
Relink: Constructing Query-Driven Evidence Graph On-the-Fly for GraphRAG: This paper proposes a paradigm shift in GraphRAG from "build-then-reason" to "reason-and-construct," introducing the Relink framework that dynamically constructs query-specific evidence graphs—combining a high-precision KG backbone with a high-recall latent relation pool, unified via a query-driven ranker to assess relevance on demand, complete missing paths, and filter distractor facts—achieving average gains of 5.4% EM and 5.2% F1 across 5 multi-hop QA benchmarks.
RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA: This paper proposes RFKG-CoT, which enhances LLM reasoning over knowledge graphs via two components: relation-driven adaptive hop-count selection (dynamically adjusting reasoning steps using KG relation activation masks) and few-shot path guidance (in-context examples in a Question-Paths-Answer format). Evaluated on 4 KGQA benchmarks, the method achieves significant improvements — GPT-4 reaches 91.5% (+6.6pp) on WebQSP, and Llama2-7B gains up to +14.7pp.
S-DAG: A Subject-Based Directed Acyclic Graph for Multi-Agent Heterogeneous Reasoning: This paper proposes S-DAG, which uses a GNN to identify relevant subjects and their dependencies from a given question, constructing a directed acyclic graph. Subject nodes are matched to the most capable expert LLMs (14 domain-specific models of 7–13B parameters), and collaborative reasoning proceeds in DAG topological order (supporting subjects → dominant subject). The resulting small-model pool surpasses GPT-4o-mini (59.73 vs. 58.52) and approaches the performance of a 72B model.
Self-Adaptive Graph Mixture of Models: This paper proposes SAGMM (Self-Adaptive Graph Mixture of Models), a graph MoE framework that leverages architectural diversity by employing a Topology-Aware Attention Gating (TAAG) mechanism to adaptively select and combine heterogeneous GNN experts, coupled with an adaptive pruning mechanism. SAGMM consistently outperforms individual GNNs and existing MoE methods across 16 benchmarks spanning node classification, graph classification, regression, and link prediction.
Self-Correction Distillation for Structured Data Question Answering: This paper proposes Self-Correction Distillation (SCD), which transfers structured data question answering capabilities from large-scale LLMs (GPT-4) to small-scale LLMs (8B) via an Error Prompting Mechanism (EPM) and a two-stage distillation strategy, achieving state-of-the-art distillation performance across five benchmarks.
Sentient: Detecting APTs Via Capturing Indirect Dependencies and Behavioral Logic: This paper proposes Sentient, an APT detection method combining Graph Transformer pre-training and bidirectional Mamba2 intent analysis. Trained exclusively on benign data, it captures indirect dependencies, removes contextual noise, and correlates behavioral logic, achieving an average 44% reduction in false positive rate across three standard benchmarks.
Sheaf Graph Neural Networks via PAC-Bayes Spectral Optimization: This paper proposes SGPC (Sheaf GNNs with PAC-Bayes Calibration), which integrates Wasserstein optimal transport for learning sheaf restriction maps, variance-reduced diffusion with an adaptive frequency mixing layer, and PAC-Bayes spectral regularization. SGPC consistently outperforms existing GNN and sheaf methods on both homophilic and heterophilic graph node classification benchmarks while providing theoretical generalization guarantees.
Spiking Heterogeneous Graph Attention Networks: This paper proposes SpikingHAN, the first framework to introduce Spiking Neural Networks (SNNs) into heterogeneous graph learning. It employs a single-layer graph convolution with shared parameters to aggregate meta-path-based neighborhood information, fuses multiple meta-path semantics via semantic-level attention, and encodes the resulting representations into 1-bit binary spike sequences. SpikingHAN achieves competitive node classification performance on three datasets with fewer parameters, faster inference, and lower energy consumption.
UniHR: Hierarchical Representation Learning for Unified Knowledge Graph Link Prediction: This paper proposes UniHR, a unified framework that converts hyper-relational, temporal, and nested KGs into a triple-based representation via Hierarchical Data Representation (HiDR), and designs a Hierarchical Structure Learning (HiSL) module for two-stage intra-fact and inter-fact message passing. UniHR achieves state-of-the-art or competitive link prediction results across 9 datasets spanning 5 KG types.

💬 LLM / NLP¶

A Content-Preserving Secure Linguistic Steganography: This paper proposes CLstega, the first content-preserving linguistic steganography paradigm, which embeds secret information into an unmodified cover text by fine-tuning a masked language model (MLM) to controllably transform its prediction distribution. The approach achieves a 100% extraction success rate and near-perfect security, with steganalysis detection accuracy approaching the random-guess baseline of 0.5.
An Invariant Latent Space Perspective on Language Model Inversion: This paper proposes the Invariant Latent Space Hypothesis (ILSH), which reframes the LLM inversion problem as reusing the LLM's own latent space. The Inv²A framework is designed to map outputs to denoised pseudo-representations via a lightweight inverse encoder, which are then decoded by a frozen LLM to recover hidden prompts. Inv²A achieves an average BLEU improvement of 4.77% across 9 datasets and attains comparable performance with only 20% of the training data.
AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research: This paper proposes AutoMalDesc, an automated static analysis framework that employs an iterative self-paced learning pipeline — starting from 900 expert-annotated seed samples, fine-tuning Llama-3.3-70B via LoRA to generate pseudo-labels, applying multi-stage quality filtering to obtain 101K samples, and training a V2 model — to achieve automated malware classification and behavior description across five scripting languages, improving Batch script detection accuracy from 52.7% to 82.4%.
Blue Teaming Function-Calling Agents: This paper systematically evaluates the robustness of four open-source function-calling LLMs against three attack types, and assesses the effectiveness of eight defense mechanisms, revealing that current models are insecure by default and that existing defenses remain difficult to deploy in practice.
C3TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation: This paper proposes the C3TG framework, which achieves fine-grained multi-attribute controllable text generation through a two-stage approach: in the generation stage, weighted KL divergence is used to fuse attribute distributions and adjust token probabilities; in the optimization stage, an energy function (combining classifier scores and conflict penalty terms) drives iterative rewriting via a Feedback Agent. C3TG achieves 90.4% attribute accuracy across 17 attribute subcategories while substantially reducing toxicity.
CoEvo: Continual Evolution of Symbolic Solutions Using Large Language Models: This paper proposes CoEvo, a framework that integrates LLMs with evolutionary search methodology to achieve continual open-ended evolution of symbolic solutions through a dynamic knowledge library and multi-representation spaces (natural language / mathematical formulas / code), significantly outperforming existing symbolic regression methods on the AI Feynman benchmark.
Collaborative LLM Numerical Reasoning with Local Data Protection: This paper proposes a large-small model collaboration framework that protects sensitive local data through a two-stage anonymization pipeline — topic shifting followed by numerical substitution — applied to local queries. The remote GPT-4 returns reasoning solutions as executable Python code (plug-and-play tools), and the local model only needs to perform numerical back-substitution to obtain the final answer. The framework achieves 16–44% accuracy improvements on FinQA and MultiHiertt while reducing data leakage by 2–45%.
Control Illusion: The Failure of Instruction Hierarchies in Large Language Models: This paper systematically demonstrates that the system/user prompt separation mechanism in current LLMs fails to establish reliable instruction priority, and finds that social hierarchy priors acquired during pretraining (authority, expertise, consensus) exert stronger control over model behavior than explicit system/user role markers.
Conversational Learning Diagnosis via Reasoning Multi-Turn Interactive Learning: This paper proposes ParLD (Preview-Analyze-Reason framework), which leverages multi-agent collaboration to achieve fine-grained, turn-level diagnosis of students' cognitive states during conversational learning. ParLD outperforms traditional knowledge tracing methods by 10% on performance prediction and substantially improves tutoring outcomes.
Do Large Language Models Think Like the Brain? Sentence-Level Evidences from Layer-Wise Embeddings and fMRI: This paper systematically investigates sentence-level alignment between 14 open-source LLMs and human brain language processing by comparing layer-wise LLM representations with fMRI data recorded while participants listened to a natural narrative. Key findings include: middle layers yield the highest brain alignment, instruction tuning substantially enhances alignment, and hemispheric lateralization patterns consistent with classical neurolinguistic theories are observed.
Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging: This paper proposes MergeBarrier, a plug-and-play defense method that disrupts linear mode connectivity (LMC) between a protected model and its homologous counterparts by applying orthogonal projection transformations to attention layers and activation-function-unfolding reparameterization to FFN layers, thereby actively preventing unauthorized model merging without degrading model performance.
From Classification to Ranking: Enhancing LLM Reasoning for MBTI Personality Detection: This paper reformulates MBTI personality detection from four independent binary classifications into a listwise ranking task over all 16 personality types, training a 7B model via SFT cold-start followed by GRPO reinforcement learning with a dual reward (NDCG + dimension similarity), achieving state-of-the-art results on the Kaggle and PANDORA datasets.
Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs: CNNs trained on LLM attention weights are used to evaluate the alignment between memorization taxonomies and actual attention mechanisms. A new three-class taxonomy (Guess/Recall/Non-Memorized) is proposed, improving the minimum F1 from 64.7% to 89.0%, while localizing that different memorization types rely on low-layer (Guess) and high-layer (Recall) attention, respectively.
ICL-Router: In-Context Learned Model Representations for LLM Routing: This paper proposes ICL-Router, a two-stage training framework (query reconstruction + ICL model routing) that encodes LLM capability profiles as in-context vectors, enabling scalable dynamic model routing. New models can be incorporated without retraining the router, achieving state-of-the-art performance on both in-distribution and out-of-distribution tasks.
Identifying and Analyzing Performance-Critical Tokens in Large Language Models: Through representation-level and token-level ablation experiments, this paper identifies the "performance-critical tokens" that LLMs directly rely on during ICL as template and stopword tokens (e.g., "Answer:"), rather than the content tokens that humans would attend to (e.g., actual text). It further reveals that LLMs indirectly exploit content by aggregating content information into the representations of these critical tokens.
Improving Sustainability of Adversarial Examples in Class-Incremental Learning: This paper proposes the SAE framework to address the degradation of adversarial examples (AEs) caused by domain drift in class-incremental learning (CIL). Through a semantic correction module (jointly guided by CLIP and the CIL model) and a filtering-and-augmentation module (removing semantically confusing samples), SAE maintains attack effectiveness even after a 9× increase in the number of classes, achieving an average attack success rate improvement of 31.28%.
IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization: This paper proposes IROTE, an in-context self-reflective optimization method grounded in information bottleneck theory. By iteratively generating and refining compact yet evocative textual "self-reflections," IROTE stably elicits target human traits (values, morality, personality) from LLMs across diverse downstream tasks without any fine-tuning, consistently outperforming existing baselines in trait consistency.
Language Models and Logic Programs for Trustworthy Tax Reasoning: This paper reframes tax law reasoning as a semantic parsing task, where LLMs translate statutory text and case facts into Prolog logic programs that are subsequently executed by a symbolic solver. By combining gold-standard statute translations, retrieval-augmented case examples, and self-consistency checks, the system achieves 86/100 accuracy on the SARA dataset while reducing estimated deployment cost to $15.78 per person — less than 6% of the average U.S. tax filing cost.
Learning Spatial Decay for Vision Transformers: This paper proposes the Spatial Decay Transformer (SDT), which for the first time adapts data-dependent spatial decay mechanisms from 1D sequence modeling to 2D vision Transformers. Through a Context-Aware Gating (CAG) module that generates dynamic, content-dependent decay intensities for patch interactions, SDT consistently outperforms strong baselines such as RMT on ImageNet-1K classification and generation tasks.
LoKI: Low-damage Knowledge Implanting of Large Language Models: This paper proposes LoKI, a parameter-efficient fine-tuning method grounded in the mechanistic understanding of knowledge storage in Transformers. It introduces Knowledge Vector Attribution (KVA) to quantify the contribution of each knowledge vector in FFN layers, and applies a layer-balanced strategy to select low-contribution vectors for targeted knowledge implanting. The approach achieves strong task performance while substantially mitigating catastrophic forgetting.
LoopLLM: Transferable Energy-Latency Attacks in LLMs via Repetitive Generation: This paper proposes LoopLLM, a framework that launches energy-latency attacks by inducing LLMs into repetitive generation modes. Through repetition-inducing prompt optimization and token-aligned ensemble optimization, LoopLLM achieves over 90% of maximum output length across 12 open-source and 2 commercial LLMs, with approximately 40% improvement in cross-model transferability.
ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models Using Pareto High-Quality Data: This paper proposes ParetoHqD, which represents human preferences as preference directions in objective space (rather than linear scalarization), and performs two-stage SFT on high-quality data selected near the Pareto front. Using only 42% of the GPU time, it achieves multi-objective LLM alignment performance superior to five baselines.
PERSIST: Persistent Instability in LLM's Personality Measurements: The PERSIST framework systematically evaluates personality measurement stability across 29 LLMs (1B–685B) on over 2 million responses, revealing a "reasoning paradox" in which CoT reasoning increases variability while reducing perplexity, as well as a scale-dependent effect whereby conversational history exerts opposite influences on large versus small models—collectively indicating that current LLMs lack the architectural foundation for behavioral consistency.
Position on LLM-Assisted Peer Review: Addressing Reviewer Gap through Mentoring and Feedback: This position paper proposes shifting the role of LLMs in peer review from "automatically generating reviews" to "augmenting human reviewer capabilities" — via an LLM-driven mentoring system (three-phase training + certification) and a feedback system (violation detection + evidence-based feedback + reliability testing) to close the reviewer quality gap.
ProFuser: Progressive Fusion of Large Language Models: ProFuser is proposed to comprehensively identify the strengths of each source model across different dimensions via dual-mode advantage assessment (training-mode Min-CE + inference-mode Reward Model voting), and then integrates the complementary capabilities of heterogeneous LLMs into a single target model through a progressive fusion strategy (inference mode first → training mode second, as an easy-to-hard curriculum), achieving an average improvement of 1.65% across 6 benchmarks covering knowledge, reasoning, and safety.
PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixing of Experts: PromptMoE shifts prompt learning from a monolithic paradigm to a compositional one. Through a visually-guided Mixture of Experts (MoE) mechanism, it dynamically assembles instance-adaptive normal/abnormal state prompts from a learnable semantic primitive bank, achieving state-of-the-art zero-shot anomaly detection (ZSAD) performance across 15 industrial and medical datasets.
Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts: From a unified distortion rectification perspective, this paper proposes the UniRect framework, which employs Residual Progressive TPS for geometric deformation correction and Residual Mamba Blocks for degradation compensation. UniRect jointly handles four tasks—portrait correction, wide-angle rectangling, stitching rectangling, and rotation correction—via Sparse MoE for four-in-one multi-task learning. It achieves PSNR gains of 3.82 dB on stitching rectangling and 0.87 dB on rotation correction.
Scalable and Accurate Graph Reasoning with LLM-Based Multi-Agents: This paper proposes GraphAgent-Reasoner (GAR), inspired by distributed graph computation theory. It decomposes graph problems into node-centric subtasks assigned to multiple agents, which collaborate through neighbor message passing. GAR extends the graph scale tractable by LLMs from 100 nodes to 1,000 nodes, and significantly outperforms existing state-of-the-art methods on polynomial-time graph reasoning tasks.
Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents: This paper proposes a zero-shot multi-agent pipeline comprising five role-based GPT-4o agents that assess learner reflection texts using a rubric-based scoring scheme and generate bias-aware conversational feedback. Evaluated on 336 reflections, the system achieves MAE=0.467, QWK=0.459 in scoring agreement, and a feedback quality score of Q(g)=3.967.
Smart: A GNN-LLM Hybrid Surrogate Model for Dragonfly System Application Runtime Prediction: This paper proposes Smart (Surrogate Model for Predicting Application RunTime), the first approach to integrate GNN and LLM (Time-LLM) for iterative application runtime prediction in Dragonfly interconnection networks. On a 1,056-node system, Smart achieves a minimum MAPE of 1.78% (LAMMPS) with an inference time of only 0.515 seconds, delivering orders-of-magnitude speedup over full-scale simulation.
Soft Filtering: Guiding Zero-Shot Composed Image Retrieval with Prescriptive and Proscriptive Prompts: This paper proposes SoFT, a training-free plug-and-play reranking module that leverages a multimodal LLM to extract dual textual constraints — "must include" (prescriptive) and "must avoid" (proscriptive) — from a reference image and modification text, and applies soft-filtering reranking over candidate results in zero-shot composed image retrieval. A multi-target triplet dataset construction pipeline is also introduced to improve evaluation.
STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transitive Evaluation Model: This paper proposes STEM, a framework that identifies "Significant Transition Samples" (STS) across models of the same architecture but varying scales to construct a lightweight evaluation subset, enabling efficient relative capability localization of unknown LLMs. STEM achieves 100% localization accuracy with only 100 samples, substantially outperforming random sampling and Bayesian methods.
TEMPLE: Incentivizing Temporal Understanding of Video LLMs via Progressive Pre-SFT Alignment: This paper proposes TEMPLE, which significantly enhances the temporal reasoning capabilities of Video LLMs through an automated video temporal preference data generation pipeline (video filtering → temporal perturbation → contrastive response generation) and a novel Progressive Pre-SFT Alignment strategy (curriculum learning + DPO prior to SFT), using a small amount of self-generated DPO data. Consistent improvements are achieved across multiple benchmarks including VideoMME, MLVU, and Vinoground.
TransMamba: A Sequence-Level Hybrid Transformer-Mamba Language Model: This paper proposes TransMamba, a sequence-level Transformer-Mamba hybrid architecture that dynamically switches between Attention and SSM computation at different token positions via shared QKV/CBx parameters and a Memory Converter, achieving efficiency advantages for both short and long sequences.
Uncertainty Under the Curve: A Sequence-Level Entropy Area Metric for Reasoning LLMs: This paper proposes the Entropy Area Score (EAS)—a method that quantifies uncertainty in reasoning LLMs by integrating token-level predictive entropy via a single forward pass. EAS requires neither external models nor repeated sampling, achieves strong correlation with answer entropy (Pearson $r=0.82$), and when applied to training data selection outperforms Pass Rate filtering by 1.2–2.3% Pass@1, making it an efficient and interpretable uncertainty estimation tool for LLMs.
Understanding Syllogistic Reasoning in LLMs from Formal and Natural Language Perspectives: This work systematically evaluates 14 LLMs on 160 syllogisms using a dual-dimensional ground truth framework (syntactic validity + NLU believability), revealing that top models approach near-perfect performance on formal logic (99.6%) while performing at chance level on natural language believability (~52%)—the inverse of human reasoning patterns. 12 out of 14 models exhibit significant belief bias, and few-shot prompting degrades formal reasoning performance.
Vision Transformers are Circulant Attention Learners: This paper discovers that self-attention matrices in ViTs inherently learn Block Circulant with Circulant Blocks (BCCB) patterns, and proposes Circulant Attention, which achieves $O(N\log N)$ complexity via 2D FFT, yielding consistent improvements on ImageNet classification, COCO detection, and ADE20K segmentation.
VSPO: Validating Semantic Pitfalls in Ontology via LLM-Based CQ Generation: This paper proposes the VSPO framework, which constructs a definition–axiom misalignment dataset and fine-tunes LLaMA-3.1-8B-Instruct to generate competency questions (CQs) capable of validating semantic pitfalls in ontologies (e.g., misuse of allValuesFrom). The approach surpasses GPT-4.1 by 26% in precision and 28.2% in recall.

🤖 Robotics & Embodied AI¶

10 Open Challenges Steering the Future of Vision-Language-Action Models: This paper systematically surveys 10 open challenges facing VLA models — multimodal perception, robust reasoning, high-quality training data, evaluation, cross-robot action generalization, resource efficiency, whole-body coordination, safety assurance, agent frameworks, and human-robot collaboration — and discusses four emerging trends: spatial understanding, world dynamics modeling, post-training, and data synthesis.
A Computable Game-Theoretic Framework for Multi-Agent Theory of Mind: This paper proposes a game-theoretic framework based on Poisson cognitive hierarchy, achieving computable multi-agent Theory of Mind via Gamma-Poisson conjugate Bayesian updates. The framework supports recursive bounded-rationality decision-making and online belief revision while avoiding the undecidability of POMDPs.
Adaptive Theory of Mind for LLM-based Multi-Agent Coordination: This paper proposes the Adaptive Theory of Mind agent (A-ToM), which formulates ToM order alignment as an online expert advice problem. By employing Follow-the-Leader (FTL) or Hedge algorithms to estimate a partner's ToM order in real time and dynamically adjust its own reasoning depth, A-ToM achieves robust zero-shot multi-agent coordination across four task categories, including repeated matrix games, grid navigation, and Overcooked.
Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation: This paper addresses the base placement problem in open-vocabulary mobile manipulation (OVMM) and proposes a zero-shot framework that constructs a cross-modal representation (Affordance RGB + Obstacle Map+) to project semantic affordance cues onto an obstacle map, followed by a coarse-to-fine iterative optimization that balances semantic and geometric constraints. The method achieves an 85% success rate across five manipulation tasks, substantially outperforming both geometric planners and pure VLM-based approaches.
Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning: This paper proposes reinterpreting the Transformer self-attention mechanism as a soft binding/unbinding operator in Vector Symbolic Architectures (VSA) — where Query/Key define a role space, Value encodes fillers, attention weights implement differentiable unbinding, and residual connections implement superposition — thereby providing an algebraic perspective that unifies explanations of LLM capability and fragility in symbolic reasoning. The paper further proposes VSA-inspired architectural improvements such as explicit binding heads and hyperdimensional memory layers.
Causal Inference Under Threshold Manipulation: Bayesian Mixture Modeling and Heterogeneous Treatment Effects: This paper proposes the BMTM/HBMTM Bayesian mixture model framework. In scenarios where consumers strategically manipulate spending to reach reward thresholds, the framework decomposes the observed distribution into bunching and non-bunching sub-distributions to accurately estimate threshold causal effects and heterogeneous treatment effects across subgroups.
Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning: This paper proposes the CCoL framework, which addresses both physical discontinuity in action sequences and semantic-physical misalignment in Behavioral Cloning through NeuralODE-driven Multimodal Continuous Co-learning (MCC) and bidirectional cross-attention-based Cross-modal Semantic-Physical Alignment (CSA). CCoL achieves an average relative improvement of 8.0% across three simulation platforms, with up to 19.2% on the bimanual insertion task.
Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling: This paper proposes GRM, a framework that achieves robust fine-grained image-text alignment through intra-modal saliency/granularity-aware adapters and Gaussian mixture-based region-level uncertainty modeling, attaining state-of-the-art performance on Flickr30K and MS-COCO.
Dexterous Manipulation Transfer via Progressive Kinematic-Dynamic Alignment: This paper proposes the PKDA framework, which automatically converts human hand manipulation videos into high-quality manipulation trajectories for multi-fingered dexterous hands via progressive kinematic-dynamic alignment, achieving an average transfer success rate of 73%.
Do LLMs Really Struggle at NL-FOL Translation? Revealing Their Strengths via a Novel Benchmarking Strategy: This paper critically examines existing evaluation methodologies for natural language to first-order logic (FOL) translation — specifically FOLIO and MALLS — exposing fundamental flaws in their datasets and evaluation protocols. The authors propose a novel benchmarking strategy that decomposes the translation task into ontology extraction (OE) and logical translation (LT), augmented with "most similar selection" and "ranking" subtasks. Experiments demonstrate that conversational LLMs (o3-mini, GPT-4o-mini, Qwen3 series) exhibit strong NL-FOL translation capabilities and genuine logical semantic understanding, while embedding-based models perform significantly worse.
EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer: This paper proposes EvoEmpirBench (EEB), comprising two dynamic interactive benchmarks (partially observable maze navigation + Match-2), and the Agent-ExpVer three-agent online learning framework (GeoLink for interaction + InsightForce for experience abstraction + TruthWeaver for knowledge management). Through a cognitive cycle of "experience → verification → truth induction," the framework achieves continuous strategy evolution without parameter updates, improving GPT-4.1 success rate by 5.6% and Qwen-32B by 29%.
From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations: This paper proposes BoxPromptIML, a weakly supervised image manipulation localization (IML) framework based on coarse-grained bounding box annotations. It leverages a frozen SAM teacher model to convert rough bounding boxes into high-quality pseudo-masks, and trains a lightweight student model via a memory-guided gated fusion module (MGFM), achieving performance comparable to or surpassing fully supervised methods with an annotation cost of only 7 seconds per image.
From Woofs to Words: Towards Intelligent Robotic Guide Dogs with Verbal Communication: This paper proposes a dialogue system for robotic guide dogs that leverages LLMs and a task planner to achieve Plan Verbalization and Scene Verbalization, supporting multi-turn natural language dialogue to assist visually impaired users in navigation decision-making. The system's effectiveness is validated through a real-user study and simulation experiments.
Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment: This paper systematically evaluates three text manipulation strategies—verbosity, strategic multi-answer embedding, and correct-answer-first with contradictory suffix—against LLM-based answer-matching judges. The results show that these manipulations do not improve scores and often reduce them. Binary scoring proves more robust than continuous scoring, demonstrating that answer matching is resistant to low-cost text manipulation as an evaluation method.
Sketch-HARP: Hierarchical Autoregressive Sketch Generation for Flexible Stroke-Level Drawing Manipulation: This paper proposes Sketch-HARP, a hierarchical autoregressive sketch generation framework that achieves, for the first time, flexible stroke-level manipulation during the drawing process through a three-stage hierarchical pipeline (predicting stroke embeddings → determining canvas positions → generating drawing action sequences). The method significantly outperforms SketchEdit on tasks including stroke replacement, erasure, and extension.
GRIM: Task-Oriented Grasping with Conditioning on Generative Examples: This paper proposes GRIM (Grasp Re-alignment via Iterative Matching), a training-free task-oriented grasping (TOG) framework that employs a retrieve–align–transfer pipeline combining video generation models with a multi-source memory bank. By leveraging DINO-feature-based semantic 3D alignment, GRIM achieves functional grasp transfer across objects, surpassing GraspMolmo—trained on 379K samples—using only 210 memory instances.
H-GAR: A Hierarchical Interaction Framework via Goal-Driven Observation-Action Refinement for Robotic Manipulation: This paper proposes H-GAR, a hierarchical goal-driven framework that first predicts a goal observation and then synthesizes intermediate observations, while refining coarse-grained actions via a historical action memory bank. This design enables explicit bidirectional interaction between observations and actions, achieving state-of-the-art performance on both simulated and real-robot manipulation tasks.
Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search: This paper formalizes the Human-Centric Open-Future Task Discovery (HOTD) problem—identifying tasks that reduce human burden across multiple possible futures in scenarios where human intentions are concurrent and dynamically evolving. The authors construct the HOTD-Bench benchmark (2K+ real-world videos) and propose CMAST (Collaborative Multi-Agent Search Tree), which substantially outperforms existing LMM baselines via a multi-agent system and a scalable search tree.
Human Cognitive Biases in Explanation-based Interaction: The Case of Within and Between Session Order Effect: This paper systematically evaluates the impact of order effects on Explanatory Interactive Learning (XIL) through two large-scale user studies (713 participants in total). The findings show that order effects have a limited and inconsistent influence on user feedback quality, with a statistically significant but weak effect observed only within sessions (not between sessions). The overall conclusion is that order effects do not constitute a major obstacle to the practical deployment of XIL.
iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification: This paper proposes iSeal — the first active fingerprinting method capable of reliably verifying LLM ownership in a black-box setting where the model thief has full control over the inference process. Through a triple mechanism of an external encrypted encoder, RSC error correction, and similarity-based matching, iSeal maintains a 100% Fingerprint Success Rate (FSR) across 12 LLMs and 10+ attack types, while existing methods drop to 0%.
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward: This paper proposes the LaF-GRPO framework, which employs an LLM to simulate the responses of visually impaired users to navigation instructions as a reward signal. By applying GRPO-based post-training to a VLM, the framework generates more precise and safer navigation instructions for the visually impaired. The authors also construct the NIG4VI benchmark dataset comprising 27k samples.
More Than Irrational: Modeling Belief-Biased Agents: This paper proposes a computational rationality (CR) user model framework that interprets seemingly "irrational" human behavior as optimal decision-making under limited memory (belief bias). A nested particle filter (NPF) is used to online-infer the user's latent memory bound parameter $\theta$ and biased belief state $\tilde{b}$. The posterior mean (PM) error is reduced by 90% within 45 steps, and adaptive AI assistant policies are demonstrated within an assistive POMDP.
Neural Graph Navigation for Intelligent Subgraph Matching: This paper proposes NeuGN (Neural Graph Navigation), the first framework to integrate generative neural navigation into the core enumeration phase of subgraph matching. By combining QSExtractor—which extracts structural signals from query graphs—with GGNavigator—which replaces brute-force enumeration with structure-aware candidate node prioritization—NeuGN reduces First Match Steps by up to 98.2% while guaranteeing completeness.
PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory: This paper proposes PanoNav, a mapless zero-shot object navigation framework that uses only RGB images. It unlocks the spatial reasoning capability of MLLMs through Panoramic Scene Parsing and introduces a Dynamic Bounded Memory Queue to prevent local deadlock.
Realistic Synthetic Household Data Generation at Scale: This paper proposes an LLM-driven bidirectional coupling generation framework that iteratively generates large-scale synthetic datasets — encompassing household environment configurations, human activities, and human-robot interactions (HRI) — through a cycle in which persona profiles drive environment generation and environment semantics in turn guide activity generation, targeting the training of home robots.
Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation: This paper proposes a VLN policy based on Implicit Scene Representation (ISR), which compresses historical trajectories into a fixed-size compact neural grid via Recursive Visual Imagination (RVI) to learn high-level scene priors, and employs Adaptive Linguistic Grounding (ALG) to finely align different semantic components of navigation instructions with different grid cells. The approach achieves state-of-the-art performance on two continuous-environment navigation benchmarks: R2R-CE and ObjectNav.
RENEW: Risk- and Energy-Aware Navigation in Dynamic Waterways: This paper proposes RENEW, a global path planner for autonomous surface vessels (ASVs) operating in dynamic water current (ocean current) environments. It introduces a unified risk- and energy-aware strategy via adaptive no-go zone identification, best-effort contingency planning, and a hierarchical architecture based on Constrained Delaunay Triangulation (CDT), achieving zero collisions in emergency maneuver tests.
Robust Out-of-Order Retrieval for Grid-Based Storage at Maximum Capacity: For the problem of uncertain retrieval order in fully loaded 2D grid-based storage systems, this paper proposes the k-bounded perturbation uncertainty model, proves that $\Theta(k)$ columns is both necessary and sufficient for zero relocation, and presents an efficient robust storage solver and greedy retrieval strategy. The approach nearly eliminates relocations when $k \leq 0.5c$ and still reduces relocations by 50%+ when $k$ reaches $c$.
SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation: This paper proposes the SemanticVLA framework, which integrates three modules — a Semantic-guided Dual-encoder Pruner (SD-Pruner), a Semantic-complementary Hierarchical Fuser (SH-Fuser), and a Semantic-conditioned Action Coupler (SA-Coupler) — to substantially reduce visual redundancy while enhancing instruction–vision–action alignment. On the LIBERO benchmark, SemanticVLA achieves a 97.7% success rate, surpassing OpenVLA by 21.1%, while reducing training cost and inference latency by 3.0× and 2.7×, respectively.
Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems: The first systematic security analysis of LLM-based multi-agent software development systems (ChatDev/MetaGPT/AgentVerse): proposes the IMBIA attack framework covering two threat scenarios (malicious user + benign agents / benign user + malicious agent) and 12 malicious behaviors across 5 malware families, achieving an attack success rate (ASR) of up to 93% on ChatDev, with the Adv-IMBIA adversarial defense reducing ASR by 40–73%.
SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation: This paper proposes SpatialActor, a framework that explicitly disentangles semantic and geometric representations. It introduces a Semantic-Guided Geometry Module (SGM) that adaptively fuses noisy depth features with a pretrained depth estimation expert prior, and a Spatial Transformer (SPT) that encodes low-level spatial position cues. SpatialActor achieves 87.4% success rate on 50+ RLBench tasks (SOTA +6.0%) and outperforms RVT-2 by 19.4% under heavy-noise conditions.
Theory of Mind for Explainable Human-Robot Interaction: This paper proposes positioning Theory of Mind (ToM) as a form of Explainable AI (XAI), systematically evaluates existing ToM research in HRI using the seven criteria of the VXAI framework, identifies critical deficiencies (most notably the absence of fidelity assessment), and advocates for integrating ToM into XAI frameworks to achieve user-oriented explanations.
To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance: By introducing a controllable contrastive learning module to systematically regulate alignment strength $\lambda$, and employing the Partial Information Decomposition (PID) framework to quantify the redundancy–uniqueness–synergy structure between modalities, this work reveals that the utility of explicit alignment is highly data-dependent: alignment is beneficial when redundancy dominates, harmful when uniqueness dominates, and an optimal $\lambda^*$ exists in mixed scenarios.
TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception: This paper proposes TouchFormer, a robust multimodal fusion framework that achieves reliable material perception under vision-impaired conditions through three complementary modules: Modality-Adaptive Gating (MAG), intra- and inter-modal attention mechanisms, and Cross-Instance Embedding Regularization (CER). The approach is validated in a robotic sorting experiment under simulated fire scenarios.
Towards Reinforcement Learning from Neural Feedback: Mapping fNIRS Signals to Agent Performance: This paper proposes the NEURO-LOOP framework, which leverages fNIRS (functional near-infrared spectroscopy) brain signals as implicit neural feedback to evaluate RL agent performance. The authors release an fNIRS dataset spanning 25 subjects × 3 domains × 6 conditions. Classification F1 reaches 67% (binary) / 46% (multi-class), with cross-subject fine-tuning yielding improvements of 17% and 41% respectively, laying the groundwork for Reinforcement Learning from Neural Feedback (RLNF).
Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation: This paper demonstrates that fine-tuning LLMs on benign agentic data causes unintended safety misalignment (attack success rate increases by 32–38%), and proposes PING (Prefix Injection Guard)—an iterative generate-and-evaluate approach that automatically discovers natural language prefixes to guide fine-tuned agents toward refusing harmful requests, achieving an average refusal rate improvement of 66% (Web) and 44% (Code) while preserving task performance (degradation of only 1.8%).
UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories: This paper proposes UrbanNav, which leverages web-scale urban walking videos (1,500+ hours from YouTube, yielding 3 million instruction–trajectory–landmark triplets) to train a language-guided urban navigation policy via an automated annotation pipeline and robust filtering mechanism, achieving an 83.3% navigation success rate in real-world deployment.
When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets: This paper introduces the CAIA benchmark, which leverages cryptocurrency markets as a natural adversarial laboratory to evaluate 17 state-of-the-art LLMs on agent capabilities in high-stakes adversarial environments. Results reveal that frontier models achieve only 67.4% accuracy (GPT-5) compared to a human baseline of 80%, and expose systematic tool selection failures.

🔬 Interpretability¶

A Closer Look at Knowledge Distillation in Spiking Neural Network Training: To address the overlooked distribution mismatch between teacher ANN continuous features/logits and student SNN discrete sparse spike features/logits in ANN→SNN knowledge distillation, this paper proposes the CKDSNN framework based on Saliency-scaled Activation Map Distillation (SAMD) and Noise-smoothed Logits Distillation (NLD), achieving new state-of-the-art SNN training performance on CIFAR-10/100, ImageNet-1K, and CIFAR10-DVS.
A Coherence-Based Measure of AGI: This paper identifies that existing AGI scores rely on arithmetic averaging, which implicitly encodes a "compensatory" assumption (strengths offsetting weaknesses), and proposes $\text{AGI}_{\text{AUC}}$—a coherence measure based on the continuous spectrum of generalized means. By integrating over the compensability parameter $p \in [-1, 1]$, the metric penalizes uneven capability profiles and exposes bottlenecks concealed by arithmetic averaging.
Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval: This paper proposes DEMR, a framework that introduces Deep Evidential Regression (DER) into video moment retrieval. It mitigates modal imbalance via a Reflective Flipped Fusion (RFF) module and corrects the counter-intuitive uncertainty estimation bias in vanilla DER via a Geom-regularizer, achieving significant improvements on both standard and debiased benchmarks.
Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT: This work applies mechanistic interpretability to reverse-engineer the internal circuits of a Video Vision Transformer (ViViT), revealing a functional division of labor in which attention heads are responsible for "gathering evidence" and MLP modules for "composing concepts." The analysis demonstrates that the model develops semantic knowledge beyond its training objective even on simple classification tasks.
Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models: This paper proposes the Composite Reliability Score (CRS), which unifies calibration, robustness, and uncertainty quantification into a single interpretable metric. A systematic evaluation of 10 open-source LLMs across 5 QA datasets reveals that Mistral-8x22B achieves the highest overall reliability (CRS=0.81), and that model size does not directly determine reliability.
Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution: This paper proposes the first systematic comparative framework that directly contrasts strategic behavioral differences between humans and personality-prompted LLMs in paired dispute mediation scenarios, finding significant divergence in personality-behavior mapping and challenging the assumption that personality prompting can serve as a proxy for human behavior.
Concepts from Representations: Post-hoc Concept Bottleneck Models via Sparse Decomposition of Visual Representations: This paper proposes PCBM-ReD, a post-hoc concept bottleneck model that automatically extracts concepts from pretrained visual encoders via sparse autoencoders, annotates and filters them using MLLMs, and selects a representative subset through reconstruction-guided search. Image representations are then sparsely decomposed into linear combinations of concept embeddings via CLIP's vision-language alignment. The method achieves state-of-the-art accuracy on 11 classification benchmarks while maintaining interpretability.
CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution: CrossCheck-Bench is a three-level hierarchical benchmark comprising 15k adversarial QA samples. It diagnoses compositional reasoning failures of VLMs in multimodal conflict resolution via 7 atomic capabilities and 15 tasks, revealing systematic performance degradation from perception (L1) to reasoning (L3) and exposing the limitations of conventional prompting strategies.
Data Whitening Improves Sparse Autoencoder Learning: This paper introduces PCA whitening — a standard preprocessing step from classical sparse coding — into modern sparse autoencoder (SAE) training. Through theoretical analysis and simulation, it demonstrates that whitening renders the optimization landscape more convex and isotropic. Experiments on SAEBench show that whitening substantially improves interpretability metrics (Sparse Probing +7.3%, SCR +54%, TPP +372%), albeit with a slight decrease in reconstruction quality.
Distribution-Based Feature Attribution for Explaining the Predictions of Any Classifier: This paper proposes DFAX, the first distribution-based feature attribution method, which quantifies feature importance by comparing the conditional probability density of a target instance under the target class versus non-target classes. It provides the first formal definition of feature attribution, and demonstrates significant improvements over SHAP/LIME and other baselines across 10 datasets while being orders of magnitude faster.
DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment: This paper proposes the DR.Experts framework, which leverages DA-CLIP to obtain distortion-type priors, employs a Distortion Saliency Differential Module (DSDM) to disentangle distortion attention from semantic attention and thereby purify distortion features, and then applies a Dynamic Distortion Weighting Module (DDWM) to adaptively weight each distortion type's features according to its perceptual impact. The method achieves state-of-the-art performance on five BIQA benchmarks.
ElementaryNet: A Non-Strategic Neural Network for Predicting Human Behavior in Normal-Form Games: This paper proposes ElementaryNet, a neural network architecture that is provably incapable of strategic reasoning, designed to model "level-0" (non-strategic) human behavior in games. It achieves prediction accuracy statistically indistinguishable from GameNet (current SOTA) while offering substantially better interpretability.
Enhancing Binary Encoded Crime Linkage Analysis Using Siamese Network: This paper proposes a Siamese Autoencoder-based crime linkage analysis framework that integrates geo-temporal features at the decoder stage and employs a domain expert-driven dimensionality reduction strategy. Evaluated on the real-world ViCLAS database from the UK National Crime Agency (NCA), the method achieves up to 9% AUC improvement, providing an effective machine learning solution for high-dimensional sparse binary-encoded crime data.
Explainable Melanoma Diagnosis with Contrastive Learning and LLM-based Report Generation: This paper proposes the CEFM framework, which aligns ViT visual features with ABCD-rule-based clinical features (asymmetry, border, color) via cross-modal contrastive learning, and subsequently employs CLIP and DeepSeek to generate structured diagnostic reports. On the ISIC dataset, the framework achieves 92.79% accuracy and 0.961 AUC, with an expert-rated interpretability score of 4.6/5.
Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs: This work leverages sparse autoencoders (SAEs) to discover "translation-initiation features" within LLMs that govern the activation of translation tasks. Causal interventions validate their functional roles—amplifying these features improves translation quality and reduces hallucinations, while suppressing them induces hallucinations. The mechanistic insight is further operationalized into a practical data selection strategy that prioritizes "mechanistically difficult" samples for fine-tuning, substantially improving data efficiency and hallucination suppression.
FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding: This paper proposes the FineVAU benchmark, which decomposes Video Anomaly Understanding (VAU) into three dimensions — Event (What), Entity (Who), and Location (Where) — introduces the FV-Score metric with high alignment to human perception, and constructs the FineW³ dataset via a fully automated LVLM-assisted pipeline. Experiments reveal critical shortcomings of current LVLMs in fine-grained anomalous event perception.
FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer: This paper provides an in-depth analysis of the root cause behind KAT (Kolmogorov-Arnold Transformer) training being 123× slower than ViT. The bottleneck is identified not as FLOPs but as memory stalls caused by gradient accumulation during backpropagation (global memory contention from atomic add operations). The proposed FlashKAT restructures GPU kernels to achieve an 86.5× training speedup and reduces gradient rounding errors by nearly an order of magnitude.
Flexible Concept Bottleneck Model: This paper proposes the Flexible Concept Bottleneck Model (FCBM), which introduces a hypernetwork to dynamically generate concept weights and a sparsemax module with a learnable temperature parameter, enabling dynamic adaptation of the concept pool—including complete replacement. FCBM achieves accuracy comparable to state-of-the-art baselines with a similar number of effective concepts across five public datasets, and requires only a single epoch of fine-tuning to adapt to an entirely new concept set.
FourierPET: Deep Fourier-based Unrolled Network for Low-count PET Reconstruction: This work identifies three categories of degradation in low-count PET that are separable in the frequency domain — Poisson noise and photon deficiency induce high-frequency phase perturbations, while attenuation correction errors suppress low-frequency amplitude — and proposes FourierPET: an ADMM-unrolled, frequency-aware reconstruction framework that achieves comprehensive state-of-the-art performance across three datasets with only 0.44M parameters.
GateRA: Token-Aware Modulation for Parameter-Efficient Fine-Tuning: This paper proposes GateRA, which introduces a lightweight token-aware gating module into PEFT methods (LoRA/DoRA/HiRA). A sigmoid gate dynamically adjusts the adaptation intensity per token—suppressing updates for in-distribution or simple tokens to preserve pre-trained knowledge, while amplifying adaptation for challenging tokens. Combined with entropy regularization to encourage near-binary gating decisions, GateRA consistently outperforms HiRA on commonsense reasoning (+1.1%), dialogue, and mathematical reasoning.
GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction Framework: This paper proposes GenePheno, the first interpretable multi-label prediction framework for end-to-end prediction of gene knockout-induced phenotype abnormalities directly from gene sequences. The framework captures inter-phenotype correlations via contrastive multi-label learning, enforces biological consistency through exclusivity regularization, and provides interpretability via a Gene Ontology (GO) bottleneck layer. GenePheno achieves state-of-the-art gene-centric $F_{\max}$ and phenotype-centric AUC across four datasets.
HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning: This paper introduces HSKBenchmark, the first benchmark for staged modeling and writing assessment of Chinese second language acquisition (SLA) in LLMs. It comprises HSK levels 3–6 textbooks (6.76M tokens), 16K synthetic instruction data, 30 test prompts, a linguistically-grounded evaluation system, and a curriculum tuning framework designed to simulate human acquisition trajectories.
Hypothesis Generation via LLM-Automated Language Bias for ILP: This paper proposes the first end-to-end framework in which a multi-agent LLM system (Actor/Critic) automatically constructs ILP language bias (predicate system, type declarations, and mode constraints) from raw text. A Translator agent converts text into Prolog facts, and the MAXSYNTH solver then induces a globally optimal rule set based on the MDL principle. The framework achieves 88.3% and 81.3% accuracy on the SHOES and ZENDO tasks, respectively, with variance below 5% across four LLMs.
iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference: iMAD proposes a framework for selectively triggering multi-agent debate (MAD): a single agent first generates a structured response with self-critique, from which 41 interpretable linguistic/semantic features are extracted; a lightweight MLP classifier trained with the FocusCal loss then determines whether to trigger MAD. Across 6 QA/VQA benchmarks, iMAD reduces token overhead by up to 92% while improving accuracy by up to 13.5%.
Induce, Align, Predict: Zero-Shot Stance Detection via Cognitive Inductive Reasoning: This paper proposes the CIRF framework, which abstracts transferable reasoning patterns from LLM-generated first-order logic via unsupervised schema induction (USI), and performs explainable zero-shot stance reasoning through structural alignment using a schema-enhanced graph kernel model (SEGKM). The method achieves state-of-the-art performance on three benchmarks while requiring only 30% of labeled data.
LLM Circuit Analyses Are Consistent Across Training and Scale: This paper presents the first systematic tracking of internal circuits in decoder-only LLMs across 300 billion tokens of training and model scales ranging from 70M to 2.8B parameters. It finds that while specific attention heads may be replaced over the course of training, the underlying algorithms remain stable and consistent across scales, suggesting that circuit analyses conducted on smaller models generalize to larger models and longer training runs.
Partially Shared Concept Bottleneck Models: This paper proposes PS-CBM, a framework that integrates multimodal concept generation (combining LLM semantics with visual cues from exemplar images), a partially shared concept strategy (merging concepts based on activation patterns), and a Concept-Efficient Accuracy (CEA) evaluation metric. PS-CBM achieves higher classification accuracy and interpretability with fewer concepts across 11 datasets.
Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models: This paper proposes MRMBench, a benchmark that evaluates whether reward models (RMs) effectively capture multi-dimensional preferences via probing tasks across 6 dimensions (harmlessness, helpfulness, correctness, coherence, complexity, and verbosity). Probe performance is shown to strongly correlate with PPO alignment quality (Pearson $r > 0.8$), and an inference-time probing method is proposed that improves AlpacaEval win rate from 57.3% to 62.5%.
Quiet Feature Learning in Algorithmic Tasks: Across 10 algorithmic tasks (18,544 training runs, $10^9$–$10^{16}$ FLOPs), this work demonstrates that loss plateaus in Transformer training do not indicate stalled learning. During these plateaus, models quietly acquire "quiet features"—intermediate algorithmic subroutines that do not directly reduce output loss yet are causally necessary for final performance (ablating them reduces accuracy by 41–75%). This challenges the common practice of using loss curves to assess training progress.
SCoPe: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs: This paper reframes copyright infringement mitigation in LLMs as an intrinsic semantic space control problem. It leverages sparse autoencoders (SAEs) to map hidden states into a high-dimensional sparse space, identifies copyright-sensitive subspaces, and clamps their activations to zero during decoding—effectively reducing verbatim reproduction of copyrighted content without external filters or parameter updates, while preserving general model capabilities.
ShapBPT: Image Feature Attributions Using Data-Aware Binary Partition Trees: This paper proposes ShapBPT, which combines data-aware Binary Partition Trees (BPT) as hierarchical coalition structures with Owen-approximated Shapley values to achieve feature attributions aligned with image morphology. ShapBPT converges faster and yields more accurate shape recognition than existing Shapley-based methods, with a 20-participant user study confirming that its explanations are preferred by human evaluators.
SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models: This paper demonstrates that refusal behavior in LLMs is not encoded by a single direction but rather forms a low-dimensional manifold. It employs self-organizing maps (SOM) to extract multiple refusal directions and applies Bayesian optimization to search for the optimal ablation combination, surpassing single-direction baselines and dedicated jailbreak algorithms across multiple models.
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning: This paper proposes SparK — a training-free, token-wise unstructured channel pruning method for KV cache. It selects salient channels via query-aware saliency scoring and recovers the contribution of pruned channels through a recovery mechanism. At an 80% pruning ratio, performance degradation remains below 5%. The method is orthogonal to token eviction approaches and can reduce KV cache storage by an additional 30%+.
ToC: Tree-of-Claims Search with Multi-Agent Language Models: This paper proposes the Tree-of-Claims (ToC) framework, which models patent claim editing as a structured search problem. Through MCTS combined with EditorAgent/ExaminerAgent multi-agent collaboration, ToC jointly optimizes novelty, scope preservation, and semantic consistency, achieving an average improvement of approximately 8% in overall score over zero/few-shot LLM baselines.
Universal Safety Controllers with Learned Prophecies: This paper proposes UCLearn, which learns CTL (Computation Tree Logic) formulas as approximate representations of prophecies from a small number of representative plant models, replacing exact but computationally expensive tree automata to achieve efficient, scalable, and interpretable universal safety controller synthesis.
Unsupervised Feature Selection Through Group Discovery: This paper proposes GroupFS, the first end-to-end differentiable unsupervised feature selection framework that simultaneously discovers latent feature groups and selects the most informative ones, requiring neither predefined groupings nor label supervision.
Using Certifying Constraint Solvers for Generating Step-wise Explanations: This paper proposes leveraging unsatisfiability proofs generated by certifying constraint solvers as a starting point, and applies a series of simplification and transformation techniques to efficiently produce user-facing step-wise explanation sequences, achieving speedups of up to 100× over approaches that construct explanations from scratch.

📈 Time Series¶

A Theoretical Analysis of Detecting Large Model-Generated Time Series: This work presents the first theoretical framework for detecting time series large model (TSLM)-generated content. By establishing the Contraction Hypothesis, it reveals that TSLM-generated sequences exhibit exponentially decaying uncertainty under recursive forecasting. Based on this insight, the proposed UCE detector achieves an in-distribution AUROC of 0.855 across 32 datasets, substantially outperforming 10 text-detection baselines.
A Unified Shape-Aware Foundation Model for Time Series Classification: This paper proposes UniShape — a foundation model for time series classification that adaptively aggregates multi-scale discriminative subsequences (shapelets) via a shape-aware adapter, and learns transferable shapelet representations at both instance and shape levels through prototype-based contrastive pretraining. With only 3.1M parameters, UniShape achieves state-of-the-art performance on 128 UCR datasets (average accuracy 87.08%) while providing strong classification interpretability.
AirDDE: Multifactor Neural Delay Differential Equations for Air Quality Forecasting: The first framework to introduce Neural Delay Differential Equations (NDDE) into air quality forecasting. By incorporating a memory-augmented attention module and a physics-guided delay evolution function, it models delay effects in the continuous-time propagation of pollutants, achieving an average MAE reduction of 8.79% across three datasets.
iTimER: Reconstruction Error-Guided Irregularly Sampled Time Series Representation Learning: This paper proposes iTimER, which leverages the model's own reconstruction error distribution as a learning signal. By estimating the error distribution from observed points and sampling from it to generate pseudo-observations at unobserved timestamps, the method aligns the error distributions of observed and pseudo-observed regions via Wasserstein distance combined with contrastive learning, achieving state-of-the-art performance on classification, interpolation, and forecasting tasks for irregularly sampled time series.
C3RL: Rethinking the Combination of Channel-independence and Channel-mixing from Representation Learning: This paper proposes C3RL, a SimSiam-based contrastive learning framework that treats channel-independence (CI) and channel-mixing (CM) strategies as two transposed views of the same data to construct positive pairs. By jointly optimizing representation learning and forecasting through a Siamese network, C3RL improves the best-performance rate of CI models from 43.6% to 81.4% and CM models from 23.8% to 76.3%.
Coherent Multi-Agent Trajectory Forecasting in Team Sports with CausalTraj: This paper proposes CausalTraj — a temporally causal, likelihood-based multi-agent trajectory forecasting model that autoregressively models spatio-temporal interactions among agents step by step. CausalTraj achieves state-of-the-art results on joint metrics (minJADE/minJFDE) across NBA, basketball, and football datasets while maintaining competitive per-agent accuracy.
CometNet: Contextual Motif-guided Long-term Time Series Forecasting: This paper proposes CometNet, which extracts recurrently occurring "contextual motifs" from the full historical sequence to construct a motif library, and employs a motif-guided MoE architecture to dynamically associate the current window with relevant motifs for prediction. This approach breaks the receptive field bottleneck imposed by limited look-back windows and achieves significant improvements over state-of-the-art methods such as TimeMixer++ and iTransformer on 8 datasets.
Counterfactual Explainable AI (XAI) Method for Deep Learning-Based Multivariate Time Series Classification: This paper proposes CONFETTI, a multi-objective counterfactual explanation method for multivariate time series (MTS) classification. By combining Class Activation Map (CAM)-guided subsequence extraction with NSGA-III multi-objective optimization, CONFETTI achieves an optimal balance among prediction confidence, sparsity, and proximity, outperforming existing methods across 7 UEA benchmark datasets.
DeepBooTS: Dual-Stream Residual Boosting for Drift-Resilient Time-Series Forecasting: This paper proposes DeepBooTS, which leverages bias-variance decomposition theory to demonstrate that weighted ensembling reduces variance and thereby mitigates concept drift. The method introduces a dual-stream residual-decreasing boosting architecture in which each block corrects the residual of the preceding block, achieving an average improvement of 15.8% across multiple datasets.
Detecting the Future: All-at-Once Event Sequence Forecasting with Horizon Matching: This paper proposes DEF (Detection-based Event Forecasting), which draws on the set-matching idea from DETR in object detection and employs the Hungarian algorithm to align predicted and ground-truth event sequences, achieving high-accuracy and high-diversity long-horizon event forecasting with state-of-the-art results on five datasets.
Finding Time Series Anomalies using Granular-ball Vector Data Description: This paper proposes the Granular-ball One-Class Network (GBOC), which adaptively constructs density-guided Granular-ball Vector Data Descriptions (GVDD) in the latent space. By replacing traditional clustering or single-hypersphere assumptions, GBOC enables flexible modeling of normal time series behavior and robust anomaly detection.
FreqCycle: A Multi-Scale Time-Frequency Analysis Method for Time Series Forecasting: This paper proposes the FreqCycle framework, which explicitly learns shared periodic patterns via the FECF module, enhances mid-to-high frequency energy contributions via the SFPL module, and extends to MFreqCycle for handling coupled multi-periodicity. The framework achieves an optimal balance of SOTA performance and efficiency across 7 benchmarks.
GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs: This paper presents GAICo (Generative AI Comparator), a deployed, extensible, open-source Python library that provides a unified reference-based evaluation framework for text, structured data (planning sequences, time series), and multimedia (images, audio), supporting multi-model comparison, visualization, and report generation.
Harmonic Dataset Distillation for Time Series Forecasting: This paper proposes HDT (Harmonic Dataset Distillation for Time Series Forecasting), which decomposes time series into sinusoidal bases via FFT and aligns the core periodic structure of synthetic and real data through Harmonic Matching in the frequency domain, achieving strong cross-architecture generalization and favorable scalability for time series dataset distillation.
HN-MVTS: HyperNetwork-based Multivariate Time Series Forecasting: This paper proposes HN-MVTS, which employs a HyperNetwork to generate channel-specific weights for the final prediction layer, striking a balance between channel-independent (CI) and channel-dependent (CD) modeling. As a plug-and-play module, it improves forecasting accuracy of various backbone models including DLinear, PatchTST, and TSMixer without incurring additional inference overhead.
HydroDCM: Hydrological Domain-Conditioned Modulation for Cross-Reservoir Inflow Prediction: This paper proposes HydroDCM, the first framework to introduce Domain Generalization (DG) into hydrological forecasting. It constructs pseudo-domain labels from spatial meta-attributes to guide adversarial learning for invariant feature extraction, then employs a FiLM adapter to modulate features conditioned on the target reservoir's geographical information, enabling cross-domain inflow prediction for unseen reservoirs.
IdealTSF: Can Non-Ideal Data Contribute to Enhancing Time Series Forecasting?: IdealTSF is a three-stage progressive framework that (1) uses negative sample pre-training on synthetic non-ideal data to enhance robustness, (2) trains on repaired positive samples to learn underlying trends, and (3) applies the ECOS optimizer to guide parameters toward flat minima — achieving approximately 10% MSE improvement on time series data containing noise and missing values.
Interpreting Fedspeak with Confidence: A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths: This paper proposes an LLM-based uncertainty-aware framework for interpreting Fedspeak (Federal Reserve language). The framework enhances inputs through domain reasoning along monetary policy transmission paths, and introduces a dynamic uncertainty decoding module to quantify prediction confidence (Perceptual Uncertainty = Environmental Ambiguity × Cognitive Risk), achieving SOTA performance on FOMC monetary policy stance analysis.
LoReTTA: A Low Resource Framework To Poison Continuous Time Dynamic Graphs: This paper proposes LoReTTA, a two-stage adversarial poisoning attack framework that requires no surrogate model. It first sparsifies high-influence edges via 16 temporal importance metrics, then replaces them with adversarial edges using a degree-preserving negative sampling algorithm. Across 4 datasets × 4 TGNN models, LoReTTA achieves an average performance degradation of 29.47%, while evading 4 anomaly detection systems and resisting 4 defense methods.
M2FMoE: Multi-Resolution Multi-View Frequency Mixture-of-Experts for Extreme-Adaptive Time Series Forecasting: This paper proposes M2FMoE, a framework that models both regular and extreme temporal patterns via frequency-domain Mixture-of-Experts from dual Fourier and wavelet perspectives. It incorporates a cross-view shared frequency-band splitter to align semantic correspondence across domains, multi-resolution adaptive fusion to capture multi-scale information, and temporal gated integration to combine short- and long-term features. On five hydrological extreme event datasets, M2FMoE surpasses all state-of-the-art methods — including label-supervised approaches — without requiring any extreme event labels, achieving an average RMSE improvement of 22.3%.
Mask the Redundancy: Evolving Masking Representation Learning for Multivariate Time-Series Clustering: This paper proposes EMTC, a framework that dynamically masks redundant timestamps via Importance-aware Variate-wise Masking (IVM), combined with Multi-Endogenous Views (MEV) generation and cluster-guided contrastive learning, achieving an average F1 improvement of 4.85% across 15 MTS clustering benchmarks.
Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints: This paper proposes GlobalDiff, a framework that, for the first time, performs diffusion-based generation in the global joint rotation space, fundamentally eliminating error accumulation in hierarchical forward kinematics. A three-level joint–bone–motion constraint scheme compensates for the structural priors lost under global representation. GlobalDiff achieves state-of-the-art performance on multi-speaker co-speech motion generation benchmarks, improving FGD by 46% over the previous best method.
Optimal Look-back Horizon for Time Series Forecasting in Federated Learning: This paper proposes a theoretical framework for selecting the optimal look-back horizon in federated time series forecasting. By introducing a Synthetic Data Generator (SDG) and an intrinsic space representation, the forecasting loss is decomposed into an irreducible Bayesian error and an approximation error. The paper proves that the total loss is unimodal with respect to the horizon length, and establishes that the minimum sufficient window is the optimal solution.
Predicting the Future by Retrieving the Past: This paper proposes PFRP (Predicting the Future by Retrieving the Past), which constructs a Global Memory Bank (GMB) to store historical patterns, trains an encoder via Predictive Contrastive Learning (PCL) for efficient retrieval, and dynamically integrates retrieved global predictions with any local forecasting model. PFRP achieves an average improvement of 8.4% in forecasting performance across 7 datasets.
ProbFM: Probabilistic Time Series Foundation Model with Uncertainty Decomposition: This work is the first to introduce Deep Evidential Regression (DER) with a Normal-Inverse-Gamma prior into a time series foundation model architecture, enabling epistemic-aleatoric uncertainty decomposition in a single forward pass. The practical value of uncertainty-aware trading strategies is validated on cryptocurrency forecasting.
ReCast: Reliability-aware Codebook Assisted Lightweight Time Series Forecasting: This paper proposes ReCast, which encodes time series into discrete embeddings via patch-level vector quantization. It introduces a dual-path architecture consisting of a quantization path (modeling regular structures) and a residual path (capturing irregular fluctuations), along with a reliability-aware codebook update strategy based on distributionally robust optimization (DRO). ReCast achieves state-of-the-art accuracy with a lightweight architecture across 8 datasets.
Revitalizing Canonical Pre-Alignment for Irregular Multivariate Time Series Forecasting: This paper is the first to argue that Canonical Pre-Alignment (CPA) should not be abandoned for Irregular Multivariate Time Series (IMTS) forecasting. It proposes KAFNet, which addresses the efficiency bottleneck of CPA via three modules—Pre-Convolution smoothing, Temporal Kernel Aggregation (TKA), and Frequency-domain Linear Attention (FLA)—achieving state-of-the-art accuracy on 4 IMTS benchmarks while reducing parameters by 7.2× and accelerating training/inference by 8.4×.
Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios: This paper proposes SpecFormer, a non-autoregressive draft model architecture that integrates unidirectional and bidirectional attention. By reducing reliance on large prefix trees and minimizing position-dependent parameters, SpecFormer achieves consistent LLM inference acceleration in large-batch scenarios.
SELDON: Supernova Explosions Learned by Deep ODE Networks: This paper proposes SELDON, a continuous-time VAE combining a masked GRU-ODE encoder, an implicit Neural ODE propagator, and an interpretable Gaussian basis function decoder, designed for sparse and irregularly sampled astronomical light curve prediction. SELDON outperforms baseline methods in accurate multi-band flux prediction using only 20% of observed data.
Sonnet: Spectral Operator Neural Network for Multivariable Time Series Forecasting: This paper proposes Sonnet, which maps inputs to the time-frequency domain via learnable wavelet transforms, introduces multivariate coherence-based attention (MVCA) to model inter-variable dependencies, and employs a Koopman operator for stable temporal evolution forecasting. Sonnet achieves state-of-the-art performance on 34 out of 47 forecasting tasks, reducing average MAE by 2.2%.
Task-Aware Retrieval Augmentation for Dynamic Recommendation: This paper proposes TarDGR, a framework that automatically constructs training data via a task-aware evaluation mechanism, trains a Graph Transformer to assess the task relevance of historical subgraphs, and retrieves and integrates task-relevant subgraphs at inference time to enhance temporal generalization in recommendation.
Towards Non-Stationary Time Series Forecasting with Temporal Stabilization and Frequency Differencing: This paper proposes DTAF, a dual-branch framework that extracts and removes heterogeneous non-stationary patterns via a non-stationary MoE filter in the temporal domain, tracks frequency drift via spectral differencing in the frequency domain, and fuses complementary information from both domains through dual-branch attention for robust non-stationary time series forecasting.
Transparent Networks for Multivariate Time Series: This paper proposes GATSM (Generalized Additive Time Series Model), a transparent neural network for time series that employs weight-sharing feature networks to learn feature representations and masked multi-head attention to capture temporal patterns. GATSM achieves performance comparable to black-box models such as Transformers while maintaining full interpretability.
Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports: This paper proposes URBAN, a multi-view multi-output GNN model that jointly leverages sparse but unbiased government inspection rating data and dense but biased crowdsourced report data to predict the true latent state of urban incidents. Validated on 9.6M+ reports and 1M+ inspections in New York City, the model achieves a 5.3× higher prediction correlation than using report data alone.
XLinear: A Lightweight and Accurate MLP-Based Model for Long-Term Time Series Forecasting with Exogenous Inputs: This paper proposes XLinear, a lightweight time series forecasting model based on MLP with sigmoid gating. Through a global token mechanism, it efficiently integrates endogenous and exogenous variable information, achieving an optimal accuracy–efficiency trade-off across 12 datasets.

📹 Video Understanding¶

APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval: This paper proposes APVR, a training-free dual-granularity visual information retrieval framework. At the frame level, it iteratively retrieves keyframes (up to 1024) via query expansion and spatiotemporal semantic confidence scoring; at the token level, it compresses visual tokens through query-aware attention-driven selection. APVR overcomes memory limitations to process hour-long videos, achieving improvements of up to 9.5%, 4.6%, and 9.7% on LongVideoBench, VideoMME, and MLVU, respectively.
Balancing Multimodal Domain Generalization via Gradient Modulation and Projection: This paper proposes a Gradient Modulation Projection (GMP) strategy that addresses inter-modality optimization imbalance and inter-task gradient conflicts in multimodal domain generalization (MMDG) through two components: Inter-modality Gradient Decoupled Modulation (IGDM) and Conflict-Adaptive Gradient Projection (CAGP), achieving state-of-the-art performance on multiple benchmarks.
BAT: Learning Event-based Optical Flow with Bidirectional Adaptive Temporal Correlation: This paper proposes the Bidirectional Adaptive Temporal Correlation (BAT) framework, which converts temporally dense motion cues from event cameras into spatially dense cues, achieving high-accuracy event-based optical flow estimation and ranking first on the DSEC-Flow benchmark.
Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces: This paper proposes the Generative Semantic Workspace (GSW), a neuroscience-inspired generative memory framework that constructs structured episodic memory representations for LLMs, achieving an F1 of 0.85 on EpBench while reducing query-time context tokens by 51%.
Causality Matters: How Temporal Information Emerges in Video Language Models: Through systematic ablation experiments, this work demonstrates that the temporal understanding capability of VideoLMs does not originate from positional encoding (PE), but rather emerges from the sequence sensitivity of causal attention masks. Temporal information is constructed layer by layer along a causal pathway of "inter-frame interaction → last-frame aggregation → query integration," based on which two lossless inference acceleration strategies are proposed.
Coordinated Humanoid Robot Locomotion with Symmetry Equivariant Reinforcement Learning Policy: This paper proposes SE-Policy, which directly embeds strict symmetry equivariance (actor) and symmetry invariance (critic) into the neural network architecture without additional hyperparameters, enabling humanoid robots to produce spatiotemporally coordinated natural locomotion. The velocity tracking error is reduced by 40% compared to DreamWaQ, and the policy is successfully deployed on a physical Unitree G1 robot.
Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers: This paper proposes a "Distillation Dynamics" analytical framework (channel-wise FFT spectral analysis + Shannon entropy + activation magnitude tracking) to reveal that ViTs exhibit a distinctive U-shaped information processing pattern (compression followed by expansion). The work demonstrates that the fundamental cause of feature-based distillation failure in ViTs is a representational paradigm mismatch between the teacher's distributed high-dimensional encoding in later layers and the student's limited channel capacity—rather than a simple capacity gap.
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation: This paper presents EmoVid, the first large-scale multimodal emotion video dataset targeting artistic and non-photorealistic content (22,758 video clips), spanning three content types—animation, film, and emoji stickers—and demonstrates the effectiveness of emotion-conditioned video generation by fine-tuning the Wan2.1 model, achieving significant improvements over baselines on emotion accuracy metrics.
Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction: This paper proposes the CACMI framework, which addresses two fundamental limitations in dense video captioning (insufficient temporal modeling and modality gap) through explicit temporal-semantic modeling. It employs Cross-modal Frame Aggregation (CFA) to extract temporally coherent event semantics, and Context-aware Feature Enhancement (CFE) to bridge the visual-textual modality gap, achieving state-of-the-art performance on ActivityNet Captions and YouCook2.
FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion: This paper proposes FineTec, a framework that achieves robust fine-grained skeleton-based action recognition under temporal corruption via three modules: context-aware sequence completion, bio-prior-guided skeleton spatial decomposition, and physics-driven acceleration modeling.
Group Orthogonal Low-Rank Adaptation for RGB-T Tracking: This paper proposes the GOLA framework, which quantifies LoRA rank importance via SVD decomposition, freezes critical ranks to preserve pre-trained priors, clusters redundant ranks into groups, and imposes inter-group orthogonal constraints to enable more efficient RGB-T tracking adaptation.
KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals: This paper proposes KineST, a kinematics-guided state space model that reconstructs whole-body motion from sparse HMD signals via a kinematic tree bidirectional scanning strategy and hybrid spatiotemporal representation learning, surpassing state-of-the-art methods in both accuracy and temporal consistency.
Learning Topology-Driven Multi-Subspace Fusion for Grassmannian Deep Networks: This paper proposes GMSF-Net, a topology-driven multi-subspace fusion network on the Grassmann manifold. By introducing adaptive multi-subspace construction and a Fréchet mean-based subspace interaction mechanism, it successfully transfers the multi-channel interaction paradigm from Euclidean space to non-Euclidean geometry, achieving state-of-the-art performance on 3D action recognition, EEG classification, and graph tasks.
Lifelong Domain Adaptive 3D Human Pose Estimation: This paper introduces a new task of lifelong domain adaptive 3D HPE, and proposes a GAN framework incorporating pose-aware, temporal-aware, and domain-aware encodings. A diffusion sampler is employed to generate domain-aware priors to mitigate catastrophic forgetting, achieving significant improvements over existing methods across multiple cross-scene/cross-dataset adaptation tasks.
Listening Between the Frames: Bridging Temporal Gaps in Large Audio-Language Models: This paper proposes TimeAudio, which equips large audio-language models (LALMs) with precise temporal grounding and end-to-end long audio understanding capabilities through three key modules: Temporal Markers, Absolute Time-aware Encoding (ATE), and Segment-level Token Merging (SEM). The paper also introduces the FTAR dataset for instruction fine-tuning on fine-grained temporal reasoning.
LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding: This paper presents LiViBench, the first omnimodal benchmark for interactive livestream video understanding (3,168 videos, 3,175 MCQs, 24 tasks), introduces a multi-agent seed-guided semi-automatic annotation pipeline, and develops LiVi-LLM-7B — a specialized model featuring a Video-to-Comment Retrieval (VCR) module and two-stage instruction tuning — which surpasses 72B open-source models at the 7B scale.
LOOM: Personalized Learning Informed by Daily LLM Conversations Toward Long-Term Mastery via a Dynamic Learner Memory Graph: This paper proposes LOOM, an agentic pipeline system that observes users' daily LLM conversations, infers learning needs, maintains a Dynamic Learner Memory Graph, and automatically generates personalized mini-courses. LOOM unifies continuity (long-term progress tracking) and initiative (immediate responsiveness to emerging interests) in a single framework.
PlugTrack: Multi-Perceptive Motion Analysis for Adaptive Fusion in Multi-Object Tracking: This paper proposes PlugTrack, a framework that achieves, for the first time, adaptive fusion of Kalman filters and data-driven motion predictors via a Context Motion Encoder (CME) and an Adaptive Blending factor Generator (ABG), yielding significant improvements under both linear and nonlinear motion scenarios.
PragWorld: A Benchmark Evaluating LLMs' Local World Model under Minimal Linguistic Alterations and Conversational Dynamics: This paper proposes the PragWorld benchmark, which evaluates the plasticity and robustness of LLMs' implicit world models by applying 7 types of minimal linguistic perturbations to dialogues. A dual-perspective interpretability framework is designed to localize harmful/helpful layers, and a layer regularization fine-tuning strategy is proposed to improve robustness.
Predicting Video Slot Attention Queries from Random Slot-Feature Pairs: This paper proposes RandSF.Q, which significantly improves query prediction quality in video object-centric learning (OCL) by leveraging next-frame features for informative query prediction and learning transition dynamics from randomly sampled slot-feature pairs. The method surpasses state-of-the-art approaches by up to 10 points on object discovery benchmarks.
Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction: This paper systematically quantifies the reliability degradation of LLMs in multi-turn conversations through three deterministically evaluable representative tasks—instruction following, tool selection, and entity extraction—revealing failure modes such as instruction drift, intent confusion, and context overwriting in extended dialogues.
R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios: This paper introduces R-AVST, the first fine-grained spatio-temporal reasoning dataset for complex audio-visual scenarios (5K+ untrimmed videos, 27K objects, 100 audio-visual event categories), defines three core reasoning tasks, and trains the AVST-Zero model via GRPO with a multi-dimensional reward function to directly optimize audio-visual spatio-temporal reasoning.
ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding: This paper proposes a Causal Information Bottleneck (CIB) theoretical framework that formalizes keyframe selection as an information-theoretic problem jointly optimizing predictive sufficiency and causal necessity. Built upon CIB, the ReaSon reinforcement learning framework trains a selection policy using three CIB-aligned rewards (answer reward, cycle-consistency reward, and counterfactual reward), significantly outperforming existing methods under constrained frame budgets.
RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems: This paper introduces RecToM, the first human-annotated benchmark for evaluating Theory of Mind (ToM) reasoning capabilities of LLMs in conversational recommender systems (CRS). It covers two dimensions—cognitive inference (desire/intention/belief) and behavioral prediction (strategy prediction/strategy judgment)—comprising 10 question types and 20,524 QA pairs, and exposes systematic deficiencies of current LLMs in fine-grained intention inference and strategy judgment.
Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective: This paper proposes LIBERO-Mem, a benchmark comprising 10 non-Markovian robotic manipulation tasks, and Embodied-SlotSSM, an object-centric memory VLA framework combining Slot Attention with state space models, to address the failure of visuomotor policies in long-horizon tasks that require object-level historical reasoning under partial observability.
MambaMia: State-Space Hierarchical Compression for Hour-Long Video Understanding in Large Multimodal Models: MambaMia proposes a two-stage hierarchical video token compression framework based on bidirectional Mamba: Gated Patch Aggregation (GPA) for spatial-temporal local compression, and a Temporal Axis Aggregator (TAA) that leverages Mamba's adaptive step size $\Delta_t$ for data-driven keyframe sampling. The method compresses hour-long videos to only 4.7K tokens, achieving 44.6 on LVBench and surpassing Qwen2-VL and mPLUG-Owl3.
StegaVAR: Privacy-Preserving Video Action Recognition via Steganographic Domain Analysis: This paper proposes StegaVAR, the first framework to integrate video steganography with action recognition. Privacy-sensitive videos are embedded into natural cover videos, and classification is performed directly in the steganographic domain. Through STeP (secret video-guided spatiotemporal feature learning) and CroDA (cross-band difference attention), the framework achieves recognition accuracy approaching that of raw video while providing stronger privacy protection than anonymization-based methods.
SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition: This paper proposes the SUGAR paradigm, which leverages GPT-generated motion descriptions and visual descriptions as prior knowledge to supervise skeleton encoders via contrastive learning, producing more discriminative representations. These representations are then fed into an LLM (LLaMA2-7B) with untouched pretrained weights as the classifier, complemented by a newly designed Temporal Query Projection (TQP) module for efficient skeleton-based action classification and zero-shot inference.
Task-Specific Distance Correlation Matching for Few-Shot Action Recognition: This paper proposes TS-FSAR, a framework that employs α-distance correlation to capture nonlinear inter-frame dependencies and combines task-specific matching matrices for query-support matching. An adapted frozen CLIP guides the training of a ladder side network, achieving substantial improvements over prior methods on temporally sensitive datasets such as SSv2-Full.
TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding: This paper formulates keyframe selection and language generation as a joint decision-making process, and optimizes a lightweight temporal agent's sampling policy end-to-end via GRPO-based reinforcement learning. It achieves state-of-the-art results on four long-form video understanding benchmarks (LongVideoBench +5.0%, MLVU +6.0% on LLaVA-Video-7B) and transfers zero-shot to other Video-MLLMs.
Uncovering Zero-Shot Generalization Gaps in Time-Series Foundation Models Using Real-World Videos: This paper proposes a pipeline for extracting time-series data from real-world videos via optical flow, constructs the REAL-V-TSFM dataset (6,130 sequences), and reveals significant zero-shot generalization gaps in current time-series foundation models (TSFMs) such as Chronos and TimesFM when confronted with real physical dynamics.
UVLM: Benchmarking Video Language Model for Underwater World Understanding: This paper constructs the first benchmark for underwater video-language understanding, UVLM, comprising 2,109 video clips, 419 marine species categories, 20 sub-tasks, and approximately 40K video-text pairs. Through a human-AI collaborative annotation pipeline that injects marine domain knowledge, a 7B VidLM fine-tuned on UVLM achieves performance approaching GPT-4o (73.04 vs. 77.95 Overall).
VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation: VTinker is a pipeline that addresses blurry optical flow boundaries via Guided Flow Upsampling (GFU) and eliminates ghosting and discontinuities by replacing conventional per-pixel blending with texture mapping, achieving state-of-the-art performance in high-resolution video frame interpolation.

🎵 Audio & Speech¶

A Mind Cannot Be Smeared Across Time: This paper formally proves that whether a machine possesses consciousness depends not only on what is computed, but also on when it is computed. Systems executing strictly sequentially fail to satisfy the temporal co-instantiation condition required for the unity of consciousness; consequently, pure software consciousness on strictly sequential hardware is impossible.
DeepDebater: A Superpersuasive Autonomous Policy Debating System: This paper presents DeepDebater, the first autonomous multi-agent system capable of participating in and winning a complete American-style policy debate (eight speeches plus cross-examination). The system employs a hierarchical agent workflow to construct affirmative (Advantage) and negative (DA+CP+Kritik) arguments, leverages over 3 million evidence cards from OpenDebateEvidence for retrieval-augmented generation, and integrates GPT-4o TTS speech synthesis with EchoMimic digital avatar animation for end-to-end presentation. Expert evaluations show DeepDebater significantly outperforms human-authored cases across all metrics (Quality: 4.32 vs. 3.65), achieving an 85% win rate in simulated rounds.
AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions: By applying binary masks (AHAMask) over attention heads in the Transformer backbone of Large Audio Language Models (LALMs), specific acoustic task functionalities can be reliably triggered without any textual instructions, while revealing the existence of "acoustic functional pathways" within LALMs.
Aligning Generative Music AI with Human Preferences: Methods and Challenges: This survey/position paper systematically reviews three technical approaches to preference alignment in music generation—MusicRL (large-scale RLHF with ~300K preference pairs), DiffRhythm+ (multi-preference DPO for diffusion models), and Text2midi-InferAlign (inference-time tree search achieving +29.4% CLAP)—while providing an in-depth analysis of alignment challenges unique to the music domain (multi-scale temporal coherence, harmonic consistency, cultural subjectivity, and the evaluation paradox), and proposing a future research roadmap.
CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation: This paper introduces CCFQA—the first cross-lingual and cross-modal factuality benchmark covering 8 languages with 14,400 fully parallel speech-text factual QA samples. It supports four task settings (QA/XQA/SQA/XSQA), systematically revealing factual inconsistencies in existing MLLMs under language and modality switching. The paper also proposes LLM-SQA, which bridges via English with only 5-shot examples to achieve cross-lingual spoken QA transfer, attaining an F1 of 51.4 on XSQA—surpassing GPT-4o-mini-Audio (45.7).
Characterizing AI Manipulation Risks in Brazilian YouTube Climate Discourse: Through a psycholinguistic framework, this work analyzes 226,775 Brazilian YouTube climate change videos and 2,756,165 comments, revealing that emotional and moral rhetoric significantly drives user engagement. It further demonstrates that fine-tuned LLMs can automatically generate high-engagement climate denial comments, warning of the potential risks of generative AI in public opinion manipulation.
Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation: This paper proposes the Cross-Space Synergy (CSS) framework, which simultaneously addresses two major challenges in multimodal emotion recognition in conversation (MERC)—insufficient fusion expressiveness and multi-objective gradient conflicts—via Synergistic Polynomial Fusion (SPF) in the representation space and a Pareto Gradient Modulator (PGM) in the gradient space.
DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization: This paper proposes DeformTrace, which introduces a deformable dynamic receptive field mechanism and relay token scheme into state space models, combining Transformer-level global modeling with SSM-level efficient inference to achieve state-of-the-art accuracy and substantial efficiency gains in temporal forgery localization.
Do LLMs Feel? Teaching Emotion Recognition with Prompts, Retrieval, and Curriculum Learning: This paper proposes PRC-Emo, a framework that integrates explicit/implicit emotion prompting, a dedicated retrieval database, and curriculum learning to enhance LLM performance on Emotion Recognition in Conversation (ERC), achieving state-of-the-art results on the IEMOCAP and MELD benchmarks.
DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling: This paper proposes DualSpeechLM, a framework that extracts high-level semantic tokens via an understanding-driven speech tokenizer (USTokenizer) as LLM input and uses acoustic tokens as output, jointly optimizing speech understanding and generation within a single end-to-end framework.
End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering: This paper proposes CLSR, an end-to-end contrastive language-speech retriever that converts acoustic representations into text-like representations before aligning them with text, enabling efficient extraction of question-relevant segments from long-form audio to support RAG-based spoken question answering for downstream LALMs.
Factor(U,T): Controlling Untrusted AI by Monitoring their Plans: This paper investigates the security of the Factor(U,T) protocol, in which an untrusted AI performs task decomposition and a trusted AI handles execution. The study finds that monitoring decomposed plans yields an AUROC of only 0.52 (near random), whereas monitoring concrete code implementations achieves 0.96—malicious intent is difficult to detect at the abstract planning level but becomes exposed in concrete implementations. The key conclusion is that "structural prevention (trusted decomposer) is superior to post-hoc monitoring."
Gene Incremental Learning for Single-Cell Transcriptomics: This paper proposes a Gene Incremental Learning (GIL) framework that leverages the permutation-invariant nature of single-cell transcriptomics data to extend the class incremental learning (CIL) paradigm to the token (gene) dimension. Two baseline methods—gene replay and gene distillation—are designed, and a comprehensive benchmark is established with two evaluation protocols: gene-level regression and gene-level classification.
Generalizing Analogical Inference from Boolean to Continuous Domains: This paper revisits the theoretical foundations of analogical inference: it first constructs a counterexample demonstrating the failure of classical generalization bounds in the Boolean domain, then proposes a unified analogical inference framework based on parameterized generalized means, extending discrete classification to continuous regression domains.
GOMPSNR: Reflourish the Signal-to-Noise Ratio Metric for Audio Generation Tasks: This paper reconstructs the SNR metric by introducing omnidirectional phase derivatives to replace instantaneous phase, proposes GOMPSNR as a more reliable audio quality evaluation metric, and derives a family of novel loss functions that significantly improve neural vocoder performance.
Hearing More with Less: Multi-Modal Retrieval-and-Selection Augmented Conversational LLM-Based ASR: MARS proposes a multimodal retrieval-and-selection approach to identify the most relevant historical context for conversational LLM-ASR—rather than relying on a fixed number of preceding utterances or the entire history—achieving state-of-the-art performance with only 1.5K hours of training data, surpassing TEA-ASLP trained on 179K hours.
HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding: This paper proposes the HPSU benchmark, comprising 20,000+ expert-annotated Chinese and English samples across 16 tasks, to systematically evaluate Speech LLMs' deep perception and reasoning capabilities in real-world spoken language scenarios. The best-performing model (Gemini 2.5 Pro, 62.6%) still falls far short of human performance (87.3%).
Improving Multimodal Sentiment Analysis via Modality Optimization and Dynamic Primary Modality Selection: This paper proposes the MODS framework, which eliminates redundancy in non-linguistic modalities via Graph-based Dynamic Compression (GDC), and introduces a sample-level Dynamic Primary Modality Selector (MSelector) together with Primary-modality-Centric Cross-Attention (PCCA) to enable adaptive dominant modality selection on a per-sample basis for multimodal sentiment analysis (MSA).
Incremental Maintenance of DatalogMTL Materialisations: This paper proposes the DRed$_{\text{MTL}}$ algorithm, extending the classical Delete/Rederive incremental maintenance technique to DatalogMTL (Datalog with Metric Temporal Logic). By designing novel seminaïve evaluation operators and a periodicity detection algorithm over periodic materialisation representations, the approach enables efficient incremental updates, achieving order-of-magnitude speedups over full rematerialisation.
Let the Model Learn to Feel: Mode-Guided Tonality Injection for Symbolic Music Emotion Recognition: The MoGE diagnostic strategy systematically identifies that MIDIBERT fails to encode mode–emotion associations. The proposed MoFi injection framework leverages FiLM conditioning to inject major/minor priors into Layer 1 of MIDIBERT (identified as the weakest layer in terms of emotional information). This achieves 75.2% accuracy (+11.8%) on EMOPIA and 59.1% (+11.8%) on VGMIDI, with F1 improvements of 12.3% and 15.5%, respectively.
Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation: A two-stage framework is proposed: Adaptive Layer Attention (ALA) fuses multi-layer representations from the Whisper encoder to enhance noise robustness, while Multi-Objective Knowledge Distillation (MOKD) aligns the semantic and attention distributions of a clean-speech teacher with a noisy-speech student — achieving significant reductions in hallucination rate and WER on multilingual noisy ASR benchmarks.
Modelling the Effects of Hearing Loss on Neural Coding in the Auditory Midbrain with Variational Conditioning: This paper proposes ψ-ICNet, a variationally conditioned deep neural network that encodes the effects of hearing loss via only 6 learnable conditioning parameters ψ. The model directly learns a low-dimensional representation space of hearing loss from real neural recordings, achieving accuracy comparable to animal-specific models in predicting auditory midbrain responses in both normal-hearing and hearing-impaired animals, and can be rapidly fitted to unseen animals via Bayesian optimization.
Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment: This paper proposes the HIA framework, which employs an Interactive Attention Module to enable bidirectional information exchange across phoneme, word, and utterance granularities. Combined with a residual hierarchical structure to mitigate feature forgetting, HIA achieves state-of-the-art results on the speechocean762 dataset across all granularities and aspects.
PaSE: Prototype-aligned Calibration and Shapley-based Equilibrium for Multimodal Sentiment Analysis: PaSE is a framework that explicitly addresses modality competition in multimodal sentiment analysis through a two-stage optimization strategy combining prototype-guided calibration alignment (via Entropic Optimal Transport) and Shapley-value-based gradient modulation.
PSA-MF: Personality-Sentiment Aligned Multi-Level Fusion for Multimodal Sentiment Analysis: This paper is the first to introduce a pre-trained personality model into Multimodal Sentiment Analysis (MSA) for extracting personalized sentiment features. Through personality-sentiment contrastive alignment and a progressive multi-level fusion architecture (pre-fusion → cross-modal interaction → enhanced fusion), the proposed PSA-MF achieves state-of-the-art performance on CMU-MOSI and CMU-MOSEI.
REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation: This paper proposes REINA (Regularized Entropy INformation Adaptation), a loss function grounded in mutual information theory that efficiently converts a non-streaming speech translation model into a streaming simultaneous speech translation model. REINA achieves state-of-the-art streaming translation performance across multiple language directions and introduces a new streaming efficiency metric, NoSE.
Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding: This paper proposes VARSTok, the first fully dynamic variable-frame-rate speech tokenizer, which achieves adaptive token allocation via temporal-aware density peak clustering and implicit duration coding, surpassing fixed-frame-rate baselines while using fewer tokens.
TEXT: Text-Routed Sparse Mixture of Experts for Multimodal Sentiment Analysis with Explanation Enhancement and Temporal Alignment: This paper proposes TEXT, a model that leverages MLLMs to generate natural language explanations for audio and video modalities to enhance modal representations, designs a lightweight temporal alignment module combining the strengths of Mamba and temporal cross-attention, and employs text-routed sparse mixture of experts for cross-modal fusion. TEXT comprehensively outperforms prior SOTA methods and large models such as GPT-4o across four MSA benchmarks.
Thucy: An LLM-based Multi-Agent System for Claim Verification across Relational Databases: This paper presents Thucy, the first multi-agent claim verification system supporting cross-database and cross-table reasoning. Led by a Verifier agent, it coordinates three specialized agents (Data/Schema/SQL Expert) with zero prior knowledge of the data sources, enabling autonomous discovery, reasoning, and SQL evidence generation. Thucy surpasses the previous SOTA by 5.6 percentage points on TabFact (94.3%).
Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning: Authentic-Dubber simulates the director-actor interaction workflow in real-world dubbing by constructing a multimodal reference footage library, employing an emotion-similarity-based retrieval-augmented strategy, and adopting a progressive graph-based speech generation approach. The method significantly improves the emotional expressiveness of automatic movie dubbing, achieving state-of-the-art emotion accuracy and MOS scores on the V2C-Animation dataset.
USE: A Unified Model for Universal Sound Separation and Extraction: The paper proposes USE, a unified framework that employs an EDA network to infer the number of sound sources and generate acoustic cues for sound separation (SS), and a multimodal fusion network to interpret user-provided text/video/label cues for target sound extraction (TSE). Joint training with cross-task alignment enables mutual reinforcement between the two tasks, achieving +1.4 dB SDR on SS and 86% matching accuracy on TSE.

✂️ Segmentation¶

A²LC: Active and Automated Label Correction for Semantic Segmentation: This paper proposes the A²LC framework, which augments conventional active label correction (ALC) — where annotators manually fix errors one by one — with an automated correction stage via a Label Correction Module (LCM). The LCM leverages annotator feedback to automatically rectify similar erroneous masks, while an Adaptively Balanced acquisition function (ABC) is designed to mitigate class imbalance. On Cityscapes, A²LC surpasses the previous SOTA using only 20% of the budget, achieving a 27.23% mIoU improvement under equal budget conditions.
Adaptive Morph-Patch Transformer for Aortic Vessel Segmentation: This paper proposes the Morph-Patch Transformer (MPT), which generates morphology-aware patches via a velocity-field-based adaptive patch partitioning strategy to preserve vascular topological integrity, and introduces Semantic Clustering Attention (SCA) to dynamically aggregate features from semantically similar patches. The method achieves state-of-the-art performance on three aortic segmentation benchmarks: AVT, AortaSeg24, and TBAD.
Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization: This paper proposes Generative Clean-Image Backdoors (GCB), which employs a Conditional InfoGAN (C-InfoGAN) to automatically discover naturally occurring, task-irrelevant features within images as backdoor triggers. GCB achieves high attack success rates (ASR ≥ 90%) at extremely low poison rates (≤ 0.5%) with negligible degradation of clean accuracy (CA drop ≤ 1%), thereby becoming the first method to break the inherent stealth-potency trade-off in clean-image backdoor attacks.
Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-Domain Few-Shot Segmentation: This paper proposes the HSL framework, which addresses the segmentation granularity gap between source and target domains in cross-domain few-shot segmentation (CD-FSS) via three modules — Dual Style Randomization (DSR), Hierarchical Semantic Mining (HSM), and Prototype Confidence-modulated Thresholding (PCMT) — achieving state-of-the-art performance across four target-domain datasets.
Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation: This paper proposes Causal-Tune, a causality-driven VFM fine-tuning strategy that decomposes VFM features into causal (domain-invariant) and non-causal (domain-specific) components via DCT frequency-domain transformation and Gaussian band-pass filtering. Learnable tokens are applied exclusively to the causal components for refinement, effectively suppressing VFM artifacts and improving generalization in domain generalized semantic segmentation.
CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion: This paper proposes CtrlFuse, which achieves interactive controllable infrared-visible image fusion by fine-tuning SAM with mask prompt guidance, simultaneously improving fusion quality and downstream segmentation/detection performance.
Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation: This paper proposes FLEX-Seg, a framework that reframes the inherent boundary misalignment between images and semantic masks in diffusion-model-synthesized data as an opportunity to learn robust representations. Through three modules—Granular Adaptive Prototypes (GAP), Uncertainty Boundary Emphasis (UBE), and Hardness-Aware Sampling (HAS)—FLEX-Seg achieves state-of-the-art performance on domain generalized semantic segmentation.
EAGLE: Episodic Appearance- and Geometry-Aware Memory for Unified 2D-3D Visual Query Localization: This paper proposes the EAGLE framework, inspired by avian memory consolidation mechanisms. A segmentation branch guided by an Appearance-aware Meta-learning Memory (AMM) and a tracking branch driven by a Geometry-aware Localization Memory (GLM) operate collaboratively. Combined with VGGT, the framework achieves efficient unified 2D-3D visual query localization, attaining state-of-the-art performance on the Ego4D-VQ benchmark.
Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter: This paper is the first to introduce DINOv2 into underwater instance segmentation. Through two adaptation modules—AquaStyle Aligner (Fourier frequency-domain style injection) and ObjectPrior Prompter (binary mask prior prompting)—the proposed DiveSeg achieves efficient domain adaptation and substantially outperforms SAM-based methods on the UIIS and USIS10K benchmarks with fewer parameters.
From Attribution to Action: Jointly ALIGNing Predictions and Explanations: This paper proposes the ALIGN framework, which jointly trains a learnable masker and a classifier through alternating optimization to iteratively align model attribution maps with task-relevant region masks, simultaneously improving prediction accuracy and interpretability. ALIGN outperforms six strong baselines on the VLCS and Terra Incognita domain generalization benchmarks.
Generalizable Slum Detection from Satellite Imagery with Mixture-of-Experts: This paper proposes GRAM (Generalized Region-Aware Mixture-of-Experts), a two-stage test-time adaptation framework. In the first stage, a MoE architecture is used to train region-specialized experts on million-scale satellite imagery from 12 cities. In the second stage, reliable pseudo-labels are selected via cross-region prediction consistency for self-training, enabling generalization of slum segmentation to unseen African cities.
Guideline-Consistent Segmentation via Multi-Agent Refinement: A training-free multi-agent framework is proposed that achieves guideline-consistent semantic segmentation through an iterative Worker (segmentation execution) and Supervisor (guideline verification) loop coupled with an RL-based adaptive stopping strategy, surpassing prior SOTA by 8.61 and 5.5 gIoU on Waymo and ReasonSeg, respectively.
InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer: This paper proposes InfoCLIP, which adopts an information-theoretic perspective to design two objectives—information bottleneck compression and mutual information distillation—to remove noise in pretrained pixel-text alignment and preserve semantic alignment knowledge during CLIP fine-tuning. InfoCLIP achieves state-of-the-art results across six open-vocabulary semantic segmentation benchmarks (A-847: 16.6, A-150: 38.5, PC-59: 63.5 mIoU) while introducing only 0.53M additional parameters and negligible computational overhead.
JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion: This paper proposes JoDiffusion, a framework that jointly diffuses images and pixel-level annotation masks in latent space, enabling—for the first time—simultaneous generation of semantically consistent image–annotation pairs conditioned solely on text prompts. JoDiffusion substantially outperforms existing Image2Mask and Mask2Image methods on Pascal VOC, COCO, and ADE20K.
LWGANet: Addressing Spatial and Channel Redundancy in Remote Sensing Visual Tasks with Light-Weight Grouped Attention: To address spatial redundancy (large homogeneous backgrounds) and channel redundancy (inefficient single feature space under extreme scale variation) in remote sensing imagery, LWGANet is proposed as a lightweight backbone that achieves efficient multi-scale feature representation via Top-K sparse global feature interaction (TGFI) and heterogeneous grouped attention (LWGA) modules, achieving SOTA across 4 remote sensing task categories on 12 datasets.
Multigranular Evaluation for Brain Visual Decoding: This paper proposes BASIC, a multigranular evaluation framework that unifies the assessment of brain visual decoding quality along two axes — structural (multi-level segmentation mask matching across four granularities) and semantic (precision/recall/F1 over object/attribute/relation graphs extracted by MLLMs) — covering six modality combinations of fMRI/EEG × Image/Video/3D, thereby addressing the limitations of metric saturation, lack of neuroscientific grounding, and insufficient fine-grained diagnostic capacity in existing evaluation protocols.
Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV: To address background distraction in few-shot action recognition (FSAR) for wide-angle videos — where subjects occupy a small portion of the frame and temporal relationships degrade — this paper proposes Otter, an enhanced RWKV-based framework. It introduces a Compound Segmentation Module (CSM) for subject highlighting and a Temporal Reconstruction Module (TRM) for recovering temporal relationships, achieving state-of-the-art results on SSv2, Kinetics, UCF101, and HMDB51 benchmarks.
RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation: This paper proposes RS2-SAM2, a framework that injects textual information into the SAM2 image encoding process via a Bidirectional Hierarchical Fusion Module (BHFM) and designs a Mask Prompt Generator (MPG) to supply SAM2 with dense prompts, achieving state-of-the-art performance on referring remote sensing image segmentation.
RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images: This paper proposes RSVG-ZeroOV, a training-free framework that integrates cross-attention maps from a VLM and self-attention maps from a diffusion model via an "Overview-Focus-Evolve" three-stage strategy, enabling zero-shot open-vocabulary visual grounding in remote sensing images.
S5: Scalable Semi-Supervised Semantic Segmentation in Remote Sensing: This paper proposes the S5 framework, which for the first time extends semi-supervised semantic segmentation into a pre-training paradigm for remote sensing foundation models (RSFMs). By constructing the million-scale RS4P-1M dataset and introducing a MoE-based multi-dataset fine-tuning strategy, S5 achieves state-of-the-art performance across multiple remote sensing segmentation and detection benchmarks.
SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection: This paper proposes SAM-DAQ, which adapts SAM2 to RGB-D video salient object detection (VSOD) via a Depth-guided Parallel Adapter (DPA) and a Query-driven Temporal Memory (QTM) module, addressing three key challenges: manual prompt dependency, excessive GPU memory consumption, and computational overhead.
SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model: This paper proposes SAQ-SAM, which improves post-training quantization (PTQ) of SAM from a semantic alignment perspective. It introduces Perceptual Consistency Clipping (PCC) to handle extreme outliers in the mask decoder, and Prompt-Aware Reconstruction (PAR) to preserve semantic alignment between image and prompt interactions.
Segment and Matte Anything in a Unified Model (SAMA): This paper proposes SAMA — a lightweight extension of SAM that introduces a Multi-View Local Encoder (MVLE) to capture fine-grained local features, a Local-Adapter to inject local details into the decoding process, and dual task-specific prediction heads. With only a 1.8% parameter increase, SAMA achieves high-quality interactive segmentation and alpha matting simultaneously within a unified model, reaching state-of-the-art performance on DIS-5K and multiple matting benchmarks.
Segment Anything Across Shots: A Method and Benchmark: This paper proposes SAAS, a method for Multi-shot Video Object Segmentation (MVOS), along with the Cut-VOS benchmark. SAAS achieves robust cross-shot segmentation via transition-simulating data augmentation (TMA), shot transition detection and understanding modules (TDM+TCH), and a local memory bank.
SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation: This paper proposes SSR, a dual-level semantic and spatial rectification framework that addresses non-target foreground over-activation caused by CLIP's cross-modal semantic misalignment via Cross-Modal Prototype Alignment (CMPA), and background over-activation during affinity propagation via Superpixel-Guided Correction (SGC). SSR achieves state-of-the-art performance on PASCAL VOC and MS COCO, surpassing both single-stage and multi-stage methods.
Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models: This paper proposes Symmetrical Flow Matching (SymmFlow), which unifies semantic segmentation, classification, and image generation into a single model. By jointly modeling forward and reverse flow transformations through a symmetric learning objective, SymmFlow achieves state-of-the-art performance in semantic image synthesis with only 25 inference steps (CelebAMask-HQ FID 11.9, COCO-Stuff FID 7.0), while obtaining competitive results on segmentation and classification.
Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective: This paper systematically investigates CLIP's internal mechanisms from an explainability perspective, revealing a "distraction" phenomenon in which CLIP allocates substantial attention resources to target-irrelevant tokens in deeper layers. The proposed training-free method RF-CLIP performs attention redistribution to refocus dispersed resources onto target regions, achieving state-of-the-art performance across 8 benchmarks while maintaining inference efficiency.
Text-guided Controllable Diffusion for Realistic Camouflage Images Generation: CT-CIG is proposed as the first text-guided controllable camouflage image generation method. It leverages a VLM-based Camouflage-Revealing Dialogue Mechanism (CRDM) to generate high-quality text prompts, and combines a lightweight control network with a Frequency Interaction Refinement Module (FIRM) built upon the Stable Diffusion framework to produce logically coherent and texturally realistic camouflage images, establishing a new Text-guided CIG paradigm.
Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors: This paper proposes AffordDex, a two-stage framework: the first stage pre-trains human hand motion priors (natural motion trajectories) via imitation learning; the second stage refines the policy through reinforcement learning using a residual module and VLM-guided Negative Affordance Annotation (NAA), achieving dexterous robotic grasping that is both human-like in naturalness and functionally correct (e.g., avoiding the blade and grasping the handle of a knife). The method significantly outperforms state-of-the-art approaches across multiple generalization levels.
Tracking and Segmenting Anything in Any Modality: SATA proposes a unified tracking and segmentation framework that models cross-modal shared and modality-specific knowledge via a Decoupled Mixture-of-Experts (DeMoE) mechanism, and introduces a Task-aware Multi-Object Tracking (TaMOT) pipeline to unify the output format across all tasks, demonstrating superior performance across 18 tracking and segmentation benchmarks.
Vista: Scene-Aware Optimization for Streaming Video Question Answering Under Post-Hoc Queries: Vista proposes a scene-aware streaming video question answering framework that dynamically segments streaming video into semantically coherent scene units, applies spatiotemporal compression to each scene and offloads it to CPU memory, and selectively recalls the most relevant scenes upon user queries, achieving high-accuracy video QA under strict GPU memory and latency constraints.

💡 LLM Reasoning¶

A Reasoning Paradigm for Named Entity Recognition: This paper proposes ReasoningNER, which reframes named entity recognition from "implicit pattern matching" to an "explicit reasoning" paradigm. Through a three-stage pipeline (CoT data construction → CoT fine-tuning → GRPO reinforcement enhancement), the model first reasons and then extracts entities. Under zero-shot settings, ReasoningNER surpasses GPT-4 by 12.3 F1 points, and the 8B model achieves an average F1 of 72.4 on CrossNER.
Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models: This paper systematically analyzes abstention failures in Large Reasoning Models (LRMs) when confronted with unanswerable math problems. It finds that LRMs possess sufficient internal cognitive capacity to recognize unsolvability (linear probe classification accuracy >80%), yet their external behavior remains biased toward forced answering. A two-stage approach combining cognitive monitoring and inference-time intervention is proposed, improving abstention rates from 16–54% to 60–92% without degrading reasoning performance on answerable questions.
ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction: This paper proposes the Latent Reasoning Chain Extraction (ARCHE) task, which requires LLMs to decompose scientific paper argumentation into Reasoning Logic Trees (RLTs) grounded in Peirce's three reasoning paradigms. Through two complementary metrics—Entity Coverage (EC) and Reasoning Edge Accuracy (REA)—the study reveals a fundamental trade-off between content completeness and logical correctness across 10 mainstream LLMs.
BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models: This paper proposes BadThink — the first training-time backdoor attack targeting CoT reasoning efficiency. By iteratively optimizing verbose reasoning templates via an LLM, it constructs poisoned data that causes the victim model, upon trigger activation, to generate reasoning chains inflated by over 17× (on MATH-500), while preserving final answer correctness and maintaining strong stealthiness.
BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards: This paper proposes BLM-Guard, an explainable multimodal moderation framework for short-video commercial advertisements. It first establishes structured reasoning capability via rule-driven ICoT data synthesis and SFT cold-start, then applies Self-Adaptive GRPO reinforcement learning (combining rule correctness rewards and a self-adaptive consistency reward SCA-R) to optimize policy alignment, achieving 91.4% strict accuracy and 0.845 reasoning consistency score on a real-world ad benchmark.
Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models: This paper proposes ASE (Adversarial Scenario Extrapolation), an inference-time CoT defense framework that enables LLMs to autonomously simulate adversarial scenarios and formulate defensive strategies prior to responding. ASE achieves near-zero attack success rates across four categories of safety threats (jailbreak, toxicity, hallucination, and bias), while reducing direct refusal rates to ≤4%, effectively balancing robustness and user experience.
CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation: This paper proposes the CMMCoT framework, which constructs interleaved multimodal multi-step reasoning chains (with visual region token supervision) and a test-time retrieval-based memory augmentation module (RIFREM) to enhance slow-thinking reasoning in multi-image scenarios without increasing model parameters. Built on Qwen2.5-VL-7B, the method achieves an average improvement of 1.4 points on multi-image benchmarks.
Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning: This paper demonstrates that attention head activations in intermediate layers of LLMs implicitly encode truthfulness information about reasoning steps during CoT inference (probing accuracy up to 85%). Based on this finding, confidence predictors are trained to guide beam search in dynamically selecting high-confidence reasoning paths, surpassing Self-Consistency and PRM Guided Search on mathematical, symbolic, and commonsense reasoning tasks.
Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment: This paper systematically investigates decision-making uncertainty across 32 open-source LLMs in moral dilemma scenarios (trolley problem variants), finding that uncertainty is primarily driven by model architecture rather than moral dimension. Introducing attention dropout at inference time significantly increases mutual information and improves human-LLM moral alignment, suggesting that reducing overconfidence in moral scenarios can enhance consistency with human preferences.
Efficient Thought Space Exploration Through Strategic Intervention: This paper proposes the Hint-Practice Reasoning (HPR) framework, in which a large model (hinter) provides short hints at sparse critical tokens while a small model (practitioner) handles the majority of the reasoning. HPR achieves performance comparable to the self-consistency baseline using only 1/5 of the tokens, and improves accuracy by up to 5.1% under the same FLOPs budget.
ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation: This paper constructs ESG-Bench — 270 manually annotated QA pairs from 94 real ESG reports (2020–2024) — and proposes a three-stage hallucination mitigation pipeline: SFT (with grounded answers + "Not Provided" abstention labels) → CoT Prompting (2/4-step prompt templates) → CoT Fine-tuning (with human-annotated reasoning chains). The 4-step CoT fine-tuned Llama-3 achieves 92.52% with-answer (WA) accuracy and 99.37% without-answer (WoA) accuracy (balanced 96%), with generalization gains on HaluEval and BioASQ.
Evaluating, Synthesizing, and Enhancing for Customer Support Conversation: This paper defines five dialogue phases and twelve support strategies based on the COPC industry standard, generates 11,232 strategy-rich synthetic dialogues (RoleCS) via five-agent role-playing, and constructs a 1,855-sample evaluation set (CSConv) by rewriting real conversations. Fine-tuning on these resources substantially improves strategy-aligned response quality and issue resolution rates.
ExtendAttack: Attacking Servers of LRMs via Extending Reasoning: This paper proposes ExtendAttack, a resource exhaustion attack targeting Large Reasoning Models (LRMs): by randomly converting characters in the prompt into multi-base ASCII encodings, the attack forces models to perform extensive character-by-character decoding before answering, increasing o3's response length by more than 2.7× and doubling latency, while keeping answer accuracy largely intact.
Graph of Verification: Structured Verification of LLM Reasoning with Directed Acyclic Graphs: This paper proposes Graph of Verification (GoV), a structured verification framework that models LLM reasoning processes as directed acyclic graphs (DAGs). Through a flexible Node Block architecture, GoV enables multi-granularity verification—ranging from atomic-level steps in formal tasks to paragraph-level verification in natural language narratives—and substantially outperforms both holistic verification and other decomposed verification methods on both structured and loosely structured reasoning benchmarks.
Improving Value-based Process Verifier via Low-Cost Variance Reduction: To address the high-variance issue in value-based process reward model (PRM) training caused by limited Monte Carlo (MC) samples, this paper proposes Compound Monte Carlo Sampling (ComMCS), which constructs an unbiased low-variance estimator by linearly combining MC estimates from the current step and subsequent steps. The method introduces no additional LLM inference overhead and achieves a 2.2-point improvement on MATH-500 under Best-of-32 evaluation.
Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement: This paper proposes a Self-Rewriting framework that enables large reasoning models (LRMs) to rewrite their own reasoning traces for "easy" samples (queries where all responses are correct) during RL training and learn from the rewritten versions. With only ~10% additional training overhead, the approach reduces reasoning length by 46% while maintaining accuracy, improves internal reasoning quality (LLM-as-Judge) by 7.2 points, and effectively mitigates issues such as over-thinking and redundant thinking.
Intention Chain-of-Thought Prompting with Dynamic Routing for Code Generation: This paper proposes RoutingGen — a difficulty-aware code generation framework grounded in the principle of cognitive economy. A Qwen3-8B classifier dynamically routes tasks to either a simple path (few-shot direct generation) or a complex path (Intention CoT = specification constraints + algorithmic intent + complexity analysis), achieving a +45.15% improvement on McEval while reducing average token consumption by 46.37%.
Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search: This work constructs the NbQA dataset (38K task-solution pairs extracted from real Jupyter Notebooks) and proposes the Jupiter framework (modeling data analysis as a state-level search problem with PUCT search guided by a value model), enabling Qwen2.5-14B to achieve 86.38% on InfiAgent-DABench, surpassing GPT-4o (85.99%), and improving Qwen2.5-7B on DSBench from 63.51% to 89.19%.
L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention: Through LAT analysis, this paper reveals that the low-frequency CoT directional representations of LLMs and VLMs share similar distributions. It proposes L2V-CoT: extract CoT directional representations from an LLM → apply low-pass filtering → frequency-domain resampling for dimension alignment → inject into VLM hidden layers. This training-free approach transfers LLM reasoning capabilities to VLMs, achieving an average improvement of 3.7% and a maximum gain of 8.6%.
LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning: This paper proposes an entropy-guided adaptive LLM reasoning framework that combines dynamic in-context retrieval with adaptive chain-of-thought (CoT) reasoning. On the Tic-Tac-Toe benchmark, the framework improves the average game outcome of LLMs from $-11.6\%$ to $+9.5\%$ while maintaining a low number of LLM queries.
Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension: This paper proposes Relation-R1, the first unified framework for binary and N-ary relation comprehension, combining progressively cognitive CoT-guided SFT with GRPO multi-reward optimization. With only 3B parameters, it surpasses 13B models, achieving 21.20% Mean (+6.87%) on PSG and state-of-the-art performance across all metrics on SWiG (Grnd-all 30.18%, +14.48%).
RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation: This paper proposes RPM-MCTS, which replaces a trained Process Reward Model (PRM) with knowledge base retrieval to guide MCTS search for code generation. Exploiting the homogeneity of correct implementations within the same algorithm family, the method retrieves reference algorithm steps from a knowledge base as evaluation signals, applies similarity-based filtering to prune redundant expansion nodes, and uses sandbox execution to localize errors—achieving approximately 15% token reduction while surpassing prior state-of-the-art.
SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger: Inspired by Error-Related Negativity (ERN) in neuroscience, this paper proposes SAPO, a self-adaptive process optimization method that replaces costly step-wise Monte Carlo rollouts with first error detection and local posterior estimation. SAPO reduces computational cost by 2–3× while enabling joint optimization of the reasoner and verifier, allowing small language models (≤2B) to outperform most self-evolution methods on mathematical and code reasoning tasks.
SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling: Grounded in the dual-process theory from cognitive science, SCALE decomposes mathematical problems into sub-problems and allocates compute resources according to difficulty (System 1 for fast computation vs. System 2 for deep reasoning). On AIME25, it improves Qwen3-32B from 57.50% to 71.25% while reducing token consumption by 33–53% compared to InftyThink.
SERL: Self-Examining Reinforcement Learning on Open-Domain: This paper proposes SERL, a self-improvement framework in which an LLM simultaneously acts as an Actor (generator) and a Judge (evaluator). It derives reward signals from the model's own judgments via the Copeland pairwise comparison method, requiring neither external reward models nor human annotations. SERL improves Qwen3-8B from 52.37% to 59.90% (+7.53%) on AlpacaEval 2.0, approaching the performance of Qwen3-32B.
Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning: By applying a single-epoch supervised fine-tuning (SFT) on OPT-350M, this work achieves a 77.55% pass rate on ToolBench, substantially outperforming large-model baselines such as ChatGPT-CoT (26%) and ToolLLaMA-DFS (30.18%), demonstrating that small models with targeted fine-tuning can significantly surpass general-purpose large models on specific tasks.
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision: This paper proposes SPARE, a unified single-pass evaluation framework that simultaneously performs step-to-reference alignment and correctness judgment (with explicit reasoning) in a single structured generation, requiring no additional training data. SPARE achieves 2.3× speedup over MCTS-based methods and attains OOD generalization with only 16% of the training samples.
Stable Voting and the Splitting of Cycles: This paper investigates the conjecture that Simple Stable Voting (SSV)—a recursive voting rule already used in hundreds of real-world elections—always refines Split Cycle (SC). Through mathematical proof (≤5 candidates) and SAT solving (6–7 candidates), the paper establishes that the conjecture holds for ≤6 candidates, is refuted for ≥7 candidates, and generalizes the counterexample to arbitrarily many candidates via a constructive proof.
Text-to-Scene with Large Reasoning Models: This paper proposes Reason-3D, which leverages the multi-step spatial reasoning capabilities of large reasoning models (LRMs) to achieve zero-shot text-to-3D scene generation via semantic-voting-based object retrieval and a two-stage layout strategy (autoregressive placement + collision-aware refinement). The system achieves an Elo score of 2248 in human evaluation, substantially outperforming Holodeck (1500) and LayoutVLM (1650).
Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities: This paper systematically evaluates the negative impact of deliberative reasoning on foundational capabilities (helpfulness and harmlessness) in Large Reasoning Models (LRMs) such as DeepSeek-R1, QwQ, and OpenThinker. It finds that deliberative reasoning significantly degrades instruction-following and safety, and proposes adaptive reasoning modes—Zero-Thinking, Less-Thinking, and Summary-Thinking—that effectively mitigate these deficiencies.

🔍 Information Retrieval & RAG¶

"As Eastern Powers, I Will Veto." : An Investigation of Nation-Level Bias of Large Language Models in International Relations: This paper systematically investigates nation-level bias of LLMs in international relations, designing three bias evaluation paradigms (DirectQA, Association Test, Vote Simulation) grounded in real UN Security Council data. It reveals the multi-dimensional nature of such bias—varying across models and evaluation contexts—and proposes a RAG+Reflexion debiasing framework.
Beyond Perplexity: Let the Reader Select Retrieval Summaries via Spectrum Projection Score: This paper proposes Spectrum Projection Score (SPS), a training-free metric that evaluates retrieval summary quality by measuring the alignment between summary token embeddings and the principal subspace of the reader LLM, serving as a replacement for conventional perplexity-based metrics. Combined with the xCompress inference-time controller, SPS achieves substantial improvements over perplexity-based methods across 5 QA datasets (HotpotQA EM +3.6).
Cog-RAG: Cognitive-Inspired Dual-Hypergraph with Theme Alignment Retrieval-Augmented Generation: This paper proposes Cog-RAG, which constructs a dual-hypergraph index comprising a theme hypergraph and an entity hypergraph to simulate the human "top-down" cognitive process via a two-stage retrieval strategy (theme first, then details), achieving global-to-local semantic alignment for generation.
ComLQ: Benchmarking Complex Logical Queries in Information Retrieval: This paper introduces ComLQ, the first IR benchmark targeting complex logical queries spanning 14 query types (conjunction, disjunction, negation, and their combinations). It proposes a subgraph-guided LLM data synthesis pipeline and a negation consistency metric LSNC, revealing that existing retrievers suffer severely in logical reasoning—particularly in negation modeling.
ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning: Inspired by the metacognitive regulation mechanism of the prefrontal cortex, this paper proposes the ComoRAG framework, which achieves stateful multi-step reasoning via a dynamic memory workspace and iterative probe queries, significantly outperforming existing RAG methods on long narrative understanding tasks (200K+ tokens).
ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval: This paper proposes ConvMix, a mixed-criteria data augmentation framework that leverages LLMs to perform scalable relevance annotation augmentation from both query and document directions, combined with clustering-based diversity selection and Fisher information-based in-distribution supervision, to systematically improve conversational dense retrieval performance.
Do Retrieval Augmented Language Models Know When They Don't Know?: This paper systematically analyzes the refusal calibration problem in RAG models, finding that RALMs exhibit an over-refusal rate exceeding 55% when all retrieved documents are irrelevant (even when the model's internal knowledge suffices to answer), and proposes a mechanism combining uncertainty estimation with refusal-aware fine-tuning to balance refusal behavior and answer quality.
Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs: This paper systematically investigates how three hallucination mitigation methods (CoVe, DoLa, RAG) affect LLM creativity, finding that they exert diametrically opposite effects on divergent creativity—CoVe enhances it, DoLa suppresses it, and RAG has no significant impact—while convergent creativity remains largely unaffected. These patterns hold consistently across model families and parameter scales.
Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-Based Machine Translation: This work develops a controlled noise injection framework to systematically evaluate retrieval-augmented machine translation (REAL-MT), introduces two new metrics—Fidelity and CAR—and reveals across 10 language pairs × 4 noise types that models blindly adopt retrieved context even when it is contradictory (CAR remains 65–78%). Large reasoning models (LRMs) are found to be even more vulnerable by "rationalizing" erroneous context, and a fundamental trade-off exists between noise robustness and clean-context utilization.
HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment: This paper proposes HiMo-CLIP, which applies in-batch PCA decomposition (HiDe) to text embeddings to extract multi-granularity semantic components, combined with a dual-branch monotonicity-aware contrastive loss (MoLo). Without modifying the encoder, the model learns that "more complete text should yield higher alignment scores" — a property termed semantic monotonicity — and significantly outperforms existing methods on long-text retrieval.
Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning: This paper proposes MERGE, the first multimodal entity-aware RAG framework for news image captioning. Through three core components — an Entity-centric Multimodal Knowledge Base (EMKB), Hypothetical Caption-guided Multimodal Alignment (HCMA), and Retrieval-driven Multimodal Knowledge Integration (RMKI) — MERGE achieves CIDEr +6.84 and F1 +4.14 on GoodNews, and demonstrates strong generalization with CIDEr +20.17 on the unseen Visual News benchmark.
Magnitude Matters: A Superior Class of Similarity Metrics for Holistic Semantic Understanding: This paper proposes two parameter-free, magnitude-aware vector similarity metrics—Overlap Similarity (OS) and Hyperbolic Tangent Similarity (HTS)—that achieve significantly lower MSE than Cosine Similarity and Dot Product on classification tasks (paraphrase detection, natural language inference) across 4 sentence embedding models and 8 NLP benchmarks, without any additional training overhead.
MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering: MAVIS is the first benchmark for evaluating multimodal source attribution systems, comprising 157K visual QA instances with fact-level citations to multimodal documents per answer, along with automatic evaluation metrics across three dimensions: informativeness, groundedness, and fluency.
Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction: This paper proposes H2Memory, a four-layer hierarchical heterogeneous memory structure (Log Graphs / Background Memory / Topic Outlines / Principles), validated on the PAL-Set dataset (100 users × 8.4 months of interaction), improving BLEU-1 on demand paraphrasing and solution recommendation tasks from 13.59 to 26.67.
Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports From Scratch with Agentic Framework: This paper proposes Multimodal DeepResearcher, a four-stage agentic framework for generating text-chart interleaved research reports from scratch. It introduces Formal Description of Visualization (FDV) to enable LLMs to learn and produce diverse charts, and employs an Actor-Critic iterative refinement mechanism (LLM generates D3.js code → browser rendering → multimodal LLM review). The system achieves an 82% overall win rate (Claude 3.7) on the newly constructed MultimodalReportBench and a 100% win rate in human evaluation.
N2N-GQA: Noise-to-Narrative for Graph-Based Table-Text Question Answering Using LLMs: N2N-GQA is proposed as the first zero-shot framework for open-domain hybrid table-text question answering. Its core mechanism transforms noisy retrieved documents into a dynamic evidence graph (documents as nodes, TF-IDF shared-term weights as edges), and employs graph centrality-based pruning to identify "bridging documents" that connect multi-hop reasoning chains. On OTT-QA, it achieves +39.6 EM over Vanilla RAG (8.0 → 48.8), approaching the fine-tuned system CORE (49.0 EM) in a zero-shot setting.
Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval: This paper proposes NIRNL, a framework that enhances sample discriminability via Cross-modal Margin Preserving (CMP) and employs Neighbor-aware Instance Refining (NIR) to partition training data into clean, hard, and noisy subsets, each with a tailored optimization strategy. The framework unifies three paradigms—robust learning, label calibration, and instance selection—achieving state-of-the-art cross-modal retrieval performance under high noise rates.
OAD-Promoter: Enhancing Zero-shot VQA using Large Language Models with Object Attribute Description: This paper proposes OAD-Promoter, a framework comprising three collaborative modules—Object-concentrated Example Generation (OEG), Memory Knowledge Assistance (MKA), and OAD Prompt—to mitigate language bias inherited by LLMs and improve domain transfer under zero-shot settings, achieving state-of-the-art performance on VQAv2 and multiple other benchmarks.
OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval: This paper proposes OPERA, a hierarchical framework comprising a Goal Planning Module and a Reason-Execute Module, combined with MAPGRPO—a training algorithm specifically designed for multi-agent settings—to substantially improve performance on reasoning-oriented multi-hop retrieval tasks.
Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End?: This paper presents the first systematic study of positional bias in multimodal representation models, finding that text encoders tend to favor the beginning of the input while image encoders exhibit preference for both the beginning and the end. Through extensive controlled experiments, the study reveals that this bias arises from the joint influence of positional encoding schemes, training objectives, context importance, and image-text pair training.
PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation: This paper extends the Prediction-Powered Inference (PPI) framework to sub-instance-level ranking metrics (e.g., Precision@K), achieving unbiased ranking metric estimation using only 30–100 human annotations combined with large-scale LLM judgments. The computational complexity is reduced from $O(2^{|C|})$ to $O(2^K)$, and the approach has been successfully deployed to guide an LLM-based query rewriting system in an Indian e-commerce search setting.
PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning: Inspired by dual-process cognitive theory, PRIME is a multi-agent reasoning framework in which a Quick Thinking Agent (System 1) rapidly generates intuitive answers, a Reflection Agent evaluates their confidence, and—when uncertainty is detected—six specialized System 2 agents (Planning / Search / Reading / Hypothesis / Integration / Decision) are triggered for deep knowledge-retrieval reasoning. The framework enables open-source LLaMA 3 to approach GPT-4o performance on medical and multi-hop QA benchmarks.
RAGFort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation: This paper proposes RAGFort, the first systematic dual-path framework for defending against RAG knowledge base extraction attacks. It combines contrastive reindexing (inter-class) to isolate topic boundaries with constrained cascade generation (intra-class) to suppress sensitive content output. RAGFort reduces the knowledge recovery rate to 0.51× that of an unprotected system while preserving answer quality.
REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering: This paper proposes REAP, a dual-module iterative framework that addresses multi-hop question answering through recursive collaboration between a Sub-task Planner (SP), which maintains a global perspective to dynamically guide reasoning trajectories, and a Fact Extractor (FE), which extracts structured facts and latent clues from retrieved content. Using Llama-3.1-8B, REAP substantially outperforms all baselines on 4 benchmarks (HotpotQA F1 68.0 vs. runner-up 63.4).
ReFeed: Retrieval Feedback-Guided Dataset Construction for Style-Aware Query Rewriting: This paper proposes a retrieval feedback-driven dataset construction framework that automatically builds high-quality style-aware query rewriting datasets through a closed-loop pipeline of three steps: identifying retrieval failure cases, LLM-based stylistic rewriting, and re-retrieval verification. The resulting dataset provides a data foundation for training retrieval-aligned rewriting models.
RRRA: Resampling and Reranking through a Retriever Adapter: This paper proposes the RRRA framework, which attaches a lightweight learnable adapter to a Bi-Encoder to model the false-negative probability of each candidate document. The adapter is used simultaneously for negative resampling during training and reranking during inference, consistently outperforming strong baselines such as SimANS and TriSampler on NQ, TQ, and MS MARCO.
SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention: This paper proposes SR-KI, a framework that injects structured knowledge bases into LLM KV caches via a two-stage training procedure (retrieval layer localization + attention supervision loss). On a single A100 40GB GPU, SR-KI supports injection of up to 40K knowledge base entries, achieves a compression ratio of up to 99.75% via top-100 selection, and maintains an average Recall@10 above 88%.
Towards Inference-Time Scaling for Continuous Space Reasoning: This work presents the first systematic investigation of whether inference-time scaling techniques from discrete text reasoning can transfer to continuous latent-space reasoning models (COCONUT). It finds that dropout sampling can generate diverse reasoning paths (Pass@32 reaching 44.43%), but PRM/ORM yields less than 2.3% improvement, with the root cause being that continuous thought representations lack the geometric inductive bias needed to distinguish correct from incorrect reasoning.
When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents: By analyzing 10,734 reasoning trajectories, this paper reveals a severe "Right for Wrong Reasons" (RWR) phenomenon in small language models (7–9B): 50–69% of correct answers contain fundamental reasoning flaws. The authors propose the Reasoning Integrity Score (RIS) as a process-level metric, find that RAG effectively improves reasoning quality while metacognitive interventions are harmful, and distill a fast classifier (0.86 F1, 100× speedup) for real-time deployment.

� LLM Safety¶

Anti-adversarial Learning: Desensitizing Prompts for Large Language Models: This paper proposes PromptObfus, which adopts an "anti-adversarial learning" paradigm to replace sensitive tokens in user prompts with semantically distinct yet task-preserving alternatives. The approach eliminates explicit privacy leakage entirely and reduces implicit privacy inference attack success rates by 62.70%, without degrading the task performance of remote LLMs.
Attention Retention for Continual Learning with Vision Transformers: This paper proposes ARCL-ViT, a framework that prevents attention drift in Vision Transformers during continual learning via a two-step strategy of attention mask generation and gradient masking. It achieves state-of-the-art results on ImageNet-R and CIFAR-100, demonstrating that preserving attention patterns is key to mitigating catastrophic forgetting.
AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models: This paper proposes AUVIC, a framework that combines an adversarial perturbation generator with a dynamic anchor preservation mechanism to precisely unlearn target visual concepts (e.g., specific faces) in MLLMs, while avoiding collateral forgetting of semantically similar concepts. The paper also introduces VCUBench, the first evaluation benchmark for visual concept unlearning in group-scene scenarios.
Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion: This paper proposes the KUnBR framework, which employs gradient-guided knowledge density estimation to localize layers enriched with harmful knowledge, and adopts a block re-insertion strategy to bypass the gradient-masking effect of cover layers, achieving deep unlearning of harmful knowledge in LLMs rather than mere surface-level suppression.
Can Editing LLMs Inject Harm?: This paper reframes knowledge editing as a novel LLM security threat termed Editing Attack, systematically investigating the feasibility of injecting misinformation and bias into LLMs via three editing methods—ROME, FT, and ICE—and demonstrating that such attacks are both highly effective and remarkably stealthy.
CATFormer: When Continual Learning Meets Spiking Transformers With Dynamic Thresholds: This paper proposes CATFormer, a data-replay-free continual learning framework built upon a spiking Vision Transformer, which achieves task-specific neuronal excitability modulation via context-adaptive dynamic firing thresholds. Over sequences of up to 100 tasks, the model not only avoids forgetting but actually improves in accuracy — a phenomenon the authors term "reverse forgetting."
Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability: This position paper argues that current LLM efficiency research is dominated by hyperscale assumptions. It identifies five open research challenges targeting small- and medium-scale deployers, and advocates for redefining efficiency metrics through an Overhead-Aware Efficiency (OAE) framework.
Designing Truthful Mechanisms for Asymptotic Fair Division: This paper proposes the PRD (Proportional Response with Dummy) mechanism, which for the first time simultaneously achieves expected truthfulness, polynomial-time computability, and high-probability envy-freeness in the asymptotic fair division setting, requiring only $m = \Omega(n \log n)$ items. This resolves an open problem posed by Manurangsi & Suksompong.
FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-World LoRA: FedALT is proposed to maintain a trainable Individual LoRA (updated locally) and a frozen Rest-of-World (RoW) LoRA (averaged from other clients) for each client, combined with an adaptive MoE mixer that dynamically balances local and global knowledge. This design fundamentally eliminates cross-client interference caused by FedAvg aggregation, achieving significant improvements over SOTA on heterogeneous-task federated LLM fine-tuning.
From Single to Societal: Analyzing Persona-Induced Bias in Multi-Agent Interactions: This paper presents the first systematic study of persona-induced bias in LLM-based multi-agent interactions. Through controlled experiments on collaborative problem solving and persuasion tasks, three key findings are revealed: (1) different personas exhibit significant divergence in trustworthiness and insistence (dominant groups such as males and White individuals are perceived as less trustworthy); (2) agents display pronounced in-group favoritism; and (3) these biases persist and tend to amplify in multi-turn, multi-agent settings.
Gender Bias in Emotion Recognition by Large Language Models: This paper systematically evaluates gender bias in emotion recognition across multiple LLMs (GPT-4/5, Mistral, LLaMA, etc.), finding that most models exhibit statistically significant gender bias on at least one emotion label. Experiments demonstrate that inference-time prompt strategies (prompt engineering, in-context learning, CoT) fail to effectively debias, whereas training-based fine-tuning can substantially mitigate the bias.
Ghost in the Transformer: Detecting Model Reuse with Invariant Spectral Signatures: This paper proposes GhostSpec, a data-free, white-box method that does not modify model behavior. It extracts spectral fingerprints by applying SVD to invariant matrix products of attention weight matrices, enabling robust verification of LLM lineage under fine-tuning, pruning, merging, expansion, and even adversarial transformations.
GraphTextack: A Realistic Black-Box Node Injection Attack on LLM-Enhanced GNNs: This paper proposes GraphTextack — the first black-box multimodal node injection poisoning attack targeting LLM-enhanced GNNs. It jointly optimizes the graph structural connections and semantic features of injected nodes via an evolutionary optimization framework, requiring neither internal model information nor surrogate models. GraphTextack significantly outperforms 12 baseline methods across 5 datasets and 2 types of LLM-GNN models.
Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models: This paper employs computational complexity theory to demonstrate that the per-step inference complexity of Transformer-based LLMs is $O(N^2 \cdot d)$. Grounded in the Hartmanis–Stearns Time Hierarchy Theorem, it proves that any computational task exceeding this complexity bound—such as $O(n^3)$ matrix multiplication, $O(n^k)$ token enumeration, or TSP verification—necessarily causes hallucination. Furthermore, LLM agents are shown to be incapable of verifying the correctness of such tasks.
LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models: This paper proposes LAMP, a black-box Universal Adversarial Perturbation (UAP) learning method targeting multi-image MLLMs. By incorporating attention constraints and a contagious loss, LAMP enables cross-model and cross-task transferable attacks by perturbing only a small subset of input images.
Learning from the Undesirable: Robust Adaptation of Language Models without Forgetting: This paper proposes Learning-from-the-Undesirable (LfU), a regularization method for SFT that simulates "undesirable behavior" by applying gradient ascent to an auxiliary model, then enforces representation-level consistency between the original and auxiliary models via an MSE loss. This effectively mitigates overfitting, catastrophic forgetting, and adversarial fragility in limited-data fine-tuning.
LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users: Systematic experiments demonstrate that mainstream LLMs (GPT-4, Claude 3 Opus, Llama 3-8B) exhibit significant discriminatory degradation in information accuracy, truthfulness, and refusal rates toward users with lower English proficiency, lower educational attainment, and non-US backgrounds, making the most vulnerable users the least reliably served.
Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms: This paper introduces the CompositeHarm benchmark, which systematically investigates the vulnerability of LLM safety alignment in cross-lingual settings by translating adversarial syntactic attacks (AttaQ) and contextualized harms (MMSafetyBench) into five Indic languages. The study finds that adversarial syntactic attacks achieve dramatically higher attack success rates in Indic languages.
PANDA: Patch and Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning: This paper proposes PANDA, a framework that achieves intra-task class balancing via CLIP-guided semantic patch grafting and alleviates inter-task distribution shift through a learnable distribution smoothening mechanism. PANDA operates as a plug-and-play module to improve pretrained model-based exemplar-free continual learning under long-tailed scenarios.
Perturb Your Data: Paraphrase-Guided Training Data Watermarking: This paper proposes SPECTRA — a paraphrase-sampling-based training data watermarking method. It generates paraphrases via an LLM and uses Min-K%++ scoring to select paraphrases with scores close to the original text as watermarks. Even when watermarked data constitutes as little as 0.001% of the training corpus, the p-value gap between members and non-members consistently exceeds 9 orders of magnitude.
Principles2Plan: LLM-Guided System for Operationalising Ethical Principles into Plans: This paper presents Principles2Plan, an interactive prototype system that enables collaborative human–LLM operationalisation of high-level ethical principles (e.g., beneficence, privacy) into context-sensitive ethical rules, which are then embedded into a PDDL planner to generate ethically compliant action plans.
PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration: This paper proposes PRISM, a framework that dynamically routes user prompts to one of three inference modes—cloud-only, edge-only, or collaborative—via a context-aware soft gating mechanism. In the collaborative mode, an adaptive two-layer local differential privacy (LDP) scheme and semantic sketch collaboration are employed to achieve a three-way balance among privacy, utility, and efficiency.
Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering: This work is the first to explore privacy-protected RAG for Knowledge Graph Question Answering (KGQA). It proposes ARoG (Abstraction Reasoning on Graph), a framework that employs two strategies—relation-centric abstraction and structure-oriented abstraction—to enable effective retrieval and utilization of knowledge graphs for question answering even when entities are anonymized (replaced with semantically meaningless MIDs).
PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization: This paper proposes PSM, a framework that formalizes system prompt protection as a utility-constrained black-box optimization problem. By leveraging LLM-as-Optimizer, PSM automatically searches for an optimal "shield" suffix that reduces prompt extraction attack success rates to near zero without degrading model functionality.
SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth: This paper introduces SproutBench, an evaluation benchmark comprising 1,283 developmentally-grounded adversarial prompts, designed to systematically assess the safety of 47 LLMs in contexts involving children and adolescents (ages 0–6, 7–12, and 13–18). Key findings reveal that safety and risk prevention are strongly correlated ($\rho = 0.86$), while a significant trade-off exists between interactivity and age-appropriateness ($\rho = -0.48$).
StyleBreak: Revealing Alignment Vulnerabilities in Large Audio-Language Models via Style-Aware Audio Jailbreak: This paper proposes StyleBreak, the first audio jailbreak framework based on speech style, which systematically investigates the impact of linguistic, paralinguistic, and extralinguistic attributes on LAM alignment robustness through a two-stage style-aware transformation pipeline and a query-adaptive policy network. StyleBreak improves ASR by 7.1%–22.3% across multiple attack paradigms.
The Confidence Trap: Gender Bias and Predictive Certainty in LLMs: This paper proposes Gender-ECE, a metric for systematically evaluating the confidence calibration and alignment with human bias judgments of six open-source LLMs on gendered pronoun prediction tasks. The authors find that Gemma-2 exhibits the worst calibration and an extreme disparity between male and female pronoun calibration, whereas GPT-J-6B — trained on less filtered data — achieves the best calibration overall.
Uncovering Bias Paths with LLM-guided Causal Discovery: An Active Learning and Dynamic Scoring Approach: This paper proposes a hybrid causal discovery framework that integrates LLM semantic priors with statistical signals. Through an active learning strategy and a dynamic scoring mechanism, the framework prioritizes querying the most informative variable pairs, effectively recovering fairness-critical causal paths (e.g., sex→education→income) under noise and confounding conditions, substantially outperforming classical CD methods and naïve LLM-based approaches.
WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking: This paper proposes WaterMod, an LLM text watermarking method based on modular arithmetic ($\text{rank} \bmod k$) that partitions the vocabulary into modular residue classes after sorting tokens by probability. Under both zero-bit ($k=2$) and multi-bit ($k>2$) watermarking settings, WaterMod achieves high detection rates and low quality degradation within a unified framework, requiring no external thesaurus or hashing tricks.

🎁 Recommender Systems¶

Align³GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation: This paper proposes Align³GR, a unified three-level alignment framework that systematically bridges the semantic-behavioral gap between LLMs and recommender systems at the token level (dual-side SCID), the behavior modeling level (multi-task SFT), and the preference level (progressive DPO).
AutoPP: Towards Automated Product Poster Generation and Optimization: This paper proposes AutoPP, the first pipeline to unify automated product poster generation with CTR-feedback-driven optimization in a single framework. It employs a unified design module to jointly design background, text, and layout; an element rendering module for efficient and controllable poster generation; and Isolated DPO (IDPO) to achieve element-level click-through rate optimization.
Behavior Tokens Speak Louder: Disentangled Explainable Recommendation with Behavior Vocabulary: This paper proposes BEAT, a framework that discretizes user/item behavior representations into interpretable behavior tokens via vector-quantized autoencoders, and aligns collaborative filtering signals to the semantic space of a frozen LLM through multi-level semantic supervision, enabling zero-shot explainable recommendation.
Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents: This paper proposes the HID framework, which constructs hybrid intents via attribute-aware spectral clustering to distinguish session-relevant from session-irrelevant tail items, and introduces a dual-constraint loss (ICLoss) targeting both long-tail coverage and recommendation accuracy. The framework achieves a "win-win" between long-tail promotion and accuracy, breaking the traditional seesaw dilemma where improving one metric inevitably harms the other.
CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search: This paper proposes CroPS, a data engine that enriches positive sample sets from three complementary perspectives—query reformulation behavior, recommender system interactions, and LLM world knowledge—combined with Hierarchical Label Assignment (HLA) and the H-InfoNCE loss function, to break the filter bubble effect in industrial-scale dense retrieval systems. CroPS has been fully deployed in Kuaishou Search.
Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios: This paper proposes PAS (Police Action Scenarios), an LLM evaluation framework for policing contexts. The framework comprises five stages: scenario definition, reference answer construction, LLM response generation, core metric extraction, and performance interpretation. An evaluation dataset is constructed from 8,000+ official Korean police documents. The study finds that commercial LLMs (GPT-4, Gemini, Claude) perform significantly below reference answers on policing tasks, particularly in factual accuracy and logical correctness.
FreqRec: Exploiting Inter-Session Information with Frequency-enhanced Dual-Path Networks for Sequential Recommendation: This paper proposes FreqRec, a dual-path architecture that applies frequency-domain transformations along the batch axis and the time axis to capture group-level consumption rhythms across sessions and fine-grained individual user interests, respectively. A frequency-domain consistency loss is introduced to explicitly align predicted and ground-truth frequency spectra. FreqRec achieves up to 7.38% improvement in NDCG@10 over the strongest baseline on three Amazon datasets.
From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization: This paper proposes GenCDR, a framework that introduces the generative semantic ID paradigm into LLM-driven cross-domain recommendation for the first time, via two core modules: domain-adaptive semantic tokenization and cross-domain autoregressive recommendation. GenCDR effectively addresses the non-transferability of item IDs and insufficient domain-personalized modeling in conventional approaches.
From Parameter to Representation: A Closed-Form Approach for Controllable Model Merging: This paper proposes ReACT, which shifts controllable model merging from parameter-space optimization to representation-space correction. By deriving a closed-form solution, ReACT enables instant generation of Pareto-optimal models under arbitrary user preferences, achieving 36–208× speedup over existing methods while delivering superior performance.
Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information: This paper proposes the first semi-supervised matrix completion learning paradigm: assuming that the sampling distribution $P$ and the ground-truth matrix $G$ share a low-rank subspace, and given a large amount of unlabeled data $M$ and a small amount of labeled data $N$, it proves that the generalization error can be decomposed into two independent terms $\tilde{O}(\sqrt{nd/M}) + \tilde{O}(\sqrt{dr/N})$, achieving significant improvements over explicit-feedback-only baselines on the Douban and MovieLens datasets.
HyMoERec: Hybrid Mixture-of-Experts for Sequential Recommendation: This paper proposes HyMoERec, a hybrid mixture-of-experts architecture combining shared and specialized expert branches. By replacing the conventional feed-forward network in sequential recommendation models with an adaptive expert fusion mechanism, the model captures heterogeneous user behavior patterns and diverse item complexities, consistently outperforming state-of-the-art methods on the MovieLens-1M and Beauty datasets.
Inductive Generative Recommendation via Retrieval-based Speculation: This paper identifies a critical limitation of Generative Recommendation (GR) models — their inability to recommend items unseen during training — and proposes SpecGR, a plug-and-play framework in which an inductively capable drafter model proposes candidate items (including new ones) while the GR model serves as a verifier to rank and validate candidates. A guided re-drafting mechanism further improves verification efficiency, achieving state-of-the-art overall performance across three datasets.
Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models: This paper reveals a non-trivial interaction between prompt selection and inference strategies (Best-of-N, Majority Voting), proposes the IAPO framework that jointly optimizes prompt design and inference scaling as a contextual best-arm identification problem, and introduces PSST—a fixed-budget training algorithm—achieving up to 50% improvement over inference-agnostic methods across 6 tasks.
Interpretable Reward Model via Sparse Autoencoder: This paper proposes SARM (Sparse Autoencoder-enhanced Reward Model), which integrates a pretrained sparse autoencoder into a reward model to map hidden-layer activations into an interpretable, sparse, monosemantic feature space. This design enables feature-level reward attribution and dynamic preference steering, while achieving the highest overall score among all models on RewardBench 2.
Length-Adaptive Interest Network for Balancing Long and Short Sequence Modeling in CTR Prediction: This paper proposes LAIN, a framework that injects sequence length as an explicit conditional signal into CTR models to mitigate performance imbalance between long- and short-sequence users. LAIN comprises three lightweight, plug-and-play modules: a Spectral Length Encoder, Length-Conditioned Prompting, and Length-Modulated Attention.
Moral Change or Noise? On Problems of Aligning AI With Temporally Unstable Human Feedback: Through a longitudinal study involving 400+ participants across 3–5 sessions in the domain of kidney transplant allocation, this paper reveals significant temporal instability in human moral preferences (6–20% response change rate) and demonstrates that such instability substantially degrades the predictive performance of AI alignment models, thereby challenging the validity of current alignment approaches that assume static preferences.
MultiTab: A Scalable Foundation for Multitask Learning on Tabular Data: This paper proposes MultiTab-Net — the first multitask Transformer architecture for tabular data — which alleviates task competition via a multitask masked attention mechanism, and substantially outperforms existing MLP-based multitask models and single-task Transformer models across datasets from recommendation, census, and physics domains.
Preference is More Than Comparisons: Rethinking Dueling Bandits with Augmented Human Feedback: This paper proposes IPEA-HF, a model-free Dueling Bandit framework based on augmented human feedback. It integrates contextual similarity and dependency relations through Augmented Confidence Bounds to calibrate uncertainty, achieving superior performance across multiple benchmarks including recommendation, multi-objective optimization, and LLM response optimization.
Probabilistic Hash Embeddings for Online Learning of Categorical Features: This paper proposes Probabilistic Hash Embeddings (PHE), which models hash embedding tables as random variables and performs posterior inference via Bayesian online learning. PHE addresses the catastrophic forgetting problem caused by parameter sharing in deterministic hash embeddings under streaming data settings. It significantly outperforms deterministic baselines across classification, sequential modeling, and recommender system tasks, while requiring only 2%–4% of the memory needed by collision-free embedding tables.
Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling: This paper proposes SSRA (Semi-Supervised Relevance-Aware synthetic data pipeline), a two-stage framework that generates domain-adaptive short video data with controllable fine-grained relevance labels (4 levels) to enhance the semantic relevance modeling capability of embedding models. Online A/B testing on Douyin's dual-column feed achieves a 1.45% CTR improvement.
SlideTailor: Personalized Presentation Slide Generation for Scientific Papers: This paper defines a new task of preference-guided paper-to-slide generation and proposes the SlideTailor framework, which distills content preferences from user-provided paper–slide example pairs and aesthetic preferences from .pptx templates. A chain-of-speech mechanism aligns slide content with intended spoken narratives. On the self-constructed PSP benchmark, SlideTailor achieves an overall score of 75.8% and a human-evaluation win rate of 81.63%, significantly outperforming existing methods.
Tokenize Once, Recommend Anywhere: Unified Item Tokenization for Multi-domain LLM-based Recommendation: This paper proposes UniTok, a unified item tokenization framework that employs a customized Mixture-of-Experts architecture (TokenMoE) combined with shared codebooks to achieve efficient discrete item representations across multiple domains, eliminating the need to train a separate tokenizer per domain, while maintaining cross-domain semantic balance through a mutual information calibration mechanism.
Tool4POI: A Tool-Augmented LLM Framework for Next POI Recommendation: This paper is the first to introduce the tool-augmented LLM paradigm to the next POI recommendation task. Through three modules—preference extraction, multi-round candidate retrieval, and reranking—the framework enables LLMs to retrieve recommendations from the full POI pool. It achieves over 40% accuracy in Out-of-History (OOH) scenarios (where existing methods yield 0%), with average Acc@5/10 improvements of 20%/30%.
TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning: This paper presents TraveLLaMA, a multimodal language model system for travel assistance. By constructing the TravelQA dataset with 265K QA pairs and the Travel-CoT structured reasoning framework, the system achieves a 10.8% accuracy improvement on travel-related question answering and obtains a SUS usability score of 82.5 in a 500-participant user study.
Wavelet Enhanced Adaptive Frequency Filter for Sequential Recommendation: This paper proposes WEARec, a model that employs Dynamic Frequency Filtering (DFF) to adaptively generate personalized frequency-domain filters conditioned on user context for capturing global preferences, and Wavelet Feature Enhancement (WFE) to compensate for the inability of global DFT to resolve short-term fluctuations. WEARec outperforms all 9 baselines on four datasets, achieving up to 11.4% improvement on long-sequence scenarios with 39–45% faster training speed.
When Top-ranked Recommendations Fail: Modeling Multi-Granular Negative Feedback for Explainable and Robust Video Recommendation: This paper proposes ENF (Explainable Negative Feedback), a framework comprising three collaborative MLLM Agents (Profile Agent, Video Agent, and Reason Agent) and a progressive S-GRPO reinforcement learning training strategy. ENF is the first approach to achieve explainable prediction and root-cause analysis of implicit negative feedback in video recommendation systems. Deployed on Tencent's news platform, it achieves a 6.2% increase in average watch duration and a 9.4% decrease in quick-skip rate.

📐 Optimization & Theory¶

A Distributed Asynchronous Generalized Momentum Algorithm Without Delay Bounds: This paper proposes a totally asynchronous Generalized Momentum (GM) distributed optimization algorithm that guarantees linear convergence without assuming any upper bound on communication or computation delays. On a Fashion-MNIST classification task, the proposed method requires 71% fewer iterations than gradient descent, 41% fewer than Heavy Ball, and 19% fewer than Nesterov accelerated gradient.
A Unified Convergence Analysis for Semi-Decentralized Learning: Sampled-to-Sampled vs. Sampled-to-All Communication: This paper presents a unified convergence analysis framework to systematically compare, for the first time, two server-to-device communication primitives in semi-decentralized federated learning — S2S (returning the aggregated model only to sampled devices) and S2A (broadcasting to all devices). The analysis reveals distinct regimes in which S2S is superior under high inter-component heterogeneity and S2A is superior under low heterogeneity, and provides practical guidelines for system configuration.
BeeRNA: Tertiary Structure-Based RNA Inverse Folding Using Artificial Bee Colony: This paper proposes BeeRNA, which applies the Artificial Bee Colony (ABC) optimization algorithm to the RNA tertiary structure inverse folding problem. Through a two-stage fitness evaluation combining base-pair distance pre-screening and RMSD scoring, BeeRNA outperforms deep learning methods gRNAde and RiboDiffusion on short-to-medium-length RNAs (<100 nt).
Beyond the Mean: Fisher-Orthogonal Projection for Natural Gradient Descent in Large Batch Training: This paper proposes Fisher-Orthogonal Projection (FOP), which supplements variance information by orthogonally projecting sub-batch gradient differences under the Fisher metric, enabling the second-order optimizer KFAC to remain effective in ultra-large batch training and achieving up to ×7.5 speedup.
Bridging Synthetic and Real Routing Problems via LLM-Guided Instance Generation and Progressive Adaptation: This paper proposes EvoReal, a framework that employs LLM-driven evolutionary search to generate synthetic VRP instances structurally aligned with real-world distributions, and then adapts pretrained neural solvers to real benchmarks via a two-stage progressive fine-tuning strategy. EvoReal substantially outperforms existing neural solvers on TSPLib (1.05% gap) and CVRPLib (2.71% gap).
Co-Layout: LLM-driven Co-optimization for Interior Layout: This paper proposes Co-Layout, a framework that leverages LLMs to extract structured constraints from natural language descriptions, then jointly optimizes room layout and furniture placement via a grid-based integer programming (IP) formulation augmented with a coarse-to-fine solving strategy, substantially outperforming existing two-stage approaches.
Convex Clustering Redefined: Robust Learning with the Median of Means Estimator: This paper integrates the Median of Means (MoM) estimator into the convex clustering framework, proposing the COMET algorithm. By combining random binning with median aggregation, COMET achieves robustness to noise and outliers without requiring prior knowledge of the number of clusters $k$. Weak consistency is established theoretically, and experiments on multiple real-world datasets demonstrate substantial improvements over six baselines, including k-means, MoM k-means, and convex clustering.
Convex Clustering Redefined: Robust Learning with the Median of Means Estimator: This paper proposes COMET (Convex Clustering with Median of Means Estimator), which integrates the Median of Means (MoM) estimator into the convex clustering framework. Through random binning, truncated distance regularization, and ADAM optimization, COMET achieves robust clustering under noise and outliers without requiring a pre-specified number of clusters. The paper establishes weak consistency guarantees theoretically and demonstrates comprehensive superiority over existing methods on both synthetic and real-world datasets.
Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment: This work provides the first theoretical analysis of the minimum cost required to steer an LLM policy toward an attacker's target by flipping preference labels during RLHF/DPO alignment. The problem is formalized as a convex optimization problem, upper and lower bounds on the cost are derived, and a post-processing method called PCM (Poisoning Cost Minimization) is proposed to substantially reduce the number of label flips while preserving the poisoning effect.
Data Heterogeneity and Forgotten Labels in Split Federated Learning: This paper systematically investigates catastrophic forgetting (CF) caused by data heterogeneity in Split Federated Learning — with particular focus on intra-round forgetting induced by the server-side processing order — and proposes Hydra, a multi-head method that partitions and trains the final layers of part-2 in groups before aggregation, significantly reducing the performance gap (PG) across labels by up to 75.4%.
ECPv2: Fast, Efficient, and Scalable Global Optimization of Lipschitz Functions: This paper proposes ECPv2, which introduces three innovations—adaptive lower bound, Worst-$m$ memory, and fixed random projection—to reduce the per-run complexity of Lipschitz global optimization from $\Omega(n^2 d)$ to $\Omega(n(m+d)\log n)$, while maintaining an $O(n^{-1/d})$ regret convergence rate that matches the minimax lower bound.
Efficient and Reliable Hitting-Set Computations for the Implicit Hitting Set Approach: To address numerical instability arising from the reliance on commercial IP solvers in the hitting-set component of the implicit hitting set (IHS) framework, this paper proposes alternative approaches based on pseudo-Boolean (PB) reasoning and stochastic local search (SLS), as well as hybrid strategies. The work realizes the first certifiable IHS computation and demonstrates effective trade-offs between efficiency and reliability across 1,786 benchmark instances.
FedP²EFT: Federated Learning to Personalize PEFT for Multilingual LLMs: This paper proposes FedP²EFT, which collaboratively trains a Personalization Strategy Generator (PSG) via federated learning to automatically generate personalized LoRA rank structures for each client, substantially outperforming manually designed PEFT configurations and existing FL personalization methods in multilingual LLM fine-tuning.
FedPM: Federated Learning Using Second-order Optimization with Preconditioned Mixing of Local Parameters: This paper proposes FedPM (Federated Preconditioned Mixing), a novel federated learning method that replaces the conventional simple parameter averaging on the server side with "preconditioned mixing," addressing the local preconditioner drift problem inherent in existing second-order federated optimization methods. FedPM is theoretically shown to achieve superlinear convergence for strongly convex objectives and empirically outperforms existing methods under heterogeneous data settings.
GHOST: Solving the Traveling Salesman Problem on Graphs of Convex Sets: This paper proposes GHOST, a hierarchical optimal search algorithm for solving the Traveling Salesman Problem on Graphs of Convex Sets (GCS). By combining combinatorial path search with convex trajectory optimization, and employing a novel abstract path expansion algorithm to compute admissible lower bounds that guide best-first search, GHOST achieves optimality guarantees while outperforming the monolithic mixed-integer convex programming (MICP) baseline by orders of magnitude in runtime.
Instance Generation for Meta-Black-Box Optimization through Latent Space Reverse Engineering: This paper proposes the LSRE framework, which constructs a two-dimensional latent instance space for BBO problems via an autoencoder, and employs enhanced genetic programming to reverse-engineer diverse synthetic optimization problem instances from this space, forming the Diverse-BBO benchmark that substantially improves the generalization performance of MetaBBO methods.
MOTIF: Multi-strategy Optimization via Turn-based Interactive Framework: This paper proposes MOTIF, a framework that models solver design as a multi-strategy optimization problem. Through a turn-based competitive mechanism involving two LLM agents guided by Monte Carlo Tree Search (MCTS), MOTIF jointly optimizes multiple interdependent algorithmic components within combinatorial optimization solvers, consistently outperforming existing methods across TSP, CVRP, BPP, and other combinatorial optimization domains.
On the Learning Dynamics of Two-Layer Linear Networks with Label Noise SGD: This paper theoretically analyzes the learning dynamics of label noise SGD on two-layer overparameterized linear networks, revealing a two-phase behavior: in Phase I, weight norms progressively diminish, enabling the model to escape the lazy regime and enter the rich regime; in Phase II, weights align with the ground-truth interpolator and converge. The theory is further extended to the SAM optimizer.
Parametrized Multi-Agent Routing via Deep Attention Models: This paper proposes the Deep FLPO framework, which integrates the algebraic structure of the Maximum Entropy Principle (MEP) with a permutation-invariant encoder-decoder neural network (SPN) to address the NP-hard mixed-integer problem of joint facility location and routing optimization. The framework achieves a 100× speedup in policy inference while matching Gurobi's exact solutions at 1500× faster runtime.
Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization: This paper proposes MPaGE, a framework that integrates LLMs with a Pareto Front Grid mechanism and semantic clustering to automatically generate heuristics for multi-objective combinatorial optimization problems (MOCOPs), jointly optimizing solution quality and runtime efficiency. MPaGE achieves significant improvements in HV and IGD over baselines such as EoH and MEoH on Bi-TSP, Tri-TSP, Bi-CVRP, and Bi-KP benchmarks.
PEOAT: Personalization-Guided Evolutionary Question Assembly for One-Shot Adaptive Testing: This paper is the first to propose the One-shot Adaptive Testing (OAT) task, formulating it as a combinatorial optimization problem, and introduces the PEOAT framework—combining personalized initialization, cognitively enhanced evolutionary search, and diversity-preserving selection—to select an optimal question set for each examinee in a single shot without interactive feedback, substantially outperforming conventional CAT methods.
pFed1BS: Personalized Federated Learning with Bidirectional Communication Compression via One-Bit Random Sketching: This paper proposes pFed1BS, a framework that achieves extreme bidirectional communication compression (>99% reduction) in federated learning via one-bit random sketching, while introducing a sign-based regularizer for client model personalization. The framework simultaneously addresses communication bottlenecks and data heterogeneity in non-IID settings.
SMoFi: Step-wise Momentum Fusion for Split Federated Learning on Heterogeneous Data: This paper proposes SMoFi, a framework that synchronizes the momentum buffers of surrogate models on the server side at every SGD step within Split FL, effectively mitigating gradient divergence caused by non-IID data. SMoFi achieves up to 7.1% accuracy improvement and up to 10.25× convergence speedup.
Tackling Resource-Constrained and Data-Heterogeneity in Federated Learning with Double-Weight Sparse Pack: This paper proposes FedCSPACK, a personalized federated learning method based on cosine-similarity-guided sparse parameter packing and double-weight aggregation. By performing parameter selection and sharing at the pack level, FedCSPACK simultaneously addresses data heterogeneity and client resource constraints, achieving 2–5× faster training, up to 96% reduction in communication overhead, and a 3.34% improvement in model accuracy.

⚖️ Alignment & RLHF¶

Align to Structure: Aligning Large Language Models with Structural Information: This paper proposes Structural Alignment, a method that integrates linguistic discourse structure frameworks—surface-level text structure scoring and an RST-based discourse motif classifier—into PPO reinforcement learning training, and introduces a discourse motif-based dense reward mechanism. This enables LLMs to generate more coherent, human-like long-form text, outperforming standard RLHF models on academic essay writing and long document summarization tasks.
AlignTree: Efficient Defense Against LLM Jailbreak Attacks: AlignTree leverages internal LLM activation features — combining linear refusal directions with nonlinear SVM signals — to train a lightweight random forest classifier that efficiently detects jailbreak attacks with negligible computational overhead, achieving state-of-the-art reductions in attack success rate (ASR).
AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment: This paper proposes AMaPO, an algorithm that dynamically modulates gradient magnitudes via instance-level adaptive margins (combining Z-normalization and exponential scaling) to address the core overfitting-underfitting dilemma in offline preference optimization methods such as DPO, thereby substantially improving ranking accuracy and downstream alignment performance.
BiasJailbreak: Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models: This paper reveals that ethical biases introduced by LLM safety alignment can be reverse-exploited as jailbreak attack vectors — marginalized-group keywords yield jailbreak success rates up to 20% higher than privileged-group keywords — and proposes BiasDefense, a lightweight prompt-based defense method.
DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF: DeCoRL transforms CoT reasoning from monolithic sequential processing into an "orchestra-style" modular parallel collaboration — nine specialized sub-models (parsing / semantic / entity / fact-checking / style / quality / computation / verification / integration) generate reasoning sub-steps in parallel, coordinated via dual reward attribution (local quality + contribution) and cascaded DRPO optimization, achieving 80.8% on RM-Bench (surpassing all baselines), a 3.8× inference speedup, and a 22.7% improvement in interpretability.
Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment: This work deconstructs the internal representations of LLM safety alignment from the conventional "single refusal direction" into two functionally independent directions — a harm detection direction and a refusal execution direction — and proposes the DBDI framework, which applies adaptive projection elimination and direct steering to intervene on each direction separately, achieving a 97.88% attack success rate (ASR) on Llama-2.
EASE: Practical and Efficient Safety Alignment for Small Language Models: This paper proposes EASE, a safety alignment framework for edge-deployed small language models (SLMs), which addresses the tension between "shallow refusal being insufficiently robust" and "deep reasoning being prohibitively expensive" via a two-stage design. Stage one distills safety reasoning capabilities from a large reasoning model into the SLM; stage two applies selective reasoning activation, enabling reasoning only for adversarial queries in vulnerable semantic regions while responding directly to benign queries. EASE reduces jailbreak attack success rate by 17% compared to shallow alignment, while cutting reasoning overhead by 90% compared to full-reasoning alignment.
Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal States: This paper proposes EAGLE, a method that estimates uncertainty by aggregating logits from multiple intermediate hidden layers of an LLM and computing the expectation of the resulting confidence distribution. EAGLE requires no additional trainable parameters and reduces ECE from 12.6% to 3.2% while improving AUROC from 59.0% to 61.6% across multiple datasets and models.
EPO: Diverse and Realistic Protein Ensemble Generation via Energy Preference Optimization: This paper proposes EPO (Energy Preference Optimization), which combines reverse SDE sampling with listwise energy-ranked preference optimization to align a pretrained protein generator with the target Boltzmann distribution using only energy signals. EPO achieves state-of-the-art performance across 9 metrics on three benchmarks (Tetrapeptides, ATLAS, and Fast-Folding), entirely eliminating the need for expensive molecular dynamics (MD) simulations.
Exploring the Effects of Alignment on Numerical Bias in Large Language Models: This paper systematically demonstrates that the LLM alignment process (instruction tuning + preference tuning) is the root cause of numerical bias in LLM evaluators, and validates that score range adjustment is the most effective mitigation strategy.
GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning: This paper proposes GRAM-R², a generative foundation reward model that elicits reward reasoning capabilities on unlabeled data via self-training. The model simultaneously produces preference labels and reasoning rationales, consistently outperforming both discriminative and generative baselines across multiple downstream tasks including response ranking, task adaptation, and RLHF.
Importance-Aware Data Selection for Efficient LLM Instruction Tuning: This paper proposes MIWV (Model Instruction Weakness Value), a metric that measures the importance of each instruction sample for improving model capability by comparing LLM loss with and without a one-shot ICL demonstration. Using only 1% (520 samples) of the Alpaca dataset, the method comprehensively outperforms fine-tuning on the full 52,002 samples.
Margin-aware Preference Optimization for Aligning Diffusion Models without Reference: This paper proposes MaPO (Margin-aware Preference Optimization), a reference-free preference alignment method that aligns T2I diffusion models by directly optimizing the likelihood margin between preferred and dispreferred outputs under the Bradley-Terry model. MaPO outperforms DPO and task-specific methods across 5 domains, including style adaptation, safety generation, and general preference alignment.
MetaGDPO: Alleviating Catastrophic Forgetting with Metacognitive Knowledge through Group Direct Preference Optimization: This paper proposes MetaGDPO, which addresses catastrophic forgetting in reasoning distillation for small models (<8B) from two complementary perspectives: (1) the data side, constructing a 5K dataset (MetaKL) based on metacognitive knowledge annotation; and (2) the training side, introducing GDPO—a DPO variant that replaces GRPO's online sampling with offline response groups generated by a large model.
On the Exponential Convergence for Offline RLHF with Pairwise Comparisons: Under the offline RLHF pairwise comparison setting, this paper proposes the RL-LOW algorithm achieving exponential convergence $\exp(-\Omega(n/H))$ for simple regret, and derives the first instance-dependent lower bound proving this rate is optimal in the exponential sense.
Reducing the Scope of Language Models: This paper systematically evaluates LLM "scoping" methods—restricting deployed LLMs to respond only to in-domain queries while rejecting all out-of-domain requests. Five approaches (prompting / SFT / DPO / probing / Circuit Breakers) are compared across 3 model families and multiple tasks. Key findings: SFT performs best under high data diversity, Circuit Breakers (CB) excel under low diversity, and a hierarchical combination (SFT→CB) preserves the strengths of both. A central finding is that the effectiveness of scoping is highly dependent on training data diversity.
Rethinking Direct Preference Optimization in Diffusion Models: Two orthogonal and plug-and-play improvement strategies are proposed to enhance preference optimization in diffusion models: stable reference model updating (relaxing the frozen constraint with a regularization anchor) and timestep-aware training (adaptive weighting to balance reward scales across timesteps). Both strategies can be embedded into various preference optimization algorithms such as DPO and IPO, achieving state-of-the-art performance on human preference evaluation benchmarks.
SafeNlidb: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces: This paper proposes SafeNlidb, a framework that jointly optimizes safety reasoning and SQL generation in LLM-driven Natural Language Interfaces to Databases (NLIDBs) through a safety-aware data synthesis pipeline and an alternating preference optimization strategy, effectively defending against privacy leakage under implicit inference attacks.
W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search: This paper proposes W2S-AlignTree, the first inference-time alignment framework that integrates Monte Carlo Tree Search (MCTS) with the weak-to-strong generalization (W2SG) paradigm. It leverages step-level proxy value functions derived from a weak model to guide the generation of a strong model at inference time, achieving significant improvements over baselines across sentiment control, summarization, and instruction-following tasks — with a 15.9% gain on the Llama3-8B summarization task.
When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF: To address the pervasive "preference flipping" problem in human preference annotation, this paper proposes FA-DPO (Flipping-Aware DPO), which models the annotation process as a two-stage procedure consisting of "true human intent + instance-dependent flipping probability." By correcting the BT model loss and iteratively optimizing a flipping estimation module, FA-DPO substantially improves alignment robustness under various noise conditions, achieving up to a 16.7% gain over DPO when instance-dependent flipping rates are high.

🎯 Object Detection¶

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios: This paper introduces AerialMind, the first large-scale Referring Multi-Object Tracking (RMOT) benchmark dataset for UAV scenarios, and proposes HawkEyeTrack (HETrack), a method that achieves language-guided multi-object tracking in aerial UAV scenes via a co-evolutionary fusion encoder and a scale-adaptive contextual refinement module.
An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice: This paper proposes a real-time overall mechanism for rice quality evaluation, integrating three modules: an improved YOLO-v5 (variety detection), an improved ConvNeXt-Tiny (intactness grading), and K-means (chalkiness region quantification). The system achieves 99.14% mAP and 97.89% detection accuracy on a self-constructed dataset of 20,000 images spanning six rice varieties.
Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection: This paper proposes a framework leveraging VFMs (DINOv2 + Grounding DINO) to enhance Source-Free Object Detection (SFOD) via three modules: Patch-weighted Global Feature Alignment (PGFA), Prototype-based Instance Feature Alignment (PIFA), and Dual-source Enhanced Pseudo-label Fusion (DEPF). The method achieves state-of-the-art results on 6 cross-domain detection benchmarks, e.g., 47.1% mAP on Cityscapes→Foggy Cityscapes (+3.5% over DRU) and 67.4% AP on Sim10k→Cityscapes (+8.7% over DRU).
Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning: This paper proposes GroundingAgent, a visual grounding framework that requires no task-specific fine-tuning. By composing pretrained open-vocabulary detectors (YOLO World), an MLLM (Llama-3.2-11B-Vision), and an LLM (DeepSeek-V3) into a structured iterative reasoning pipeline, the method achieves a zero-shot average accuracy of 65.1% on RefCOCO/+/g, substantially outperforming prior zero-shot approaches.
LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers: This paper proposes LampQ, a metric-based layer-wise mixed precision quantization method that measures the quantization sensitivity of each ViT layer via a type-aware Fisher information metric, combines integer linear programming to optimize bit-width allocation, and iteratively refines the allocation. LampQ achieves state-of-the-art performance across image classification, object detection, and zero-shot quantization tasks.
MonoCLUE: Object-Aware Clustering Enhances Monocular 3D Object Detection: This paper proposes MonoCLUE, which leverages local clustering to extract object-level visual patterns (e.g., hood, roof) and generalized scene memory to aggregate consistent appearance features across images, enhancing detection of occluded and truncated objects in monocular 3D detection. MonoCLUE achieves state-of-the-art performance on the KITTI benchmark without relying on additional depth or LiDAR information.
Real-Time 3D Object Detection with Inference-Aligned Learning: This paper proposes SR3D, a framework that bridges the training-inference gap in indoor dense 3D object detection via two training-phase components: Spatial-Priority Optimal Transport Assignment (SPOTA) and Ranking-Aware adaptive Self-distillation (RAS). SR3D achieves state-of-the-art performance among dense detectors on ScanNet V2 and SUN RGB-D at a real-time speed of 42ms.
REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion: This paper extends the 2D bounding box diffusion paradigm of DiffusionDet to 3D radar space, proposing the REXO framework. It enables explicit cross-view radar feature association guided by noisy 3D bounding box projections, and introduces a ground-level constraint to reduce the diffusion parameter space. REXO surpasses the state of the art by +4.22 AP and +11.02 AP on the HIBER and MMVR indoor radar datasets, respectively.
SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation: This paper proposes SAGA, a training-free method that learns prompt-aligned Gaussian distributions to improve semantic alignment in text-to-image generation models. Supporting both text and spatial conditioning, SAGA achieves substantial alignment gains on SD 1.4 and SD 3 (TIAM-3 improves from 8.4% to 50.7%).
SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements: This paper proposes SimROD, an extremely lightweight (only 0.003M parameters) RAW image object detection method that surpasses complex state-of-the-art approaches on multiple RAW detection benchmarks through global Gamma enhancement (4 learnable parameters) and green channel-guided local enhancement.
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection: SM3Det introduces the M2Det task for remote sensing (multi-modal datasets + multi-task object detection), employing a grid-level sparse MoE backbone and a Dynamic Sub-module Optimization (DSO) mechanism to handle SAR/optical/infrared modalities with both horizontal and oriented bounding box detection in a single unified model, substantially outperforming three independently trained modality-specific models combined.
T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection: This paper proposes T-Rex-Omni, the first framework to systematically incorporate negative visual prompts into open-set object detection. Through a training-free NNC module and an NNH loss, it substantially narrows the performance gap between visual-prompt and text-prompt detection methods, with particularly strong results in long-tail scenarios (LVIS-minival APr of 51.2).
Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection: This paper proposes an object-aware temporal modeling framework that achieves cross-frame temporal consistency through selective propagation of high-confidence detection features. Combined with a pretrained vision-language encoder (OWL-ViT) and a few-shot detection head, the method achieves an average improvement of 3.7%–5.3% AP across four video few-shot detection benchmarks.
TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding: This paper proposes TubeRMC, a framework that generates text-conditioned candidate tubes and performs tube-conditioned reconstruction along temporal, spatial, and spatio-temporal dimensions, augmented by spatial-temporal mutual constraints to improve weakly-supervised spatio-temporal video grounding.
VK-Det: Visual Knowledge Guided Prototype Learning for Open-Vocabulary Aerial Object Detection: VK-Det is proposed as a framework that leverages only the visual knowledge of VLMs (without any additional supervision signals) to achieve state-of-the-art performance in open-vocabulary aerial object detection through Adaptive Selection Knowledge Distillation (ASKD), Prototype-Aware Pseudo-Label generation (PAPL), and Synthetic Matching Inference (SMI), even surpassing methods that rely on extra supervision.
When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking: This paper proposes MFT25, a large-scale underwater multiple fish tracking dataset (15 sequences, 408K annotations), and SU-T, a tracking framework combining UKF with FishIoU, achieving state-of-the-art performance of 34.1 HOTA and 44.6 IDF1. Statistical analyses further reveal fundamental differences between fish tracking and terrestrial object tracking.
YOLO-IOD: Towards Real Time Incremental Object Detection: This work is the first to systematically integrate incremental object detection (IOD) into the YOLO real-time framework. It identifies three types of knowledge conflict, proposes a three-module solution (CPR + IKS + CAKD), and introduces the more realistic LoCo COCO benchmark for evaluation.

🧑 Human Understanding¶

AHAN: Asymmetric Hierarchical Attention Network for Identical Twin Face Verification: To address the extreme fine-grained recognition challenge of identical twin face verification, this paper proposes AHAN, a multi-stream architecture that performs multi-scale analysis of semantic facial regions via Hierarchical Cross-Attention (HCA), captures left-right facial asymmetry signatures through a Facial Asymmetry Attention Module (FAAM), and incorporates Twin-Aware Pair-Wise Cross-Attention (TA-PWCA) as a training regularizer. On the ND_TWIN dataset, AHAN improves twin verification accuracy from 88.9% to 92.3% (+3.4%).
CLIP-FTI: Fine-Grained Face Template Inversion via CLIP-Driven Attribute Conditioning: This paper presents the first approach to leverage CLIP-extracted fine-grained facial semantic attribute embeddings for Face Template Inversion (FTI). A cross-modal feature interaction network fuses leaked templates with attribute embeddings and projects them into the StyleGAN latent space, synthesizing identity-consistent face images with richer attribute details. The method surpasses state-of-the-art in recognition accuracy, attribute similarity, and cross-model attack transferability.
CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation: This paper proposes CoordAR, which formulates 3D-3D correspondence estimation in single-reference-view 6D pose estimation as an autoregressive generation problem over discrete tokens. Through coordinate map tokenization, modality-decoupled encoding, and an autoregressive Transformer decoder, CoordAR substantially outperforms existing single-view methods on multiple benchmarks and demonstrates strong robustness to challenging scenarios such as symmetry and occlusion.
Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis: This paper proposes Facial-R1, a three-stage alignment training framework (SFT → RL → Data Synthesis) that aligns the reasoning process of VLMs with emotion recognition outcomes by treating AU and emotion labels as verifiable reward signals. The framework achieves state-of-the-art performance on 8 benchmarks and introduces the FEA-20K dataset.
Generating Attribute-Aware Human Motions from Textual Prompt: This paper proposes AttrMoGen, a framework that decouples action semantics from human attributes (age, gender, etc.) via a Structural Causal Model (SCM)-based Causal Information Bottleneck, enabling attribute-aware human motion generation from text prompts. The authors also introduce HumanAttr, the first large-scale text-motion dataset with extensive attribute annotations.
Improving Sparse IMU-based Motion Capture with Motion Label Smoothing: This paper proposes Motion Label Smoothing, adapting classical label smoothing from classification tasks to sparse IMU-based motion capture. By incorporating skeleton-structure-aware Perlin noise as smoothed labels, the method improves accuracy across three state-of-the-art methods on four datasets in a plug-and-play manner without modifying model architectures. GlobalPose achieves a 20.41% reduction in SIP error on TotalCapture.
mmPred: Radar-based Human Motion Prediction in the Dark: This work is the first to introduce millimeter-wave radar into human motion prediction (HMP), proposing mmPred — a diffusion-based framework that employs dual-domain historical motion representations (time-domain pose refinement TPR + frequency-domain dominant motion FDM) and a Global Skeleton Transformer (GST) to effectively suppress radar-specific noise and temporal inconsistency, surpassing SOTA methods by 8.6% and 22% on the mmBody and mm-Fi datasets, respectively.
Modality-Aware Bias Mitigation and Invariance Learning for Unsupervised Visible-Infrared Person Re-Identification: To address the core problem of unreliable cross-modality associations in unsupervised visible-infrared person re-identification (USVI-ReID), this paper proposes modality-aware Jaccard distance correction and a "split-and-contrast" invariance learning strategy. By eliminating modality bias, the method enables reliable global cross-modality clustering and feature alignment, achieving state-of-the-art performance on SYSU-MM01 and RegDB.
MVGD-Net: A Novel Motion-aware Video Glass Surface Detection Network: Grounded in the physical observation that objects in reflection/transmission layers move at different velocities than those in non-glass regions, this paper proposes MVGD-Net, which leverages optical flow motion cues to guide glass surface detection in videos. The framework comprises four core modules: Cross-scale Multimodal Fusion Module (CMFM), History-Guided Attention Module (HGAM), Temporal Cross-Attention Module (TCAM), and Temporal-Spatial Decoder (TSD). A large-scale dataset, MVGD-D, containing 312 videos and 19,268 frames is also introduced.
New Synthetic Goldmine: Hand Joint Angle-Driven EMG Data Generation Framework for Micro-Gesture Recognition: This paper proposes SeqEMG-GAN, a conditional adversarial generation framework driven by hand joint angle sequences. Through the joint design of an angle encoder, a two-level context encoder (featuring the novel Ang2Gist unit), a deep convolutional generator, and a multi-view discriminator, the framework synthesizes high-fidelity EMG signals from joint kinematic trajectories, enabling zero-shot generation for unseen gestures. Mixing synthetic and real data for training improves classification accuracy from 57.77% to 60.53%.
PressTrack-HMR: Pressure-Based Top-Down Multi-Person Global Human Mesh Recovery: This paper proposes PressTrack-HMR, the first top-down pipeline for multi-person global human mesh recovery using only pressure signals. It introduces a novel UoE similarity metric for pressure footprint tracking (93.6% MOTA) and presents MIP, the first multi-person interaction pressure dataset.
Robust Long-term Test-Time Adaptation for 3D Human Pose Estimation through Motion Discretization: To address error accumulation in online test-time adaptation (TTA) for 3D human pose estimation, this paper proposes a framework combining motion discretization (an anchor motion set obtained via unsupervised clustering), a self-replay mechanism, and a soft reset strategy. The approach enables robust long-term continuous adaptation by leveraging subject-specific body shape and habitual motion patterns, outperforming all existing online TTA methods on Ego-Exo4D and 3DPW.
SOSControl: Enhancing Human Motion Generation through Saliency-Aware Symbolic Orientation and Timing Control: This paper proposes the Salient Orientation Symbolic (SOS) script — a programmable symbolic motion representation framework inspired by Labanotation — that extracts keyframe saliency via temporally-constrained agglomerative clustering, and introduces an SMS data augmentation strategy along with a gradient-optimization-based SOSControl framework for precise control over body-part orientation and motion timing. On HumanML3D, the method achieves an SOS-Acc of 0.988 with an FID of only 3.892.
Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction: This paper proposes ST-MoE, the first framework to combine Mixture of Experts (MoE) with bidirectional spatiotemporal Mamba for multi-person motion prediction. Four heterogeneous spatiotemporal experts flexibly capture complex spatiotemporal dependencies, achieving state-of-the-art accuracy while reducing parameter count by 41.38% and accelerating training by 3.6×.
Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion: This paper proposes a streaming co-speech gesture generation framework based on Rolling Diffusion, which converts arbitrary diffusion models into streaming gesture generators via a structured progressive noise schedule. It further introduces Rolling Diffusion Ladder Acceleration (RDLA) to achieve up to 4× speedup (200 FPS), comprehensively outperforming baselines on the ZEGGS and BEAT benchmarks.
VPHO: Joint Visual-Physical Cue Learning and Aggregation for Hand-Object Pose Estimation: This paper proposes VPHO, a framework for hand-object pose estimation that jointly leverages visual and physical cues. It introduces a force prediction module to learn 3D physical cues and designs a two-stage candidate pose aggregation strategy (visual-guided + physics-guided) to achieve physical plausibility while preserving visual consistency. VPHO attains state-of-the-art performance in both pose accuracy and physical plausibility on the DexYCB and HO3D benchmarks simultaneously.

🔄 Self-Supervised Learning¶

BCE3S: Binary Cross-Entropy Based Tripartite Synergistic Learning for Long-tailed Recognition: BCE3S is proposed, a binary cross-entropy (BCE)-based tripartite synergistic learning framework that integrates BCE-based joint learning, BCE-based contrastive learning, and BCE-based classifier uniformity learning. By decoupling per-class logits via Sigmoid, it suppresses the imbalance effects inherent to long-tailed distributions, achieving state-of-the-art performance on CIFAR10/100-LT, ImageNet-LT, and iNaturalist2018.
Explanation-Preserving Augmentation for Semi-Supervised Graph Representation Learning: This paper proposes EPA-GRL (Explanation-Preserving Augmentation for Graph Representation Learning), which employs a GNN explainer trained with a small number of labels to identify semantic subgraphs (explanation subgraphs). During augmentation, only the non-semantic portions (marginal subgraphs) are perturbed, achieving semantics-preserving graph augmentation. EPA-GRL significantly outperforms semantics-agnostic random augmentation methods across 6 benchmarks.
FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Clients: This paper proposes FedGRPO, which reformulates foundation model optimization as a reward-based evaluation process. Through competence-aware expert selection and federated group-relative policy optimization (transmitting only scalar reward signals), FedGRPO achieves privacy-preserving, communication-efficient federated foundation model optimization, approaching or surpassing centralized GRPO on mathematical reasoning and question-answering tasks.
FineXtrol: Controllable Motion Generation via Fine-Grained Text: This paper proposes FineXtrol, a framework that leverages temporally annotated, fine-grained body-part text descriptions as control signals. By combining a dual-branch ControlNet architecture with hierarchical contrastive learning to enhance the discriminability of the text encoder, FineXtrol achieves efficient, user-friendly, and precise controllable human motion generation, significantly outperforming existing methods on multi-body-part control benchmarks on HumanML3D.
From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models without Finetuning: This paper proposes Transferable Video Attack (TVA), which generates adversarial perturbations solely by exploiting the embedding space of open-source Video Foundation Models (VFMs), without any knowledge of downstream tasks, and effectively attacks downstream models and multimodal LLMs across 24 video tasks.
HiLoMix: Robust High- and Low-Frequency Graph Learning Framework for Mixing Address Association: This paper proposes HiLoMix, a robust graph learning framework for the mixing address association task. It addresses three core challenges—graph sparsity, label scarcity, and label noise—through a Heterogeneous Attribute Mixing Interaction Graph (HAMIG), frequency-aware graph contrastive learning, and confidence-based label weighting supervision, respectively. HiLoMix surpasses the second-best baseline by 5.69%, 7.34%, and 15.61% on F1, AUC, and MRR.
Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision: This paper proposes UrbanLN, a framework that improves urban region representation learning from LLM-generated captions via a long-caption-aware positional encoding interpolation strategy and a dual-level (data and model) noise suppression mechanism.
Let the Void Be Void: Robust Open-Set Semi-Supervised Learning via Selective Non-Alignment: This paper proposes SkipAlign, a framework that introduces a third "skip" operation alongside the conventional pull/push operations in contrastive learning. Low-confidence samples are selectively excluded from alignment and subjected only to mild repulsion, allowing in-distribution (ID) classes to form compact "galaxies" while OOD samples naturally disperse into the "interstellar void." The approach achieves an average AUC improvement of +3.1 on unseen OOD detection, with a maximum gain of +7.1.
MovSemCL: Movement-Semantics Contrastive Learning for Trajectory Similarity (Extension): This paper proposes MovSemCL, a framework that transforms GPS trajectories into movement-semantic features (displacement vectors + heading angles + Node2Vec spatial graph embeddings), achieves hierarchical encoding via patch-level two-stage attention (reducing complexity from $O(L^2)$ to near-linear), and designs Curvature-Guided Augmentation (CGA) to preserve behaviorally critical segments such as turns and intersections. The framework achieves a mean rank approaching the ideal value of 1 on trajectory retrieval tasks while reducing inference latency by 43.4%.
NeuroBridge: Bio-Inspired Self-Supervised EEG-to-Image Decoding via Cognitive Priors and Bidirectional Semantic Alignment: This paper proposes NeuroBridge, a framework that employs Cognitive Prior Augmentation (CPA, asymmetric augmentation to simulate perceptual variability) and a Shared Semantic Projector (SSP, bidirectional alignment into a unified semantic space). On the THINGS-EEG dataset under a 200-class zero-shot EEG-to-image retrieval task, the method achieves 63.2% Top-1 (+12.3%) and 89.9% Top-5 (+10.2%), substantially surpassing the existing state of the art.
Robust Tabular Foundation Models: This paper proposes RTFM — a model-agnostic adversarial training framework that performs min-max optimization over the parameter space of a synthetic data generator, maximizing the "optimality gap" between a tabular foundation model (TFM) and classical tree-based models. Using fewer than 100,000 additional synthetic datasets, RTFM significantly improves TabPFN V2 across multiple tabular benchmarks.
Self-Supervised Inductive Logic Programming: This paper proposes a new self-supervised inductive logic programming (SS-ILP) setting and the Poker system, which starts from a small number of positive labeled examples and unlabeled examples, automatically generates positive and negative examples, and employs a maximally general second-order normal form (SONF) background theory to learn logic programs with recursion and predicate invention in the absence of negative examples.
Spikingformer: A Key Foundation Model for Spiking Neural Networks: This paper proposes Spikingformer, which integrates MS Residual with Self-Attention in a spike-driven manner to address the non-spike computation introduced by SEW Residual in Spikformer, while preserving global modeling capability.
Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space: This paper proposes L-HAKT, a framework that for the first time integrates LLM dual-agent design with hyperbolic geometry for knowledge tracing. A Teacher Agent parses exercise semantics and constructs a hierarchical knowledge graph, while a Student Agent simulates individual learning behaviors to generate synthetic interaction data. Hyperbolic contrastive learning is employed to calibrate the distributional gap between synthetic and real data. L-HAKT achieves an AUC of up to 80.29% across four educational datasets, with an AUC improvement of 13.03% over the GKT baseline on EdNet.

🖼️ Image Restoration¶

Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection: This paper proposes JFD3, an end-to-end dual-branch framework that performs deblurring in the feature domain rather than the image domain, and leverages frequency structure priors to guide the detection network, achieving high-accuracy real-time infrared UAV target detection under motion blur conditions.
Clear Nights Ahead: Towards Multi-Weather Nighttime Image Restoration: This paper is the first to define and explore the multi-weather nighttime image restoration task. It constructs the AllWeatherNight dataset (8K training + 1K synthetic test + 1K real-world test) and proposes the ClearNight unified framework, which simultaneously removes compound degradations—haze, rain streaks, raindrops, snow, and flare—in a single stage via Retinex dual-prior guidance and weather-aware dynamic specificity–commonality collaboration. With only 2.84M parameters, ClearNight comprehensively surpasses state-of-the-art methods.
ClearAIR: A Human-Visual-Perception-Inspired All-in-One Image Restoration: Inspired by human visual perception (HVP), this paper proposes ClearAIR, a coarse-to-fine unified image restoration framework that progressively recovers image quality through four stages — MLLM-based quality assessment → semantic region perception → degradation type identification → internal clue reuse — achieving state-of-the-art performance across multiple degradation tasks.
Hard vs. Noise: Resolving Hard-Noisy Sample Confusion in Recommender Systems via Large Language Models: This paper proposes the LLMHNI framework, which leverages two types of auxiliary signals generated by LLMs—semantic relevance and logical relevance—to resolve the confusion between hard samples and noisy samples in recommender systems, significantly improving denoising recommendation performance.
HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios: This paper proposes HQ-SVC, a framework that leverages a disentangled audio codec (FACodec) to jointly extract content and speaker features, integrates an Enhanced Voice Adaptor (EVA) to fuse acoustic features such as pitch and energy, and employs a progressive synthesis pipeline combining DDSP and a diffusion model. Trained on a single RTX 3090 with fewer than 80 hours of singing data, HQ-SVC achieves zero-shot singing voice conversion quality surpassing large-scale training baselines, and additionally supports speech super-resolution.
ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement: Targeting two overlooked statistical distribution issues in the HVI color space — large distribution discrepancy between chrominance and luminance branches leading to insufficient complementary feature extraction, and weak inter-chrominance correlation causing gradient conflicts — this paper proposes the ICLR framework. It introduces a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL) to address these issues from the perspectives of fusion enhancement and statistical distribution optimization, respectively, achieving state-of-the-art performance on the LOL benchmark series.
Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework: This paper investigates the effective utilization of decoder-based LLMs for Extreme Multi-label Classification (XMC), proposing a dual-decoder learning strategy and the ViXML multimodal framework. By employing structured prompt templates to adapt LLM embeddings and efficiently integrating visual metadata, the method substantially outperforms state-of-the-art approaches on four public benchmarks (up to +8.21% P@1 on the largest dataset), demonstrating that "one image outweighs billions of parameters."
MFmamba: A Multi-function Network for Panchromatic Image Resolution Restoration Based on State-Space Model: This paper proposes MFmamba, a multi-function network built upon a UNet++ backbone that integrates a Mamba Upsampling Block (MUB), Dual Pooling Attention (DPA), and a Multi-scale Hybrid Cross Block (MHCB). Using only panchromatic (PAN) images as input, the unified framework simultaneously supports three tasks: super-resolution, spectral restoration, and joint SR with colorization.
RefiDiff: Progressive Refinement Diffusion for Efficient Missing Data Imputation: RefiDiff proposes a four-stage framework (pre-processing → warm-up → diffusion → polishing) that progressively unifies the predictive and generative imputation paradigms for the first time. Combined with a Mamba-based denoising network, it achieves state-of-the-art performance across 9 datasets while running 4× faster than DIFFPUTER.
SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining: SD-PSFNet is a cascaded CNN-based deraining network driven by a dynamic PSF mechanism. It models the optical effects of raindrops via a multi-scale learnable PSF dictionary, combined with a sequential restoration architecture featuring adaptive gated fusion. The method achieves SOTA performance of 33.12 dB on Rain100H and 42.28 dB on RealRain-1k-L, yielding a cumulative gain of 5.04 dB (13.5%) over the baseline MPRNet.
SpatioTemporal Difference Network for Video Depth Super-Resolution: Motivated by the statistical observation that spatially non-smooth regions and temporally varying regions in video depth super-resolution (VDSR) follow long-tail distributions, this paper proposes STDNet. The method incorporates a spatial difference branch (learning spatial difference representations for intra-frame RGB-D adaptive aggregation) and a temporal difference branch (exploiting temporal difference representations for motion compensation in changing regions). On the TarTanAir dataset at ×16 super-resolution, RMSE is reduced from 112.04 cm to 96.80 cm, outperforming state-of-the-art methods by an average of 27.6%–32.6%.
Temporal Inconsistency Guidance for Super-resolution Video Quality Assessment: This paper proposes TIG-SVQA, a framework that, for the first time, incorporates temporal inconsistency as an explicit guidance signal for super-resolution video quality assessment. The framework introduces an Inconsistency-Highlighted Spatial Module (IHSM) and an Inconsistency-Guided Temporal Module (IGTM), achieving SRCC scores of 0.950, 0.942, and 0.939 on the SFD, MFD, and Combined-VSR datasets, respectively, surpassing all existing IQA/VQA methods.
TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis: This paper proposes TMDC, a two-stage framework in which the first stage learns denoised modality-specific and modality-common representations on complete data, and the second stage leverages denoised representations from available modalities to reconstruct missing ones — marking the first joint treatment of noise and missing modalities in MSA.

🌐 Multilingual & Translation¶

Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages: This paper synthesizes multiple empirical studies to reveal critical failures of LLM safety mechanisms in low-resource and code-mixed settings, and proposes a resource-aware blueprint grounded in parameter-efficient safety steering, culturally driven preference data, and community-participatory alignment.
Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation: This paper proposes CANEFT, which uses mutual information (MI) to identify consensus-aligned neurons in LLMs that are consistently important across domains, and fine-tunes only these neurons to achieve efficient adaptation for multi-domain machine translation (MDMT). CANEFT outperforms PEFT baselines such as LoRA across 3 LLMs and 10 translation domains without introducing any additional parameters.
Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models: This paper proposes LAHIS, a method that efficiently identifies language-specific and language-general attention heads in multilingual LLMs using only a single forward-backward pass. It demonstrates that manipulating these heads enables cross-lingual attention transfer, mitigates off-target language generation, and improves multilingual QA performance with only 14–20 trainable parameters.
GloCTM: Cross-Lingual Topic Modeling via a Global Context Space: This paper proposes GloCTM, a dual-path VAE architecture (local language path + global context path) that enforces cross-lingual alignment at four levels—Polyglot Augmentation (cross-lingual neighbor-based input expansion), KL divergence internal alignment, unified decoder structural alignment, and CKA semantic alignment—achieving state-of-the-art topic quality and cross-lingual alignment on three cross-lingual datasets.
How Does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective: This paper proposes a ternary neuron classification scheme (language-specific / language-related / universal) and decomposes multilingual LLM inference into a four-stage framework. It finds that multilingual alignment improves performance by increasing language-related neurons (while reducing language-specific ones), and further demonstrates a "spontaneous multilingual alignment" effect on untrained languages.
MIDB: Multilingual Instruction Data Booster for Enhancing Cultural Equality in Multilingual Instruction Synthesis: This paper proposes MIDB (Multilingual Instruction Data Booster), a unified model trained on 36.8k expert-annotated revision samples, which automatically repairs content errors, machine translation defects, and localization deficiencies in multilingual synthetic instruction data, significantly improving instruction data quality across 16 languages and enhancing downstream LLM multilingual/cultural understanding capabilities.
Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering: This paper applies activation steering to mitigate content effects in LLMs — the tendency to conflate content believability with formal logical validity. The proposed K-CAST (kNN-based Conditional Activation Steering) method achieves up to 15% improvement in formal reasoning accuracy on models unresponsive to standard static steering.
NADIR: Differential Attention Flow for Non-Autoregressive Transliteration in Indic Languages: This paper proposes NADIR, a non-autoregressive (NAR) multilingual transliteration architecture combining a differential Transformer with a Mixture-of-Experts (MoE) module. NADIR achieves over 13× inference speedup on Indic language transliteration tasks while substantially reducing hallucination errors common in NAR models (repetition, substitution, omission, and insertion), narrowing the accuracy gap with autoregressive counterparts.
STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data: This paper proposes STELLAR, a framework for scene text editing (STE) in low-resource languages such as Korean, Arabic, and Japanese. STELLAR introduces a language-adaptive glyph encoder and a two-stage training strategy (synthetic pretraining followed by real-data fine-tuning). A reference-free TAS metric is proposed to evaluate font, color, and background style preservation without requiring ground-truth images. Korean recognition accuracy improves from a baseline maximum of 22.1% to 80.4%.
ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation: ViDia2Std constructs the first manually annotated Vietnamese dialect-to-standard parallel corpus covering all 63 provinces of Vietnam (13,000+ sentence pairs), evaluates multiple seq2seq models on the dialect normalization task, and demonstrates that dialect normalization as a preprocessing step significantly improves downstream task performance in machine translation and sentiment analysis.
X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework: This paper proposes the X-MuTeST framework, which combines LLM semantic reasoning with a two-stage training strategy enhanced by n-gram attention for explainable multilingual hate speech detection. It also introduces the first token-level human-annotated rationale benchmark datasets for Hindi and Telugu.

👥 Social Computing¶

Argumentative Debates for Transparent Bias Detection: This paper proposes ABIDE (Argumentative BIas Detection by DEbate), which constructs Quantitative Bipolar Argumentation Frameworks (QBAFs) via neighborhood-based argument schemes, models the bias detection process as a structured debate, enables transparent bias reasoning from individual neighborhoods to the global level, and formally proves the correspondence between QBAF semantics and the expected behavior of bias detection.
Beyond Detection: Exploring Evidence-based Multi-Agent Debate for Misinformation Intervention and Persuasion: This paper proposes ED2D, a framework that integrates an evidence retrieval module into a multi-agent debate (MAD) system to enhance misinformation detection accuracy. Through controlled human experiments, it provides the first comparative evaluation of AI-generated debate transcripts versus expert human fact-checks in terms of persuasiveness and belief correction, revealing a double-edged-sword effect: the AI debate system achieves expert-level persuasiveness when correct, but may amplify misinformation when wrong.
Bias Association Discovery Framework for Open-Ended LLM Generations: This paper proposes the Bias Association Discovery Framework (BADF), which systematically extracts both known and unknown bias associations between demographic identities and descriptive concepts from LLM open-ended story generation, overcoming the limitation of prior methods that rely on predefined bias concepts.
Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition: This paper proposes Cross-modal Prompting (ComP), which addresses the modality imbalance problem in incomplete multi-modal emotion recognition (IMER) via progressive prompt generation, cross-modal knowledge propagation, and a dynamic scheduler, achieving state-of-the-art performance across 4 datasets and 7 missing rates.
Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System: This paper proposes Fact2Fiction, the first poisoning attack framework targeting agentic fact-checking systems (e.g., DEFAME, InFact). It employs a Planner Agent to simulate claim decomposition and generate sub-questions, reverse-engineers key reasoning points from system justifications to craft targeted malicious evidence, and allocates the poisoning budget according to sub-claim importance. At a poisoning rate of only 1%, Fact2Fiction achieves 8.9%–21.2% higher attack success rate (ASR) than the state-of-the-art PoisonedRAG.
FactGuard: Event-Centric and Commonsense-Guided Fake News Detection: This paper proposes FactGuard, a framework that leverages LLMs to extract event-centric content (with style removed) and generate commonsense rationales. A Rationale Usability Evaluator dynamically assesses the reliability of LLM suggestions. Knowledge distillation yields a lightweight variant, FactGuard-D, that operates without LLM inference, achieving both robustness and efficiency in fake news detection.
From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks: This paper proposes CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. Through a two-phase strategy — first imitation (using only positive-advantage samples) then discrimination (introducing negative signals) — CAPO stably and significantly improves LLM performance on mathematical reasoning and multimodal GUI reasoning tasks.
Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering: This paper proposes the Multi-DProxy framework, which leverages learnable textual proxies for personalized multiple clustering through three key innovations: gated cross-modal fusion, dual-constraint proxy optimization, and dynamic candidate management, achieving state-of-the-art performance on all public benchmarks.
Reasoning About the Unsaid: Misinformation Detection with Omission-Aware Graph Inference: This paper proposes OmiGraph, the first omission-aware misinformation detection framework. By constructing omission-aware graphs, leveraging LLMs to reason about omission intent, and employing omission-guided message passing and aggregation mechanisms, OmiGraph extracts deception patterns from "what is unsaid," achieving average gains of +5.4% F1 and +5.3% ACC on bilingual datasets.
SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation: This paper proposes SceneJailEval, a scenario-adaptive multi-dimensional jailbreak evaluation framework that defines 14 jailbreak scenarios and 10 evaluation dimensions. Through a pipeline of scenario classification → dynamic dimension selection → multi-dimensional detection → weighted harm scoring, it achieves F1 of 0.917 on a self-constructed dataset (surpassing SOTA by 6%) and 0.995 on JBB (surpassing SOTA by 3%), while supporting harm severity quantification beyond binary classification.
T2Agent: A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search: This paper proposes T2Agent, a misinformation detection agent integrating an extensible toolset with Monte Carlo Tree Search (MCTS). By decomposing detection into sub-tasks targeting distinct forgery sources via a multi-source verification mechanism, T2Agent achieves a new state of the art on MMfakebench, improving the accuracy of the baseline MMDAgent by 28.7% using GPT-4o as the backbone.

🎬 Video Generation¶

3D4D: An Interactive Editable 4D World Model via 3D Video Generation: This paper proposes 3D4D, an interactive 4D visualization framework integrating WebGL and Supersplat rendering. A four-module backend pipeline converts static images and text prompts into editable 4D scenes, while a VLM-guided foveated rendering strategy enables 60fps real-time interaction, achieving state-of-the-art performance on both CLIP Consistency and CLIP Score.
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation: This paper proposes DreamRunner, a framework that achieves fine-grained controllable multi-character multi-event story video generation via LLM-based dual-level planning, retrieval-augmented motion prior learning, and a spatial-temporal region-based 3D attention injection module (SR3AI).
FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion: FilmWeaver is proposed as a framework that guides autoregressive diffusion models via a dual-level cache (Shot Cache + Temporal Cache), enabling multi-shot video generation of arbitrary length with cross-shot consistency.
GenVidBench: A 6-Million Benchmark for AI-Generated Video Detection: This paper introduces GenVidBench—the first large-scale AI-generated video detection dataset with 6.78 million videos, featuring cross-source and cross-generator properties, covering 11 state-of-the-art video generators, and providing rich semantic annotations.
Mask2IV: Interaction-Centric Video Generation via Mask Trajectories: This paper proposes Mask2IV, a two-stage decoupled framework that first predicts mask motion trajectories of the interactor and object, then generates video conditioned on these trajectories. The approach enables controllable, interaction-centric video generation without dense mask annotations, supporting both human-object interaction and robot manipulation scenarios.
MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation: This paper proposes MoFu, which addresses two fundamental challenges in multi-subject video generation—scale inconsistency and permutation sensitivity—through two core modules: Scale-Aware Modulation (SMO, an LLM-guided scale-aware modulation mechanism) and Fourier Fusion (an FFT-based permutation-invariant feature fusion strategy). The work additionally introduces the MoFu-1M training dataset and the MoFu-Bench evaluation benchmark.
MotionCharacter: Fine-Grained Motion Controllable Human Video Generation: This paper proposes the MotionCharacter framework, which decouples motion into two independently controllable dimensions—action type and motion intensity—to achieve fine-grained motion control and identity consistency in high-fidelity human video generation.
OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding: This paper proposes OmniVDiff, a unified controllable video diffusion framework that jointly models multiple visual modalities (RGB, depth, segmentation, Canny) in color space and introduces an Adaptive Modality Control Strategy (AMCS). Within a single diffusion model, OmniVDiff simultaneously supports three task types—text-conditioned generation, X-conditioned generation, and video understanding—achieving state-of-the-art performance on VBench.
Phased One-Step Adversarial Equilibrium for Video Diffusion Models: This paper proposes V-PAE (Video Phased Adversarial Equilibrium), a two-phase distillation framework consisting of stability priming followed by unified adversarial equilibrium, which compresses large-scale video diffusion models (e.g., Wan2.1-I2V-14B) to single-step generation, achieving a 100× speedup and surpassing existing acceleration methods by 5.8% in average quality on VBench-I2V.
Seeing the Unseen: Zooming in the Dark with Event Cameras: This paper proposes RetinexEVSR, the first event-driven low-light video super-resolution (LVSR) framework. Through a Retinex-inspired bidirectional fusion strategy (RBF)—which first uses illumination maps to guide event feature denoising (IEE), then leverages enhanced event features to recover reflectance details (ERE)—the method achieves a 2.95 dB gain on the SDSD benchmark while reducing runtime by 65%.
SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation: This paper proposes SphereDiff, which defines a spherical latent representation (uniformly distributed via Fibonacci Lattice) to replace conventional equirectangular projection (ERP), combined with a dynamic sampling algorithm and distortion-aware weighted averaging. Without any fine-tuning, SphereDiff leverages pretrained diffusion models such as SANA and LTX Video to generate seamless, low-distortion 360° panoramic images and videos.

🔗 Causal Inference¶

CaDyT: Causal Structure Learning for Dynamical Systems with Theoretical Score Analysis: This paper proposes CaDyT, which combines Gaussian process-based continuous-time dynamics modeling (via Adams-Bashforth integrators for exact inference) with the Minimum Description Length (MDL) principle for structure search. The method simultaneously addresses irregular sampling and causal structure identification, substantially outperforming all baselines on double-mass spring, diamond graph, and Rössler oscillator benchmarks (AUPRC 0.79 vs. runner-up 0.39).
Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs: This paper proposes Owl, a framework that models visual and textual attention as mediating variables within a structural causal model, introduces the VTACR metric to quantify cross-modal attention imbalance, and designs VTACR-guided adaptive attention modulation combined with a dual-path contrastive decoding strategy, achieving state-of-the-art hallucination mitigation on POPE and CHAIR benchmarks.
From Theory of Mind to Theory of Environment: Counterfactual Simulation of Latent Environmental Dynamics: This paper proposes the concept of "Theory of Environment" (ToE), arguing that humans may infer latent environmental dynamics through computational mechanisms shared with Theory of Mind (ToM), thereby expanding the dimensionality of motor exploration and facilitating behavioral innovation.
Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models: This paper proposes ABCA (Aspect-Based Causal Abstention), a pre-generation abstention framework that employs dual-agent debate to identify "aspect variables" (e.g., discipline, legal context, temporal frame) for activating distinct knowledge branches within LLMs. It applies the AIPW doubly robust estimator to compute causal effects and uses Centroid Angular Deviation (CAD) to detect knowledge conflicts (Type-1) or knowledge insufficiency (Type-2), achieving 91.4% accuracy on TruthfulQA and 96.4% unanswerable question identification rate—far surpassing the baseline of 44%.
I-CAM-UV: Integrating Causal Graphs over Non-Identical Variable Sets Using Causal Additive Models with Unobserved Variables: This paper proposes I-CAM-UV, a method that enumerates consistent DAGs satisfying structural constraints derived from multiple CAM-UV outputs over non-identical variable sets, recovering causal relations lost due to unobserved variables, and introduces an optimal-first search algorithm exploiting cost monotonicity for efficient combinatorial search.
KTCF: Actionable Recourse in Knowledge Tracing via Counterfactual Explanations for Education: This paper proposes KTCF, a counterfactual explanation generation method for Knowledge Tracing (KT) that leverages inter-concept relationships to produce sparse and actionable counterfactuals, subsequently post-processed into sequentially ordered instructional recommendations. KTCF comprehensively outperforms baseline methods across validity, sparsity, and actionability metrics.
Learning Subgroups with Maximum Treatment Effects without Causal Heuristics: Under the SCM framework, the paper proves that the subgroup with maximum treatment effect must exhibit homogeneous pointwise effects (Theorem 1); under the partition model assumption, it proves that optimal subgroup discovery reduces to standard supervised learning (Theorem 2), achievable via CART with the Gini index. On 77 ACIC-2016 semi-synthetic datasets, the proposed method achieves a mean treatment effect of 10.54 (vs. 7.84 for the runner-up), ranking first on 51.9% of datasets.
MUG: Multi-agent Undercover Gaming — Hallucination Removal via Counterfactual Test for Multimodal Reasoning: MUG reframes Multi-Agent Debate (MAD) as a "Who's Undercover" social reasoning game — by introducing information asymmetry through counterfactual image editing (modifying the reference image), one agent is assigned the edited image $I^-$ as the "undercover," while other agents hold the original image $I^+$ and identify the undercover (i.e., the hallucination source) via reasoning and voting. On HallusionBench, Qwen2.5VL-7B improves from 46.4% to 53.8%.
Skill Path: Unveiling Language Skills from Circuit Graphs: This paper proposes the concept of Skill Path and a three-step framework (Decomposition–Pruning–Causal Mediation) to extract linear paths corresponding to specific language skills from circuit graphs, and quantitatively validates two core conjectures: Stratification and Inclusiveness of skills.
Sparse Additive Model Pruning for Order-Based Causal Structure Learning: This paper proposes SARTRE, a framework that employs randomized tree embeddings and group-sparse regression to learn sparse additive models, replacing the hypothesis-testing-based redundant edge pruning in CAM-pruning for order-based causal structure learning. SARTRE achieves significant speedups without sacrificing accuracy.

💻 Code Intelligence¶

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation: This paper proposes DiffBench (an evaluation benchmark comprising 604 diffusion model acceleration tasks across 5 difficulty levels) and DiffAgent (a closed-loop framework integrating Planning, Coding, and Debugging agents with a genetic algorithm-based selector). On Claude Sonnet 4, the framework improves the pass rate for diffusion acceleration code generation from 54.30% to 81.59%, achieving a 68.27% success rate on complex optimization tasks.
EquaCode: A Multi-Strategy Jailbreak Approach for Large Language Models via Equation Solving and Code Completion: This paper proposes EquaCode, a multi-strategy jailbreak method that decomposes malicious queries into a cross-domain combination of equation solving ($B+C+x=A$) and code completion (completing the solve() method of a Solver class), achieving an average attack success rate of 92.78% on the GPT series and approaching 100% on the latest models (Gemini/DeepSeek/Grok).
Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction: This paper proposes Agent-Event-Coder (AEC), which reformulates zero-shot event extraction as a software engineering workflow. Four specialized agents (Retrieval→Planning→Coding→Verification) collaborate to perform extraction, while event schemas are encoded as executable Python classes to enable compiler-style deterministic validation and dual-loop iterative correction. AEC comprehensively outperforms zero-shot baselines across 5 domains and 6 LLMs.
MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings: This paper proposes ModularStarEncoder (MoSE), a 1B-parameter multi-exit encoder that significantly enhances early-layer representations via a novel self-distillation mechanism in which higher layers guide the training of lower layers. MoSE surpasses all open-source models on code understanding tasks such as CodeSearchNet while supporting flexible compute–accuracy tradeoff deployment.
ReCode: Updating Code API Knowledge with Reinforcement Learning: This paper proposes ReCode, a framework that trains LLMs via rule-based reinforcement learning (rather than SFT) to correctly leverage API update documentation provided in the prompt for code version migration, enabling a 7B model to surpass 32B models on CodeUpdateArena.
SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models: This paper proposes SPAN, a cross-calendar temporal reasoning benchmark (6 calendars × 10 reasoning directions × 100-year range × 37,380 instances). Baseline LLMs achieve an average accuracy of only 34.5% (none exceeding 80%), revealing two systematic failure modes—Future-Date Degradation and Calendar Asymmetry Bias. A tool-augmented Time Agent achieves 95.31%, demonstrating that cross-calendar reasoning requires external tools rather than parametric knowledge.
TAPA: Training-Free Adaptation of Programmatic Agents via LLM-Guided Program Synthesis in Dynamic Environments: TAPA positions LLMs as "intelligent modulators" of the symbolic action space rather than direct decision-makers. Through LLM-guided program synthesis, it dynamically adapts the symbolic actions of programmatic agents without retraining, achieving strong performance in cybersecurity DDoS defense (77.7% network uptime) and swarm intelligence formation control.
Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning: This paper proposes CL4D, a contrastive learning framework that adapts pretrained decoder-only code generation models to code understanding tasks (code search, clone detection) via continued pretraining, achieving performance comparable to or better than encoder-only models of equivalent scale without retraining them from scratch.
Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning: This paper proposes CL4D, a contrastive learning framework that continues pre-training decoder-only code generation models, enabling them to extract effective code representations and achieve performance on par with or superior to encoder-only models of comparable scale on code understanding tasks such as code search and clone detection.
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study: This paper systematically investigates the capability bottlenecks of open-source LLMs in data analysis tasks. It decomposes data analysis into three dimensions—data comprehension, code generation, and strategic planning—and identifies strategic planning as the decisive factor, rather than coding or data comprehension. A strategy-guided data synthesis approach is proposed, enabling fine-tuned 7B/14B models to achieve performance competitive with GPT-4o.

⚡ LLM Efficiency¶

Connectivity-Guided Sparsification of 2-FWL GNNs Preserving Full Expressivity: Co-Sparsify proposes a connectivity-aware sparsification framework that restricts 3-node interactions to biconnected components and 2-node interactions to connected components, eliminating provably redundant computation. It preserves full 2-FWL expressivity while substantially improving efficiency, achieving state-of-the-art results on synthetic substructure counting tasks and benchmarks including ZINC and QM9.
Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models: This paper presents the first systematic study of how parametric knowledge influences generation in long-context language models (LCLMs), finding that such influence grows with context length and that methods designed to improve extrinsic retrieval suppress parametric recall. Based on these findings, the paper proposes the Hybrid Needle-in-a-Haystack (Hybrid NIAH) benchmark to jointly evaluate both capabilities.
How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts: This paper proposes MASS, a framework that adaptively expands the MoE expert pool via gradient-based semantic drift detection, combined with a Top-p confidence routing strategy, to automatically discover the optimal number of experts without hyperparameter search while enhancing semantic differentiation across experts.
InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE: This paper proposes InterMoE, a Dynamic Temporal-Selective MoE architecture for text-driven two-person 3D interaction motion generation that addresses individual identity preservation and semantic fidelity. A Synergistic Router fuses semantic and kinematic features to guide routing, while Dynamic Temporal Selection enables each expert to adaptively select key temporal frames. The method achieves a 9% FID reduction on InterHuman and 22% on InterX.
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction: This paper proposes Judge Q, which introduces trainable soft tokens into the model vocabulary and trains their attention patterns to align with those of actual decoding tokens, enabling them to replace local-window queries for evaluating KV cache importance during the prefill stage. This approach better preserves global information, achieving ~1-point improvement on LongBench and 3+ points on RULER.
MoETTA: Test-Time Adaptation Under Mixed Distribution Shifts with MoE-LayerNorm: This paper proposes MoETTA, a test-time adaptation framework that reparameterizes LayerNorm into multiple structurally decoupled expert branches. A routing mechanism assigns samples from different domains to different experts, enabling multi-directional parameter updates and overcoming the limitations of a single adaptation path under mixed distribution shifts. The paper also introduces two more realistic evaluation benchmarks—potpourri and potpourri+—and achieves state-of-the-art performance across all settings.
Resource Efficient Sleep Staging via Multi-Level Masking and Prompt Learning: This paper proposes MASS (Mask-Aware Sleep Staging), a framework that achieves reliable sleep staging using only 10% of the original EEG signal through a multi-level masking strategy and hierarchical prompt learning mechanism, providing a practical solution for resource-constrained wearable sleep monitoring systems.
Scaling and Transferability of Annealing Strategies in Large Language Model Training: This paper proposes a model-agnostic predictive framework that decomposes training loss into a forward-effect term (learning rate integral $S$), an annealing momentum term (Adam-style momentum integral $M$), and a model-size term $N$. It demonstrates that annealing strategies can be transferred from small models/small batches to large models/large batches, achieving a prediction MAPE below 2%.
The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models: Using mechanistic interpretability tools including Patchscopes, attention knockout, and linear probes, this paper systematically reveals the internal mechanisms of analogical reasoning in LLMs: models can effectively encode relational information in middle-to-upper layers, but applying relational information to new entities is a greater bottleneck than extracting it; successful analogical reasoning correlates with strong structural alignment across stories, while failures reflect weakened or misaligned alignment.

🛰️ Remote Sensing¶

Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency: This paper proposes a novel paradigm termed Asymmetric Cross-modal Knowledge Distillation (ACKD), realized through the SemBridge framework — comprising two plug-and-play modules, namely self-supervised semantic matching and optimal transport alignment — to enable cross-modal knowledge distillation under weak semantic consistency. This allows multispectral (MS) images collected from different geographic regions to effectively guide RGB-based remote sensing scene classification.
Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments: This paper models conflicting predictions from multiple pre-trained perception models in novel environments as a consistency-based abductive reasoning problem. Error detection rules and domain constraints for each model are encoded as logic programs, and an optimal hypothesis is sought that maximizes prediction coverage while keeping the inconsistency rate below a threshold. The approach achieves an average F1 improvement of 13.6% across 15 aerial test sets.
Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: To address the attenuation of causal treatment effects caused by regression-to-the-mean in ML-based satellite poverty predictions, this paper proposes two post-processing correction methods that require no additional labeled data — Linear Calibration Correction (LCC) and Tweedie local unshrinking — enabling a single prediction map to be reused across multiple downstream causal studies (the "One Map, Many Trials" paradigm). Tweedie correction achieves near-unbiased treatment effect estimation on both simulated and real DHS data.
M3SR: Multi-Scale Multi-Perceptual Mamba for Efficient Spectral Reconstruction: This paper proposes M3SR, a Mamba-based multi-scale multi-perceptual architecture that integrates spatial, frequency, and spectral branches in parallel within a U-Net multi-scale structure. With only 2.17M parameters and 100.9G FLOPs, M3SR surpasses existing state-of-the-art methods on four spectral reconstruction benchmarks.
Machine Learning for Sustainable Rice Production: Region-Scale Monitoring of Water-Saving Practices in Punjab, India: This paper proposes a dimensional classification approach that decouples the recognition of water-saving rice practices into two independent binary classification tasks — a seeding dimension (DSR vs. PTR) and an irrigation dimension (AWD vs. CF). Using only Sentinel-1 SAR imagery, the method achieves seeding F1=0.80 and irrigation F1=0.74, and performs large-scale inference over 3 million+ parcels in Punjab, with district-level adoption rates strongly correlated with government statistics (Spearman ρ=0.69).
Perceive, Act and Correct: Confidence Is Not Enough for Hyperspectral Classification: This paper proposes the CABIN framework, which employs a closed-loop cognitive perceive–act–correct learning mechanism. By replacing naive confidence with epistemic uncertainty to guide sample selection and pseudo-label management in semi-supervised hyperspectral image classification, CABIN significantly outperforms fully supervised baselines while using only 75% of the labeled data.
TDCNet: Spatio-Temporal Context Learning with Temporal Difference Convolution for Moving IRSTD: This paper proposes TDCNet, which unifies temporal difference and 3D convolution into a single Temporal Difference Convolution (TDC). Through re-parameterization, TDC introduces zero additional inference overhead. Combined with TDC-guided spatio-temporal attention (TDCSTA), TDCNet achieves an F1 of 97.12% (AP50 93.83%) on the newly constructed IRSTD-UAV dataset, which contains 15,106 frames of real infrared UAV imagery.
UniABG: Unified Adversarial View Bridging and Graph Correspondence for Unsupervised Cross-View Geo-Localization: This paper proposes UniABG, a two-stage unsupervised cross-view geo-localization framework that employs View-Aware Adversarial Bridging (VAAB) to eliminate the domain gap between UAV and satellite views, followed by Heterogeneous Graph Filtering Calibration (HGFC) to purify cross-view correspondences. UniABG achieves 93.29% Satellite→Drone AP on University-1652, surpassing most supervised methods.

🧮 Scientific Computing¶

Just Few States are Enough: Randomized Sparse Feedback for Stability of Dynamical Systems: This paper proposes a randomized sparse feedback control framework in which the controller accesses only a random subset of the state vector at each time step. Feedback gain matrices and Bernoulli sparsification parameters are jointly designed via LMIs to guarantee asymptotic mean-square stability (AMSS) while minimizing the required number of active sensors. Experiments demonstrate that as few as 0.3% of state components suffice to achieve performance comparable to full-state feedback.
Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction: This paper proposes KARMA, a framework that embeds the Linear Spectral Mixing Model (LSMM) as a physics constraint within the ViT-MAE decoder, combined with a Spectral Angle Mapper (SAM) loss, to improve reconstruction fidelity and downstream transfer performance for hyperspectral remote sensing imagery.
Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids: This work introduces the Phys-Liquid dataset (97,200 physics-simulated images with 3D meshes), which models dynamic deformation of liquids inside transparent containers based on the Navier-Stokes equations, and proposes a four-stage reconstruction pipeline (segmentation → multi-view mask generation → 3D reconstruction → scaling) to achieve high-accuracy liquid geometry and volume estimation in both simulated and real-world scenes.
PhysicsCorrect: A Training-Free Approach for Stable Neural PDE Simulations: This paper proposes PhysicsCorrect, a training-free correction framework that models PDE residual correction as a linearized inverse problem and precomputes a cached pseudoinverse. At inference time, it achieves up to 100× error reduction with less than 5% computational overhead, and is applicable to arbitrary pretrained neural operators including FNO, UNet, and ViT.
PIMRL: Physics-Informed Multi-Scale Recurrent Learning for Burst-Sampled Spatiotemporal Dynamics: This paper proposes PIMRL, a framework for learning from burst-sampled (short high-frequency bursts followed by long intervals) sparse spatiotemporal data. It features a dual-module architecture combining macro-scale latent-space reasoning and micro-scale physics correction, integrated via cross-scale message passing, achieving up to 80% error reduction across 5 PDE benchmarks.
SAOT: An Enhanced Locality-Aware Spectral Transformer for Solving PDEs: This paper proposes SAOT (Spectral Attention Operator Transformer), which captures high-frequency local details via linear-complexity Wavelet Attention (WA) and complements it with the global receptive field of Fourier Attention (FA) through a gated fusion mechanism. SAOT achieves state-of-the-art performance on 6 operator learning benchmarks, reducing the relative error on Navier-Stokes by 22.3% compared to Transolver.
Scientific Knowledge-Guided Machine Learning for Vessel Power Prediction: A Comparative Study: A hybrid modeling framework combining a physics baseline with a data-driven residual is proposed. The sea trial power curve (propeller law $P=cV^n$) serves as the baseline, and XGBoost/NN/PINN models learn the residual correction, significantly improving extrapolation stability and physical consistency in sparse data regions.
Towards a Foundation Model for Partial Differential Equations Across Physics Domains: This paper proposes PDE-FM, a modular PDE foundation model combining spatial-spectral dual-modal tokenization, FiLM-based physics modulation, and a Mamba state-space backbone. It achieves an average 46% reduction in VRMSE across 12 heterogeneous physics-domain datasets from The Well benchmark.

📚 Pretraining¶

ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences: ELSPR models pairwise preferences of LLM evaluators as tournament graphs, identifies non-transitive preferences via strongly connected components (SCCs), proposes a normalized directed graph structural entropy metric, and filters problematic training data through graph reconstruction — resulting in a 13.8% reduction in non-transitivity and a 0.088 decrease in structural entropy, while the discarded data achieves only 34.4% human agreement (vs. 52.6% for retained data).
Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding: This paper proposes a Task-Step-State (TSS) three-level semantic framework that introduces "state" as a visual grounding layer within the conventional task-step hierarchy, and designs a progressive pretraining strategy following a U-shaped path (Task→Step→State→Step→Task) to unfold the TSS hierarchy stage by stage. The approach achieves comprehensive state-of-the-art performance on task recognition, step recognition, and step forecasting tasks on the COIN and CrossTask datasets.
Learning Time in Static Classifiers: This paper proposes the Support-Exemplar-Query (SEQ) learning framework, which injects temporal reasoning capabilities into standard feed-forward classifiers through loss function design rather than architectural modification. By aligning predicted sequences with class-level temporal prototypes via soft DTW, the method achieves consistent improvements on both fine-grained image classification and video anomaly detection.
No-Regret Strategy Solving in Imperfect-Information Games via Pre-Trained Embedding: This paper proposes the Embedding CFR algorithm, which maps information sets in imperfect-information games to a continuous low-dimensional embedding space (rather than discrete clusters), achieving faster exploitability convergence and higher-quality strategy solving under the same space budget.
PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer: PrefixGPT frames prefix adder optimization as a sequence generation problem. A customized GPT model is pretrained to learn design rules, then fine-tuned via RL to generate optimized designs, achieving state-of-the-art area-delay product (ADP) with robustness to initialization.
Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach: This paper proposes SynPrune — the first syntax-aware membership inference attack (MIA) method for code. By identifying 47 Python syntactic conventions and pruning syntactically determined tokens (retaining only tokens that reflect authorial style) when computing MIA scores, SynPrune achieves an average AUROC improvement of 15.4%, enabling effective detection of pretraining data attribution in code LLMs.

🗣️ Dialogue Systems¶

Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation: This paper proposes the Auto-PRE framework, which selects qualified LLM evaluators through an automatic qualification exam across three dimensions—consistency, pertinence, and self-confidence—achieving state-of-the-art evaluation performance without human annotation while significantly reducing costs.
Chatsparent: An Interactive System for Detecting and Mitigating Cognitive Fatigue in LLMs: This paper presents Chatsparent, an interactive system that monitors three token-level fatigue signals during LLM inference in real time—attention decay, embedding drift, and entropy collapse—aggregates them into a unified fatigue index, and automatically applies lightweight interventions (prompt re-injection, attention reset, entropy-regularized decoding, self-reflection checkpoints) when fatigue thresholds are triggered, transforming passive chat interaction into an active diagnostic experience.
Emergent Persuasion: Will LLMs Persuade Without Being Prompted?: This paper investigates whether LLMs spontaneously exhibit persuasive behavior without being explicitly prompted to do so. It finds that activation steering fails to reliably induce persuasive tendencies, whereas SFT fine-tuning on benign persuasion data causes models to exhibit emergent persuasive behavior on harmful topics, revealing latent post-training safety risks.
TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech: This paper proposes TalkSketch, a system that integrates hand-drawn sketches with real-time speech input into a multimodal AI chatbot, enabling designers to simultaneously draw and verbalize ideas during early-stage ideation. The system addresses the problem that text-based prompting in existing GenAI tools disrupts the creative workflow.
Canoe: Teaching LLMs to Maintain Contextual Faithfulness via Synthetic Tasks and RL: This paper proposes the Canoe framework, which synthesizes four types of verifiable short-form QA data from Wikidata triples and applies Dual-GRPO (incorporating accuracy reward, long-form proxy reward, and format reward) to jointly optimize faithfulness in both short- and long-form generation. The approach improves Llama-3-8B by an average of 22.6% across 11 downstream tasks, surpassing GPT-4o.

✏️ Knowledge Editing¶

Catastrophic Forgetting in Kolmogorov-Arnold Networks: The first systematic study of catastrophic forgetting in Kolmogorov-Arnold Networks (KANs): establishes a theoretical framework linking forgetting to activation support overlap and intrinsic data dimensionality, and proposes KAN-LoRA for continual fine-tuning knowledge editing in language models.
Hybrid-DMKG: A Hybrid Reasoning Framework over Dynamic Multimodal Knowledge Graphs for Multimodal Multihop QA with Knowledge Editing: This paper proposes the MMQAKE benchmark and the Hybrid-DMKG framework, which constructs a dual-channel hybrid reasoning mechanism — combining relation link prediction with RAG-augmented LVLM inference — over a dynamic multimodal knowledge graph, supplemented by a background reflection decision module. The approach significantly outperforms existing methods on 2–5 hop multimodal knowledge editing QA (H-Acc of 29.90% on LLaVA, surpassing IKE by 13.52 percentage points).
Is the Information Bottleneck Robust Enough? Towards Label-Noise Resistant Information Bottleneck Learning: This paper identifies the inherent vulnerability of the Information Bottleneck (IB) principle under label noise and proposes LaT-IB, which decomposes representations into a clean-label subspace and a noisy-label subspace. Combined with a Minimal-Sufficient-Clean (MSC) criterion and a three-stage training framework, LaT-IB significantly outperforms existing IB methods across diverse noise conditions.
Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior: This paper frames the steering of agent ethical behavior as a model editing task (Behavior Editing), proposes a three-tier BehaviorBench grounded in psychological moral theory, and validates on 9 open-source and 20 closed-source models that model editing can precisely steer agents toward either benevolent or malicious behavior, with a single edit potentially causing global moral alignment drift.
Multiplicative Orthogonal Sequential Editing for Language Models (MOSE): This paper proposes MOSE (Multiplicative Orthogonal Sequential Editing), which injects new knowledge by left-multiplying the parameter matrix with an orthogonal matrix (rather than via additive updates), strictly preserving the Frobenius norm and condition number of the edited matrix. MOSE achieves a 12.08% performance improvement in sequential editing while retaining 95.73% of general capabilities.

🔎 AIGC Detection¶

ActiShade: Activating Overshadowed Knowledge to Guide Multi-Hop Reasoning in Large Language Models: This paper proposes ActiShade, a framework that detects "overshadowed" key phrases in LLM multi-hop reasoning via Gaussian noise perturbation, retrieves supplementary documents using a customized contrastive learning retriever, and iteratively reformulates queries to mitigate error accumulation caused by knowledge overshadowing. ActiShade significantly outperforms DRAGIN and other state-of-the-art methods on HotpotQA, 2WikiMQA, and MuSiQue.
BAID: A Benchmark for Bias Assessment of AI Detectors: This paper introduces the BAID benchmark (208K sample pairs covering 7 bias dimensions and 41 subgroups) to systematically evaluate the fairness of 4 open-source AI text detectors across demographic and linguistic subgroups, revealing significant recall disparities for dialect, informal English, and minority group texts.
Optimized Algorithms for Text Clustering with LLM-Generated Constraints: This paper proposes the LSCK-HC framework, which leverages LLMs to generate set-form must-link/cannot-link constraints (as opposed to traditional pairwise constraints), coupled with a penalty-based local search clustering algorithm. The approach achieves clustering accuracy comparable to SOTA on five short-text datasets while reducing the number of LLM queries by more than 20×.

📡 Signal & Communications¶

Task Aware Modulation Using Representation Learning for Upscaling of Terrestrial Carbon Fluxes: This paper proposes TAM-RL, a framework that formulates terrestrial carbon flux upscaling as a zero-shot regression transfer learning problem. By combining a BiLSTM task encoder with FiLM modulation and a knowledge-guided loss derived from the carbon balance equation, the method achieves a 9.6% reduction in GPP RMSE and a 43.8% improvement in NEE R² over FLUXCOM-X-BASE across 150+ flux tower sites.
Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion: This paper proposes UP-Fusion, a unified multi-modality image fusion framework comprising three modules — Semantic-aware Channel Pruning Module (SCPM), Geometric Affine Modulation (GAM), and CLIP Text-guided Channel Perturbation Module (TCPM) — that employs a single set of weights (trained solely on infrared-visible data) to simultaneously handle both IVIF and medical image fusion tasks, achieving state-of-the-art performance on both.
Toward Gaze Target Detection in Young Autistic Children: To address the severe class imbalance in gaze target detection for autistic children—where face-directed gaze accounts for only 6.6% of samples—this paper proposes the Socially Aware Coarse-to-Fine (SACF) framework. A fine-tuned Qwen2.5-VL serves as a social-context-aware gate that routes inputs to either a socially aware or a socially agnostic expert model. Evaluated on the newly introduced AGT dataset, the framework substantially improves face gaze detection performance (Face L2 reduced by 13.9% on Sharingan; F1 improved from 0.753 to 0.761).

⚛️ Physics¶

Adaptive Fidelity Estimation for Quantum Programs with Graph-Guided Noise Awareness: This paper proposes QuFid, a framework that models quantum circuits as directed acyclic graphs (DAGs), characterizes noise propagation via control-flow-aware random walks, quantifies circuit complexity through spectral features of the propagation operator, and achieves adaptive measurement budget allocation — significantly reducing the number of measurement shots while maintaining fidelity accuracy.
Data Verification is the Future of Quantum Computing Copilots: This position paper argues that data verification must be elevated from a post-hoc filtering step to a foundational architectural principle in quantum computing AI copilots. Three positions are advanced: (1) verified data is a minimum requirement; (2) prior constraints outperform posterior filtering; (3) scientific domains governed by physical laws require verification-aware architectures. Experiments demonstrate that LLMs trained without verified data achieve at most 79% accuracy on circuit optimization tasks.

🌍 Earth Science¶

MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics: This paper proposes MdaIF, a framework that leverages a vision-language model (VLM) to extract degradation-aware semantic priors for guiding mixture-of-experts (MoE) routing and channel attention modulation, enabling one-stop infrared-visible image fusion across multiple degradation scenarios without requiring degradation-type annotations.

📂 Others¶

A Fast Heuristic Search Approach for Energy-Optimal Profile Routing for Electric Vehicles: This paper proposes Pr-A, a label-setting method based on multi-objective A search for efficiently solving energy-optimal profile routing for electric vehicles (EVs) when the initial state of charge (SoC) is unknown. By using profile dominance pruning, the method avoids the complex profile merge operations required by traditional approaches, achieving performance close to standard A* with known initial SoC on large-scale road networks.
A Graph-Theoretical Perspective on Law Design for Multiagent Systems: This paper studies the law design problem in multiagent systems from a graph-theoretical perspective, reducing the minimization of useful laws and gap-free laws to the vertex cover problem on hypergraphs, proving NP-hardness, and providing approximation algorithms.
A New Strategy for Verifying Reach-Avoid Specifications in Neural Feedback Systems: This paper proposes FaBRe (Forward and Backward Reachability), a unified framework that, for the first time, develops both over- and under-approximation algorithms for backward reachable sets of ReLU neural network controllers (GSS/ICH/LEB), and integrates them with forward reachability analysis to construct a unified reach-avoid verification framework, aiming to overcome the scalability bottleneck of purely forward analysis.
A Phase Transition for Opinion Dynamics with Competing Biases: This paper models the competition between two opposing forces — external subversive bias and individual stubbornness — on binary opinion spreading over directed random graphs. It proves that the system exhibits a sharp phase transition: when the bias exceeds a critical threshold $p_c$, the population rapidly reaches a new consensus; below the threshold, the system remains in a long-lived metastable polarized state. The critical point is determined solely by two simple statistics of the degree sequence.
A Switching Framework for Online Interval Scheduling with Predictions: For the irrevocable online interval scheduling problem, this paper proposes the SemiTrust-and-Switch framework and the SmoothMerge randomized algorithm. By switching between or blending a prediction-trusting strategy and a classical greedy algorithm, the approach achieves near-optimal performance when predictions are accurate (consistency) and degrades gracefully when predictions are erroneous (robustness and smoothness). Tightness of the framework on specific instances is also established.
A Topological Rewriting of Tarski's Mereogeometry: This work extends the λ-MM library within the Coq theorem prover to recast Tarski's solid geometry—grounded in Leśniewski's mereology—into a fully formalized system with a complete topological structure. It proves that mereological classes correspond to regular open sets, satisfy Kuratowski's interior axioms, and exhibit the Hausdorff (T2) separation property, thereby providing a unified mereological–geometric–topological theoretical framework for qualitative spatial reasoning.
UniShape: A Unified Shape-Aware Foundation Model for Time Series Classification: This paper proposes UniShape — the first shape-aware foundation model for time series classification (TSC). It captures class-discriminative temporal patterns via a shape-aware adapter that adaptively aggregates multi-scale subsequences (shapes), and jointly learns transferable shapelet representations at both instance and shape levels through a prototype-based pretraining module. Pretrained on 1.89M samples, UniShape achieves an average accuracy of 0.8708 across 128 UCR datasets, surpassing all baselines.
Agent-SAMA: State-Aware Mobile Assistant: This paper proposes Agent-SAMA, which for the first time introduces a finite state machine (FSM) into mobile GUI agents, modeling UI screens as states and user actions as transitions. Four specialized agents collaborate to achieve state-aware task planning, execution verification, and error recovery, improving success rate by up to 12% and recovery rate by 13.8% on cross-app benchmarks.
Align When They Want, Complement When They Need! Human-Centered Ensembles for Adaptive Human-AI Collaboration: This paper reveals a fundamental trade-off between complementarity and alignment in human-AI collaboration—no single model can simultaneously optimize both objectives. It proposes an adaptive AI ensemble framework that dynamically switches between an alignment model and a complementarity model via a Rational Routing Shortcut (RRS) mechanism, achieving up to 9% improvement in team accuracy over standard AI.
AMS-IO-Bench and AMS-IO-Agent: Benchmarking and Structured Reasoning for Analog and Mixed-Signal Integrated Circuit Input/Output Design: This paper proposes AMS-IO-Agent, a domain-specific LLM-based agent that transforms natural language design intent into production-ready analog and mixed-signal IC I/O ring designs via a structured Intent Graph and a domain knowledge base. It also introduces AMS-IO-Bench, the first benchmark for AMS I/O ring automation. The agent-generated I/O ring is validated in a 28nm CMOS tape-out and demonstrated to be directly applicable to real chip fabrication.
An Epistemic Perspective on Agent Awareness: This paper is the first to treat agent awareness as a form of knowledge, distinguishing two awareness modalities — de re (concerning physical objects) and de dicto (concerning concepts/descriptions) — and proposes a sound and complete logical system grounded in 2D semantics to characterize the interaction between these two modalities and the standard "factual knowledge" modality.
Approximation Algorithm for Constrained k-Center Clustering: A Local Search Approach: This paper studies the k-center clustering problem with instance-level cannot-link (CL) and must-link (ML) constraints. It proposes a local search framework based on a dominating matching set (DMS) reduction, and, under the disjoint CL sets condition, is the first to achieve the optimal approximation ratio of 2 via local search—resolving an open problem in the field.
Area-Optimal Control Strategies for Heterogeneous Multi-Agent Pursuit: This paper studies pursuit-evasion games with heterogeneous speeds involving multiple pursuers and a single evader. The evader's safe reachable set is defined as the intersection of Apollonius circles for all pursuer–evader pairs. The capture strategy is modeled as a zero-sum game in which pursuers minimize and the evader maximizes the area of this intersection. Closed-form instantaneous optimal heading control laws are derived, and simulations verify that pursuers can systematically shrink the safe region to guarantee capture.
Automated Reproducibility Has a Problem Statement Problem: This paper proposes a formalized problem definition of reproducibility grounded in the scientific method, representing empirical AI research as a hypothesis–experiment–interpretation graph structure. An LLM is used to automatically extract this structure from 20 papers, and the extracted results are validated through review by the original authors.
Autonomous Concept Drift Threshold Determination: This paper proves that no fixed threshold can be optimal across all scenarios and that dynamic thresholds strictly dominate static ones. It proposes the DTD algorithm, which initiates a three-model comparison phase upon drift detection signal trigger and adaptively adjusts the detection threshold based on candidate model performance.
Bandit Learning in Housing Markets: This paper is the first to introduce the multi-armed bandit (MAB) framework into housing markets (one-sided matching markets). It defines regret based on the core solution concept, proposes a decentralized ETC algorithm and a centralized UCB algorithm, proves a decentralized regret upper bound of $\mathcal{O}(N\log T / \Delta_{\min}^2)$ along with a matching lower bound, and establishes order-optimality.
Bayesian Network Structural Consensus via Greedy Min-Cut Analysis: This paper proposes the MCBNC algorithm, which quantifies the structural support of each edge via min-cut analysis and embeds this scoring into the backward phase of Greedy Equivalence Search (GES) to iteratively prune redundant edges from a fused Bayesian network. The method produces sparser and more accurate consensus structures without accessing any data, making it well-suited for federated learning scenarios.
Beyond World Models: Rethinking Understanding in AI Models: Through three case studies drawn from the philosophy of science — a domino computer, a mathematical proof, and Bohr's atomic theory — this paper argues that the world model framework is insufficient to characterize human-level "understanding," demonstrating that tracking states and state transitions alone cannot capture the abstract reasoning, motivational insight, and problem-context awareness that understanding requires.
Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning: This paper proposes Bilevel MCTS, which launches a depth-proportional budgeted best-first search at the leaf node selected by MCTS, reducing the amortized node-selection complexity from $O(\log N)$ to $O(1)$. Complemented by Tree Collapsing to reduce the number of action-selection steps, these components are integrated into the Nεbula planner, which solves 192.2/230.6 problems on IPC2018/2023 benchmarks (5min/30min), outperforming all prior SOTA planners including LAMA, DecStar, NOLAN, and SM-Type-LAMA.
Bipartite Mode Matching for Vision Training Set Search from a Hierarchical Data Server: This paper proposes a hierarchical data server combined with a Bipartite Mode Matching (BMM) framework. It organizes large-scale source data via multi-granularity hierarchical clustering and employs the Hungarian algorithm to perform one-to-one matching between semantic modes of the source and target domains, thereby retrieving a training set that minimizes the distributional gap to the target domain. The approach significantly outperforms existing training set search methods on person re-identification and object detection tasks.
Boosting Adversarial Transferability via Ensemble Non-Attention: This paper proposes NAMEA (Non-Attention Meta Ensemble Attack), which for the first time exploits the non-attention areas of ensemble models to integrate transferable information from both CNNs and ViTs, and combines meta-learning gradient optimization to achieve an average improvement of 15.0% and 9.6% over the state-of-the-art methods AdaEA and SMER, respectively, on cross-architecture adversarial transferability.
Bridging the Skills Gap: A Course Model for Modern Generative AI Education: This paper proposes a generative AI application course model for undergraduate and graduate computer science students. A mixed-methods survey demonstrates that the course is effective in bridging the generative AI skills gap between industry and academia, with students broadly rating it as valuable and impactful.
Cash Flow Underwriting with Bank Transaction Data: Advancing MSME Financial Inclusion in Malaysia: This paper proposes an end-to-end cash flow underwriting workflow based on bank transaction data and constructs the first Malaysian MSME bank statement dataset (611 loan records). It demonstrates that features derived from bank transactions improve a logistic regression model's AUROC from 0.672 to 0.850 compared to traditional application information alone, significantly enhancing credit assessment capability for MSMEs lacking credit histories.
CASL: Curvature-Augmented Self-supervised Learning for 3D Anomaly Detection: This work identifies point cloud curvature as a powerful cue for anomaly detection and proposes CASL, a curvature-augmented self-supervised learning framework. By guiding coordinate reconstruction with multi-scale curvature prompts, CASL learns generalizable 3D representations without any anomaly-detection-specific mechanisms, achieving a 5.6% O-AUROC improvement over the previous state of the art on Real3D-AD.
CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding: This paper proposes CAT-Net (Cross-Attention Tone Network), which achieves Mandarin four-tone classification using only 20 EEG channels and 5 EMG channels via spatial-temporal feature extraction branches, a cross-attention fusion mechanism, and domain adversarial training. The model achieves 87.83%/88.08% accuracy under voiced/silent speech conditions and 83.27%/85.10% under cross-subject evaluation, outperforming all 8 baseline methods.
CellStream: Dynamical Optimal Transport Informed Embeddings for Reconstructing Cellular Trajectories from Snapshots Data: This paper proposes CellStream, a deep learning framework that jointly learns an autoencoder and unbalanced dynamical optimal transport (OT) to simultaneously obtain low-dimensional embeddings and continuous cellular dynamics from discrete-time single-cell snapshot data, achieving significant improvements over existing methods in temporal consistency and velocity consistency.
Center-Outward q-Dominance: A Sample-Computable Proxy for Strong Stochastic Dominance in Multi-Objective Optimisation: Building on the center-outward distribution function from optimal transport theory, this paper proposes the q-dominance relation as a computable approximation of strong first-order stochastic dominance (strong FSD). It proves that q-dominance over the full quantile range implies strong FSD, derives explicit sample-size thresholds for Type I error control, and validates practical utility in hyperparameter tuning ranking and noisy multi-objective optimisation.
Certified Branch-and-Bound MaxSAT Solving (Extended Version): This paper introduces VeriPB-based certification for Branch-and-Bound MaxSAT solvers, covering two core techniques: look-ahead bounding methods and multi-valued decision diagram (MDD) encodings. Experiments on the MaxCDCL solver demonstrate a median proof logging overhead of only 19%, filling the last remaining gap in certified MaxSAT solving paradigms.
Certified but Fooled! Breaking Certified Defences with Ghost Certificates: This paper proposes GhostCert, a salient-region-based adversarial attack that misleads classifiers while maintaining imperceptible perturbations and forging large-radius robustness certificates (ghost certificates). On ImageNet, GhostCert achieves substantially higher attack success rates and larger spoofed certification radii than Shadow Attack against state-of-the-art certified defences including DensePure.
Clinician-in-the-Loop Smart Home System to Detect Urinary Tract Infection Flare-Ups via Uncertainty-Aware Decision Support: This paper proposes a clinician-in-the-loop smart home system that extracts behavioral markers from ambient sensor data and introduces a novel Conformal Calibrated Interval (CCI) method to quantify predictive uncertainty, enabling reliable detection of urinary tract infection (UTI) flare-ups in older adults and supporting an "abstain when uncertain" decision paradigm.
CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking: This paper proposes CompTrack—the first 3D single object tracking framework that simultaneously addresses both spatial redundancy and information redundancy in LiDAR point clouds. A Spatial Foreground Predictor (SFP) filters background noise via information entropy, while an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module estimates effective rank via online SVD and compresses foreground tokens into compact proxy tokens. CompTrack achieves state-of-the-art performance on nuScenes and Waymo while running in real time at 90 FPS.
Controllable Financial Market Generation with Diffusion Guided Meta Agent: This paper proposes the Diffusion Guided Meta Agent (DigMA), which formalizes controllable financial market generation as a conditional generation task. A conditional diffusion model captures the dynamics of market states (time-varying distribution parameters of mid-price returns and order arrival rates), while a Meta Agent with financial economics priors generates order flow under the guidance of the controller. DigMA outperforms existing methods in both controllability and generation fidelity.
Cost-Free Neutrality for the River Method: For the Parallel Universes Tiebreaking (PUT) problem applied to the River voting method, this paper proves that the winner set can be computed in polynomial time (in contrast to the NP-completeness of Ranked Pairs), and proposes the Fused-Universe (FUN) algorithm, which simulates all possible tiebreaking orders in a single pass and provides a constructive certificate for each winner.
Data Complexity of Querying Description Logic Knowledge Bases under Cost-Based Semantics: This paper systematically investigates the data complexity of query answering over weighted description logic knowledge bases under cost-based semantics. It establishes that optimal-cost semantics is decidable within $\Delta_2^p$, and delivers a surprising positive result: for DL-Lite$_{\text{bool}}^{\mathcal{H}}$ ontologies with a fixed cost bound, both certain answers to instance queries and possible answers to conjunctive queries admit first-order rewritings, achieving the lowest possible data complexity (AC$^0$).
Deadline-Aware, Energy-Efficient Control of Domestic Immersion Hot Water Heaters: This paper proposes a deadline-aware energy-efficient control method for domestic hot water heaters. Using a Gymnasium-based simulation environment, it benchmarks a bang-bang baseline, an MCTS planner, and a PPO policy, demonstrating that PPO achieves up to 69% energy savings under identical physical conditions.
Decomposition and Preprocessing of Ternary Constraint Networks: This paper proposes a complete theoretical framework for formally decomposing arbitrary discrete constraint networks into ternary constraint networks (TCNs), and reduces the variable/constraint blowup introduced by decomposition from a median of 8x/6x to 4.8x/4.3x via seven preprocessing techniques (propagation, algebraic simplification, common subexpression elimination, etc.), providing a regularized data layout for efficient constraint solving on GPU hardware.
DECOR: Deep Embedding Clustering with Orientation Robustness: This paper proposes DECOR, a framework that achieves orientation-robust clustering of wafer map defect patterns via a rotation-invariant equivariant convolutional autoencoder (RCAE), non-parametric clustering (DeepDPM), and an ensemble anomaly detection mechanism.
DeepRWCap: Neural-Guided Random-Walk Capacitance Solver for IC Design: This paper proposes DeepRWCap, a machine-learning-guided random-walk capacitance solver that accelerates multi-dielectric capacitance extraction in IC design via a two-stage neural network architecture for transition kernel prediction, achieving an average error of 1.24% and 23% speedup across 10 industrial test cases.
Depth-Synergized Mamba Meets Memory Experts for All-Day Image Reflection Separation: This paper proposes DMDNet, which employs a depth-aware scanning strategy (DAScan) to guide Mamba toward salient structures, incorporates a depth-synergized state space model (DS-SSM) to suppress ambiguous feature propagation, and introduces a memory expert compensation module (MECM) to leverage cross-image historical knowledge, achieving all-day (daytime + nighttime) image reflection separation.
Description Logics with Two Types of Definite Descriptions: Complexity, Expressiveness, and Automated Deduction: This paper introduces two extensions of the description logic ALC with definite descriptions — local definite descriptions $\{ι C\}$ and global definite descriptions $ι C.D$ — and proves that the satisfiability problems of all three extended logics are ExpTime-complete. Furthermore, it establishes that global definite descriptions are strictly more expressive than local ones ($\mathcal{ALC}\iota_L < \mathcal{ALC}\iota_G = \mathcal{ALC}\iota$), and provides tableau calculi decision procedures along with experimental evaluation.
Designing Incident Reporting Systems for Harms from General-Purpose AI: Through a literature review and case studies of nine safety-critical industries (nuclear energy, aviation, healthcare, etc.), this paper proposes a seven-dimensional institutional design framework for AI incident reporting systems, providing systematic guidance for policy design of general-purpose AI incident reporting in the United States.
DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes: This paper proposes FlexDeMo — a hybrid sharding training strategy that integrates Fully Sharded Data Parallelism (FSDP) with decoupled momentum optimization. It applies FSDP sharding within nodes and synchronizes only the fast-moving momentum components across nodes, achieving loss convergence comparable to full-synchronization AdamW while substantially accelerating training.
Deviation Dynamics in Cardinal Hedonic Games: This paper establishes meta-theorems for deviation dynamics in cardinal hedonic games, showing that the computational complexity of determining whether deviation dynamics may or must converge can be derived directly from instances in which no stable outcome exists. The paper further proposes methods for finding individually rational and contractually individually stable partitions via deviation dynamics in additively separable hedonic games.
DFDT: Dynamic Fast Decision Tree for IoT Data Stream Mining on Edge Devices: This paper proposes DFDT (Dynamic Fast Decision Tree), a memory-constrained data stream mining algorithm for IoT edge devices. Through three coordinated mechanisms — activity-aware pre-pruning, dynamic grace period, and adaptive tie threshold — DFDT achieves an optimal trade-off among accuracy, memory usage, and runtime.
DiffMM: Efficient Method for Accurate Noisy and Sparse Trajectory Map Matching via One Step Diffusion: This paper proposes DiffMM, the first approach to introduce diffusion models into map matching. By combining a road-segment-aware trajectory encoder with a one-step Shortcut diffusion process, DiffMM achieves simultaneous improvements in accuracy and efficiency on sparse trajectories and complex road networks, with inference speed approximately 17× faster than the second-best method.
DS-ATGO: Dual-Stage Synergistic Learning via Forward Adaptive Threshold and Backward Gradient Optimization for Spiking Neural Networks: To address spike firing imbalance and gradient vanishing caused by membrane potential distribution shifts during SNN training, this paper proposes DS-ATGO — a dual-stage synergistic learning algorithm combining forward adaptive thresholding (AT) and backward threshold-driven gradient optimization (TGO) — achieving state-of-the-art performance on CIFAR-10/100 and ImageNet with low time-step latency.
Enhancing Control Policy Smoothness by Aligning Actions with Predictions from Preceding States: This paper proposes ASAP (Action Smoothing by Aligning Actions with Predictions from Preceding States), a reinforcement learning action smoothing method based on transition-induced similar state definitions. ASAP suppresses high-frequency action oscillations via a spatial constraint (aligning actions with predictions from the preceding state) and a temporal constraint (penalizing second-order action differences). It outperforms existing methods on Gymnasium and Isaac-Lab benchmarks.
Enhancing Noise Resilience in Face Clustering via Sparse Differential Transformer: This paper proposes a prediction-driven Top-K Jaccard similarity coefficient to improve neighbor purity, combined with a Sparse Differential Transformer (SDT) to eliminate noisy attention, achieving state-of-the-art performance on large-scale face clustering datasets such as MS-Celeb-1M.
Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning: This paper proposes EDD (Expandable and Differentiable Dual Memory), an exemplar-free continual learning method that decomposes data into reusable sub-features via differentiable shared and task-specific memories, combined with memory expansion-pruning and orthogonal regularization mechanisms. EDD surpasses 14 state-of-the-art methods on CIFAR-10/100 and Tiny-ImageNet, achieving final accuracies of 55.13%, 37.24%, and 30.11%, respectively.
Expressive Temporal Specifications for Reward Monitoring: This paper leverages quantitative linear temporal logic (LTLf[F]) to automatically synthesize quantitative reward monitors (QRMs) that generate dense, continuous-valued reward streams for reinforcement learning agents at runtime, fundamentally alleviating the sparse reward problem in long-horizon tasks under Boolean semantics.
Extreme Value Monte Carlo Tree Search for Classical Planning: This paper applies Peaks-Over-Threshold Extreme Value Theory (POT EVT) to provide a statistical foundation for Full Bellman Backup in MCTS for classical planning. It proposes the UCB1-Uniform bandit algorithm, which uses MLE under a uniform distribution (a special case of the Generalized Pareto distribution) to guide action selection, outperforming GBFS by 67.8 instances and Softmin-Type(h) by 33.2 instances under a $10^4$ node budget on Pyperplan.
Faster Certified Symmetry Breaking Using Orders With Auxiliary Variables: By introducing auxiliary variables to encode lexicographic order in place of large-integer encodings, this work fundamentally redesigns the VeriPB proof system, achieving order-of-magnitude speedups in both proof generation and verification for certified SAT symmetry breaking, both theoretically and empirically.
Finding Diverse Solutions Parameterized by Cliquewidth: This paper extends the parameterized framework for finding diverse solutions from treewidth to the strictly more powerful graph parameter cliquewidth, proving that any monotone dynamic programming algorithm parameterized by a cliquewidth decomposition can be converted into an algorithm for the diverse version with minimal overhead. A new family of Venn diversity measures is also proposed.
Forest vs Tree: The (N, K) Trade-off in Reproducible ML Evaluation: This paper investigates the optimal trade-off between the number of samples $N$ and the number of annotators per sample $K$ in machine learning evaluation. Under a fixed budget $N \times K$, by analyzing multi-annotator datasets and simulated distributions, the study finds that $K > 10$ is generally optimal when annotator disagreement is considered, and the required total budget $N \times K$ typically does not exceed 1000.
Forget Less by Learning from Parents Through Hierarchical Relationships: This paper proposes FLLP (Forget Less by Learning from Parents), a framework that mitigates catastrophic forgetting in custom diffusion models (CDMs) by establishing parent-child hierarchical relationships among concepts in hyperbolic space. It leverages the tree-structure modeling capability of the Lorentz manifold to preserve knowledge during new concept learning and enable continual concept integration.
Formal Abductive Latent Explanations for Prototype-Based Networks: This paper addresses the problem of misleading explanations in prototype-based networks (e.g., ProtoPNet) by proposing Abductive Latent Explanations (ALE), which construct formally guaranteed sufficient-condition explanations directly in latent space—without invoking external solvers—and scale to standard and fine-grained classification tasks across multiple datasets.
From Decision Trees to Boolean Logic: A Fast and Unified SHAP Algorithm: This paper proposes Woodelf, an algorithm that converts decision tree ensemble models into pseudo-Boolean functions in Weighted Disjunctive Normal Form (WDNF), enabling linear-time computation of both Background SHAP and Path-Dependent SHAP within a unified framework, achieving 16–31× CPU speedup and 24–333× GPU speedup on large-scale datasets.
From Sequential to Recursive: Enhancing Decision-Focused Learning with Bidirectional Feedback: This paper proposes the first Recursive Decision-Focused Learning (R-DFL) framework, which introduces a bidirectional feedback loop between the prediction module and the optimization module, breaking the unidirectional information flow of conventional sequential DFL. Two gradient propagation methods—explicit unrolling and implicit differentiation—are designed, achieving significant improvements in final decision quality on the newsvendor problem and bipartite matching problem.
Guided Perturbation Sensitivity (GPS): Detecting Adversarial Text via Embedding Stability and Word Importance: This paper proposes the Guided Perturbation Sensitivity (GPS) framework, which detects adversarial text samples by masking important words and measuring changes in embedding stability. GPS achieves 85%+ detection accuracy across 3 datasets, 3 attack types, and 2 models, and generalizes across datasets, attacks, and models without retraining.
CAE: Hierarchical Semantic Alignment for Image Clustering: By combining two complementary semantic sources — noun-level (WordNet) and caption-level (Flickr image captions) — and constructing a semantic space via optimal transport alignment followed by adaptive fusion, this work achieves training-free image clustering with a 4.2% accuracy improvement on ImageNet-1K.
Higher-Order Responsibility: This paper studies higher-order responsibility in sequential decision-making mechanisms and establishes two core theorems: (1) any mechanism with $n$ agents is necessarily $n$-gap-free (i.e., a responsible agent can always be found at some order); (2) determining whether a mechanism is $d$-gap-free is $\Pi_{2d+1}$-complete.
How Hard is it to Explain Preferences Using Few Boolean Attributes?: This paper systematically investigates the computational complexity of explaining preference data using the Boolean Attribute Model (BAM). It proves that the problem is NP-complete for $k \geq 3$ attributes and solvable in linear time for $k \leq 2$; further, it provides a complete parameterized complexity landscape with respect to the number of voters $n$, candidates $m$, and attributes $k$, and analyzes how problem hardness changes when partial information (cares/has) is known.
How Hard Is It to Rig a Tournament When Few Players Can Beat or Be Beaten by the Favorite?: This paper introduces two novel structural parameters — the in-degree $k$ and out-degree $\ell$ of the favorite player in the tournament digraph — for analyzing the Tournament Fixing Problem (TFP). It proves that TFP is FPT under both parameterizations, where the in-degree algorithm involves sophisticated structural analysis and the color coding technique.
How to Marginalize in Causal Structure Learning?: This paper employs tractable Probabilistic Circuits (PCs) as a replacement for traditional dynamic programming to perform marginalization in Bayesian structure learning. Through a novel two-stage training strategy—first learning full parent set scores and then progressively fine-tuning for marginal queries—the method eliminates the artificial restriction on the number of candidate parent sets, achieving improved posterior distribution estimation within the TRUST framework.
How Wide and How Deep? Mitigating Over-Squashing of GNNs via Channel Capacity Constrained Estimation: From an information-theoretic perspective, this paper models spectral GNNs as communication channels and proposes the Channel Capacity Constrained Estimation (C3E) framework, which formalizes the selection of GNN hidden dimensions and depth as a nonlinear programming problem. The framework estimates optimal architectural parameters prior to training, effectively mitigating over-squashing and consistently improving representation learning across 9 datasets.
HyperSHAP: Shapley Values and Interactions for Explaining Hyperparameter Optimization: HyperSHAP proposes a game-theoretic framework based on Shapley values and Shapley interactions to explain hyperparameter optimization (HPO). By defining four categories of explanation games—ablation, sensitivity, tunability, and optimizer bias—it provides more actionable hyperparameter importance analysis than fANOVA.
I2E: Real-Time Image-to-Event Conversion for High-Performance Spiking Neural Networks: I2E proposes an ultra-efficient image-to-event stream conversion framework that simulates microsaccadic eye movements and implements the conversion via highly parallelized convolutions, achieving over 300× speedup compared to prior methods. It enables online data augmentation during SNN training for the first time, achieves a state-of-the-art 60.50% event-based classification accuracy on I2E-ImageNet, and sets a new record of 92.5% on CIFAR10-DVS through a sim-to-real paradigm of synthetic pretraining followed by real-data fine-tuning.
Improved Differentially Private Algorithms for Rank Aggregation: This paper presents improved approximation algorithms for rank aggregation under differential privacy. It introduces the first study of differentially private footrule rank aggregation with a near-optimal algorithm (which also yields a 2-approximation for the Kemeny problem), and improves the additive error of the Kemeny PTAS by combining two-way marginal queries with an unbiasedness technique (reducing the exponent of $m$ from 3 to 65/22).
Intermediate N-Gramming: Deterministic and Fast N-Grams For Large N and Large Datasets: This paper proposes Intergrams, a multi-pass scanning algorithm that recursively uses shorter n-grams as prefixes to filter candidates for longer n-grams, fully exploiting the processor cache hierarchy to achieve cache-friendly memory access patterns. On TB-scale datasets, Intergrams achieves 6–33× speedup over the previously fastest hash-gramming method while recovering nearly all top-k n-grams with high accuracy.
Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis: This paper formalizes AI alignment as an $\langle M,N,\varepsilon,\delta\rangle$-agreement multi-objective optimization problem, proves information-theoretic lower bounds on alignment (encoding "all human values" is fundamentally intractable) from a communication complexity perspective, and derives explicit achievable algorithms and tight upper bounds for unbounded/bounded rational agents, revealing the theoretical basis for the global inevitability of reward hacking in large state spaces.
Judging by the Rules: Compliance-Aligned Framework for Modern Slavery Statement Monitoring: This paper proposes a training framework centered on a Compliance-Aligned Judge (CA-Judge) that trains a 3B-parameter CALLM model using rule-level alignment feedback, enabling the generation of traceable compliance judgments grounded in statutory provisions. The model surpasses GPT-4o and DeepSeek-R1 on sentence-level compliance classification of modern slavery statements.
LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval: This paper proposes LeanRAG, a framework that employs a semantic aggregation algorithm to automatically construct explicit relations among summary nodes in a hierarchical knowledge graph, thereby breaking "semantic islands." Combined with a bottom-up retrieval strategy based on the Lowest Common Ancestor (LCA), LeanRAG efficiently navigates the hierarchical structure, achieving state-of-the-art performance on four QA benchmarks while reducing retrieval redundancy by 46%.
Learning Compact Latent Space for Representing Neural Signed Distance Functions with High-fidelity Geometry Details: This paper proposes a dual-branch architecture (generalization branch + overfitting branch) to learn a compact latent space over multiple neural SDFs. By combining a shared spatial feature grid with a novel bandwidth-based sampling strategy, the method recovers high-fidelity geometric details while maintaining compact latent codes, achieving state-of-the-art performance on Stanford Models, ShapeNet, and D-FAUST.
Learning Fair Representations with Kolmogorov-Arnold Networks: This paper proposes integrating Kolmogorov-Arnold Networks (KAN) into an adversarial debiasing framework, leveraging KAN's spline-based architecture to provide theoretical guarantees of Lipschitz continuity and smoothness. An adaptive $\lambda$ update mechanism is introduced to dynamically balance fairness and accuracy. The approach achieves significant improvements on fairness metrics on the UCI college admissions dataset.
Learning Network Dismantling Without Handcrafted Inputs: This paper proposes MIND (Message Iteration Network Dismantler), which eliminates the dependence of GNNs on handcrafted features through a novel All-to-One attention mechanism and Message Iteration Profiles. Using only raw adjacency information, MIND achieves state-of-the-art network dismantling performance on real-world networks with up to millions of nodes, while maintaining the lowest computational complexity of $O(|V|+|E|)$.
Life, Machine Learning, and the Search for Habitability: Predicting Biosignature Fluxes for the Habitable Worlds Observatory: To address the observation prioritization needs of NASA's Habitable Worlds Observatory (HWO), this paper proposes two architectures — a Bayesian Convolutional Neural Network (BCNN) and a novel Spectral Query-Adaptive Transformer (SQuAT) — for predicting biosignature species fluxes from planetary reflected spectra. Both achieve high predictive accuracy on an augmented dataset, with complementary strengths in uncertainty quantification and interpretability, respectively.
LILAD: Learning In-context Lyapunov-stable Adaptive Dynamics Models: This paper proposes LILAD, a framework that leverages the in-context learning (ICL) capability of GPT-2 to jointly learn a dynamics model and a Lyapunov function, achieving adaptive identification of non-stationary parametric dynamical systems while guaranteeing global exponential stability. LILAD outperforms baselines such as ICL and MAML on multiple benchmark systems.
Local Guidance for Configuration-Based Multi-Agent Pathfinding: This paper introduces the concept of Local Guidance (LG) to improve solution quality in LaCAM-based multi-agent pathfinding. By constructing local space-time paths for each agent at every configuration generation step, LG mitigates congestion and reduces solution cost by up to 50%, while maintaining completion within a few seconds for 1,000 agents.
Lost in Time? A Meta-Learning Framework for Time-Shift-Tolerant Physiological Signal Transformation: This paper proposes ShiftSyncNet, a bi-level meta-learning optimization framework that trains a SyncNet to learn temporal offsets between training signal pairs and leverages the Fourier shift theorem to automatically correct label alignment, achieving waveform transformation accuracy improvements of 9.4%, 6.0%, and 12.8% across three datasets respectively.
Measuring Model Performance in the Presence of an Intervention: To address the bias in AI model evaluation under interventions, this paper proposes Nuisance Parameter Weighting (NPW), which applies causal reweighting to the treatment arm of RCT data to achieve unbiased AUROC estimation. The method achieves a 5× improvement in sample efficiency and substantially improves statistical power for model selection and hypothesis testing.
MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement: This paper proposes MF-Speech, a framework that employs multi-objective optimization to disentangle speech signals into three high-purity, independent factor representations—content, timbre, and emotion—and subsequently leverages dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN) to achieve fine-grained, compositional control in speech generation, significantly outperforming existing methods on multi-factor compositional speech generation tasks (WER=4.67%, SECS=0.5685).
MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals: This paper proposes MindCross, a cross-subject brain decoding framework that learns subject-independent information via a shared encoder and subject-specific information via $N$ individual encoders. Combined with a fast calibration stage and a Top-K collaborative decoding module, a single unified model achieves performance comparable to per-subject models on fMRI/EEG-to-video benchmarks, with new subject adaptation requiring only minimal data and time (~1s vs. 5–17s for baselines).
Model Change for Description Logic Concepts: This paper studies the problem of modifying description logic concepts in response to new model evidence represented as pointed interpretations. It defines three operations — eviction, reception, and revision — and establishes positive and negative compatibility results for the EL and ALC description logics.
Model Counting for Dependency Quantified Boolean Formulas: This paper presents the first study of the model counting problem for Dependency Quantified Boolean Formulas (DQBF). It proves that #2-DQBF — restricted to only two existential variables — is already #EXP-complete, and implements a practical 2-DQBF model counter, sharp2DQR, based on BDD symbolic reachability. The proposed approach significantly outperforms unfolding-based baselines on instances with large dependency sets.
On the Edge of Core (Non-)Emptiness: An Automated Reasoning Approach to Approval-Based Multi-Winner Voting: This paper proposes an automated reasoning framework based on Mixed Integer Linear Programming (MILP) to investigate the major open problem of whether core stability always exists in approval-based multi-winner voting. The framework establishes new existence results, uncovers previously unknown relationships between core stability and other axioms (e.g., Lindahl pricability), and refutes an existing conjecture.
On the Information Processing of One-Dimensional Wasserstein Distances with Finite Samples: This paper analytically characterizes, via a Poisson process framework, the ability of the one-dimensional Wasserstein distance under finite samples to simultaneously encode pointwise density differences (rate difference) and support differences between probability density functions, and validates its practical utility on neural spike data and amino acid contact frequency data.
On the Variability of Concept Activation Vectors: This paper presents the first theoretical analysis of the variability of Concept Activation Vectors (CAVs) in the TCAV framework. It proves that the variance of CAVs decays at a rate of $O(1/N)$ (where $N$ is the number of random samples), while the variance of TCAV scores remains $O(1)$ due to "boundary points," and can only be reduced to $O(1/s)$ by averaging over multiple runs.
Online Linear Regression with Paid Stochastic Features: This paper studies a novel setting in online linear regression where features are corrupted by noise and the learner can pay to reduce noise intensity. It establishes that the optimal regret rate is $\widetilde{\mathcal{O}}(\sqrt{T})$ when the noise covariance is known and $\widetilde{\mathcal{O}}(T^{2/3})$ when unknown, with matching lower bounds; all bounds are order-optimal in $T$.
Optimal Welfare in Noncooperative Network Formation under Attack: In the noncooperative network formation game model proposed by Goyal et al. (WINE 2016), this paper proves that equilibrium networks created by selfish agents maintain asymptotically optimal social welfare $n^2 - O(n)$ under a broad class of attackers — including maximum disruption — called super-quadratic disruption (SQD) attackers, thereby resolving a long-standing open problem.
OR-R1: Automating Modeling and Solving of Operations Research Optimization Problems: OR-R1 proposes a data-efficient two-stage training framework (SFT + TGRPO) that achieves an average solving accuracy of 67.7% using only 1/10 of the synthetic data required by ORLM, surpassing existing SOTA methods. Additionally, test-time reinforcement learning reduces the performance gap between single-sample generation (Pass@1) and multi-sample generation (Pass@8) from 13% to 7%.
ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations: This paper proposes ParaMETA, a unified paralinguistic speaking style representation learning framework that achieves disentangled representations of speaking styles—including emotion, age, gender, and language—through META space regularization and task-specific subspace projection, while simultaneously supporting downstream multi-task classification and style-controllable speech synthesis.
Parameterized Approximation Algorithms for TSP on Non-Metric Graphs: This paper proposes improved FPT approximation algorithms for the Travelling Salesman Problem (TSP) on non-metric graphs, parameterized by $p$ (the number of vertices involved in triangle inequality violations) and $q$ (the size of a minimum violation set), improving the approximation ratio under parameter $p$ from 2.5 to 1.5 and under parameter $q$ from 11 to 3.
ParaRevSNN: A Parallel Reversible Spiking Neural Network for Efficient Training and Inference: This paper proposes ParaRevSNN, a parallel reversible spiking neural network architecture that decouples sequential computation constraints by redesigning the data dependencies between reversible blocks, achieving inter-block parallelism while preserving reversibility (memory efficiency). Training time is reduced by up to 35.2% and inference time to 18.15%.
PIPHEN: Physical Interaction Prediction with Hamiltonian Energy Networks: This paper proposes PIPHEN, a distributed physical cognition-control framework that employs a Physical Interaction Prediction Network (PIPN) for "semantic distillation" to compress high-dimensional perceptual data to less than 5% of the original data volume, while a Hamiltonian energy conservation-based HEN controller generates coordinated actions, thereby addressing the "shared brain dilemma" in multi-robot systems.
Predict and Resist: Long-Term Accident Anticipation under Sensor Noise: A unified framework is proposed that integrates a diffusion-based dual-level denoising module with a temporally-aware Actor-Critic reinforcement learning model to enable robust long-term traffic accident anticipation under sensor noise, achieving state-of-the-art performance on three benchmark datasets in terms of both average precision (AP) and mean time-to-accident (mTTA).
Private Frequency Estimation via Residue Number Systems: This paper proposes ModularSubsetSelection (MSS), a local differential privacy frequency estimation protocol based on the Residue Number System (RNS). MSS achieves estimation accuracy comparable to SubsetSelection and PGR while significantly reducing communication overhead (up to 50% less than SS), substantially accelerating server-side decoding (11–448× faster than PGR), and attaining the lowest data reconstruction attack success rate.
Provably Data-Driven Projection Method for Quadratic Programming: This work extends data-driven projection matrix learning from linear programming (LP) to convex quadratic programming (QP). By proposing an "unrolled active set method" to model the computation of QP optimal values within the Goldberg–Jerrum (GJ) framework, it establishes a pseudo-dimension upper bound and generalization guarantees for projection matrix learning.
Radar-APLANC: Unsupervised Radar-based Heartbeat Sensing via Augmented Pseudo-Label and Noise Contrast: This paper proposes Radar-APLANC, the first unsupervised learning framework for radar-based heartbeat sensing. Through a noise contrastive triplet (NCT) loss and an augmented pseudo-label generator, it achieves two-stage unsupervised training without requiring expensive physiological signal annotations, attaining performance approaching supervised methods.
RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection: This paper proposes a Recursive Convolutional Autoencoder (RcAE) that progressively suppresses anomalies while preserving normal details through multi-step iterative reconstruction with shared parameters. Combined with a Cross-Recursive Detection module (CRD) that exploits multi-step reconstruction dynamics for robust anomaly localization, the method achieves performance comparable to state-of-the-art approaches using only 10% of the parameters required by diffusion models.
Reimagining Anomalies: What if Anomalies Were Normal?: This paper proposes the first counterfactual explanation framework for unsupervised image anomaly detection. By training a generator to modify anomalous samples into multiple disentangled counterfactuals perceived as normal by the detector, the framework answers at the semantic level: "What would an anomaly look like if it were normal?" This provides a depth of interpretability far exceeding traditional heatmap-based approaches.
Rethinking Flow and Diffusion Bridge Models for Speech Enhancement: This paper proposes a unified theoretical framework that subsumes flow matching, score-based diffusion, and Schrödinger bridge models for speech enhancement as processes that construct different Gaussian probability paths between paired data. It further reveals that each sampling step in such generative models is intrinsically equivalent to predictive speech enhancement, and leverages this insight to improve bridge model performance by adopting high-performance backbone networks, refined loss functions, and fine-tuning strategies from the predictive paradigm.
Reward Redistribution via Gaussian Process Likelihood Estimation: This paper proposes GP-LRR, a reward redistribution framework based on Gaussian process likelihood estimation. It explicitly models correlations among state-action pairs via kernel functions, and learns a step-wise reward function by maximizing the marginal likelihood of trajectory returns using a leave-one-out strategy. Theoretical analysis demonstrates that conventional MSE-based methods are a degenerate special case of GP-LRR. Experiments on MuJoCo benchmarks combined with SAC show superior sample efficiency and policy performance.
Scalable Vision-Guided Crop Yield Estimation: This paper proposes a crop yield estimation method based on Prediction-Powered Inference (PPI++), which leverages vision models trained on field photographs to supplement costly ground-truth crop cut measurements. The approach guarantees asymptotic unbiasedness while increasing effective sample size by up to 73%, enabling more accurate and cost-efficient regional yield estimation for agricultural insurance.
Semi-Supervised High Dynamic Range Image Reconstructing via Bi-Level Uncertain Area Masking: This paper proposes a semi-supervised HDR reconstruction framework that evaluates pseudo HDR label quality via an uncertainty estimation branch, masking unreliable regions at both the patch and pixel levels. Using only 6.7% of HDR ground-truth annotations, the method achieves performance comparable to fully supervised state-of-the-art.
ShortageSim: Simulating Drug Shortages under Information Asymmetry: This paper proposes ShortageSim, the first LLM-based multi-agent simulation framework for drug shortages. It models strategic decision-making among FDA regulators, manufacturers, and buyers under information asymmetry, achieving an 84% improvement in predicting resolution lag time on historical shortage data, and provides a controlled testbed for evaluating regulatory strategies.
Shrinking the Teacher: An Adaptive Teaching Paradigm for Asymmetric EEG-Vision Alignment: This paper proposes an Adaptive Teaching Paradigm (ATS) in which a residual-free bottleneck module, ShrinkAdapter, enables the visual "teacher" to actively shrink and restructure its knowledge to match the learning capacity of the EEG "student," achieving 60.2% Top-1 accuracy on zero-shot brain-image retrieval and surpassing the previous SOTA by 9.8 percentage points.
Spike Imaging Velocimetry: Dense Motion Estimation of Fluids Using Spike Cameras: This paper proposes Spike Imaging Velocimetry (SIV), the first systematic application of spike cameras (20,000 Hz ultra-high temporal resolution) to fluid velocimetry. Three fluid-aware modules are designed: Detail-Preserving Hierarchical Transform (DPHT), Graph Encoder (GE), and Multi-Scale Velocity Refinement (MSVR). A new PSSD dataset is constructed, and SIV comprehensively outperforms existing baselines on steady-state turbulence, high-speed flow, and HDR scenarios.
STEM Faculty Perspectives on Generative AI in Higher Education: Through focus group research with 29 STEM faculty at a large public university in the United States, this study reveals how instructors integrate GenAI into teaching, the observed benefits and challenges for student learning, and the institutional support required. A key finding is that GenAI shifts faculty labor from content creation to expert review and may obscure students' underlying competency gaps.
Structural Approach to Guiding a Present-Biased Agent: This paper systematically investigates the parameterized complexity of the T-path-Editing problem within the principal-agent extension of the Kleinberg-Oren model. It presents FPT algorithms parameterized by treewidth and path-cost diversity, establishes tight hardness results, and comprehensively characterizes the tractability-intractability boundary for guiding a present-biased agent to complete critical tasks.
Structure-Aware Encodings of Argumentation Properties for Clique-width: This paper designs directed decomposition-guided (DDG) reductions from abstract argumentation problems to (Q)SAT that linearly preserve clique-width, establishing tractability upper bounds parameterized by clique-width for all standard argumentation semantics (stable, admissible, complete, preferred, semi-stable, stage) across extension existence, argument acceptance, and counting problems. Under the ETH, it further proves that the overhead of these reductions cannot be significantly improved.
SVD-NO: Learning PDE Solution Operators with SVD Integral Kernels: This paper proposes SVD-NO, a neural operator that explicitly parameterizes the SVD decomposition of integral kernels, achieving $O(ndL)$ linear computational complexity while maintaining high expressiveness, and attaining new state-of-the-art performance on 5 PDE benchmarks.
Symbolic Planning and Multi-Agent Path Finding in Extremely Dense Environments with Unassigned Agents: This paper introduces the Block Rearrangement Problem (BRaP) as a formal problem definition and proposes five solving algorithms based on configuration space search, PDDL symbolic planning, and MAPF. Among them, BR-LaCAM achieves a 92% success rate with millisecond-level solving speed on grids up to 80×80 under extreme density conditions.
SynWeather: Weather Observation Data Synthesis across Multiple Regions and Variables via a General Diffusion Transformer: This work introduces SynWeather, the first unified multi-region multi-variable weather observation synthesis dataset (covering 4 regions × 4 variables × 6 satellites), and proposes SynWeatherDiff, a general probabilistic generative model based on a Diffusion Transformer. By leveraging text prompts to distinguish region–variable task combinations, SynWeatherDiff outperforms both task-specific models and existing general-purpose models across multiple synthesis tasks.
Tab-PET: Graph-Based Positional Encodings for Tabular Transformers: Tab-PET estimates a graph structure from inter-feature correlations in tabular data, constructs positional encodings (PE) from graph Laplacian eigenvectors, and injects them into tabular Transformers. Both theoretical analysis and experiments demonstrate that PE reduces the effective rank of embeddings, thereby improving generalization. Consistent improvements are observed across 50 datasets for TabTransformer, SAINT, and FT-Transformer, with the Spearman correlation graph yielding the best results.
TaylorPODA: A Taylor Expansion-Based Method to Improve Post-Hoc Attributions for Opaque Models: Under the Taylor expansion framework, three postulates—precision, federation, and zero-discrepancy—are proposed to regulate feature attribution. An adaptation property is further introduced to optimize the allocation weights of interaction effects via an AUP objective, making TaylorPODA the only post-hoc, model-agnostic attribution method that simultaneously satisfies all postulates and properties.
TDSNNs: Competitive Topographic Deep Spiking Neural Networks for Visual Cortex Modeling: This paper proposes Topographic Deep Spiking Neural Networks (TDSNNs), which introduce a Spatiotemporal Constraint (STC) loss to successfully replicate the hierarchical topographic organization of the primate visual cortex from V1 to IT in deep SNNs, achieving zero accuracy degradation on ImageNet (top-1) while substantially outperforming existing topographic ANNs in brain similarity.
The Limitations and Power of NP-Oracle-Based Functional Synthesis Techniques: This paper systematically investigates, from a theoretical perspective, the capabilities and limitations of functional synthesis methods that rely on NP oracles. It proves that naive bit-by-bit learning approaches necessarily fail in multi-output settings, that Resolution-interpolation-based methods produce exponential-size circuits, and that an NP oracle is a necessary condition for efficient synthesis. Positive results are also established, showing that NP oracles suffice to synthesize small Skolem functions in polynomial time under appropriate conditions.
The Publication Choice Problem: This paper proposes the "publication choice problem," a game-theoretic framework that models the bidirectional interaction between researchers' publication strategies and venue influence. It proves the existence and uniqueness of pure-strategy equilibria and analyzes the effects of Spotlight paper labels on the academic ecosystem.
Theoretical and Empirical Analysis of Lehmer Codes to Search Permutation Spaces with Evolutionary Algorithms: This work presents the first rigorous mathematical runtime analysis of Lehmer codes (inversion tables) for searching permutation spaces with evolutionary algorithms. It proves that Lehmer-code-based EAs achieve expected runtimes of $O(n^2 \log n)$ or $O(n^2)$ on most benchmark functions, matching or improving upon classical representations, and validates practical utility on LOP and QAP instances.
ASAG: Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance: This paper proposes ASAG (Adversarial Sinkhorn Attention Guidance), which reinterprets self-attention scores in diffusion models from the perspective of optimal transport theory. By injecting adversarial transport costs into attention layers via the Sinkhorn algorithm to deliberately reduce query-key similarity, ASAG systematically disrupts misleading attention alignment and improves both conditional and unconditional sampling quality. The method is lightweight, plug-and-play, and requires no retraining.
Towards Temporal Fusion Beyond the Field of View for Camera-based Semantic Scene Completion: This paper proposes C3DFusion, a module that explicitly aligns point features from historical and current frames in 3D space, and is the first to systematically address temporal completion of out-of-frame (out-of-view) regions in camera-based SSC. The method achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360.
Tractable Weighted First-Order Model Counting with Bounded Treewidth Binary Evidence: A polynomial-time (in domain size) algorithm is proposed for computing weighted first-order model counting (WFOMC) of the $\text{FO}^2$ and $\text{C}^2$ fragments with bounded-treewidth binary evidence, resolving an open problem on counting stable seating arrangements on bounded-treewidth bounded-degree graphs.
Variance Computation for Weighted Model Counting with Knowledge Compilation Approach: This paper treats the weights in weighted model counting (WMC) as random variables with associated variances, and proposes a polynomial-time algorithm for computing WMC variance on structured d-DNNF representations. It further proves intractability of this problem on structured DNNF, d-DNNF, and FBDD (unless P=NP), and applies the framework to quantify parameter uncertainty in Bayesian network inference.
Verification-Guided Context Optimization for Tool Calling via Hierarchical LLMs-as-editors: This paper proposes the VGCO framework, which employs LLMs as hierarchical editors to iteratively optimize tool documentation and knowledge base context through verification-guided signals, achieving significant improvements in retrieval recall, tool selection, and parameter filling accuracy in large-scale tool calling scenarios.
Whispering Agents: An Event-Driven Covert Communication Protocol for the Internet of Agents: This paper presents the first formal definition of a "Covert Event Channel" in the Internet of Agents (IoA) and proposes the ΠCCAP protocol, which embeds secret data across the storage, timing, and behavioral dimensions of agent conversations, achieving high-capacity, high-robustness covert communication that is imperceptible to LLM-based censors.
Why Isn't Relational Learning Taking Over the World?: This position paper systematically analyzes why relational learning has failed to dominate the AI landscape, identifying core issues including unrealistic datasets, fundamentally flawed evaluation methodologies, the absence of negative examples, and theoretical difficulties with aggregation operations. It further delineates the key improvements necessary for relational learning to realize its potential.